arXiv

Plan-R1: Safe and Feasible Trajectory Planning as Language Modeling

June 2, 2026 · Xiaolong Tang, Meina Kan, Shiguang Shan, Xilin Chen · Original Source

Title: Plan-R1: Ensuring Safety and Feasibility in Trajectory Planning Through Language Modeling

Abstract:

For real-world autonomous driving systems, the ability to generate safe and feasible trajectories is paramount. However, current learning-based planning approaches are heavily dependent on expert demonstrations. This reliance poses a significant risk: such data often lacks explicit safety awareness and may inadvertently teach the model undesirable habits, such as speeding, derived from suboptimal human driving records. Drawing inspiration from the advancements in large language models, we introduce Plan-R1, a novel two-stage trajectory planning framework that separates principle alignment from behavior learning.

The first stage involves pre-training a general trajectory predictor on expert data to capture a wide range of human-like driving behaviors. In the second stage, the model undergoes fine-tuning using Group Relative Policy Optimization (GRPO) with rule-based rewards. This process explicitly aligns the ego-vehicle’s planning with core principles, including traffic rule compliance, comfort, and safety. By adopting this two-stage approach, the framework preserves the naturalistic qualities of human driving while simultaneously enhancing safety awareness and filtering out negative patterns present in the original demonstrations.

We also identified a critical limitation when applying standard GRPO directly to planning tasks. Specifically, group-wise normalization tends to erase scale differences between groups. This issue causes rare groups with high-variance safety violations to exhibit advantages similar to those of abundant, low-variance safe groups, which inadvertently suppresses the optimization of safety-critical objectives. To resolve this, we propose Variance-Decoupled GRPO (VD-GRPO). This method replaces standard normalization with centering and fixed scaling, thereby preserving the absolute magnitude of rewards. This adjustment ensures that safety-critical objectives maintain their dominance throughout the training process.

Experiments conducted on the nuPlan benchmark indicate that Plan-R1 significantly enhances both the safety and feasibility of planning, achieving state-of-the-art results, especially within realistic reactive scenarios. Our code is publicly available at https://github.com/XiaolongTang23/Plan-R1.

Source: arXiv Generated at: 2026-06-02 00:00:00 UTC