Read the Trace, Steer the Path: Trajectory-Aware Reinforcement Learning for Diffusion Language Models
Title: Reading the Trajectory to Guide the Path: Trajectory-Aware Reinforcement Learning for Diffusion Language Models
Abstract:
Diffusion large language models (dLLMs) construct responses through an iterative process that simultaneously unmask and refine multiple token positions. This mechanism produces a detailed denoising trace, offering insights into token confidence levels, areas of instability, and the timing of final commitments. While current reinforcement learning approaches for dLLMs leverage this information minimally, they rely on either flat or tree-based rollouts. Flat rollouts are computationally inexpensive but distribute a single, global outcome reward across the entire sequence. Conversely, tree rollouts offer more granular, verifiable signals by branching partial sequences and propagating rewards from leaf nodes upward; however, they demand significant computational resources.
This study investigates whether the denoising trace can serve as a source of tree-like supervision without incurring the high costs associated with full tree expansion. To this end, we propose CAPR (Cached-Amortized Path Refinement), a novel RL algorithm tailored for dLLMs. CAPR condenses the denoising trace into a compact path state and utilizes cached trajectory states to efficiently generate sibling continuations. Additionally, it employs a block-level value head to enable local, block-wise supervision.
Operating under a block-wise unmasking schedule, CAPR logs path-state and block-progress metrics before allocating the final outcome reward to specific blocks based on the tokens revealed within them. This approach trains the value head to transform a sparse reward signal into block-level Proximal Policy Optimization (PPO) weights. Consequently, CAPR achieves a level of granularity comparable to tree search while circumventing the need for complete tree expansion. In terms of efficiency, this method reduces rollout-generation costs to approximately 0.75 times that of flat rollouts and just 0.6 times that of tree rollouts under standard configurations.
Evaluations across 4x4 Sudoku, Countdown, GSM8K, and Math500 tasks—utilizing both dense and mixture-of-experts LLaDA backbones—demonstrate that CAPR establishes a new state-of-the-art for RL-tuned dLLMs at both 256- and 512-token budgets. Notably, on the Sudoku task, CAPR matches the performance of the strongest tree-structured baseline while requiring less than one-third of the per-step computational cost.
Source: arXiv Generated at: 2026-06-04 00:00:00 UTC


