arXiv

Read the Trace, Steer the Path: Trajectory-Aware Reinforcement Learning for Diffusion Language Models

Title: Reading the Trajectory to Guide the Path: Trajectory-Aware Reinforcement Learning for Diffusion Language Models

Abstract:

Diffusion large language models (dLLMs) construct responses through an iterative process that simultaneously unmask and refine multiple token positions. This mechanism produces a detailed denoising trace, offering insights into token confidence levels, areas of instability, and the timing of final commitments. While current reinforcement learning approaches for dLLMs leverage this information minimally, they rely on either flat or tree-based rollouts. Flat rollouts are computationally inexpensive but distribute a single, global outcome reward across the entire sequence. Conversely, tree rollouts offer more granular, verifiable signals by branching partial sequences and propagating rewards from leaf nodes upward; however, they demand significant computational resources.

This study investigates whether the denoising trace can serve as a source of tree-like supervision without incurring the high costs associated with full tree expansion. To this end, we propose CAPR (Cached-Amortized Path Refinement), a novel RL algorithm tailored for dLLMs. CAPR condenses the denoising trace into a compact path state and utilizes cached trajectory states to efficiently generate sibling continuations. Additionally, it employs a block-level value head to enable local, block-wise supervision.

Operating under a block-wise unmasking schedule, CAPR logs path-state and block-progress metrics before allocating the final outcome reward to specific blocks based on the tokens revealed within them. This approach trains the value head to transform a sparse reward signal into block-level Proximal Policy Optimization (PPO) weights. Consequently, CAPR achieves a level of granularity comparable to tree search while circumventing the need for complete tree expansion. In terms of efficiency, this method reduces rollout-generation costs to approximately 0.75 times that of flat rollouts and just 0.6 times that of tree rollouts under standard configurations.

Evaluations across 4x4 Sudoku, Countdown, GSM8K, and Math500 tasks—utilizing both dense and mixture-of-experts LLaDA backbones—demonstrate that CAPR establishes a new state-of-the-art for RL-tuned dLLMs at both 256- and 512-token budgets. Notably, on the Sudoku task, CAPR matches the performance of the strongest tree-structured baseline while requiring less than one-third of the per-step computational cost.


Source: arXiv Generated at: 2026-06-04 00:00:00 UTC

Related Articles

TechCrunch

Meta’s Oversight Board says account bans lack due process, transparency

Meta’s Oversight Board criticized account bans for lacking due process and transparency, citing inconsistent enforcement...

TechCrunch

Meta rolls out a new AI creator assistant on Facebook

Meta launched an AI creator assistant on Facebook to streamline analytics and content brainstorming. Initially available...

TechCrunch

What to expect from WWDC 2026: Siri’s highly anticipated revamp and Apple Intelligence updates

WWDC 2026 promises a Siri revamp powered by Google’s Gemini and standalone app, plus AI agents in the App Store and Came...

TechCrunch

A burglar used a Waymo to steal yoga clothes in San Francisco — and got away with it

A thief stole yoga clothes using a Waymo, but police failed to catch them because the car’s video data was deleted and b...

Goldman Sachs CEO David Solomon on the Coming Mega IPOs
Bloomberg

Goldman Sachs CEO David Solomon on the Coming Mega IPOs

Goldman Sachs CEO David Solomon anticipates a surge in major IPOs, signaling renewed market confidence and significant o...

What Are A.I. Agents Actually Doing?
New York Times

What Are A.I. Agents Actually Doing?

Arena research shows tech professionals are most likely to use AI agents at work, highlighting a strong industry trend i...