arXiv

Outcome-Based RL Provably Leads Transformers to Reason, but Only With the Right Data

Title: Outcome-Based RL Provably Leads Transformers to Reason, but Only With the Right Data

Abstract: While Reinforcement Learning (RL) with outcome-based supervision has been shown to enable Transformers to spontaneously generate intermediate reasoning steps (Chain-of-Thought), the precise mechanism by which sparse rewards guide policy gradients to uncover such systematic reasoning remains unclear. To resolve this, we examine the policy gradient dynamics of single-layer Transformers applied to a synthetic graph traversal task. This task is unsolvable without Chain-of-Thought but allows for a straightforward iterative solution. We demonstrate that training exclusively on final-answer accuracy causes the policy gradient to drive the Transformer toward a structured, interpretable algorithm that traverses the graph vertex by vertex. Our analysis identifies specific distributional properties necessary for this emergence, highlighting the pivotal importance of "simple examples"—instances that demand fewer reasoning steps. If the training distribution assigns adequate probability mass to these simpler cases, the Transformer acquires a generalizable traversal strategy capable of extrapolating to longer chains. Conversely, if this mass disappears, policy gradient learning fails. We support these theoretical insights with experiments on synthetic datasets and with real-world language models on mathematical reasoning tasks, confirming that our findings are applicable to practical scenarios.


Source: arXiv Generated at: 2026-06-04 00:00:00 UTC

Related Articles

Glazer Family Members Said to Study Manchester United Stake Sale
Bloomberg

Glazer Family Members Said to Study Manchester United Stake Sale

Reports indicate the Glazer family is evaluating a potential sale of their Manchester United stake, with family members ...

Ares' Blair Jacbobson: Disconnect Over Private Credit Headlines
Bloomberg

Ares' Blair Jacbobson: Disconnect Over Private Credit Headlines

Ares’ Blair Jacobson argues that private credit headlines misrepresent reality, highlighting a disconnect between media ...

Nvidia-Backed Robotics Startup Generalist AI Valued at $2 Billion
Bloomberg

Nvidia-Backed Robotics Startup Generalist AI Valued at $2 Billion

Nvidia-backed robotics startup Generalist AI has reached a $2 billion valuation. Founders Pete Florence, Andy Zeng, and ...

TechCrunch

Oura Ring 5 review: Thinner, lighter, better

The Oura Ring 5 is 40% smaller and lighter than its predecessor, offering superior comfort and a discreet, jewelry-like ...

Financial Times

How AI has de-skilled translation

AI fragments specialist translation into routine tasks, effectively de-skilling the profession. This shift reduces compl...

Zurich Insurance Expands Data-Center Offering Beyond the US
Bloomberg

Zurich Insurance Expands Data-Center Offering Beyond the US

Zurich Insurance Group is expanding its data center insurance products internationally, extending coverage beyond the Un...