arXiv

Multi-Rollout On-Policy Distillation via Peer Successes and Failures

Title: Multi-Rollout On-Policy Distillation via Peer Successes and Failures

Original: arXiv:2605.12652v2 Announce Type: replace-cross Abstract: Large language models are often post-trained with sparse verifier rewards, which indicate whether a sampled trajectory succeeds but provide limited guidance about where reasoning succeeds or fails. On-policy distillation (OPD) offers denser token-level supervision by training on student-generated trajectories, yet existing methods typically distill each rollout independently and ignore the other attempts sampled for the same prompt. We introduce Multi-Rollout On-Policy Distillation (MOPD), a peer-conditioned distillation framework that uses the student's local rollout group to construct more informative teacher signals. MOPD conditions the teacher on both successful and failed peer rollouts: successes provide positive evidence for valid reasoning patterns, while failures provide structured negative evidence about plausible mistakes to avoid. We study two peer-context constructions: positive peer imitation and contrastive success-failure conditioning. Experiments on competitive programming, mathematical reasoning, scientific question answering, and tool-use benchmarks show that MOPD consistently improves over standard on-policy baselines. Further teacher-signal analysis shows that mixed success-failure contexts better align teacher scores with verifier rewards, indicating that the gains arise from more faithful, instance-adaptive supervision. These results indicate that effective on-policy distillation should exploit the student's multi-rollout trial-and-error behavior rather than treating rollouts as isolated samples.

Rewrite: Title: Leveraging Peer Outcomes for Multi-Rollout On-Policy Distillation

Abstract: Post-training for large language models frequently relies on sparse verifier rewards. While these rewards signal whether a sampled trajectory is successful, they offer little insight into the specific points where reasoning succeeds or fails. On-policy distillation (OPD) addresses this by providing denser, token-level supervision through training on trajectories generated by the student. However, current approaches generally process each rollout in isolation, disregarding other attempts made for the same prompt. To address this limitation, we propose Multi-Rollout On-Policy Distillation (MOPD), a framework that leverages the student’s local group of rollouts to generate more informative teacher signals. MOPD conditions the teacher on both successful and failed peer attempts; successes serve as positive evidence for correct reasoning patterns, whereas failures offer structured negative evidence regarding plausible errors to avoid. We explore two methods for constructing this peer context: positive peer imitation and contrastive conditioning based on success and failure. Our experiments across benchmarks for competitive programming, mathematical reasoning, scientific question answering, and tool use demonstrate that MOPD consistently outperforms standard on-policy baselines. Additionally, an analysis of teacher signals reveals that combining success and failure contexts leads to a closer alignment between teacher scores and verifier rewards. This suggests that the performance improvements stem from supervision that is both more faithful and adaptive to specific instances. Ultimately, these findings highlight that effective on-policy distillation should capitalize on the student’s trial-and-error dynamics across multiple rollouts, rather than treating each sample as an independent unit.


Source: arXiv Generated at: 2026-06-02 00:00:00 UTC

Related Articles

Law’s Billable Hour Is Being Shredded by AI
Bloomberg

Law’s Billable Hour Is Being Shredded by AI

AI is dismantling the billable hour by automating routine legal tasks. This technological shift threatens the traditiona...

Iran War: Trump Tries to Stop Israel’s Lebanon Push | The Opening Trade 6/2/2026
Bloomberg

Iran War: Trump Tries to Stop Israel’s Lebanon Push | The Opening Trade 6/2/2026

SoftBank in Early Talks to Back $800 Million Agile Robots Round
Bloomberg

SoftBank in Early Talks to Back $800 Million Agile Robots Round

SoftBank is in early talks to back Agile Robots’ $800 million funding round. The Japanese tech giant is currently in pre...

Amundi Is Diversifying Risk Via Commodity Currencies, Gold
Bloomberg

Amundi Is Diversifying Risk Via Commodity Currencies, Gold

Amundi diversifies risk by investing in commodity-linked currencies and gold. This strategy hedges against market volati...

Reuters

Marvell Technology surges after Nvidia's Huang calls it 'next trillion-dollar company'

Marvell Technology shares surged after Nvidia CEO Jensen Huang labeled the firm the “next trillion-dollar company.”

Russia Says It Found Foreign Spyware on Top Officials’ Phones
Bloomberg

Russia Says It Found Foreign Spyware on Top Officials’ Phones

Russia’s FSB claims to have discovered foreign spyware on senior officials’ phones. Moscow attributes the intrusion to h...