arXiv

Filter, Then Reweight: Rethinking Optimization Granularity in On-Policy Distillation

Title: Filter, Then Reweight: Rethinking Optimization Granularity in On-Policy Distillation

Abstract: The landscape of On-Policy Distillation (OPD) for large language models is evolving away from comprehensive KL supervision across full traces, moving instead toward more selective training methodologies. Current OPD approaches are increasingly prioritizing the identification of high-value trajectories, the determination of the most informative tokens, and the assessment of supervision signal reliability. Building on this trajectory, we reexamine the granularity of OPD optimization and introduce FiRe-OPD (Filter, then Reweight). This method simultaneously modulates supervision signals at both the trajectory and token levels. Specifically, FiRe-OPD begins by filtering out low-quality rollout samples to retain only high-quality trajectories, followed by a soft reweighting process within those selected trajectories to highlight the most informative tokens. Unlike hard token selection, FiRe-OPD employs a soft-weighting strategy to reduce information loss and improve optimization stability, resulting in a more refined OPD optimization process. We demonstrate the efficacy of FiRe-OPD across strong-to-weak, single-teacher, and multi-teacher scenarios. Our results show that FiRe-OPD outperforms recent token-level OPD methods, achieving improvements of +6.25 on AIME 2024 in strong-to-weak settings and +18.81 on Miner in multi-teacher settings. The code for this work is available at https://github.com/YuYingLi0/FiRe-OPD.


Source: arXiv Generated at: 2026-06-03 00:00:00 UTC

Related Articles

TikTok Billionaire Tops Ambani as Asia’s Second-Richest
Bloomberg

TikTok Billionaire Tops Ambani as Asia’s Second-Richest

TikTok founder surpasses Mukesh Ambani to become Asia’s second-richest person, marking a significant shift in the region...

Publishers in UK can opt out of Google AI search results
BBC News

Publishers in UK can opt out of Google AI search results

UK publishers can now opt out of Google’s AI search summaries, a CMA ruling designed to boost their bargaining power and...

Kioxia Edges Nearer Toyota’s Market Cap in Shakeup to Japan Inc.
Bloomberg

Kioxia Edges Nearer Toyota’s Market Cap in Shakeup to Japan Inc.

Kioxia’s market cap nears Toyota’s, signaling a major shift in Japan’s corporate hierarchy. This narrowing gap highlight...

Reuters

Morning Bid: Marvell, a fitting name for the latest AI darling

Reuters highlights Marvell as a top AI stock, noting its name perfectly suits its status as the newest market darling.

Financial Times

Tim Hayward: I built the Jaguar E-Type of computer keyboards

Tim Hayward compares his bespoke keyboard designs to the Jaguar E-Type. He explores high-end customization for personal ...

Financial Times

AI Labs: Zuckerberg’s $100bn gamble

Meta’s $100 billion AI investment aims to secure AI dominance, but questions remain whether sheer spending can outpace c...