Filter, Then Reweight: Rethinking Optimization Granularity in On-Policy Distillation
Title: Filter, Then Reweight: Rethinking Optimization Granularity in On-Policy Distillation
Abstract: The landscape of On-Policy Distillation (OPD) for large language models is evolving away from comprehensive KL supervision across full traces, moving instead toward more selective training methodologies. Current OPD approaches are increasingly prioritizing the identification of high-value trajectories, the determination of the most informative tokens, and the assessment of supervision signal reliability. Building on this trajectory, we reexamine the granularity of OPD optimization and introduce FiRe-OPD (Filter, then Reweight). This method simultaneously modulates supervision signals at both the trajectory and token levels. Specifically, FiRe-OPD begins by filtering out low-quality rollout samples to retain only high-quality trajectories, followed by a soft reweighting process within those selected trajectories to highlight the most informative tokens. Unlike hard token selection, FiRe-OPD employs a soft-weighting strategy to reduce information loss and improve optimization stability, resulting in a more refined OPD optimization process. We demonstrate the efficacy of FiRe-OPD across strong-to-weak, single-teacher, and multi-teacher scenarios. Our results show that FiRe-OPD outperforms recent token-level OPD methods, achieving improvements of +6.25 on AIME 2024 in strong-to-weak settings and +18.81 on Miner in multi-teacher settings. The code for this work is available at https://github.com/YuYingLi0/FiRe-OPD.
Source: arXiv Generated at: 2026-06-03 00:00:00 UTC



