OGLS-SD: On-Policy Self-Distillation with Outcome-Guided Logit Steering for LLM Reasoning
Title: OGLS-SD: On-Policy Self-Distillation with Outcome-Guided Logit Steering for LLM Reasoning
Abstract:
This paper investigates on-policy self-distillation (OPSD), a method where language models enhance their reasoning capabilities by absorbing privileged teacher distributions derived from their own on-policy trajectories. While promising, OPSD often encounters training instability stemming from discrepancies between teacher and student response patterns. Specifically, self-reflective teacher responses can embed reflection-induced biases and rigid response templates, which misalign token-level supervision and consequently degrade the student model’s reasoning performance.
To address these challenges, we introduce OGLS-SD, a framework that employs outcome-guided logit steering to calibrate privileged teacher logits using verifiable outcome rewards. OGLS-SD functions by contrasting the logits generated by successful versus failed on-policy trajectories, thereby establishing an outcome-discriminative steering direction to guide token-level decisions. Our experiments on mathematical reasoning benchmarks demonstrate that OGLS-SD not only stabilizes the self-distillation process but also yields superior performance compared to standard OPSD and its alternative variants.
Source: arXiv Generated at: 2026-06-02 00:00:00 UTC





