Self-Distilled Policy Gradient
Title: Self-Distilled Policy Gradient
Abstract: On-policy self-distillation represents a valuable source of dense supervision for sparse-reward reinforcement learning, a process in which a language model leverages privileged context to guide its own outputs. This approach can be formally expressed as an auxiliary full-vocabulary student-to-teacher reverse Kullback-Leibler divergence loss. Building on this foundation, we introduce SDPG, a novel self-distilled policy-gradient framework. SDPG integrates group-relative verifier advantages paired with normalized standard deviation, exact full-vocabulary on-policy self-distillation, and reference-policy KL regularization. Experimental results demonstrate that SDPG offers superior stability and performance compared to both RLVR and self-distillation baseline methods. The source code is publicly accessible at https://github.com/lauyikfung/SDPG.
Source: arXiv Generated at: 2026-06-04 00:00:00 UTC






