Physics-Guided Policy Optimization with Self-Distillation
Title: Physics-Guided Policy Optimization with Self-Distillation
Original: arXiv:2606.03620v1 Announce Type: cross Abstract: Self-distilled policy optimization (SDPO) has become a popular paradigm for LLM post-training, where a model learns from its own predictions conditioned on privileged information. SDPO, however, is sensitive to how much each update step should be trusted: corrections from a self-teacher can be highly informative on some batches and misleading on others, and applying them uniformly with a fixed step size can destabilize training. Drawing inspiration from viscous-fluid dynamics and formalizing the analogy at the SDE level, we propose Physics-Guided Policy Optimization (PGPO), which introduces an information-modulated step-size multiplier derived from a mutual-information estimate between the student's predictions and the feedback-conditioned teacher. We show that this modulation preserves the order-1 weak-approximation guarantees of vanilla SGD, and incurs negligible overhead per iteration. We evaluate PGPO on the Science-QA dataset, where it outperforms SDPO on 3 of the 4 domains with gains of up to +4.5 points, while remaining stable in a setting where SDPO collapses late in training.
Rewrite: Physics-Guided Policy Optimization with Self-Distillation
arXiv:2606.03620v1 Announce Type: cross Abstract: In the realm of large language model (LLM) post-training, self-distilled policy optimization (SDPO) has emerged as a prevalent approach, enabling models to learn from their own outputs based on specific contextual information. Nevertheless, SDPO faces challenges regarding the reliability of individual update steps; feedback from a self-teacher may be valuable for certain data batches but deceptive for others. Uniformly applying these corrections with a constant learning rate can lead to training instability. To address this, we introduce Physics-Guided Policy Optimization (PGPO), a method inspired by viscous-fluid dynamics and formalized through stochastic differential equations (SDEs). PGPO employs a step-size multiplier that is modulated by information, specifically calculated via a mutual-information estimate between the student model’s predictions and the teacher’s feedback-conditioned outputs. Our analysis demonstrates that this adaptive modulation maintains the order-1 weak-approximation properties inherent in standard stochastic gradient descent (SGD) while adding minimal computational cost per iteration. Empirical evaluations on the Science-QA dataset reveal that PGPO surpasses SDPO in three out of four domains, achieving performance improvements of as much as +4.5 points. Furthermore, PGPO maintains training stability in scenarios where SDPO experiences late-stage collapse.
Source: arXiv Generated at: 2026-06-03 00:00:00 UTC



