Constitutional On-Policy Safe Distillation
Title: Constitutional On-Policy Safe Distillation
Original: arXiv:2606.03089v1 Announce Type: cross Abstract: On-policy self-distillation (OPSD) has emerged as an efficient post-training paradigm by using a teacher conditioned on privileged information to provide dense token-level supervision. Prior work has shown that OPSD can collapse in verifiable reasoning tasks, but safety alignment differs in that it is guided by high-level constitutions rather than explicit target answers, making it a natural setting to revisit dense distillation. However, our pilot study show that safety OPSD still suffers from severe collapse: constitutional conditioning contracts the teacher distribution toward short and overly conservative responses, and Reverse KL further amplifies this contraction into reduced expressiveness. We formalize this effect as geometric leakage under safety boundaries in a non-orthogonal semantic space, where safety pressure transfers into the expressiveness dimension. Based on this analysis, we propose Constitutional On-Policy Safe Distillation (COPSD), which first calibrates the teacher through a Cross-SFT cold-start and then performs constitution-conditioned on-policy distillation. Experiments on 12 benchmarks show that COPSD achieves a consistently stronger safety--helpfulness trade-off than baselines while substantially reducing the safety tax on general reasoning ability.
Rewrite: arXiv:2606.03089v1 Announce Type: cross Abstract: On-policy self-distillation (OPSD) has gained traction as an efficient post-training approach, leveraging a teacher model informed by privileged data to deliver granular, token-level guidance. While previous research indicates that OPSD may fail in tasks with verifiable answers, safety alignment presents a distinct challenge, as it relies on broad constitutional guidelines rather than specific correct outputs. This distinction makes it an ideal context for re-examining dense distillation techniques. Nevertheless, our initial findings reveal that safety-focused OPSD is still prone to significant collapse. Specifically, conditioning on constitutions narrows the teacher’s distribution toward brief, excessively cautious replies, a phenomenon exacerbated by Reverse KL divergence, which diminishes the model’s expressive capacity. We characterize this issue as geometric leakage within safety constraints in a non-orthogonal semantic space, where safety demands encroach upon the dimensions responsible for expressiveness. To address this, we introduce Constitutional On-Policy Safe Distillation (COPSD). This method begins by calibrating the teacher via a Cross-SFT cold-start phase, followed by constitution-guided on-policy distillation. Evaluations across 12 benchmarks demonstrate that COPSD consistently outperforms baseline methods in balancing safety and helpfulness, while notably mitigating the negative impact on general reasoning capabilities.
Source: arXiv Generated at: 2026-06-03 00:00:00 UTC



