arXiv

Constitutional On-Policy Safe Distillation

Title: Constitutional On-Policy Safe Distillation

Original: arXiv:2606.03089v1 Announce Type: cross Abstract: On-policy self-distillation (OPSD) has emerged as an efficient post-training paradigm by using a teacher conditioned on privileged information to provide dense token-level supervision. Prior work has shown that OPSD can collapse in verifiable reasoning tasks, but safety alignment differs in that it is guided by high-level constitutions rather than explicit target answers, making it a natural setting to revisit dense distillation. However, our pilot study show that safety OPSD still suffers from severe collapse: constitutional conditioning contracts the teacher distribution toward short and overly conservative responses, and Reverse KL further amplifies this contraction into reduced expressiveness. We formalize this effect as geometric leakage under safety boundaries in a non-orthogonal semantic space, where safety pressure transfers into the expressiveness dimension. Based on this analysis, we propose Constitutional On-Policy Safe Distillation (COPSD), which first calibrates the teacher through a Cross-SFT cold-start and then performs constitution-conditioned on-policy distillation. Experiments on 12 benchmarks show that COPSD achieves a consistently stronger safety--helpfulness trade-off than baselines while substantially reducing the safety tax on general reasoning ability.

Rewrite: arXiv:2606.03089v1 Announce Type: cross Abstract: On-policy self-distillation (OPSD) has gained traction as an efficient post-training approach, leveraging a teacher model informed by privileged data to deliver granular, token-level guidance. While previous research indicates that OPSD may fail in tasks with verifiable answers, safety alignment presents a distinct challenge, as it relies on broad constitutional guidelines rather than specific correct outputs. This distinction makes it an ideal context for re-examining dense distillation techniques. Nevertheless, our initial findings reveal that safety-focused OPSD is still prone to significant collapse. Specifically, conditioning on constitutions narrows the teacher’s distribution toward brief, excessively cautious replies, a phenomenon exacerbated by Reverse KL divergence, which diminishes the model’s expressive capacity. We characterize this issue as geometric leakage within safety constraints in a non-orthogonal semantic space, where safety demands encroach upon the dimensions responsible for expressiveness. To address this, we introduce Constitutional On-Policy Safe Distillation (COPSD). This method begins by calibrating the teacher via a Cross-SFT cold-start phase, followed by constitution-guided on-policy distillation. Evaluations across 12 benchmarks demonstrate that COPSD consistently outperforms baseline methods in balancing safety and helpfulness, while notably mitigating the negative impact on general reasoning capabilities.


Source: arXiv Generated at: 2026-06-03 00:00:00 UTC

Related Articles

TikTok Billionaire Tops Ambani as Asia’s Second-Richest
Bloomberg

TikTok Billionaire Tops Ambani as Asia’s Second-Richest

TikTok founder surpasses Mukesh Ambani to become Asia’s second-richest person, marking a significant shift in the region...

Publishers in UK can opt out of Google AI search results
BBC News

Publishers in UK can opt out of Google AI search results

UK publishers can now opt out of Google’s AI search summaries, a CMA ruling designed to boost their bargaining power and...

Kioxia Edges Nearer Toyota’s Market Cap in Shakeup to Japan Inc.
Bloomberg

Kioxia Edges Nearer Toyota’s Market Cap in Shakeup to Japan Inc.

Kioxia’s market cap nears Toyota’s, signaling a major shift in Japan’s corporate hierarchy. This narrowing gap highlight...

Reuters

Morning Bid: Marvell, a fitting name for the latest AI darling

Reuters highlights Marvell as a top AI stock, noting its name perfectly suits its status as the newest market darling.

Financial Times

Tim Hayward: I built the Jaguar E-Type of computer keyboards

Tim Hayward compares his bespoke keyboard designs to the Jaguar E-Type. He explores high-end customization for personal ...

Financial Times

AI Labs: Zuckerberg’s $100bn gamble

Meta’s $100 billion AI investment aims to secure AI dominance, but questions remain whether sheer spending can outpace c...