Weak Critics Make Strong Learners: On-Policy Critique Distillation for Scalable Oversight
Title: Weak Critics Make Strong Learners: On-Policy Critique Distillation for Scalable Oversight
Abstract:
As large language models (LLMs) grow more capable, the ability of weaker supervisors to offer reliable labels, preferences, or final judgments for complex outputs diminishes. This limitation hinders both weak-to-strong generalization and the scalability of oversight mechanisms. To address this, we explore a more manageable approach to weak supervision: employing a weak model as a critic rather than a labeler or judge. In this framework, the weak critic is not required to solve the task or identify the correct answer; instead, it merely needs to offer a revision direction that is not misleading, thereby enabling the strong model to better leverage its own internal knowledge. We term this paradigm weak-critic strong oversight.
Our analysis demonstrates that weak critiques can enhance the performance of frozen strong models during inference, with the quality of the critique being a critical factor in this improvement. Furthermore, we introduce progressive on-policy critique distillation (OPCD), a method that filters for high-quality critiques and transfers critic-guided behaviors into the strong model using adaptive self-teacher signals. Empirical results across reasoning and alignment benchmarks indicate that our approach boosts strong model performance over training epochs, highlighting a promising avenue for achieving scalable oversight through weak supervision.
Source: arXiv Generated at: 2026-06-02 00:00:00 UTC




