On the Generalization Gap in Self-Evolving Language Model Reasoning
Title: Investigating the Generalization Gap in Self-Evolving Language Model Reasoning
Abstract: Emerging research indicates that large language models (LLMs) are capable of self-improvement via self-evolution (SE), a process driven by supervision signals produced by the models themselves. This study investigates the efficacy of such systems within a rigorous closed-loop environment, wherein the self-evolution algorithm is restricted to an unlabeled prompt dataset and a foundational model. The central question is to what extent internally generated supervision can approximate the performance of oracle-supervised training. We examine four distinct methodologies within a cohesive offline self-evolution framework: single-round verification, iterative training, curriculum learning, and multi-turn revision incorporating feedback. Our experimental analysis primarily utilizes Knights and Knaves (KK) logical reasoning tasks, selected for their deterministic answers, adjustable difficulty, and suitability as a testbed for evaluating generalization from easy to hard problems. Our findings demonstrate that while self-evolution reliably enhances performance over the baseline, gains diminish with excessive computational investment, ultimately failing to bridge the significant performance gap to oracle supervision. Notably, multi-turn critic-revision processes employing larger models yield superior results; for instance, Gemma 12B approaches the efficacy of oracle-supervised training. Furthermore, assessments on real-world reasoning benchmarks reveal that performance improvements remain limited. Collectively, these results delineate the boundaries of closed-loop self-evolution, highlighting that internally derived supervision proves inadequate under this minimalistic configuration.
Source: arXiv Generated at: 2026-06-02 00:00:00 UTC





