Smaller Models are Natural Explorers for Policy-Level Diversity in GRPO
Title: Leveraging Small Models as Natural Explorers to Enhance Policy Diversity in GRPO
Abstract:
This study introduces a novel approach to boosting rollout diversity within Group Relative Policy Optimization (GRPO) for Large Language Models (LLMs). While GRPO necessitates diverse rollouts, current methods predominantly rely on injecting additional token-level randomness. This conventional tactic often introduces step-wise noise, resulting in incoherent trajectories. In contrast, we reveal that smaller models within the same family naturally possess higher policy-level diversity. This is evidenced by their superior pass@k metrics compared to larger models as sample sizes grow. Unlike artificial token-level noise, this inherent diversity is temporally correlated, maintains logical consistency, and delivers structured exploration signals for gradient estimation.
Based on these findings, we propose S2L-PO (Small-to-Large Policy Optimization), a framework that utilizes fixed small models as natural explorers to train larger models. To effectively balance exploration with exploitation, we have developed a progressive annealing strategy. This method gradually shifts from using offline rollouts generated by small models to sampling directly from the large learner. This transition smoothly circumvents performance degradation during training that typically arises from the capacity limitations of small models, ultimately enabling faster convergence and a higher performance ceiling. S2L-PO has demonstrated significant improvements in accuracy across various mathematical reasoning benchmarks—such as an 8.8% gain on AIME 24 when using a 1.7B model to guide an 8B model—while simultaneously reducing the computational cost of rollouts.
Source: arXiv Generated at: 2026-06-03 00:00:00 UTC



