arXiv

Smaller Models are Natural Explorers for Policy-Level Diversity in GRPO

Title: Leveraging Small Models as Natural Explorers to Enhance Policy Diversity in GRPO

Abstract:

This study introduces a novel approach to boosting rollout diversity within Group Relative Policy Optimization (GRPO) for Large Language Models (LLMs). While GRPO necessitates diverse rollouts, current methods predominantly rely on injecting additional token-level randomness. This conventional tactic often introduces step-wise noise, resulting in incoherent trajectories. In contrast, we reveal that smaller models within the same family naturally possess higher policy-level diversity. This is evidenced by their superior pass@k metrics compared to larger models as sample sizes grow. Unlike artificial token-level noise, this inherent diversity is temporally correlated, maintains logical consistency, and delivers structured exploration signals for gradient estimation.

Based on these findings, we propose S2L-PO (Small-to-Large Policy Optimization), a framework that utilizes fixed small models as natural explorers to train larger models. To effectively balance exploration with exploitation, we have developed a progressive annealing strategy. This method gradually shifts from using offline rollouts generated by small models to sampling directly from the large learner. This transition smoothly circumvents performance degradation during training that typically arises from the capacity limitations of small models, ultimately enabling faster convergence and a higher performance ceiling. S2L-PO has demonstrated significant improvements in accuracy across various mathematical reasoning benchmarks—such as an 8.8% gain on AIME 24 when using a 1.7B model to guide an 8B model—while simultaneously reducing the computational cost of rollouts.


Source: arXiv Generated at: 2026-06-03 00:00:00 UTC

Related Articles

TikTok Billionaire Tops Ambani as Asia’s Second-Richest
Bloomberg

TikTok Billionaire Tops Ambani as Asia’s Second-Richest

TikTok founder surpasses Mukesh Ambani to become Asia’s second-richest person, marking a significant shift in the region...

Publishers in UK can opt out of Google AI search results
BBC News

Publishers in UK can opt out of Google AI search results

UK publishers can now opt out of Google’s AI search summaries, a CMA ruling designed to boost their bargaining power and...

Kioxia Edges Nearer Toyota’s Market Cap in Shakeup to Japan Inc.
Bloomberg

Kioxia Edges Nearer Toyota’s Market Cap in Shakeup to Japan Inc.

Kioxia’s market cap nears Toyota’s, signaling a major shift in Japan’s corporate hierarchy. This narrowing gap highlight...

Reuters

Morning Bid: Marvell, a fitting name for the latest AI darling

Reuters highlights Marvell as a top AI stock, noting its name perfectly suits its status as the newest market darling.

Financial Times

Tim Hayward: I built the Jaguar E-Type of computer keyboards

Tim Hayward compares his bespoke keyboard designs to the Jaguar E-Type. He explores high-end customization for personal ...

Financial Times

AI Labs: Zuckerberg’s $100bn gamble

Meta’s $100 billion AI investment aims to secure AI dominance, but questions remain whether sheer spending can outpace c...