arXiv

Rollout-Level Advantage-Prioritized Experience Replay for GRPO

Title: Implementing Advantage-Prioritized Experience Replay at the Rollout Level for GRPO

Abstract:

For post-training reasoning in Large Language Models (LLMs), Reinforcement Learning from Verifiable Rewards using Group Relative Policy Optimization (GRPO) has emerged as a standard methodology. However, this approach suffers from significant sample inefficiency, as each generated rollout is utilized for only one gradient update before being discarded. Standard experience replay techniques are ill-suited for this context because LLM policies drift rapidly with each gradient step, causing stored rollouts to become outdated and potentially destabilizing the training process.

To address these challenges, we introduce a rollout-level replay buffer specifically designed for GRPO. Unlike traditional methods that handle groups of samples, our buffer stores and samples individual rollouts. To manage data staleness, the buffer employs an age-based eviction strategy, automatically removing any rollout that exceeds $tau_{max}$ training steps. Furthermore, we maintain on-policy integrity through fresh-anchored composition. This mechanism ensures that each training batch consists of fresh, on-policy rollouts concatenated with replayed rollouts sampled independently from the buffer.

Our prioritization strategy focuses on the magnitude of the advantage per rollout, selectively recycling those with high advantage values. We evaluated this method across three scales of the Qwen3-Base model on five mathematical benchmarks. The results demonstrate that our approach consistently surpasses both standard GRPO and naive replay baselines. Performance improvements are observed at every model scale and increase as model size grows. The most significant performance boost was recorded at the 4B parameter level, achieving a +4.35 percentage point gain in the average score across the five benchmarks. Additionally, when assessed using an AES metric that balances accuracy and token efficiency, the 4B model showed the greatest efficiency improvement over GRPO, with a margin of +0.579.


Source: arXiv Generated at: 2026-06-04 00:00:00 UTC

Related Articles

AI Concentration Risk Is the Problem: 3-Minutes MLIV
Bloomberg

AI Concentration Risk Is the Problem: 3-Minutes MLIV

The article argues that AI concentration risk, rather than the technology itself, is the primary concern. It highlights ...

Reuters

Foxconn announces strategic collaboration with Intel on next-gen AI infrastructure

Foxconn and Intel announced a strategic partnership to develop next-generation AI infrastructure. This collaboration aim...

SpaceX Seeks to Raise $75 Billion in Record IPO (Video)
Bloomberg

SpaceX Seeks to Raise $75 Billion in Record IPO (Video)

SpaceX aims for a record $75 billion valuation through an initial public offering. This historic IPO marks a significant...

Broadcom AI Chip Outlook Disappoints Investors
Bloomberg

Broadcom AI Chip Outlook Disappoints Investors

Broadcom’s AI chip projections disappointed investors, dampening market sentiment. The outlook fell short of expectation...

Reuters

Europe's tech 'liberation day'? Computer says not yet

Europe’s expected tech breakthrough remains unrealized, as current systems indicate that a true "liberation day" has not...

Hiranandani Group CEO on Powering India's Digital Future
Bloomberg

Hiranandani Group CEO on Powering India's Digital Future

Hiranandani Group CEO discusses driving India's digital transformation.