Rollout-Level Advantage-Prioritized Experience Replay for GRPO
Title: Implementing Advantage-Prioritized Experience Replay at the Rollout Level for GRPO
Abstract:
For post-training reasoning in Large Language Models (LLMs), Reinforcement Learning from Verifiable Rewards using Group Relative Policy Optimization (GRPO) has emerged as a standard methodology. However, this approach suffers from significant sample inefficiency, as each generated rollout is utilized for only one gradient update before being discarded. Standard experience replay techniques are ill-suited for this context because LLM policies drift rapidly with each gradient step, causing stored rollouts to become outdated and potentially destabilizing the training process.
To address these challenges, we introduce a rollout-level replay buffer specifically designed for GRPO. Unlike traditional methods that handle groups of samples, our buffer stores and samples individual rollouts. To manage data staleness, the buffer employs an age-based eviction strategy, automatically removing any rollout that exceeds $tau_{max}$ training steps. Furthermore, we maintain on-policy integrity through fresh-anchored composition. This mechanism ensures that each training batch consists of fresh, on-policy rollouts concatenated with replayed rollouts sampled independently from the buffer.
Our prioritization strategy focuses on the magnitude of the advantage per rollout, selectively recycling those with high advantage values. We evaluated this method across three scales of the Qwen3-Base model on five mathematical benchmarks. The results demonstrate that our approach consistently surpasses both standard GRPO and naive replay baselines. Performance improvements are observed at every model scale and increase as model size grows. The most significant performance boost was recorded at the 4B parameter level, achieving a +4.35 percentage point gain in the average score across the five benchmarks. Additionally, when assessed using an AES metric that balances accuracy and token efficiency, the 4B model showed the greatest efficiency improvement over GRPO, with a margin of +0.579.
Source: arXiv Generated at: 2026-06-04 00:00:00 UTC




