Advantage Collapse in Group Relative Policy Optimization: Diagnosis and Mitigation
Title: Diagnosing and Mitigating Advantage Collapse in Group Relative Policy Optimization
Abstract:
Group Relative Policy Optimization (GRPO), a leading method within the Reinforcement Learning from Verifiable Rewards (RLVR) paradigm, has demonstrated significant efficacy in enhancing the reasoning skills of large language models (LLMs). Nevertheless, GRPO is susceptible to "advantage collapse," a specific failure mode wherein homogeneous reward signals within a group—such as when all responses are either correct or incorrect—result in near-zero advantages and vanishing gradients. To tackle this issue, we present the Advantage Collapse Rate (ACR), a novel diagnostic metric designed to quantify the fraction of training batches that suffer from ineffective gradients. Our analysis, spanning models ranging from 0.5B to 14B parameters across various mathematical reasoning benchmarks, reveals that ACR is a strong predictor of both final performance and training stagnation. Furthermore, we introduce Adaptive Virtual Sample Policy Optimization (AVSPO), a lightweight enhancement to GRPO. By incorporating virtual reward samples—activated based on real-time ACR monitoring—AVSPO facilitates learning from homogeneous groups without necessitating extra model rollouts. Relative to standard GRPO, AVSPO diminishes advantage collapse by 58–63% and delivers consistent accuracy improvements of 4–6 percentage points across all model scales, while preserving generalization capabilities on out-of-domain tasks. The associated code and datasets can be accessed at https://github.com/hexixiang/Advantage-Collapse-Rate.
Source: arXiv Generated at: 2026-06-02 00:00:00 UTC





