arXiv

Advantage Collapse in Group Relative Policy Optimization: Diagnosis and Mitigation

Title: Diagnosing and Mitigating Advantage Collapse in Group Relative Policy Optimization

Abstract:

Group Relative Policy Optimization (GRPO), a leading method within the Reinforcement Learning from Verifiable Rewards (RLVR) paradigm, has demonstrated significant efficacy in enhancing the reasoning skills of large language models (LLMs). Nevertheless, GRPO is susceptible to "advantage collapse," a specific failure mode wherein homogeneous reward signals within a group—such as when all responses are either correct or incorrect—result in near-zero advantages and vanishing gradients. To tackle this issue, we present the Advantage Collapse Rate (ACR), a novel diagnostic metric designed to quantify the fraction of training batches that suffer from ineffective gradients. Our analysis, spanning models ranging from 0.5B to 14B parameters across various mathematical reasoning benchmarks, reveals that ACR is a strong predictor of both final performance and training stagnation. Furthermore, we introduce Adaptive Virtual Sample Policy Optimization (AVSPO), a lightweight enhancement to GRPO. By incorporating virtual reward samples—activated based on real-time ACR monitoring—AVSPO facilitates learning from homogeneous groups without necessitating extra model rollouts. Relative to standard GRPO, AVSPO diminishes advantage collapse by 58–63% and delivers consistent accuracy improvements of 4–6 percentage points across all model scales, while preserving generalization capabilities on out-of-domain tasks. The associated code and datasets can be accessed at https://github.com/hexixiang/Advantage-Collapse-Rate.


Source: arXiv Generated at: 2026-06-02 00:00:00 UTC

Related Articles

Law’s Billable Hour Is Being Shredded by AI
Bloomberg

Law’s Billable Hour Is Being Shredded by AI

AI is dismantling the billable hour by automating routine legal tasks. This technological shift threatens the traditiona...

Iran War: Trump Tries to Stop Israel’s Lebanon Push | The Opening Trade 6/2/2026
Bloomberg

Iran War: Trump Tries to Stop Israel’s Lebanon Push | The Opening Trade 6/2/2026

SoftBank in Early Talks to Back $800 Million Agile Robots Round
Bloomberg

SoftBank in Early Talks to Back $800 Million Agile Robots Round

SoftBank is in early talks to back Agile Robots’ $800 million funding round. The Japanese tech giant is currently in pre...

Amundi Is Diversifying Risk Via Commodity Currencies, Gold
Bloomberg

Amundi Is Diversifying Risk Via Commodity Currencies, Gold

Amundi diversifies risk by investing in commodity-linked currencies and gold. This strategy hedges against market volati...

Reuters

Marvell Technology surges after Nvidia's Huang calls it 'next trillion-dollar company'

Marvell Technology shares surged after Nvidia CEO Jensen Huang labeled the firm the “next trillion-dollar company.”

Russia Says It Found Foreign Spyware on Top Officials’ Phones
Bloomberg

Russia Says It Found Foreign Spyware on Top Officials’ Phones

Russia’s FSB claims to have discovered foreign spyware on senior officials’ phones. Moscow attributes the intrusion to h...