Outcome-Grounded Advantage Reshaping for Fine-Grained Credit Assignment in Mathematical Reasoning
Title: Refining Credit Assignment in Mathematical Reasoning Through Outcome-Grounded Advantage Reshaping
Abstract:
Group Relative Policy Optimization (GRPO) has recently gained traction as a promising reinforcement learning framework for reasoning tasks that eliminates the need for a critic. Nevertheless, traditional GRPO relies on a coarse-grained credit assignment approach, distributing group-level rewards evenly across all tokens in a sequence. This method overlooks the distinct impact of individual reasoning steps. To overcome this drawback, we propose Outcome-grounded Advantage Reshaping (OAR), a mechanism designed for fine-grained credit assignment that reallocates advantages according to the extent to which each token affects the model’s ultimate output.
We implement OAR through two distinct yet complementary strategies: (1) OAR-P, which leverages counterfactual token perturbations to estimate outcome sensitivity, providing a high-accuracy attribution signal; and (2) OAR-G, which employs an input-gradient sensitivity proxy to approximate the influence signal using just one backward pass. These importance metrics are combined with a conservative Bi-Level advantage reshaping framework that amplifies critical tokens while diminishing those with low impact, all while maintaining the total advantage mass. Extensive experiments on various mathematical reasoning benchmarks reveal that although OAR-P establishes the performance ceiling, OAR-G delivers similar improvements with minimal computational cost. Both variants significantly surpass a robust GRPO baseline, thereby advancing the limits of critic-free Large Language Model (LLM) reasoning.
Source: arXiv Generated at: 2026-06-04 00:00:00 UTC


