Policy Improvement Reinforcement Learning
Title: Policy Improvement Reinforcement Learning
Abstract:
Reinforcement Learning with Verifiable Rewards (RLVR) has emerged as a pivotal post-training strategy for enhancing the reasoning skills of large language models. However, current methodologies suffer from a significant oversight: they adjust policies using instantaneous group or batch statistics, failing to verify if these adjustments actually enhance model performance. This open-loop architecture, which updates policies in isolation based solely on immediate within-batch reward signals, lacks mechanisms to detect or correct optimization drift and collapse. We posit that the crucial missing component is feedback on policy improvement, specifically the capacity to directly measure and optimize progress across iterations.
To address this, we present Policy Improvement Reinforcement Learning (PIRL), a framework that shifts the focus from maximizing surrogate rewards to explicitly optimizing cumulative policy improvement over time. We demonstrate that this temporal objective is perfectly aligned with maximizing final task performance. Leveraging PIRL, we introduce Policy Improvement Policy Optimization (PIPO), a method that achieves closed-loop optimization via retrospective verification. PIPO operates by assessing whether prior updates delivered genuine improvements relative to a sliding-window historical baseline. It then actively reinforces positive updates while suppressing detrimental ones, effectively converting the training process into a self-correcting system. Theoretical analysis confirms that PIPO performs ascent on the PIRL objective in expectation. Furthermore, experiments conducted on mathematical reasoning benchmarks reveal that PIPO offers superior stability and performance compared to GRPO and its variants.
Source: arXiv Generated at: 2026-06-04 00:00:00 UTC




