arXiv

Policy Improvement Reinforcement Learning

Title: Policy Improvement Reinforcement Learning

Abstract:

Reinforcement Learning with Verifiable Rewards (RLVR) has emerged as a pivotal post-training strategy for enhancing the reasoning skills of large language models. However, current methodologies suffer from a significant oversight: they adjust policies using instantaneous group or batch statistics, failing to verify if these adjustments actually enhance model performance. This open-loop architecture, which updates policies in isolation based solely on immediate within-batch reward signals, lacks mechanisms to detect or correct optimization drift and collapse. We posit that the crucial missing component is feedback on policy improvement, specifically the capacity to directly measure and optimize progress across iterations.

To address this, we present Policy Improvement Reinforcement Learning (PIRL), a framework that shifts the focus from maximizing surrogate rewards to explicitly optimizing cumulative policy improvement over time. We demonstrate that this temporal objective is perfectly aligned with maximizing final task performance. Leveraging PIRL, we introduce Policy Improvement Policy Optimization (PIPO), a method that achieves closed-loop optimization via retrospective verification. PIPO operates by assessing whether prior updates delivered genuine improvements relative to a sliding-window historical baseline. It then actively reinforces positive updates while suppressing detrimental ones, effectively converting the training process into a self-correcting system. Theoretical analysis confirms that PIPO performs ascent on the PIRL objective in expectation. Furthermore, experiments conducted on mathematical reasoning benchmarks reveal that PIPO offers superior stability and performance compared to GRPO and its variants.


Source: arXiv Generated at: 2026-06-04 00:00:00 UTC

Related Articles

Reuters

Foxconn announces strategic collaboration with Intel on next-gen AI infrastructure

Foxconn and Intel announced a strategic partnership to develop next-generation AI infrastructure. This collaboration aim...

SpaceX Seeks to Raise $75 Billion in Record IPO (Video)
Bloomberg

SpaceX Seeks to Raise $75 Billion in Record IPO (Video)

SpaceX aims for a record $75 billion valuation through an initial public offering. This historic IPO marks a significant...

Broadcom AI Chip Outlook Disappoints Investors
Bloomberg

Broadcom AI Chip Outlook Disappoints Investors

Broadcom’s AI chip projections disappointed investors, dampening market sentiment. The outlook fell short of expectation...

Reuters

Europe's tech 'liberation day'? Computer says not yet

Europe’s expected tech breakthrough remains unrealized, as current systems indicate that a true "liberation day" has not...

Hiranandani Group CEO on Powering India's Digital Future
Bloomberg

Hiranandani Group CEO on Powering India's Digital Future

Hiranandani Group CEO discusses driving India's digital transformation.

Cerebras Says It’s Working With All AI Gear Makers Except Nvidia
Bloomberg

Cerebras Says It’s Working With All AI Gear Makers Except Nvidia

Cerebras confirmed partnerships with all major AI hardware vendors except Nvidia. This broad engagement positions Cerebr...