arXiv

EST-PRM: Stress-Testing Process Reward Models Before They Become Load-Bearing

Title: EST-PRM: Stress-Testing Process Reward Models Before They Become Load-Bearing

Abstract:

Process reward models (PRMs) are a staple in language model training, particularly where dense, step-level supervision is required. The prevailing assumption is that PRM scores remain stable under label-preserving transformations—changes that alter the reasoning structure while maintaining the final answer’s correctness. However, we contend that this premise lacks sufficient validation. Such transformations can fundamentally alter the relationship between PRM scores and correctness signals, resulting in varied failure modes across different models.

To bridge this gap, we propose EST-PRM, a stress-testing framework designed for dense process rewards. This framework employs three specific transformations: step inflation, dependency-aware step reordering, and the introduction of confidence markers. We also define a vulnerability decomposition metric that distinguishes between reward inflation and a decline in correctness sensitivity.

We evaluated five PRM-style models using 4,687 reasoning chains drawn from MATH-500, GSM8K, and PRMBench. Our findings reveal distinct vulnerability patterns among the models. Math-Shepherd exhibited the highest sensitivity to position perturbations, characterized by a Pearson correlation drop of $0.152 \pm 0.038$ and a score inflation rate of $32.8 \pm 4.9\%$. In contrast, Qwen2.5-Math-PRM was most impacted by step inflation, showing an inflation rate of $47.6 \pm 4.3\%$. Furthermore, confidence-based perturbations distorted reward calibration, exposing inconsistencies in how correctness is estimated. Finally, we assessed three mitigation strategies, which highlighted the inherent trade-offs between expanding robustness coverage and managing false-positive rates.


Source: arXiv Generated at: 2026-06-02 00:00:00 UTC

Related Articles

Law’s Billable Hour Is Being Shredded by AI
Bloomberg

Law’s Billable Hour Is Being Shredded by AI

AI is dismantling the billable hour by automating routine legal tasks. This technological shift threatens the traditiona...

Iran War: Trump Tries to Stop Israel’s Lebanon Push | The Opening Trade 6/2/2026
Bloomberg

Iran War: Trump Tries to Stop Israel’s Lebanon Push | The Opening Trade 6/2/2026

SoftBank in Early Talks to Back $800 Million Agile Robots Round
Bloomberg

SoftBank in Early Talks to Back $800 Million Agile Robots Round

SoftBank is in early talks to back Agile Robots’ $800 million funding round. The Japanese tech giant is currently in pre...

Amundi Is Diversifying Risk Via Commodity Currencies, Gold
Bloomberg

Amundi Is Diversifying Risk Via Commodity Currencies, Gold

Amundi diversifies risk by investing in commodity-linked currencies and gold. This strategy hedges against market volati...

Reuters

Marvell Technology surges after Nvidia's Huang calls it 'next trillion-dollar company'

Marvell Technology shares surged after Nvidia CEO Jensen Huang labeled the firm the “next trillion-dollar company.”

Russia Says It Found Foreign Spyware on Top Officials’ Phones
Bloomberg

Russia Says It Found Foreign Spyware on Top Officials’ Phones

Russia’s FSB claims to have discovered foreign spyware on senior officials’ phones. Moscow attributes the intrusion to h...