EST-PRM: Stress-Testing Process Reward Models Before They Become Load-Bearing
Title: EST-PRM: Stress-Testing Process Reward Models Before They Become Load-Bearing
Abstract:
Process reward models (PRMs) are a staple in language model training, particularly where dense, step-level supervision is required. The prevailing assumption is that PRM scores remain stable under label-preserving transformations—changes that alter the reasoning structure while maintaining the final answer’s correctness. However, we contend that this premise lacks sufficient validation. Such transformations can fundamentally alter the relationship between PRM scores and correctness signals, resulting in varied failure modes across different models.
To bridge this gap, we propose EST-PRM, a stress-testing framework designed for dense process rewards. This framework employs three specific transformations: step inflation, dependency-aware step reordering, and the introduction of confidence markers. We also define a vulnerability decomposition metric that distinguishes between reward inflation and a decline in correctness sensitivity.
We evaluated five PRM-style models using 4,687 reasoning chains drawn from MATH-500, GSM8K, and PRMBench. Our findings reveal distinct vulnerability patterns among the models. Math-Shepherd exhibited the highest sensitivity to position perturbations, characterized by a Pearson correlation drop of $0.152 \pm 0.038$ and a score inflation rate of $32.8 \pm 4.9\%$. In contrast, Qwen2.5-Math-PRM was most impacted by step inflation, showing an inflation rate of $47.6 \pm 4.3\%$. Furthermore, confidence-based perturbations distorted reward calibration, exposing inconsistencies in how correctness is estimated. Finally, we assessed three mitigation strategies, which highlighted the inherent trade-offs between expanding robustness coverage and managing false-positive rates.
Source: arXiv Generated at: 2026-06-02 00:00:00 UTC





