When RLHF Fails: A Mechanistic Taxonomy of Reward Hacking, Collapse, and Evaluator Gaming
**Title: When RLHF Fails: A Mechanistic Taxonomy of Reward Hacking, Collapse, and Evaluator Gaming
Abstract:
Reinforcement learning from human feedback (RLHF) enables large-scale post-training by substituting vague human objectives with learned, scalable proxies. However, this substitution introduces a structured failure surface where optimization may elevate the learned reward at the expense of external quality, degrade both proxy and judge metrics, expose under-alignment in the proxy, or generate disagreements specific to the evaluator. This study empirically examines failure modes within a compact RLHF pipeline utilizing proximal policy optimization (PPO), direct preference optimization (DPO), uncertainty-penalized PPO (UP-PPO), reward-model uncertainty, approximate policy drift, and diagnostics for diversity and repetition, alongside two external LLM judges. Instead of viewing reward hacking as a singular endpoint, we classify matched transitions between checkpoints based on the trajectories of the learned reward, judge scores, and average judge scores. Analyzing 61 checkpoint rows and 1,920 row-level transitions, we find that aggressive PPO exhibits the highest localized reward-hacking rate (14.45%; bootstrap 95% CI: 10.16–18.75), whereas UP-PPO demonstrates lower rates (11.33–10.94%) under similar aggressive conditions. A logistic model trained on pre-transition data predicts future row-level reward hacking with an ROC-AUC of 0.821. Furthermore, row-level analysis reveals localized reward hacking that checkpoint averages overlook in three out of twelve settings. The primary conclusion is methodological: RLHF failures are not merely final-model pathologies but are training dynamics that can be classified, localized, and partially anticipated.
Source: arXiv Generated at: 2026-06-03 00:00:00 UTC



