One Bias After Another: Mechanistic Reward Shaping and Persistent Biases in Language Reward Models
Title: A Cascade of Biases: Mechanistic Reward Shaping and the Persistence of Flaws in Language Reward Models
Abstract:
Reward Models (RMs) play a pivotal role in the online alignment of language models (LMs) with human preferences. Nevertheless, preference-tuning strategies reliant on RMs are susceptible to reward hacking, a phenomenon where LM policies adopt undesirable behaviors derived from imperfect RMs. Through a systematic evaluation of five high-quality RMs, including the current state-of-the-art, this study reveals that longstanding issues concerning length, sycophancy, and overconfidence remain unresolved, despite previous efforts. Additionally, we identify novel biases favoring model-specific "styles" and answer ordering. We classify RM failures into two categories: those tractable to linear intervention and those resistant to it. To address low-complexity biases stemming from spurious correlations, we introduce a straightforward post-hoc intervention. Our approach, termed mechanistic reward shaping, effectively reduces targeted biases without compromising reward quality or requiring extensive labeled data. Furthermore, this method is adaptable to emerging biases, applicable to model-internal adjustments, and demonstrates generalization capabilities across out-of-distribution scenarios.
Source: arXiv Generated at: 2026-06-02 00:00:00 UTC






