HARVE: Hacking-Aware Reward-Head Vector Editing for Robust Reward Models
Title: HARVE: Hacking-Aware Reward-Head Vector Editing for Robust Reward Models
Abstract:
While reward models play a pivotal role in aligning large language models (LLMs), they are notably susceptible to reward hacking. To assess the resilience of these models, we present RewardHackBench, a benchmark comprising 13 distinct reward-hacking patterns that span both real-world high-stakes scenarios and general contexts. Our evaluation reveals significant vulnerabilities in specific subcategories across eight different reward models.
To address these weaknesses, we introduce HARVE, a training-free technique designed to edit the reward head of scalar reward models. Unlike traditional approaches that rely on fine-tuning, HARVE detects a multi-directional hacking subspace by analyzing residual stream directions linked to chosen hacking subcategories. It then eliminates the portion of the reward-head vector that aligns with this subspace. This process effectively diminishes the reward head’s sensitivity to features associated with hacking, utilizing only a minimal set of contrastive gold-hacked examples. Notably, this method requires no gradient updates or fine-tuning.
Extensive experiments involving eight reward models demonstrate that HARVE enhances robustness against hacking, surpasses fine-tuning baselines, and maintains the general capabilities of the reward models. Additional analysis indicates that reward hacking is more accurately represented as a multidimensional structure within the residual space rather than through isolated surface-level cues.
Source: arXiv Generated at: 2026-06-03 00:00:00 UTC



