Mitigating Reward Hacking in RLHF via Bayesian Non-negative Reward Modeling
Title: Reducing Reward Hacking in RLHF Through Bayesian Non-Negative Reward Modeling
Abstract:
While reward models derived from human preferences are essential for aligning large language models (LLMs) using reinforcement learning from human feedback (RLHF), they frequently suffer from reward hacking. This vulnerability often stems from noisy data and systematic biases, including those related to response length or stylistic choices. To address these challenges, we introduce the Bayesian Non-Negative Reward Model (BNRM), a rigorous framework that embeds non-negative factor analysis within the Bradley-Terry (BT) preference model.
BNRM characterizes rewards through a generative process of sparse, non-negative latent factors operating at two distinct, complementary levels. First, instance-specific latent variables facilitate the creation of disentangled reward representations. Second, sparsity imposed on global latent factors serves as an implicit debiasing mechanism, effectively suppressing spurious correlations. This "disentanglement-then-debiasing" architecture supports robust, uncertainty-aware reward learning.
To make BNRM scalable for contemporary LLMs, we designed an amortized variational inference network conditioned on deep model representations, enabling efficient end-to-end training. Our comprehensive empirical evaluations show that BNRM significantly curtails reward over-optimization, enhances robustness against distribution shifts, and provides more interpretable reward decompositions compared to strong baseline methods.
Source: arXiv Generated at: 2026-06-02 00:00:00 UTC






