arXiv

Reproducing, Analyzing, and Detecting Reward Hacking in Rubric-Based Reinforcement Learning

Title: Reproducing, Analyzing, and Detecting Reward Hacking in Rubric-Based Reinforcement Learning

Abstract:

In rubric-based reinforcement learning (RL), Large Language Models acting as judges (LaaJ) evaluate model outputs against specific criteria to generate rewards. However, this approach is vulnerable to reward hacking, where policy models leverage hidden biases within the judge, resulting in training outcomes that are either unsafe or ineffective. In practical applications, these hacking behaviors are often nuanced and intertwined with various judge biases, complicating efforts to analyze, detect, and mitigate them.

To address these challenges, this study presents CHERRL, a controlled environment designed for rubric-based RL that facilitates the reproduction of reward hacking. By introducing known biases into the LaaJ, CHERRL allows for the stable replication of hacking incidents, the clear observation of reward divergence, and the accurate pinpointing of when hacking begins. This setup serves as a rigorous experimental platform for investigating both the underlying mechanisms and potential solutions for reward hacking in rubric-based RL.

To illustrate the framework's effectiveness, we examine various judge biases through the lenses of discoverability and exploitability. Additionally, we investigate an agent-based system capable of automatically identifying the onset of reward hacking by analyzing training logs. The associated code and environment are accessible at https://github.com/THUAIS-Lab/CHERRL.


Source: arXiv Generated at: 2026-06-04 00:00:00 UTC

Related Articles

China’s Robotaxi Dilemma Shows AI Policy Tension Between Growth and Jobs
Bloomberg

China’s Robotaxi Dilemma Shows AI Policy Tension Between Growth and Jobs

China’s robotaxi expansion highlights the policy tension between driving economic growth through AI and protecting emplo...

Exams watchdog warns of rise in high-tech cheating
BBC News

Exams watchdog warns of rise in high-tech cheating

Ofqual warns of rising high-tech cheating, with smart devices involved in 44% of misconduct cases. Invigilators are trai...

Thailand’s Richest Man Plans $4.3 Billion Expansion Amid AI Boom
Bloomberg

Thailand’s Richest Man Plans $4.3 Billion Expansion Amid AI Boom

Thailand’s wealthiest individual is investing $4.3 billion in expansion, capitalizing on the booming artificial intellig...

US Tech Sector Announces Most Job Cuts in Nearly Two Years
Bloomberg

US Tech Sector Announces Most Job Cuts in Nearly Two Years

The US tech sector recorded its highest wave of layoffs in nearly two years, signaling a significant downturn for the in...

Iran Says No Progress in US Talks | The Opening Trade 6/4/2026
Bloomberg

Iran Says No Progress in US Talks | The Opening Trade 6/4/2026

Iran reports no progress in US talks on June 4, 2026. The Opening Trade highlights the ongoing diplomatic impasse betwee...

The Do’s and Don’ts of Buying Used Tech Gadgets
New York Times

The Do’s and Don’ts of Buying Used Tech Gadgets

Refurbished tech offers a cost-effective alternative amid component shortages and inflated prices. This guide outlines e...