arXiv

EvalStop: Using World Feedback to Detect and Correct Reward Overoptimization in Multi-Tenant RLHF Platforms

Title: EvalStop: Leveraging World Feedback to Identify and Mitigate Reward Overoptimization in Multi-Tenant RLHF Environments

Abstract:

Cloud-based platforms for fine-tuning large language models are increasingly handling Reinforcement Learning from Human Feedback (RLHF) tasks, where a trained reward model serves as a surrogate for human judgment. However, as demonstrated by Gao et al. (2023), this surrogate metric tends to drift away from actual world feedback—measured by downstream evaluation metrics—when subjected to prolonged optimization pressure. This divergence is widely recognized as "reward overoptimization." Current platform schedulers fail to address this issue: non-clairvoyant schedulers focus solely on Job Completion Time (JCT) without quality indicators; SLAQ-style quality-aware schedulers rely on training loss, which is a less effective proxy that can be artificially manipulated to drop monotonically; and traditional per-job early stopping mechanisms demand constant human oversight and do not reclaim shared GPU resources.

To address these limitations, we introduce EvalStop, a modular scheduling primitive designed to terminate jobs following k consecutive drops in evaluation scores. This mechanism frees up GPU capacity, retains the highest-performing checkpoint, and integrates seamlessly with any underlying base scheduler. We conceptualize scheduler-level early stopping as a detection challenge and assess its performance using a discrete-event simulator. This simulator generates RLHF workloads comprising both reward-hacking instances and structurally sound runs, while keeping ground-truth labels concealed from the schedulers.

In environments where RLHF tasks constitute 80% of the workload across 64 GPUs, EvalStop demonstrated a precision of 98% and a recall of 99%, with a false positive rate (FPR) of just 1.5%. Furthermore, it reduced JCT by 9% and decreased wasted computational resources by 22% compared to SRTF-Est (p<0.05). In contrast, simpler methods based on fixed progress or loss plateaus suffered from a 65% FPR on healthy RLHF tasks or failed to identify more than half of the actual hacking cases. The benefits of EvalStop are consistent across all tested base schedulers, yielding JCT improvements between 9% and 25%. Additionally, detection accuracy remained robust under varying conditions, maintaining a precision of at least 91% with evaluation noise standard deviations up to 0.05, and at least 89% across hacking base rates ranging from 20% to 80%.


Source: arXiv Generated at: 2026-06-04 00:00:00 UTC

Related Articles

SpaceX Seeks to Raise $75 Billion in Record IPO (Video)
Bloomberg

SpaceX Seeks to Raise $75 Billion in Record IPO (Video)

SpaceX aims for a record $75 billion valuation through an initial public offering. This historic IPO marks a significant...

Broadcom AI Chip Outlook Disappoints Investors
Bloomberg

Broadcom AI Chip Outlook Disappoints Investors

Broadcom’s AI chip projections disappointed investors, dampening market sentiment. The outlook fell short of expectation...

Hiranandani Group CEO on Powering India's Digital Future
Bloomberg

Hiranandani Group CEO on Powering India's Digital Future

Hiranandani Group CEO discusses driving India's digital transformation.

Cerebras Says It’s Working With All AI Gear Makers Except Nvidia
Bloomberg

Cerebras Says It’s Working With All AI Gear Makers Except Nvidia

Cerebras confirmed partnerships with all major AI hardware vendors except Nvidia. This broad engagement positions Cerebr...

Putin Turns Russia’s AI Future Into a Kremlin Family Business
Bloomberg

Putin Turns Russia’s AI Future Into a Kremlin Family Business

Putin is consolidating Russia’s AI ambitions into a Kremlin family business, effectively turning the sector into a dynas...

Reuters

Meta repeatedly pushes back new AI model release for developers, WSJ says

Meta has repeatedly delayed the release of its new AI model for developers, according to the WSJ. This ongoing postponem...