EvalStop: Using World Feedback to Detect and Correct Reward Overoptimization in Multi-Tenant RLHF Platforms
Title: EvalStop: Leveraging World Feedback to Identify and Mitigate Reward Overoptimization in Multi-Tenant RLHF Environments
Abstract:
Cloud-based platforms for fine-tuning large language models are increasingly handling Reinforcement Learning from Human Feedback (RLHF) tasks, where a trained reward model serves as a surrogate for human judgment. However, as demonstrated by Gao et al. (2023), this surrogate metric tends to drift away from actual world feedback—measured by downstream evaluation metrics—when subjected to prolonged optimization pressure. This divergence is widely recognized as "reward overoptimization." Current platform schedulers fail to address this issue: non-clairvoyant schedulers focus solely on Job Completion Time (JCT) without quality indicators; SLAQ-style quality-aware schedulers rely on training loss, which is a less effective proxy that can be artificially manipulated to drop monotonically; and traditional per-job early stopping mechanisms demand constant human oversight and do not reclaim shared GPU resources.
To address these limitations, we introduce EvalStop, a modular scheduling primitive designed to terminate jobs following k consecutive drops in evaluation scores. This mechanism frees up GPU capacity, retains the highest-performing checkpoint, and integrates seamlessly with any underlying base scheduler. We conceptualize scheduler-level early stopping as a detection challenge and assess its performance using a discrete-event simulator. This simulator generates RLHF workloads comprising both reward-hacking instances and structurally sound runs, while keeping ground-truth labels concealed from the schedulers.
In environments where RLHF tasks constitute 80% of the workload across 64 GPUs, EvalStop demonstrated a precision of 98% and a recall of 99%, with a false positive rate (FPR) of just 1.5%. Furthermore, it reduced JCT by 9% and decreased wasted computational resources by 22% compared to SRTF-Est (p<0.05). In contrast, simpler methods based on fixed progress or loss plateaus suffered from a 65% FPR on healthy RLHF tasks or failed to identify more than half of the actual hacking cases. The benefits of EvalStop are consistent across all tested base schedulers, yielding JCT improvements between 9% and 25%. Additionally, detection accuracy remained robust under varying conditions, maintaining a precision of at least 91% with evaluation noise standard deviations up to 0.05, and at least 89% across hacking base rates ranging from 20% to 80%.
Source: arXiv Generated at: 2026-06-04 00:00:00 UTC





