Before the Model Learns the Bug:Fuzzing RLVR Verifiers
Title: Exploiting Flaws Before the Model Does: Fuzzing RLVR Verifiers
Abstract:
Reinforcement Learning with Verifiable Rewards (RLVR) shifts the paradigm by substituting human preference annotations with executable reward functions, including tools such as JSON tool-call validators, code unit-test harnesses, and mathematical answer checkers. Because these rewards are partly software artifacts, a flawed verifier can cause the optimization process to inadvertently learn its bugs. To investigate this vulnerability, we introduce a lightweight framework for fuzzing verifiers. This approach generates adversarial completions, contrasts buggy verifiers against stricter reference versions, and records paired decision logs. The system then quantifies performance using metrics for false positives, false negatives, disagreements, exploits, and uncertainty.
Source: arXiv Generated at: 2026-06-02 00:00:00 UTC




