Self-Trained Verification for Training- and Test-Time Self-Improvement
Title: Self-Trained Verification for Training- and Test-Time Self-Improvement
Original: arXiv:2605.30290v2 Announce Type: replace-cross Abstract: Self-improvement at scale has been a longstanding goal for reasoning models, and there are two natural places to do it: at test time, through verification-refinement (V-R) loops; and at training time, through self-training methods. Both are gated by the same bottleneck: the verifier. V-R loops stall when verifier scores inflate while accuracy stagnates, and when feedback is too generic to act on; self-training fails similarly when bad self-generated data are added to training. Better verification would unlock both, but the capability we want to train, i.e., catching self-generated errors, lacks training signal. To address this challenge, we propose self-trained verification (STV). Our key observation is that, while a model cannot catch these errors alone, it can when shown the reference solution. We turn this asymmetry into a supervision target and train the verifier to imitate a more informed version of itself. At test time, STV substantially improves V-R loops on hard problems, while alternatives (e.g., SFT, RL on verifier scores, and even meta-verifiers) do not. STV roughly doubles accuracy on hard math and lifts it 14x on scientific reasoning tasks (1.5% to 21%). At training time, we additionally train the generator using RL with STV verifier's feedback inside the V-R loop - a procedure we call verifier-in-the-loop training (ViL). Starting from an RL-converged generator, ViL yields a further 33% gain in pass@1. More notably, the generator's standalone pass@1, with no verifier at test time, climbs 30% relative past where standard RL had converged. Hence, the next frontier in reasoning on hard problems may lie in how we train for and with verification. Website: https://ar-forum.github.io/stv-webpage
Rewrite: Large-scale self-improvement in reasoning models has long been an objective, achievable through two primary avenues: test-time verification-refinement (V-R) loops and training-time self-training. However, both approaches are hindered by a common constraint: the verifier. V-R processes often stall due to inflated verifier scores despite stagnant accuracy, or because the feedback provided is too vague to be actionable. Similarly, self-training breaks down when erroneous self-generated data is incorporated into the training set. While enhanced verification could resolve these issues, the specific skill of detecting self-generated errors is difficult to train due to a lack of direct supervision signals.
To overcome this, we introduce Self-Trained Verification (STV). Our core insight is that although a model struggles to identify these errors in isolation, it succeeds when presented with a reference solution. We leverage this discrepancy as a supervisory signal, training the verifier to mimic a more knowledgeable version of itself.
During inference, STV significantly boosts V-R loop performance on complex problems, outperforming alternative methods such as Supervised Fine-Tuning (SFT), Reinforcement Learning (RL) based on verifier scores, and meta-verifiers. Specifically, STV approximately doubles accuracy on difficult mathematical tasks and increases it by a factor of 14 in scientific reasoning (rising from 1.5% to 21%).
In the training phase, we employ an approach termed Verifier-in-the-Loop training (ViL), which utilizes RL with feedback from the STV verifier within the V-R loop. When applied to a generator that has already converged via standard RL, ViL achieves an additional 33% improvement in pass@1. Notably, the generator’s standalone pass@1—without any verifier assistance during testing—increases by 30% compared to its performance after standard RL convergence. These findings suggest that the future of tackling hard reasoning problems may depend on how we design verification training processes.
Website: https://ar-forum.github.io/stv-webpage
Source: arXiv Generated at: 2026-06-02 00:00:00 UTC





