arXiv

Self-Trained Verification for Training- and Test-Time Self-Improvement

Title: Self-Trained Verification for Training- and Test-Time Self-Improvement

Original: arXiv:2605.30290v2 Announce Type: replace-cross Abstract: Self-improvement at scale has been a longstanding goal for reasoning models, and there are two natural places to do it: at test time, through verification-refinement (V-R) loops; and at training time, through self-training methods. Both are gated by the same bottleneck: the verifier. V-R loops stall when verifier scores inflate while accuracy stagnates, and when feedback is too generic to act on; self-training fails similarly when bad self-generated data are added to training. Better verification would unlock both, but the capability we want to train, i.e., catching self-generated errors, lacks training signal. To address this challenge, we propose self-trained verification (STV). Our key observation is that, while a model cannot catch these errors alone, it can when shown the reference solution. We turn this asymmetry into a supervision target and train the verifier to imitate a more informed version of itself. At test time, STV substantially improves V-R loops on hard problems, while alternatives (e.g., SFT, RL on verifier scores, and even meta-verifiers) do not. STV roughly doubles accuracy on hard math and lifts it 14x on scientific reasoning tasks (1.5% to 21%). At training time, we additionally train the generator using RL with STV verifier's feedback inside the V-R loop - a procedure we call verifier-in-the-loop training (ViL). Starting from an RL-converged generator, ViL yields a further 33% gain in pass@1. More notably, the generator's standalone pass@1, with no verifier at test time, climbs 30% relative past where standard RL had converged. Hence, the next frontier in reasoning on hard problems may lie in how we train for and with verification. Website: https://ar-forum.github.io/stv-webpage

Rewrite: Large-scale self-improvement in reasoning models has long been an objective, achievable through two primary avenues: test-time verification-refinement (V-R) loops and training-time self-training. However, both approaches are hindered by a common constraint: the verifier. V-R processes often stall due to inflated verifier scores despite stagnant accuracy, or because the feedback provided is too vague to be actionable. Similarly, self-training breaks down when erroneous self-generated data is incorporated into the training set. While enhanced verification could resolve these issues, the specific skill of detecting self-generated errors is difficult to train due to a lack of direct supervision signals.

To overcome this, we introduce Self-Trained Verification (STV). Our core insight is that although a model struggles to identify these errors in isolation, it succeeds when presented with a reference solution. We leverage this discrepancy as a supervisory signal, training the verifier to mimic a more knowledgeable version of itself.

During inference, STV significantly boosts V-R loop performance on complex problems, outperforming alternative methods such as Supervised Fine-Tuning (SFT), Reinforcement Learning (RL) based on verifier scores, and meta-verifiers. Specifically, STV approximately doubles accuracy on difficult mathematical tasks and increases it by a factor of 14 in scientific reasoning (rising from 1.5% to 21%).

In the training phase, we employ an approach termed Verifier-in-the-Loop training (ViL), which utilizes RL with feedback from the STV verifier within the V-R loop. When applied to a generator that has already converged via standard RL, ViL achieves an additional 33% improvement in pass@1. Notably, the generator’s standalone pass@1—without any verifier assistance during testing—increases by 30% compared to its performance after standard RL convergence. These findings suggest that the future of tackling hard reasoning problems may depend on how we design verification training processes.

Website: https://ar-forum.github.io/stv-webpage


Source: arXiv Generated at: 2026-06-02 00:00:00 UTC

Related Articles

Law’s Billable Hour Is Being Shredded by AI
Bloomberg

Law’s Billable Hour Is Being Shredded by AI

AI is dismantling the billable hour by automating routine legal tasks. This technological shift threatens the traditiona...

Iran War: Trump Tries to Stop Israel’s Lebanon Push | The Opening Trade 6/2/2026
Bloomberg

Iran War: Trump Tries to Stop Israel’s Lebanon Push | The Opening Trade 6/2/2026

SoftBank in Early Talks to Back $800 Million Agile Robots Round
Bloomberg

SoftBank in Early Talks to Back $800 Million Agile Robots Round

SoftBank is in early talks to back Agile Robots’ $800 million funding round. The Japanese tech giant is currently in pre...

Amundi Is Diversifying Risk Via Commodity Currencies, Gold
Bloomberg

Amundi Is Diversifying Risk Via Commodity Currencies, Gold

Amundi diversifies risk by investing in commodity-linked currencies and gold. This strategy hedges against market volati...

Reuters

Marvell Technology surges after Nvidia's Huang calls it 'next trillion-dollar company'

Marvell Technology shares surged after Nvidia CEO Jensen Huang labeled the firm the “next trillion-dollar company.”

Russia Says It Found Foreign Spyware on Top Officials’ Phones
Bloomberg

Russia Says It Found Foreign Spyware on Top Officials’ Phones

Russia’s FSB claims to have discovered foreign spyware on senior officials’ phones. Moscow attributes the intrusion to h...