arXiv

Exploiting Verification-Generation Gap: Test-Time Reinforcement Learning with Confidence-Conditioned Verification

June 3, 2026 · Jiahui Li, Jianfeng Shan, Wenpei Chen, Shunyu Wu, Jian Lou, Wenjie Feng, Dan Li, See-Kiong Ng · Original Source

Title: Bridging the Verification-Generation Divide: Test-Time Reinforcement Learning via Confidence-Conditioned Verification

Abstract:

Test-time reinforcement learning has recently gained traction as a powerful, label-free approach for boosting the complex reasoning capabilities of large language models. While prior research has largely prioritized Pass@1 metrics, the optimization of Pass@k remains a critical yet under-addressed challenge in label-free environments, as it serves as a key indicator of generation coverage during sustained exploration. However, optimizing Pass@k in these settings proves to be highly non-trivial; directly adapting Pass@k advantage strategies that work well for RLVR results in suboptimal outcomes.

Through rigorous empirical analysis, we identified the underlying causes of these performance bottlenecks: pseudo-label estimations for low-confidence samples are prone to significant errors, whereas candidate answers for high-confidence samples frequently suffer from a severe collapse in diversity. To address these issues, we introduce TTRL-CoCoV (Test-Time Reinforcement Learning with Confidence-Conditioned Verification), a novel confidence-adaptive framework designed to broaden Pass@k coverage and enhance Pass@1 performance.

Guided by the insight that verification proficiency generally underpins generation capability, TTRL-CoCoV implements a confidence-conditioned mechanism tailored to different confidence levels. For high-confidence samples, the system bootstraps the verifier and applies an exploration-enhancing reward to mitigate diversity collapse. For low-confidence samples, it delegates pseudo-label selection to the verifier to filter out inaccurate labels. Conversely, medium-confidence samples bypass verification entirely.

Extensive experiments confirm that TTRL-CoCoV surpasses the best competing methods across six widely recognized benchmarks. It achieves average absolute gains of +9.8% in Pass@1 and +18.7% in Pass@16 compared to TTRL. Furthermore, when evaluated against fully supervised RL methods, TTRL-CoCoV delivers absolute Pass@1 improvements of up to +5.0% across multiple reasoning benchmarks.

Our code repository is available at: https://github.com/shanjf666/CoCoV.

Source: arXiv Generated at: 2026-06-03 00:00:00 UTC

Global News Digest

Exploiting Verification-Generation Gap: Test-Time Reinforcement Learning with Confidence-Conditioned Verification

Related Articles

TikTok Billionaire Tops Ambani as Asia’s Second-Richest

Publishers in UK can opt out of Google AI search results

Kioxia Edges Nearer Toyota’s Market Cap in Shakeup to Japan Inc.

Morning Bid: Marvell, a fitting name for the latest AI darling

Tim Hayward: I built the Jaguar E-Type of computer keyboards

AI Labs: Zuckerberg’s $100bn gamble