Exploiting Verification-Generation Gap: Test-Time Reinforcement Learning with Confidence-Conditioned Verification
Title: Bridging the Verification-Generation Divide: Test-Time Reinforcement Learning via Confidence-Conditioned Verification
Abstract:
Test-time reinforcement learning has recently gained traction as a powerful, label-free approach for boosting the complex reasoning capabilities of large language models. While prior research has largely prioritized Pass@1 metrics, the optimization of Pass@k remains a critical yet under-addressed challenge in label-free environments, as it serves as a key indicator of generation coverage during sustained exploration. However, optimizing Pass@k in these settings proves to be highly non-trivial; directly adapting Pass@k advantage strategies that work well for RLVR results in suboptimal outcomes.
Through rigorous empirical analysis, we identified the underlying causes of these performance bottlenecks: pseudo-label estimations for low-confidence samples are prone to significant errors, whereas candidate answers for high-confidence samples frequently suffer from a severe collapse in diversity. To address these issues, we introduce TTRL-CoCoV (Test-Time Reinforcement Learning with Confidence-Conditioned Verification), a novel confidence-adaptive framework designed to broaden Pass@k coverage and enhance Pass@1 performance.
Guided by the insight that verification proficiency generally underpins generation capability, TTRL-CoCoV implements a confidence-conditioned mechanism tailored to different confidence levels. For high-confidence samples, the system bootstraps the verifier and applies an exploration-enhancing reward to mitigate diversity collapse. For low-confidence samples, it delegates pseudo-label selection to the verifier to filter out inaccurate labels. Conversely, medium-confidence samples bypass verification entirely.
Extensive experiments confirm that TTRL-CoCoV surpasses the best competing methods across six widely recognized benchmarks. It achieves average absolute gains of +9.8% in Pass@1 and +18.7% in Pass@16 compared to TTRL. Furthermore, when evaluated against fully supervised RL methods, TTRL-CoCoV delivers absolute Pass@1 improvements of up to +5.0% across multiple reasoning benchmarks.
Our code repository is available at: https://github.com/shanjf666/CoCoV.
Source: arXiv Generated at: 2026-06-03 00:00:00 UTC



