arXiv

Smart Picks in the Dark: Towards Efficient RLVR for Reasoning via Tracing Metacognitive Pivots

June 4, 2026 · Guangcheng Zhu, Shenzhi Yang, Haobo Wang, Xing Zheng, Yingfan MA, Xuening Feng, Zhongqi Chen, Bowen Song, Weiqiang Wang, Gang Chen · Original Source

Title: Navigating Uncertainty: Enhancing Reasoning Efficiency in RLVR Through Metacognitive Pivot Tracing

Abstract:

Reinforcement learning with verifiable rewards (RLVR) has significantly propelled the capabilities of large reasoning models (LRMs). However, this progress is often bottlenecked by the necessity for extensive, fully annotated datasets during timely training phases. To address these data inefficiencies, researchers have explored two primary approaches: first, data selection techniques that identify a minimal set of "golden" samples capable of matching the performance of full-data training; yet, these methods depend on the availability of pre-labeled data pools. Second, unsupervised RLVR strategies that utilize a model’s internal supervision signals on vast amounts of unlabeled data; however, these approaches frequently result in suboptimal outcomes.

In response, this study examines the "pick in the dark" framework for RLVR. This approach seeks to identify unlabeled samples that offer the highest training value and warrant annotation, all without relying on prior supervisory signals. Our systematic analysis reveals that effective selection depends critically on a robust uncertainty estimator, which facilitates the strategic division of data into adaptive training regimes.

Capitalizing on this finding, we introduce PivotTrace, a novel three-way data triage system. PivotTrace utilizes attention dynamics to monitor metacognitive pivots occurring during the reasoning process. By measuring uncertainty through pivot density, the framework enables automatic data routing, thereby optimizing both the efficiency of annotation and the training process. Empirical evaluations demonstrate that PivotTrace outperforms fully supervised LRM baselines, achieving superior results with merely 29.3% of the annotated samples and accelerating convergence by a factor of 2.75.

Source: arXiv Generated at: 2026-06-04 00:00:00 UTC