arXiv

Aletheia: What Makes RLVR For Code Verifiers Tick?

June 3, 2026 · Vatsal Venkatkrishna, Indraneil Paul, Iryna Gurevych · Original Source

Title: Aletheia: Deciphering the Mechanics of RLVR Code Verifiers

Abstract:

While Reinforcement Learning with Verifiable Rewards (RLVR) stands as a fundamental component of contemporary post-training strategies for multi-domain thinking verifiers, its integration into code generation has trailed behind execution feedback. This delay is primarily attributed to the excessive financial and computational demands associated with the complete RLVR workflow. To address this, our study dissects three critical variables within the performance-cost spectrum of RLVR: the utility of intermediate thinking traces, the value of learning from negative samples, and the necessity of on-policy training.

We present Aletheia, a rigorous, execution-based testbed designed to enable contamination-free evaluation of code verifier training methodologies. This framework allows for the assessment of various training recipes across different model scales and covariate shifts, specifically targeting two prevalent verifier use cases. Our empirical findings indicate that the ideal training strategy relies heavily on model scale. For smaller verifiers, on-policy learning serves as the dominant factor for performance gains. Conversely, at larger scales, the allocation of the thinking budget emerges as the most crucial element.

Although the use of negative samples consistently enhances top-1 selection accuracy regardless of model size, their role in reconstructing rankings grows steadily as models scale up, becoming essential for stabilizing training processes in larger architectures. Through Pareto optimality analysis, we observe that removing on-policy training at larger scales produces a verifier that matches the performance of the comprehensive RLVR approach. Additionally, we identify that omitting thinking traces represents a highly efficient, compute-saving strategy for lower-budget scenarios, striking a favorable balance between cost and accuracy. Ultimately, this research establishes the empirical groundwork required to deploy robust code verifiers efficiently, thereby promoting their broader integration into post-training pipelines for large-scale code generation models.

Source: arXiv Generated at: 2026-06-03 00:00:00 UTC