arXiv

Aletheia: What Makes RLVR For Code Verifiers Tick?

Title: Aletheia: Deciphering the Mechanics of RLVR Code Verifiers

Abstract:

While Reinforcement Learning with Verifiable Rewards (RLVR) stands as a fundamental component of contemporary post-training strategies for multi-domain thinking verifiers, its integration into code generation has trailed behind execution feedback. This delay is primarily attributed to the excessive financial and computational demands associated with the complete RLVR workflow. To address this, our study dissects three critical variables within the performance-cost spectrum of RLVR: the utility of intermediate thinking traces, the value of learning from negative samples, and the necessity of on-policy training.

We present Aletheia, a rigorous, execution-based testbed designed to enable contamination-free evaluation of code verifier training methodologies. This framework allows for the assessment of various training recipes across different model scales and covariate shifts, specifically targeting two prevalent verifier use cases. Our empirical findings indicate that the ideal training strategy relies heavily on model scale. For smaller verifiers, on-policy learning serves as the dominant factor for performance gains. Conversely, at larger scales, the allocation of the thinking budget emerges as the most crucial element.

Although the use of negative samples consistently enhances top-1 selection accuracy regardless of model size, their role in reconstructing rankings grows steadily as models scale up, becoming essential for stabilizing training processes in larger architectures. Through Pareto optimality analysis, we observe that removing on-policy training at larger scales produces a verifier that matches the performance of the comprehensive RLVR approach. Additionally, we identify that omitting thinking traces represents a highly efficient, compute-saving strategy for lower-budget scenarios, striking a favorable balance between cost and accuracy. Ultimately, this research establishes the empirical groundwork required to deploy robust code verifiers efficiently, thereby promoting their broader integration into post-training pipelines for large-scale code generation models.


Source: arXiv Generated at: 2026-06-03 00:00:00 UTC

Related Articles

TikTok Billionaire Tops Ambani as Asia’s Second-Richest
Bloomberg

TikTok Billionaire Tops Ambani as Asia’s Second-Richest

TikTok founder surpasses Mukesh Ambani to become Asia’s second-richest person, marking a significant shift in the region...

Publishers in UK can opt out of Google AI search results
BBC News

Publishers in UK can opt out of Google AI search results

UK publishers can now opt out of Google’s AI search summaries, a CMA ruling designed to boost their bargaining power and...

Kioxia Edges Nearer Toyota’s Market Cap in Shakeup to Japan Inc.
Bloomberg

Kioxia Edges Nearer Toyota’s Market Cap in Shakeup to Japan Inc.

Kioxia’s market cap nears Toyota’s, signaling a major shift in Japan’s corporate hierarchy. This narrowing gap highlight...

Reuters

Morning Bid: Marvell, a fitting name for the latest AI darling

Reuters highlights Marvell as a top AI stock, noting its name perfectly suits its status as the newest market darling.

Financial Times

Tim Hayward: I built the Jaguar E-Type of computer keyboards

Tim Hayward compares his bespoke keyboard designs to the Jaguar E-Type. He explores high-end customization for personal ...

Financial Times

AI Labs: Zuckerberg’s $100bn gamble

Meta’s $100 billion AI investment aims to secure AI dominance, but questions remain whether sheer spending can outpace c...