Global News Digest

arXiv

Expected Value Alignment for Generative Reward Modeling in Formal Mathematics Verification

Title: Aligning Expected Values for Generative Reward Modeling in Formal Math Verification

Original: arXiv:2606.01160v1 Announce Type: new Abstract: Large Language Models (LLMs) are increasingly used with formal interactive theorem provers such as Lean 4. Scaling these systems with reinforcement learning or search methods requires process reward models (PRMs) that can evaluate intermediate reasoning steps. Existing reward-model designs expose a practical trade-off. Value-head models provide continuous scores but modify the generative model interface, while generative reward models preserve textual rationales but are poorly matched to continuous floating-point regression because numeric values are split across tokens. We introduce Expected Value Alignment (EVA), a reward-modeling procedure that keeps the surface output discrete while extracting continuous scores from the model's token distribution. The model emits integer scores in a structured JSON format, and EVA computes a continuous score as the expectation over the logits of the corresponding anchor tokens. Training combines the causal language modeling objective with an auxiliary mean squared error loss on these expected values. We instantiate EVA in \textit{Leibniz}, a reward model for Lean 4 formal verification, and evaluate it against zero-shot and reward-modeling baselines. The evaluation demonstrates that continuous logit-based scoring significantly reduces discretization artifacts while retaining the interpretability of generative critiques.

Rewrite:

As Large Language Models (LLMs) become more deeply integrated with formal interactive theorem provers like Lean 4, the demand for scalable evaluation methods grows. Implementing reinforcement learning or search-based scaling strategies necessitates Process Reward Models (PRMs) capable of assessing intermediate reasoning steps. However, current reward-model architectures face a distinct practical dilemma. While value-head models offer continuous scoring, they alter the standard generative interface. Conversely, generative reward models maintain textual explanations but struggle with continuous floating-point regression, as numerical values are fragmented across multiple tokens.

To address this, we present Expected Value Alignment (EVA), a novel reward-modeling technique that maintains discrete surface outputs while deriving continuous scores from the underlying token distribution. In this framework, the model generates integer scores formatted as structured JSON. EVA then calculates a continuous score by determining the expectation of the logits associated with the relevant anchor tokens. The training process integrates the standard causal language modeling objective with an additional mean squared error loss applied to these expected values.

We implemented EVA within \textit{Leibniz}, a specialized reward model designed for Lean 4 formal verification, and benchmarked its performance against both zero-shot and standard reward-modeling baselines. Our results indicate that leveraging continuous logit-based scoring effectively minimizes discretization errors without sacrificing the interpretability inherent in generative critiques.


Source: arXiv Generated at: 2026-06-02 00:00:00 UTC

Related Articles

Schroders Renewable Unit Targets AI Assets as Power Demand Soars
Bloomberg

Schroders Renewable Unit Targets AI Assets as Power Demand Soars

Schroders’ renewable unit targets AI infrastructure, pivoting to meet soaring energy demand from artificial intelligence...

State Street's Paglia on SBI Group Partnership, ETFs
Bloomberg

State Street's Paglia on SBI Group Partnership, ETFs

State Street's Paglia discusses the SBI Group partnership and ETFs, but the source text is missing. Please provide the a...

Nvidia Boss Says Workers Should Be Paid ‘as Much as Possible’
Bloomberg

Nvidia Boss Says Workers Should Be Paid ‘as Much as Possible’

Nvidia CEO Jensen Huang advocates for paying workers “as much as possible,” emphasizing maximum compensation. This stanc...

TSE Talking With Regulator For Easing ETF Listing Rules
Bloomberg

TSE Talking With Regulator For Easing ETF Listing Rules

The Tokyo Stock Exchange is discussing with regulators to ease ETF listing rules. This aims to simplify market access an...

S&P DJI CEO on Japan Markets, Mega IPOs
Bloomberg

S&P DJI CEO on Japan Markets, Mega IPOs

S&P DJI CEO discusses Japan's financial markets and major IPOs.