Global News Digest

arXiv

Off-the-Shelf LLMs as Process Scorers: Training-Free Alternative to PRMs for Mathematical Reasoning

Title: Leveraging Pre-Trained LLMs as Process Scorers: A Training-Free Approach to PRM for Math Reasoning

Abstract

While selecting the optimal response from multiple samples generated by smaller models using a more powerful scorer is a straightforward inference-time tactic, it proves ineffective when the smaller model has already entrenched itself in flawed reasoning trajectories. Process Reward Models (PRMs) mitigate this issue by evaluating candidate continuations during the generation process; however, this approach necessitates the training of a reward model utilizing step-level annotations. To address this dependency, we introduce Chunk-Level Guided Generation, a training-free method that employs a pre-existing large language model (LLM) as a process scorer. In this framework, at every generation step, a smaller model produces k candidates of fixed length. The larger model then ranks these candidates based on likelihoods, without producing any text itself. By committing to the highest-scoring chunk before proceeding to the next step, the method guides the generation process to prevent error propagation.

We implement two specific selection strategies within this architecture: Likelihood-Guided Selection (LGS), which chooses the chunk exhibiting the highest length-normalized log-probability from the larger model, and Contrastive-Guided Selection (CGS). The latter enhances performance by subtracting the smaller model’s log-probability from the larger model’s, thereby prioritizing chunks where the larger model’s preference significantly diverges from that of the smaller one. Our analysis reveals that scoring reasoning steps of varying lengths using large-model likelihoods is prone to unreliability due to a persistent systematic length bias, which length normalization fails to fully eliminate. Consequently, employing fixed-length chunks effectively circumvents this confounding factor.

Empirical evaluations across GSM8K, MATH, Minerva Math, AMC23, and AIME24 demonstrate the efficacy of our approach. When Qwen2.5-1.5B is guided by Qwen2.5-32B, and Llama-3.2-1B is guided by Llama-3.1-70B, CGS surpasses majority voting by as much as 28 percentage points. Furthermore, under comparable guidance budgets, it performs on par with or better than Qwen2.5-Math-PRM-72B guided search on most benchmarks, all without the need for reward model training. Additionally, using Qwen2.5-7B guided by Qwen2.5-72B with k=16, CGS achieves scores of 81.8% on MATH and 63.6% on Minerva Math, outperforming majority voting by 4–6 percentage points. Notably, Chunk-Level Guided Generation also yields significantly more concise reasoning traces compared to PRM-guided search.


Source: arXiv Generated at: 2026-06-02 00:00:00 UTC

Related Articles

Schroders Renewable Unit Targets AI Assets as Power Demand Soars
Bloomberg

Schroders Renewable Unit Targets AI Assets as Power Demand Soars

Schroders’ renewable unit targets AI infrastructure, pivoting to meet soaring energy demand from artificial intelligence...

State Street's Paglia on SBI Group Partnership, ETFs
Bloomberg

State Street's Paglia on SBI Group Partnership, ETFs

State Street's Paglia discusses the SBI Group partnership and ETFs, but the source text is missing. Please provide the a...

Nvidia Boss Says Workers Should Be Paid ‘as Much as Possible’
Bloomberg

Nvidia Boss Says Workers Should Be Paid ‘as Much as Possible’

Nvidia CEO Jensen Huang advocates for paying workers “as much as possible,” emphasizing maximum compensation. This stanc...

TSE Talking With Regulator For Easing ETF Listing Rules
Bloomberg

TSE Talking With Regulator For Easing ETF Listing Rules

The Tokyo Stock Exchange is discussing with regulators to ease ETF listing rules. This aims to simplify market access an...

S&P DJI CEO on Japan Markets, Mega IPOs
Bloomberg

S&P DJI CEO on Japan Markets, Mega IPOs

S&P DJI CEO discusses Japan's financial markets and major IPOs.