arXiv

Off-the-Shelf LLMs as Process Scorers: Training-Free Alternative to PRMs for Mathematical Reasoning

June 2, 2026 · Atoosa Chegini, Soheil Feizi · Original Source

Title: Leveraging Pre-Trained LLMs as Process Scorers: A Training-Free Approach to PRM for Math Reasoning

Abstract

While selecting the optimal response from multiple samples generated by smaller models using a more powerful scorer is a straightforward inference-time tactic, it proves ineffective when the smaller model has already entrenched itself in flawed reasoning trajectories. Process Reward Models (PRMs) mitigate this issue by evaluating candidate continuations during the generation process; however, this approach necessitates the training of a reward model utilizing step-level annotations. To address this dependency, we introduce Chunk-Level Guided Generation, a training-free method that employs a pre-existing large language model (LLM) as a process scorer. In this framework, at every generation step, a smaller model produces k candidates of fixed length. The larger model then ranks these candidates based on likelihoods, without producing any text itself. By committing to the highest-scoring chunk before proceeding to the next step, the method guides the generation process to prevent error propagation.

We implement two specific selection strategies within this architecture: Likelihood-Guided Selection (LGS), which chooses the chunk exhibiting the highest length-normalized log-probability from the larger model, and Contrastive-Guided Selection (CGS). The latter enhances performance by subtracting the smaller model’s log-probability from the larger model’s, thereby prioritizing chunks where the larger model’s preference significantly diverges from that of the smaller one. Our analysis reveals that scoring reasoning steps of varying lengths using large-model likelihoods is prone to unreliability due to a persistent systematic length bias, which length normalization fails to fully eliminate. Consequently, employing fixed-length chunks effectively circumvents this confounding factor.

Empirical evaluations across GSM8K, MATH, Minerva Math, AMC23, and AIME24 demonstrate the efficacy of our approach. When Qwen2.5-1.5B is guided by Qwen2.5-32B, and Llama-3.2-1B is guided by Llama-3.1-70B, CGS surpasses majority voting by as much as 28 percentage points. Furthermore, under comparable guidance budgets, it performs on par with or better than Qwen2.5-Math-PRM-72B guided search on most benchmarks, all without the need for reward model training. Additionally, using Qwen2.5-7B guided by Qwen2.5-72B with k=16, CGS achieves scores of 81.8% on MATH and 63.6% on Minerva Math, outperforming majority voting by 4–6 percentage points. Notably, Chunk-Level Guided Generation also yields significantly more concise reasoning traces compared to PRM-guided search.

Source: arXiv Generated at: 2026-06-02 00:00:00 UTC