arXiv

T1: Tool-integrated Verification for Test-time Compute Scaling in Small Language Models

June 2, 2026 · Minki Kang, Jongwon Jeong, Jaewoong Cho · Original Source

Title: T1: Enhancing Small Language Model Performance at Test Time via Tool-Integrated Verification

Abstract:

While recent research indicates that scaling test-time compute can significantly boost the capabilities of small language models (sLMs), existing studies have predominantly relied on larger models to serve as verifiers, thereby neglecting the potential for sLMs to verify their own outputs. This study explores the efficacy of sLMs in verifying output candidates during test-time scaling. Our analysis reveals that even when employing knowledge distillation from larger verifier models, sLMs remain ineffective at verification tasks demanding high levels of memorization, such as fact-checking and numerical computation.

To overcome this constraint, we introduce Tool-integrated verification (T1), a two-stage framework designed to mitigate these issues. This approach first utilizes external tools to filter candidate outputs, reserving the sLM for the final verification stage. By offloading memory-intensive operations to tools like code interpreters, T1 alleviates the cognitive load on sLMs. We demonstrate theoretically and empirically that this offloading strategy enhances the model’s test-time scaling performance.

Empirical results on the MATH benchmark show that a Llama-3.2 1B model, when equipped with T1 and test-time scaling, surpasses the performance of the substantially larger Llama-3.1 8B model. Furthermore, T1 has been shown to increase verification accuracy for both process reward models (PRMs) and critic models. These results underscore the significant potential of integrating external tools to strengthen the verification capabilities of small language models.

Source: arXiv Generated at: 2026-06-02 00:00:00 UTC