arXiv

ATLAS: Agentic Test-time Learning-to-Allocate Scaling

June 2, 2026 · Peijia Qin, Qi Cao, Pengtao Xie · Original Source

Title: ATLAS: Agentic Test-time Learning-to-Allocate Scaling

Abstract:

While test-time scaling has emerged as a primary method for enhancing the reasoning capabilities of large language models, its execution has traditionally relied on rigid, designer-engineered orchestration. Conventional approaches typically employ static parameters—such as fixed sample budgets, unchanging refinement loops, predetermined scoring rules, or set search policies—to dictate compute allocation. This structure leaves the model responsible for solving problems but not for managing the orchestration process itself. To address this limitation, we present ATLAS, an agentic test-time scaling framework that grants an LLM orchestrator end-to-end control of the loop.

In this system, the orchestrator determines when to halt, how to synthesize the final response, and whether to collect additional evidence through a single, extensible action: "explore." This action dispatches a new, independent solver to the original problem, with each call potentially defining specific parameters such as the solver type, reasoning intensity, or prompting strategy.

We assessed ATLAS using a Claude Sonnet 4.6 backbone across four distinct benchmarks spanning scientific question answering, code generation, and multimodal reasoning. The framework demonstrated superior efficiency, requiring significantly fewer API calls than fixed-workflow baselines while achieving the following performance metrics: 56.00% on HLE-Verified, 82.29% on LiveCodeBench, 85.75% on GPQA-Diamond, and 23.71% on BabyVision.

Furthermore, a multi-model variant, ATLAS-MM, which introduces solver selection as an additional action dimension, yielded further improvements. This extension raised scores to 60.00% on HLE-Verified and 85.63% on LiveCodeBench, alongside consistent gains on GPQA-Diamond and BabyVision. Ablation studies revealed that replacing the orchestrator’s direct synthesis capability with a distinct integrator either degraded performance or failed to improve accuracy on three of the four benchmarks. These results underscore the critical role of stateful evidence management in driving the observed performance enhancements.

Source: arXiv Generated at: 2026-06-02 00:00:00 UTC