arXiv

ATLAS: Agentic Test-time Learning-to-Allocate Scaling

Title: ATLAS: Agentic Test-time Learning-to-Allocate Scaling

Abstract:

While test-time scaling has emerged as a primary method for enhancing the reasoning capabilities of large language models, its execution has traditionally relied on rigid, designer-engineered orchestration. Conventional approaches typically employ static parameters—such as fixed sample budgets, unchanging refinement loops, predetermined scoring rules, or set search policies—to dictate compute allocation. This structure leaves the model responsible for solving problems but not for managing the orchestration process itself. To address this limitation, we present ATLAS, an agentic test-time scaling framework that grants an LLM orchestrator end-to-end control of the loop.

In this system, the orchestrator determines when to halt, how to synthesize the final response, and whether to collect additional evidence through a single, extensible action: "explore." This action dispatches a new, independent solver to the original problem, with each call potentially defining specific parameters such as the solver type, reasoning intensity, or prompting strategy.

We assessed ATLAS using a Claude Sonnet 4.6 backbone across four distinct benchmarks spanning scientific question answering, code generation, and multimodal reasoning. The framework demonstrated superior efficiency, requiring significantly fewer API calls than fixed-workflow baselines while achieving the following performance metrics: 56.00% on HLE-Verified, 82.29% on LiveCodeBench, 85.75% on GPQA-Diamond, and 23.71% on BabyVision.

Furthermore, a multi-model variant, ATLAS-MM, which introduces solver selection as an additional action dimension, yielded further improvements. This extension raised scores to 60.00% on HLE-Verified and 85.63% on LiveCodeBench, alongside consistent gains on GPQA-Diamond and BabyVision. Ablation studies revealed that replacing the orchestrator’s direct synthesis capability with a distinct integrator either degraded performance or failed to improve accuracy on three of the four benchmarks. These results underscore the critical role of stateful evidence management in driving the observed performance enhancements.


Source: arXiv Generated at: 2026-06-02 00:00:00 UTC

Related Articles

Law’s Billable Hour Is Being Shredded by AI
Bloomberg

Law’s Billable Hour Is Being Shredded by AI

AI is dismantling the billable hour by automating routine legal tasks. This technological shift threatens the traditiona...

Iran War: Trump Tries to Stop Israel’s Lebanon Push | The Opening Trade 6/2/2026
Bloomberg

Iran War: Trump Tries to Stop Israel’s Lebanon Push | The Opening Trade 6/2/2026

SoftBank in Early Talks to Back $800 Million Agile Robots Round
Bloomberg

SoftBank in Early Talks to Back $800 Million Agile Robots Round

SoftBank is in early talks to back Agile Robots’ $800 million funding round. The Japanese tech giant is currently in pre...

Amundi Is Diversifying Risk Via Commodity Currencies, Gold
Bloomberg

Amundi Is Diversifying Risk Via Commodity Currencies, Gold

Amundi diversifies risk by investing in commodity-linked currencies and gold. This strategy hedges against market volati...

Reuters

Marvell Technology surges after Nvidia's Huang calls it 'next trillion-dollar company'

Marvell Technology shares surged after Nvidia CEO Jensen Huang labeled the firm the “next trillion-dollar company.”

Russia Says It Found Foreign Spyware on Top Officials’ Phones
Bloomberg

Russia Says It Found Foreign Spyware on Top Officials’ Phones

Russia’s FSB claims to have discovered foreign spyware on senior officials’ phones. Moscow attributes the intrusion to h...