arXiv

AutoLab: Can Frontier Models Solve Long-Horizon Auto Research and Engineering Tasks?

June 4, 2026 · Zhangchen Xu, Junda Chen, Yue Huang, Dongfu Jiang, Jiefeng Chen, Hang Hua, Zijian Wu, Zheyuan Liu, Zexue He, Lichi Li, Shizhe Diao, Jiaxin Pei, Jinsung Yoon, Hao Zhang, Mengdi Wang, Radha Poovendran, Misha Sra, Alex Pentland, Zichen Chen · Original Source

Title: AutoLab: Can Frontier Models Solve Long-Horizon Auto Research and Engineering Tasks?

Abstract:

Scientific and engineering advancements are inherently driven by a long-horizon, iterative cycle involving hypothesis generation, experimentation, outcome measurement, and the continuous refinement of work products. However, current benchmarks for frontier models largely focus on single-turn interactions or short-horizon agent trajectories, thereby missing the complexities associated with sustained, iterative improvement over extended periods. To bridge this gap, we present AutoLab, a novel benchmark designed for ultra long-horizon closed-loop optimization.

AutoLab features 36 expert-curated, realistic tasks distributed across four distinct domains: system optimization, puzzle and challenge solving, model development, and CUDA kernel optimization. Each task starts with a baseline that is functionally correct but intentionally suboptimal, requiring agents to enhance performance within a rigid wall-clock time limit.

Our evaluation of 17 state-of-the-art models highlights a critical finding: success is driven less by the quality of an agent’s initial attempt and more by its persistence in continuously benchmarking, editing, and integrating empirical feedback. While models like claude-opus-4.6 demonstrate robust capabilities in long-horizon optimization, the majority of frontier models—including several proprietary systems—tend to terminate prematurely or deplete their time budgets with negligible progress. These findings emphasize the necessity of time awareness and persistent iteration for autonomous agents. To facilitate further research into truly capable long-horizon agents, we have open-sourced the complete benchmark, evaluation harness, and task artifacts.

Source: arXiv Generated at: 2026-06-04 00:00:00 UTC