arXiv

SciAgentGym: Benchmarking Multi-Step Scientific Tool-use in LLM Agents

June 2, 2026 · Yujiong Shen, Yajie Yang, Zhiheng Xi, Binze Hu, Huayu Sha, Jiazheng Zhang, Qiyuan Peng, Junlin Shang, Jixuan Huang, Yutao Fan, Jingqi Tong, Shihan Dou, Ming Zhang, Lei Bai, Zhenfei Yin, Tao Gui, Xingjun Ma, Qi Zhang, Xuanjing Huang, Yu-Gang Jiang · Original Source

Title: SciAgentGym: Benchmarking Multi-Step Scientific Tool-use in LLM Agents

Abstract:

Scientific reasoning fundamentally requires the integration of advanced toolkits to navigate specialized domain knowledge. However, existing benchmarks frequently neglect the capacity of agents to coordinate tools for such rigorous, complex workflows. To address this limitation, we introduce SciAgentGym, a scalable interactive environment equipped with a robust execution infrastructure. This platform features 1,780 domain-specific tools spanning four natural science disciplines. In parallel, we present SciAgentBench, a hierarchical evaluation suite designed to rigorously test agentic abilities, ranging from basic operations to extended, long-horizon workflows.

Our assessment reveals a significant bottleneck: despite their status as state-of-the-art, current models continue to face challenges with complex scientific tool-use. Moreover, their performance deteriorates markedly as the length of interactions increases. To mitigate these issues, we propose SciForge, a novel data synthesis approach that represents the tool action space as a dependency graph to produce logic-aware training trajectories. Through fine-tuning on these trajectories, our model, SciAgent-8B, surpasses the considerably larger Qwen3-VL-235B-Instruct. Notably, it also demonstrates effective cross-domain transfer of scientific tool-use capabilities. These findings highlight the considerable promise held by next-generation autonomous scientific agents.

Source: arXiv Generated at: 2026-06-02 00:00:00 UTC