SciAgentGym: Benchmarking Multi-Step Scientific Tool-use in LLM Agents
Title: SciAgentGym: Benchmarking Multi-Step Scientific Tool-use in LLM Agents
Abstract:
Scientific reasoning fundamentally requires the integration of advanced toolkits to navigate specialized domain knowledge. However, existing benchmarks frequently neglect the capacity of agents to coordinate tools for such rigorous, complex workflows. To address this limitation, we introduce SciAgentGym, a scalable interactive environment equipped with a robust execution infrastructure. This platform features 1,780 domain-specific tools spanning four natural science disciplines. In parallel, we present SciAgentBench, a hierarchical evaluation suite designed to rigorously test agentic abilities, ranging from basic operations to extended, long-horizon workflows.
Our assessment reveals a significant bottleneck: despite their status as state-of-the-art, current models continue to face challenges with complex scientific tool-use. Moreover, their performance deteriorates markedly as the length of interactions increases. To mitigate these issues, we propose SciForge, a novel data synthesis approach that represents the tool action space as a dependency graph to produce logic-aware training trajectories. Through fine-tuning on these trajectories, our model, SciAgent-8B, surpasses the considerably larger Qwen3-VL-235B-Instruct. Notably, it also demonstrates effective cross-domain transfer of scientific tool-use capabilities. These findings highlight the considerable promise held by next-generation autonomous scientific agents.
Source: arXiv Generated at: 2026-06-02 00:00:00 UTC





