Do Multimodal Agents Really Benefit from Tool Use? A Systematic Study of Capability Gains
Title: Do Multimodal Agents Really Benefit from Tool Use? A Systematic Study of Capability Gains
Abstract:
While tool-augmented multimodal agents frequently demonstrate significant improvements on benchmarks—leading many to conclude that these systems have mastered tool utilization—this conclusion may be hasty. A record of tool invocation alone fails to establish whether the tool actually provided information essential to deriving the answer. To investigate this, we conducted a systematic analysis of two prominent "thinking with images" agents, Thyme and DeepEyesV2, evaluating their performance in real-world understanding, optical character recognition (OCR), chart interpretation, and mathematical reasoning.
Our methodology involved comparing each agent against two baselines: a version of itself stripped of tool-access capabilities and a Pure-Text Reasoner trained on the same source data but without exposure to tool-calling trajectories. The results indicate that tool access provides minimal consistent aggregate benefit, fails to reliably lower the cost of generated tokens, and contributes to a very small set of problems solved exclusively through tools. Specifically, 93% of the problems solved by DeepEyesV2 via tools were also resolved by at least one non-tool configuration, as were 96% of Thyme’s tool-solved cases.
Further ablation studies reveal that the complete tool-use loop does not consistently surpass the performance of either the tool-call format in isolation or the execution results returned by the tool. In the contexts examined, the agents appear to master the mechanics of tool-calling more effectively than they leverage tools for genuine capability expansion. Consequently, we argue that future evaluations must clearly distinguish between the mere availability of tools and the extent to which tools actually enable agents to solve problems they otherwise could not.
Source: arXiv Generated at: 2026-06-02 00:00:00 UTC




