Diversity Over Frequency: Rethinking Tool Use in Visual Chain-of-Thought Agents
Title: Prioritizing Diversity Over Frequency: Reevaluating Tool Utilization in Visual Chain-of-Thought Agents
Abstract:
Visual agents leverage external visual tools within their visual chains of thought to embed detailed evidence. However, while existing literature has predominantly examined these tools in the context of visual search, their function in more intricate visual reasoning scenarios remains largely unexplored. This study shifts focus from basic visual search to more demanding tasks, such as 3D spatial reasoning and medical visual question answering (VQA). In these contexts, agents are required to synthesize local evidence obtained through tools with broader global contexts.
We identify a "tool-use collapse phenomenon," wherein models gradually cease utilizing tools even as their task accuracy improves. Furthermore, we note a distinct asymmetry in performance: (i) removing tool usage entirely leads to a decline in performance, while (ii) encouraging tool use results in only slight performance improvements, despite a significant increase in the frequency of tool invocation. Our analysis reveals that both standard training methods and incentives for tool usage tend to reduce the diversity of rollout trajectories. This reduction in diversity explains why increased tool usage does not necessarily translate to enhanced reasoning capabilities.
Based on these insights, we introduce an entropy regularization term designed to foster more diverse exploration during rollouts. This approach achieves superior performance, even though the frequency of tool usage continues to decrease. We also observe similar dynamics in medical VQA, indicating that tool-use collapse extends beyond 3D spatial reasoning. Ultimately, our results suggest viewing tools as scaffolding during training; promoting broader exploration across both language generation and visual tool invocation enhances reasoning capabilities, notwithstanding the observed collapse in tool usage.
Project page: https://scaffolded-exploration.github.io
Source: arXiv Generated at: 2026-06-02 00:00:00 UTC




