ToolGate: Token-Efficient Pre-Call Control for Tool-Augmented Vision-Language Agents
Title: ToolGate: Token-Efficient Pre-Call Control for Tool-Augmented Vision-Language Agents
Abstract: Although tool-augmented vision-language agents can leverage external perceptual evidence via techniques such as OCR, detection, and segmentation, executing every suggested tool call is often expensive and redundant. This paper investigates the pre-call control challenge: specifically, whether a perceptual tool call proposed by a ReAct-style VLM agent should be executed or skipped before its results are incorporated into the context. Our evaluation across five benchmarks reveals that baseline agents suffer from poor local selectivity, with helpful and harmful calls occurring at comparable rates (11.8% versus 9.9%), and the majority of calls failing to alter the immediate forced-answer prediction. To address this, we propose ToolGate, a lightweight external controller that determines execute or skip decisions based on trajectory text and basic structural features. Utilizing two Qwen3-VL backbones, ToolGate cuts token costs to between 64% and 69% of the unrestricted ReAct baseline, while maintaining average accuracy in cross-domain scenarios. Furthermore, when trained on matched-domain trajectories with Qwen3-VL-30B, it boosts average accuracy by an additional 1.65 points. These findings demonstrate that tool-augmented VLM agents gain significant advantages not just from enhanced perceptual tools, but also from explicit mechanisms to control when tool outputs justify their cost.
Source: arXiv Generated at: 2026-06-03 00:00:00 UTC



