arXiv

SpaceTools: Tool-Augmented Spatial Reasoning via Double Interactive RL

June 2, 2026 · Siyi Chen, Mikaela Angelina Uy, Chan Hee Song, Faisal Ladhak, Adithyavairavan Murali, Qing Qu, Stan Birchfield, Valts Blukis, Jonathan Tremblay · Original Source

Title: SpaceTools: Enhancing Spatial Reasoning with Tools Through Dual-Phase Interactive Reinforcement Learning

Abstract:

While Vision Language Models (VLMs) excel at qualitative visual comprehension, they often falter when faced with the metrically precise spatial reasoning demands of embodied tasks. The agentic framework offers a potential solution by enabling VLMs to leverage diverse tools—such as depth estimators, segmentation models, and pose estimators—to bolster these capabilities. However, realizing this potential without resorting to rigid, handcrafted prompting or restrictive, predefined tool pipelines remains a significant hurdle, as such constraints hinder the VLM’s capacity to identify optimal usage strategies.

Although Reinforcement Learning (RL) could bridge this gap, previous efforts have been confined to single-tool reasoning due to the expansive search space inherent in multi-tool scenarios. To address this, we present Double Interactive Reinforcement Learning (DIRL), a novel two-phase training architecture designed to help VLMs master the coordination of multiple tools through interactive exploration and feedback.

The DIRL framework operates in two distinct stages. First, during the teaching phase, the model integrates demonstrations from a single-tool specialist, trained via interactive RL, with interaction traces generated by a frontier model utilizing all available tools. Subsequently, in the exploration phase, the model further optimizes its multi-tool coordination through sustained RL.

Our resulting model, SpaceTools, which integrates these tool-augmented spatial reasoning capabilities, sets a new state-of-the-art on spatial understanding benchmarks, including RoboSpatial-Home, BLINK, and BOP-ASK. Furthermore, it demonstrates robust real-world manipulation proficiency when paired with a 7-DOF robot acting as a physical tool. DIRL yields significant performance gains over baseline methods, outperforming vanilla Supervised Fine-Tuning (SFT) by 12% and standard RL by 16% on the RoboSpatial benchmark.

Project page: https://spacetools.github.io/.

Source: arXiv Generated at: 2026-06-02 00:00:00 UTC