Active Exploring like a Pigeon: Reinforcing Spatial Reasoning via Agentic Vision-Language Models
Title: Active Exploration Resembling Pigeon Behavior: Enhancing Spatial Reasoning through Agentic Vision-Language Models
Abstract:
Empowering Vision-Language Models (VLMs) to execute spatial reasoning presents significant hurdles. Current methodologies often relegate VLMs to the role of passive observers, a limitation that hinders their utility in practical, real-world scenarios. Furthermore, traditional reinforcement learning techniques depend on sparse reward structures, which restricts their efficacy when tackling intricate reasoning problems. Drawing inspiration from how pigeons construct and leverage cognitive maps for navigation, we introduce an innovative agentic framework designed for spatial reasoning.
Our approach begins with the development of a \emph{dynamic cognitive map}, which encodes scene layouts through the positions and orientations of objects. This component acts as a continuous memory system for integrating new visual inputs. Additionally, we present \emph{Spatial Assertion Codes (SAC)}, a set of Python expressions that algorithmically define spatial relationships. By working in tandem with the dynamic cognitive map, SAC facilitates the validation of intermediate reasoning steps, thereby generating dense reward signals to guide learning. The model undergoes optimization through a combination of supervised learning and reinforcement finetuning.
Evaluations on the MindCube benchmark reveal that our method achieves state-of-the-art results, attaining an overall accuracy of \emph{80.5\%}. Notably, on the difficult \textsc{Rotation} subset, it surpasses the leading existing approach by \emph{29.5} accuracy points, marking a relative improvement of \emph{53.2\%}. The associated code and datasets have been made publicly available at https://github.com/dw-dengwei/active-spatial-reasoning.git.
Source: arXiv Generated at: 2026-06-02 00:00:00 UTC





