Limits of Spatial Imagery Reasoning in Frontier LLM Models
The Boundaries of Spatial Visualization in Advanced LLMs
Abstract
While Large Language Models (LLMs) have showcased remarkable reasoning abilities, they continue to falter when faced with spatial challenges that demand mental simulation, such as mental rotation. This study explores whether integrating an external "Imagery Module"—a tool designed to render and rotate 3D models—can serve as a "cognitive prosthetic" to overcome these limitations. We employed a dual-module framework, where a reasoning component (an MLLM) collaborates with an imagery component to handle tasks involving the rotation of 3D objects. Despite this setup, performance fell short of expectations, peaking at an accuracy of only 62.5%. Subsequent analysis indicates that even when the responsibility for maintaining and manipulating a comprehensive 3D state is delegated externally, the system remains ineffective. This outcome highlights that contemporary frontier models are missing the fundamental visual-spatial primitives necessary to interact effectively with imagery tools. Specifically, these models exhibit deficiencies in: (1) low-level sensitivity for extracting spatial cues, including (a) depth perception, (b) motion detection, and (c) short-term dynamic prediction; and (2) the ability to engage in contemplative reasoning over visual data, which involves dynamically adjusting visual attention and balancing imagery with symbolic and associative information.
Source: arXiv Generated at: 2026-06-02 00:00:00 UTC





