Imaginative Perception Tokens Enhance Spatial Reasoning in Multimodal Language Models
Title: Imaginative Perception Tokens Boost Spatial Reasoning in Multimodal Language Models
Abstract:
While Vision-Language Models (VLMs) demonstrate proficiency across numerous applications, they frequently encounter difficulties with spatial reasoning when essential details are hidden from view. Addressing these challenges often demands "imaginative perception"—the ability to visualize scenes from angles not currently visible, navigate through obstructed areas, or synthesize fragmented observations into a unified spatial understanding. To address this, we present Imaginative Perception Tokens (IPT), which serve as intermediate perceptual representations. These tokens allow a VLM to externalize potential perceptions under different spatial arrangements while maintaining strict consistency with the actual input data.
To evaluate this capability, we defined three specific tasks: Perspective Taking (PET), Path Tracing (PT), and Multiview Counting (MVC). We developed corresponding datasets containing roughly 20,000 examples, complete with ground-truth imaginations, answers, and evaluation benchmarks. When integrated with the BAGEL unified VLM as the backbone, IPT supervision consistently enhances spatial reasoning capabilities. Notably, this approach often surpasses textual chain-of-thought training methods, even though it does not require image generation during inference. Specifically, IPT increased accuracy by 3.4% on the MVC task and delivered performance on PT that competes with leading closed-source models.
Our analysis reveals that merging IPT with label-only supervision leads to further improvements. Conversely, relying on textual chain-of-thought methods can significantly hinder performance, indicating a potential modality mismatch when spatial computations are forced through language. Ultimately, IPT offers a robust supervision signal for reasoning about unobserved spatial structures, thereby enhancing generalization while providing interpretable intermediate representations.
Source: arXiv Generated at: 2026-06-03 00:00:00 UTC



