arXiv

Consistent Yet Wrong: Evidence Insensitivity in Spatial Vision-Language Models

June 3, 2026 · S Divakar Bhat, Toshihiko Yamasaki · Original Source

Title: Persistent but Inaccurate: The Lack of Evidence Sensitivity in Spatial Vision-Language Models

Abstract:

While spatial reasoning is a cornerstone for robotics, autonomous systems, and embodied AI, contemporary vision-language models (VLMs) continue to struggle with providing reliable answers to metric distance questions. It is frequently assumed that when a model yields consistent predictions across different camera angles, it demonstrates true geometric grounding. However, our investigation reveals the contrary: state-of-the-art VLMs frequently generate answers that remain invariant to viewpoint changes, even when those answers are factually wrong. This behavior highlights a fragile connection between the model’s decisions and the specific visual evidence available from each perspective.

To address this, we present ViewDiag, a rigorous, multi-view evaluation framework derived from the Hypersim, ScanNet, and KITTI360 datasets. This protocol includes 176 object-pair sequences across 80 distinct scenes, featuring 2 to 10 views per sequence. Our assessment measures models along three critical dimensions: metric precision, distributional concentration, and a latent feature probe designed to detect internal collapse, thereby differentiating between failures in decision-making and failures in representation.

Our analysis of various models uncovers a recurring trend: high stability in predictions coincides with significant inaccuracies. These models tend to cluster in a specific operational regime defined by robust consistency yet poor accuracy. These findings undermine the widespread practice of using cross-view consistency as a surrogate for genuine geometric comprehension. Instead, our work suggests that such stable predictions often stem from prior-driven collapse rather than reasoning that is sensitive to visual evidence. ViewDiag offers both a controlled benchmark and a diagnostic toolkit to assess spatial VLMs on metrics that go beyond simple accuracy. The associated code and data are available at \href{https://github.com/SDivakarBhat/Consistent_Yet_Wrong.git}{here}.

Source: arXiv Generated at: 2026-06-03 00:00:00 UTC