arXiv

Mechanistic Diagnostics of Spatial Lexical Bias in Multimodal Large Language Model Spatial Reasoning

June 2, 2026 · Chuang Ma, Qianying Liu, Tomoyuki Obuchi, Fei Cheng, Wang Yang, Sudong Cai, Shuyuan Zheng, Akiko Aizawa, Sadao Kurohashi · Original Source

Title: Uncovering the Mechanisms Behind Spatial Lexical Bias in Multimodal Large Language Models’ Spatial Reasoning

Abstract:

Multimodal large language models (MLLMs) frequently struggle with spatial multiple-choice questions, a performance gap traditionally blamed on inadequate visual attention. This study, however, pinpoints a distinct failure mechanism: spatial lexical bias. We demonstrate that simply introducing a spatial relation term into the answer choices can sway the model’s decision-making process, significantly increasing the likelihood that the newly inserted option will be chosen. Through experiments involving nine open-weight MLLMs, we establish that this bias is a pervasive issue. Specifically, we find that models may correctly resolve a binary spatial query but systematically choose an incorrect third option when it is introduced into the selection pool.

We categorize these instances—where models remain stable in binary contexts but fail in ternary ones—as diagnostic cases. By applying mechanistic interpretability techniques, we uncover that the root cause lies primarily in the language component rather than visual processing. Analyses of visual attention and residual-stream probes indicate that the correct spatial relationship is still internally represented even during these failures. Furthermore, techniques such as irrelevant-option controls, activation patching, and sparse component interventions trace the source of the bias to specific channels and neurons within the LLM. Leveraging these insights, we implemented a lightweight Direct Preference Optimization (DPO) update using only a small synthetic dataset of single-object pairs. This intervention effectively reduced the bias, boosting four-way robust accuracy by as much as 100 points on synthetic data, and by 68.0, 32.6, and 20.1 points on the broader WhatsUp, SpatialMQA-Direct, and VSR evaluation datasets, respectively.

Source: arXiv Generated at: 2026-06-02 00:00:00 UTC