The Cartesian Shortcut: Re-evaluate Vision Reasoning in Polar Coordinate Space
Title: The Cartesian Shortcut: Re-evaluating Visual Reasoning in Polar Coordinate Space
Abstract:
As Multimodal Large Language Models (MLLMs) rapidly saturate standard visual reasoning benchmarks, a critical question arises: do these impressive scores truly indicate robust visual comprehension? We identify a pervasive vulnerability known as the "Cartesian Shortcut." Current visual reasoning benchmarks predominantly rely on orthogonal, grid-based layouts that can be easily discretized into explicit textual coordinates. Models systematically exploit this characteristic, heavily depending on text-based deductive reasoning to aid in solving visual problems.
To dismantle this shortcut, we introduce Polaris-Bench. This new benchmark re-formulates 53 visual reasoning tasks into Polar coordinate space, providing paired Cartesian counterparts as references. Crucially, this approach preserves consistent logical constraints and task semantics, thereby fundamentally breaking the orthogonal prior that models typically exploit.
Comprehensive evaluations across 14 state-of-the-art MLLMs reveal a stark contrast in performance: frontier models that achieve scores between 70% and 83% on Cartesian layouts collapse to a range of 31%–39% on their Polar equivalents. This degradation persists even when the tasks are logically equivalent. Furthermore, the reasoning improvements observed on Cartesian layouts are significantly diminished when applied to Polar equivalents. These findings expose a critical deficiency in current MLLMs: a lack of topology-invariant visual reasoning.
Source: arXiv Generated at: 2026-06-02 00:00:00 UTC





