BareBones: Benchmarking Zero-Shot Geometric Comprehension in VLMs
Title: BareBones: Benchmarking Zero-Shot Geometric Comprehension in VLMs
Abstract:
Although Vision-Language Models (VLMs) exhibit impressive zero-shot recognition abilities across a wide array of multimodal tasks, it remains unclear whether these systems truly grasp geometric structures or simply rely on RGB textures and contextual priors as statistical shortcuts. Current evaluation methods do not adequately isolate this mechanism, often mixing semantic reasoning with texture mapping and depending on inaccurate annotations that may unintentionally expose environmental hints. To bridge this gap, we present BareBones, a zero-shot benchmark specifically crafted to test pure geometric shape understanding. We have curated pixel-level silhouettes representing geometrically distinct classes from six datasets: five established segmentation sources (ImageNet-S, DIS5K, ThinObject5K, PASCAL VOC, and CUB-200) alongside our new flagship collection, WTP-Bench, which creates a noise-free geometric taxonomy. WTP-Bench serves as an extreme, fine-grained visual challenge, requiring models to discern inter-class geometric concepts based solely on boundary contours. Our assessment of 26 leading proprietary and open-weight VLMs (including GPT-4.1, Gemini, Claude Sonnet 4.5, and LLaVA) shows a consistent and significant drop in performance when RGB information is removed, a trend we label the Texture Bias Cliff. By highlighting universal structural blindspots, BareBones provides a stringent standard for assessing genuine geometric grounding.
Project Page: https://eternal-f1ame.github.io/WTP-Bench/
Source: arXiv Generated at: 2026-06-02 00:00:00 UTC





