arXiv

StemBind: When MLLMs Get Lost Between Rules and Instances in Abstract Visual Reasoning

June 2, 2026 · Xixiang He, Baiqi Wu, Xingming Li, Ao Cheng, Qiyao Sun, Xuanyu Ji, Qingyong Hu · Original Source

Title: StemBind: When MLLMs Get Lost Between Rules and Instances in Abstract Visual Reasoning

Abstract

Multimodal Large Language Models (MLLMs) frequently demonstrate a disconnect between understanding and execution: they may accurately perceive visual elements and articulate the governing pattern, yet still select the incorrect answer in Abstract Visual Reasoning (AVR) tasks. Current AVR benchmarks fail to capture this nuance, as they merge perception, rule induction, and answer selection into a binary success-or-failure metric. To address this, we present StemBind, a diagnostic benchmark designed to isolate specific failure points. By presenting a single visual stem with three distinct, aligned queries—Perception (identifying image contents), Rule (determining the governing pattern), and Full (selecting the completing option)—StemBind allows researchers to attribute final-answer errors to precise sub-steps using identical evidence.

The benchmark comprises 2,298 carefully curated, knowledge-light stems covering nine auditable visual operations. This results in 19,533 total tasks across the Perception, Rule, and Full categories. Each complete item is annotated according to Sternberg’s four-stage reasoning framework: S1 Encode, S2 Infer, S3 Map, and S4 Apply. Our evaluation of 24 leading MLLM configurations revealed four critical insights:

The R-F Chasm: On 22 of the 24 models tested, accuracy on rule identification surpassed accuracy on the full item, indicating that most errors occur after the rule has been correctly identified.
Persistent Binding Gap: Even in instances where both perception (P) and rule induction (R) were correct for a given stem, models still answered the final question (F) incorrectly 51.2% of the time.
The Bottleneck at S3: Through process diagnostics and Stage-wise Stimulus Augmentation, we pinpointed the rule-to-instance mapping (Stage S3) as the primary source of failure.
Inefficacy of Scale and Reflection: Increasing model size or employing explicit "thinking" modes did not reliably bridge the accuracy gap. In fact, the thinking mode resulted in decreased accuracy for both rule identification and final item selection.

StemBind shifts the paradigm of AVR evaluation from merely ranking final answers to diagnosing exactly where abstract visual reasoning collapses. By highlighting rule-to-instance binding as a specific weakness, it offers a concrete target for improving vision-grounded reasoning in future models.

Source: arXiv Generated at: 2026-06-02 00:00:00 UTC