Eliciting Complex Spatial Reasoning in MLLMs through Wide-Baseline Matching
Title: Enhancing Complex Spatial Reasoning in Multimodal Large Language Models via Wide-Baseline Matching
Abstract:
Wide-baseline matching (WBM) serves as a rigorous benchmark for spatial reasoning in multimodal large language models (MLLMs) operating within physical spaces, as it demands the synthesis of geometric comprehension, perspective shifts, detailed perception, and occlusion logic. Despite its importance, existing MLLMs suffer from a lack of structured frameworks for both evaluation and training in this domain. To address this, we present ReasonMatch-Bench, a comprehensive benchmark categorized by viewpoint displacement and matching granularity across object-centric, indoor, and outdoor contexts. Our analysis reveals that contemporary MLLMs continue to face significant difficulties in establishing fine-grained wide-baseline correspondences; on a challenging subset of 90 samples, human annotators achieved an F1 score of 84.0, whereas the top-performing current baseline managed only 37.2.
To close this performance gap, we developed a scalable data-generation pipeline that automatically isolates wide-baseline view pairs from extensive video-3D datasets, such as SfM reconstructions and RGB-D videos, providing diverse and verifiable supervisory signals. Additionally, we introduce Dynamic Correspondence Reinforcement Learning (DCRL), a method that leverages verifiable rewards—eliminating the need for explicit Chain-of-Thought (CoT) supervision—through a combination of Image-Level Viewpoint Progression and Point-Level Correspondence Curriculum. Extensive empirical results demonstrate that DCRL significantly boosts performance on ReasonMatch-Bench and generalizes effectively to other spatial benchmarks, all while preserving general visual understanding capabilities with only modest improvements on several standard metrics.
Source: arXiv Generated at: 2026-06-03 00:00:00 UTC





