arXiv

Eliciting Complex Spatial Reasoning in MLLMs through Wide-Baseline Matching

Title: Enhancing Complex Spatial Reasoning in Multimodal Large Language Models via Wide-Baseline Matching

Abstract:

Wide-baseline matching (WBM) serves as a rigorous benchmark for spatial reasoning in multimodal large language models (MLLMs) operating within physical spaces, as it demands the synthesis of geometric comprehension, perspective shifts, detailed perception, and occlusion logic. Despite its importance, existing MLLMs suffer from a lack of structured frameworks for both evaluation and training in this domain. To address this, we present ReasonMatch-Bench, a comprehensive benchmark categorized by viewpoint displacement and matching granularity across object-centric, indoor, and outdoor contexts. Our analysis reveals that contemporary MLLMs continue to face significant difficulties in establishing fine-grained wide-baseline correspondences; on a challenging subset of 90 samples, human annotators achieved an F1 score of 84.0, whereas the top-performing current baseline managed only 37.2.

To close this performance gap, we developed a scalable data-generation pipeline that automatically isolates wide-baseline view pairs from extensive video-3D datasets, such as SfM reconstructions and RGB-D videos, providing diverse and verifiable supervisory signals. Additionally, we introduce Dynamic Correspondence Reinforcement Learning (DCRL), a method that leverages verifiable rewards—eliminating the need for explicit Chain-of-Thought (CoT) supervision—through a combination of Image-Level Viewpoint Progression and Point-Level Correspondence Curriculum. Extensive empirical results demonstrate that DCRL significantly boosts performance on ReasonMatch-Bench and generalizes effectively to other spatial benchmarks, all while preserving general visual understanding capabilities with only modest improvements on several standard metrics.


Source: arXiv Generated at: 2026-06-03 00:00:00 UTC

Related Articles

TechCrunch

The world’s largest privately owned laser just turned on

Xcimer Energy activated the Phoenix laser, the world’s largest privately owned laser, aiming to commercialize fusion pow...

Uber Targets Doubling Its Fleet of Electric Motorcycles in Kenya
Bloomberg

Uber Targets Doubling Its Fleet of Electric Motorcycles in Kenya

Uber plans to double its electric motorcycle fleet in Kenya. This expansion aims to enhance sustainable transport option...

AI Saves Time But Most Companies Waste the Gain, Study Shows
Bloomberg

AI Saves Time But Most Companies Waste the Gain, Study Shows

A study reveals that while AI saves employee time, most companies fail to capitalize on these gains, squandering potenti...

JPMorgan Lifts S&P Target on Earnings 'Supercycle'
Bloomberg

JPMorgan Lifts S&P Target on Earnings 'Supercycle'

JPMorgan raised its S&P 500 target, citing an earnings “supercycle” that reflects heightened confidence in corporate pro...

Europe Sleepwalking Into Economic Ruin, Serb Leader Says
Bloomberg

Europe Sleepwalking Into Economic Ruin, Serb Leader Says

Serbian leader warns Europe is sleepwalking into economic ruin.

Delta Electronics Flags Power Crunch
Bloomberg

Delta Electronics Flags Power Crunch

Delta Electronics warns of a looming power deficit due to surging demand and constrained production, predicting serious ...