arXiv

Zero-Shot 3D Question Answering via Hierarchical View-to-Token Transportation

Title: Enhancing Zero-Shot 3D Question Answering Through Hierarchical View-to-Token Transport

Abstract:

The application of 2D Vision-Language Models (VLMs) for zero-shot 3D scene understanding has attracted growing attention in recent research, largely due to their strong spatial reasoning abilities. Standard methodologies involve extracting multiple 2D perspectives from a 3D point cloud and processing them through pre-trained VLMs to resolve specific queries. This approach underscores the importance of input context quality, presenting the challenge of maximizing the retention of task-relevant 3D details despite constraints on input capacity.

To address this, we introduce \texttt{KeyVT}, a hierarchical strategy for gathering input context at both the token and view levels. First, we integrate pixel features with camera parameters to evaluate the significance of each view, considering both its semantic content and geometric positioning. This process ensures the selection of views that are spatially consistent and pertinent to the task. Second, we mitigate redundancy among image patches across these chosen views by pinpointing representative tokens within an optimal transport (OT) framework. In this model, view tokens and key tokens are treated as two distinct distributions within the embedding space. By minimizing the OT distance, the selected key tokens are designed to comprehensively encompass all view features. Our evaluations across three prominent benchmarks reveal that our framework significantly outperforms current tuning-free methods while achieving performance levels on par with approaches that require training.


Source: arXiv Generated at: 2026-06-03 00:00:00 UTC

Related Articles

TikTok Billionaire Tops Ambani as Asia’s Second-Richest
Bloomberg

TikTok Billionaire Tops Ambani as Asia’s Second-Richest

TikTok founder surpasses Mukesh Ambani to become Asia’s second-richest person, marking a significant shift in the region...

Publishers in UK can opt out of Google AI search results
BBC News

Publishers in UK can opt out of Google AI search results

UK publishers can now opt out of Google’s AI search summaries, a CMA ruling designed to boost their bargaining power and...

Kioxia Edges Nearer Toyota’s Market Cap in Shakeup to Japan Inc.
Bloomberg

Kioxia Edges Nearer Toyota’s Market Cap in Shakeup to Japan Inc.

Kioxia’s market cap nears Toyota’s, signaling a major shift in Japan’s corporate hierarchy. This narrowing gap highlight...

Reuters

Morning Bid: Marvell, a fitting name for the latest AI darling

Reuters highlights Marvell as a top AI stock, noting its name perfectly suits its status as the newest market darling.

Financial Times

Tim Hayward: I built the Jaguar E-Type of computer keyboards

Tim Hayward compares his bespoke keyboard designs to the Jaguar E-Type. He explores high-end customization for personal ...

Financial Times

AI Labs: Zuckerberg’s $100bn gamble

Meta’s $100 billion AI investment aims to secure AI dominance, but questions remain whether sheer spending can outpace c...