arXiv

Zero-Shot 3D Question Answering via Hierarchical View-to-Token Transportation

June 3, 2026 · Dongsheng Wang, Dawei Su, Hui Huang · Original Source

Title: Enhancing Zero-Shot 3D Question Answering Through Hierarchical View-to-Token Transport

Abstract:

The application of 2D Vision-Language Models (VLMs) for zero-shot 3D scene understanding has attracted growing attention in recent research, largely due to their strong spatial reasoning abilities. Standard methodologies involve extracting multiple 2D perspectives from a 3D point cloud and processing them through pre-trained VLMs to resolve specific queries. This approach underscores the importance of input context quality, presenting the challenge of maximizing the retention of task-relevant 3D details despite constraints on input capacity.

To address this, we introduce \texttt{KeyVT}, a hierarchical strategy for gathering input context at both the token and view levels. First, we integrate pixel features with camera parameters to evaluate the significance of each view, considering both its semantic content and geometric positioning. This process ensures the selection of views that are spatially consistent and pertinent to the task. Second, we mitigate redundancy among image patches across these chosen views by pinpointing representative tokens within an optimal transport (OT) framework. In this model, view tokens and key tokens are treated as two distinct distributions within the embedding space. By minimizing the OT distance, the selected key tokens are designed to comprehensively encompass all view features. Our evaluations across three prominent benchmarks reveal that our framework significantly outperforms current tuning-free methods while achieving performance levels on par with approaches that require training.

Source: arXiv Generated at: 2026-06-03 00:00:00 UTC