arXiv

MASER: Modality-Adaptive Specialist Routing for Embodied 3D Spatial Intelligence

June 2, 2026 · Hilton Raj, Vishnuram AV · Original Source

Title: MASER: Modality-Adaptive Specialist Routing for Embodied 3D Spatial Intelligence

Original: arXiv:2606.02463v1 Announce Type: cross

Abstract: In the context of 3D environments, embodied agents resolve spatially pertinent queries by leveraging reasoning capabilities drawn from a diverse array of modalities, such as natural language, RGB imagery, depth maps, point clouds, and camera poses. However, current Vision-Language Models (VLMs) typically undergo fine-tuning on just one specific modality. This approach fails to account for question semantics that might inherently benefit from a different modality than the one used for training. To overcome this limitation, we introduce MASER (Modality-Adaptive SpEcialist Routing), a streamlined framework designed to train five distinct modality adapters on a shared VLM backbone. This system employs a neural routing policy to identify and select the most suitable adapter for each query during inference. Specifically, we utilize a frozen sentence transformer to encode every question, feeding the resulting embedding into a lightweight Multi-layer Perceptron (MLP) that has been trained using oracle adapter-accuracy labels. Our assessment of this methodology on the Open3D-VQA benchmark reveals that no single modality serves as a universal optimum; notably, point-cloud responses prove superior in 51.5% of instances. MASER achieves an oracle agreement rate of 51.3%, significantly surpassing a Random-Forest ablation which scored 43.5%, while requiring only one adapter call per question.

Source: arXiv Generated at: 2026-06-02 00:00:00 UTC