arXiv

Recent Advances in Multi-modal 3D Intelligence: A Comprehensive Survey and Evaluation

June 2, 2026 · Yinjie Lei, Zixuan Wang, Feng Chen, Guoqing Wang, Peng Wang, Yang Yang · Original Source

Title: Recent Advances in Multi-modal 3D Intelligence: A Comprehensive Survey and Evaluation

Abstract:

Multi-modal 3D Intelligence has attracted significant interest, particularly for its broad applicability in domains such as world simulation and autonomous driving. By incorporating an extra modality beyond traditional single-modal 3D perception, these systems not only enhance the accuracy and depth of scene understanding but also establish a robust basis for complex interactions with the physical world. This capability is vital in diverse and difficult settings where 3D data alone proves insufficient. Despite a notable increase in multi-modal 3D methodologies over the last six years—particularly those combining multi-camera imagery (3D+2D) and text (3D+language)—there remains a lack of thorough, holistic reviews in this area. To address this void, this study offers a systematic overview of recent developments. We start by outlining the distinct challenges associated with various 3D multi-modal tasks. Subsequently, we introduce a new classification framework that organizes current methods based on their modalities and specific tasks, while examining their respective advantages and drawbacks. The paper also provides a comparative analysis of recent techniques across multiple benchmark datasets, accompanied by detailed insights. Finally, we highlight existing open problems and suggest promising directions for future inquiry.

Source: arXiv Generated at: 2026-06-02 00:00:00 UTC