Distilling Neuro-Symbolic Programs into 3D Multi-modal LLMs
Title: Transferring Neuro-Symbolic Logic into 3D Multi-Modal Large Language Models
Abstract:
Existing approaches to 3D spatial reasoning are defined by a critical dichotomy: neuro-symbolic 3D (NS3D) concept learners offer transparent, compositional program-based reasoning but are limited by closed-set vocabularies and simplistic logic. Conversely, end-to-end 3D multi-modal LLMs (3D MLLMs) possess the flexibility to manage open-vocabulary concepts and complex natural language, yet their reasoning processes remain opaque black boxes lacking explicit spatial verification. To resolve this conflict, we present APEIRIA, a novel neuro-symbolic 3D MLLM that bridges these paradigms by distilling symbolic reasoning structures into natural language chain-of-thought (CoT) sequences within MLLMs.
APEIRIA employs a three-stage curriculum to systematically develop reasoning proficiency: 1. 3D Perception Alignment: This initial phase anchors the visual-geometric features of objects to the LLMās understanding. 2. CoT-SFT: Supervised fine-tuning on symbolic program traces instructs the model in query decomposition and stepwise verification. 3. CoT-RL: Reinforcement learning generalizes these reasoning patterns to open-set concepts and deeply nested instructions.
By focusing on the transfer of reasoning patterns rather than specific concept knowledge, APEIRIA retains the core advantages of NS3D, including transparent logic and the modular interchangeability of planning and perception modules. Our evaluations across grounding, question answering, and captioning tasks demonstrate that APEIRIA outperforms previous NS3D methods while achieving performance comparable to state-of-the-art 3D MLLMs on 3D spatial reasoning benchmarks. This achievement effectively merges the systematic rigor of symbolic methods with the adaptability of MLLMs. The project code is accessible at https://github.com/oceanflowlab/APEIRIA.
Source: arXiv Generated at: 2026-06-02 00:00:00 UTC




