EvoMemNav: Efficient Self-Evolving Fine-Grained Memory for Zero-Shot Embodied Navigation
Title: EvoMemNav: Efficient Self-Evolving Fine-Grained Memory for Zero-Shot Embodied Navigation
Abstract:
Effective long-horizon planning in zero-shot embodied navigation relies heavily on robust memory systems. Traditional approaches face significant limitations: detector-centric scene graphs tend to reduce observations to sparse nodes, which often results in the loss of detailed visual information and the accumulation of noise, whereas methods based on 3D reconstruction are typically too computationally expensive. To address these challenges, we introduce EvoMemNav, a novel framework designed for efficient, self-evolving, and fine-grained memory management in zero-shot embodied navigation.
At the core of EvoMemNav is the Visual-Semantic Memory Graph (VSMGraph). This structure treats raw visual views as primary memory elements, organizing them through lightweight semantic indicators and topological connections into a hierarchical structure comprising rooms, views, and objects. This approach ensures that fine-grained details are retained to support accurate disambiguation and stop verification. To manage the expansion of memory resources, we propose a budgeted coarse-to-fine policy. This strategy first compresses the search space during a coarse stage to identify promising regions, then engages a Vision-Language Model (VLM) only during a fine stage for specific verification and decision-making tasks.
Furthermore, EvoMemNav goes beyond static memory storage by implementing reflection-driven write-back mechanisms after each subtask. This process updates graph-attached priors that encapsulate accumulated environmental knowledge, allowing the system to refine future decisions without the need for retraining. Our experimental results on GOAT-Bench and HM3D datasets demonstrate consistent improvements in Success Rate (SR) and Success weighted by Path Length (SPL) across object, text-description, and image-goal modalities. These gains are accompanied by enhanced ability to disambiguate multiple instances, a reduction in premature stops, and superior zero-shot generalization capabilities.
Source: arXiv Generated at: 2026-06-03 00:00:00 UTC





