Hierarchical Semantic-Augmented Navigation: Optimal Transport and Graph-Driven Reasoning for Vision-Language Navigation
Title: Hierarchical Semantic-Augmented Navigation: Optimal Transport and Graph-Driven Reasoning for Vision-Language Navigation
Vision-Language Navigation in Continuous Environments (VLN-CE) presents a significant hurdle for autonomous agents, demanding the fluid combination of visual inputs with natural language directives to traverse intricate 3D indoor settings. Current methodologies frequently struggle with extended-duration tasks, largely because of inadequate scene comprehension, suboptimal planning mechanisms, and the absence of resilient decision-making structures.
To overcome these limitations, we present the Hierarchical Semantic-Augmented Navigation (HSAN) framework, a novel solution that transforms VLN-CE via three interconnected advancements. Initially, HSAN generates a dynamic, hierarchical semantic scene graph. By utilizing vision-language models, it captures multi-tiered environmental representations—ranging from individual objects to broader regions and zones—which facilitates detailed spatial reasoning.
Secondly, the framework utilizes a topological planner based on optimal transport, rooted in Kantorovich’s duality. This component identifies long-term objectives by striking a balance between semantic significance and spatial feasibility, offering theoretical assurances of optimality.
Finally, a graph-aware reinforcement learning policy governs fine-grained control. This ensures accurate execution of subgoals while maintaining robust obstacle avoidance. By merging spectral graph theory, optimal transport, and sophisticated multi-modal learning, HSAN mitigates the issues associated with the static maps and heuristic planners common in previous studies. Comprehensive evaluations across various demanding VLN-CE datasets reveal that HSAN delivers state-of-the-art results, marking substantial gains in navigation success rates and the ability to generalize to unfamiliar environments.
Source: arXiv Generated at: 2026-06-02 00:00:00 UTC





