arXiv

Hierarchical Semantic-Augmented Navigation: Optimal Transport and Graph-Driven Reasoning for Vision-Language Navigation

Title: Hierarchical Semantic-Augmented Navigation: Optimal Transport and Graph-Driven Reasoning for Vision-Language Navigation

Vision-Language Navigation in Continuous Environments (VLN-CE) presents a significant hurdle for autonomous agents, demanding the fluid combination of visual inputs with natural language directives to traverse intricate 3D indoor settings. Current methodologies frequently struggle with extended-duration tasks, largely because of inadequate scene comprehension, suboptimal planning mechanisms, and the absence of resilient decision-making structures.

To overcome these limitations, we present the Hierarchical Semantic-Augmented Navigation (HSAN) framework, a novel solution that transforms VLN-CE via three interconnected advancements. Initially, HSAN generates a dynamic, hierarchical semantic scene graph. By utilizing vision-language models, it captures multi-tiered environmental representations—ranging from individual objects to broader regions and zones—which facilitates detailed spatial reasoning.

Secondly, the framework utilizes a topological planner based on optimal transport, rooted in Kantorovich’s duality. This component identifies long-term objectives by striking a balance between semantic significance and spatial feasibility, offering theoretical assurances of optimality.

Finally, a graph-aware reinforcement learning policy governs fine-grained control. This ensures accurate execution of subgoals while maintaining robust obstacle avoidance. By merging spectral graph theory, optimal transport, and sophisticated multi-modal learning, HSAN mitigates the issues associated with the static maps and heuristic planners common in previous studies. Comprehensive evaluations across various demanding VLN-CE datasets reveal that HSAN delivers state-of-the-art results, marking substantial gains in navigation success rates and the ability to generalize to unfamiliar environments.


Source: arXiv Generated at: 2026-06-02 00:00:00 UTC

Related Articles

Law’s Billable Hour Is Being Shredded by AI
Bloomberg

Law’s Billable Hour Is Being Shredded by AI

AI is dismantling the billable hour by automating routine legal tasks. This technological shift threatens the traditiona...

Iran War: Trump Tries to Stop Israel’s Lebanon Push | The Opening Trade 6/2/2026
Bloomberg

Iran War: Trump Tries to Stop Israel’s Lebanon Push | The Opening Trade 6/2/2026

SoftBank in Early Talks to Back $800 Million Agile Robots Round
Bloomberg

SoftBank in Early Talks to Back $800 Million Agile Robots Round

SoftBank is in early talks to back Agile Robots’ $800 million funding round. The Japanese tech giant is currently in pre...

Amundi Is Diversifying Risk Via Commodity Currencies, Gold
Bloomberg

Amundi Is Diversifying Risk Via Commodity Currencies, Gold

Amundi diversifies risk by investing in commodity-linked currencies and gold. This strategy hedges against market volati...

Reuters

Marvell Technology surges after Nvidia's Huang calls it 'next trillion-dollar company'

Marvell Technology shares surged after Nvidia CEO Jensen Huang labeled the firm the “next trillion-dollar company.”

Russia Says It Found Foreign Spyware on Top Officials’ Phones
Bloomberg

Russia Says It Found Foreign Spyware on Top Officials’ Phones

Russia’s FSB claims to have discovered foreign spyware on senior officials’ phones. Moscow attributes the intrusion to h...