arXiv

Dual-Anchoring: Addressing State Drift in Vision-Language Navigation

Title: Dual-Anchoring: Addressing State Drift in Vision-Language Navigation

Abstract:

Vision-Language Navigation (VLN) tasks require agents to traverse three-dimensional spaces by interpreting natural language commands. Although recent advancements in Video Large Language Models (Video-LLMs) have significantly boosted VLN capabilities, these models struggle with State Drift during extended scenarios. This issue arises when an agent’s internal state diverges from the actual progress of the task, resulting in disoriented wandering and the inability to perform critical maneuvers specified in the instructions. We identify two primary cognitive deficits responsible for this failure: Progress Drift, where the agent cannot differentiate between achieved sub-goals and those yet to be completed, and Memory Drift, wherein the degradation of historical representations causes the agent to lose track of previously visited landmarks.

To resolve these issues, we introduce a Dual-Anchoring Framework designed to explicitly stabilize both instruction progress and historical memory. To counter Progress Drift, we implement Instruction Progress Anchoring, a mechanism that guides the agent to output structured text tokens clearly separating completed sub-goals from those remaining. To alleviate Memory Drift, we propose Memory Landmark Anchoring, which employs a Landmark-Centric World Model. This component retrospectively predicts object-centric embeddings derived from the Segment Anything Model, thereby forcing the agent to actively verify past observations and maintain distinct records of visited landmarks.

Supporting this framework, we have compiled two large-scale datasets: one containing 3.6 million samples featuring explicit progress descriptions, and another comprising 937,000 instances of grounded landmark data for retrospective verification. Comprehensive experiments conducted in both simulated and physical environments highlight the effectiveness of our approach, which yields a 15.2% increase in Success Rate and a significant 24.7% improvement on long-horizon trajectories. To support the broader research community, we plan to release our codebase, data generation pipelines, and the curated datasets.


Source: arXiv Generated at: 2026-06-02 00:00:00 UTC

Related Articles

Law’s Billable Hour Is Being Shredded by AI
Bloomberg

Law’s Billable Hour Is Being Shredded by AI

AI is dismantling the billable hour by automating routine legal tasks. This technological shift threatens the traditiona...

Iran War: Trump Tries to Stop Israel’s Lebanon Push | The Opening Trade 6/2/2026
Bloomberg

Iran War: Trump Tries to Stop Israel’s Lebanon Push | The Opening Trade 6/2/2026

SoftBank in Early Talks to Back $800 Million Agile Robots Round
Bloomberg

SoftBank in Early Talks to Back $800 Million Agile Robots Round

SoftBank is in early talks to back Agile Robots’ $800 million funding round. The Japanese tech giant is currently in pre...

Amundi Is Diversifying Risk Via Commodity Currencies, Gold
Bloomberg

Amundi Is Diversifying Risk Via Commodity Currencies, Gold

Amundi diversifies risk by investing in commodity-linked currencies and gold. This strategy hedges against market volati...

Reuters

Marvell Technology surges after Nvidia's Huang calls it 'next trillion-dollar company'

Marvell Technology shares surged after Nvidia CEO Jensen Huang labeled the firm the “next trillion-dollar company.”

Russia Says It Found Foreign Spyware on Top Officials’ Phones
Bloomberg

Russia Says It Found Foreign Spyware on Top Officials’ Phones

Russia’s FSB claims to have discovered foreign spyware on senior officials’ phones. Moscow attributes the intrusion to h...