Dual-Anchoring: Addressing State Drift in Vision-Language Navigation
Title: Dual-Anchoring: Addressing State Drift in Vision-Language Navigation
Abstract:
Vision-Language Navigation (VLN) tasks require agents to traverse three-dimensional spaces by interpreting natural language commands. Although recent advancements in Video Large Language Models (Video-LLMs) have significantly boosted VLN capabilities, these models struggle with State Drift during extended scenarios. This issue arises when an agent’s internal state diverges from the actual progress of the task, resulting in disoriented wandering and the inability to perform critical maneuvers specified in the instructions. We identify two primary cognitive deficits responsible for this failure: Progress Drift, where the agent cannot differentiate between achieved sub-goals and those yet to be completed, and Memory Drift, wherein the degradation of historical representations causes the agent to lose track of previously visited landmarks.
To resolve these issues, we introduce a Dual-Anchoring Framework designed to explicitly stabilize both instruction progress and historical memory. To counter Progress Drift, we implement Instruction Progress Anchoring, a mechanism that guides the agent to output structured text tokens clearly separating completed sub-goals from those remaining. To alleviate Memory Drift, we propose Memory Landmark Anchoring, which employs a Landmark-Centric World Model. This component retrospectively predicts object-centric embeddings derived from the Segment Anything Model, thereby forcing the agent to actively verify past observations and maintain distinct records of visited landmarks.
Supporting this framework, we have compiled two large-scale datasets: one containing 3.6 million samples featuring explicit progress descriptions, and another comprising 937,000 instances of grounded landmark data for retrospective verification. Comprehensive experiments conducted in both simulated and physical environments highlight the effectiveness of our approach, which yields a 15.2% increase in Success Rate and a significant 24.7% improvement on long-horizon trajectories. To support the broader research community, we plan to release our codebase, data generation pipelines, and the curated datasets.
Source: arXiv Generated at: 2026-06-02 00:00:00 UTC





