arXiv

Goal2Pixel: Grounding Goals to Pixels for Vision-Language Navigation

June 2, 2026 · Muyi Bao, Yuxin Cai, Hang Xu, Zongtai Li, Jinxi He, Jingfan Tang, Chen Lv, Ji Zhang, Yaqi Xie, Wenshan Wang · Original Source

Title: Goal2Pixel: Grounding Goals to Pixels for Vision-Language Navigation

Abstract: Vision-language models (VLMs) have emerged as a standard backbone for vision-and-language navigation in continuous environments (VLN-CE). However, the majority of current VLM-driven approaches treat navigation as a problem of low-level action prediction. This interface presents several drawbacks: it is ambiguous, restricted to short-horizon motion primitives, and computationally inefficient because it necessitates repeated queries to the VLM. To address these limitations, we introduce Goal2Pixel, a novel pixel-based paradigm that redefines VLN-CE as a task of navigable pixel grounding. Instead of generating action commands, Goal2Pixel leverages the image plane as a cohesive spatial bridge between robotic motion and VLM reasoning. The model identifies a visible navigable pixel, which is then back-projected into a 3D waypoint to guide forward movement. For maneuvers other than moving forward, auxiliary directive regions are added to the image plane; specifically, the left, right, and bottom areas correspond to turning left, turning right, and stopping, respectively. To support long-horizon navigation tasks, we develop a visibility-aware keyframe memory system designed to provide a compact yet informative representation of historical data. Furthermore, we adapt pretrained VLMs for navigable pixel grounding by incorporating semantic embeddings and coordinate-aware auxiliary losses. Goal2Pixel delivers competitive state-of-the-art results while significantly reducing the number of VLM inference calls required. On the R2R-CE Val-Unseen dataset, it achieves a Success Rate (SR) of 54.1% and a Success weighted by Path Length (SPL) of 52.5%, using only 7.75 VLM calls per episode. This represents a six-fold reduction compared to the 46.62 calls needed by direct action prediction methods, which achieved a lower SR of 32.9%. This efficiency trend is also observed on the RxR-CE benchmark.

Project Page: https://baobao0926.github.io/Goal2Pixel/.

Source: arXiv Generated at: 2026-06-02 00:00:00 UTC