arXiv

See, Plan, Rewind: Progress-Aware Vision-Language-Action Models for Robust Robotic Manipulation

June 2, 2026 · Tingjun Dai, Mingfei Han, Tingwen Du, Zhiheng Liu, Zihao Zhang, Zhihui Li, Salman Khan, Jun Yu, Xiaojun Chang · Original Source

Title: See, Plan, Rewind: Progress-Aware Vision-Language-Action Models for Robust Robotic Manipulation

Abstract: Achieving reliable robotic manipulation hinges on the ability to measure task advancement through clear, actionable milestones. This awareness of progress allows a system to anchor its current operational status, predict verifiable intermediate outcomes, and identify and recover from failures when movement halts. To implement this capability, we present \textbf{S}ee, \textbf{P}lan, \textbf{R}ewind (SPR), a progress-conscious vision-language-action framework that dynamically translates language commands into a series of spatial subgoals. SPR functions via a continuous loop comprising three stages: Seeing the present condition and the forthcoming milestone, Planning a path to the subsequent 2D waypoint, and Rewinding to a recoverable position if failure occurs, all while tracking progress against the anticipated sequence. This closed-loop methodology facilitates effective error correction without the need for extra training data or supplementary models. Comprehensive experiments confirm the framework’s efficacy, generalizability, and resilience: SPR surpasses the MolmoAct baseline by 5\% on the LIBERO benchmark. Furthermore, on the demanding LIBERO-Plus benchmark, which features unseen instructions and initial states, SPR demonstrates state-of-the-art robustness with the minimal performance decline, outperforming OpenVLA-OFT and UniVLA to exhibit superior out-of-distribution robustness.

Source: arXiv Generated at: 2026-06-02 00:00:00 UTC