arXiv

VISTA: Vision-Grounded and Physics-Validated Adaptation of UMI data for VLA Training

June 4, 2026 · Siyuan Yang, Linzheng Guo, Ouyang Lu, Zhaxizhuoma, Daoran Zhang, Xinmiao Wang, Ting Xiao, Fangzheng Yan, Zhijun Chen, Yan Ding, Chao Yu, Chenjia Bai, Xuelong Li · Original Source

Title: VISTA: Adapting UMI Data for VLA Training via Vision-Grounded and Physics-Validated Methods

Abstract:

While the Universal Manipulation Interface (UMI) facilitates scalable real-world robot data collection without the need for hardware-specific teleoperation, utilizing this data to train large-scale Vision-Language-Action (VLA) models presents significant fundamental hurdles. This study highlights two primary mismatches: first, wrist-mounted fisheye cameras produce images with severe radial distortion and local, gripper-centric viewpoints that fall outside the distribution of data used to pretrain Vision-Language Models (VLMs); second, human-collected trajectories often breach kinematic constraints, result in collisions, or surpass controller bandwidth limits, thereby instructing VLA policies to execute physically impossible actions.

To overcome these obstacles, we introduce VISTA, a framework designed to bridge this dual gap through three integrated components. First, we present UMI-VQA, the inaugural large-scale Visual Question Answering dataset customized for wrist-mounted fisheye observations. This dataset aligns VLM representations with the distorted visual domain through auxiliary vision-language supervision. Second, we implement a systematic physical-validation pipeline that conducts a data-completeness pre-check and evaluates valid trajectories based on trajectory continuity, self-collision risk, and execution fidelity prior to their inclusion in training. Third, we employ a two-stage co-training strategy that simultaneously learns vision-language grounding on UMI-VQA and action prediction on the validated trajectories.

Our empirical results demonstrate that integrating UMI-VQA consistently enhances downstream policy performance, and that physical-validation scores serve as strong predictors of deployment success. Across a variety of simulation and real-world manipulation tasks, VISTA achieves superior performance compared to robust baselines such as $\pi_{0.5}$, LingBot-VLA, and Wall-X. To support the broader research community, we are releasing the physical-validation pipeline, the UMI-VQA dataset, the validated trajectory data, and the pre-trained model.

Source: arXiv Generated at: 2026-06-04 00:00:00 UTC