arXiv

VISTA: Vision-Grounded and Physics-Validated Adaptation of UMI data for VLA Training

Title: VISTA: Adapting UMI Data for VLA Training via Vision-Grounded and Physics-Validated Methods

Abstract:

While the Universal Manipulation Interface (UMI) facilitates scalable real-world robot data collection without the need for hardware-specific teleoperation, utilizing this data to train large-scale Vision-Language-Action (VLA) models presents significant fundamental hurdles. This study highlights two primary mismatches: first, wrist-mounted fisheye cameras produce images with severe radial distortion and local, gripper-centric viewpoints that fall outside the distribution of data used to pretrain Vision-Language Models (VLMs); second, human-collected trajectories often breach kinematic constraints, result in collisions, or surpass controller bandwidth limits, thereby instructing VLA policies to execute physically impossible actions.

To overcome these obstacles, we introduce VISTA, a framework designed to bridge this dual gap through three integrated components. First, we present UMI-VQA, the inaugural large-scale Visual Question Answering dataset customized for wrist-mounted fisheye observations. This dataset aligns VLM representations with the distorted visual domain through auxiliary vision-language supervision. Second, we implement a systematic physical-validation pipeline that conducts a data-completeness pre-check and evaluates valid trajectories based on trajectory continuity, self-collision risk, and execution fidelity prior to their inclusion in training. Third, we employ a two-stage co-training strategy that simultaneously learns vision-language grounding on UMI-VQA and action prediction on the validated trajectories.

Our empirical results demonstrate that integrating UMI-VQA consistently enhances downstream policy performance, and that physical-validation scores serve as strong predictors of deployment success. Across a variety of simulation and real-world manipulation tasks, VISTA achieves superior performance compared to robust baselines such as $\pi_{0.5}$, LingBot-VLA, and Wall-X. To support the broader research community, we are releasing the physical-validation pipeline, the UMI-VQA dataset, the validated trajectory data, and the pre-trained model.


Source: arXiv Generated at: 2026-06-04 00:00:00 UTC

Related Articles

China’s Robotaxi Dilemma Shows AI Policy Tension Between Growth and Jobs
Bloomberg

China’s Robotaxi Dilemma Shows AI Policy Tension Between Growth and Jobs

China’s robotaxi expansion highlights the policy tension between driving economic growth through AI and protecting emplo...

Exams watchdog warns of rise in high-tech cheating
BBC News

Exams watchdog warns of rise in high-tech cheating

Ofqual warns of rising high-tech cheating, with smart devices involved in 44% of misconduct cases. Invigilators are trai...

Thailand’s Richest Man Plans $4.3 Billion Expansion Amid AI Boom
Bloomberg

Thailand’s Richest Man Plans $4.3 Billion Expansion Amid AI Boom

Thailand’s wealthiest individual is investing $4.3 billion in expansion, capitalizing on the booming artificial intellig...

Reuters

Amazon unveils new AI warehouse robot in $12 billion Europe push

Amazon unveiled a new AI warehouse robot, marking a key step in its $12 billion European expansion strategy to enhance l...

US Tech Sector Announces Most Job Cuts in Nearly Two Years
Bloomberg

US Tech Sector Announces Most Job Cuts in Nearly Two Years

The US tech sector recorded its highest wave of layoffs in nearly two years, signaling a significant downturn for the in...

Iran Says No Progress in US Talks | The Opening Trade 6/4/2026
Bloomberg

Iran Says No Progress in US Talks | The Opening Trade 6/4/2026

Iran reports no progress in US talks on June 4, 2026. The Opening Trade highlights the ongoing diplomatic impasse betwee...