arXiv

Can VLMs Predict Future States? Bootstrapping World Models from Inverse Dynamics

Title: Can VLMs Predict Future States? Bootstrapping World Models from Inverse Dynamics

Original: arXiv:2506.06006v3 Announce Type: replace-cross Abstract: Can unified vision-language models (VLMs) perform forward dynamics prediction (FDP), i.e., predicting the future state (in image form) given the previous observation and an action (in language form)? We find that VLMs struggle to generate physically plausible transitions between frames from instructions. Nevertheless, we identify a crucial asymmetry in multimodal grounding: fine-tuning a VLM to learn inverse dynamics prediction (IDP)-effectively captioning the action between frames-is significantly easier than learning FDP. In turn, IDP can be used to bootstrap FDP through two main strategies: 1) weakly supervised learning from synthetic data and 2) inference time verification. Firstly, IDP can annotate actions for unlabelled pairs of video frame observations to expand the training data scale for FDP. Secondly, IDP can assign rewards to multiple samples of FDP to score them, effectively guiding search at inference time. We evaluate the FDP resulting from both strategies through the task of action-centric image editing on Aurora-Bench with two families of VLMs. Despite remaining general-purpose, our best model achieves a performance competitive with state-of-the-art image editing models, improving on them by a margin between 7% and 13% according to GPT4o-as-judge, and achieving the best average human evaluation across all subsets of Aurora-Bench.

Rewrite: Title: Leveraging Inverse Dynamics to Bootstrap World Models for Future State Prediction in VLMs

Abstract: This study investigates whether unified vision-language models (VLMs) can execute forward dynamics prediction (FDP)—the task of forecasting the next visual state based on a prior observation and a linguistic action command. Our analysis reveals that VLMs face significant challenges in producing physically coherent frame transitions when guided by textual instructions. However, we uncover a notable disparity in multimodal grounding capabilities: it is substantially more straightforward to fine-tune a VLM for inverse dynamics prediction (IDP), which involves describing the action that occurred between frames, than to master FDP. We propose leveraging IDP to enhance FDP capabilities via two primary mechanisms: weakly supervised learning using synthetic data and verification during inference. Specifically, IDP serves to label unannotated video frame pairs, thereby increasing the volume of training data available for FDP. Additionally, IDP functions as a scoring mechanism, assigning rewards to various FDP outputs to refine the search process at inference time. We assessed the efficacy of both approaches using action-oriented image editing tasks within the Aurora-Bench framework, employing two distinct VLM architectures. Although the models retain their general-purpose nature, our top-performing variant rivals leading specialized image editing tools. According to GPT-4o evaluations, it outperforms state-of-the-art models by 7% to 13%, and it secured the highest average scores in human evaluations across all Aurora-Bench subsets.


Source: arXiv Generated at: 2026-06-04 00:00:00 UTC

Related Articles

China’s Robotaxi Dilemma Shows AI Policy Tension Between Growth and Jobs
Bloomberg

China’s Robotaxi Dilemma Shows AI Policy Tension Between Growth and Jobs

China’s robotaxi expansion highlights the policy tension between driving economic growth through AI and protecting emplo...

Exams watchdog warns of rise in high-tech cheating
BBC News

Exams watchdog warns of rise in high-tech cheating

Ofqual warns of rising high-tech cheating, with smart devices involved in 44% of misconduct cases. Invigilators are trai...

Thailand’s Richest Man Plans $4.3 Billion Expansion Amid AI Boom
Bloomberg

Thailand’s Richest Man Plans $4.3 Billion Expansion Amid AI Boom

Thailand’s wealthiest individual is investing $4.3 billion in expansion, capitalizing on the booming artificial intellig...

US Tech Sector Announces Most Job Cuts in Nearly Two Years
Bloomberg

US Tech Sector Announces Most Job Cuts in Nearly Two Years

The US tech sector recorded its highest wave of layoffs in nearly two years, signaling a significant downturn for the in...

Iran Says No Progress in US Talks | The Opening Trade 6/4/2026
Bloomberg

Iran Says No Progress in US Talks | The Opening Trade 6/4/2026

Iran reports no progress in US talks on June 4, 2026. The Opening Trade highlights the ongoing diplomatic impasse betwee...

The Do’s and Don’ts of Buying Used Tech Gadgets
New York Times

The Do’s and Don’ts of Buying Used Tech Gadgets

Refurbished tech offers a cost-effective alternative amid component shortages and inflated prices. This guide outlines e...