arXiv

SKIP: Sparse Keyframe Interpolation Paradigm for Efficient Embodied World Models

Title: SKIP: A Sparse Keyframe Interpolation Paradigm for Efficient Embodied World Models

Abstract:

Embodied world models have recently gained traction in robotics as a method for forecasting how robotic interventions influence their environment. However, generating long-horizon manipulation videos in pixel space remains computationally prohibitive, as these sequences typically require frame-by-frame synthesis. Simply discarding frames to lower costs is not a viable solution, because downstream policies depend on the intact representation of critical, sparse events like approaching, contacting, grasping, and releasing objects.

To overcome this limitation, we introduce the Sparse Keyframe Interpolation Paradigm (SKIP). This framework operates on an event-preserving, sparse-to-dense principle, eliminating the need for dense, frame-by-frame generation. SKIP begins by detecting task-critical keyframes using multimodal features that are aware of the robot’s state. It then employs a sparse video diffusion model to generate only these essential frames. Subsequently, a learned gap predictor and an action-conditioned interpolator are used to fill in the missing temporal intervals based on the robot’s actions.

Experiments on the LIBERO benchmark demonstrate that SKIP produces dense rollouts $4.16\times$ more quickly than a dense baseline, while simultaneously enhancing visual quality and lowering the aggregate Fréchet Video Distance (FVD) by $89.0\%$. Crucially, videos generated by SKIP serve as high-quality data for policy training. When SKIP-generated videos completely substitute for real-world demonstrations, the success rate of $\pi_{0.5}$ decreases by only $1.3$ percentage points in LIBERO simulations and $6.7$ percentage points on physical robots. In contrast, policies trained on fully dense, frame-by-frame generated videos suffer a dramatic performance collapse, with success rates dropping by $48$ to $58$ percentage points.


Source: arXiv Generated at: 2026-06-02 00:00:00 UTC

Related Articles

Law’s Billable Hour Is Being Shredded by AI
Bloomberg

Law’s Billable Hour Is Being Shredded by AI

AI is dismantling the billable hour by automating routine legal tasks. This technological shift threatens the traditiona...

Iran War: Trump Tries to Stop Israel’s Lebanon Push | The Opening Trade 6/2/2026
Bloomberg

Iran War: Trump Tries to Stop Israel’s Lebanon Push | The Opening Trade 6/2/2026

SoftBank in Early Talks to Back $800 Million Agile Robots Round
Bloomberg

SoftBank in Early Talks to Back $800 Million Agile Robots Round

SoftBank is in early talks to back Agile Robots’ $800 million funding round. The Japanese tech giant is currently in pre...

Amundi Is Diversifying Risk Via Commodity Currencies, Gold
Bloomberg

Amundi Is Diversifying Risk Via Commodity Currencies, Gold

Amundi diversifies risk by investing in commodity-linked currencies and gold. This strategy hedges against market volati...

Reuters

Marvell Technology surges after Nvidia's Huang calls it 'next trillion-dollar company'

Marvell Technology shares surged after Nvidia CEO Jensen Huang labeled the firm the “next trillion-dollar company.”

Russia Says It Found Foreign Spyware on Top Officials’ Phones
Bloomberg

Russia Says It Found Foreign Spyware on Top Officials’ Phones

Russia’s FSB claims to have discovered foreign spyware on senior officials’ phones. Moscow attributes the intrusion to h...