SKIP: Sparse Keyframe Interpolation Paradigm for Efficient Embodied World Models
Title: SKIP: A Sparse Keyframe Interpolation Paradigm for Efficient Embodied World Models
Abstract:
Embodied world models have recently gained traction in robotics as a method for forecasting how robotic interventions influence their environment. However, generating long-horizon manipulation videos in pixel space remains computationally prohibitive, as these sequences typically require frame-by-frame synthesis. Simply discarding frames to lower costs is not a viable solution, because downstream policies depend on the intact representation of critical, sparse events like approaching, contacting, grasping, and releasing objects.
To overcome this limitation, we introduce the Sparse Keyframe Interpolation Paradigm (SKIP). This framework operates on an event-preserving, sparse-to-dense principle, eliminating the need for dense, frame-by-frame generation. SKIP begins by detecting task-critical keyframes using multimodal features that are aware of the robot’s state. It then employs a sparse video diffusion model to generate only these essential frames. Subsequently, a learned gap predictor and an action-conditioned interpolator are used to fill in the missing temporal intervals based on the robot’s actions.
Experiments on the LIBERO benchmark demonstrate that SKIP produces dense rollouts $4.16\times$ more quickly than a dense baseline, while simultaneously enhancing visual quality and lowering the aggregate Fréchet Video Distance (FVD) by $89.0\%$. Crucially, videos generated by SKIP serve as high-quality data for policy training. When SKIP-generated videos completely substitute for real-world demonstrations, the success rate of $\pi_{0.5}$ decreases by only $1.3$ percentage points in LIBERO simulations and $6.7$ percentage points on physical robots. In contrast, policies trained on fully dense, frame-by-frame generated videos suffer a dramatic performance collapse, with success rates dropping by $48$ to $58$ percentage points.
Source: arXiv Generated at: 2026-06-02 00:00:00 UTC





