Unified Video-Action Joint Denoising for Dexterous Action and Data Generation
Title: Unified Video-Action Joint Denoising for Dexterous Action and Data Generation
Abstract: Current world action models build upon video foundation models by aligning extensive visual-dynamics priors with executable robot movements. This study re-examines that alignment through a distributional lens. While conventional approaches typically restrict the aligned prior to an observation-conditioned policy distribution focused on future actions, our method maintains a broader distribution. We achieve this by modeling the joint space of interaction videos and executable hand trajectories across various conditioning regimes. We introduce Donk, a unified denoising model for dexterous hands that handles both video and action. Given language prompts, an initial image, and the starting hand state, Donk functions as an action policy, sampling future videos alongside bimanual MANO trajectories. If the image condition is omitted, the identical denoising architecture generates paired video-action rollouts from a text-conditioned distribution, effectively transforming the aligned video prior into a data generation engine. Evaluations across action, video, and text-only generation tasks demonstrate that Donk enhances dexterous trajectory accuracy, maintains high video fidelity, and yields smooth text-conditioned action rollouts, all achieved under a single, unified training framework.
Source: arXiv Generated at: 2026-06-03 00:00:00 UTC






