arXiv

Robust Dreamer: Deviation-Aware Latent Gaussian Memory for Action-Controlled AR Video Generation

June 2, 2026 · Hanlin Chen, Jiaxin Wei, Xibin Song, Yifu Wang, Steve Wang, Hongdong Li, Pan Ji, Gim Hee Lee · Original Source

Title: Robust Dreamer: Deviation-Aware Latent Gaussian Memory for Action-Controlled AR Video Generation

Original: arXiv:2605.30855v2 Announce Type: replace Abstract: Frame-wise action-controlled image-to-video generation is a promising paradigm for interactive world simulation, where each control signal should elicit an immediate visual response. However, maintaining visual fidelity and 3D consistency over long autoregressive rollouts remains challenging. Existing 3D-aware methods often suffer from catastrophic drift due to two impediments: information loss from \textit{Latent--RGB Cycling}, where generated latents are repeatedly decoded to RGB and re-encoded for future conditioning, and the training--inference gap induced by the \textit{error-free hypothesis}, where clean training memory fails to match prediction-corrupted inference memory. To address these challenges, we present \textbf{Robust Dreamer}, a memory-augmented framework built around how to design 3D memory and how to use it robustly. First, we introduce \textbf{Latent Gaussian Memory}, which anchors diffusion latents inherited from the generation process to Gaussian primitives and recalls them via latent-space Gaussian splatting. This provides dense, geometry-aware, view-aligned conditioning while avoiding accumulated degradation from repeated VAE conversion. Second, we propose \textbf{Deviation Learning with Dynamic Deviation Archive}, which synthesizes rollout-induced latent deviations through a one-step approximation, stores them by autoregressive stage and denoising timestamp, and injects them into historical memory during training. This exposes the generator to realistic corrupted memory states and teaches internal correction before inference. Experiments on ScanNet, DL3DV, and OmniWorldGame demonstrate state-of-the-art long-horizon performance.

Rewrite: Title: Robust Dreamer: Deviation-Aware Latent Gaussian Memory for Action-Controlled AR Video Generation

Original: arXiv:2605.30855v2 Announce Type: replace Abstract: Interactive world simulation benefits significantly from frame-wise, action-controlled image-to-video generation, a paradigm that demands immediate visual feedback for every control input. Nevertheless, preserving both visual quality and three-dimensional coherence during extended autoregressive sequences poses a significant hurdle. Current 3D-aware approaches frequently experience severe drift stemming from two primary issues: the \textit{Latent--RGB Cycling} bottleneck, which causes information degradation as generated latents are continuously decoded into RGB images and re-encoded for subsequent conditioning, and the discrepancy between training and inference known as the \textit{error-free hypothesis}, wherein pristine training memories do not reflect the noise-corrupted memories encountered during inference. To overcome these obstacles, we introduce \textbf{Robust Dreamer}, a framework enhanced with memory mechanisms focused on the robust design and utilization of 3D memory structures. Our approach begins with \textbf{Latent Gaussian Memory}, a method that links diffusion latents—carried over from the generation pipeline—to Gaussian primitives. These are retrieved using latent-space Gaussian splatting, offering dense, geometry-sensitive, and view-consistent conditioning without the cumulative quality loss associated with repeated VAE transformations. Furthermore, we develop \textbf{Deviation Learning with Dynamic Deviation Archive}, a technique that approximates latent deviations caused by rollouts in a single step. These deviations are categorized by autoregressive stage and denoising time, then integrated into the historical memory during the training phase. This strategy forces the generator to confront realistic, corrupted memory conditions, thereby fostering internal corrective capabilities prior to deployment. Evaluations across ScanNet, DL3DV, and OmniWorldGame benchmarks confirm that our method achieves state-of-the-art results for long-horizon generation tasks.

Source: arXiv Generated at: 2026-06-02 00:00:00 UTC