arXiv

MORPHOS: Autoregressive 4D Generation with Temporal Structured Latents

June 2, 2026 · Minkyung Kwon, Jinhyeok Choi, Youngjin Shin, Jaeyeong Kim, JongMin Lee, Seungryong Kim · Original Source

Title: MORPHOS: Autoregressive 4D Generation with Temporal Structured Latents

Original: arXiv:2606.02491v1 Announce Type: new Abstract: We present MORPHOS, a novel autoregressive framework that generates dynamic 3D assets from videos across diverse representations, including meshes, 3D Gaussians, and radiance fields. Existing methods are typically limited to a single representation, struggle to model topological changes, or fail to maintain temporal consistency over long videos. To address these limitations, we introduce the Temporal Structured Latents (T-SLAT), a unified 4D representation that jointly encodes geometry and appearance along the temporal dimension. Leveraging T-SLAT, MORPHOS autoregressively generates dynamic 3D assets via causal attention, conditioning each frame on its preceding history to ensure temporal consistency while handling evolving topologies. We also propose a temporal-structural augmentation to mitigate error accumulation in autoregressive generation. MORPHOS achieves state-of-the-art performance in appearance and competitive results in geometry across multiple benchmarks, demonstrating superior generalization across various representations and robustness in long-horizon generation.

Rewritten:

Abstract:

This paper introduces MORPHOS, an innovative autoregressive system capable of synthesizing dynamic 3D assets from video input. Unlike previous approaches that are often restricted to specific formats, MORPHOS supports a wide array of representations, such as radiance fields, 3D Gaussians, and meshes. Current solutions frequently face challenges in preserving temporal coherence during extended sequences, managing topological shifts, or are confined to a single output format. To overcome these hurdles, we develop Temporal Structured Latents (T-SLAT), a cohesive 4D encoding scheme that simultaneously captures both geometric structure and visual appearance across time. By employing T-SLAT, MORPHOS utilizes causal attention mechanisms to produce dynamic 3D content in an autoregressive manner. This approach conditions every frame on its prior context, thereby maintaining temporal stability and accommodating changes in topology. Additionally, we introduce a temporal-structural augmentation technique designed to reduce the propagation of errors inherent in autoregressive processes. Experimental results indicate that MORPHOS sets new standards for visual fidelity and delivers competitive geometric accuracy across various benchmarks. These findings highlight the model’s strong generalization capabilities across different representation types and its resilience in generating long-duration sequences.

Source: arXiv Generated at: 2026-06-02 00:00:00 UTC