Genie 4D: Semantic-Prior-Guided 4D Dynamic Scene Reconstruction
Title: Genie 4D: Semantic-Prior-Guided 4D Dynamic Scene Reconstruction
Abstract:
Bridging the gap between low-level geometric sensing and high-level semantic understanding, 4D reconstruction of dynamic scenes sits at the core of advancements in computer vision and robotic perception. In this work, we introduce Genie 4D, a novel framework designed to transform standard smartphone footage into a semantically anchored, action-responsive 4D world model. The architecture integrates a real-time visual-inertial Gaussian splatting front-end, which handles metric geometry, with a feed-forward 4D backbone. This backbone is stabilized by frozen DINOv3 features, which serve as structural priors. These semantic constraints effectively mitigate identity drift during dynamic tracking. Furthermore, to address the loss of fine details often caused by regression backends, a conditional diffusion refiner is employed to restore high-frequency surface textures.
The system concludes with a lightweight latent-action head that interfaces the reconstructed 4D state with a Genie-style world model. This model is trained using a JEPA-style next-embedding objective, enabling the scene to be projected forward in time based on user inputs. Evaluated on the Point Odyssey and TUM-Dynamics benchmarks, Genie 4D maintains the linear time complexity, O(T), characteristic of feed-forward baselines, while significantly enhancing both 3D tracking accuracy (APD) and reconstruction completeness. The framework supports interactive operation on a single consumer-grade GPU (RTX 5090) and is compatible with capture clients across iPhone, Mac, Windows, and Linux platforms. Ultimately, Genie 4D provides a viable, semantically guided route toward developing physically grounded world models.
Source: arXiv Generated at: 2026-06-02 00:00:00 UTC





