Geometry-Aware Implicit Memory for Video World Models
Title: Geometry-Aware Implicit Memory for Video World Models
Abstract:
While video world models strive to generate controllable visual environments, their ability to execute long-horizon rollouts is often limited by how well they retain information once observations exceed the model’s native context window. Current approaches typically rely on explicit memories, such as storing individual frames or performing online 3D reconstructions; however, these methods are prone to heuristic retrieval failures, excessive storage of redundant visual data, and artifacts resulting from reconstruction processes. Alternatively, implicit memory systems compress historical data into a compact state representation, yet prior designs have lacked explicit constraints to ensure the accurate encoding of cross-view scene geometry.
To address these limitations, we introduce GIM-World, a novel framework that integrates geometry-aware implicit memory into video world models. This architecture utilizes a lightweight transformer encoder to distill variable-length historical sequences into a fixed-size set of memory tokens. During the training phase, a camera-queryable geometry head extracts 3D scene structures from a frozen foundation model, embedding this geometric knowledge into the memory. Furthermore, an information-guided pruning mechanism ensures that the computational cost of encoding remains manageable as the history expands. Notably, the geometry teacher component is removed during inference, resulting in a streamlined and efficient memory module. Evaluations on the MIND dataset demonstrate that GIM-World outperforms both explicit and implicit memory baselines in maintaining geometric and visual consistency over extended horizons.
Source: arXiv Generated at: 2026-06-02 00:00:00 UTC





