Beyond Pixel Histories: World Models with Persistent 3D State
Title: Moving Past Pixel-Based Histories: World Models with Enduring 3D States
Abstract: Interactive world models facilitate open-ended video generation by dynamically responding to user inputs. Yet, most current approaches fail to incorporate an explicit 3D environmental representation. Consequently, these models must implicitly deduce 3D consistency from data, while their spatial memory is confined to short temporal windows. This limitation leads to unnatural interactions and hinders applications like agent training. To overcome these challenges, we introduce PERSIST, a novel world model framework that simulates the progression of a latent 3D scene, encompassing the environment, camera movements, and rendering processes. This approach enables the synthesis of new frames endowed with persistent spatial memory and geometric consistency. Our evaluation, comprising both quantitative metrics and a qualitative user study, reveals significant enhancements in spatial memory, 3D consistency, and long-horizon stability compared to existing methods, thereby supporting the creation of coherent, evolving 3D worlds. Additionally, we showcase new functionalities, such as generating varied 3D environments from a single image and allowing for precise, geometry-aware control through direct 3D space editing and specification. Project page: https://francelico.github.io/persist.github.io
Source: arXiv Generated at: 2026-06-04 00:00:00 UTC




