Towards Interactive Video World Modeling: Frontiers, Challenges, Benchmarks, and Future Trends
Title: Navigating the Frontiers of Interactive Video World Modeling: Current Landscape, Obstacles, Evaluation Standards, and Upcoming Trajectories
Abstract
The accelerating evolution of diffusion-based content generation and large language models has positioned world modeling as a focal point of research interest, yielding significant benefits for downstream sectors including embodied AI, autonomous driving, and game engines. By explicitly integrating user actions into the mechanics of world state transitions, recent studies have endowed world modeling with interactivity. This shift toward action-conditioned video or 3D generation paradigms not only strengthens control over world dynamics but also enables users to freely navigate, manipulate, and personalize their environments.
This paper provides a systematic review of the latest trends, technical advancements, and evaluation benchmarks in interactive world modeling, while also outlining potential future avenues. We begin by summarizing current developments across application scenarios, scene modalities, and the evolution of world states. The discussion then moves to three pivotal technical hurdles: ensuring action-conditioned controllability, managing long-horizon interactions and memory, and achieving action-following responsiveness to support real-time interactivity. Additionally, we conduct a comprehensive comparison of existing benchmarks and metrics within four key domains: robotics, autonomous driving, game engines, and open-world exploration. Finally, we explore promising directions for the next generation of interactive world modeling. The associated repository is accessible at: https://github.com/liujiuming123/Awesome-Interactive-World-Model.
Source: arXiv Generated at: 2026-06-02 00:00:00 UTC





