MetaWorld: Scaling Multi-Agent Video World Model from Single-view Video Data
Title: MetaWorld: Scaling Multi-Agent Video World Model from Single-view Video Data
Abstract:
Video world models serve as a cornerstone for generative technologies in embodied AI and the Metaverse; however, current methodologies are constrained by their reliance on single agents viewing the world from one perspective. Expanding these models to accommodate multi-agent scenarios presents two significant hurdles: a lack of data, since gathering coordinated multi-view recordings for broad, open-domain contexts is prohibitively costly, and the issue of world state alignment, where independently produced video streams fail to guarantee that shared physical settings and events unfold consistently across different viewpoints.
To overcome these obstacles, we introduce MetaWorld, an innovative framework designed to scale multi-agent video world models to open-domain environments using only single-view video data. Our approach begins with Monocular World-State Unrolling (MWSU), a technique that explicitly separates monocular footage into the camera operator’s ego-motion and the spatial trajectory of the visible subject. This decomposition of camera and subject movement naturally yields synchronized multi-agent motion data within a unified 3D space, eliminating the necessity for multi-camera installations.
For precise visual control, we developed the Subject-Aware World Generator, which facilitates appearance-driven simulation conditioned on identity images specific to each agent. Furthermore, to guarantee that all views are anchored in the same physical reality, we propose World-State Alignment (WSA). This mechanism employs per-frame inter-branch cross-attention, inserted at every transformer layer of the video DiT. By synchronizing the denoising process, WSA enforces both static geometric consistency and dynamic motion consistency, ensuring that the shared 3D environment and physical events remain well-aligned across egocentric viewpoints. Comprehensive experiments confirm that MetaWorld delivers superior cross-view consistency and identity fidelity, thereby establishing a highly scalable, physics-driven paradigm for multi-agent video world modeling.
Source: arXiv Generated at: 2026-06-03 00:00:00 UTC



