arXiv

MetaWorld: Scaling Multi-Agent Video World Model from Single-view Video Data

Title: MetaWorld: Scaling Multi-Agent Video World Model from Single-view Video Data

Abstract:

Video world models serve as a cornerstone for generative technologies in embodied AI and the Metaverse; however, current methodologies are constrained by their reliance on single agents viewing the world from one perspective. Expanding these models to accommodate multi-agent scenarios presents two significant hurdles: a lack of data, since gathering coordinated multi-view recordings for broad, open-domain contexts is prohibitively costly, and the issue of world state alignment, where independently produced video streams fail to guarantee that shared physical settings and events unfold consistently across different viewpoints.

To overcome these obstacles, we introduce MetaWorld, an innovative framework designed to scale multi-agent video world models to open-domain environments using only single-view video data. Our approach begins with Monocular World-State Unrolling (MWSU), a technique that explicitly separates monocular footage into the camera operator’s ego-motion and the spatial trajectory of the visible subject. This decomposition of camera and subject movement naturally yields synchronized multi-agent motion data within a unified 3D space, eliminating the necessity for multi-camera installations.

For precise visual control, we developed the Subject-Aware World Generator, which facilitates appearance-driven simulation conditioned on identity images specific to each agent. Furthermore, to guarantee that all views are anchored in the same physical reality, we propose World-State Alignment (WSA). This mechanism employs per-frame inter-branch cross-attention, inserted at every transformer layer of the video DiT. By synchronizing the denoising process, WSA enforces both static geometric consistency and dynamic motion consistency, ensuring that the shared 3D environment and physical events remain well-aligned across egocentric viewpoints. Comprehensive experiments confirm that MetaWorld delivers superior cross-view consistency and identity fidelity, thereby establishing a highly scalable, physics-driven paradigm for multi-agent video world modeling.


Source: arXiv Generated at: 2026-06-03 00:00:00 UTC

Related Articles

TikTok Billionaire Tops Ambani as Asia’s Second-Richest
Bloomberg

TikTok Billionaire Tops Ambani as Asia’s Second-Richest

TikTok founder surpasses Mukesh Ambani to become Asia’s second-richest person, marking a significant shift in the region...

Publishers in UK can opt out of Google AI search results
BBC News

Publishers in UK can opt out of Google AI search results

UK publishers can now opt out of Google’s AI search summaries, a CMA ruling designed to boost their bargaining power and...

Kioxia Edges Nearer Toyota’s Market Cap in Shakeup to Japan Inc.
Bloomberg

Kioxia Edges Nearer Toyota’s Market Cap in Shakeup to Japan Inc.

Kioxia’s market cap nears Toyota’s, signaling a major shift in Japan’s corporate hierarchy. This narrowing gap highlight...

Reuters

Morning Bid: Marvell, a fitting name for the latest AI darling

Reuters highlights Marvell as a top AI stock, noting its name perfectly suits its status as the newest market darling.

Financial Times

Tim Hayward: I built the Jaguar E-Type of computer keyboards

Tim Hayward compares his bespoke keyboard designs to the Jaguar E-Type. He explores high-end customization for personal ...

Financial Times

AI Labs: Zuckerberg’s $100bn gamble

Meta’s $100 billion AI investment aims to secure AI dominance, but questions remain whether sheer spending can outpace c...