arXiv

ShareVerse: Multi-Agent Consistent Video Generation for Shared World Modeling

June 4, 2026 · Jiayi Zhu, Jianing Zhang, Yiying Yang, Wei Cheng, Xiaoyun Yuan · Original Source

Title: ShareVerse: Multi-Agent Consistent Video Generation for Shared World Modeling

Abstract:

This study introduces ShareVerse, a novel video generation framework designed to facilitate multi-agent shared world modeling. It addresses a critical limitation in current research: the absence of unified methods for constructing shared environments involving multi-agent interactions. By harnessing the generative power of large-scale video models, ShareVerse incorporates three distinct technical advancements. First, it establishes a comprehensive dataset for large-scale multi-agent world modeling, developed using the CARLA simulation platform. This dataset encompasses a wide variety of scenes and weather conditions, providing paired multi-view videos (capturing front, rear, left, and right perspectives for each agent) alongside corresponding camera data.

Second, the framework employs a spatial concatenation strategy applied to four-view videos from independent agents. This approach serves to model a more expansive environment while maintaining internal geometric consistency across multi-view perspectives. Third, cross-agent attention blocks are embedded within the pretrained video model. These blocks facilitate the exchange of spatial-temporal data between different agents, thereby ensuring consistency in overlapping areas of the shared world and producing plausible outputs in non-overlapping zones. Capable of generating large-scale videos of 49 frames, ShareVerse accurately detects the positions of dynamic agents and achieves coherent shared world modeling.

Source: arXiv Generated at: 2026-06-04 00:00:00 UTC