Coding Agent Is Good As World Simulator
Title: Coding Agent Performs as Well as World Simulator
Original: arXiv:2605.14398v2 Announce Type: replace Abstract: World models have emerged as a powerful paradigm for building interactive simulation environments, with recent video-based approaches demonstrating impressive progress in generating visually plausible dynamics. However, because these models typically infer dynamics from video and represent them in latent states, they do not explicitly enforce physical constraints. As a result, the generated video rollouts are not physically plausible, exhibiting unstable contacts, distorted shapes, or inconsistent motion. In this paper, we present an agentic framework constructing physics-based world models through executable simulation code. The framework coordinates planning, code generation, visual review, and physics analysis agents. The planning agent converts the natural language prompt into a structured scene plan, the code agent implements it as executable simulation code, and the visual review agent provide visual feedback while the physics analysis agent checks physical consistency. The code is iteratively revised based on the feedback until the simulation matches the prompt reqirements and physical constraints. Experimental results show that our framework outperforms advanced video-based models in physical accuracy, instruction fidelity and visual quality, which could be applied to various scenarios including driving simulation and embodied robot tasks.
Rewrite: World models have become a dominant approach for creating interactive simulation environments, with recent video-centric methods showing remarkable success in producing visually realistic dynamics. Nevertheless, since these models usually deduce dynamics from video footage and encode them within latent states, they fail to explicitly adhere to physical laws. Consequently, the resulting video sequences often lack physical plausibility, manifesting issues such as erratic contacts, deformed objects, or irregular movements. This study introduces an agentic framework that builds physics-based world models using executable simulation code. The system integrates four distinct agents: planning, code generation, visual review, and physics analysis. Specifically, the planning agent translates natural language instructions into a detailed scene structure, while the code agent transforms this plan into runnable simulation code. Simultaneously, the visual review agent offers visual critiques, and the physics analysis agent ensures physical consistency. The generated code undergoes iterative refinement based on this feedback loop until the simulation aligns with both the original prompt’s requirements and physical laws. Our experiments demonstrate that this framework surpasses state-of-the-art video-based models in terms of physical accuracy, adherence to instructions, and visual quality, opening up potential applications in areas such as autonomous driving simulations and embodied robotics.
Source: arXiv Generated at: 2026-06-02 00:00:00 UTC




