WALL-WM: Carving World Action Modeling at the Event Joints
Title: WALL-WM: Sculpting World Action Modeling Around Event Boundaries
Abstract
WALL-WM represents a paradigm shift in World Action Modeling (WAM), moving the field away from chunk-centric optimization toward event-grounded Vision-Language-Action (VLA) pretraining. In this framework, semantically coherent action events serve as the fundamental atomic unit for learning.
Current WAMs typically begin with multimodal or video foundation models, subsequently optimizing fixed-length action segments conditioned on immediate observations and instructions. While this approach offers convenience, it introduces a critical granularity mismatch. Language articulates semantic goals and events, vision progresses through continuous scene dynamics, and actions function at control-level timescales. Forcing these three distinct modalities into a single, fixed-length prediction window reduces VLA training to mere short-horizon correlation fitting.
WALL-WM resolves this disconnect by structuring both data and supervision around semantic events. The model integrates event-grounded VLA pretraining with a specialized data ecosystem featuring event-level captions and cluster-balanced sampling. This architecture facilitates scalable learning across a wide array of behaviors, environments, and task structures.
Derived from the same event-pretrained backbone, WALL-WM offers two complementary inference strategies. The "event mode" accepts next-event descriptions, allowing for variable-length execution segments. Conversely, the "unified mode" employs a Vision-Language Model (VLM) utilizing Staircase Decoding to facilitate conventional fixed-length chunk inference, all while maintaining a gradient-continuous VLA pathway. Supported by large-scale pretraining infrastructure powered by the Muon optimizer, WALL-WM delivers a practical recipe for scaling general-purpose WAMs. Experimental results demonstrate that WALL-WM achieves state-of-the-art performance in large-scale real-world generalization evaluations, exhibiting broad generalization capabilities across diverse languages, scenes, and tasks.
Source: arXiv Generated at: 2026-06-02 00:00:00 UTC





