arXiv

WALL-WM: Carving World Action Modeling at the Event Joints

Title: WALL-WM: Sculpting World Action Modeling Around Event Boundaries

Abstract

WALL-WM represents a paradigm shift in World Action Modeling (WAM), moving the field away from chunk-centric optimization toward event-grounded Vision-Language-Action (VLA) pretraining. In this framework, semantically coherent action events serve as the fundamental atomic unit for learning.

Current WAMs typically begin with multimodal or video foundation models, subsequently optimizing fixed-length action segments conditioned on immediate observations and instructions. While this approach offers convenience, it introduces a critical granularity mismatch. Language articulates semantic goals and events, vision progresses through continuous scene dynamics, and actions function at control-level timescales. Forcing these three distinct modalities into a single, fixed-length prediction window reduces VLA training to mere short-horizon correlation fitting.

WALL-WM resolves this disconnect by structuring both data and supervision around semantic events. The model integrates event-grounded VLA pretraining with a specialized data ecosystem featuring event-level captions and cluster-balanced sampling. This architecture facilitates scalable learning across a wide array of behaviors, environments, and task structures.

Derived from the same event-pretrained backbone, WALL-WM offers two complementary inference strategies. The "event mode" accepts next-event descriptions, allowing for variable-length execution segments. Conversely, the "unified mode" employs a Vision-Language Model (VLM) utilizing Staircase Decoding to facilitate conventional fixed-length chunk inference, all while maintaining a gradient-continuous VLA pathway. Supported by large-scale pretraining infrastructure powered by the Muon optimizer, WALL-WM delivers a practical recipe for scaling general-purpose WAMs. Experimental results demonstrate that WALL-WM achieves state-of-the-art performance in large-scale real-world generalization evaluations, exhibiting broad generalization capabilities across diverse languages, scenes, and tasks.


Source: arXiv Generated at: 2026-06-02 00:00:00 UTC

Related Articles

Law’s Billable Hour Is Being Shredded by AI
Bloomberg

Law’s Billable Hour Is Being Shredded by AI

AI is dismantling the billable hour by automating routine legal tasks. This technological shift threatens the traditiona...

Iran War: Trump Tries to Stop Israel’s Lebanon Push | The Opening Trade 6/2/2026
Bloomberg

Iran War: Trump Tries to Stop Israel’s Lebanon Push | The Opening Trade 6/2/2026

SoftBank in Early Talks to Back $800 Million Agile Robots Round
Bloomberg

SoftBank in Early Talks to Back $800 Million Agile Robots Round

SoftBank is in early talks to back Agile Robots’ $800 million funding round. The Japanese tech giant is currently in pre...

Amundi Is Diversifying Risk Via Commodity Currencies, Gold
Bloomberg

Amundi Is Diversifying Risk Via Commodity Currencies, Gold

Amundi diversifies risk by investing in commodity-linked currencies and gold. This strategy hedges against market volati...

Reuters

Marvell Technology surges after Nvidia's Huang calls it 'next trillion-dollar company'

Marvell Technology shares surged after Nvidia CEO Jensen Huang labeled the firm the “next trillion-dollar company.”

Russia Says It Found Foreign Spyware on Top Officials’ Phones
Bloomberg

Russia Says It Found Foreign Spyware on Top Officials’ Phones

Russia’s FSB claims to have discovered foreign spyware on senior officials’ phones. Moscow attributes the intrusion to h...