Physical Object Understanding with a Physically Controllable World Model
Title: Mastering Physical Object Comprehension via a Physically Manipulable World Model
Abstract:
A primary hurdle in the field of visual intelligence lies in deducing the physical architecture of environments directly from unprocessed video footage. This involves determining how specific regions coalesce into distinct objects and identifying the fundamental laws that regulate their interactions. Addressing these challenges necessitates world models capable of inferring the distributional states of the environment based on incomplete observations—a functionality that existing architectural frameworks currently lack.
To bridge this gap, we propose a novel category of probabilistic world models. These systems are designed to estimate the likelihood of any visual attribute, such as motion dynamics or visual appearance, conditioned upon any other available variables. We demonstrate that such models can be trained effectively using autoregressive sequence modeling, a process that gives rise to sophisticated object understanding.
Our approach is validated through several key findings. First, we show that the model successfully internalizes the physical laws dictating object movement by generating multiple plausible future scenarios via sequential inference. Second, by examining motion correlations across these predicted futures, the model is able to isolate objects and identify subparts of articulated structures. Once these entities are identified, we prove that the world model can execute 3D manipulation of these objects. Finally, we illustrate how the model can compute physical relationships between entities, paving the way for practical applications such as Visual Jenga.
Source: arXiv Generated at: 2026-06-02 00:00:00 UTC





