arXiv

Token Predictors Are Not Planners: Building Physically Grounded Causal Reasoners

June 2, 2026 · Zheng Lu, Mingqi Gao, Qinlei Xie, Wanqi Zhong, Hanwen Cui, Heng Cao, Zirui Song, Yifan Yang, Chong Luo, Bei Liu, Yiming Li · Original Source

Title: Token Predictors Are Not Planners: Building Physically Grounded Causal Reasoners

Abstract:

Existing benchmarks for embodied vision-language planning frequently prioritize linguistic next-token prediction over reasoning that is grounded in physical states. This approach incentivizes models to rely on statistical language priors rather than tracking causal relationships, thereby reducing the complexity of physical planning to superficial sequence modeling. We contend that achieving reliable physical autonomy necessitates a fundamental shift from linguistically driven token prediction to reasoning that is grounded in physical causality.

To address this, we present Causal-Plan-Bench, a high-fidelity diagnostic suite designed to evaluate embodied planning across four distinct causal dimensions. This suite was curated using multi-stage verification processes to ensure rigor. Additionally, we developed Causal-Plan-1M, a massive corpus comprising one million explicit reasoning traces. These traces were generated via a four-stage annotation pipeline applied to egocentric videos.

Our extensive evaluations reveal that state-of-the-art models continue to struggle with demonstrating authentic physical agency; for instance, Gemini 3 Pro achieved a score of only 38.18 on our benchmark. Conversely, our proposed training methodology allows Causal Planner—constructed upon the Qwen3-VL-8B architecture—to internalize physical logic, resulting in more precise next-state estimations. This model demonstrates robust performance both within its domain and when generalizing to other benchmarks. Furthermore, our findings uncover a Causal Scaling Law: expanding causal training data to one million instances delivers a relative performance improvement of 36.3%, raising scores from 33.22 to 45.28. Ultimately, this work marks a significant advancement in transforming agents from mere superficial token predictors into agents capable of physically grounded causal reasoning.

Source: arXiv Generated at: 2026-06-02 00:00:00 UTC