arXiv

X-Foresight: A Joint Vision-Action Causal Forecasting Network via Predictive World Modeling

June 2, 2026 · Baolu Li (Victor), Jingyu Qian (Victor), Rui Guo (Victor), Yilun Chen (Victor), Hanpeng Liu (Victor), Yuan Lin (Victor), Junhong Zhou (Victor), Ruixin Liu (Victor), Willow Yang (Victor), Yutong Zheng (Victor), Zhenli Zhang (Victor), Tenglong (Victor), G · Original Source

Title: X-Foresight: A Joint Vision-Action Causal Forecasting Network via Predictive World Modeling

Abstract:

Physical intuition is largely embedded within video data. Integrating this knowledge into Vision-Language-Action (VLA) models is essential for ensuring robust and safe generalizable planning. Predictive world modeling allows VLAs to grasp physical dynamics and long-range causality by forecasting future video sequences based on prior observations. Nevertheless, standard next-frame prediction encounters two primary hurdles. First, because video tokens are characterized by low entropy and high redundancy—unlike the semantically distinct tokens found in text—predictions often devolve into simple, trivial extrapolations. Second, world modeling presents a temporal trade-off: while dense prediction effectively captures immediate dynamics, it struggles to efficiently represent long-horizon causality.

To address these issues, we propose X-Foresight, a predictive world model seamlessly integrated into the VLA architecture. This framework simultaneously learns world modeling and real-time action control. The core innovation is a long-horizon, chunk-wise auto-regressive strategy. By forecasting semantically distant chunks instead of consecutive frames, the model avoids trivial extrapolation. Simultaneously, it retains dense frames within each chunk to capture instantaneous dynamics and employs sparse transitions between chunks to model long-term causality. A curriculum learning approach further aids this process by progressively increasing prediction horizons, thereby stabilizing training over extended periods.

To enhance the capture of long-term causality, we introduce temporal importance sampling. This technique focuses supervisory signals on safety-critical chunks, which are identified through ego-motion and behavioral indicators. Additionally, we offload photorealistic synthesis to a diffusion-based multi-view renderer to improve visual fidelity. Extensive experiments show that X-Foresight surpasses VLA baselines in planning capabilities while preserving strong generative quality, offering a reliable framework for autonomous systems driven by world knowledge.

Source: arXiv Generated at: 2026-06-02 00:00:00 UTC