arXiv

X-Foresight: A Joint Vision-Action Causal Forecasting Network via Predictive World Modeling

Title: X-Foresight: A Joint Vision-Action Causal Forecasting Network via Predictive World Modeling

Abstract:

Physical intuition is largely embedded within video data. Integrating this knowledge into Vision-Language-Action (VLA) models is essential for ensuring robust and safe generalizable planning. Predictive world modeling allows VLAs to grasp physical dynamics and long-range causality by forecasting future video sequences based on prior observations. Nevertheless, standard next-frame prediction encounters two primary hurdles. First, because video tokens are characterized by low entropy and high redundancy—unlike the semantically distinct tokens found in text—predictions often devolve into simple, trivial extrapolations. Second, world modeling presents a temporal trade-off: while dense prediction effectively captures immediate dynamics, it struggles to efficiently represent long-horizon causality.

To address these issues, we propose X-Foresight, a predictive world model seamlessly integrated into the VLA architecture. This framework simultaneously learns world modeling and real-time action control. The core innovation is a long-horizon, chunk-wise auto-regressive strategy. By forecasting semantically distant chunks instead of consecutive frames, the model avoids trivial extrapolation. Simultaneously, it retains dense frames within each chunk to capture instantaneous dynamics and employs sparse transitions between chunks to model long-term causality. A curriculum learning approach further aids this process by progressively increasing prediction horizons, thereby stabilizing training over extended periods.

To enhance the capture of long-term causality, we introduce temporal importance sampling. This technique focuses supervisory signals on safety-critical chunks, which are identified through ego-motion and behavioral indicators. Additionally, we offload photorealistic synthesis to a diffusion-based multi-view renderer to improve visual fidelity. Extensive experiments show that X-Foresight surpasses VLA baselines in planning capabilities while preserving strong generative quality, offering a reliable framework for autonomous systems driven by world knowledge.


Source: arXiv Generated at: 2026-06-02 00:00:00 UTC

Related Articles

Withings Debuts New Smart Scale Marketed Toward GLP-1 Users
Bloomberg

Withings Debuts New Smart Scale Marketed Toward GLP-1 Users

Withings launched a new smart scale targeting GLP-1 users, offering advanced body composition analysis. This device help...

TechCrunch

Rocket engine startup Impulse raises $500 million to hire people, not AI

Rocket engine startup Impulse Space raised $500 million to hire 200 engineers, prioritizing human expertise over AI for ...

Startup Impulse Space Raises $500 Million, Valued at $4 Billion
Bloomberg

Startup Impulse Space Raises $500 Million, Valued at $4 Billion

Impulse Space secured $500 million in funding, achieving a $4 billion valuation. This investment supports the developmen...

Walmart’s Answer to Apple Pay Wants to Be Your Favorite Financial App
Bloomberg

Walmart’s Answer to Apple Pay Wants to Be Your Favorite Financial App

Walmart’s new financial app aims to rival Apple Pay, positioning itself as a preferred digital payment and banking solut...

Nvidia Is Bigger, Stronger, and Trying to Slay the Laptop Dragon Again
Bloomberg

Nvidia Is Bigger, Stronger, and Trying to Slay the Laptop Dragon Again

Nvidia unveiled the RTX Spark Superchip at Computex 2026, aiming to challenge Intel’s PC dominance and modernize hardwar...

TechCrunch

Pacific Fusion’s latest prototype packs 440 gigawatts into an 80-nanosecond burst

Pacific Fusion’s new prototype delivers 440 gigawatts in 80 nanoseconds, securing over $1 billion in funding and enablin...