arXiv

A 3D Isovist World Model -- Revealing a City's Unseen Geometry and Its Emergent Cross-City Signature

Title: A 3D Isovist World Model -- Revealing a City's Unseen Geometry and Its Emergent Cross-City Signature

Abstract:

For embodied agents navigating urban environments, effective world models must anticipate how the surroundings evolve during movement. However, successful navigation depends less on the visual appearance of structures and more on determining accessible pathways. Conventional world models typically prioritize photometric accuracy, learning how a scene appears rather than the navigable space within it. Approaches that do focus on geometry, such as bird’s-eye-view occupancy grids, often reduce the three-dimensional environment to a two-dimensional ground plane, thereby ignoring the multi-level and above-ground complexities that define real-world navigation. There remains a need for a predictive framework that captures the navigable geometry an agent actually traverses, avoiding both photometric distractions and the loss of vertical dimensionality.

To address this, we propose modeling the open volume between buildings—essentially the negative space—using a 3D isovist. This representation functions as a spherical visibility-depth map, recording the distance to the nearest surface in every direction. We introduce an embodied world model that forecasts the subsequent isovist based on a brief history of previous isovists and the agent’s movement actions. This prediction is structured as a depth residual, allowing the decoder to retain sharp building edges. The model is trained using self-rollout scheduled sampling to maintain corrupted context on the geometry manifold and incorporates a persistent latent bird’s-eye-view spatial map to ensure consistency across different paths.

Our primary discovery is both emergent and surprising: a single model, trained exclusively on data from Manhattan and Paris without city-specific labels, develops a distinct cross-city spatial signature. The identity of the city can be linearly decoded from the model’s temporal latents with significantly higher accuracy than single-frame baselines, indicating that this signature resides in the learned dynamics rather than in visual appearance. This representation is lightweight, interpretable, and reproducible, providing a geometric foundation for spatial reasoning in embodied AI, robotics, and urban analysis. The work is accompanied by an open dataset and pipeline.


Source: arXiv Generated at: 2026-06-03 00:00:00 UTC

Related Articles

TikTok Billionaire Tops Ambani as Asia’s Second-Richest
Bloomberg

TikTok Billionaire Tops Ambani as Asia’s Second-Richest

TikTok founder surpasses Mukesh Ambani to become Asia’s second-richest person, marking a significant shift in the region...

Publishers in UK can opt out of Google AI search results
BBC News

Publishers in UK can opt out of Google AI search results

UK publishers can now opt out of Google’s AI search summaries, a CMA ruling designed to boost their bargaining power and...

Kioxia Edges Nearer Toyota’s Market Cap in Shakeup to Japan Inc.
Bloomberg

Kioxia Edges Nearer Toyota’s Market Cap in Shakeup to Japan Inc.

Kioxia’s market cap nears Toyota’s, signaling a major shift in Japan’s corporate hierarchy. This narrowing gap highlight...

Reuters

Morning Bid: Marvell, a fitting name for the latest AI darling

Reuters highlights Marvell as a top AI stock, noting its name perfectly suits its status as the newest market darling.

Financial Times

Tim Hayward: I built the Jaguar E-Type of computer keyboards

Tim Hayward compares his bespoke keyboard designs to the Jaguar E-Type. He explores high-end customization for personal ...

Financial Times

AI Labs: Zuckerberg’s $100bn gamble

Meta’s $100 billion AI investment aims to secure AI dominance, but questions remain whether sheer spending can outpace c...