A 3D Isovist World Model -- Revealing a City's Unseen Geometry and Its Emergent Cross-City Signature
Title: A 3D Isovist World Model -- Revealing a City's Unseen Geometry and Its Emergent Cross-City Signature
Abstract:
For embodied agents navigating urban environments, effective world models must anticipate how the surroundings evolve during movement. However, successful navigation depends less on the visual appearance of structures and more on determining accessible pathways. Conventional world models typically prioritize photometric accuracy, learning how a scene appears rather than the navigable space within it. Approaches that do focus on geometry, such as bird’s-eye-view occupancy grids, often reduce the three-dimensional environment to a two-dimensional ground plane, thereby ignoring the multi-level and above-ground complexities that define real-world navigation. There remains a need for a predictive framework that captures the navigable geometry an agent actually traverses, avoiding both photometric distractions and the loss of vertical dimensionality.
To address this, we propose modeling the open volume between buildings—essentially the negative space—using a 3D isovist. This representation functions as a spherical visibility-depth map, recording the distance to the nearest surface in every direction. We introduce an embodied world model that forecasts the subsequent isovist based on a brief history of previous isovists and the agent’s movement actions. This prediction is structured as a depth residual, allowing the decoder to retain sharp building edges. The model is trained using self-rollout scheduled sampling to maintain corrupted context on the geometry manifold and incorporates a persistent latent bird’s-eye-view spatial map to ensure consistency across different paths.
Our primary discovery is both emergent and surprising: a single model, trained exclusively on data from Manhattan and Paris without city-specific labels, develops a distinct cross-city spatial signature. The identity of the city can be linearly decoded from the model’s temporal latents with significantly higher accuracy than single-frame baselines, indicating that this signature resides in the learned dynamics rather than in visual appearance. This representation is lightweight, interpretable, and reproducible, providing a geometric foundation for spatial reasoning in embodied AI, robotics, and urban analysis. The work is accompanied by an open dataset and pipeline.
Source: arXiv Generated at: 2026-06-03 00:00:00 UTC



