arXiv

UrbanFusion: Stochastic Multimodal Fusion for Contrastive Learning of Robust Spatial Representations

June 2, 2026 · Dominik J. M\"uhlematter, Lin Che, Ye Hong, Martin Raubal, Nina Wiedemann · Original Source

Title: UrbanFusion: Stochastic Multimodal Fusion for Contrastive Learning of Robust Spatial Representations

Abstract: Accurately predicting urban dynamics, such as public health metrics and housing market trends, hinges on the seamless synthesis of diverse geospatial datasets. While existing approaches typically rely on models tailored to specific tasks, recent general-purpose spatial representation models often suffer from restricted modality support and an absence of robust multimodal fusion mechanisms. To address these limitations, we introduce UrbanFusion, a novel spatial representation framework distinguished by its Stochastic Multimodal Fusion (SMF) architecture. UrbanFusion utilizes dedicated encoders to process varied input streams, including street-level imagery, remote sensing feeds, cartographic maps, and Points of Interest (POI) data. These heterogeneous inputs are harmonized into cohesive representations through a Transformer-driven fusion module. Comprehensive testing across 56 global cities and 41 distinct tasks reveals that UrbanFusion exhibits superior predictive capabilities and generalization abilities relative to leading GeoAI models. Key advantages include: 1) enhanced performance in location encoding compared to previous benchmarks; 2) the capacity to incorporate multimodal data during the inference phase; and 3) strong generalization to geographic regions not encountered during training. Furthermore, UrbanFusion offers the flexibility to employ any available subset of modalities for specific locations during both pretraining and inference, ensuring adaptability across a wide spectrum of data availability contexts.

Source: arXiv Generated at: 2026-06-02 00:00:00 UTC