Pinpoint: Grounded Worldwide Image Geolocation via Cross-Source Retrieval and Reranking
Title: Pinpoint: Achieving Global Image Geolocation Through Cross-Source Retrieval and Reranking
Abstract:
Determining the geographic origin of a photograph based on its visual elements is the core objective of image geolocation. However, scaling this process globally is difficult due to the ambiguity, diversity, and uneven distribution of visual cues. Historically, research has separated the geolocation of standard internet photographs from that of street-view images, overlooking their synergistic potential. Internet photos align more closely with the visual characteristics of user-generated queries, whereas street-view data offers denser, geographically anchored coverage.
To address this, we introduce Pinpoint, a retrieve-and-rerank framework that integrates both data sources within a coarse-to-fine workflow. The system utilizes a contrastive image-GPS embedder, trained on a combination of user-uploaded Flickr images and street-view footage, to establish a unified embedding space for image and GPS data. This space facilitates the initial retrieval of potential locations. Subsequently, an attention-based reranker refines these results by merging visual and GPS features at the candidate level with cross-source contextual evidence from surrounding areas to enhance accuracy.
Distinct from recent approaches, Pinpoint avoids the use of multimodal large-language models, thereby offering faster inference speeds and greater reproducibility. The model sets new state-of-the-art performance standards across all evaluation metrics on established benchmarks, including IM2GPS3k and YFCC4k for internet photos, as well as OSV-5M for street-view imagery.
Source: arXiv Generated at: 2026-06-04 00:00:00 UTC



