arXiv

CIPER: A Unified Framework for Cross-view Image-retrieval and Pose-estimation

June 4, 2026 · Yurim Jeon, Dongseong Seo, Seung-Woo Seo · Original Source

Title: CIPER: A Unified Framework for Cross-view Image-retrieval and Pose-estimation

Abstract:

Cross-view geo-localization determines the geographic position of a ground-level photograph by comparing it with images from an aerial database. Current approaches generally address this challenge through either large-scale retrieval or high-precision pose estimation, but rarely both. Retrieval-centric methods allow for searches across wide areas but suffer from lower localization accuracy, whereas pose estimation techniques offer precise results but are restricted to narrow search spaces. Simply chaining these separate pipelines leads to error propagation and misaligned feature representations.

To address these limitations, we define cross-view geo-localization as a single, unified problem that demands simultaneous city-scale retrieval and accurate 3-DoF pose estimation. We introduce CIPER (Cross-view Image-retrieval and Pose-estimation transformER), a novel architecture that executes both tasks jointly via mutually reinforcing feature learning. CIPER employs a shared transformer encoder equipped with task-specific tokens, effectively separating global retrieval features from spatial localization cues.

To overcome the significant domain gap between ground and aerial perspectives, we present a two-way transformer pose decoder. This component leverages ground features as spatial queries to facilitate bidirectional cross-attention. Additionally, a set prediction strategy ensures stable 3-DoF regression within a unified multi-task objective. Evaluations on the VIGOR, KITTI, and Ford Multi-AV datasets show competitive results, particularly in scenarios involving limited fields of view and arbitrary orientations. The code is publicly accessible at https://github.com/yurimjeon1892/CIPER.

Source: arXiv Generated at: 2026-06-04 00:00:00 UTC