arXiv

Self-supervised Monocular Depth and Pose Estimation for Endoscopy with Latent Priors

June 2, 2026 · Ziang Xu, Bin Li, Yang Hu, Chenyu Zhang, James East, Sharib Ali, Jens Rittscher · Original Source

Title: Enhancing Self-Supervised Monocular Depth and Pose Estimation in Endoscopy via Latent Priors

Abstract: Achieving precise 3D mapping in endoscopic procedures is essential for comprehensive, quantitative assessment of lesions within the gastrointestinal (GI) tract, a task that hinges on dependable depth and pose estimation. Despite the monocular nature of endoscopic equipment, current approaches often struggle with generalizability in difficult endoscopic environments, typically due to their reliance on synthetic data or overly complex architectures. To address these limitations, we introduce a resilient self-supervised framework for monocular depth and pose estimation that integrates a Generative Latent Bank with a Variational Autoencoder (VAE). By utilizing a Generative Latent Bank, the system draws upon extensive depth data from natural images to condition the depth network. This process injects latent feature priors that significantly boost the realism and robustness of depth predictions. Concurrently, we recast pose estimation within a VAE structure, conceptualizing pose transitions as latent variables. This strategy serves to regularize scale, stabilize prominence along the z-axis, and enhance sensitivity in the x-y plane. Our dual-refinement pipeline delivers accurate depth and pose estimations, effectively navigating the challenging textures and lighting conditions inherent to the GI tract. Comprehensive assessments on the SimCol and EndoSLAM datasets demonstrate that our approach outperforms existing self-supervised methods in the realm of endoscopic depth and pose estimation.

Source: arXiv Generated at: 2026-06-02 00:00:00 UTC