Honey, I Shrunk the Arc de Triomphe!
Title: Honey, I Shrunk the Arc de Triomphe!
Original: arXiv:2606.02379v1 Announce Type: new Abstract: Metric scale monocular geometry estimation has seen significant progress through large-scale data aggregation, yet current foundation models suffer from a persistent ''scale-collapse'' phenomenon: distant landmarks and vast landscapes are metrically underestimated. We hypothesize that this performance gap stems from a training data bottleneck, where existing metric-scale datasets are hardware-constrained to homogenous vehicle-captured LiDAR or short-range indoor scans, or consist of synthetic data that lacks the semantic complexity of the physical world. To bridge this gap, we curate a new metrically-grounded, in-the-wild dataset that we call MetricScenes, gathered from a variety of sources including Internet photo collections and stereo imagery. We estimate camera poses and initial depth maps for each scene using off-the-shelf methods, and recover absolute scale from geo-tagged metadata as well as known stereo camera baselines. We also improve the quality of depth maps derived from MetricScenes via a new two-stage Poisson completion method. Fine-tuning MoGe-2 on our dataset significantly mitigates scale-collapse and achieves superior metric accuracy in unconstrained, open-domain scenes while maintaining state-of-the-art performance on standard benchmarks.
Rewrite:
arXiv:2606.02379v1 Announcement Type: New
Abstract:
While the aggregation of extensive datasets has driven substantial advancements in metric-scale monocular geometry estimation, contemporary foundation models continue to grapple with a recurring issue known as "scale-collapse." This defect results in the systematic underestimation of the true dimensions of distant landmarks and expansive landscapes. We propose that this limitation arises from a critical bottleneck in training data: current metric-scale collections are often restricted by hardware limitations to uniform LiDAR scans from vehicles or limited indoor environments, or they rely on synthetic data that fails to capture the semantic intricacy of real-world scenarios.
To address this deficiency, we have developed MetricScenes, a novel dataset grounded in real-world metrics and captured in uncontrolled environments. This collection is compiled from diverse origins, such as online photo archives and stereo imaging sources. For every scene within MetricScenes, we utilize standard tools to determine camera positions and generate preliminary depth maps. Absolute scaling is then restored by leveraging geo-tagged metadata and established stereo camera baseline measurements. Furthermore, we enhance the fidelity of these depth maps through an innovative two-stage Poisson completion technique. When MoGe-2 is fine-tuned using our dataset, the model effectively reduces scale-collapse, delivering enhanced metric precision in open-domain, unconstrained settings without compromising its leading-edge performance on established benchmarks.
Source: arXiv Generated at: 2026-06-02 00:00:00 UTC





