Who Needs Labels? Adapting Vision Foundation Models With the Metadata You Already Have
Title: Beyond Labels: Leveraging Existing Metadata to Adapt Vision Foundation Models
Abstract: This paper introduces a method for tailoring broad, powerful vision foundation models to niche scientific fields without relying on labeled data. Conventional supervised fine-tuning is frequently ineffective in these contexts due to limited label availability and the risk that task-specific training may erode the model’s general capabilities and robustness. To address this, we utilize metadata to adapt model representations to new domains through self-supervision. We present FINO, a technique that integrates a conventional self-supervised loss with adaptable metadata guidance, capable of managing both fine-grained discrete attributes and continuous variables. This approach promotes the retention of relevant information within the representation while filtering out noise. Evaluated across diverse fields including subcellular fluorescence microscopy, Earth observation, wildlife tracking, and medical imaging, FINO consistently surpasses standard unsupervised domain adaptation and fully supervised adaptation methods. Furthermore, it outperforms current state-of-the-art models specific to these domains, achieving these results without any task labels for backbone adaptation and requiring only lightweight probes for supervision.
Source: arXiv Generated at: 2026-06-04 00:00:00 UTC




