arXiv

CLIP-like Model as a Foundational Density Ratio Estimator

June 2, 2026 · Fumiya Uchiyama, Rintaro Yanagi, Shohei Taniguchi, Shota Takashiro, Masahiro Suzuki, Hirokatsu Kataoka, Yusuke Iwasawa, Yutaka Matsuo · Original Source

Title: Leveraging CLIP-Style Architectures as Universal Density Ratio Estimators

Abstract

Density ratio estimation serves as a cornerstone in statistical machine learning, offering a cohesive framework for diverse operations including importance weighting, divergence calculation, and likelihood-free inference. Despite its significance, the application of this concept within vision-and-language models remains largely untapped. Contemporary encoders like SigLIP and CLIP utilize contrastive learning objectives that, in effect, optimize the logarithmic density ratios between joint and marginal distributions of image-text pairs. Consequently, these models implicitly acquire similarity metrics that correlate with log density ratios. While previous research has primarily capitalized on the embedding capabilities of these models, the structural density-ratio properties arising from contrastive learning have not been thoroughly investigated or utilized in multimodal contexts.

To bridge this gap, we reframe CLIP-style architectures as pre-trained, versatile tools for density ratio estimation, demonstrating that this viewpoint unlocks novel algorithmic potential. We provide a comprehensive analysis of how contrastive objectives facilitate density ratio estimation and introduce two concrete applications: KL divergence estimation and Importance Weight Learning. Our approach to Importance Weight Learning is notably efficient, requiring merely one extra prompt, yet it boosts F1 scores by as much as seven points. Furthermore, we demonstrate that density ratios derived from CLIP can estimate KL divergences, which measure the shift in the distribution of one modality when conditioned on the other. Through qualitative case studies and N-gram analysis of captions, we observe that these divergences effectively reflect the semantic variety and mode structure inherent in multimodal datasets. Exploiting this insight, we develop a straightforward data curation strategy guided by KL divergence, which yields performance levels comparable to the rigorous filtering methods of LAION2B.

Source: arXiv Generated at: 2026-06-02 00:00:00 UTC