arXiv

CLIP-like Model as a Foundational Density Ratio Estimator

Title: Leveraging CLIP-Style Architectures as Universal Density Ratio Estimators

Abstract

Density ratio estimation serves as a cornerstone in statistical machine learning, offering a cohesive framework for diverse operations including importance weighting, divergence calculation, and likelihood-free inference. Despite its significance, the application of this concept within vision-and-language models remains largely untapped. Contemporary encoders like SigLIP and CLIP utilize contrastive learning objectives that, in effect, optimize the logarithmic density ratios between joint and marginal distributions of image-text pairs. Consequently, these models implicitly acquire similarity metrics that correlate with log density ratios. While previous research has primarily capitalized on the embedding capabilities of these models, the structural density-ratio properties arising from contrastive learning have not been thoroughly investigated or utilized in multimodal contexts.

To bridge this gap, we reframe CLIP-style architectures as pre-trained, versatile tools for density ratio estimation, demonstrating that this viewpoint unlocks novel algorithmic potential. We provide a comprehensive analysis of how contrastive objectives facilitate density ratio estimation and introduce two concrete applications: KL divergence estimation and Importance Weight Learning. Our approach to Importance Weight Learning is notably efficient, requiring merely one extra prompt, yet it boosts F1 scores by as much as seven points. Furthermore, we demonstrate that density ratios derived from CLIP can estimate KL divergences, which measure the shift in the distribution of one modality when conditioned on the other. Through qualitative case studies and N-gram analysis of captions, we observe that these divergences effectively reflect the semantic variety and mode structure inherent in multimodal datasets. Exploiting this insight, we develop a straightforward data curation strategy guided by KL divergence, which yields performance levels comparable to the rigorous filtering methods of LAION2B.


Source: arXiv Generated at: 2026-06-02 00:00:00 UTC

Related Articles

Law’s Billable Hour Is Being Shredded by AI
Bloomberg

Law’s Billable Hour Is Being Shredded by AI

AI is dismantling the billable hour by automating routine legal tasks. This technological shift threatens the traditiona...

Iran War: Trump Tries to Stop Israel’s Lebanon Push | The Opening Trade 6/2/2026
Bloomberg

Iran War: Trump Tries to Stop Israel’s Lebanon Push | The Opening Trade 6/2/2026

SoftBank in Early Talks to Back $800 Million Agile Robots Round
Bloomberg

SoftBank in Early Talks to Back $800 Million Agile Robots Round

SoftBank is in early talks to back Agile Robots’ $800 million funding round. The Japanese tech giant is currently in pre...

Amundi Is Diversifying Risk Via Commodity Currencies, Gold
Bloomberg

Amundi Is Diversifying Risk Via Commodity Currencies, Gold

Amundi diversifies risk by investing in commodity-linked currencies and gold. This strategy hedges against market volati...

Reuters

Marvell Technology surges after Nvidia's Huang calls it 'next trillion-dollar company'

Marvell Technology shares surged after Nvidia CEO Jensen Huang labeled the firm the “next trillion-dollar company.”

Russia Says It Found Foreign Spyware on Top Officials’ Phones
Bloomberg

Russia Says It Found Foreign Spyware on Top Officials’ Phones

Russia’s FSB claims to have discovered foreign spyware on senior officials’ phones. Moscow attributes the intrusion to h...