KODA: Contrastive Representation Comparison and Alignment for Vision-Language Foundation Models
Title: KODA: Aligning and Comparing Representations in Vision-Language Foundation Models via Contrastive Methods
Abstract: Multimodal learning systems frequently rely on vision-language foundation models like SigLIP and CLIP for their robust representations. Although these models are commonly benchmarked based on their downstream task performance, such metrics rarely elucidate the structural distinctions between their internal representations. This study addresses this gap by investigating Contrastive Embedding Clustering, a task aimed at uncovering subsets of samples that exhibit weak clustering within one representation yet demonstrate strong clustering under another. To this end, we introduce Kernel Optimization for Discrepancy Analysis (KODA), a framework grounded in kernel methods designed for the comparison and alignment of contrastive representations. KODA generates unified multimodal kernels by composing kernels on a modality-specific basis and frames the identification of discrepancies as a constrained optimization challenge. This approach seeks to isolate coherent structures present in a target representation while simultaneously diminishing coherence within a reference representation. Consequently, the method produces interpretable discrepancy directions that highlight specific modality interactions and sample subsets. To ensure KODA can handle large-scale vision-language datasets, we implement randomized low-dimensional approximations of joint kernels, employing techniques such as Random Fourier Features for shift-invariant kernels. Our empirical results demonstrate that KODA consistently uncovers interpretable discrepancy structures across various vision-language representations and effectively yields sample subsets suitable for representation alignment. The source code can be accessed at https://github.com/yokiwuuu/KODA.
Source: arXiv Generated at: 2026-06-04 00:00:00 UTC






