Reconstructing Content via Collaborative Attention to Improve Multimodal Embedding Quality
Title: Enhancing Multimodal Embedding Quality Through Collaborative Attention-Based Content Reconstruction
Abstract:
Multimodal embedding models, which are grounded in multimodal large language models (MLLMs), have demonstrated substantial performance gains across various applications, including classification and retrieval. Nevertheless, the majority of current methodologies depend heavily on large-scale contrastive learning, with insufficient investigation into how MLLM training paradigms and architectural designs influence embedding efficacy. Although the causal attention mechanism and next-token prediction framework are highly effective for generative tasks, they do not explicitly foster the development of globally compact representations. This limitation restricts their utility as robust backbones for multimodal embeddings.
To overcome these challenges, we introduce CoCoA, a pre-training paradigm centered on Collaborative Attention for content reconstruction, designed to optimize multimodal embeddings. Our approach reconfigures the attention flow and incorporates an EOS-based reconstruction objective, compelling the model to reconstruct input data from its corresponding embeddings. This process incentivizes the multimodal model to compress the semantic content of the input into the token, thereby establishing a solid foundation for subsequent contrastive learning phases.
Extensive evaluations conducted on the MMEB-V1 benchmark reveal that CoCoA, when integrated with Qwen2-VL and Qwen2.5-VL, markedly enhances embedding quality. These results confirm that content reconstruction is a potent strategy for maximizing the utility of existing datasets. By enabling multimodal embedding models to produce compact and highly informative representations, this method raises the performance ceiling for such systems.
Source: arXiv Generated at: 2026-06-02 00:00:00 UTC





