arXiv

Reconstructing Content via Collaborative Attention to Improve Multimodal Embedding Quality

Title: Enhancing Multimodal Embedding Quality Through Collaborative Attention-Based Content Reconstruction

Abstract:

Multimodal embedding models, which are grounded in multimodal large language models (MLLMs), have demonstrated substantial performance gains across various applications, including classification and retrieval. Nevertheless, the majority of current methodologies depend heavily on large-scale contrastive learning, with insufficient investigation into how MLLM training paradigms and architectural designs influence embedding efficacy. Although the causal attention mechanism and next-token prediction framework are highly effective for generative tasks, they do not explicitly foster the development of globally compact representations. This limitation restricts their utility as robust backbones for multimodal embeddings.

To overcome these challenges, we introduce CoCoA, a pre-training paradigm centered on Collaborative Attention for content reconstruction, designed to optimize multimodal embeddings. Our approach reconfigures the attention flow and incorporates an EOS-based reconstruction objective, compelling the model to reconstruct input data from its corresponding embeddings. This process incentivizes the multimodal model to compress the semantic content of the input into the token, thereby establishing a solid foundation for subsequent contrastive learning phases.

Extensive evaluations conducted on the MMEB-V1 benchmark reveal that CoCoA, when integrated with Qwen2-VL and Qwen2.5-VL, markedly enhances embedding quality. These results confirm that content reconstruction is a potent strategy for maximizing the utility of existing datasets. By enabling multimodal embedding models to produce compact and highly informative representations, this method raises the performance ceiling for such systems.


Source: arXiv Generated at: 2026-06-02 00:00:00 UTC

Related Articles

Law’s Billable Hour Is Being Shredded by AI
Bloomberg

Law’s Billable Hour Is Being Shredded by AI

AI is dismantling the billable hour by automating routine legal tasks. This technological shift threatens the traditiona...

Iran War: Trump Tries to Stop Israel’s Lebanon Push | The Opening Trade 6/2/2026
Bloomberg

Iran War: Trump Tries to Stop Israel’s Lebanon Push | The Opening Trade 6/2/2026

SoftBank in Early Talks to Back $800 Million Agile Robots Round
Bloomberg

SoftBank in Early Talks to Back $800 Million Agile Robots Round

SoftBank is in early talks to back Agile Robots’ $800 million funding round. The Japanese tech giant is currently in pre...

Amundi Is Diversifying Risk Via Commodity Currencies, Gold
Bloomberg

Amundi Is Diversifying Risk Via Commodity Currencies, Gold

Amundi diversifies risk by investing in commodity-linked currencies and gold. This strategy hedges against market volati...

Reuters

Marvell Technology surges after Nvidia's Huang calls it 'next trillion-dollar company'

Marvell Technology shares surged after Nvidia CEO Jensen Huang labeled the firm the “next trillion-dollar company.”

Russia Says It Found Foreign Spyware on Top Officials’ Phones
Bloomberg

Russia Says It Found Foreign Spyware on Top Officials’ Phones

Russia’s FSB claims to have discovered foreign spyware on senior officials’ phones. Moscow attributes the intrusion to h...