arXiv

UniNote: A Unified Embedding Model for Multimodal Representation and Ranking

June 2, 2026 · Jinghan Zhao, Wenwei Jin, Anqi Li, Jintao Tong, Luya Mo, Jiawei Li, Bin Li, Yao Hu · Original Source

Title: UniNote: A Unified Embedding Model for Multimodal Representation and Ranking

Abstract:

Item-to-Item (I2I) retrieval serves as a cornerstone for contemporary content platforms, underpinning essential industrial operations ranging from recommendation systems to content moderation. Although multimodal embedding techniques have significantly enhanced general retrieval capabilities, they frequently struggle in I2I contexts. These difficulties stem from the complex balance required between global content representation and fine-grained local retrieval, the systemic inefficiencies inherent in decoupled embedding-and-ranking pipelines, and the unavoidable compromises between model precision and serving latency.

To address these challenges, we introduce UniNote, a unified embedding model specifically engineered for industrial I2I retrieval. We incorporate tailored retrieval strategies to facilitate representation learning across complex, multimodal content at diverse levels of granularity. To implement these strategies effectively, UniNote utilizes a two-stage training framework. The initial stage employs contrastive Supervised Fine-Tuning (SFT) to create robust foundational embeddings. The subsequent stage enhances ranking quality via a reinforcement learning (RL) process that aligns the model with content relevance.

Our findings indicate that UniNote delivers state-of-the-art (SOTA) performance across a wide array of I2I tasks. In large-scale deployments at Xiaohongshu, where UniNote is integrated with Matryoshka Representation Learning (MRL), we observed substantial gains in both retrieval quality and cost efficiency.

Source: arXiv Generated at: 2026-06-02 00:00:00 UTC