UniPinRec: Unifying Generative Retrieval and Ranking at Pinterest Scale
Title: UniPinRec: Achieving Full-Stack Unification of Retrieval and Ranking at Pinterest Scale
Modern recommendation architectures typically train retrieval and ranking components as distinct models. This separation is inefficient, as both stages increasingly depend on large transformers to encode identical user behavior data, resulting in duplicated parameters, heightened computational demands, and increased serving costs. While previous efforts have attempted to merge model architectures, they have not addressed the broader pipeline; input formats, training protocols, and serving stacks remain disjointed across different stages.
We introduce UniPinRec, a solution that accomplishes a comprehensive, full-stack unification of retrieval and ranking at Pinterest. This approach utilizes a single input format, one unified model, and a consolidated training stage, all integrated into existing serving infrastructure. At its core, a shared transformer encodes user action sequences into candidate-independent representations. These representations then branch into retrieval tasks via ANN dot-product and ranking tasks through cross-attention, facilitated by task-specific heads.
The success of this framework rests on three key innovations: 1. Masked Action Modeling (MAM): This technique removes the need for interleaving, allowing for weight sharing without necessitating a doubling of context length. 2. Blended Training Examples: This method pairs action sequences with feedview impression slates, enabling the model to satisfy both retrieval and ranking objectives simultaneously. 3. Cross-Stage KV Cache Sharing: By reusing user-history computations generated during retrieval for the ranking process, this feature significantly reduces total floating-point operations (FLOPs) compared to operating two independent models.
Since deployment in Pinterest’s core surfaces, UniPinRec has achieved an approximate 1% increase in online engagement. Furthermore, it has reduced end-to-end serving latency by 11.1% and boosted queries per second (QPS) by 63.6%. To our knowledge, this marks the first instance of a full-stack unification of retrieval and ranking—spanning inputs, model architecture, training, and serving—successfully deployed within a production recommendation system.
Source: arXiv Generated at: 2026-06-02 00:00:00 UTC





