arXiv

UniPinRec: Unifying Generative Retrieval and Ranking at Pinterest Scale

Title: UniPinRec: Achieving Full-Stack Unification of Retrieval and Ranking at Pinterest Scale

Modern recommendation architectures typically train retrieval and ranking components as distinct models. This separation is inefficient, as both stages increasingly depend on large transformers to encode identical user behavior data, resulting in duplicated parameters, heightened computational demands, and increased serving costs. While previous efforts have attempted to merge model architectures, they have not addressed the broader pipeline; input formats, training protocols, and serving stacks remain disjointed across different stages.

We introduce UniPinRec, a solution that accomplishes a comprehensive, full-stack unification of retrieval and ranking at Pinterest. This approach utilizes a single input format, one unified model, and a consolidated training stage, all integrated into existing serving infrastructure. At its core, a shared transformer encodes user action sequences into candidate-independent representations. These representations then branch into retrieval tasks via ANN dot-product and ranking tasks through cross-attention, facilitated by task-specific heads.

The success of this framework rests on three key innovations: 1. Masked Action Modeling (MAM): This technique removes the need for interleaving, allowing for weight sharing without necessitating a doubling of context length. 2. Blended Training Examples: This method pairs action sequences with feedview impression slates, enabling the model to satisfy both retrieval and ranking objectives simultaneously. 3. Cross-Stage KV Cache Sharing: By reusing user-history computations generated during retrieval for the ranking process, this feature significantly reduces total floating-point operations (FLOPs) compared to operating two independent models.

Since deployment in Pinterest’s core surfaces, UniPinRec has achieved an approximate 1% increase in online engagement. Furthermore, it has reduced end-to-end serving latency by 11.1% and boosted queries per second (QPS) by 63.6%. To our knowledge, this marks the first instance of a full-stack unification of retrieval and ranking—spanning inputs, model architecture, training, and serving—successfully deployed within a production recommendation system.


Source: arXiv Generated at: 2026-06-02 00:00:00 UTC

Related Articles

Law’s Billable Hour Is Being Shredded by AI
Bloomberg

Law’s Billable Hour Is Being Shredded by AI

AI is dismantling the billable hour by automating routine legal tasks. This technological shift threatens the traditiona...

Iran War: Trump Tries to Stop Israel’s Lebanon Push | The Opening Trade 6/2/2026
Bloomberg

Iran War: Trump Tries to Stop Israel’s Lebanon Push | The Opening Trade 6/2/2026

SoftBank in Early Talks to Back $800 Million Agile Robots Round
Bloomberg

SoftBank in Early Talks to Back $800 Million Agile Robots Round

SoftBank is in early talks to back Agile Robots’ $800 million funding round. The Japanese tech giant is currently in pre...

Amundi Is Diversifying Risk Via Commodity Currencies, Gold
Bloomberg

Amundi Is Diversifying Risk Via Commodity Currencies, Gold

Amundi diversifies risk by investing in commodity-linked currencies and gold. This strategy hedges against market volati...

Reuters

Marvell Technology surges after Nvidia's Huang calls it 'next trillion-dollar company'

Marvell Technology shares surged after Nvidia CEO Jensen Huang labeled the firm the “next trillion-dollar company.”

Russia Says It Found Foreign Spyware on Top Officials’ Phones
Bloomberg

Russia Says It Found Foreign Spyware on Top Officials’ Phones

Russia’s FSB claims to have discovered foreign spyware on senior officials’ phones. Moscow attributes the intrusion to h...