arXiv

Distributional Approximate Nearest Neighbour Search for Uncertainty-Aware Retrieval

Title: Distributional Approximate Nearest Neighbour Search for Uncertainty-Aware Retrieval

Abstract:

Approximate Nearest Neighbour (ANN) search indices serve as the foundation for modern recommender systems, facilitating real-time candidate retrieval across catalogs containing millions of items. Conventional approaches typically rely on learning a single point-estimate embedding for each user and item. During the serving phase, the user’s embedding is used to query the index for relevant items. However, because these representations are derived from sparse interaction data, they are inherently noisy. This noise often leads to a failure in capturing the full complexity of "relevance," specifically by ignoring the fundamental uncertainty embedded within the data. Consequently, current retrieval pipelines exhibit a systematic bias toward the small subset of popular "head" items, which possess well-estimated embeddings, while neglecting the long-tail majority of niche, diverse, and serendipitous content.

To address this, we introduce DINOSAUR (Distributional Approximate Nearest Neighbour Search for Uncertainty-Aware Retrieval), a straightforward framework compatible with existing infrastructure that integrates embedding uncertainty into the candidate generation process. Instead of indexing point estimates, DINOSAUR generates $S_i$ embeddings for each item and builds an index on this expanded set. Similarly, at query time, the user embedding is sampled. This dual-sided stochastic retrieval mechanism implicitly marginalizes over embedding uncertainty without necessitating any modifications to the model architecture or the underlying ANN index infrastructure.

From an analytical perspective, we demonstrate that DINOSAUR converges to standard point-estimate retrieval as uncertainty diminishes. Furthermore, we characterize how increased embedding variance broadens the regions of latent space where uncertain items can be retrieved. Empirical results, which are fully reproducible, corroborate these theoretical expectations, revealing significant improvements in coverage with only minor reductions in offline recall.


Source: arXiv Generated at: 2026-06-04 00:00:00 UTC

Related Articles

Nvidia-Backed Robotics Startup Generalist AI Valued at $2 Billion
Bloomberg

Nvidia-Backed Robotics Startup Generalist AI Valued at $2 Billion

Nvidia-backed robotics startup Generalist AI has reached a $2 billion valuation. Founders Pete Florence, Andy Zeng, and ...

TechCrunch

Oura Ring 5 review: Thinner, lighter, better

The Oura Ring 5 is 40% smaller and lighter than its predecessor, offering superior comfort and a discreet, jewelry-like ...

Financial Times

How AI has de-skilled translation

AI fragments specialist translation into routine tasks, effectively de-skilling the profession. This shift reduces compl...

Zurich Insurance Expands Data-Center Offering Beyond the US
Bloomberg

Zurich Insurance Expands Data-Center Offering Beyond the US

Zurich Insurance Group is expanding its data center insurance products internationally, extending coverage beyond the Un...

Emerging-Market Stocks Fall as Broadcom Miss Disrupts AI Trade
Bloomberg

Emerging-Market Stocks Fall as Broadcom Miss Disrupts AI Trade

Broadcom’s earnings miss triggered a sell-off in AI stocks, dragging down emerging-market equities. This disruption high...

Revolut Co-Founder, CTO Vlad Yatsenko to Step Down From Role
Bloomberg

Revolut Co-Founder, CTO Vlad Yatsenko to Step Down From Role

Revolut co-founder and CTO Vlad Yatsenko is stepping down from his executive role. The resignation marks a significant l...