Distributional Approximate Nearest Neighbour Search for Uncertainty-Aware Retrieval
Title: Distributional Approximate Nearest Neighbour Search for Uncertainty-Aware Retrieval
Abstract:
Approximate Nearest Neighbour (ANN) search indices serve as the foundation for modern recommender systems, facilitating real-time candidate retrieval across catalogs containing millions of items. Conventional approaches typically rely on learning a single point-estimate embedding for each user and item. During the serving phase, the user’s embedding is used to query the index for relevant items. However, because these representations are derived from sparse interaction data, they are inherently noisy. This noise often leads to a failure in capturing the full complexity of "relevance," specifically by ignoring the fundamental uncertainty embedded within the data. Consequently, current retrieval pipelines exhibit a systematic bias toward the small subset of popular "head" items, which possess well-estimated embeddings, while neglecting the long-tail majority of niche, diverse, and serendipitous content.
To address this, we introduce DINOSAUR (Distributional Approximate Nearest Neighbour Search for Uncertainty-Aware Retrieval), a straightforward framework compatible with existing infrastructure that integrates embedding uncertainty into the candidate generation process. Instead of indexing point estimates, DINOSAUR generates $S_i$ embeddings for each item and builds an index on this expanded set. Similarly, at query time, the user embedding is sampled. This dual-sided stochastic retrieval mechanism implicitly marginalizes over embedding uncertainty without necessitating any modifications to the model architecture or the underlying ANN index infrastructure.
From an analytical perspective, we demonstrate that DINOSAUR converges to standard point-estimate retrieval as uncertainty diminishes. Furthermore, we characterize how increased embedding variance broadens the regions of latent space where uncertain items can be retrieved. Empirical results, which are fully reproducible, corroborate these theoretical expectations, revealing significant improvements in coverage with only minor reductions in offline recall.
Source: arXiv Generated at: 2026-06-04 00:00:00 UTC




