arXiv

Do Neural Retrievers Prefer Certain Documents? Evidence of Learned Relevance Priors

Title: Do Neural Retrievers Favor Specific Documents? Evidence of Learned Relevance Priors

Abstract:

While neural retrievers are designed to gauge query-document relevance based on annotated pairs, the annotation process itself may introduce bias. Since annotators typically label only a subset of documents, this selection criteria can inadvertently privilege certain document categories over others. This study examines whether supervised bi-encoder retrievers inadvertently acquire a document-level relevance prior—a query-independent signal embedded within their representation space as a byproduct of training on labeled data. To quantify this prior, we employ simple classifiers on frozen document embeddings and assess three leading retrievers across various information retrieval benchmarks.

Our results indicate that supervised neural models do encode relevance priors that generalize to unseen documents and remain consistent across different architectures. These priors establish a "findability gap," where documents with lower prior scores are systematically more difficult to retrieve, even if they are genuinely relevant to the query. This phenomenon is prominent in supervised dense retrievers, whereas it is notably weaker and less consistent in BM25, and it holds true even under controlled comparisons of matched documents.

Through LLM-generated explanations, we observe that documents deemed relevant by annotators are typically comprehensive, self-contained summaries of mainstream subjects. Conversely, niche, fragmented, or highly technical content is frequently excluded from judgment. Retrievers internalize this human bias, elevating documents with these favored characteristics above those that lack them, regardless of their true relevance. These findings reveal a structural constraint of supervised retrieval: models trained on annotated data do not merely learn relevance but also absorb the implicit document preferences inherent in their training sets.


Source: arXiv Generated at: 2026-06-03 00:00:00 UTC

Related Articles

TikTok Billionaire Tops Ambani as Asia’s Second-Richest
Bloomberg

TikTok Billionaire Tops Ambani as Asia’s Second-Richest

TikTok founder surpasses Mukesh Ambani to become Asia’s second-richest person, marking a significant shift in the region...

Publishers in UK can opt out of Google AI search results
BBC News

Publishers in UK can opt out of Google AI search results

UK publishers can now opt out of Google’s AI search summaries, a CMA ruling designed to boost their bargaining power and...

Kioxia Edges Nearer Toyota’s Market Cap in Shakeup to Japan Inc.
Bloomberg

Kioxia Edges Nearer Toyota’s Market Cap in Shakeup to Japan Inc.

Kioxia’s market cap nears Toyota’s, signaling a major shift in Japan’s corporate hierarchy. This narrowing gap highlight...

Reuters

Morning Bid: Marvell, a fitting name for the latest AI darling

Reuters highlights Marvell as a top AI stock, noting its name perfectly suits its status as the newest market darling.

Financial Times

Tim Hayward: I built the Jaguar E-Type of computer keyboards

Tim Hayward compares his bespoke keyboard designs to the Jaguar E-Type. He explores high-end customization for personal ...

Financial Times

AI Labs: Zuckerberg’s $100bn gamble

Meta’s $100 billion AI investment aims to secure AI dominance, but questions remain whether sheer spending can outpace c...