When Hard Negatives Hurt: Bridging the Generative-Discriminative Gap in Hard Negative Synthesis for Retrieval
Title: The Detrimental Impact of Hard Negatives: Closing the Generative-Discriminative Divide in Hard Negative Synthesis for Retrieval
Abstract
While hard negative mining has emerged as the prevailing method for training retrieval systems, it is plagued by inherent constraints. Specifically, the pool of negatives is restricted by the available corpus, selection relies on retriever scores rather than diagnostic utility, and the efficacy of the retriever is undermined by an increasing prevalence of false positives. In contrast, synthesis via Large Language Models (LLMs) presents a more robust alternative, offering negatives that are unconstrained, precisely targeted, and immune to false positive contamination. However, our findings indicate that the straightforward integration of these generated negatives into contrastive learning frameworks frequently results in diminished retrieval performance.
We pinpoint and formalize the underlying issue as a "generative-discriminative gap." This disconnect arises because LLM generation prioritizes fluent and plausible text, whereas contrastive learning requires strategic breaches of relevance at the decision boundary. Our investigation uncovers two compounding failure modes:
- Discriminative-agnostic generation: LLMs lack an explicit representation of query information needs, leading them to produce generic or topic-drifted text that fails to provide a meaningful contrastive signal.
- Source-dependent shortcuts: Distributional artifacts allow the model to differentiate negatives based on their origin rather than their relevance, resulting in gradient drift that actively corrupts the optimization process.
To bridge this gap, we introduce CausalNeg, a framework comprising two primary components:
- CoT-guided counterfactual perturbation for data construction: This module decomposes the reasons a document satisfies a query into explicit information requirements. It then surgically violates individual requirements to create negatives with controlled and interpretable hardness.
- Query-view entropy maximization during training: This technique disperses generated negatives across the similarity spectrum, reducing the mutual information between source identity and similarity scores to prevent the exploitation of shortcuts.
We have made our code publicly available at https://github.com/mzhangzhicheng/CausalNeg.
Source: arXiv Generated at: 2026-06-02 00:00:00 UTC





