arXiv

When Hard Negatives Hurt: Bridging the Generative-Discriminative Gap in Hard Negative Synthesis for Retrieval

June 2, 2026 · Zhicheng Zhang, Jiwei Tang, Kuicai Dong, Xiaopeng Li, Jieming Zhu, Jingyu Li, Qianhui Zhu, Fengyuan Lu, Wang Jiaheng, Gang Wang, Hai-Tao Zheng, Zhaocheng Du · Original Source

Title: The Detrimental Impact of Hard Negatives: Closing the Generative-Discriminative Divide in Hard Negative Synthesis for Retrieval

Abstract

While hard negative mining has emerged as the prevailing method for training retrieval systems, it is plagued by inherent constraints. Specifically, the pool of negatives is restricted by the available corpus, selection relies on retriever scores rather than diagnostic utility, and the efficacy of the retriever is undermined by an increasing prevalence of false positives. In contrast, synthesis via Large Language Models (LLMs) presents a more robust alternative, offering negatives that are unconstrained, precisely targeted, and immune to false positive contamination. However, our findings indicate that the straightforward integration of these generated negatives into contrastive learning frameworks frequently results in diminished retrieval performance.

We pinpoint and formalize the underlying issue as a "generative-discriminative gap." This disconnect arises because LLM generation prioritizes fluent and plausible text, whereas contrastive learning requires strategic breaches of relevance at the decision boundary. Our investigation uncovers two compounding failure modes:

Discriminative-agnostic generation: LLMs lack an explicit representation of query information needs, leading them to produce generic or topic-drifted text that fails to provide a meaningful contrastive signal.
Source-dependent shortcuts: Distributional artifacts allow the model to differentiate negatives based on their origin rather than their relevance, resulting in gradient drift that actively corrupts the optimization process.

To bridge this gap, we introduce CausalNeg, a framework comprising two primary components:

CoT-guided counterfactual perturbation for data construction: This module decomposes the reasons a document satisfies a query into explicit information requirements. It then surgically violates individual requirements to create negatives with controlled and interpretable hardness.
Query-view entropy maximization during training: This technique disperses generated negatives across the similarity spectrum, reducing the mutual information between source identity and similarity scores to prevent the exploitation of shortcuts.

We have made our code publicly available at https://github.com/mzhangzhicheng/CausalNeg.

Source: arXiv Generated at: 2026-06-02 00:00:00 UTC

Global News Digest

When Hard Negatives Hurt: Bridging the Generative-Discriminative Gap in Hard Negative Synthesis for Retrieval

Related Articles

Law’s Billable Hour Is Being Shredded by AI

Iran War: Trump Tries to Stop Israel’s Lebanon Push | The Opening Trade 6/2/2026

SoftBank in Early Talks to Back $800 Million Agile Robots Round

Amundi Is Diversifying Risk Via Commodity Currencies, Gold

Marvell Technology surges after Nvidia's Huang calls it 'next trillion-dollar company'

Russia Says It Found Foreign Spyware on Top Officials’ Phones