arXiv

When Hard Negatives Hurt: Bridging the Generative-Discriminative Gap in Hard Negative Synthesis for Retrieval

Title: The Detrimental Impact of Hard Negatives: Closing the Generative-Discriminative Divide in Hard Negative Synthesis for Retrieval

Abstract

While hard negative mining has emerged as the prevailing method for training retrieval systems, it is plagued by inherent constraints. Specifically, the pool of negatives is restricted by the available corpus, selection relies on retriever scores rather than diagnostic utility, and the efficacy of the retriever is undermined by an increasing prevalence of false positives. In contrast, synthesis via Large Language Models (LLMs) presents a more robust alternative, offering negatives that are unconstrained, precisely targeted, and immune to false positive contamination. However, our findings indicate that the straightforward integration of these generated negatives into contrastive learning frameworks frequently results in diminished retrieval performance.

We pinpoint and formalize the underlying issue as a "generative-discriminative gap." This disconnect arises because LLM generation prioritizes fluent and plausible text, whereas contrastive learning requires strategic breaches of relevance at the decision boundary. Our investigation uncovers two compounding failure modes:

  1. Discriminative-agnostic generation: LLMs lack an explicit representation of query information needs, leading them to produce generic or topic-drifted text that fails to provide a meaningful contrastive signal.
  2. Source-dependent shortcuts: Distributional artifacts allow the model to differentiate negatives based on their origin rather than their relevance, resulting in gradient drift that actively corrupts the optimization process.

To bridge this gap, we introduce CausalNeg, a framework comprising two primary components:

  1. CoT-guided counterfactual perturbation for data construction: This module decomposes the reasons a document satisfies a query into explicit information requirements. It then surgically violates individual requirements to create negatives with controlled and interpretable hardness.
  2. Query-view entropy maximization during training: This technique disperses generated negatives across the similarity spectrum, reducing the mutual information between source identity and similarity scores to prevent the exploitation of shortcuts.

We have made our code publicly available at https://github.com/mzhangzhicheng/CausalNeg.


Source: arXiv Generated at: 2026-06-02 00:00:00 UTC

Related Articles

Law’s Billable Hour Is Being Shredded by AI
Bloomberg

Law’s Billable Hour Is Being Shredded by AI

AI is dismantling the billable hour by automating routine legal tasks. This technological shift threatens the traditiona...

Iran War: Trump Tries to Stop Israel’s Lebanon Push | The Opening Trade 6/2/2026
Bloomberg

Iran War: Trump Tries to Stop Israel’s Lebanon Push | The Opening Trade 6/2/2026

SoftBank in Early Talks to Back $800 Million Agile Robots Round
Bloomberg

SoftBank in Early Talks to Back $800 Million Agile Robots Round

SoftBank is in early talks to back Agile Robots’ $800 million funding round. The Japanese tech giant is currently in pre...

Amundi Is Diversifying Risk Via Commodity Currencies, Gold
Bloomberg

Amundi Is Diversifying Risk Via Commodity Currencies, Gold

Amundi diversifies risk by investing in commodity-linked currencies and gold. This strategy hedges against market volati...

Reuters

Marvell Technology surges after Nvidia's Huang calls it 'next trillion-dollar company'

Marvell Technology shares surged after Nvidia CEO Jensen Huang labeled the firm the “next trillion-dollar company.”

Russia Says It Found Foreign Spyware on Top Officials’ Phones
Bloomberg

Russia Says It Found Foreign Spyware on Top Officials’ Phones

Russia’s FSB claims to have discovered foreign spyware on senior officials’ phones. Moscow attributes the intrusion to h...