arXiv

Which Leakage Types Matter? A Quantitative Landscape Across 2,047 Benchmark Datasets

Title: Prioritizing Data Leakage: A Quantitative Analysis Across 2,047 Benchmark Datasets

Abstract:

This study evaluates the impact of four distinct categories of data leakage in machine learning through a comprehensive series of experiments. The methodology includes twenty-eight within-subject counterfactual tests conducted on 2,047 independent and identically distributed (iid) tabular datasets, alongside a boundary analysis involving 129 temporal datasets.

The findings reveal significant disparities in the severity of these leakage types. Class I leakage, defined as estimation errors such as fitting scalers on the entire dataset, proves to be negligible; across all nine tested conditions, the change in Area Under the Curve ($|{\Delta}AUC|$) remained at or below 0.005. In contrast, Class II leakage—encompassing selection biases like peeking at data and cherry-picking random seeds—demonstrates substantial effects. The data suggests that approximately 90% of noise exploitation contributes to inflated performance scores in this category.

Class III leakage, related to memorization, exhibits a direct correlation with model capacity. At a 10% duplication rate, the effect size ($d_z$) ranges from 0.37 for Naive Bayes models to 1.11 for Decision Trees. Meanwhile, Class IV leakage remains undetectable when standard random cross-validation is employed.

In the context of iid tabular data, these results invert the traditional textbook hierarchy of concerns. While normalization-related leakage is the least impactful, selection-based leakage poses the most significant risk at practical dataset scales.


Source: arXiv Generated at: 2026-06-02 00:00:00 UTC

Related Articles

Law’s Billable Hour Is Being Shredded by AI
Bloomberg

Law’s Billable Hour Is Being Shredded by AI

AI is dismantling the billable hour by automating routine legal tasks. This technological shift threatens the traditiona...

Iran War: Trump Tries to Stop Israel’s Lebanon Push | The Opening Trade 6/2/2026
Bloomberg

Iran War: Trump Tries to Stop Israel’s Lebanon Push | The Opening Trade 6/2/2026

SoftBank in Early Talks to Back $800 Million Agile Robots Round
Bloomberg

SoftBank in Early Talks to Back $800 Million Agile Robots Round

SoftBank is in early talks to back Agile Robots’ $800 million funding round. The Japanese tech giant is currently in pre...

Amundi Is Diversifying Risk Via Commodity Currencies, Gold
Bloomberg

Amundi Is Diversifying Risk Via Commodity Currencies, Gold

Amundi diversifies risk by investing in commodity-linked currencies and gold. This strategy hedges against market volati...

Reuters

Marvell Technology surges after Nvidia's Huang calls it 'next trillion-dollar company'

Marvell Technology shares surged after Nvidia CEO Jensen Huang labeled the firm the “next trillion-dollar company.”

Russia Says It Found Foreign Spyware on Top Officials’ Phones
Bloomberg

Russia Says It Found Foreign Spyware on Top Officials’ Phones

Russia’s FSB claims to have discovered foreign spyware on senior officials’ phones. Moscow attributes the intrusion to h...