Which Leakage Types Matter? A Quantitative Landscape Across 2,047 Benchmark Datasets
Title: Prioritizing Data Leakage: A Quantitative Analysis Across 2,047 Benchmark Datasets
Abstract:
This study evaluates the impact of four distinct categories of data leakage in machine learning through a comprehensive series of experiments. The methodology includes twenty-eight within-subject counterfactual tests conducted on 2,047 independent and identically distributed (iid) tabular datasets, alongside a boundary analysis involving 129 temporal datasets.
The findings reveal significant disparities in the severity of these leakage types. Class I leakage, defined as estimation errors such as fitting scalers on the entire dataset, proves to be negligible; across all nine tested conditions, the change in Area Under the Curve ($|{\Delta}AUC|$) remained at or below 0.005. In contrast, Class II leakage—encompassing selection biases like peeking at data and cherry-picking random seeds—demonstrates substantial effects. The data suggests that approximately 90% of noise exploitation contributes to inflated performance scores in this category.
Class III leakage, related to memorization, exhibits a direct correlation with model capacity. At a 10% duplication rate, the effect size ($d_z$) ranges from 0.37 for Naive Bayes models to 1.11 for Decision Trees. Meanwhile, Class IV leakage remains undetectable when standard random cross-validation is employed.
In the context of iid tabular data, these results invert the traditional textbook hierarchy of concerns. While normalization-related leakage is the least impactful, selection-based leakage poses the most significant risk at practical dataset scales.
Source: arXiv Generated at: 2026-06-02 00:00:00 UTC





