Testing Most Influential Sets
Title: Evaluating the Extremes of Influence
Small subsets of data can wield disproportionate power over model outcomes, with just a handful of observations capable of reversing significant conclusions. Although recent research has focused on identifying these highly influential groups, a formal method for distinguishing between excessive influence and the natural variation expected from random sampling has been lacking. To bridge this gap, we introduce a rigorous framework for analyzing the most influential sets.
By concentrating on linear least-squares regression, we derive a precise formula for influence and characterize the extreme value distributions associated with maximal influence. Our findings indicate that constant-size sets and datasets with heavy tails follow the heavy-tailed Fréchet distribution, whereas growing sets or those with light tails adhere to the more stable Gumbel distribution. These insights enable the execution of strict hypothesis tests to determine when influence is statistically excessive. We validate this approach through applications in machine learning benchmarks, biology, and economics, demonstrating its ability to replace informal heuristics with robust inference and resolve previously contested research findings.
Source: arXiv Generated at: 2026-06-03 00:00:00 UTC





