Do Real-World Datasets Contain Natural Experiments? An Empirical Study Using Causal Feature Selection
Title: Investigating the Presence of Natural Experiments in Real-World Data: An Empirical Analysis via Causal Feature Selection
Natural experiments are defined as real-world events that act as implicit interventions, impacting specific individuals or groups while leaving others unaffected. A prime example is the coronavirus pandemic, which served as an intervention by the virus on the subset of the population it infected. This study investigates whether such natural experiments exist within current real-world datasets and explores the appropriate methods for handling them.
To identify these phenomena, we employ causal discovery techniques to reconstruct the underlying causal graph and subsequently conduct feature selection based on identified causal relationships. The core hypothesis is that if modeling the data as interventional—rather than observational—leads to enhanced downstream performance, it indicates the presence of natural experiments within the dataset.
We initially tested this hypothesis by generating synthetic datasets, both with and without embedded natural experiments, using constructed graphs. Following this validation, we executed a comprehensive empirical assessment across a broad range of real-world datasets. The findings confirm that natural experiments are indeed present in real-world data. Furthermore, leveraging these experiments through causal inference methods can significantly boost model performance. This research marks a preliminary step into this domain, providing an initial exploration within a constrained scope.
Source: arXiv Generated at: 2026-06-03 00:00:00 UTC



