arXiv

A Grammar of Machine Learning Workflows: Rejecting Data Leakage at Call Time

June 2, 2026 · Simon Roth · Original Source

Title: A Structural Grammar for Machine Learning Pipelines: Eliminating Data Leakage at Execution

Original: arXiv:2603.10742v4 Announcement Type: Replacement

Abstract: Data leakage has been identified in 648 published papers across 30 scientific fields. The knowledge to prevent it has existed for over a decade; the problem persists because the tools do not enforce what the textbooks teach. This paper presents a grammar (eight typed primitives connected by a directed acyclic graph with four hard constraints) that makes the most damaging leakage types structurally unrepresentable within the grammar's scope. The core mechanism is a terminal assessment gate: the first call-time-enforced evaluate/assess boundary documented in the peer-reviewed ML methodology literature (to my knowledge, as of May 2026), backed by a specification precise enough for independent reimplementation. A companion landscape study across 2,047 datasets grounds the constraints in measured effect sizes. Two reference implementations (Python, R) are available.

Rewrite: Despite the existence of preventive knowledge for more than ten years, data leakage remains a pervasive issue, having been detected in 648 articles spanning 30 distinct scientific disciplines. This persistence stems from a disconnect between theoretical guidelines taught in textbooks and the practical enforcement mechanisms offered by current software tools. To bridge this gap, we introduce a formal grammar designed to render the most harmful forms of data leakage structurally impossible within its defined scope. This framework consists of eight typed primitives linked via a directed acyclic graph, governed by four rigid constraints. Central to this approach is a terminal assessment gate, which establishes the first known evaluate/assess boundary enforced at call time in peer-reviewed machine learning methodology literature (as of May 2026). This boundary is supported by a specification detailed enough to allow for independent reimplementation. Furthermore, we validate these constraints through a companion landscape analysis of 2,047 datasets, grounding the theoretical framework in measured effect sizes. Reference implementations in both Python and R are provided for accessibility.

Source: arXiv Generated at: 2026-06-02 00:00:00 UTC