arXiv

MLSkip: Data Skipping for ML Filters via Lightweight Metadata

June 3, 2026 · Mihail Stoian, Mark Gerarts, Pascal Ginter, Andreas Zimmerer, Jan Van den Bussche, Andreas Kipf · Original Source

Title: MLSkip: Optimizing ML Filter Performance with Minimal Metadata for Data Skipping

Abstract:

Recent releases of AI capabilities by database vendors have introduced machine learning functions into filter predicates. However, because these functions typically depend on expensive, opaque ML models, they introduce novel data management hurdles. Specifically, conventional data skipping methods designed for strings and integers are ineffective for this new filter category. Currently, there is no established mechanism to eliminate non-matching row groups, such as when accessing files from blob storage.

This paper launches the investigation into data skipping strategies tailored for ML filters. We demonstrate that Parquet’s standard min-max metadata is sufficient to facilitate pruning. We support this claim by linking our approach to two existing research domains: (i) the emerging query language for ML models and (ii) neural network verification.

Our initial experiments using ReLU architectures on TPC-H and TPC-DS datasets indicate that for filters with selectivity under 0.1%, the average pruning effectiveness reaches 27.4%. Furthermore, drawing inspiration from spatial join research, we introduce an improved metadata structure: a size-capped 2D convex hull. This structure allows verification tools to achieve higher pruning efficiency, boosting effectiveness to 38.31%. This enhancement requires no more than 45 bytes per row group and column pair. In terms of performance, we recorded an end-to-end speedup of 1.07x compared to PyTorch when running within DuckDB.

Source: arXiv Generated at: 2026-06-03 00:00:00 UTC