Modeling Robotics Dataset Construction as an Artifact-Based Build Process
Title: Treating Robotics Dataset Creation as an Artifact-Centric Build Pipeline
Abstract:
While robotic platforms produce vast amounts of multimodal sensor information, the transformation of ROS bag files into machine learning-ready datasets typically relies on fragmented, sequential scripts. This traditional method introduces significant engineering burdens and results in sluggish iteration cycles. To address these inefficiencies, we propose modeling dataset construction as an artifact-based build process governed by a dependency graph. We have realized this concept through Bagzel, an open-source extension for Bazel that enables reproducible and incremental dataset generation, with support for exporting data in the nuScenes format.
In our evaluation, we benchmarked Bagzel and its variant, Bagzel-xattr (which utilizes server-side digest management), against a standard sequential rosbag2nuscenes baseline. The results demonstrate that Bagzel lowers runtime across all tested execution modes, delivering the most substantial improvements in iterative development scenarios. Specifically, on a 20.4 GB dataset, Bagzel achieved speedups of up to 386.26x during warm builds and 7.21x during incremental builds. Furthermore, as dataset sizes ranged from 5.1 to 20.4 GB, the Bagzel variants exhibited superior scaling characteristics compared to the baseline, particularly in warm and incremental contexts. The Bagzel-xattr implementation offered further optimizations, yielding an average runtime reduction of 5.9% relative to standard Bagzel in our input granularity analysis. Ultimately, applying an artifact-based build framework to robotics dataset construction significantly decreases the latency of dataset updates while preserving a deterministic design that ensures reproducibility. Bagzel is accessible at https://github.com/UniBwTAS/bagzel.
Source: arXiv Generated at: 2026-06-02 00:00:00 UTC





