An Open-Source Two-Stage Computer Vision Pipeline for Fine-Grained Vehicle Classification using Vision Transformers
Title: Open-Source Two-Stage Computer Vision Framework for High-Resolution Vehicle Classification via Vision Transformers
Abstract:
While vehicle body type is a critical factor in determining the severity of injuries sustained by cyclists during overtaking incidents, there is currently no open-source automated solution capable of categorizing vehicles into these injury-relevant classes using naturalistic road footage. Existing standard object detection datasets typically offer only broad vehicle labels, such as "car" or "bus," and current fine-grained recognition models are generally trained on controlled imagery, lacking validation for robust deployment across diverse recording environments. To address this gap, we introduce an open-source, two-stage computer vision pipeline. This system integrates a pre-trained RT-DETR detector for initial coarse vehicle localization with a fine-tuned Vision Transformer (ViT-Base/16) designed to classify vehicles into six specific body types: passenger cars, SUVs, pickup trucks, minivans, large vans, and commercial trucks.
To mitigate silent misclassifications, the pipeline employs a confidence-based abstention mechanism that assigns an "unknown" label when the softmax output drops below a threshold of 0.60. We evaluated this approach on 3,805 annotated overtaking events recorded along a bicycle-lane corridor in Ann Arbor, Michigan. In this in-distribution test, the system achieved an overall accuracy of 0.94, with per-class F1 scores ranging from 0.91 for minivans to 0.97 for SUVs.
Further testing on an independent, out-of-distribution dataset comprising 311 events from an open cycling repository—without any retraining—yielded an accuracy of 0.89. Under this domain shift, three of the four well-represented categories sustained F1 scores of at least 0.90. The most significant performance drop occurred in the minivan category, where the F1 score fell to 0.72. This decline was primarily driven by an increase in the abstention rate, which rose from 2.4% in the in-distribution setting to 25.0% in the out-of-distribution setting, rather than by active errors, thereby reflecting the model’s appropriate handling of genuine uncertainty. The complete pipeline, encompassing inference scripts, training code, evaluation utilities, and model weights, has been released as open-source software to facilitate reproducibility and support research in cycling safety and roadside video analysis.
Source: arXiv Generated at: 2026-06-04 00:00:00 UTC




