arXiv

What to Test Next: Interpretable Coverage Gap Discovery in Driving VLMs

June 2, 2026 · Abhishek Aich, Sparsh Garg, Vijay Kumar BG, Turgun Yusuf Kashgari, Manmohan Chandraker · Original Source

Title: Identifying the Next Test Target: Interpretable Discovery of Coverage Gaps in Autonomous Driving Vision-Language Models

Abstract:

While driving vision-language models (VLMs) are required to comprehend scenes under various Operational Design Domain (ODD) conditions, current verification efforts are insufficient. The absence of many test slices renders empirical failure rates unreliable. To address this, we introduce SliceScorer, a deterministic scoring mechanism designed to recommend missing test slices. This approach integrates two key components: an exposure-based coverage prior that highlights rare, under-tested areas, and a neighbor-failure prior that extends risk assessments from comparable, already-tested scenarios. Designed with safety-critical validation in mind, SliceScorer prioritizes interpretability, auditability, and conservatism.

For stress testing outside of declared ODDs, we integrate SliceScorer into SliceNav, a verification pipeline orchestrated by large language models (LLMs). In this system, the LLM interprets developer queries to choose relevant operators—such as triage, scoring, acquisition, and evaluation—along with vocabulary extensions. This process constructs verification workflows while ensuring that all scoring remains deterministic and auditable. Our experiments involving three driving VLMs (WiseAD, DriveMM, and Cosmos-Reason2-2B) demonstrate that SliceNav identifies high-risk coverage gaps more efficiently than existing slice-discovery techniques, while also providing diverse recommendations across the condition space. Ablation studies verify the contribution of both scoring elements, and qualitative assessments illustrate the complete workflow from initial developer query to targeted evaluation.

Source: arXiv Generated at: 2026-06-02 00:00:00 UTC