arXiv

WildRoadBench: A Wild Aerial Road-Damage Grounding Benchmark for Vision-Language Models and Autonomous Agents

June 3, 2026 · Bingnan Liu, Chenhang Cui, Rui Huang, Jiani Luo, Zhirong Shen, Tinghao Wang, Xiande Huang, Lingbei Meng, Fei Shen, An Zhang · Original Source

Title: WildRoadBench: A Wild Aerial Road-Damage Grounding Benchmark for Vision-Language Models and Autonomous Agents

Abstract:

We present WildRoadBench, a novel benchmark for aerial road-damage grounding that integrates direct visual localization via vision-language models (VLMs) with autonomous research-and-engineering capabilities driven by LLM-based agents, all within a single, professionally annotated UAV dataset. Evaluation is conducted on an identical image set using the per-class AP_50 metric across two distinct protocols. The VLM Track assesses the ability of a static VLM to identify domain-specific damage from a single image and a concise prompt, utilizing a standardized pipeline for prompting, decoding, and parsing. Conversely, the Agent Track evaluates an autonomous agent’s capacity to operate with only a written task description, a limited exploratory data slice, and a fixed interaction budget. This agent must navigate the public web, modify pretrained components, generate training and inference code, and submit predictions via a scalar-feedback oracle on a hidden holdout set.

Our evaluation encompasses a wide range of closed-source frontier models, open-source VLMs, and several leading LLM-driven agents. Results indicate that neither approach has yet achieved reliable performance in this challenging, uncontrolled environment. While closed-source frontier models dominate the VLM leaderboard, they still fail to capture more than half of the available metric score. Open-source grounding models perform significantly lower and show no consistent improvement with newer generations or reasoning-oriented variants; additionally, small targets prove particularly difficult for all open-source models. Although agents possess richer affordances, they trail behind the most capable VLMs, with several failing to produce a valid submission within the allotted budget. To facilitate reproducible future research, we have released the associated code and data at https://anonymous.4open.science/r/wildroadbench-0607.

Source: arXiv Generated at: 2026-06-03 00:00:00 UTC