arXiv

WildRoadBench: A Wild Aerial Road-Damage Grounding Benchmark for Vision-Language Models and Autonomous Agents

Title: WildRoadBench: A Wild Aerial Road-Damage Grounding Benchmark for Vision-Language Models and Autonomous Agents

Abstract:

We present WildRoadBench, a novel benchmark for aerial road-damage grounding that integrates direct visual localization via vision-language models (VLMs) with autonomous research-and-engineering capabilities driven by LLM-based agents, all within a single, professionally annotated UAV dataset. Evaluation is conducted on an identical image set using the per-class AP_50 metric across two distinct protocols. The VLM Track assesses the ability of a static VLM to identify domain-specific damage from a single image and a concise prompt, utilizing a standardized pipeline for prompting, decoding, and parsing. Conversely, the Agent Track evaluates an autonomous agent’s capacity to operate with only a written task description, a limited exploratory data slice, and a fixed interaction budget. This agent must navigate the public web, modify pretrained components, generate training and inference code, and submit predictions via a scalar-feedback oracle on a hidden holdout set.

Our evaluation encompasses a wide range of closed-source frontier models, open-source VLMs, and several leading LLM-driven agents. Results indicate that neither approach has yet achieved reliable performance in this challenging, uncontrolled environment. While closed-source frontier models dominate the VLM leaderboard, they still fail to capture more than half of the available metric score. Open-source grounding models perform significantly lower and show no consistent improvement with newer generations or reasoning-oriented variants; additionally, small targets prove particularly difficult for all open-source models. Although agents possess richer affordances, they trail behind the most capable VLMs, with several failing to produce a valid submission within the allotted budget. To facilitate reproducible future research, we have released the associated code and data at https://anonymous.4open.science/r/wildroadbench-0607.


Source: arXiv Generated at: 2026-06-03 00:00:00 UTC

Related Articles

TechCrunch

The world’s largest privately owned laser just turned on

Xcimer Energy activated the Phoenix laser, the world’s largest privately owned laser, aiming to commercialize fusion pow...

Uber Targets Doubling Its Fleet of Electric Motorcycles in Kenya
Bloomberg

Uber Targets Doubling Its Fleet of Electric Motorcycles in Kenya

Uber plans to double its electric motorcycle fleet in Kenya. This expansion aims to enhance sustainable transport option...

AI Saves Time But Most Companies Waste the Gain, Study Shows
Bloomberg

AI Saves Time But Most Companies Waste the Gain, Study Shows

A study reveals that while AI saves employee time, most companies fail to capitalize on these gains, squandering potenti...

JPMorgan Lifts S&P Target on Earnings 'Supercycle'
Bloomberg

JPMorgan Lifts S&P Target on Earnings 'Supercycle'

JPMorgan raised its S&P 500 target, citing an earnings “supercycle” that reflects heightened confidence in corporate pro...

Europe Sleepwalking Into Economic Ruin, Serb Leader Says
Bloomberg

Europe Sleepwalking Into Economic Ruin, Serb Leader Says

Serbian leader warns Europe is sleepwalking into economic ruin.

Delta Electronics Flags Power Crunch
Bloomberg

Delta Electronics Flags Power Crunch

Delta Electronics warns of a looming power deficit due to surging demand and constrained production, predicting serious ...