arXiv

VLA-Arena: An Open-Source Framework for Benchmarking Vision-Language-Action Models

Title: VLA-Arena: An Open-Source Framework for Benchmarking Vision-Language-Action Models

Abstract:

As Vision-Language-Action (VLA) models progress rapidly toward becoming generalist robot policies, it remains challenging to quantitatively assess their limitations and failure modes. To bridge this gap, we present VLA-Arena, a comprehensive benchmarking framework. We introduce a novel structured task design methodology that quantifies difficulty along three orthogonal dimensions: Task Structure, Language Command, and Visual Observation. This approach facilitates the systematic creation of tasks with fine-grained difficulty levels, allowing for a precise measurement of the current capabilities of VLA models.

In terms of Task Structure, VLA-Arena comprises 170 tasks categorized into four distinct dimensions: Safety, Distractor, Extrapolation, and Long Horizon. Each task is developed with three specific difficulty tiers (L0–L2). To accurately evaluate general capability, fine-tuning is conducted exclusively on the L0 level. Complementing this, language perturbations (W0–W4) and visual perturbations (V0–W4) can be applied independently to any task, enabling a decoupled analysis of model robustness.

Our extensive evaluation of state-of-the-art VLAs highlights several critical shortcomings, including a pronounced bias toward memorization rather than generalization, asymmetric robustness, insufficient adherence to safety constraints, and an inability to compose learned skills for long-horizon tasks. To encourage research into these issues and ensure reproducibility, we release the full VLA-Arena framework. This includes an end-to-end toolchain spanning from task definition to automated evaluation, as well as the VLA-Arena-S/M/L datasets for fine-tuning. The benchmark, associated data, models, and leaderboard are accessible at https://vla-arena.github.io.


Source: arXiv Generated at: 2026-06-03 00:00:00 UTC

Related Articles

TechCrunch

The world’s largest privately owned laser just turned on

Xcimer Energy activated the Phoenix laser, the world’s largest privately owned laser, aiming to commercialize fusion pow...

Uber Targets Doubling Its Fleet of Electric Motorcycles in Kenya
Bloomberg

Uber Targets Doubling Its Fleet of Electric Motorcycles in Kenya

Uber plans to double its electric motorcycle fleet in Kenya. This expansion aims to enhance sustainable transport option...

AI Saves Time But Most Companies Waste the Gain, Study Shows
Bloomberg

AI Saves Time But Most Companies Waste the Gain, Study Shows

A study reveals that while AI saves employee time, most companies fail to capitalize on these gains, squandering potenti...

JPMorgan Lifts S&P Target on Earnings 'Supercycle'
Bloomberg

JPMorgan Lifts S&P Target on Earnings 'Supercycle'

JPMorgan raised its S&P 500 target, citing an earnings “supercycle” that reflects heightened confidence in corporate pro...

Europe Sleepwalking Into Economic Ruin, Serb Leader Says
Bloomberg

Europe Sleepwalking Into Economic Ruin, Serb Leader Says

Serbian leader warns Europe is sleepwalking into economic ruin.

Delta Electronics Flags Power Crunch
Bloomberg

Delta Electronics Flags Power Crunch

Delta Electronics warns of a looming power deficit due to surging demand and constrained production, predicting serious ...