VLA-Arena: An Open-Source Framework for Benchmarking Vision-Language-Action Models
Title: VLA-Arena: An Open-Source Framework for Benchmarking Vision-Language-Action Models
Abstract:
As Vision-Language-Action (VLA) models progress rapidly toward becoming generalist robot policies, it remains challenging to quantitatively assess their limitations and failure modes. To bridge this gap, we present VLA-Arena, a comprehensive benchmarking framework. We introduce a novel structured task design methodology that quantifies difficulty along three orthogonal dimensions: Task Structure, Language Command, and Visual Observation. This approach facilitates the systematic creation of tasks with fine-grained difficulty levels, allowing for a precise measurement of the current capabilities of VLA models.
In terms of Task Structure, VLA-Arena comprises 170 tasks categorized into four distinct dimensions: Safety, Distractor, Extrapolation, and Long Horizon. Each task is developed with three specific difficulty tiers (L0–L2). To accurately evaluate general capability, fine-tuning is conducted exclusively on the L0 level. Complementing this, language perturbations (W0–W4) and visual perturbations (V0–W4) can be applied independently to any task, enabling a decoupled analysis of model robustness.
Our extensive evaluation of state-of-the-art VLAs highlights several critical shortcomings, including a pronounced bias toward memorization rather than generalization, asymmetric robustness, insufficient adherence to safety constraints, and an inability to compose learned skills for long-horizon tasks. To encourage research into these issues and ensure reproducibility, we release the full VLA-Arena framework. This includes an end-to-end toolchain spanning from task definition to automated evaluation, as well as the VLA-Arena-S/M/L datasets for fine-tuning. The benchmark, associated data, models, and leaderboard are accessible at https://vla-arena.github.io.
Source: arXiv Generated at: 2026-06-03 00:00:00 UTC





