arXiv

ChartArena: Benchmarking Chart Parsing across Languages, Scenarios, and Formats

June 2, 2026 · Shangpin Peng, Gengluo Li, Xingyu Wan, Chengquan Zhang, Hao Feng, Binghong Wu, Huawen Shen, Weinong Wang, Ziyi Cai, Zhuotao Tian, Han Hu, Can Ma, Yu Zhou · Original Source

Title: ChartArena: Benchmarking Chart Parsing across Languages, Scenarios, and Formats

Abstract

While charts serve as a fundamental tool for communicating quantitative and relational data, the systematic assessment of chart parsing models continues to present significant challenges. Current benchmarks are often restricted to limited chart types, neglecting diagrammatic structures like mind maps and flowcharts. Furthermore, existing models generate outputs in incompatible formats, and datasets frequently fail to incorporate real-world variations such as printed or hand-drawn images.

To resolve these limitations, we present ChartArena, a robust bilingual benchmark that encompasses eight distinct chart families, bridging both numeric charts and diagrammatic structures. Each category is assessed across three visual contexts: digital renderings, photographs of printed documents, and images of hand-drawn sketches. The dataset’s reliability is ensured through a human-agent collaborative annotation pipeline featuring multi-stage human verification.

To facilitate equitable comparisons between different models, we have developed a format-agnostic evaluation protocol. This system translates heterogeneous model outputs into two standardized semantic spaces—a normalized triple view and a directed graph view—allowing for scoring via structure-aware metrics.

Our extensive evaluation of 26 state-of-the-art Multimodal Large Language Models (MLLMs) yielded three key insights: 1. Leading proprietary models, such as Gemini 3.1 Pro, dominate overall performance, although top-tier open-source systems are quickly narrowing the performance gap. 2. While document parsing models perform adequately with numeric charts, their capabilities drop significantly when handling diagrammatic structures. 3. Specialized expert chart parsers remain confined to specific, narrow chart families.

Radar charts and hand-drawn scenarios emerged as particularly difficult challenges across all tested models. These results highlight distinct capability gaps within current technology and establish ChartArena as a unified foundation for advancing future research. The ChartArena dataset and resources are publicly accessible at https://github.com/pspdada/ChartArena.

Source: arXiv Generated at: 2026-06-02 00:00:00 UTC