arXiv

PolySpeech-100: A Large-Scale Benchmark for Speech Understanding Across 100+ Languages and Dialects

June 2, 2026 · Sicheng Yang, Shulan Ruan, Shiwei Wu, Yu Liu, Lu Fan, Zhi Li, You He · Original Source

Title: PolySpeech-100: A Comprehensive Benchmark for Speech Understanding Spanning Over 100 Languages and Dialects

Abstract:

As End-to-End (E2E) Speech-Large Language Models (Speech-LLMs) continue to advance rapidly, their assessment methods remain stuck in the past, relying primarily on simple transcription tasks. Current benchmarks are hindered by three major flaws: a strong skew toward high-resource languages, an emphasis on low-level automatic speech recognition (ASR) rather than semantic reasoning, and a general oversight of regional dialects. To address these shortcomings, we present PolySpeech-100, a large-scale benchmark aimed at evaluating 'native-level' speech comprehension across 110 linguistic variants. We utilize a unique hybrid construction pipeline that combines gold-standard human recordings with instruction-driven synthetic speech, enabling coverage of 19 specific Chinese dialects and more than 80 low-resource languages.

Our extensive evaluation of 22 state-of-the-art models, including Gemini-3, GPT-Audio, and Qwen2.5-Omni, provides several key insights. First, we show that open-source E2E models surpass Cascade systems (ASR+LLM) when handling heavy dialects. This confirms that direct audio processing retains vital paralinguistic cues and prosodic features, such as intonation and stress, which are typically lost in standard transcription. Second, we identify a stark performance divide: while commercial models remain robust, open-source models experience significant degradation in performance on low-resource languages. Finally, surprisingly, we find that under standard zero-shot conditions, Chain-of-Thought prompting often reduces speech understanding performance across most tested models, suggesting a potential modality alignment gap in current architectures. PolySpeech-100 sets a new rigorous standard for the next generation of inclusive, omni-capable Speech-LLMs. The data, demo, and code are publicly accessible at https://github.com/YoungSeng/PolySpeech-100.

Source: arXiv Generated at: 2026-06-02 00:00:00 UTC