Global PIQA: Evaluating Commonsense Reasoning Across 100+ Languages and Cultures
Title: Global PIQA: Assessing Commonsense Reasoning Across More Than 100 Languages and Cultural Contexts
Abstract:
Currently, there is a significant lack of culturally specific evaluation benchmarks for large language models (LLMs) that encompass a broad spectrum of languages and cultures. To address this gap, this paper introduces Global PIQA, a participatory benchmark designed to test commonsense reasoning across more than 100 languages. The dataset was meticulously curated by over 350 researchers hailing from 65 different countries. Global PIQA includes 141 distinct language varieties, spanning five continents, 19 language families, and 24 unique writing systems.
The benchmark features two distinct splits. In the non-parallel split, more than half of the examples incorporate locally relevant elements, such as regional foods, customs, traditions, and other culture-specific details. Conversely, the parallel split involves translating "culturally agnostic" commonsense reasoning questions into 131 language varieties, enabling direct cross-lingual comparisons. Crucially, all examples in both splits have been validated by native speakers of the respective languages.
Our analysis reveals that while state-of-the-art LLMs demonstrate strong aggregate performance on Global PIQA, they struggle significantly with lower-resource languages. For instance, in the parallel split, we observed accuracy gaps of up to 68% between different languages. These findings underscore that everyday knowledge remains a critical area for improvement in LLMs for many cultures, paralleling existing concerns regarding complex reasoning and expert knowledge. Beyond serving as a tool for LLM evaluation, Global PIQA offers valuable insights into the rich diversity of cultures in which human language is situated.
Source: arXiv Generated at: 2026-06-02 00:00:00 UTC





