arXiv

CultureForest: Understanding and Evaluating Cultural Norm Grounded Reasoning in LLMs

June 2, 2026 · Yangfan Ye, Xiaocheng Feng, Jialong Tang, Xiayu Cao, Zihan Zhang, Xiachong Feng, Baosong Yang, Bing Qin · Original Source

Title: CultureForest: Assessing and Interpreting Cultural Norm Grounded Reasoning in LLMs

Abstract: Current studies predominantly frame cultural intelligence in Large Language Models (LLMs) as a matter of factual knowledge, often neglecting the critical question of whether models can effectively apply this information in practical contexts. To address this oversight, we present CultureForest, a novel benchmark designed for \textit{Cultural Norm Grounded Reasoning}. This framework anchors each query in a concise set of atomic norms, facilitating evaluations that are both verifiable and attributable. The benchmark contains 5,378 instances spanning 53 countries or regions and covering eight distinct domains, allowing for a tiered assessment that ranges from multiple-choice questions to open-ended generation tasks.

Our comprehensive experiments demonstrate that even leading models suffer significant performance declines in open-ended scenarios, with notable disparities observed across different regions. Detailed analysis reveals several consistent trends: first, employing reasoning at test time offers minimal improvements and can actually widen existing inequalities; second, models display highly similar regional preference structures; third, model outputs tend to be notably conservative, particularly when subjected to stricter cultural constraints; and fourth, by separating the acquisition of cultural knowledge from the act of cultural reasoning, we demonstrate that while LLMs hold considerable cultural knowledge, their effectiveness is hindered by their inability to utilize it efficiently. These results underscore the need to shift evaluation focus from mere knowledge retention to the measurement of knowledge-grounded reasoning.

Source: arXiv Generated at: 2026-06-02 00:00:00 UTC