arXiv

Multilingual Idioms in Sentences and Conversations Across High-, Medium-, and Low-Resource Languages

June 2, 2026 · Saeed Almheiri, Bilal Elbouardi, Salsabila Zahirah Pranida, Irina Nikishina, Ashwath Rao B, Parameswari Krishnamurthy, Muhammad Cendekia Airlangga, Rifo Ahmad Genadi, Nguyen Phan Gia Bao, Amir Hossein Yari, Hawau Olamide Toyin, Nurdaulet Mukhituly, Mena A · Original Source

Title: Examining Multilingual Idioms in Sentences and Conversations Across High-, Medium-, and Low-Resource Languages

Abstract: Idiomatic expressions present a significant hurdle for multilingual natural language processing, as their meanings oscillate between literal and figurative interpretations, necessitating contextual understanding for accurate decoding. While earlier studies have predominantly targeted high-resource languages and concentrated on isolated questions regarding idiom meanings, they have largely neglected realistic discourse scenarios. To address this gap, we present MIDI, a comprehensive multilingual idiom dataset developed by native speakers. This resource covers 3 high-resource, 3 medium-resource, and 12 low-resource languages. Distinct from prior efforts, MIDI situates idioms within both sentence-level and conversational frameworks, thereby encompassing both literal and figurative nuances. Our evaluation of state-of-the-art models reveals that comprehension of idioms deteriorates in low-resource settings. Furthermore, across all resource categories, literal interpretations prove significantly more challenging than figurative ones. Although incorporating conversational context enhances model performance, it fails to bridge these disparities entirely. By employing controlled tests and analyzing interventions on hidden representations, we disentangle memorization from reasoning, thereby highlighting the fundamental constraints of existing models.

Source: arXiv Generated at: 2026-06-02 00:00:00 UTC