Multilingual Idioms in Sentences and Conversations Across High-, Medium-, and Low-Resource Languages
Title: Examining Multilingual Idioms in Sentences and Conversations Across High-, Medium-, and Low-Resource Languages
Abstract: Idiomatic expressions present a significant hurdle for multilingual natural language processing, as their meanings oscillate between literal and figurative interpretations, necessitating contextual understanding for accurate decoding. While earlier studies have predominantly targeted high-resource languages and concentrated on isolated questions regarding idiom meanings, they have largely neglected realistic discourse scenarios. To address this gap, we present MIDI, a comprehensive multilingual idiom dataset developed by native speakers. This resource covers 3 high-resource, 3 medium-resource, and 12 low-resource languages. Distinct from prior efforts, MIDI situates idioms within both sentence-level and conversational frameworks, thereby encompassing both literal and figurative nuances. Our evaluation of state-of-the-art models reveals that comprehension of idioms deteriorates in low-resource settings. Furthermore, across all resource categories, literal interpretations prove significantly more challenging than figurative ones. Although incorporating conversational context enhances model performance, it fails to bridge these disparities entirely. By employing controlled tests and analyzing interventions on hidden representations, we disentangle memorization from reasoning, thereby highlighting the fundamental constraints of existing models.
Source: arXiv Generated at: 2026-06-02 00:00:00 UTC




