Beyond the Literal: Decomposing Pragmatic Intent in Multimodal Meme Understanding
Title: Moving Beyond the Surface: Disentangling Pragmatic Intent in Multimodal Meme Interpretation
Abstract: Large Vision Language Models (LVLMs) frequently default to describing the visual elements of a meme or sarcastic post when queried about its meaning, rather than capturing the author’s intended message. This limitation arises because standard instruction tuning intertwines the literal content of a post with its pragmatic significance, allowing superficial details to skew the final output. To address this, we recast meme comprehension as a challenge of separating literal content from pragmatic intent. We introduce Intent Projection, a novel framework that disentangles these two signals across the representation, output, and objective layers within a single LVLM backbone.
At the representation stage, an orthogonal projection module eliminates dominant unimodal directions from the fused image-text data, preserving only the pragmatic residual. Simultaneously, a surface-real affect classifier provides the decoder with a discrete tag identifying the polarity gap. The framework also enforces structured reasoning chains at the output level and employs a contrastive reward at the objective level to explicitly penalize responses that merely restate literal descriptions. Evaluated across six multimodal benchmarks, Intent Projection consistently surpasses open-source baselines and reduces the performance gap with proprietary models. The most significant improvements are observed in high-divergence posts, where literal collapse causes the greatest detriment to understanding.
Source: arXiv Generated at: 2026-06-03 00:00:00 UTC





