Multimodal Function Vectors for Visual Relations
Title: Multimodal Function Vectors for Visual Relations
Abstract
While Large Multimodal Models (LMMs) exhibit remarkable capabilities in learning from limited multimodal examples, the underlying mechanisms facilitating this in-context learning remain largely unclear. Drawing on previous research regarding Large Language Models, we reveal that a specific, limited group of attention heads within LMMs plays a critical role in conveying visual relationship representations. These heads generate activations, known as function vectors, which can be isolated and modified to directly influence an LMMâs efficacy in relational tasks.
In this study, we utilize both synthetic and real-world image datasets to conduct causal mediation analysis, pinpointing attention heads that significantly impact relational predictions. From these, we extract multimodal function vectors that boost zero-shot accuracy during inference. Furthermore, we demonstrate that these vectors can be fine-tuned using a relatively small dataset without updating the LMMâs core parameters, resulting in performance that substantially surpasses standard in-context learning baselines.
Additionally, we show that function vectors tailored to specific relations can be linearly combined to resolve analogy problems involving previously unseen and untrained visual relationships, underscoring the robust generalization potential of this method. Through extensive experiments on two distinct LMMsâOpenFlamingo and Qwen3-VLâour findings indicate that visual relational knowledge is encoded within localized internal structures. These structures can be systematically extracted and optimized, thereby deepening our comprehension of model modularity and improving control over relational reasoning in LMMs.
Source: arXiv Generated at: 2026-06-02 00:00:00 UTC




