arXiv

Hyper-ICL: Attention Calibration with Hyperbolic Anchor Distillation for Multimodal In-Context Learning

June 4, 2026 · Niloufar Alipour Talemi, Hossein Kashiani, Fatemeh Afghah · Original Source

Title: Hyper-ICL: Enhancing Multimodal In-Context Learning via Attention Calibration and Hyperbolic Anchor Distillation

Abstract:

Multimodal In-Context Learning (ICL) has become a viable inference strategy for Multimodal Large Language Models. This approach leverages a limited collection of interleaved image-text In-Context Demonstrations (ICDs) to guide the model in addressing novel tasks. However, while multimodal ICL offers significant flexibility, it is often hindered by substantial inference latency and instability. These issues stem from the model's high sensitivity to the formatting, sequence, and specific content of the demonstrations.

To overcome these challenges, we introduce Hyper-ICL, a lightweight framework that enables demonstration-free multimodal ICL through training. This method reconstructs the effects typically achieved by demonstrations without needing ICDs during the inference phase. Hyper-ICL employs a parameter-efficient, low-rank adapter operating at the logit level to fine-tune attention distributions, ensuring they closely resemble the attention shifts caused by demonstrations.

Furthermore, we present a query-adaptive modulation mechanism. This component dynamically adjusts the intensity of intervention at the token level across various layers and attention heads, tailoring the process to the specific requirements of each query. Additionally, we develop a layer-wise hyperbolic anchor distillation loss. This mechanism aligns the intermediate features of the student model with those of a demonstration-conditioned teacher, utilizing Lorentz geodesic distance. By doing so, the loss function encourages the student to effectively replicate the relationships between queries and demonstrations that are normally induced by ICDs.

Comprehensive evaluations across six distinct multimodal benchmarks, such as VQAv2, OK-VQA, and COCO Caption, reveal that Hyper-ICL consistently outperforms both standard ICL and current state-of-the-art techniques in terms of accuracy and stability.

Source: arXiv Generated at: 2026-06-04 00:00:00 UTC