arXiv

Hyper-ICL: Attention Calibration with Hyperbolic Anchor Distillation for Multimodal In-Context Learning

Title: Hyper-ICL: Enhancing Multimodal In-Context Learning via Attention Calibration and Hyperbolic Anchor Distillation

Abstract:

Multimodal In-Context Learning (ICL) has become a viable inference strategy for Multimodal Large Language Models. This approach leverages a limited collection of interleaved image-text In-Context Demonstrations (ICDs) to guide the model in addressing novel tasks. However, while multimodal ICL offers significant flexibility, it is often hindered by substantial inference latency and instability. These issues stem from the model's high sensitivity to the formatting, sequence, and specific content of the demonstrations.

To overcome these challenges, we introduce Hyper-ICL, a lightweight framework that enables demonstration-free multimodal ICL through training. This method reconstructs the effects typically achieved by demonstrations without needing ICDs during the inference phase. Hyper-ICL employs a parameter-efficient, low-rank adapter operating at the logit level to fine-tune attention distributions, ensuring they closely resemble the attention shifts caused by demonstrations.

Furthermore, we present a query-adaptive modulation mechanism. This component dynamically adjusts the intensity of intervention at the token level across various layers and attention heads, tailoring the process to the specific requirements of each query. Additionally, we develop a layer-wise hyperbolic anchor distillation loss. This mechanism aligns the intermediate features of the student model with those of a demonstration-conditioned teacher, utilizing Lorentz geodesic distance. By doing so, the loss function encourages the student to effectively replicate the relationships between queries and demonstrations that are normally induced by ICDs.

Comprehensive evaluations across six distinct multimodal benchmarks, such as VQAv2, OK-VQA, and COCO Caption, reveal that Hyper-ICL consistently outperforms both standard ICL and current state-of-the-art techniques in terms of accuracy and stability.


Source: arXiv Generated at: 2026-06-04 00:00:00 UTC

Related Articles

TechCrunch

A burglar used a Waymo to steal yoga clothes in San Francisco — and got away with it

A thief stole yoga clothes using a Waymo, but police failed to catch them because the car’s video data was deleted and b...

Goldman Sachs CEO David Solomon on the Coming Mega IPOs
Bloomberg

Goldman Sachs CEO David Solomon on the Coming Mega IPOs

Goldman Sachs CEO David Solomon anticipates a surge in major IPOs, signaling renewed market confidence and significant o...

What Are A.I. Agents Actually Doing?
New York Times

What Are A.I. Agents Actually Doing?

Arena research shows tech professionals are most likely to use AI agents at work, highlighting a strong industry trend i...

TechCrunch

Cash App launches a wand for tap-and-pay

Cash App launched a $25 NFC "Magic Wand" for tap-and-pay, blending viral novelty with practical contactless payments. It...

Databricks CEO Plans to Avoid IPO During Year of Huge Offerings
Bloomberg

Databricks CEO Plans to Avoid IPO During Year of Huge Offerings

Databricks CEO plans to avoid an IPO in 2021, despite a surge in public offerings. This contrasts with earlier reports t...

TechCrunch

Waymo’s spent robotaxi batteries will be used as grid storage

Waymo partners with B2U to repurpose retired robotaxi batteries for grid storage in California and Texas, aligning with ...