arXiv

Video2LoRA: Parametric Video Internalization for Vision-Language Models

Title: Video2LoRA: Parametric Video Internalization for Vision-Language Models

Abstract:

Handling video data within vision-language models (VLMs) incurs significant computational costs, as each frame consumes hundreds of tokens, causing inference expenses to escalate with every frame and repeated query. To address this, we present Video2LoRA, a novel approach for the parametric internalization of video content. This method employs a perceiver hypernetwork that ingests intermediate representations generated layer-by-layer as a frozen VLM processes a video, subsequently producing a Low-Rank Adaptation (LoRA) adapter in a single forward pass. In contrast to conventional LoRA fine-tuning, which relies on iterative gradient updates, Video2LoRA directly predicts these weights from the video input.

Evaluated on SmolVLM2 models of 500M and 2.2B parameters for tasks involving video summarization and captioning, Video2LoRA allows the frozen VLM to respond to queries using only the adapter, eliminating the need for visual tokens in the context window at query time. Performance evaluations demonstrate that Video2LoRA is statistically equivalent to direct video-in-context inference across all five captioning benchmarks at both model scales, and matches it across seven of eight video question-answering benchmark pairings. Notably, despite being trained exclusively on 12 frames at 384px resolution, the model maintains stability even when processing up to 1,024 frames and 1024px resolution—a scenario where standard video-in-context inference typically degrades. Throughout this range, the method reduces the visual-token load at answer time by as much as 1,500-fold and decreases the time-to-first-token (TTFT) for queries by 6 to 80 times, all while ensuring outputs remain faithful to the source video. Furthermore, our findings indicate that adapters generated independently for non-overlapping video segments can be composed within rank space, pointing toward a viable strategy for internalizing long videos in chunks.


Source: arXiv Generated at: 2026-06-04 00:00:00 UTC

Related Articles

Glazer Family Members Said to Study Manchester United Stake Sale
Bloomberg

Glazer Family Members Said to Study Manchester United Stake Sale

Reports indicate the Glazer family is evaluating a potential sale of their Manchester United stake, with family members ...

Ares' Blair Jacbobson: Disconnect Over Private Credit Headlines
Bloomberg

Ares' Blair Jacbobson: Disconnect Over Private Credit Headlines

Ares’ Blair Jacobson argues that private credit headlines misrepresent reality, highlighting a disconnect between media ...

Nvidia-Backed Robotics Startup Generalist AI Valued at $2 Billion
Bloomberg

Nvidia-Backed Robotics Startup Generalist AI Valued at $2 Billion

Nvidia-backed robotics startup Generalist AI has reached a $2 billion valuation. Founders Pete Florence, Andy Zeng, and ...

TechCrunch

Oura Ring 5 review: Thinner, lighter, better

The Oura Ring 5 is 40% smaller and lighter than its predecessor, offering superior comfort and a discreet, jewelry-like ...

Financial Times

How AI has de-skilled translation

AI fragments specialist translation into routine tasks, effectively de-skilling the profession. This shift reduces compl...

Zurich Insurance Expands Data-Center Offering Beyond the US
Bloomberg

Zurich Insurance Expands Data-Center Offering Beyond the US

Zurich Insurance Group is expanding its data center insurance products internationally, extending coverage beyond the Un...