arXiv

Video2LoRA: Parametric Video Internalization for Vision-Language Models

June 4, 2026 · Manan Suri, Sarvesh Baskar, Dinesh Manocha · Original Source

Title: Video2LoRA: Parametric Video Internalization for Vision-Language Models

Abstract:

Handling video data within vision-language models (VLMs) incurs significant computational costs, as each frame consumes hundreds of tokens, causing inference expenses to escalate with every frame and repeated query. To address this, we present Video2LoRA, a novel approach for the parametric internalization of video content. This method employs a perceiver hypernetwork that ingests intermediate representations generated layer-by-layer as a frozen VLM processes a video, subsequently producing a Low-Rank Adaptation (LoRA) adapter in a single forward pass. In contrast to conventional LoRA fine-tuning, which relies on iterative gradient updates, Video2LoRA directly predicts these weights from the video input.

Evaluated on SmolVLM2 models of 500M and 2.2B parameters for tasks involving video summarization and captioning, Video2LoRA allows the frozen VLM to respond to queries using only the adapter, eliminating the need for visual tokens in the context window at query time. Performance evaluations demonstrate that Video2LoRA is statistically equivalent to direct video-in-context inference across all five captioning benchmarks at both model scales, and matches it across seven of eight video question-answering benchmark pairings. Notably, despite being trained exclusively on 12 frames at 384px resolution, the model maintains stability even when processing up to 1,024 frames and 1024px resolution—a scenario where standard video-in-context inference typically degrades. Throughout this range, the method reduces the visual-token load at answer time by as much as 1,500-fold and decreases the time-to-first-token (TTFT) for queries by 6 to 80 times, all while ensuring outputs remain faithful to the source video. Furthermore, our findings indicate that adapters generated independently for non-overlapping video segments can be composed within rank space, pointing toward a viable strategy for internalizing long videos in chunks.

Source: arXiv Generated at: 2026-06-04 00:00:00 UTC