Where Do We (Not) Need Temporal Context in Low-Resource Video Task Adaptation?
Title: Determining the Necessity of Temporal Context in Low-Resource Video Task Adaptation
Abstract:
Parameter-efficient fine-tuning (PEFT) and probing offer a pathway to adapt foundation models using minimal trainable parameters, a feature that is particularly valuable in video understanding where both annotation efforts and computational costs are prohibitive. Despite this potential, current research on video PEFT has predominantly concentrated on adapting models pre-trained on images, rather than exploring the application of standard PEFT methods directly to video representations. These distinct approaches are seldom compared, and both tend to restrict temporal reasoning to a single model component, thereby leaving unresolved the question of how temporal context should be optimally distributed among the backbone, PEFT modules, and probes. This study presents a comprehensive examination of adaptation strategies for video understanding. We assess various methods within appearance-centric, motion-centric, and spatially dense frameworks, placing special emphasis on data-scarce scenarios where parameter efficiency yields the greatest advantage. Our findings yield novel insights into the application of PEFT and probing across different contexts, highlighting the critical role of temporal context distribution in achieving effective video adaptation.
Source: arXiv Generated at: 2026-06-03 00:00:00 UTC






