Video-OPD: Efficient Post-Training of Multimodal Large Language Models for Temporal Video Grounding via On-Policy Distillation
Title: Video-OPD: Streamlining the Post-Training of Multimodal Large Language Models for Temporal Video Grounding Through On-Policy Distillation
Abstract:
While reinforcement learning has gained traction as a robust post-training strategy for Temporal Video Grounding (TVG) thanks to its on-policy optimization capabilities, current methods based on Group Relative Policy Optimization (GRPO) face significant hurdles. These include limited reward signals and high computational demands. To address these issues, we introduce Video-OPD, a streamlined post-training framework for TVG that leverages recent developments in on-policy distillation.
Video-OPD refines trajectories drawn directly from the active policy, ensuring that the training distribution remains aligned with the inference distribution. Simultaneously, a sophisticated teacher model provides dense, token-level guidance through a reverse Kullback-Leibler (KL) divergence loss. This approach maintains the essential on-policy characteristic necessary to reduce distributional shift, while transforming coarse, episode-level rewards into granular, step-by-step learning cues.
Building upon the Video-OPD foundation, we present Teacher-Validated Disagreement Focusing (TVDF). This is a lightweight, iterative training curriculum designed to enhance efficiency by selectively prioritizing trajectories that offer maximum informative value to the student while remaining reliable according to the teacher.
Our experimental findings show that Video-OPD not only surpasses GRPO in performance but also converges significantly faster and at a reduced computational expense. These results position on-policy distillation as a viable and effective alternative to traditional reinforcement learning techniques for TVG tasks.
Source: arXiv Generated at: 2026-06-03 00:00:00 UTC





