arXiv

Video-OPD: Efficient Post-Training of Multimodal Large Language Models for Temporal Video Grounding via On-Policy Distillation

Title: Video-OPD: Streamlining the Post-Training of Multimodal Large Language Models for Temporal Video Grounding Through On-Policy Distillation

Abstract:

While reinforcement learning has gained traction as a robust post-training strategy for Temporal Video Grounding (TVG) thanks to its on-policy optimization capabilities, current methods based on Group Relative Policy Optimization (GRPO) face significant hurdles. These include limited reward signals and high computational demands. To address these issues, we introduce Video-OPD, a streamlined post-training framework for TVG that leverages recent developments in on-policy distillation.

Video-OPD refines trajectories drawn directly from the active policy, ensuring that the training distribution remains aligned with the inference distribution. Simultaneously, a sophisticated teacher model provides dense, token-level guidance through a reverse Kullback-Leibler (KL) divergence loss. This approach maintains the essential on-policy characteristic necessary to reduce distributional shift, while transforming coarse, episode-level rewards into granular, step-by-step learning cues.

Building upon the Video-OPD foundation, we present Teacher-Validated Disagreement Focusing (TVDF). This is a lightweight, iterative training curriculum designed to enhance efficiency by selectively prioritizing trajectories that offer maximum informative value to the student while remaining reliable according to the teacher.

Our experimental findings show that Video-OPD not only surpasses GRPO in performance but also converges significantly faster and at a reduced computational expense. These results position on-policy distillation as a viable and effective alternative to traditional reinforcement learning techniques for TVG tasks.


Source: arXiv Generated at: 2026-06-03 00:00:00 UTC

Related Articles

Meta won't track its workers' clicks - but only for half an hour at a time
BBC News

Meta won't track its workers' clicks - but only for half an hour at a time

Meta allows employees to pause AI data tracking in 30-minute intervals after significant internal backlash. This revised...

SpaceX Kicks Off AI IPOs Set to Rewrite the Rules of Wall Street
Bloomberg

SpaceX Kicks Off AI IPOs Set to Rewrite the Rules of Wall Street

SpaceX is launching a wave of AI IPOs poised to reshape Wall Street conventions. This move signals a significant shift i...

Spain Earmarks Up to €800M for Massive EU Data Center Project
Bloomberg

Spain Earmarks Up to €800M for Massive EU Data Center Project

Spain allocates up to €800M for a major EU data center initiative.

Forget Tariffs, It’s All AI-Led Growth: MLIV
Bloomberg

Forget Tariffs, It’s All AI-Led Growth: MLIV

MLIV argues AI-driven growth outweighs tariff impacts, shifting focus from trade policies to the transformative power of...

TechCrunch

The world’s largest privately owned laser just turned on

Xcimer Energy activated the Phoenix laser, the world’s largest privately owned laser, aiming to commercialize fusion pow...

AI Music Startup Suno Raises Capital at $5.4 Billion Valuation
Bloomberg

AI Music Startup Suno Raises Capital at $5.4 Billion Valuation

AI music startup Suno has doubled its valuation to $5.4 billion, just seven months after raising $250 million in funding...