Global News Digest

arXiv

GIRL-DETR: Gradient-Isolated Reinforcement Learning for Video Moment Retrieval

Title: GIRL-DETR: Gradient-Isolated Reinforcement Learning for Video Moment Retrieval

Abstract:

The Video Moment Retrieval (VMR) task demands the precise identification of temporal boundaries that correspond to natural language queries. However, numerous existing models encounter optimization stagnation in the later training phases and become trapped in suboptimal boundary predictions. This issue stems from a misalignment between continuous surrogate losses and non-differentiable metrics. While Reinforcement Learning (RL) post-training has proven effective in refining localization outcomes for large-scale models, its direct application to lightweight networks often destabilizes the delicate feature representations formed during supervised learning.

To address this optimization bottleneck, we introduce GIRL-DETR (Gradient-Isolated Reinforcement Learning for DETR), which marks the first integration of RL post-training into a lightweight temporal localization framework. In this architecture, input video and text features undergo early alignment via Cross-Modal Interaction (CMI) prior to entering the transformer encoder. Subsequently, a Text-Guided Gating (TGG) mechanism dynamically injects semantic priors into the queries before the transformer decoder produces candidate proposals, thereby delivering high signal-to-noise ratio inputs for temporal prediction.

Once supervised training converges, the backbone network is frozen to preserve the integrity of the feature manifold. Meanwhile, the detection head directly optimizes the non-differentiable evaluation metric, tIoU, utilizing a Three-stage Progressive Reinforcement Learning (TPRL) strategy to boost localization precision. This methodology achieves an orthogonal decoupling of state representation and metric optimization. Evaluations on the Charades-STA, QVHighlights, and TACoS datasets confirm that GIRL-DETR successfully mitigates surrogate loss degradation and delivers significant accuracy gains with minimal parameter adjustments, offering a resilient new pathway for applying RL to lightweight VMR models.


Source: arXiv Generated at: 2026-06-02 00:00:00 UTC

Related Articles

Schroders Renewable Unit Targets AI Assets as Power Demand Soars
Bloomberg

Schroders Renewable Unit Targets AI Assets as Power Demand Soars

Schroders’ renewable unit targets AI infrastructure, pivoting to meet soaring energy demand from artificial intelligence...

State Street's Paglia on SBI Group Partnership, ETFs
Bloomberg

State Street's Paglia on SBI Group Partnership, ETFs

State Street's Paglia discusses the SBI Group partnership and ETFs, but the source text is missing. Please provide the a...

Nvidia Boss Says Workers Should Be Paid ‘as Much as Possible’
Bloomberg

Nvidia Boss Says Workers Should Be Paid ‘as Much as Possible’

Nvidia CEO Jensen Huang advocates for paying workers “as much as possible,” emphasizing maximum compensation. This stanc...

TSE Talking With Regulator For Easing ETF Listing Rules
Bloomberg

TSE Talking With Regulator For Easing ETF Listing Rules

The Tokyo Stock Exchange is discussing with regulators to ease ETF listing rules. This aims to simplify market access an...

S&P DJI CEO on Japan Markets, Mega IPOs
Bloomberg

S&P DJI CEO on Japan Markets, Mega IPOs

S&P DJI CEO discusses Japan's financial markets and major IPOs.