GIRL-DETR: Gradient-Isolated Reinforcement Learning for Video Moment Retrieval
Title: GIRL-DETR: Gradient-Isolated Reinforcement Learning for Video Moment Retrieval
Abstract:
The Video Moment Retrieval (VMR) task demands the precise identification of temporal boundaries that correspond to natural language queries. However, numerous existing models encounter optimization stagnation in the later training phases and become trapped in suboptimal boundary predictions. This issue stems from a misalignment between continuous surrogate losses and non-differentiable metrics. While Reinforcement Learning (RL) post-training has proven effective in refining localization outcomes for large-scale models, its direct application to lightweight networks often destabilizes the delicate feature representations formed during supervised learning.
To address this optimization bottleneck, we introduce GIRL-DETR (Gradient-Isolated Reinforcement Learning for DETR), which marks the first integration of RL post-training into a lightweight temporal localization framework. In this architecture, input video and text features undergo early alignment via Cross-Modal Interaction (CMI) prior to entering the transformer encoder. Subsequently, a Text-Guided Gating (TGG) mechanism dynamically injects semantic priors into the queries before the transformer decoder produces candidate proposals, thereby delivering high signal-to-noise ratio inputs for temporal prediction.
Once supervised training converges, the backbone network is frozen to preserve the integrity of the feature manifold. Meanwhile, the detection head directly optimizes the non-differentiable evaluation metric, tIoU, utilizing a Three-stage Progressive Reinforcement Learning (TPRL) strategy to boost localization precision. This methodology achieves an orthogonal decoupling of state representation and metric optimization. Evaluations on the Charades-STA, QVHighlights, and TACoS datasets confirm that GIRL-DETR successfully mitigates surrogate loss degradation and delivers significant accuracy gains with minimal parameter adjustments, offering a resilient new pathway for applying RL to lightweight VMR models.
Source: arXiv Generated at: 2026-06-02 00:00:00 UTC




