CourseTimeQA: A Lecture-Video Benchmark and a Latency-Constrained Cross-Modal Fusion Method for Timestamped QA
Title: CourseTimeQA: A Lecture-Video Benchmark and a Latency-Constrained Cross-Modal Fusion Method for Timestamped QA
Abstract:
This study investigates timestamped question answering within educational lecture videos, operating under strict single-GPU constraints for both memory and latency. The proposed system processes natural-language queries by retrieving relevant, timestamped video segments and generating grounded responses. We introduce CourseTimeQA, a dataset comprising 902 queries across six courses totaling 52.3 hours of content, alongside CrossFusion-RAG, a lightweight retriever designed for latency constraints. This cross-modal architecture integrates frozen encoders, a learned projection layer mapping 512 to 768 dimensions for visual data, and a shallow, query-agnostic cross-attention mechanism applied to ASR text and video frames. It also incorporates a temporal-consistency regularizer and a compact cross-attentive reranker.
Experimental results on CourseTimeQA demonstrate that CrossFusion-RAG outperforms a robust BLIP-2 baseline, yielding improvements of 0.08 in Mean Reciprocal Rank (MRR) and 0.10 in nDCG@10. Notably, the model maintains a median end-to-end latency of approximately 1.55 seconds on a single A100 GPU. We benchmark our approach against several closely related methods under identical hardware and indexing conditions, including zero-shot CLIP multi-frame pooling, a combination of CLIP with a cross-encoder reranker and MMR, learned late-fusion gating, text-only hybrid systems with cross-encoder reranking (and their MMR variants), caption-augmented text retrieval, and non-learned temporal smoothing. To facilitate reproducible research, we provide comprehensive training and tuning details, along with robustness analyses regarding ASR noise (categorized by WER quartiles) and diagnostics for temporal localization.
Source: arXiv Generated at: 2026-06-03 00:00:00 UTC





