TRACE: Evidence Grounding-Guided Multi-Video Event Understanding and Claim Generation
Title: TRACE: Evidence Grounding-Guided Multi-Video Event Understanding and Claim Generation
Abstract: Accurately understanding events across multiple videos requires systems capable of identifying and attributing relevant evidence dispersed throughout extensive, varied video collections. Current large vision-language models (LVLMs) frequently fall short in this domain, primarily because they rapidly deplete their context limits and face difficulties in pinpointing critical segments. Consequently, they often overlook dense informational elements like subtitles, broadcast graphics, and scoreboards. To address these challenges, we present TRACE, a framework that employs an evidence grounding-guided approach based on a "ground-before-reasoning" strategy for multi-video event analysis. TRACE initially constructs a structured, text-searchable timeline for each video by leveraging object detection and OCR. A text-only large language model (LLM) then performs query-aware evidence localization to identify relevant moments before any visual reasoning occurs. These retrieved frames, accompanied by their grounding summaries, guide the subsequent generation of claims and the consolidation of citations across videos using LVLMs. Evaluations on WikiVideo and the MAGMaR 2026 benchmark reveal that structured grounding significantly enhances both attribution fidelity and factual completeness. Specifically, on the MAGMaR validation set, TRACE increases the macro-average MiRAGE F1 score from 0.705 to 0.811 when compared to an unguided Qwen3-VL-30B baseline, with citation recall improving notably from 0.440 to 0.628. This method also achieves state-of-the-art performance on the official MAGMaR 2026 leaderboard. The source code is available at https://github.com/pengyu965/TRACE.
Source: arXiv Generated at: 2026-06-02 00:00:00 UTC




