A Systematic Evaluation of Positional Bias in Multi-Video Summarization with MLLMs
Title: Assessing Positional Bias in Multi-Video Summarization via MLLMs: A Systematic Study
Abstract: While Multimodal Large Language Models (MLLMs) are gaining traction for video comprehension, their performance and reliability when processing multiple video inputs are not yet well characterized. This study investigates positional bias within the context of multi-video summarization, revealing that the fidelity of individual video summaries may fluctuate based on the video’s position in the input sequence, despite the content remaining constant. To examine this phenomenon, we developed a benchmark utilizing videos from ActivityNet and news sources, spanning Cooking, Domestic, Leisure, and News categories, with configurations involving both two and four videos. We assessed nine distinct MLLMs—comprising both open-source and proprietary models—utilizing three complementary metrics: Coverage, Directional Positional Bias (DPB), and Middle-Edge Gap (MEG). Our findings indicate that positional effects vary significantly across domains and models. Specifically, a small signed directional bias can coexist with significant underperformance of videos placed in middle positions. Furthermore, expanding the visual or generation budget does not consistently eliminate this imbalance. We also explore mitigation strategies at the prompt level. Collectively, these results demonstrate that multi-video summarization is highly sensitive to input ordering and protocol, underscoring the need for the development of more robust, order-invariant multimodal systems.
Source: arXiv Generated at: 2026-06-04 00:00:00 UTC



