Answer Self-Consistency with Margin-Triggered Question Re-Arbitration for the CVPR 2026 VidLLMs Challenge
Title: Leveraging Answer Self-Consistency and Margin-Triggered Re-Arbitration for the CVPR 2026 VidLLMs Challenge
This report outlines our approach to Track 2 of the CVPR 2026 VidLLMs Challenge, a competition focused on assessing visual relational reasoning in video content. The primary objective for participants is to enable models to deduce relationships that are not immediately or explicitly apparent within the visual data. To address this, we introduce Answer Self-Consistency with Margin-Triggered Question Re-Arbitration (ASC-MQRA), a novel training-free test-time reasoning framework grounded in a multimodal reasoning model.
The foundational element, ASC, enhances performance by executing multiple stochastic runs of video question-answering tasks. By aggregating the resulting answer choices through answer-level self-consistency, this method significantly outperforms standard single-pass inference, establishing it as the core of our final test submission.
We also investigate MQRA, a conditional module designed to re-arbitrate questions where initial results indicate uncertainty. This is identified through a low-margin vote distribution. Our analysis reveals that examples with low margins frequently retain the ground-truth answer within their top candidates. This insight motivates MQRA to refine the candidate set and prompt the model to re-examine only the video segments associated with these retained options. While MQRA demonstrated further improvements over ASC during validation—suggesting that low-margin vote distributions serve as an effective uncertainty signal—it led to a slight performance decline on the test set. This degradation implies that the re-arbitration process is highly sensitive to the specific size and category distribution of the subset triggering the re-evaluation.
Consequently, our definitive test submission relies solely on the ASC framework without the additional re-arbitration step. This strategy yielded an average accuracy of 72.73 and a category-wise macro average accuracy of 78.34 on the validation set. On the test set, the model achieved an average accuracy of 81.16 and a category-wise macro average accuracy of 80.91. This document provides a comprehensive overview of our prompting methodology, implementation details, ablation studies, and diagnostic analyses. The source code for this project can be accessed at https://github.com/data-analytics-labo/ASC-MQRA.
Source: arXiv Generated at: 2026-06-04 00:00:00 UTC




