arXiv

CASTLE2026 Team WDL Technical Report

June 2, 2026 · Zhengyang Li, Zhenglin Du, Yi Wen, Fang Liu, Shuo Li, Xu Liu · Original Source

Title: CASTLE2026 Team WDL Technical Report

Original: arXiv:2606.00712v1 Announce Type: new Abstract: The CASTLE Challenge @ EgoVis 2026 evaluates long-form egocentric video question answering over 600+ hours of multi-perspective recordings. Each four-choice question requires evidence from videos, transcripts, auxiliary photos, people, days, rooms, and temporal context. We propose an evidence-aware multimodal reasoning pipeline based on Qwen. Our system parses question hints, retrieves ASR chunks, attaches auxiliary images, samples candidate video frames, and routes questions into static visual, speech/text, temporal, and mixed types with specialized prompts. Multiple inference passes are aggregated by confidence-weighted voting and converted into the official Codabench format. In ablation, LoRA improves the score from 0.21 to 0.50, and more sampled frames further raise it to 0.58. Our final system ranks first in the CASTLE Challenge @ EgoVis 2026.

Rewritten:

Abstract:

The CASTLE Challenge @ EgoVis 2026 assesses the capability of systems to perform long-form, egocentric video question answering, utilizing a dataset comprising over 600 hours of multi-perspective recordings. Responding to four-option queries demands the integration of diverse evidence sources, including video content, transcripts, supplementary photographs, details regarding individuals, dates, locations, and temporal dynamics. To address this, we introduce a multimodal reasoning framework grounded in Qwen that is specifically aware of evidentiary requirements. The proposed pipeline processes question cues, extracts relevant Automatic Speech Recognition (ASR) segments, integrates auxiliary imagery, and selects representative video frames. Furthermore, it classifies inquiries into distinct categories—such as static visual, speech/text, temporal, and mixed modalities—applying tailored prompts for each. The system employs confidence-weighted voting to consolidate results from multiple inference runs, ultimately formatting the output according to the official Codabench standards. Ablation studies demonstrate that integrating Low-Rank Adaptation (LoRA) boosts performance from 0.21 to 0.50, while increasing the volume of sampled video frames further elevates the score to 0.58. Consequently, our final configuration secured the top ranking in the CASTLE Challenge @ EgoVis 2026.

Source: arXiv Generated at: 2026-06-02 00:00:00 UTC