arXiv

CASTLE2026 Team WDL Technical Report

Title: CASTLE2026 Team WDL Technical Report

Original: arXiv:2606.00712v1 Announce Type: new Abstract: The CASTLE Challenge @ EgoVis 2026 evaluates long-form egocentric video question answering over 600+ hours of multi-perspective recordings. Each four-choice question requires evidence from videos, transcripts, auxiliary photos, people, days, rooms, and temporal context. We propose an evidence-aware multimodal reasoning pipeline based on Qwen. Our system parses question hints, retrieves ASR chunks, attaches auxiliary images, samples candidate video frames, and routes questions into static visual, speech/text, temporal, and mixed types with specialized prompts. Multiple inference passes are aggregated by confidence-weighted voting and converted into the official Codabench format. In ablation, LoRA improves the score from 0.21 to 0.50, and more sampled frames further raise it to 0.58. Our final system ranks first in the CASTLE Challenge @ EgoVis 2026.

Rewritten:

Abstract:

The CASTLE Challenge @ EgoVis 2026 assesses the capability of systems to perform long-form, egocentric video question answering, utilizing a dataset comprising over 600 hours of multi-perspective recordings. Responding to four-option queries demands the integration of diverse evidence sources, including video content, transcripts, supplementary photographs, details regarding individuals, dates, locations, and temporal dynamics. To address this, we introduce a multimodal reasoning framework grounded in Qwen that is specifically aware of evidentiary requirements. The proposed pipeline processes question cues, extracts relevant Automatic Speech Recognition (ASR) segments, integrates auxiliary imagery, and selects representative video frames. Furthermore, it classifies inquiries into distinct categories—such as static visual, speech/text, temporal, and mixed modalities—applying tailored prompts for each. The system employs confidence-weighted voting to consolidate results from multiple inference runs, ultimately formatting the output according to the official Codabench standards. Ablation studies demonstrate that integrating Low-Rank Adaptation (LoRA) boosts performance from 0.21 to 0.50, while increasing the volume of sampled video frames further elevates the score to 0.58. Consequently, our final configuration secured the top ranking in the CASTLE Challenge @ EgoVis 2026.


Source: arXiv Generated at: 2026-06-02 00:00:00 UTC

Related Articles

Law’s Billable Hour Is Being Shredded by AI
Bloomberg

Law’s Billable Hour Is Being Shredded by AI

AI is dismantling the billable hour by automating routine legal tasks. This technological shift threatens the traditiona...

Iran War: Trump Tries to Stop Israel’s Lebanon Push | The Opening Trade 6/2/2026
Bloomberg

Iran War: Trump Tries to Stop Israel’s Lebanon Push | The Opening Trade 6/2/2026

SoftBank in Early Talks to Back $800 Million Agile Robots Round
Bloomberg

SoftBank in Early Talks to Back $800 Million Agile Robots Round

SoftBank is in early talks to back Agile Robots’ $800 million funding round. The Japanese tech giant is currently in pre...

Amundi Is Diversifying Risk Via Commodity Currencies, Gold
Bloomberg

Amundi Is Diversifying Risk Via Commodity Currencies, Gold

Amundi diversifies risk by investing in commodity-linked currencies and gold. This strategy hedges against market volati...

Reuters

Marvell Technology surges after Nvidia's Huang calls it 'next trillion-dollar company'

Marvell Technology shares surged after Nvidia CEO Jensen Huang labeled the firm the “next trillion-dollar company.”

Russia Says It Found Foreign Spyware on Top Officials’ Phones
Bloomberg

Russia Says It Found Foreign Spyware on Top Officials’ Phones

Russia’s FSB claims to have discovered foreign spyware on senior officials’ phones. Moscow attributes the intrusion to h...