arXiv

3rd Place at CVPR 2026 CASTLE Challenge: Agentic Multi-View Long-Context Video Understanding via Hierarchical Knowledge Graph Retrieval

June 2, 2026 · Raghad Albusayes, Munirah Alyahya · Original Source

Title: Third-Place Finish at CVPR 2026 CASTLE Challenge: Hierarchical Knowledge Graph Retrieval Enables Agentic Multi-View Long-Context Video Understanding

Abstract: This study details the approach that earned our team third place in the global CASTLE 2026 Challenge, held during the EgoVis Workshop at CVPR 2026. The competition required participants to address intricate questions spanning visual, spatiotemporal, and verbal domains—such as visual counting, action localization, multi-view tracking, and speaker temporal reasoning—across massive multimodal video streams. The dataset utilized for the challenge comprises more than 600 hours of synchronized footage gathered from 15 distinct ego and exo camera sources.

To address the significant challenges posed by the extreme scale and long-context requirements of this environment, we developed a training-free agentic framework specifically designed for long-form video understanding. This framework relies on two fundamental architectural elements: first, a Video Knowledge Graph that maps static and dynamic entities, their temporal relationships, and intersecting events to facilitate multi-hop relational reasoning; and second, an adaptive agentic workflow that resolves complex queries via hierarchical retrieval and indexing. Empirical evaluations confirm that our framework delivers high zero-shot reasoning accuracy when processing long-context multi-view streams. The source code for this framework will be made publicly available at https://github.com/RaghadKhaled/CASTLE-Challenge-Framework.

Source: arXiv Generated at: 2026-06-02 00:00:00 UTC