EGOSTREAM: A Diagnostic Benchmark for Streaming Episodic Memory in Egocentric Vision
Title: EGOSTREAM: A Diagnostic Benchmark for Streaming Episodic Memory in Egocentric Vision
Abstract:
While continuous episodic memory is essential for autonomous agents navigating dynamic, real-world settings, existing streaming video benchmarks offer insufficient mechanisms to diagnose the longevity and nature of model recall. To address this limitation, we present Egostream, a specialized diagnostic benchmark designed to evaluate streaming episodic memory within the domain of egocentric vision.
Egostream comprises 2,250 meticulously curated questions structured across seven distinct cognitive dimensions: detail, spatial, temporal, event, social, causal, and prospective memory. Central to our methodology is the Answer Validity Window (AVW), a metric that defines the specific temporal interval during which an answer remains accurate as the observed scene progresses. This innovation allows us to scale the evaluation dataset to 8,528 recall-conditioned tests, facilitating controlled assessments ranging from immediate to ultra-long-term recall. Crucially, this approach disentangles genuine model forgetting from natural changes in the world state.
We establish robust baseline performance using a unified streaming Multimodal Large Language Model (MLLM) framework. This framework evaluates several state-of-the-art memory management strategies, including sliding windows, attention sinks, KV-cache pruning and merging, and offloading techniques. Our experiments, conducted on a unified Qwen3-VL backbone, demonstrate that similar aggregate accuracy scores can obscure significantly different memory profiles. For example, token pruning maintains fine-grained details and temporal structures far more effectively than token merging, whereas quantized offloading proves essential for rescuing ultra-long-term recall capabilities.
Despite these nuances, all tested mechanisms struggle to meet real-time requirements, processing at speeds exceeding 1 second per frame. Furthermore, the highest-performing methods plateau at approximately 45% accuracy, highlighting significant deficiencies in current architectural designs. Egostream serves as the necessary diagnostic testbed to identify and bridge these critical gaps.
Project website, news, and updates: https://saroo25.github.io/Egostream/
Source: arXiv Generated at: 2026-06-02 00:00:00 UTC




