arXiv

EGOSTREAM: A Diagnostic Benchmark for Streaming Episodic Memory in Egocentric Vision

Title: EGOSTREAM: A Diagnostic Benchmark for Streaming Episodic Memory in Egocentric Vision

Abstract:

While continuous episodic memory is essential for autonomous agents navigating dynamic, real-world settings, existing streaming video benchmarks offer insufficient mechanisms to diagnose the longevity and nature of model recall. To address this limitation, we present Egostream, a specialized diagnostic benchmark designed to evaluate streaming episodic memory within the domain of egocentric vision.

Egostream comprises 2,250 meticulously curated questions structured across seven distinct cognitive dimensions: detail, spatial, temporal, event, social, causal, and prospective memory. Central to our methodology is the Answer Validity Window (AVW), a metric that defines the specific temporal interval during which an answer remains accurate as the observed scene progresses. This innovation allows us to scale the evaluation dataset to 8,528 recall-conditioned tests, facilitating controlled assessments ranging from immediate to ultra-long-term recall. Crucially, this approach disentangles genuine model forgetting from natural changes in the world state.

We establish robust baseline performance using a unified streaming Multimodal Large Language Model (MLLM) framework. This framework evaluates several state-of-the-art memory management strategies, including sliding windows, attention sinks, KV-cache pruning and merging, and offloading techniques. Our experiments, conducted on a unified Qwen3-VL backbone, demonstrate that similar aggregate accuracy scores can obscure significantly different memory profiles. For example, token pruning maintains fine-grained details and temporal structures far more effectively than token merging, whereas quantized offloading proves essential for rescuing ultra-long-term recall capabilities.

Despite these nuances, all tested mechanisms struggle to meet real-time requirements, processing at speeds exceeding 1 second per frame. Furthermore, the highest-performing methods plateau at approximately 45% accuracy, highlighting significant deficiencies in current architectural designs. Egostream serves as the necessary diagnostic testbed to identify and bridge these critical gaps.

Project website, news, and updates: https://saroo25.github.io/Egostream/


Source: arXiv Generated at: 2026-06-02 00:00:00 UTC

Related Articles

Withings Debuts New Smart Scale Marketed Toward GLP-1 Users
Bloomberg

Withings Debuts New Smart Scale Marketed Toward GLP-1 Users

Withings launched a new smart scale targeting GLP-1 users, offering advanced body composition analysis. This device help...

TechCrunch

Rocket engine startup Impulse raises $500 million to hire people, not AI

Rocket engine startup Impulse Space raised $500 million to hire 200 engineers, prioritizing human expertise over AI for ...

Startup Impulse Space Raises $500 Million, Valued at $4 Billion
Bloomberg

Startup Impulse Space Raises $500 Million, Valued at $4 Billion

Impulse Space secured $500 million in funding, achieving a $4 billion valuation. This investment supports the developmen...

Walmart’s Answer to Apple Pay Wants to Be Your Favorite Financial App
Bloomberg

Walmart’s Answer to Apple Pay Wants to Be Your Favorite Financial App

Walmart’s new financial app aims to rival Apple Pay, positioning itself as a preferred digital payment and banking solut...

Nvidia Is Bigger, Stronger, and Trying to Slay the Laptop Dragon Again
Bloomberg

Nvidia Is Bigger, Stronger, and Trying to Slay the Laptop Dragon Again

Nvidia unveiled the RTX Spark Superchip at Computex 2026, aiming to challenge Intel’s PC dominance and modernize hardwar...

TechCrunch

Pacific Fusion’s latest prototype packs 440 gigawatts into an 80-nanosecond burst

Pacific Fusion’s new prototype delivers 440 gigawatts in 80 nanoseconds, securing over $1 billion in funding and enablin...