SuperMemory-VQA: An Egocentric Visual Question-Answering Benchmark for Long-Horizon Memory
Title: SuperMemory-VQA: A Benchmark for Long-Horizon Egocentric Visual Question Answering
Abstract:
AI-enabled eyewear offers a promising foundation for deploying artificial intelligence as personalized memory aids. For these systems to deliver genuine utility, they must transcend the analysis of brief video segments and instead bridge memory gaps that arise in practical, personal, or social contexts over extended periods of egocentric footage. Currently, most egocentric datasets prioritize action recognition or generic question-answering derived from short clips, thereby assessing perceptual skills rather than addressing the complex memory requirements of humans. To fill this void, we present SuperMemory-VQA, a new benchmark designed to evaluate AI assistants on practical, long-horizon memory challenges.
The dataset comprises 52.9 hours of daily activities captured via AI glasses, featuring synchronized data streams that include RGB video, audio transcriptions, eye-tracking metrics, IMU readings, and SLAM trajectories. Utilizing a rigorous, human-verified annotation process, we developed 4,853 grounded question-answer pairs. These items cover a diverse range of memory types, including object and location retention, intent and visual scene recall, timeline reconstruction, conversational history, and in-context retrieval. To assess resilience against hallucinations, every question is formatted as a multiple-choice item that includes a distinct "unanswerable" option.
Our benchmarking of state-of-the-art agentic frameworks and large language model backbones indicates that current systems are still significantly lacking in reliability when applied to real-world memory tasks. This gap underscores the necessity for novel architectures grounded in AI memory, which can restrict responses to situations where sufficient evidence exists. Furthermore, feedback from a participant survey confirms that the benchmark questions are realistic, useful, and well-aligned with the memory demands of everyday life.
Source: arXiv Generated at: 2026-06-02 00:00:00 UTC





