arXiv

OVO-S-Bench: A Hierarchical Benchmark for Streaming Spatial Intelligence in Multimodal LLMs

June 3, 2026 · Yifei Li, Pengyiang Liu, Yuhang Zang, Zhongyue Shi, Qi Fu, Hongye Hao, Jiwen Lu · Original Source

OVO-S-Bench: A Hierarchical Benchmark for Streaming Spatial Intelligence in Multimodal LLMs

Abstract

Robotic systems, autonomous driving platforms, and augmented reality applications rely on multimodal agents capable of reasoning about spatial layouts and locations. These agents must process continuous egocentric video streams, frequently leveraging contextual evidence that falls outside the immediate field of view. Current evaluation metrics fall short: existing benchmarks either assess performance on complete offline videos or focus on event detection rather than spatial structure. To address this gap, we present OVO-S-Bench, a comprehensive benchmark designed for streaming spatial intelligence.

The dataset is entirely human-annotated and includes 1,680 questions derived from 348 source videos. The rigorous annotation process engaged 12 trained annotators, who also acted as blind cross-reviewers, dedicating approximately 804 person-hours to multiple rounds of quality assurance. Each query is tagged with a specific timestamp and an associated evidence interval. During evaluation, models are restricted to viewing only the video prefix that precedes the query point.

OVO-S-Bench organizes tasks into four hierarchical levels of increasing abstraction: instantaneous egocentric perception, spatiotemporal context tracking, spatial simulation and reasoning, and allocentric mapping. We tested 38 proprietary and open-source Multimodal Large Language Models (MLLMs) on this benchmark. The results reveal that Gemini-3.1-Pro lags behind human experts by 27 points, scoring 59.2 compared to the human baseline of 86.6, with allocentric mapping identified as the primary bottleneck.

Interestingly, MLLMs specifically fine-tuned for spatial tasks and streaming data actually underperformed their base backbone models. Additionally, our analysis shows that chain-of-thought reasoning tends to exacerbate spatial errors when not properly grounded in the video stream. By highlighting these critical limitations, OVO-S-Bench provides a rigorous testbed to drive the development of next-generation streaming spatial MLLMs.

Source: arXiv Generated at: 2026-06-03 00:00:00 UTC