arXiv

OVO-S-Bench: A Hierarchical Benchmark for Streaming Spatial Intelligence in Multimodal LLMs

OVO-S-Bench: A Hierarchical Benchmark for Streaming Spatial Intelligence in Multimodal LLMs

Abstract

Robotic systems, autonomous driving platforms, and augmented reality applications rely on multimodal agents capable of reasoning about spatial layouts and locations. These agents must process continuous egocentric video streams, frequently leveraging contextual evidence that falls outside the immediate field of view. Current evaluation metrics fall short: existing benchmarks either assess performance on complete offline videos or focus on event detection rather than spatial structure. To address this gap, we present OVO-S-Bench, a comprehensive benchmark designed for streaming spatial intelligence.

The dataset is entirely human-annotated and includes 1,680 questions derived from 348 source videos. The rigorous annotation process engaged 12 trained annotators, who also acted as blind cross-reviewers, dedicating approximately 804 person-hours to multiple rounds of quality assurance. Each query is tagged with a specific timestamp and an associated evidence interval. During evaluation, models are restricted to viewing only the video prefix that precedes the query point.

OVO-S-Bench organizes tasks into four hierarchical levels of increasing abstraction: instantaneous egocentric perception, spatiotemporal context tracking, spatial simulation and reasoning, and allocentric mapping. We tested 38 proprietary and open-source Multimodal Large Language Models (MLLMs) on this benchmark. The results reveal that Gemini-3.1-Pro lags behind human experts by 27 points, scoring 59.2 compared to the human baseline of 86.6, with allocentric mapping identified as the primary bottleneck.

Interestingly, MLLMs specifically fine-tuned for spatial tasks and streaming data actually underperformed their base backbone models. Additionally, our analysis shows that chain-of-thought reasoning tends to exacerbate spatial errors when not properly grounded in the video stream. By highlighting these critical limitations, OVO-S-Bench provides a rigorous testbed to drive the development of next-generation streaming spatial MLLMs.


Source: arXiv Generated at: 2026-06-03 00:00:00 UTC

Related Articles

AI Saves Time But Most Companies Waste the Gain, Study Shows
Bloomberg

AI Saves Time But Most Companies Waste the Gain, Study Shows

A study reveals that while AI saves employee time, most companies fail to capitalize on these gains, squandering potenti...

JPMorgan Lifts S&P Target on Earnings 'Supercycle'
Bloomberg

JPMorgan Lifts S&P Target on Earnings 'Supercycle'

JPMorgan raised its S&P 500 target, citing an earnings “supercycle” that reflects heightened confidence in corporate pro...

Europe Sleepwalking Into Economic Ruin, Serb Leader Says
Bloomberg

Europe Sleepwalking Into Economic Ruin, Serb Leader Says

Serbian leader warns Europe is sleepwalking into economic ruin.

Delta Electronics Flags Power Crunch
Bloomberg

Delta Electronics Flags Power Crunch

Delta Electronics warns of a looming power deficit due to surging demand and constrained production, predicting serious ...

Google Ordered to Make Changes to AI Search Summaries by UK
Bloomberg

Google Ordered to Make Changes to AI Search Summaries by UK

The UK has ordered Google to modify its AI search summaries. This mandate aims to ensure greater accuracy and transparen...

TikTok Billionaire Tops Ambani as Asia’s Second-Richest
Bloomberg

TikTok Billionaire Tops Ambani as Asia’s Second-Richest

TikTok founder surpasses Mukesh Ambani to become Asia’s second-richest person, marking a significant shift in the region...