arXiv

SCOPE: Real-Time Natural Language Camera Agent at the Edge

Title: SCOPE: A Real-Time Natural Language Camera Agent Operating at the Edge

Abstract

Integrating language-driven agents into robotics necessitates evaluation frameworks that mirror the complexities of real-world tasks, specifically involving natural-language instructions and reproducible results. These systems must effectively bridge large language models with callable perception and control mechanisms, while being rigorously assessed against critical deployment metrics such as latency, accuracy, and specific error patterns. To address these needs, we introduce SCOPE (Simulation and Camera Operations for Perception and Evaluation), a modular agent engineered for the edge. SCOPE facilitates natural-language, open-vocabulary control of pan-tilt-zoom (PTZ) cameras and enables visual scene understanding, with a design optimized for local edge deployment.

The system functions within a Blender-based simulation and on physical PTZ hardware, performing all perception, planning, and control tasks locally using edge-accessible computing resources. We have published a comprehensive 536-task benchmark conducted in the Blender simulation environment. This benchmark covers a wide range of capabilities, including question answering, single- and multi-step command execution, counting, spatial reasoning, descriptive tasks, and optical character recognition, all while exposing realistic PTZ control affordances. To assess performance, we combine execution traces with an LM-as-Judge framework to measure latency, accuracy, and error modes.

Our evaluation involved 19 distinct combinations of planner-perception models, pairing Qwen3 small language models (SLMs) with Moondream and Qwen vision-language models (VLMs). The results indicate that employing more robust SLMs significantly curtails hallucinations and enhances tool routing, thereby fostering more reliable closed-loop behavior. However, once a sufficiently capable SLM is integrated, perception emerges as the primary bottleneck for overall performance. Furthermore, Mixture-of-Experts models utilized on both the planning and perception sides consistently performed as well as or better than their dense counterparts, achieving this with latency and memory usage comparable to much smaller networks. Additionally, quantification offered further efficiency improvements with negligible impact on accuracy, pinpointing a practical, sim-to-real validated design point for real-time, edge-feasible language-driven PTZ control.


Source: arXiv Generated at: 2026-06-03 00:00:00 UTC

Related Articles

TikTok Billionaire Tops Ambani as Asia’s Second-Richest
Bloomberg

TikTok Billionaire Tops Ambani as Asia’s Second-Richest

TikTok founder surpasses Mukesh Ambani to become Asia’s second-richest person, marking a significant shift in the region...

Publishers in UK can opt out of Google AI search results
BBC News

Publishers in UK can opt out of Google AI search results

UK publishers can now opt out of Google’s AI search summaries, a CMA ruling designed to boost their bargaining power and...

Kioxia Edges Nearer Toyota’s Market Cap in Shakeup to Japan Inc.
Bloomberg

Kioxia Edges Nearer Toyota’s Market Cap in Shakeup to Japan Inc.

Kioxia’s market cap nears Toyota’s, signaling a major shift in Japan’s corporate hierarchy. This narrowing gap highlight...

Reuters

Morning Bid: Marvell, a fitting name for the latest AI darling

Reuters highlights Marvell as a top AI stock, noting its name perfectly suits its status as the newest market darling.

Financial Times

Tim Hayward: I built the Jaguar E-Type of computer keyboards

Tim Hayward compares his bespoke keyboard designs to the Jaguar E-Type. He explores high-end customization for personal ...

Financial Times

AI Labs: Zuckerberg’s $100bn gamble

Meta’s $100 billion AI investment aims to secure AI dominance, but questions remain whether sheer spending can outpace c...