arXiv

SCOPE: Real-Time Natural Language Camera Agent at the Edge

June 3, 2026 · Nikolaj Hindsbo, Sina Ehsani, Pragyana Mishra · Original Source

Title: SCOPE: A Real-Time Natural Language Camera Agent Operating at the Edge

Abstract

Integrating language-driven agents into robotics necessitates evaluation frameworks that mirror the complexities of real-world tasks, specifically involving natural-language instructions and reproducible results. These systems must effectively bridge large language models with callable perception and control mechanisms, while being rigorously assessed against critical deployment metrics such as latency, accuracy, and specific error patterns. To address these needs, we introduce SCOPE (Simulation and Camera Operations for Perception and Evaluation), a modular agent engineered for the edge. SCOPE facilitates natural-language, open-vocabulary control of pan-tilt-zoom (PTZ) cameras and enables visual scene understanding, with a design optimized for local edge deployment.

The system functions within a Blender-based simulation and on physical PTZ hardware, performing all perception, planning, and control tasks locally using edge-accessible computing resources. We have published a comprehensive 536-task benchmark conducted in the Blender simulation environment. This benchmark covers a wide range of capabilities, including question answering, single- and multi-step command execution, counting, spatial reasoning, descriptive tasks, and optical character recognition, all while exposing realistic PTZ control affordances. To assess performance, we combine execution traces with an LM-as-Judge framework to measure latency, accuracy, and error modes.

Our evaluation involved 19 distinct combinations of planner-perception models, pairing Qwen3 small language models (SLMs) with Moondream and Qwen vision-language models (VLMs). The results indicate that employing more robust SLMs significantly curtails hallucinations and enhances tool routing, thereby fostering more reliable closed-loop behavior. However, once a sufficiently capable SLM is integrated, perception emerges as the primary bottleneck for overall performance. Furthermore, Mixture-of-Experts models utilized on both the planning and perception sides consistently performed as well as or better than their dense counterparts, achieving this with latency and memory usage comparable to much smaller networks. Additionally, quantification offered further efficiency improvements with negligible impact on accuracy, pinpointing a practical, sim-to-real validated design point for real-time, edge-feasible language-driven PTZ control.

Source: arXiv Generated at: 2026-06-03 00:00:00 UTC