Sandboxed Coding Agents are Competitive Omni-modal Task Solvers
Title: Sandboxed Coding Agents Compete as Comprehensive Omni-Modal Problem Solvers
Abstract
As multimodal Large Language Models increasingly extend their capabilities to encompass video and audio, there is a prevailing assumption that addressing such tasks necessitates the use of native omni-modal architectures. However, our research demonstrates that this is not invariably true. We reveal that coding agents, equipped solely with text and image capabilities alongside a sandboxed tool-use interface, are capable of matching, and in certain contexts surpassing, the performance of state-of-the-art native omni-modal models and established multimodal agent frameworks across various audio-video benchmarks.
Our trajectory analysis indicates that the efficacy of these agents stems from their ability to generate code and coordinate tools to extract pertinent evidence from transcripts, video frames, and other modality-specific signals. This approach effectively transforms omni-modal challenges into retrieval and information-processing tasks, eliminating the need to ingest complete media streams. To better understand their constraints, we present a failure taxonomy supported by process-level trace analysis. Furthermore, we demonstrate that performance can be significantly enhanced through the injection of simple skills, encompassing both human-authored and self-distilled expertise.
To facilitate open-source elicitation, we introduce Code-X, a training methodology utilizing the OmniCoding trajectory dataset and verifiable rewards, establishing baselines on the Qwen-3.5-9B and Qwen-3.6-27B models. Finally, we posit that many-modality processing represents the next significant frontier in the field and present TerminalBench-O, a process-oriented benchmark designed for real-world omni-modal processing tasks. The associated code will be accessible at https://github.com/Dongping-Chen/OmniCoding.
Source: arXiv Generated at: 2026-06-02 00:00:00 UTC





