MUSE: A Unified Agentic Harness for MLLMs
Title: MUSE: A Unified Agentic Harness for MLLMs
Source: arXiv:2606.03005v1
Abstract:
Although multimodal large language models (MLLMs) have advanced rapidly, they still struggle with tasks that humans perform with ease, such as selecting the right puzzle piece or navigating a grid maze based on a screenshot. Instead of focusing on model retraining, this study investigates a different approach: determining how much potential can be unlocked from a frozen MLLM simply by enhancing the execution framework surrounding it.
To this end, we present MUSE, a multimodal unified structured execution harness. MUSE wraps any commercially available MLLM with a suite of composable modules designed for task representation, visual processing, perception tool utilization, structured parsing, deterministic verification, and verifier-guided repair. Notably, this process requires no retraining of the underlying model.
We assessed MUSE across a wide array of benchmarks, including visual spatial planning, visual perception, multimodal reasoning, and fine-grained visual discrimination, utilizing several state-of-the-art MLLMs. The results show that MUSE consistently outperforms the unassisted models across all tested scenarios, with the most significant improvements observed in complex cases. Further analysis indicates that many MLLM errors stem from limitations in the execution harness rather than inherent flaws in the models themselves; these issues can be resolved through verifier-guided repair without modifying the model weights. These insights underscore the agentic multimodal harness as a vital yet underappreciated design aspect, providing an alternative path to enhancing MLLM capabilities that operates independently of model-centric optimization strategies.
Source: arXiv Generated at: 2026-06-03 00:00:00 UTC



