MobiBench: Multi-Branch, Modular Benchmark for Mobile GUI Agents
Title: MobiBench: A Modular, Multi-Branch Benchmark for Mobile GUI Agents
Abstract:
Mobile GUI agents, which serve as AI intermediaries capable of interacting with mobile applications on user behalf, hold the promise of revolutionizing human-computer interaction. Despite this potential, current evaluation methodologies for these agents are hindered by two core constraints. The first issue lies in the binary choice between single-path offline benchmarks and online live benchmarks. Offline approaches, which depend on static, single-path annotated datasets, disproportionately penalize valid alternative actions. Conversely, online benchmarks struggle with scalability and reproducibility, largely due to the dynamic and unpredictable environment of live evaluations. The second limitation involves the tendency of existing benchmarks to view agents as monolithic black boxes. This perspective ignores the specific contributions of individual components, often resulting in unfair comparisons and masking critical performance bottlenecks.
To overcome these challenges, we introduce MobiBench, the inaugural modular and multi-path-aware offline benchmarking framework designed for mobile GUI agents. This system facilitates high-fidelity, scalable, and reproducible assessments entirely within offline environments. Our experimental results indicate that MobiBench secures a 94.72% agreement rate with human evaluators, matching the precision of carefully constructed online benchmarks while retaining the scalability and reproducibility advantages of static offline methods. Additionally, our extensive module-level analysis reveals several significant insights, such as a systematic review of various techniques employed in mobile GUI agents, optimal module configurations across different model scales, the inherent constraints of current Large Foundation Models (LFMs), and practical recommendations for engineering more capable and cost-effective mobile agents.
Source: arXiv Generated at: 2026-06-02 00:00:00 UTC




