MOSAIC: Efficient Mixture-of-Agent Scheduling via Adaptive Aggregation and Inference Concurrency
Title: MOSAIC: Accelerating Mixture-of-Agent Workloads through Adaptive Aggregation and Concurrent Inference
Abstract: Mixture-of-Agents (MoA) architectures enhance reasoning precision by directing individual queries to various specialized large language models (LLMs) and synthesizing their responses. However, executing these workloads efficiently on constrained GPU infrastructure presents significant challenges. The reliance on skill-based routing leads to uneven demand across experts, while the integration of instruction-tuned models with long-reasoning variants introduces substantial fluctuations in output length. As a result, conventional scheduling methods are prone to severe GPU underutilization and throughput degradation caused by load disparities. To address these issues, we introduce MOSAIC, a scheduling framework designed to boost MoA performance. Our approach begins with an Integer Linear Program (ILP)-based scheduler that simultaneously optimizes expert allocation and per-worker prompt distribution using offline-profiled cost data; this strategy involves duplicating computationally intensive reasoning experts across different workers while assigning lightweight models to specific instances. Furthermore, MOSAIC employs confidence-aware adaptive aggregation, which utilizes inter-expert consensus to bypass the resource-intensive final aggregator LLM for queries where agreement is already reached. In evaluations on a 4-GPU setup, MOSAIC delivers speedups of up to 2.5x during the expert stage, 4.23x during the aggregator stage, and 1.7 to 2.3x end-to-end compared to baseline schedulers, all while maintaining accuracy levels within 0.1 percentage points of the original performance.
Source: arXiv Generated at: 2026-06-03 00:00:00 UTC



