Multi-SPIN: Multi-Access Speculative Inference for Cooperative Token Generation at the Edge
**Title: Multi-SPIN: Enabling Cooperative Token Generation at the Edge via Multi-Access Speculative Inference
Abstract:
While speculative inference (SPIN) has traditionally served as an efficient architectural framework for accelerating Large Language Models (LLMs), this study explores its potential for distributed deployment within multiuser edge environments. The proposed approach facilitates cooperative token generation, effectively balancing computational burdens between servers and resource-limited devices. We introduce Multi-access SPIN (Multi-SPIN), an architecture where edge servers run LLMs to verify candidate token drafts in parallel batches, while on-device small language models generate and upload these drafts.
In environments characterized by significant heterogeneity in users’ computational and communication resources, the length of the generated drafts becomes a pivotal control variable. This parameter directly impacts node-level processing loads, multi-access latency, and ultimately, the aggregate token goodput. Focusing on frequency-division multiple access scenarios, we address the challenge of multi-access draft control—a joint optimization problem involving both draft-length management and bandwidth allocation—aimed at maximizing total token goodput.
Our investigation covers two distinct scenarios: first, a homogeneous setting where all users employ identical draft lengths to streamline server-side batching; and second, a heterogeneous setting that leverages varying draft lengths to unlock additional dimensions for improving goodput. Through the application of decomposition techniques, we simplify these complex optimization challenges into manageable sub-problems, enabling the derivation of efficient draft control algorithms with closed-form solutions.
Our analytical findings reveal distinct strategies for bandwidth allocation depending on the scenario. In the homogeneous case, bandwidth is allocated to compensate users with weaker computational and communication capabilities, a necessity driven by the synchronization requirements of batching. Conversely, in the heterogeneous case, the system rewards users with higher token acceptance rates by relaxing these synchronization constraints. Empirical evaluations utilizing pairs of Llama-2 and Qwen3.5 models across various tasks confirm that Multi-SPIN achieves an improvement in goodput of up to 88% compared to baselines that do not account for system heterogeneity.
Source: arXiv Generated at: 2026-06-04 00:00:00 UTC





