arXiv

Multi-SPIN: Multi-Access Speculative Inference for Cooperative Token Generation at the Edge

**Title: Multi-SPIN: Enabling Cooperative Token Generation at the Edge via Multi-Access Speculative Inference

Abstract:

While speculative inference (SPIN) has traditionally served as an efficient architectural framework for accelerating Large Language Models (LLMs), this study explores its potential for distributed deployment within multiuser edge environments. The proposed approach facilitates cooperative token generation, effectively balancing computational burdens between servers and resource-limited devices. We introduce Multi-access SPIN (Multi-SPIN), an architecture where edge servers run LLMs to verify candidate token drafts in parallel batches, while on-device small language models generate and upload these drafts.

In environments characterized by significant heterogeneity in users’ computational and communication resources, the length of the generated drafts becomes a pivotal control variable. This parameter directly impacts node-level processing loads, multi-access latency, and ultimately, the aggregate token goodput. Focusing on frequency-division multiple access scenarios, we address the challenge of multi-access draft control—a joint optimization problem involving both draft-length management and bandwidth allocation—aimed at maximizing total token goodput.

Our investigation covers two distinct scenarios: first, a homogeneous setting where all users employ identical draft lengths to streamline server-side batching; and second, a heterogeneous setting that leverages varying draft lengths to unlock additional dimensions for improving goodput. Through the application of decomposition techniques, we simplify these complex optimization challenges into manageable sub-problems, enabling the derivation of efficient draft control algorithms with closed-form solutions.

Our analytical findings reveal distinct strategies for bandwidth allocation depending on the scenario. In the homogeneous case, bandwidth is allocated to compensate users with weaker computational and communication capabilities, a necessity driven by the synchronization requirements of batching. Conversely, in the heterogeneous case, the system rewards users with higher token acceptance rates by relaxing these synchronization constraints. Empirical evaluations utilizing pairs of Llama-2 and Qwen3.5 models across various tasks confirm that Multi-SPIN achieves an improvement in goodput of up to 88% compared to baselines that do not account for system heterogeneity.


Source: arXiv Generated at: 2026-06-04 00:00:00 UTC

Related Articles

SpaceX Seeks to Raise $75 Billion in Record IPO (Video)
Bloomberg

SpaceX Seeks to Raise $75 Billion in Record IPO (Video)

SpaceX aims for a record $75 billion valuation through an initial public offering. This historic IPO marks a significant...

Broadcom AI Chip Outlook Disappoints Investors
Bloomberg

Broadcom AI Chip Outlook Disappoints Investors

Broadcom’s AI chip projections disappointed investors, dampening market sentiment. The outlook fell short of expectation...

Hiranandani Group CEO on Powering India's Digital Future
Bloomberg

Hiranandani Group CEO on Powering India's Digital Future

Hiranandani Group CEO discusses driving India's digital transformation.

Cerebras Says It’s Working With All AI Gear Makers Except Nvidia
Bloomberg

Cerebras Says It’s Working With All AI Gear Makers Except Nvidia

Cerebras confirmed partnerships with all major AI hardware vendors except Nvidia. This broad engagement positions Cerebr...

Putin Turns Russia’s AI Future Into a Kremlin Family Business
Bloomberg

Putin Turns Russia’s AI Future Into a Kremlin Family Business

Putin is consolidating Russia’s AI ambitions into a Kremlin family business, effectively turning the sector into a dynas...

Reuters

Meta repeatedly pushes back new AI model release for developers, WSJ says

Meta has repeatedly delayed the release of its new AI model for developers, according to the WSJ. This ongoing postponem...