arXiv

Efficient Reasoning on the Edge

June 4, 2026 · Yelysei Bondarenko, Thomas Hehn, Rob Hesselink, Romain Lepert, Fabio Valerio Massoli, Evgeny Mironov, Leyla Mirvakhabova, Tribhuvanesh Orekondy, Spyridon Stasis, Andrey Kuzmin, Anna Kuzina, Markus Nagel, Ankita Nayak, Corrado Rainone, Ork de Rooij, Paul N · Original Source

Title: Streamlined Reasoning at the Edge

Abstract

While Large Language Models (LLMs) utilizing chain-of-thought reasoning currently deliver state-of-the-art results in complex problem-solving, their deployment on edge devices remains hindered by verbose reasoning traces and substantial context demands. These limitations manifest as elevated token generation expenses, extensive KV-cache requirements, and inefficiencies during the distillation of reasoning capabilities into smaller models for mobile use. Conventional methods typically distill reasoning traces from larger models into smaller ones, resulting in verbose and stylistically redundant outputs that are ill-suited for on-device inference.

To overcome these hurdles, we introduce a lightweight framework that empowers small LLMs with reasoning abilities through the integration of LoRA adapters and supervised fine-tuning. We further enhance this approach by employing reinforcement learning to enforce budget constraints on these adapters, which drastically cuts down response length while preserving accuracy. To mitigate issues related to memory-bound decoding, we leverage parallel test-time scaling, thereby boosting accuracy with only a negligible increase in latency. Additionally, we implement a dynamic adapter-switching mechanism that triggers reasoning processes solely when necessary, alongside a KV-cache sharing strategy during prompt encoding to accelerate time-to-first-token for on-device inference. Our experiments on the Qwen2.5-7B model confirm that this method delivers efficient and precise reasoning within strict resource boundaries, rendering LLM reasoning viable for mobile applications. Demonstrative videos of our solution operating on mobile devices can be accessed via our project page.

Source: arXiv Generated at: 2026-06-04 00:00:00 UTC

Top international news

Efficient Reasoning on the Edge

Related Articles

Meta’s Oversight Board says account bans lack due process, transparency

Meta rolls out a new AI creator assistant on Facebook

What to expect from WWDC 2026: Siri’s highly anticipated revamp and Apple Intelligence updates

A burglar used a Waymo to steal yoga clothes in San Francisco — and got away with it

Goldman Sachs CEO David Solomon on the Coming Mega IPOs

What Are A.I. Agents Actually Doing?