arXiv

Efficient Reasoning on the Edge

Title: Streamlined Reasoning at the Edge

Abstract

While Large Language Models (LLMs) utilizing chain-of-thought reasoning currently deliver state-of-the-art results in complex problem-solving, their deployment on edge devices remains hindered by verbose reasoning traces and substantial context demands. These limitations manifest as elevated token generation expenses, extensive KV-cache requirements, and inefficiencies during the distillation of reasoning capabilities into smaller models for mobile use. Conventional methods typically distill reasoning traces from larger models into smaller ones, resulting in verbose and stylistically redundant outputs that are ill-suited for on-device inference.

To overcome these hurdles, we introduce a lightweight framework that empowers small LLMs with reasoning abilities through the integration of LoRA adapters and supervised fine-tuning. We further enhance this approach by employing reinforcement learning to enforce budget constraints on these adapters, which drastically cuts down response length while preserving accuracy. To mitigate issues related to memory-bound decoding, we leverage parallel test-time scaling, thereby boosting accuracy with only a negligible increase in latency. Additionally, we implement a dynamic adapter-switching mechanism that triggers reasoning processes solely when necessary, alongside a KV-cache sharing strategy during prompt encoding to accelerate time-to-first-token for on-device inference. Our experiments on the Qwen2.5-7B model confirm that this method delivers efficient and precise reasoning within strict resource boundaries, rendering LLM reasoning viable for mobile applications. Demonstrative videos of our solution operating on mobile devices can be accessed via our project page.


Source: arXiv Generated at: 2026-06-04 00:00:00 UTC

Related Articles

TechCrunch

Meta’s Oversight Board says account bans lack due process, transparency

Meta’s Oversight Board criticized account bans for lacking due process and transparency, citing inconsistent enforcement...

TechCrunch

Meta rolls out a new AI creator assistant on Facebook

Meta launched an AI creator assistant on Facebook to streamline analytics and content brainstorming. Initially available...

TechCrunch

What to expect from WWDC 2026: Siri’s highly anticipated revamp and Apple Intelligence updates

WWDC 2026 promises a Siri revamp powered by Google’s Gemini and standalone app, plus AI agents in the App Store and Came...

TechCrunch

A burglar used a Waymo to steal yoga clothes in San Francisco — and got away with it

A thief stole yoga clothes using a Waymo, but police failed to catch them because the car’s video data was deleted and b...

Goldman Sachs CEO David Solomon on the Coming Mega IPOs
Bloomberg

Goldman Sachs CEO David Solomon on the Coming Mega IPOs

Goldman Sachs CEO David Solomon anticipates a surge in major IPOs, signaling renewed market confidence and significant o...

What Are A.I. Agents Actually Doing?
New York Times

What Are A.I. Agents Actually Doing?

Arena research shows tech professionals are most likely to use AI agents at work, highlighting a strong industry trend i...