Efficient Reasoning on the Edge
Title: Streamlined Reasoning at the Edge
Abstract
While Large Language Models (LLMs) utilizing chain-of-thought reasoning currently deliver state-of-the-art results in complex problem-solving, their deployment on edge devices remains hindered by verbose reasoning traces and substantial context demands. These limitations manifest as elevated token generation expenses, extensive KV-cache requirements, and inefficiencies during the distillation of reasoning capabilities into smaller models for mobile use. Conventional methods typically distill reasoning traces from larger models into smaller ones, resulting in verbose and stylistically redundant outputs that are ill-suited for on-device inference.
To overcome these hurdles, we introduce a lightweight framework that empowers small LLMs with reasoning abilities through the integration of LoRA adapters and supervised fine-tuning. We further enhance this approach by employing reinforcement learning to enforce budget constraints on these adapters, which drastically cuts down response length while preserving accuracy. To mitigate issues related to memory-bound decoding, we leverage parallel test-time scaling, thereby boosting accuracy with only a negligible increase in latency. Additionally, we implement a dynamic adapter-switching mechanism that triggers reasoning processes solely when necessary, alongside a KV-cache sharing strategy during prompt encoding to accelerate time-to-first-token for on-device inference. Our experiments on the Qwen2.5-7B model confirm that this method delivers efficient and precise reasoning within strict resource boundaries, rendering LLM reasoning viable for mobile applications. Demonstrative videos of our solution operating on mobile devices can be accessed via our project page.
Source: arXiv Generated at: 2026-06-04 00:00:00 UTC


