arXiv

LayerRoute: Input-Conditioned Adaptive Layer Skipping via LoRA Fine-Tuning for Agentic Language Models

June 2, 2026 · Prateek Kumar Sikdar · Original Source

Title: LayerRoute: Adaptive Layer Skipping in Agentic Language Models Through Input-Conditioned LoRA Fine-Tuning

Abstract

Agentic language model architectures typically oscillate between two fundamentally different operational modes: structured tool calls, which are short, deterministic, and exhibit low perplexity, and open-ended planning or reasoning phases, which are lengthy, complex, and characterized by high perplexity. However, existing inference frameworks currently allocate uniform computational resources to every step, ignoring this structural heterogeneity. To address this inefficiency, we present LayerRoute, a lightweight adapter capable of selectively bypassing transformer blocks on a per-input basis.

LayerRoute is integrated into the 24 transformer blocks of the Qwen2.5-0.5B-Instruct model. It introduces two key components to each block: a per-layer router consisting of approximately 897 parameters (implemented as a Linear(896,1) layer) that generates a hard binary gate using a straight-through estimator, and LoRA adapters with a rank of 8, adding roughly 1.08 million parameters to the Q/K/V/O attention projections. The primary backbone weights remain frozen throughout this process.

By conducting a single end-to-end training pass on agentic datasets—including Hermes, Glaive, GSM8K, and Turing—alongside a gate regularization term, the system learns to identify which blocks can be skipped for specific input types. Following 3,000 training steps, which took only 6.4 minutes on an A100 40GB GPU, LayerRoute demonstrates a 12.91% skip differential. Specifically, tool calls result in a 15.25% reduction in FLOPs, whereas planning steps see a mere 2.34% reduction. This performance is achieved with just 1.10 million trainable parameters, representing only 0.22% of the 494 million parameters in the backbone. Furthermore, quality metrics surpass those of the base model due to the LoRA adaptation, yielding perplexity deltas of -1.29 for tool calls and -1.30 for planning tasks.

Source: arXiv Generated at: 2026-06-02 00:00:00 UTC