Distill-then-Replace: Efficient Task-Specific Hybrid Attention Model Construction
Title: Distill-then-Replace: Efficient Task-Specific Hybrid Attention Model Construction
Abstract:
While Transformer architectures achieve state-of-the-art accuracy through dense full-attention mechanisms, their quadratic time and memory complexity relative to sequence length hinders practical deployment. Although linear attention mechanisms provide linear or near-linear scaling, they frequently suffer from performance drops. Hybrid models that combine full and linear attention layers aim to strike a balance between efficiency and expressiveness; however, they encounter two significant hurdles: training such hybrids from scratch is computationally prohibitive, and manually determining the optimal arrangement of attention types is extremely difficult. To address these issues, we introduce DtR (Distill-then-Replace). This method initially transfers weights from pretrained full-attention modules to their linear attention equivalents using blockwise local distillation. Subsequently, it employs a greedy layer replacement strategy that iteratively swaps full attention blocks for linear ones, tracking validation performance on the specific task. DtR generates a task-specific hybrid model in a single, efficient pass, eliminating the need for expensive retraining or neural architecture search. Furthermore, this approach is versatile and can be applied to any pretrained full-attention backbone for various downstream tasks.
Source: arXiv Generated at: 2026-06-03 00:00:00 UTC



