MTA: Multi-Granular Trajectory Alignment for Large Language Model Distillation
Title: MTA: Multi-Granular Trajectory Alignment for Large Language Model Distillation
Original: arXiv:2605.01374v2 Announce Type: replace
Abstract: While knowledge distillation serves as a primary method for compressing large language models (LLMs), current approaches typically restrict alignment to fixed layers or token-level outputs. This narrow focus overlooks the dynamic evolution of representations across network depth, resulting in weak guidance for students attempting to replicate the teacher’s internal relational structures and ultimately hindering effective knowledge transfer. To overcome this bottleneck, we introduce Multi-Granular Trajectory Alignment (MTA), a novel framework designed to align teacher and student representations throughout their layer-wise transformation journey. MTA employs a layer-adaptive mechanism: it aligns lower layers at the word level to safeguard lexical details, while operating at the phrase level—such as noun and verb phrases—in higher layers to better capture compositional semantics. We realize this concept via a Dynamic Structural Alignment loss, which synchronizes the relative geometry of semantic units within each layer. This architectural choice is supported by empirical evidence showing that Transformer representations grow more abstract as depth increases, aligning with linguistic theories that posit higher-level meaning arises from the composition of basic lexical elements. Additionally, we integrate a Hidden Representation Alignment loss to facilitate direct alignment between specific teacher and student layers. Our experimental results demonstrate that MTA consistently surpasses state-of-the-art baselines on standard benchmarks, with ablation studies validating the efficacy of each individual component.
Source: arXiv Generated at: 2026-06-03 00:00:00 UTC





