Logit Distillation on Manifolds: Mapping by Learning
Title: Logit Distillation on Manifolds: Mapping by Learning
Abstract:
It is well established that enhancing the performance of machine learning models can be achieved not by relying on a single algorithm, but by employing multiple models with diverse architectures. These varied models generate slightly different predictions and errors on identical datasets, thereby boosting the robustness and accuracy of the averaged outcomes. However, deploying a full ensemble of models is often impractical due to high computational costs and operational complexity, particularly when dealing with large-scale neural networks intended for broad user access.
To address this challenge, we propose a novel approach featuring a layer-wise and point-wise projection mapping technique. This method aligns the representations of both student and teacher models into a shared, high-dimensional embedding space during training. By integrating this mapping with LoRA injection, our strategy reduces the number of trainable parameters in the student model to under 1% of those in the teacher model. Ablation studies confirm that this approach yields significant improvements in word error rate (WER) compared to existing distillation techniques. Furthermore, unlike mixture-of-experts models, our method supports rapid, parallel training.
Source: arXiv Generated at: 2026-06-02 00:00:00 UTC




