Balancing Knowledge Distillation for Imbalance Learning with Bilevel Optimization
Title: Optimizing Knowledge Distillation for Imbalanced Learning via Bilevel Strategies
Abstract: Knowledge distillation facilitates the transfer of expertise from a high-capacity teacher model to a more compact student, relying on a blend of hard and soft loss functions. However, in scenarios involving imbalanced datasets, maintaining a static weight ratio between these two loss types can destabilize the learning trajectory. While recent literature attempts to reweight these components within long-tailed distributions, the majority of existing approaches fail to adjust weights at the individual sample level and overlook the student’s dynamic behavior throughout training.
To overcome these limitations, we introduce BiKD, a bilevel optimization framework designed to dynamically balance hard and soft losses on a per-sample basis. This approach utilizes a weight generation network that derives adaptive weights for each sample, informed by a small, balanced validation set. Consequently, the student model is trained using a flexible combination of weighted hard and soft losses, enabling it to optimize both terms effectively. Additionally, we present a multi-step Stochastic Gradient Descent (SGD) strategy to enhance both the accuracy and efficiency of the weight model’s optimization. Experimental results on long-tailed CIFAR-10 and CIFAR-100 datasets demonstrate that our method outperforms contemporary balanced distillation techniques across various imbalance factors.
Source: arXiv Generated at: 2026-06-02 00:00:00 UTC





