Align-KD: Distilling Cross-Modal Alignment Knowledge for Mobile Vision-Language Model Enhancement
Title: Align-KD: Enhancing Mobile Vision-Language Models by Distilling Cross-Modal Alignment Insights
Abstract: Vision-Language Models (VLMs) offer robust reasoning and comprehension skills for multimodal applications. Simultaneously, the demand for sophisticated artificial intelligence on mobile platforms, particularly within AI assistant applications, is growing. To broaden the utility of VLMs, researchers are increasingly attempting to deploy them on edge devices. While reducing model complexity is a standard approach, shrinking the architecture often exacerbates the dilemma of balancing performance with model size. Knowledge distillation (KD) offers a solution by enabling models to enhance their overall capabilities without expanding their footprint or requiring additional data. However, current distillation techniques for large models primarily focus on single-modal Large Language Models (LLMs) or rely on teachers to generate synthetic data for students. Crucially, these existing methods overlook the distillation of the most vital cross-modal alignment knowledge inherent in VLMs.
To address this gap, we introduce Align-KD, a novel method designed to guide student models in mastering cross-modal matching at the shallow layers. Furthermore, the teacher model assists the student in learning how to project visual tokens into the text embedding space, guided by textual focus. Leveraging Align-KD, the 1.7B parameter MobileVLM V2 model is able to absorb extensive knowledge from a 7B teacher model through a streamlined training loss design. This approach yields an average score increase of 2.0 across six benchmarks, evaluated under two distinct training subsets. The source code is publicly accessible at: https://github.com/fqhank/Align-KD.
Source: arXiv Generated at: 2026-06-03 00:00:00 UTC



