arXiv

Align-KD: Distilling Cross-Modal Alignment Knowledge for Mobile Vision-Language Model Enhancement

Title: Align-KD: Enhancing Mobile Vision-Language Models by Distilling Cross-Modal Alignment Insights

Abstract: Vision-Language Models (VLMs) offer robust reasoning and comprehension skills for multimodal applications. Simultaneously, the demand for sophisticated artificial intelligence on mobile platforms, particularly within AI assistant applications, is growing. To broaden the utility of VLMs, researchers are increasingly attempting to deploy them on edge devices. While reducing model complexity is a standard approach, shrinking the architecture often exacerbates the dilemma of balancing performance with model size. Knowledge distillation (KD) offers a solution by enabling models to enhance their overall capabilities without expanding their footprint or requiring additional data. However, current distillation techniques for large models primarily focus on single-modal Large Language Models (LLMs) or rely on teachers to generate synthetic data for students. Crucially, these existing methods overlook the distillation of the most vital cross-modal alignment knowledge inherent in VLMs.

To address this gap, we introduce Align-KD, a novel method designed to guide student models in mastering cross-modal matching at the shallow layers. Furthermore, the teacher model assists the student in learning how to project visual tokens into the text embedding space, guided by textual focus. Leveraging Align-KD, the 1.7B parameter MobileVLM V2 model is able to absorb extensive knowledge from a 7B teacher model through a streamlined training loss design. This approach yields an average score increase of 2.0 across six benchmarks, evaluated under two distinct training subsets. The source code is publicly accessible at: https://github.com/fqhank/Align-KD.


Source: arXiv Generated at: 2026-06-03 00:00:00 UTC

Related Articles

TikTok Billionaire Tops Ambani as Asia’s Second-Richest
Bloomberg

TikTok Billionaire Tops Ambani as Asia’s Second-Richest

TikTok founder surpasses Mukesh Ambani to become Asia’s second-richest person, marking a significant shift in the region...

Publishers in UK can opt out of Google AI search results
BBC News

Publishers in UK can opt out of Google AI search results

UK publishers can now opt out of Google’s AI search summaries, a CMA ruling designed to boost their bargaining power and...

Kioxia Edges Nearer Toyota’s Market Cap in Shakeup to Japan Inc.
Bloomberg

Kioxia Edges Nearer Toyota’s Market Cap in Shakeup to Japan Inc.

Kioxia’s market cap nears Toyota’s, signaling a major shift in Japan’s corporate hierarchy. This narrowing gap highlight...

Reuters

Morning Bid: Marvell, a fitting name for the latest AI darling

Reuters highlights Marvell as a top AI stock, noting its name perfectly suits its status as the newest market darling.

Financial Times

Tim Hayward: I built the Jaguar E-Type of computer keyboards

Tim Hayward compares his bespoke keyboard designs to the Jaguar E-Type. He explores high-end customization for personal ...

Financial Times

AI Labs: Zuckerberg’s $100bn gamble

Meta’s $100 billion AI investment aims to secure AI dominance, but questions remain whether sheer spending can outpace c...