Query-based Cross-Modal Projector Bolstering Mamba Multimodal LLM
Title: Query-Driven Cross-Modal Projector Enhances Mamba-Based Multimodal LLMs
Abstract: The quadratic computational complexity associated with input length in Transformer architectures creates an unsustainable burden for large language models (LLMs). Mamba, which utilizes a Selective Scan Structured State-Space Model, offers a robust solution to this efficiency bottleneck. This study introduces a novel query-based cross-modal projector aimed at optimizing Mamba’s performance in vision-language tasks. By leveraging cross-attention mechanisms, the projector compresses visual tokens according to the input query. Furthermore, this approach eliminates the requirement for manual specification of the 2D scan order typically needed to transform original image features into input sequences for Mamba LLMs. Evaluations across multiple vision-language understanding benchmarks demonstrate that integrating this cross-modal projector significantly improves both the accuracy and processing speed of Mamba-based multimodal LLMs.
Source: arXiv Generated at: 2026-06-04 00:00:00 UTC






