arXiv

Zamba2-VL Technical Report

June 2, 2026 · Hassan Shapourian, Kasra Hejazi, Olabode M. Sule, Beren Millidge · Original Source

Zamba2-VL Technical Report

Abstract

This paper introduces Zamba2-VL, a family of vision-language models (VLMs) constructed upon Zamba2. The underlying Zamba2 architecture is a hybrid design that integrates Mamba2 state-space layers with a limited set of shared transformer blocks. Our evaluation demonstrates that Zamba2-VL performs competitively against top-tier open-weight Transformer-based VLMs of similar size, such as the Molmo2, Qwen3-VL, and InternVL3.5 series, across diverse tasks including image comprehension, reasoning, optical character recognition (OCR), grounding, and counting. Furthermore, it significantly surpasses earlier SSM-based and hybrid VLMs, including VL-Mamba, Cobra, and mmMamba.

By leveraging the Zamba2 backbone, Zamba2-VL benefits from near-linear prefill computational costs and a recurrent state that remains small and nearly constant. Consequently, these models achieve a time-to-first-token (TTFT) that is approximately ten times lower than comparable Transformer baselines. This efficiency advantage is particularly notable at the 1.2B and 2.7B parameter scales, which are critical for on-device and edge computing applications. We have made three model variants—1.2B, 2.7B, and 7B—along with the corresponding inference code available at https://huggingface.co/collections/Zyphra/zamba2-vl.

Source: arXiv Generated at: 2026-06-02 00:00:00 UTC