TutorialsSource: MARKTECHPOSTJune 12, 2026Importance: 4/5

Zyphra Releases Zamba2-VL: Open 1.2B–7B Hybrid Mamba2-Transformer Vision-Language Models with 10× Lower Time-to-First-Token

English summary

Zyphra has released Zamba2-VL, a family of open vision-language models in three sizes: 1.2B, 2.7B, and 7B parameters. Each model uses a hybrid Mamba2 state-space model combined with a small number of shared transformer blocks, replacing dense attention to achieve near-linear inference scaling. The models pair a Qwen2.5-VL vision encoder with this backbone, supporting single- and multi-image understanding and grounding. On 14 benchmarks, Zamba2-VL shows strong visual counting and document understanding (e.g., 90.9 DocVQA for the 2.7B model) but lags larger baselines on knowledge-heavy reasoning like MMMU and MathVista. Its main advantage is an order-of-magnitude lower time-to-first-token compared to comparable Transformer VLMs, particularly beneficial for long multimodal inputs and on-device deployment. Weights are released under Apache 2.0 license on HuggingFace with inference code available.

Chinese summary

Zyphra 发布了 Zamba2-VL 视觉语言模型系列，包含 1.2B、2.7B 和 7B 三种参数规模。每个模型采用混合 Mamba2 状态空间模型与少量共享 Transformer 块相结合的架构，取代密集注意力以实现近线性推理扩展。该模型使用 Qwen2.5-VL 视觉编码器与此骨干配合，支持单图、多图理解及定位。在 14 项基准测试中，Zamba2-VL 在视觉计数和文档理解方面表现强劲（例如 2.7B 模型 DocVQA 得分 90.9），但在知识密集型推理（如 MMMU 和 MathVista）上落后于更大基线。其最大优势是相比同等 Transformer VLM 首 token 时间降低约一个数量级，尤其有利于长多模态输入和端侧部署。权重以 Apache 2.0 协议在 HuggingFace 开源，并提供推理代码。

Key points

Three sizes: 1.2B, 2.7B, and 7B parameters, under Apache 2.0 license.
三种规格：1.2B、2.7B 和 7B 参数，Apache 2.0 协议开源。
Hybrid backbone combining Mamba2 state-space layers with shared transformer blocks, replacing dense attention.
混合骨干网络：Mamba2 状态空间层与共享 Transformer 块相结合，替代密集注意力。
Time-to-first-token drops approximately an order of magnitude compared to comparable Transformer VLMs, especially on long sequences.
首 token 延迟较同类 Transformer VLM 降低约一个数量级，在长序列下尤为显著。
Strong performance on visual counting (PixMoCount, CountBenchQA) and document understanding (DocVQA), but weaker on knowledge reasoning (MMMU, MathVista).
视觉计数（PixMoCount、CountBenchQA）和文档理解（DocVQA）表现强劲，知识推理（MMMU、MathVista）较弱。
Uses Qwen2.5-VL vision encoder with 2D rotary embeddings and dynamic resolution, and supports grounding tasks.
采用 Qwen2.5-VL 视觉编码器，具备 2D 旋转位置嵌入和动态分辨率，支持定位任务。

Open original