Zyphra Releases Zamba2-VL: Open 1.2B–7B Hybrid Mamba2-Transformer Vision-Language Models with 10× Lower Time-to-First-Token
English summary
Zyphra has released Zamba2-VL, a family of open vision-language models in three sizes: 1.2B, 2.7B, and 7B parameters. Each model uses a hybrid Mamba2 state-space model combined with a small number of shared transformer blocks, replacing dense attention to achieve near-linear inference scaling. The models pair a Qwen2.5-VL vision encoder with this backbone, supporting single- and multi-image understanding and grounding. On 14 benchmarks, Zamba2-VL shows strong visual counting and document understanding (e.g., 90.9 DocVQA for the 2.7B model) but lags larger baselines on knowledge-heavy reasoning like MMMU and MathVista. Its main advantage is an order-of-magnitude lower time-to-first-token compared to comparable Transformer VLMs, particularly beneficial for long multimodal inputs and on-device deployment. Weights are released under Apache 2.0 license on HuggingFace with inference code available.
Chinese summary
Zyphra 发布了 Zamba2-VL 视觉语言模型系列,包含 1.2B、2.7B 和 7B 三种参数规模。每个模型采用混合 Mamba2 状态空间模型与少量共享 Transformer 块相结合的架构,取代密集注意力以实现近线性推理扩展。该模型使用 Qwen2.5-VL 视觉编码器与此骨干配合,支持单图、多图理解及定位。在 14 项基准测试中,Zamba2-VL 在视觉计数和文档理解方面表现强劲(例如 2.7B 模型 DocVQA 得分 90.9),但在知识密集型推理(如 MMMU 和 MathVista)上落后于更大基线。其最大优势是相比同等 Transformer VLM 首 token 时间降低约一个数量级,尤其有利于长多模态输入和端侧部署。权重以 Apache 2.0 协议在 HuggingFace 开源,并提供推理代码。
Key points
Three sizes: 1.2B, 2.7B, and 7B parameters, under Apache 2.0 license.
三种规格:1.2B、2.7B 和 7B 参数,Apache 2.0 协议开源。
Hybrid backbone combining Mamba2 state-space layers with shared transformer blocks, replacing dense attention.
混合骨干网络:Mamba2 状态空间层与共享 Transformer 块相结合,替代密集注意力。
Time-to-first-token drops approximately an order of magnitude compared to comparable Transformer VLMs, especially on long sequences.
首 token 延迟较同类 Transformer VLM 降低约一个数量级,在长序列下尤为显著。
Strong performance on visual counting (PixMoCount, CountBenchQA) and document understanding (DocVQA), but weaker on knowledge reasoning (MMMU, MathVista).
视觉计数(PixMoCount、CountBenchQA)和文档理解(DocVQA)表现强劲,知识推理(MMMU、MathVista)较弱。
Uses Qwen2.5-VL vision encoder with 2D rotary embeddings and dynamic resolution, and supports grounding tasks.
采用 Qwen2.5-VL 视觉编码器,具备 2D 旋转位置嵌入和动态分辨率,支持定位任务。