vLLM v0.23.0 Released: DeepSeek-V4 Matures Across Backends, Model Runner V2 Expands to Llama/Mistral, and Rust Frontend Grows
English summary
vLLM v0.23.0 brings 408 commits from 200 contributors and deepens support for recent models. DeepSeek-V4 received massive hardening with sparse MLA decoupling, TRTLLM-gen attention, EPLB mega-MoE, and sliding-window KV cache retention. Model Runner V2 is now default for Llama and Mistral dense models and adds FlashInfer sampling, breakable CUDA graphs, and pipeline-parallel bubble elimination. The Rust frontend gained streaming generate, dynamic LoRA endpoints, /version and /server_info, plus new tool parsers for InternLM2, Phi-4-mini, and Gemma4. Newly supported models include Gemma 4 Unified (encoder-free), MiMo-V2.5, Step-3.7-Flash, Cosmos3 Reasoner, and Cohere Mini Code. The release also deprecates Transformers v4, unifies reasoning/tool-call parsing, and introduces a multi-tier KV cache offloading framework with an object-store secondary tier.
Chinese summary
vLLM v0.23.0 版本包含 200 位贡献者的 408 次提交,强化了对新近模型的支持。DeepSeek-V4 经过大规模优化,包括稀疏 MLA 解耦、TRTLLM-gen 注意力、EPLB 超级混合专家及滑动窗口 KV 缓存保留。Model Runner V2 现默认用于 Llama 和 Mistral 稠密模型,新增 FlashInfer 采样、可中断 CUDA 图及流水线并行气泡消除。Rust 前端新增流式生成、动态 LoRA 接口、/version 和 /server_info,以及面向 InternLM2、Phi-4-mini 和 Gemma4 的工具解析器。新支持的模型包括 Gemma 4 Unified(无编码器)、MiMo-V2.5、Step-3.7-Flash、Cosmos3 Reasoner 和 Cohere Mini Code。该版本还弃用了 Transformers v4,统一了推理与工具调用解析,并引入了带对象存储二级层的多层 KV 缓存卸载框架。
Key points
DeepSeek-V4 received major hardening: sparse MLA decoupled from V3.2, TRTLLM-gen attention kernel, EPLB mega-MoE, selective prefix-cache retention, and index-share for DSA MTP.
DeepSeek-V4 大幅加固:稀疏 MLA 与 V3.2 解耦,新增 TRTLLM-gen 注意力内核、EPLB 超级混合专家、选择性前缀缓存保留及 DSA MTP 索引共享。
Model Runner V2 now defaults for Llama and Mistral dense models, with FlashInfer sampler, breakable CUDA graphs, and pipeline-parallel bubble elimination.
Model Runner V2 现默认用于 Llama 和 Mistral 稠密模型,支持 FlashInfer 采样器、可中断 CUDA 图和流水线并行气泡消除。
Rust frontend added streaming generate, dynamic LoRA endpoints, /version and /server_info, and new tool parsers for InternLM2, Phi-4-mini and Gemma4.
Rust 前端新增流式生成、动态 LoRA 端点、/version 和 /server_info,以及 InternLM2、Phi-4-mini 和 Gemma4 的工具解析器。
New models include Gemma 4 Unified (encoder-free), MiMo-V2.5, Step-3.7-Flash, Cosmos3 Reasoner, and Cohere Mini Code.
新模型包括 Gemma 4 Unified(无编码器)、MiMo-V2.5、Step-3.7-Flash、Cosmos3 Reasoner 和 Cohere Mini Code。
Transformers v4 support is deprecated; vLLM now targets Transformers v5 and vendors MiniCPM-V/O processors.
弃用 Transformers v4 支持,vLLM 现针对 Transformers v5 并将 MiniCPM-V/O 处理器内嵌。
Multi-tier KV cache offloading added with object-store secondary tier and per-request offloading policy via lifecycle hooks.
新增多层 KV 缓存卸载,提供对象存储二级层和基于生命周期钩子的按请求卸载策略。