llama.cpp b9626 Adds Support for Cohere2 Mixture of Experts Architecture
English summary
Release b9626 of llama.cpp introduces support for the Cohere2 Mixture of Experts (MoE) architecture under the new arch name "cohere2moe". It fixes sliding window attention pattern handling, resolves MTP failures by switching to iSWA, and adjusts shared expert combination to (routed+shared)*0.5. Redundant gating function checks, lmhead tensor checks, and tokenizer type definitions were removed; the tokenizer is kept as tiny_aya. Platform builds are provided for macOS (Apple Silicon/Intel), Linux (x64/arm64 with Vulkan, ROCm, OpenVINO, SYCL), Android, and Windows (CPU/CUDA/Vulkan/SYCL/HIP), along with UI support.
Chinese summary
llama.cpp 的 b9626 版本新增了对 Cohere2 混合专家(MoE)架构的支持,架构名称为“cohere2moe”。该版本修复了滑动窗口注意力模式问题,通过改用 iSWA 解决了 MTP 失败问题,并将共享专家的组合方式调整为 (routed+shared)*0.5。同时移除了冗余的门控函数检查、lm_head 张量检查及 tokenizer 类型定义,tokenizer 保持为 tiny_aya。构建版本覆盖 macOS(Apple Silicon/Intel)、Linux(x64/arm64,支持 Vulkan、ROCm、OpenVINO、SYCL)、Android 以及 Windows(CPU/CUDA/Vulkan/SYCL/HIP),并提供 UI 支持。
Key points
Added new architecture "cohere2moe" to support Cohere2 Mixture of Experts models.
新增架构“cohere2moe”以支持 Cohere2 混合专家模型。
Shared expert combination changed from earlier method to (routed+shared)*0.5.
共享专家的组合方式调整为 (routed+shared)*0.5。
Fixed sliding window attention pattern and resolved MTP failures by switching to iSWA.
修复了滑动窗口注意力模式问题,并通过改用 iSWA 解决了 MTP 失败。
Removed redundant gating_func checks, lmhead tensor dependency, and unnecessary tokenizer type cohere2-moe; tokenizer kept as tiny_aya.
移除了冗余的门控函数检查、lm_head 张量依赖和不再需要的 tokenizer 类型 cohere2-moe;tokenizer 保留为 tiny_aya。
Builds available for diverse platforms including macOS, Linux (CPU/Vulkan/ROCm/OpenVINO/SYCL), Android, and Windows (CPU/CUDA/Vulkan/SYCL/HIP).
提供面向多平台的构建版本,包括 macOS、Linux(CPU/Vulkan/ROCm/OpenVINO/SYCL)、Android 和 Windows(CPU/CUDA/Vulkan/SYCL/HIP)。