教程来源: MARKTECHPOST2026年6月9日重要度: 5/5

小米MiMo与TileRT将万亿参数模型推至每秒1000 tokens以上，在商用GPU上运行

英文摘要

Xiaomi's MiMo-V2.5-Pro-UltraSpeed achieves over 1000 tokens per second on a 1-trillion-parameter MoE model using commodity GPUs, a milestone at this scale. The speedup comes from three coordinated techniques: FP4 quantization applied only to MoE experts, DFlash speculative decoding that predicts entire token blocks in parallel, and the TileRT runtime optimized for microsecond-scale operations. Rejection sampling ensures lossless decoding while maintaining output quality. The system runs on a single 8-GPU node and is available through a limited API trial from June 9-23, 2026.

中文摘要

小米的MiMo-V2.5-Pro-UltraSpeed在商用GPU上以每秒超过1000 tokens的速度运行万亿参数MoE模型，这是该规模下的里程碑。速度提升来自三项协同技术：仅应用于MoE专家的FP4量化、可并行预测整个token块的DFlash推测解码，以及针对微秒级操作优化的TileRT运行时。拒绝采样确保无损解码，同时保持输出质量。该系统在单台8-GPU节点上运行，并通过2026年6月9日至23日的有限API试用提供。

关键要点

Achieves over 1000 tokens per second on a 1-trillion-parameter MoE model using commodity GPUs.
在商用GPU上以每秒超过1000 tokens的速度运行万亿参数MoE模型。
Three-layer model-system codesign: FP4 quantization, DFlash speculative decoding, and TileRT runtime.
三层模型-系统协同设计：FP4量化、DFlash推测解码和TileRT运行时。
FP4 quantization applied only to MoE experts, preserving model quality via QAT.
仅对MoE专家应用FP4量化，通过QAT保持模型质量。
DFlash uses block-level masked parallel prediction, achieving average acceptance length of 6.30 in coding tasks.
DFlash使用块级掩码并行预测，在编码任务中平均接受长度为6.30。
TileRT runtime uses persistent engine kernels and warp specialization for microsecond-scale efficiency.
TileRT运行时使用持久引擎内核和线程束专业化实现微秒级效率。
API trial runs from June 9-23, 2026, with pricing 3x standard rate for roughly 10x speed.
API试用期为2026年6月9日至23日，定价为标准费率的3倍，速度约为10倍。
Checkpoint open-sourced on Hugging Face; TileRT modules partially open-sourced on GitHub.
检查点在Hugging Face上开源；TileRT模块在GitHub上部分开源。

打开原文