小米MiMo与TileRT将万亿参数模型推至每秒1000 tokens以上,在商用GPU上运行
英文摘要
Xiaomi's MiMo-V2.5-Pro-UltraSpeed achieves over 1000 tokens per second on a 1-trillion-parameter MoE model using commodity GPUs, a milestone at this scale. The speedup comes from three coordinated techniques: FP4 quantization applied only to MoE experts, DFlash speculative decoding that predicts entire token blocks in parallel, and the TileRT runtime optimized for microsecond-scale operations. Rejection sampling ensures lossless decoding while maintaining output quality. The system runs on a single 8-GPU node and is available through a limited API trial from June 9-23, 2026.
中文摘要
小米的MiMo-V2.5-Pro-UltraSpeed在商用GPU上以每秒超过1000 tokens的速度运行万亿参数MoE模型,这是该规模下的里程碑。速度提升来自三项协同技术:仅应用于MoE专家的FP4量化、可并行预测整个token块的DFlash推测解码,以及针对微秒级操作优化的TileRT运行时。拒绝采样确保无损解码,同时保持输出质量。该系统在单台8-GPU节点上运行,并通过2026年6月9日至23日的有限API试用提供。
关键要点
Achieves over 1000 tokens per second on a 1-trillion-parameter MoE model using commodity GPUs.
在商用GPU上以每秒超过1000 tokens的速度运行万亿参数MoE模型。
Three-layer model-system codesign: FP4 quantization, DFlash speculative decoding, and TileRT runtime.
三层模型-系统协同设计:FP4量化、DFlash推测解码和TileRT运行时。
FP4 quantization applied only to MoE experts, preserving model quality via QAT.
仅对MoE专家应用FP4量化,通过QAT保持模型质量。
DFlash uses block-level masked parallel prediction, achieving average acceptance length of 6.30 in coding tasks.
DFlash使用块级掩码并行预测,在编码任务中平均接受长度为6.30。
TileRT runtime uses persistent engine kernels and warp specialization for microsecond-scale efficiency.
TileRT运行时使用持久引擎内核和线程束专业化实现微秒级效率。
API trial runs from June 9-23, 2026, with pricing 3x standard rate for roughly 10x speed.
API试用期为2026年6月9日至23日,定价为标准费率的3倍,速度约为10倍。
Checkpoint open-sourced on Hugging Face; TileRT modules partially open-sourced on GitHub.
检查点在Hugging Face上开源;TileRT模块在GitHub上部分开源。