Together AI Details Optimizations for GLM 5.1 Inference: Indexer Kernel Rewrite and Overhead Eliminations
English summary
Together AI shared the three main optimizations applied to accelerate GLM 5.1 inference. They rewrote the indexer topk kernel and fused the indexer kernel to reduce memory and launch overhead. Additionally, CPU overhead that was bottlenecking prefill throughput was eliminated. The indexer changes yielded the largest performance gain. GLM 5.1 is now available on the Together AI platform.
Chinese summary
Together AI 披露了加速 GLM 5.1 推理的三项主要优化。他们重写了索引器 topk 内核,并融合索引器内核以降低内存和启动开销。此外,还消除了阻碍预填充吞吐量的 CPU 开销。其中索引器带来了最大的性能提升。GLM 5.1 现已在 Together AI 平台上线。
Key points
Rewrote the indexer topk kernel to improve performance.
重写了索引器 topk 内核以提升性能。
Fused the indexer kernel to reduce memory usage and launch overhead.
融合索引器内核以减少内存使用和启动开销。
Eliminated CPU overhead that was limiting prefill throughput.
消除了限制预填充吞吐量的 CPU 开销。
The indexer modifications were the most impactful, and the model is now available on Together AI.
索引器的改动效果最大,该模型现已在 Together AI 上可用。