MiniMax Sparse Attention
English summary
MiniMax Sparse Attention (MSA) is a new method for efficient processing of ultra-long contexts (hundreds of thousands to millions of tokens) in large language models. It uses blockwise sparsity and an optimized GPU execution path to achieve significant speedups in both training and inference while maintaining performance. The method is built on Grouped Query Attention (GQA), introducing a lightweight Index Branch for group-specific sparse token retrieval and a Main Branch for exact block-sparse attention. MSA is co-designed with GPU kernels for cross-GPU scalability and has been deployed in a production-grade multimodal model, reducing per-token attention compute. Its inference kernel and model are openly available online.
Chinese summary
MiniMax 稀疏注意力 (MSA) 是一种为大型语言模型高效处理超长上下文(数十万至数百万 token)的新方法。它利用块级稀疏性和优化的 GPU 执行路径,在训练和推理中实现显著加速,同时保持性能水平。该方法基于分组查询注意力 (GQA),引入轻量级索引分支用于分组稀疏 token 检索,以及主分支用于精确块稀疏注意力。MSA 与 GPU 内核协同设计,可跨 GPU 扩展,已部署于生产级多模态模型,降低了每 token 的注意力计算量。其推理内核和模型已公开发布。
Key points
Enables efficient attention over ultra-long contexts (up to millions of tokens) via blockwise sparsity and GPU-optimized execution.
通过块级稀疏性和 GPU 优化执行,支持对超长上下文(高达数百万 token)的高效注意力计算。
Built on Grouped Query Attention (GQA), with a lightweight Index Branch for sparse token retrieval and a Main Branch for exact block-sparse attention.
基于分组查询注意力 (GQA),采用轻量级索引分支进行稀疏 token 检索,以及主分支执行精确块稀疏注意力。
Co-designed with GPU execution path, delivering significant speedups in both training and decoding.
与 GPU 执行路径协同设计,在训练和推理中均带来显著速度提升。
Deployed in a production-grade multimodal model, reducing per-token attention compute and accelerating various tasks.
已部署于生产级多模态模型,降低了每 token 注意力计算量,并加速多项任务。
Inference kernel and model powered by MSA are openly available online for further use.
由 MSA 驱动的推理内核和模型已开源,供公众进一步使用。