Redesign Mixture-of-Experts Routers with Manifold Power Iteration
English summary
Researchers propose a new router design for Mixture-of-Experts (MoE) models that aligns router rows with the principal singular directions of expert matrices using Manifold Power Iteration (MPI). The "Power-then-Retract" paradigm guides router rows toward these directions, improving the representation of token-expert affinity. Experiments show that this alignment yields more effective MoE models, boosting pretraining performance across various scales. The work directly addresses the core routing mechanism in sparse models.
Chinese summary
研究人员提出一种新的混合专家(MoE)路由器设计,利用流形幂迭代(MPI)将路由器行对齐到专家矩阵的主奇异方向。"先幂后收缩"的范式引导路由器行向关联专家的主奇异方向靠近,从而改善词元-专家亲和度的表示。实验表明,该对齐方法能产生更高效的MoE模型,在不同规模的预训练中均提升性能。这项工作直接针对稀疏模型的核心路由机制。
Key points
Proposes aligning MoE router rows with the principal singular directions of expert matrices via Manifold Power Iteration.
提出通过流形幂迭代将MoE路由器行对齐到专家矩阵的主奇异方向。
Introduces a "Power-then-Retract" paradigm that iteratively moves router rows toward the principal subspace of experts.
引入"先幂后收缩"范式,迭代地将路由器行移向专家的主子空间。
Demonstrates improved pretraining performance across different model scales, making MoE models more effective.
在不同模型规模的预训练中均展现性能提升,使MoE模型更高效。