Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Title:Turn Waste into Worth: Rectifying Top-$k$ Router of MoE

Feb 21, 2024

Zhiyuan Zeng, Qipeng Guo, Zhaoye Fei, Zhangyue Yin, Yunhua Zhou, Linyang Li, Tianxiang Sun, Hang Yan, Dahua Lin, Xipeng Qiu

Figure 1 for Turn Waste into Worth: Rectifying Top-$k$ Router of MoE

Figure 2 for Turn Waste into Worth: Rectifying Top-$k$ Router of MoE

Figure 3 for Turn Waste into Worth: Rectifying Top-$k$ Router of MoE

Figure 4 for Turn Waste into Worth: Rectifying Top-$k$ Router of MoE

Share this with someone who'll enjoy it:

Abstract:Sparse Mixture of Experts (MoE) models are popular for training large language models due to their computational efficiency. However, the commonly used top-$k$ routing mechanism suffers from redundancy computation and memory costs due to the unbalanced routing. Some experts are overflow, where the exceeding tokens are dropped. While some experts are vacant, which are padded with zeros, negatively impacting model performance. To address the dropped tokens and padding, we propose the Rectify-Router, comprising the Intra-GPU Rectification and the Fill-in Rectification. The Intra-GPU Rectification handles dropped tokens, efficiently routing them to experts within the GPU where they are located to avoid inter-GPU communication. The Fill-in Rectification addresses padding by replacing padding tokens with the tokens that have high routing scores. Our experimental results demonstrate that the Intra-GPU Rectification and the Fill-in Rectification effectively handle dropped tokens and padding, respectively. Furthermore, the combination of them achieves superior performance, surpassing the accuracy of the vanilla top-1 router by 4.7%.

View paper on

Share this with someone who'll enjoy it:

Title:Turn Waste into Worth: Rectifying Top-$k$ Router of MoE

Paper and Code