Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Title:Adaptive Gating in Mixture-of-Experts based Language Models

Oct 11, 2023

Jiamin Li, Qiang Su, Yitao Yang, Yimin Jiang, Cong Wang, Hong Xu

Figure 1 for Adaptive Gating in Mixture-of-Experts based Language Models

Figure 2 for Adaptive Gating in Mixture-of-Experts based Language Models

Figure 3 for Adaptive Gating in Mixture-of-Experts based Language Models

Figure 4 for Adaptive Gating in Mixture-of-Experts based Language Models

Share this with someone who'll enjoy it:

Abstract:Large language models, such as OpenAI's ChatGPT, have demonstrated exceptional language understanding capabilities in various NLP tasks. Sparsely activated mixture-of-experts (MoE) has emerged as a promising solution for scaling models while maintaining a constant number of computational operations. Existing MoE model adopts a fixed gating network where each token is computed by the same number of experts. However, this approach contradicts our intuition that the tokens in each sequence vary in terms of their linguistic complexity and, consequently, require different computational costs. Little is discussed in prior research on the trade-off between computation per token and model performance. This paper introduces adaptive gating in MoE, a flexible training strategy that allows tokens to be processed by a variable number of experts based on expert probability distribution. The proposed framework preserves sparsity while improving training efficiency. Additionally, curriculum learning is leveraged to further reduce training time. Extensive experiments on diverse NLP tasks show that adaptive gating reduces at most 22.5% training time while maintaining inference quality. Moreover, we conduct a comprehensive analysis of the routing decisions and present our insights when adaptive gating is used.

View paper on

Share this with someone who'll enjoy it:

Title:Adaptive Gating in Mixture-of-Experts based Language Models

Paper and Code