Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Title:DA-MoE: Towards Dynamic Expert Allocation for Mixture-of-Experts Models

Sep 10, 2024

Maryam Akhavan Aghdam, Hongpeng Jin, Yanzhao Wu

Figure 1 for DA-MoE: Towards Dynamic Expert Allocation for Mixture-of-Experts Models

Figure 2 for DA-MoE: Towards Dynamic Expert Allocation for Mixture-of-Experts Models

Figure 3 for DA-MoE: Towards Dynamic Expert Allocation for Mixture-of-Experts Models

Figure 4 for DA-MoE: Towards Dynamic Expert Allocation for Mixture-of-Experts Models

Share this with someone who'll enjoy it:

Abstract:Transformer-based Mixture-of-Experts (MoE) models have been driving several recent technological advancements in Natural Language Processing (NLP). These MoE models adopt a router mechanism to determine which experts to activate for routing input tokens. However, existing router mechanisms allocate a fixed number of experts to each token, which neglects the varying importance of different input tokens. In this study, we propose a novel dynamic router mechanism that Dynamically Allocates a variable number of experts for Mixture-of-Experts (DA-MoE) models based on an effective token importance measure. First, we show that the Transformer attention mechanism provides a natural and effective way of calculating token importance. Second, we propose a dynamic router mechanism that effectively decides the optimal number of experts (K) and allocates the top-K experts for each input token. Third, comprehensive experiments on several benchmark datasets demonstrate that our DA-MoE approach consistently outperforms the state-of-the-art Transformer based MoE model on the popular GLUE benchmark.

View paper on

Share this with someone who'll enjoy it:

Title:DA-MoE: Towards Dynamic Expert Allocation for Mixture-of-Experts Models

Paper and Code