Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Title:Towards Being Parameter-Efficient: A Stratified Sparsely Activated Transformer with Dynamic Capacity

May 03, 2023

Haoran Xu, Maha Elbayad, Kenton Murray, Jean Maillard, Vedanuj Goswami

Figure 1 for Towards Being Parameter-Efficient: A Stratified Sparsely Activated Transformer with Dynamic Capacity

Figure 2 for Towards Being Parameter-Efficient: A Stratified Sparsely Activated Transformer with Dynamic Capacity

Figure 3 for Towards Being Parameter-Efficient: A Stratified Sparsely Activated Transformer with Dynamic Capacity

Figure 4 for Towards Being Parameter-Efficient: A Stratified Sparsely Activated Transformer with Dynamic Capacity

Share this with someone who'll enjoy it:

Abstract:Mixture-of-experts (MoE) models that employ sparse activation have demonstrated effectiveness in significantly increasing the number of parameters while maintaining low computational requirements per token. However, recent studies have established that MoE models are inherently parameter-inefficient as the improvement in performance diminishes with an increasing number of experts. We hypothesize this parameter inefficiency is a result of all experts having equal capacity, which may not adequately meet the varying complexity requirements of different tokens or tasks, e.g., in a multilingual setting, languages based on their resource levels might require different capacities. In light of this, we propose Stratified Mixture of Experts(SMoE) models, which feature a stratified structure and can assign dynamic capacity to different tokens. We demonstrate the effectiveness of SMoE on two multilingual machine translation benchmarks, where it outperforms multiple state-of-the-art MoE models. On a diverse 15-language dataset, SMoE improves the translation quality over vanilla MoE by +0.93 BLEU points on average. Additionally, SMoE is parameter-efficient, matching vanilla MoE performance with around 50\% fewer parameters.

View paper on

Share this with someone who'll enjoy it:

Title:Towards Being Parameter-Efficient: A Stratified Sparsely Activated Transformer with Dynamic Capacity

Paper and Code