Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Giang Do

Sparse Mixture of Experts as Unified Competitive Learning

Mar 29, 2025

Giang Do, Hung Le, Truyen Tran

Abstract:Sparse Mixture of Experts (SMoE) improves the efficiency of large language model training by directing input tokens to a subset of experts. Despite its success in generation tasks, its generalization ability remains an open question. In this paper, we demonstrate that current SMoEs, which fall into two categories: (1) Token Choice ;and (2) Expert Choice, struggle with tasks such as the Massive Text Embedding Benchmark (MTEB). By analyzing their mechanism through the lens of competitive learning, our study finds that the Token Choice approach may overly focus on irrelevant experts, while the Expert Choice approach risks discarding important tokens, potentially affecting performance. Motivated by this analysis, we propose Unified Competitive Learning SMoE (USMoE), a novel and efficient framework designed to improve the performance of existing SMoEs in both scenarios: with and without training. Extensive experiments across various tasks show that USMoE achieves up to a 10% improvement over traditional approaches or reduces computational inference costs by 14% while maintaining strong performance.

* 18 pages

Via

Access Paper or Ask Questions

S2MoE: Robust Sparse Mixture of Experts via Stochastic Learning

Mar 29, 2025

Giang Do, Hung Le, Truyen Tran

Figure 1 for S2MoE: Robust Sparse Mixture of Experts via Stochastic Learning

Figure 2 for S2MoE: Robust Sparse Mixture of Experts via Stochastic Learning

Figure 3 for S2MoE: Robust Sparse Mixture of Experts via Stochastic Learning

Figure 4 for S2MoE: Robust Sparse Mixture of Experts via Stochastic Learning

Abstract:Sparse Mixture of Experts (SMoE) enables efficient training of large language models by routing input tokens to a select number of experts. However, training SMoE remains challenging due to the issue of representation collapse. Recent studies have focused on improving the router to mitigate this problem, but existing approaches face two key limitations: (1) expert embeddings are significantly smaller than the model's dimension, contributing to representation collapse, and (2) routing each input to the Top-K experts can cause them to learn overly similar features. In this work, we propose a novel approach called Robust Sparse Mixture of Experts via Stochastic Learning (S2MoE), which is a mixture of experts designed to learn from both deterministic and non-deterministic inputs via Learning under Uncertainty. Extensive experiments across various tasks demonstrate that S2MoE achieves performance comparable to other routing methods while reducing computational inference costs by 28%.

* 4 pages

Via

Access Paper or Ask Questions

On the effectiveness of discrete representations in sparse mixture of experts

Nov 28, 2024

Giang Do, Kha Pham, Hung Le, Truyen Tran

Figure 1 for On the effectiveness of discrete representations in sparse mixture of experts

Figure 2 for On the effectiveness of discrete representations in sparse mixture of experts

Figure 3 for On the effectiveness of discrete representations in sparse mixture of experts

Figure 4 for On the effectiveness of discrete representations in sparse mixture of experts

Abstract:Sparse mixture of experts (SMoE) is an effective solution for scaling up model capacity without increasing the computational costs. A crucial component of SMoE is the router, responsible for directing the input to relevant experts; however, it also presents a major weakness, leading to routing inconsistencies and representation collapse issues. Instead of fixing the router like previous works, we propose an alternative that assigns experts to input via indirection, which employs the discrete representation of input that points to the expert. The discrete representations are learnt via vector quantization, resulting in a new architecture dubbed Vector-Quantized Mixture of Experts (VQMoE). We provide theoretical support and empirical evidence demonstrating the VQMoE's ability to overcome the challenges present in traditional routers. Through extensive evaluations on both large language models and vision tasks for pre-training and fine-tuning, we show that VQMoE achieves a 28% improvement in robustness compared to other SMoE routing methods, while maintaining strong performance in fine-tuning tasks.

* 17 pages

Via

Access Paper or Ask Questions

SimSMoE: Solving Representational Collapse via Similarity Measure

Jun 22, 2024

Giang Do, Hung Le, Truyen Tran

Figure 1 for SimSMoE: Solving Representational Collapse via Similarity Measure

Figure 2 for SimSMoE: Solving Representational Collapse via Similarity Measure

Figure 3 for SimSMoE: Solving Representational Collapse via Similarity Measure

Figure 4 for SimSMoE: Solving Representational Collapse via Similarity Measure

Abstract:Sparse mixture of experts (SMoE) have emerged as an effective approach for scaling large language models while keeping a constant computational cost. Regardless of several notable successes of SMoE, effective training such architecture remains elusive due to the representation collapse problem, which in turn harms model performance and causes parameter redundancy. In this work, we present Similarity-based Sparse Mixture of Experts (SimSMoE), a novel similarity of neural network algorithm, that guarantees a solution to address the representation collapse issue between experts given a fixed FLOPs budget. We conduct extensive empirical evaluations on three large language models for both Pre-training and Fine-tuning tasks to illustrate the efficacy, robustness, and scalability of our method. The results demonstrate that SimSMoE significantly enhances existing routing policy and outperforms other SMoE training methods in performance for the tasks.

Via

Access Paper or Ask Questions

CompeteSMoE -- Effective Training of Sparse Mixture of Experts via Competition

Feb 04, 2024

Quang Pham, Giang Do, Huy Nguyen, TrungTin Nguyen, Chenghao Liu, Mina Sartipi, Binh T. Nguyen, Savitha Ramasamy, Xiaoli Li, Steven Hoi(+1 more)

Figure 1 for CompeteSMoE -- Effective Training of Sparse Mixture of Experts via Competition

Figure 2 for CompeteSMoE -- Effective Training of Sparse Mixture of Experts via Competition

Figure 3 for CompeteSMoE -- Effective Training of Sparse Mixture of Experts via Competition

Figure 4 for CompeteSMoE -- Effective Training of Sparse Mixture of Experts via Competition

Abstract:Sparse mixture of experts (SMoE) offers an appealing solution to scale up the model complexity beyond the mean of increasing the network's depth or width. However, effective training of SMoE has proven to be challenging due to the representation collapse issue, which causes parameter redundancy and limited representation potentials. In this work, we propose a competition mechanism to address this fundamental challenge of representation collapse. By routing inputs only to experts with the highest neural response, we show that, under mild assumptions, competition enjoys the same convergence rate as the optimal estimator. We further propose CompeteSMoE, an effective and efficient algorithm to train large language models by deploying a simple router that predicts the competition outcomes. Consequently, CompeteSMoE enjoys strong performance gains from the competition routing policy while having low computation overheads. Our extensive empirical evaluations on two transformer architectures and a wide range of tasks demonstrate the efficacy, robustness, and scalability of CompeteSMoE compared to state-of-the-art SMoE strategies.

Via

Access Paper or Ask Questions

HyperRouter: Towards Efficient Training and Inference of Sparse Mixture of Experts

Dec 12, 2023

Giang Do, Khiem Le, Quang Pham, TrungTin Nguyen, Thanh-Nam Doan, Bint T. Nguyen, Chenghao Liu, Savitha Ramasamy, Xiaoli Li, Steven Hoi

Figure 1 for HyperRouter: Towards Efficient Training and Inference of Sparse Mixture of Experts

Figure 2 for HyperRouter: Towards Efficient Training and Inference of Sparse Mixture of Experts

Figure 3 for HyperRouter: Towards Efficient Training and Inference of Sparse Mixture of Experts

Figure 4 for HyperRouter: Towards Efficient Training and Inference of Sparse Mixture of Experts

Abstract:By routing input tokens to only a few split experts, Sparse Mixture-of-Experts has enabled efficient training of large language models. Recent findings suggest that fixing the routers can achieve competitive performance by alleviating the collapsing problem, where all experts eventually learn similar representations. However, this strategy has two key limitations: (i) the policy derived from random routers might be sub-optimal, and (ii) it requires extensive resources during training and evaluation, leading to limited efficiency gains. This work introduces \HyperRout, which dynamically generates the router's parameters through a fixed hypernetwork and trainable embeddings to achieve a balance between training the routers and freezing them to learn an improved routing policy. Extensive experiments across a wide range of tasks demonstrate the superior performance and efficiency gains of \HyperRouter compared to existing routing methods. Our implementation is publicly available at {\url{{https://github.com/giangdip2410/HyperRouter}}}.

Via

Access Paper or Ask Questions