Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Title:Toward Inference-optimal Mixture-of-Expert Large Language Models

Apr 03, 2024

Longfei Yun, Yonghao Zhuang, Yao Fu, Eric P Xing, Hao Zhang

Figure 1 for Toward Inference-optimal Mixture-of-Expert Large Language Models

Figure 2 for Toward Inference-optimal Mixture-of-Expert Large Language Models

Figure 3 for Toward Inference-optimal Mixture-of-Expert Large Language Models

Figure 4 for Toward Inference-optimal Mixture-of-Expert Large Language Models

Share this with someone who'll enjoy it:

Abstract:Mixture-of-Expert (MoE) based large language models (LLMs), such as the recent Mixtral and DeepSeek-MoE, have shown great promise in scaling model size without suffering from the quadratic growth of training cost of dense transformers. Like dense models, training MoEs requires answering the same question: given a training budget, what is the optimal allocation on the model size and number of tokens? We study the scaling law of MoE-based LLMs regarding the relations between the model performance, model size, dataset size, and the expert degree. Echoing previous research studying MoE in different contexts, we observe the diminishing return of increasing the number of experts, but this seems to suggest we should scale the number of experts until saturation, as the training cost would remain constant, which is problematic during inference time. We propose to amend the scaling law of MoE by introducing inference efficiency as another metric besides the validation loss. We find that MoEs with a few (4/8) experts are the most serving efficient solution under the same performance, but costs 2.5-3.5x more in training. On the other hand, training a (16/32) expert MoE much smaller (70-85%) than the loss-optimal solution, but with a larger training dataset is a promising setup under a training budget.

* 15 pages, 8 figures

View paper on

Share this with someone who'll enjoy it:

Title:Toward Inference-optimal Mixture-of-Expert Large Language Models

Paper and Code