Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Title:Branch-Train-MiX: Mixing Expert LLMs into a Mixture-of-Experts LLM

Mar 12, 2024

Sainbayar Sukhbaatar, Olga Golovneva, Vasu Sharma, Hu Xu, Xi Victoria Lin, Baptiste Rozière, Jacob Kahn, Daniel Li, Wen-tau Yih, Jason Weston(+1 more)

Figure 1 for Branch-Train-MiX: Mixing Expert LLMs into a Mixture-of-Experts LLM

Figure 2 for Branch-Train-MiX: Mixing Expert LLMs into a Mixture-of-Experts LLM

Figure 3 for Branch-Train-MiX: Mixing Expert LLMs into a Mixture-of-Experts LLM

Figure 4 for Branch-Train-MiX: Mixing Expert LLMs into a Mixture-of-Experts LLM

Share this with someone who'll enjoy it:

Abstract:We investigate efficient methods for training Large Language Models (LLMs) to possess capabilities in multiple specialized domains, such as coding, math reasoning and world knowledge. Our method, named Branch-Train-MiX (BTX), starts from a seed model, which is branched to train experts in embarrassingly parallel fashion with high throughput and reduced communication cost. After individual experts are asynchronously trained, BTX brings together their feedforward parameters as experts in Mixture-of-Expert (MoE) layers and averages the remaining parameters, followed by an MoE-finetuning stage to learn token-level routing. BTX generalizes two special cases, the Branch-Train-Merge method, which does not have the MoE finetuning stage to learn routing, and sparse upcycling, which omits the stage of training experts asynchronously. Compared to alternative approaches, BTX achieves the best accuracy-efficiency tradeoff.

View paper on

Share this with someone who'll enjoy it:

Title:Branch-Train-MiX: Mixing Expert LLMs into a Mixture-of-Experts LLM

Paper and Code