Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Raghu Kiran Ganti

Flexible and Effective Mixing of Large Language Models into a Mixture of Domain Experts

Aug 30, 2024

Rhui Dih Lee, Laura Wynter, Raghu Kiran Ganti

Figure 1 for Flexible and Effective Mixing of Large Language Models into a Mixture of Domain Experts

Figure 2 for Flexible and Effective Mixing of Large Language Models into a Mixture of Domain Experts

Figure 3 for Flexible and Effective Mixing of Large Language Models into a Mixture of Domain Experts

Figure 4 for Flexible and Effective Mixing of Large Language Models into a Mixture of Domain Experts

Abstract:We present a toolkit for creating low-cost Mixture-of-Domain-Experts (MOE) from trained models. The toolkit can be used for creating a mixture from models or from adapters. We perform extensive tests and offer guidance on defining the architecture of the resulting MOE using the toolkit. A public repository is available.

Via

Access Paper or Ask Questions

Enhancing Training Efficiency Using Packing with Flash Attention

Jul 12, 2024

Achintya Kundu, Rhui Dih Lee, Laura Wynter, Raghu Kiran Ganti

Figure 1 for Enhancing Training Efficiency Using Packing with Flash Attention

Figure 2 for Enhancing Training Efficiency Using Packing with Flash Attention

Figure 3 for Enhancing Training Efficiency Using Packing with Flash Attention

Figure 4 for Enhancing Training Efficiency Using Packing with Flash Attention

Abstract:Padding is often used in tuning LLM models by adding special tokens to shorter training examples to match the length of the longest sequence in each batch. While this ensures uniformity for batch processing, it introduces inefficiencies by including irrelevant padding tokens in the computation and wastes GPU resources. On the other hand, the Hugging Face SFT trainer offers the option to use packing to combine multiple training examples up to the maximum sequence length. This allows for maximal utilization of GPU resources. However, without proper masking of each packed training example, attention will not be computed correctly when using SFT trainer. We enable and then analyse packing and Flash Attention with proper attention masking of each example and show the benefits of this training paradigm.

Via

Access Paper or Ask Questions