Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Jianhao Zhang

Skywork-MoE: A Deep Dive into Training Techniques for Mixture-of-Experts Language Models

Jun 03, 2024

Tianwen Wei, Bo Zhu, Liang Zhao, Cheng Cheng, Biye Li, Weiwei Lü, Peng Cheng, Jianhao Zhang, Xiaoyu Zhang, Liang Zeng(+6 more)

Figure 1 for Skywork-MoE: A Deep Dive into Training Techniques for Mixture-of-Experts Language Models

Figure 2 for Skywork-MoE: A Deep Dive into Training Techniques for Mixture-of-Experts Language Models

Figure 3 for Skywork-MoE: A Deep Dive into Training Techniques for Mixture-of-Experts Language Models

Figure 4 for Skywork-MoE: A Deep Dive into Training Techniques for Mixture-of-Experts Language Models

Abstract:In this technical report, we introduce the training methodologies implemented in the development of Skywork-MoE, a high-performance mixture-of-experts (MoE) large language model (LLM) with 146 billion parameters and 16 experts. It is initialized from the pre-existing dense checkpoints of our Skywork-13B model. We explore the comparative effectiveness of upcycling versus training from scratch initializations. Our findings suggest that the choice between these two approaches should consider both the performance of the existing dense checkpoints and the MoE training budget. We highlight two innovative techniques: gating logit normalization, which improves expert diversification, and adaptive auxiliary loss coefficients, allowing for layer-specific adjustment of auxiliary loss coefficients. Our experimental results validate the effectiveness of these methods. Leveraging these techniques and insights, we trained our upcycled Skywork-MoE on a condensed subset of our SkyPile corpus. The evaluation results demonstrate that our model delivers strong performance across a wide range of benchmarks.

Via

Access Paper or Ask Questions

Coop: Memory is not a Commodity

Nov 01, 2023

Jianhao Zhang, Shihan Ma, Peihong Liu, Jinhui Yuan

Abstract:Tensor rematerialization allows the training of deep neural networks (DNNs) under limited memory budgets by checkpointing the models and recomputing the evicted tensors as needed. However, the existing tensor rematerialization techniques overlook the memory system in deep learning frameworks and implicitly assume that free memory blocks at different addresses are identical. Under this flawed assumption, discontiguous tensors are evicted, among which some are not used to allocate the new tensor. This leads to severe memory fragmentation and increases the cost of potential rematerializations. To address this issue, we propose to evict tensors within a sliding window to ensure all evictions are contiguous and are immediately used. Furthermore, we proposed cheap tensor partitioning and recomputable in-place to further reduce the rematerialization cost by optimizing the tensor allocation. We named our method Coop as it is a co-optimization of tensor allocation and tensor rematerialization. We evaluated Coop on eight representative DNNs. The experimental results demonstrate that Coop achieves up to $2\times$ memory saving and hugely reduces compute overhead, search latency, and memory fragmentation compared to the state-of-the-art baselines.

* NeurIPS 2023 spotlight

Via

Access Paper or Ask Questions

Skywork: A More Open Bilingual Foundation Model

Oct 30, 2023

Tianwen Wei, Liang Zhao, Lichang Zhang, Bo Zhu, Lijie Wang, Haihua Yang, Biye Li, Cheng Cheng, Weiwei Lü, Rui Hu(+20 more)

Figure 1 for Skywork: A More Open Bilingual Foundation Model

Figure 2 for Skywork: A More Open Bilingual Foundation Model

Figure 3 for Skywork: A More Open Bilingual Foundation Model

Figure 4 for Skywork: A More Open Bilingual Foundation Model

Abstract:In this technical report, we present Skywork-13B, a family of large language models (LLMs) trained on a corpus of over 3.2 trillion tokens drawn from both English and Chinese texts. This bilingual foundation model is the most extensively trained and openly published LLMs of comparable size to date. We introduce a two-stage training methodology using a segmented corpus, targeting general purpose training and then domain-specific enhancement training, respectively. We show that our model not only excels on popular benchmarks, but also achieves \emph{state of the art} performance in Chinese language modeling on diverse domains. Furthermore, we propose a novel leakage detection method, demonstrating that test data contamination is a pressing issue warranting further investigation by the LLM community. To spur future research, we release Skywork-13B along with checkpoints obtained during intermediate stages of the training process. We are also releasing part of our SkyPile corpus, a collection of over 150 billion tokens of web text, which is the largest high quality open Chinese pre-training corpus to date. We hope Skywork-13B and our open corpus will serve as a valuable open-source resource to democratize access to high-quality LLMs.

Via

Access Paper or Ask Questions

daBNN: A Super Fast Inference Framework for Binary Neural Networks on ARM devices

Aug 16, 2019

Jianhao Zhang, Yingwei Pan, Ting Yao, He Zhao, Tao Mei

Figure 1 for daBNN: A Super Fast Inference Framework for Binary Neural Networks on ARM devices

Figure 2 for daBNN: A Super Fast Inference Framework for Binary Neural Networks on ARM devices

Figure 3 for daBNN: A Super Fast Inference Framework for Binary Neural Networks on ARM devices

Figure 4 for daBNN: A Super Fast Inference Framework for Binary Neural Networks on ARM devices

Abstract:It is always well believed that Binary Neural Networks (BNNs) could drastically accelerate the inference efficiency by replacing the arithmetic operations in float-valued Deep Neural Networks (DNNs) with bit-wise operations. Nevertheless, there has not been open-source implementation in support of this idea on low-end ARM devices (e.g., mobile phones and embedded devices). In this work, we propose daBNN --- a super fast inference framework that implements BNNs on ARM devices. Several speed-up and memory refinement strategies for bit-packing, binarized convolution, and memory layout are uniquely devised to enhance inference efficiency. Compared to the recent open-source BNN inference framework, BMXNet, our daBNN is $7\times$$\sim$$23\times$ faster on a single binary convolution, and about $6\times$ faster on Bi-Real Net 18 (a BNN variant of ResNet-18). The daBNN is a BSD-licensed inference framework, and its source code, sample projects and pre-trained models are available on-line: https://github.com/JDAI-CV/dabnn.

* Accepted by 2019 ACMMM Open Source Software Competition. Source code: https://github.com/JDAI-CV/dabnn

Via

Access Paper or Ask Questions