Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Guoqing Jiang

SKDBERT: Compressing BERT via Stochastic Knowledge Distillation

Nov 29, 2022

Zixiang Ding, Guoqing Jiang, Shuai Zhang, Lin Guo, Wei Lin

Abstract:In this paper, we propose Stochastic Knowledge Distillation (SKD) to obtain compact BERT-style language model dubbed SKDBERT. In each iteration, SKD samples a teacher model from a pre-defined teacher ensemble, which consists of multiple teacher models with multi-level capacities, to transfer knowledge into student model in an one-to-one manner. Sampling distribution plays an important role in SKD. We heuristically present three types of sampling distributions to assign appropriate probabilities for multi-level teacher models. SKD has two advantages: 1) it can preserve the diversities of multi-level teacher models via stochastically sampling single teacher model in each iteration, and 2) it can also improve the efficacy of knowledge distillation via multi-level teacher models when large capacity gap exists between the teacher model and the student model. Experimental results on GLUE benchmark show that SKDBERT reduces the size of a BERT$_{\rm BASE}$ model by 40% while retaining 99.5% performances of language understanding and being 100% faster.

* This paper has been accepted by AAAI2023

Via

Access Paper or Ask Questions

Understanding Why Neural Networks Generalize Well Through GSNR of Parameters

Feb 24, 2020

Jinlong Liu, Guoqing Jiang, Yunzhi Bai, Ting Chen, Huayan Wang

Figure 1 for Understanding Why Neural Networks Generalize Well Through GSNR of Parameters

Figure 2 for Understanding Why Neural Networks Generalize Well Through GSNR of Parameters

Figure 3 for Understanding Why Neural Networks Generalize Well Through GSNR of Parameters

Figure 4 for Understanding Why Neural Networks Generalize Well Through GSNR of Parameters

Abstract:As deep neural networks (DNNs) achieve tremendous success across many application domains, researchers tried to explore in many aspects on why they generalize well. In this paper, we provide a novel perspective on these issues using the gradient signal to noise ratio (GSNR) of parameters during training process of DNNs. The GSNR of a parameter is defined as the ratio between its gradient's squared mean and variance, over the data distribution. Based on several approximations, we establish a quantitative relationship between model parameters' GSNR and the generalization gap. This relationship indicates that larger GSNR during training process leads to better generalization performance. Moreover, we show that, different from that of shallow models (e.g. logistic regression, support vector machines), the gradient descent optimization dynamics of DNNs naturally produces large GSNR during training, which is probably the key to DNNs' remarkable generalization ability.

* 14 pages, 8 figures, ICLR2020 accepted as spotlight presentation

Via

Access Paper or Ask Questions