Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Title:ALP-KD: Attention-Based Layer Projection for Knowledge Distillation

Dec 27, 2020

Peyman Passban, Yimeng Wu, Mehdi Rezagholizadeh, Qun Liu

Figure 1 for ALP-KD: Attention-Based Layer Projection for Knowledge Distillation

Figure 2 for ALP-KD: Attention-Based Layer Projection for Knowledge Distillation

Figure 3 for ALP-KD: Attention-Based Layer Projection for Knowledge Distillation

Figure 4 for ALP-KD: Attention-Based Layer Projection for Knowledge Distillation

Share this with someone who'll enjoy it:

Abstract:Knowledge distillation is considered as a training and compression strategy in which two neural networks, namely a teacher and a student, are coupled together during training. The teacher network is supposed to be a trustworthy predictor and the student tries to mimic its predictions. Usually, a student with a lighter architecture is selected so we can achieve compression and yet deliver high-quality results. In such a setting, distillation only happens for final predictions whereas the student could also benefit from teacher's supervision for internal components. Motivated by this, we studied the problem of distillation for intermediate layers. Since there might not be a one-to-one alignment between student and teacher layers, existing techniques skip some teacher layers and only distill from a subset of them. This shortcoming directly impacts quality, so we instead propose a combinatorial technique which relies on attention. Our model fuses teacher-side information and takes each layer's significance into consideration, then performs distillation between combined teacher layers and those of the student. Using our technique, we distilled a 12-layer BERT (Devlin et al. 2019) into 6-, 4-, and 2-layer counterparts and evaluated them on GLUE tasks (Wang et al. 2018). Experimental results show that our combinatorial approach is able to outperform other existing techniques.

* AAAI 2021. This work has been done while Peyman Passban was at Huawei

View paper on

Share this with someone who'll enjoy it:

Title:ALP-KD: Attention-Based Layer Projection for Knowledge Distillation

Paper and Code