Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Title:Gradient Knowledge Distillation for Pre-trained Language Models

Nov 02, 2022

Lean Wang, Lei Li, Xu Sun

Figure 1 for Gradient Knowledge Distillation for Pre-trained Language Models

Figure 2 for Gradient Knowledge Distillation for Pre-trained Language Models

Figure 3 for Gradient Knowledge Distillation for Pre-trained Language Models

Figure 4 for Gradient Knowledge Distillation for Pre-trained Language Models

Share this with someone who'll enjoy it:

Abstract:Knowledge distillation (KD) is an effective framework to transfer knowledge from a large-scale teacher to a compact yet well-performing student. Previous KD practices for pre-trained language models mainly transfer knowledge by aligning instance-wise outputs between the teacher and student, while neglecting an important knowledge source, i.e., the gradient of the teacher. The gradient characterizes how the teacher responds to changes in inputs, which we assume is beneficial for the student to better approximate the underlying mapping function of the teacher. Therefore, we propose Gradient Knowledge Distillation (GKD) to incorporate the gradient alignment objective into the distillation process. Experimental results show that GKD outperforms previous KD methods regarding student performance. Further analysis shows that incorporating gradient knowledge makes the student behave more consistently with the teacher, improving the interpretability greatly.

* Accepted by NeurIPS ENLSP 2022 workshop(spotlight)

View paper on

Share this with someone who'll enjoy it:

Title:Gradient Knowledge Distillation for Pre-trained Language Models

Paper and Code