Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Title:Efficient Knowledge Distillation from Model Checkpoints

Oct 12, 2022

Chaofei Wang, Qisen Yang, Rui Huang, Shiji Song, Gao Huang

Figure 1 for Efficient Knowledge Distillation from Model Checkpoints

Figure 2 for Efficient Knowledge Distillation from Model Checkpoints

Figure 3 for Efficient Knowledge Distillation from Model Checkpoints

Figure 4 for Efficient Knowledge Distillation from Model Checkpoints

Share this with someone who'll enjoy it:

Abstract:Knowledge distillation is an effective approach to learn compact models (students) with the supervision of large and strong models (teachers). As empirically there exists a strong correlation between the performance of teacher and student models, it is commonly believed that a high performing teacher is preferred. Consequently, practitioners tend to use a well trained network or an ensemble of them as the teacher. In this paper, we make an intriguing observation that an intermediate model, i.e., a checkpoint in the middle of the training procedure, often serves as a better teacher compared to the fully converged model, although the former has much lower accuracy. More surprisingly, a weak snapshot ensemble of several intermediate models from a same training trajectory can outperform a strong ensemble of independently trained and fully converged models, when they are used as teachers. We show that this phenomenon can be partially explained by the information bottleneck principle: the feature representations of intermediate models can have higher mutual information regarding the input, and thus contain more "dark knowledge" for effective distillation. We further propose an optimal intermediate teacher selection algorithm based on maximizing the total task-related mutual information. Experiments verify its effectiveness and applicability.

* Accepted by NeurIPS2022

View paper on

OpenReview

Share this with someone who'll enjoy it:

Title:Efficient Knowledge Distillation from Model Checkpoints

Paper and Code