Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Title:Less or More From Teacher: Exploiting Trilateral Geometry For Knowledge Distillation

Jan 01, 2024

Chengming Hu, Haolun Wu, Xuan Li, Chen Ma, Xi Chen, Jun Yan, Boyu Wang, Xue Liu

Figure 1 for Less or More From Teacher: Exploiting Trilateral Geometry For Knowledge Distillation

Figure 2 for Less or More From Teacher: Exploiting Trilateral Geometry For Knowledge Distillation

Figure 3 for Less or More From Teacher: Exploiting Trilateral Geometry For Knowledge Distillation

Figure 4 for Less or More From Teacher: Exploiting Trilateral Geometry For Knowledge Distillation

Share this with someone who'll enjoy it:

Abstract:Knowledge distillation aims to train a compact student network using soft supervision from a larger teacher network and hard supervision from ground truths. However, determining an optimal knowledge fusion ratio that balances these supervisory signals remains challenging. Prior methods generally resort to a constant or heuristic-based fusion ratio, which often falls short of a proper balance. In this study, we introduce a novel adaptive method for learning a sample-wise knowledge fusion ratio, exploiting both the correctness of teacher and student, as well as how well the student mimics the teacher on each sample. Our method naturally leads to the intra-sample trilateral geometric relations among the student prediction ($S$), teacher prediction ($T$), and ground truth ($G$). To counterbalance the impact of outliers, we further extend to the inter-sample relations, incorporating the teacher's global average prediction $\bar{T}$ for samples within the same class. A simple neural network then learns the implicit mapping from the intra- and inter-sample relations to an adaptive, sample-wise knowledge fusion ratio in a bilevel-optimization manner. Our approach provides a simple, practical, and adaptable solution for knowledge distillation that can be employed across various architectures and model sizes. Extensive experiments demonstrate consistent improvements over other loss re-weighting methods on image classification, attack detection, and click-through rate prediction.

View paper on

Share this with someone who'll enjoy it:

Title:Less or More From Teacher: Exploiting Trilateral Geometry For Knowledge Distillation

Paper and Code