Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Renjie Song

Asymmetric Decision-Making in Online Knowledge Distillation:Unifying Consensus and Divergence

Mar 09, 2025

Zhaowei Chen, Borui Zhao, Yuchen Ge, Yuhao Chen, Renjie Song, Jiajun Liang

Figure 1 for Asymmetric Decision-Making in Online Knowledge Distillation:Unifying Consensus and Divergence

Figure 2 for Asymmetric Decision-Making in Online Knowledge Distillation:Unifying Consensus and Divergence

Figure 3 for Asymmetric Decision-Making in Online Knowledge Distillation:Unifying Consensus and Divergence

Figure 4 for Asymmetric Decision-Making in Online Knowledge Distillation:Unifying Consensus and Divergence

Abstract:Online Knowledge Distillation (OKD) methods streamline the distillation training process into a single stage, eliminating the need for knowledge transfer from a pretrained teacher network to a more compact student network. This paper presents an innovative approach to leverage intermediate spatial representations. Our analysis of the intermediate features from both teacher and student models reveals two pivotal insights: (1) the similar features between students and teachers are predominantly focused on foreground objects. (2) teacher models emphasize foreground objects more than students. Building on these findings, we propose Asymmetric Decision-Making (ADM) to enhance feature consensus learning for student models while continuously promoting feature diversity in teacher models. Specifically, Consensus Learning for student models prioritizes spatial features with high consensus relative to teacher models. Conversely, Divergence Learning for teacher models highlights spatial features with lower similarity compared to student models, indicating superior performance by teacher models in these regions. Consequently, ADM facilitates the student models to catch up with the feature learning process of the teacher models. Extensive experiments demonstrate that ADM consistently surpasses existing OKD methods across various online knowledge distillation settings and also achieves superior results when applied to offline knowledge distillation, semantic segmentation and diffusion distillation tasks.

Via

Access Paper or Ask Questions

DOT: A Distillation-Oriented Trainer

Jul 17, 2023

Borui Zhao, Quan Cui, Renjie Song, Jiajun Liang

Figure 1 for DOT: A Distillation-Oriented Trainer

Figure 2 for DOT: A Distillation-Oriented Trainer

Figure 3 for DOT: A Distillation-Oriented Trainer

Figure 4 for DOT: A Distillation-Oriented Trainer

Abstract:Knowledge distillation transfers knowledge from a large model to a small one via task and distillation losses. In this paper, we observe a trade-off between task and distillation losses, i.e., introducing distillation loss limits the convergence of task loss. We believe that the trade-off results from the insufficient optimization of distillation loss. The reason is: The teacher has a lower task loss than the student, and a lower distillation loss drives the student more similar to the teacher, then a better-converged task loss could be obtained. To break the trade-off, we propose the Distillation-Oriented Trainer (DOT). DOT separately considers gradients of task and distillation losses, then applies a larger momentum to distillation loss to accelerate its optimization. We empirically prove that DOT breaks the trade-off, i.e., both losses are sufficiently optimized. Extensive experiments validate the superiority of DOT. Notably, DOT achieves a +2.59% accuracy improvement on ImageNet-1k for the ResNet50-MobileNetV1 pair. Conclusively, DOT greatly benefits the student's optimization properties in terms of loss convergence and model generalization. Code will be made publicly available.

* Accepted by ICCV 2023

Via

Access Paper or Ask Questions

Cumulative Spatial Knowledge Distillation for Vision Transformers

Jul 17, 2023

Borui Zhao, Renjie Song, Jiajun Liang

Figure 1 for Cumulative Spatial Knowledge Distillation for Vision Transformers

Figure 2 for Cumulative Spatial Knowledge Distillation for Vision Transformers

Figure 3 for Cumulative Spatial Knowledge Distillation for Vision Transformers

Figure 4 for Cumulative Spatial Knowledge Distillation for Vision Transformers

Abstract:Distilling knowledge from convolutional neural networks (CNNs) is a double-edged sword for vision transformers (ViTs). It boosts the performance since the image-friendly local-inductive bias of CNN helps ViT learn faster and better, but leading to two problems: (1) Network designs of CNN and ViT are completely different, which leads to different semantic levels of intermediate features, making spatial-wise knowledge transfer methods (e.g., feature mimicking) inefficient. (2) Distilling knowledge from CNN limits the network convergence in the later training period since ViT's capability of integrating global information is suppressed by CNN's local-inductive-bias supervision. To this end, we present Cumulative Spatial Knowledge Distillation (CSKD). CSKD distills spatial-wise knowledge to all patch tokens of ViT from the corresponding spatial responses of CNN, without introducing intermediate features. Furthermore, CSKD exploits a Cumulative Knowledge Fusion (CKF) module, which introduces the global response of CNN and increasingly emphasizes its importance during the training. Applying CKF leverages CNN's local inductive bias in the early training period and gives full play to ViT's global capability in the later one. Extensive experiments and analysis on ImageNet-1k and downstream datasets demonstrate the superiority of our CSKD. Code will be publicly available.

* Accepted by ICCV 2023

Via

Access Paper or Ask Questions

Is Synthetic Data From Diffusion Models Ready for Knowledge Distillation?

May 22, 2023

Zheng Li, Yuxuan Li, Penghai Zhao, Renjie Song, Xiang Li, Jian Yang

Figure 1 for Is Synthetic Data From Diffusion Models Ready for Knowledge Distillation?

Figure 2 for Is Synthetic Data From Diffusion Models Ready for Knowledge Distillation?

Figure 3 for Is Synthetic Data From Diffusion Models Ready for Knowledge Distillation?

Figure 4 for Is Synthetic Data From Diffusion Models Ready for Knowledge Distillation?

Abstract:Diffusion models have recently achieved astonishing performance in generating high-fidelity photo-realistic images. Given their huge success, it is still unclear whether synthetic images are applicable for knowledge distillation when real images are unavailable. In this paper, we extensively study whether and how synthetic images produced from state-of-the-art diffusion models can be used for knowledge distillation without access to real images, and obtain three key conclusions: (1) synthetic data from diffusion models can easily lead to state-of-the-art performance among existing synthesis-based distillation methods, (2) low-fidelity synthetic images are better teaching materials, and (3) relatively weak classifiers are better teachers. Code is available at https://github.com/zhengli97/DM-KD.

Via

Access Paper or Ask Questions

A Survey of Historical Learning: Learning Models with Learning History

Mar 23, 2023

Xiang Li, Ge Wu, Lingfeng Yang, Wenhai Wang, Renjie Song, Jian Yang

Figure 1 for A Survey of Historical Learning: Learning Models with Learning History

Figure 2 for A Survey of Historical Learning: Learning Models with Learning History

Figure 3 for A Survey of Historical Learning: Learning Models with Learning History

Figure 4 for A Survey of Historical Learning: Learning Models with Learning History

Abstract:New knowledge originates from the old. The various types of elements, deposited in the training history, are a large amount of wealth for improving learning deep models. In this survey, we comprehensively review and summarize the topic--``Historical Learning: Learning Models with Learning History'', which learns better neural models with the help of their learning history during its optimization, from three detailed aspects: Historical Type (what), Functional Part (where) and Storage Form (how). To our best knowledge, it is the first survey that systematically studies the methodologies which make use of various historical statistics when training deep neural networks. The discussions with related topics like recurrent/memory networks, ensemble learning, and reinforcement learning are demonstrated. We also expose future challenges of this topic and encourage the community to pay attention to the think of historical learning principles when designing algorithms. The paper list related to historical learning is available at \url{https://github.com/Martinser/Awesome-Historical-Learning.}

* Xiang Li and Ge Wu have equal contributions

Via

Access Paper or Ask Questions

Boosting Semi-Supervised Learning by Exploiting All Unlabeled Data

Mar 20, 2023

Yuhao Chen, Xin Tan, Borui Zhao, Zhaowei Chen, Renjie Song, Jiajun Liang, Xuequan Lu

Figure 1 for Boosting Semi-Supervised Learning by Exploiting All Unlabeled Data

Figure 2 for Boosting Semi-Supervised Learning by Exploiting All Unlabeled Data

Figure 3 for Boosting Semi-Supervised Learning by Exploiting All Unlabeled Data

Figure 4 for Boosting Semi-Supervised Learning by Exploiting All Unlabeled Data

Abstract:Semi-supervised learning (SSL) has attracted enormous attention due to its vast potential of mitigating the dependence on large labeled datasets. The latest methods (e.g., FixMatch) use a combination of consistency regularization and pseudo-labeling to achieve remarkable successes. However, these methods all suffer from the waste of complicated examples since all pseudo-labels have to be selected by a high threshold to filter out noisy ones. Hence, the examples with ambiguous predictions will not contribute to the training phase. For better leveraging all unlabeled examples, we propose two novel techniques: Entropy Meaning Loss (EML) and Adaptive Negative Learning (ANL). EML incorporates the prediction distribution of non-target classes into the optimization objective to avoid competition with target class, and thus generating more high-confidence predictions for selecting pseudo-label. ANL introduces the additional negative pseudo-label for all unlabeled data to leverage low-confidence examples. It adaptively allocates this label by dynamically evaluating the top-k performance of the model. EML and ANL do not introduce any additional parameter and hyperparameter. We integrate these techniques with FixMatch, and develop a simple yet powerful framework called FullMatch. Extensive experiments on several common SSL benchmarks (CIFAR-10/100, SVHN, STL-10 and ImageNet) demonstrate that FullMatch exceeds FixMatch by a large margin. Integrated with FlexMatch (an advanced FixMatch-based framework), we achieve state-of-the-art performance. Source code is at https://github.com/megvii-research/FullMatch.

* Accepted by CVPR 2023

Via

Access Paper or Ask Questions

Curriculum Temperature for Knowledge Distillation

Dec 04, 2022

Zheng Li, Xiang Li, Lingfeng Yang, Borui Zhao, Renjie Song, Lei Luo, Jun Li, Jian Yang

Figure 1 for Curriculum Temperature for Knowledge Distillation

Figure 2 for Curriculum Temperature for Knowledge Distillation

Figure 3 for Curriculum Temperature for Knowledge Distillation

Figure 4 for Curriculum Temperature for Knowledge Distillation

Abstract:Most existing distillation methods ignore the flexible role of the temperature in the loss function and fix it as a hyper-parameter that can be decided by an inefficient grid search. In general, the temperature controls the discrepancy between two distributions and can faithfully determine the difficulty level of the distillation task. Keeping a constant temperature, i.e., a fixed level of task difficulty, is usually sub-optimal for a growing student during its progressive learning stages. In this paper, we propose a simple curriculum-based technique, termed Curriculum Temperature for Knowledge Distillation (CTKD), which controls the task difficulty level during the student's learning career through a dynamic and learnable temperature. Specifically, following an easy-to-hard curriculum, we gradually increase the distillation loss w.r.t. the temperature, leading to increased distillation difficulty in an adversarial manner. As an easy-to-use plug-in technique, CTKD can be seamlessly integrated into existing knowledge distillation frameworks and brings general improvements at a negligible additional computation cost. Extensive experiments on CIFAR-100, ImageNet-2012, and MS-COCO demonstrate the effectiveness of our method. Our code is available at https://github.com/zhengli97/CTKD.

* AAAI 2023

Via

Access Paper or Ask Questions

Decoupled Knowledge Distillation

Mar 16, 2022

Borui Zhao, Quan Cui, Renjie Song, Yiyu Qiu, Jiajun Liang

Figure 1 for Decoupled Knowledge Distillation

Figure 2 for Decoupled Knowledge Distillation

Figure 3 for Decoupled Knowledge Distillation

Figure 4 for Decoupled Knowledge Distillation

Abstract:State-of-the-art distillation methods are mainly based on distilling deep features from intermediate layers, while the significance of logit distillation is greatly overlooked. To provide a novel viewpoint to study logit distillation, we reformulate the classical KD loss into two parts, i.e., target class knowledge distillation (TCKD) and non-target class knowledge distillation (NCKD). We empirically investigate and prove the effects of the two parts: TCKD transfers knowledge concerning the "difficulty" of training samples, while NCKD is the prominent reason why logit distillation works. More importantly, we reveal that the classical KD loss is a coupled formulation, which (1) suppresses the effectiveness of NCKD and (2) limits the flexibility to balance these two parts. To address these issues, we present Decoupled Knowledge Distillation (DKD), enabling TCKD and NCKD to play their roles more efficiently and flexibly. Compared with complex feature-based methods, our DKD achieves comparable or even better results and has better training efficiency on CIFAR-100, ImageNet, and MS-COCO datasets for image classification and object detection tasks. This paper proves the great potential of logit distillation, and we hope it will be helpful for future research. The code is available at https://github.com/megvii-research/mdistiller.

* Accepted by CVPR2022

Via

Access Paper or Ask Questions

RecursiveMix: Mixed Learning with History

Mar 14, 2022

Lingfeng Yang, Xiang Li, Borui Zhao, Renjie Song, Jian Yang

Figure 1 for RecursiveMix: Mixed Learning with History

Figure 2 for RecursiveMix: Mixed Learning with History

Figure 3 for RecursiveMix: Mixed Learning with History

Figure 4 for RecursiveMix: Mixed Learning with History

Abstract:Mix-based augmentation has been proven fundamental to the generalization of deep vision models. However, current augmentations only mix samples at the current data batch during training, which ignores the possible knowledge accumulated in the learning history. In this paper, we propose a recursive mixed-sample learning paradigm, termed "RecursiveMix" (RM), by exploring a novel training strategy that leverages the historical input-prediction-label triplets. More specifically, we iteratively resize the input image batch from the previous iteration and paste it into the current batch while their labels are fused proportionally to the area of the operated patches. Further, a consistency loss is introduced to align the identical image semantics across the iterations, which helps the learning of scale-invariant feature representations. Based on ResNet-50, RM largely improves classification accuracy by $\sim$3.2\% on CIFAR100 and $\sim$2.8\% on ImageNet with negligible extra computation/storage costs. In the downstream object detection task, the RM pretrained model outperforms the baseline by 2.1 AP points and surpasses CutMix by 1.4 AP points under the ATSS detector on COCO. In semantic segmentation, RM also surpasses the baseline and CutMix by 1.9 and 1.1 mIoU points under UperNet on ADE20K, respectively. Codes and pretrained models are available at \url{https://github.com/megvii-research/RecursiveMix}.

* Lingfeng Yang and Xiang Li contribute equally

Via

Access Paper or Ask Questions

Discriminability-Transferability Trade-Off: An Information-Theoretic Perspective

Mar 08, 2022

Quan Cui, Bingchen Zhao, Zhao-Min Chen, Borui Zhao, Renjie Song, Jiajun Liang, Boyan Zhou, Osamu Yoshie

Figure 1 for Discriminability-Transferability Trade-Off: An Information-Theoretic Perspective

Figure 2 for Discriminability-Transferability Trade-Off: An Information-Theoretic Perspective

Figure 3 for Discriminability-Transferability Trade-Off: An Information-Theoretic Perspective

Figure 4 for Discriminability-Transferability Trade-Off: An Information-Theoretic Perspective

Abstract:This work simultaneously considers the discriminability and transferability properties of deep representations in the typical supervised learning task, i.e., image classification. By a comprehensive temporal analysis, we observe a trade-off between these two properties. The discriminability keeps increasing with the training progressing while the transferability intensely diminishes in the later training period. From the perspective of information-bottleneck theory, we reveal that the incompatibility between discriminability and transferability is attributed to the over-compression of input information. More importantly, we investigate why and how the InfoNCE loss can alleviate the over-compression, and further present a learning framework, named contrastive temporal coding~(CTC), to counteract the over-compression and alleviate the incompatibility. Extensive experiments validate that CTC successfully mitigates the incompatibility, yielding discriminative and transferable representations. Noticeable improvements are achieved on the image classification task and challenging transfer learning tasks. We hope that this work will raise the significance of the transferability property in the conventional supervised learning setting. Code will be publicly available.

* First two authors contributed equally

Via

Access Paper or Ask Questions