Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Title:Exploring The Role of Mean Teachers in Self-supervised Masked Auto-Encoders

Oct 05, 2022

Youngwan Lee, Jeffrey Willette, Jonghee Kim, Juho Lee, Sung Ju Hwang

Figure 1 for Exploring The Role of Mean Teachers in Self-supervised Masked Auto-Encoders

Figure 2 for Exploring The Role of Mean Teachers in Self-supervised Masked Auto-Encoders

Figure 3 for Exploring The Role of Mean Teachers in Self-supervised Masked Auto-Encoders

Figure 4 for Exploring The Role of Mean Teachers in Self-supervised Masked Auto-Encoders

Share this with someone who'll enjoy it:

Abstract:Masked image modeling (MIM) has become a popular strategy for self-supervised learning~(SSL) of visual representations with Vision Transformers. A representative MIM model, the masked auto-encoder (MAE), randomly masks a subset of image patches and reconstructs the masked patches given the unmasked patches. Concurrently, many recent works in self-supervised learning utilize the student/teacher paradigm which provides the student with an additional target based on the output of a teacher composed of an exponential moving average (EMA) of previous students. Although common, relatively little is known about the dynamics of the interaction between the student and teacher. Through analysis on a simple linear model, we find that the teacher conditionally removes previous gradient directions based on feature similarities which effectively acts as a conditional momentum regularizer. From this analysis, we present a simple SSL method, the Reconstruction-Consistent Masked Auto-Encoder (RC-MAE) by adding an EMA teacher to MAE. We find that RC-MAE converges faster and requires less memory usage than state-of-the-art self-distillation methods during pre-training, which may provide a way to enhance the practicality of prohibitively expensive self-supervised learning of Vision Transformer models. Additionally, we show that RC-MAE achieves more robustness and better performance compared to MAE on downstream tasks such as ImageNet-1K classification, object detection, and instance segmentation.

* pre-print

View paper on

OpenReview

Share this with someone who'll enjoy it:

Title:Exploring The Role of Mean Teachers in Self-supervised Masked Auto-Encoders

Paper and Code