Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Title:RevColV2: Exploring Disentangled Representations in Masked Image Modeling

Sep 02, 2023

Qi Han, Yuxuan Cai, Xiangyu Zhang

Figure 1 for RevColV2: Exploring Disentangled Representations in Masked Image Modeling

Figure 2 for RevColV2: Exploring Disentangled Representations in Masked Image Modeling

Figure 3 for RevColV2: Exploring Disentangled Representations in Masked Image Modeling

Figure 4 for RevColV2: Exploring Disentangled Representations in Masked Image Modeling

Share this with someone who'll enjoy it:

Abstract:Masked image modeling (MIM) has become a prevalent pre-training setup for vision foundation models and attains promising performance. Despite its success, existing MIM methods discard the decoder network during downstream applications, resulting in inconsistent representations between pre-training and fine-tuning and can hamper downstream task performance. In this paper, we propose a new architecture, RevColV2, which tackles this issue by keeping the entire autoencoder architecture during both pre-training and fine-tuning. The main body of RevColV2 contains bottom-up columns and top-down columns, between which information is reversibly propagated and gradually disentangled. Such design enables our architecture with the nice property: maintaining disentangled low-level and semantic information at the end of the network in MIM pre-training. Our experimental results suggest that a foundation model with decoupled features can achieve competitive performance across multiple downstream vision tasks such as image classification, semantic segmentation and object detection. For example, after intermediate fine-tuning on ImageNet-22K dataset, RevColV2-L attains 88.4% top-1 accuracy on ImageNet-1K classification and 58.6 mIoU on ADE20K semantic segmentation. With extra teacher and large scale dataset, RevColv2-L achieves 62.1 box AP on COCO detection and 60.4 mIoU on ADE20K semantic segmentation. Code and models are released at https://github.com/megvii-research/RevCol

View paper on

Share this with someone who'll enjoy it:

Title:RevColV2: Exploring Disentangled Representations in Masked Image Modeling

Paper and Code