# masked_image_modeling_with_denoising_contrast__34abe389.pdf Published as a conference paper at ICLR 2023 MASKED IMAGE MODELING WITH DENOISING CONTRAST Kun Yi1 Yixiao Ge1 Xiaotong Li1,4 Shusheng Yang1,5 Dian Li3 Jianping Wu6 Ying Shan1 Xiaohu Qie2 1ARC Lab, 2Tencent PCG 3Foundation Technology Center, 2Tencent PCG 4Peking University 5Huazhong University of Science and Technology 6Tsinghua University equal contribution corresponding author Since the development of self-supervised visual representation learning from contrastive learning to masked image modeling (MIM), there is no significant difference in essence, that is, how to design proper pretext tasks for vision dictionary look-up. MIM recently dominates this line of research with state-of-theart performance on vision Transformers (Vi Ts), where the core is to enhance the patch-level visual context capturing of the network via denoising auto-encoding mechanism. Rather than tailoring image tokenizers with extra training stages as in previous works, we unleash the great potential of contrastive learning on denoising auto-encoding and introduce a pure MIM method, Con MIM, to produce simple intra-image inter-patch contrastive constraints as the sole learning objectives for masked patch prediction. We further strengthen the denoising mechanism with asymmetric designs, including image perturbations and model progress rates, to improve the network pre-training. Con MIM-pretrained models with various scales achieve competitive results on downstream image classification, semantic segmentation, object detection, and instance segmentation tasks, e.g., on Image Net-1K classification, we achieve 83.9% top-1 accuracy with Vi T-Small and 85.3% with Vi T-Base without extra data for pre-training. Code will be available at https://github.com/Tencent ARC/Con MIM. 1 INTRODUCTION The great success of self-supervised learning in natural language processing (NLP) tasks, e.g., BERT (Devlin et al., 2019) and GPT (Radford et al., 2018; 2019), has sparked several revolutions in visual representation learning, during which the development of vision dictionary look-up is the most critical. In the age of convolutional neural networks (CNNs) (He et al., 2016; Krizhevsky et al., 2012), prominent works (He et al., 2020; Chen et al., 2020) perform self-supervised learning with a pretext task of instance-level dictionary look-up via contrastive learning as demonstrated in Figure 1(a). With the advent of vision Transformers (Vi Ts) (Dosovitskiy et al., 2021), the gap between vision and NLP tasks has been further narrowed since the introduction of patch-level dictionary look-up via masked image modeling in a pioneer work BEi T (Bao et al., 2022) (see Figure 1(b)). The introduction of masked image modeling (Bao et al., 2022), inspired by masked language modeling (Devlin et al., 2019) in NLP tasks, ushers in a new fad for self-supervised learning using vision Transformers (Dosovitskiy et al., 2021), i.e., a portion of vision tokens are randomly masked and then recovered by the Transformer network being trained. Concurrent works (Dong et al., 2021; Li et al., 2022; Wei et al., 2022) make efforts to design patch-level dictionaries, image tokenizers in other words, to build proper learning objectives (i.e., vision token ids) for masked image modeling. Though advanced results can be achieved, the off-the-shelf image tokenizers, e.g., discrete VAE (Ramesh et al., 2021) used in BEi T (Bao et al., 2022), depend on extra training stages and data knowledge, rendering an inflexible two-stage pre-training paradigm. We would like to call for a revisit of the superiority of masked image modeling over contrastive learning on self-supervised learning with vision Transformers. Since they are essentially both designed Published as a conference paper at ICLR 2023 off-the-shelf I ... Vi T I (a) conventional contrastive learning instance-level dynamic dictionary patch-level static dictionary patch-level dynamic dictionary masking denoising (b) conventional masked image modeling (c) masked image modeling with denoising contrast Figure 1: Conventional contrastive learning methods (e.g., Mo Co (He et al., 2020), Sim CLR (Chen et al., 2020)) and masked image modeling methods (e.g., BEi T (Bao et al., 2022), Pe Co (Dong et al., 2021)) both perform the pretext task of vision dictionary look-up, where the superiority of the latter ones lie in the patch-level denoising auto-encoding mechanism to enable fine-grained visual context understanding of vision Transformers (Dosovitskiy et al., 2021). We introduce to cast masked image modeling as denoising contrastive learning to avoid the extra training stages of image tokenizer, rendering a flexible, simple and effective pre-training paradigm. towards vision dictionary look-up, the key difference lies in the patch-level denoising auto-encoding mechanism in masked image modeling, which encourages the network s capability to capture finegrained visual context and semantics. As for the auto-encoding objective, we do not have to intentionally discretize the continuous visual signals as words in NLP tasks to cast the masked prediction as a classification task. Instead, we can give full play to the wisdom of contrastive learning, which has good capability to structure the visual space with semantically meaningful representations. To this end, we introduce a new pre-training method for masked image modeling, namely, Con MIM, to get rid of extra tokenizing networks by revitalizing contrastive learning, as shown in Figure 1(c). Our Con MIM casts masked patch prediction in self-supervised image pre-training as denoising contrastive learning. The corrupted input with a large proportion of patches masked is fed into the encoder, a plain vision Transformer in general. The encoder learns to recover the representations of the masked patches, which are predicted by feeding the full input into the encoder. The training objective is formed by an intra-image inter-patch contrastive loss. To be specific, patch representations of a full input image build a dynamic dictionary, and patches from the same positions as the masked ones of the corrupted input serve as their positive keys, respectively. The remaining ones from different positions but in a same image are the negative keys. To further improve the network via a stronger denoising auto-encoding mechanism, we introduce asymmetric designs in Con MIM training, including asymmetric image perturbations and asymmetric model progress rates. We adopt a strong augmentation for the full input while a weak augmentation for the corrupted input. For the image encoder, the slowly progressing momentum encoder (He et al., 2020) is employed for the full input to embed more challenging but semantically consistent learning targets. We perform self-supervised learning with Con MIM on Image Net (Deng et al., 2009), and then fine-tune the pre-trained vision Transformers with various scales on image classification, semantic segmentation, object detection and instance segmentation. Unlike those employ large models with super-scale extra data knowledge, Con MIM excels especially at small-scale architectures, which render a more challenging task for effective pre-training as well as a more practical task in realworld applications. With a vanilla Vi T-Small model, we achieve 83.9% top-1 accuracy using only Image Net-1K, suggesting that useful knowledge is exploited from data. This significantly outperforms the baseline BEi T (Bao et al., 2022) and the comparable MIM methods without tokenizers (e.g., MAE (He et al., 2022), i BOT (Zhou et al., 2022)) due to the stronger semantic structured regularization in Con MIM. Other than the promising results, we would like to draw public attention to unleash the great potential of outdated contrastive learning in visual representation learning. 2 RELATED WORK Self-supervised learning via vision dictionary look-up. The pretext task of contrastive learning (Chen et al., 2020; He et al., 2020; Caron et al., 2020) dominates self-supervised visual pre-training in the era of CNNs. Contrastive learning methods generally perform instance-level dictionary lookup. The anchors are pulled closer to their positive keys at the same time pushing away from the negative keys. The establishment of vision dictionaries is critical for the contrast regularization. For example, the seminal work Mo Co (He et al., 2020) builds the vision dictionary with a firstin-first-out queue, driven by a momentum encoder. The concurrent work Sim CLR (Chen et al., Published as a conference paper at ICLR 2023 2020) uses a large batch size to enlarge the dictionary with more negative keys. Sw AV (Caron et al., 2020) further introduces an online clustering algorithm in an unsupervised manner, and the cluster assignments serve for the dictionary keys. Despite the great achievements with CNNs, these methods are gradually abandoned with the introduction of Vi Ts (Dosovitskiy et al., 2021) due to the lack of inductive bias, which requires stronger supervision for better pre-training performance. Researchers attempt to reproduce the success of masked language modeling (Devlin et al., 2019) in self-supervised learning of Vi Ts via patch-level dictionary look-up. Specifically, BEi T (Bao et al., 2022) introduces a new pretext task, namely, masked image modeling, for visual pre-training. They tokenize high-dimensional images into discrete vision tokens by a discrete VAE (Ramesh et al., 2021) to establish a static patch-level dictionary as in NLP tasks. A proportion of image patches are randomly masked, the backbone network is then trained to recover the vision token ids of the masked patches, rendering a denoising mechanism. Follow-up works make efforts to further improve the static dictionaries, e.g., mc-BEi T (Li et al., 2022) introduces eased and refined dictionaries with multiple choices. Pe Co (Dong et al., 2021) proposes to produce perceptual-aware keys in the patchlevel dictionary. Though promising results, these methods all require extra training stages and even extra data for obtaining a proper image tokenizer. We would also like to mention and thank the classic denoising auto-encoding methods (Vincent et al., 2010; Seung, 1997). Though they did not mask patches in Transformer, these pilot works on auto-encoding and emphasized reconstruction have well inspired the deep learning community. Tokenizer-free masked image modeling (MIM) methods. There are other recent works that cast MIM as a pixel-level reconstruction task (e.g., MAE (He et al., 2022)) or a self-distillation task (e.g., i BOT (Zhou et al., 2022)) rather than dictionary look-up. However, they fail to achieve competitive results using the same training epochs and perform especially unsatisfactorily on small-scale architectures due to the weak regression constraints (see Appendix B.4). Moreover, i BOT is not a pure MIM method as it heavily depends on the vanilla DINO (Caron et al., 2021) loss (i.e., the global self-distillation loss on [CLS] tokens). It actually conducts MIM on top of DINO and argues that MIM alone hardly captures visual semantics. However, we would like to clarify that it is actually due to the improper MIM constraints. Contrastive learning is proven to be good at structuring the visual space but does not been successfully employed in MIM before. We propose a flexible pure MIM method without extra dependencies, including offline tokenizer or global discrimination loss. Dense contrast vs. denoising contrast. There are some previous works on contrastive learning devoted to taking local feature representations into consideration, e.g., Dense CL (Wang et al., 2021). Though the form of Info NCE (Van den Oord et al., 2018) is similar, they show significant differences from our Con MIM in both motivation and method design. They focus on how to learn better pre-trained weights for dense downstream tasks, e.g., object detection and segmentation, but hardly encourage the patch-level visual context reasoning as it is a contrastive-only task, showing inferior performance on Vi T pre-training. Moreover, Dense CL depends on the global discrimination loss to ensure correct local correspondences and needs to carefully balance the global and local constraints. Such a chicken-and-egg problem can be seamlessly addressed in our well-designed denoising mechanism, including both the masking operation and the asymmetric designs. See Appendix B.4.1 for experimental comparisons. There are also some concurrent works (Tao et al., 2022; Huang et al., 2022) that study contrastive learning in MIM. Con MIM is conducted independently with them. 3 PRELIMINARIES The pretraining-and-then-finetuning paradigm has been proven to be effective for visual representation learning and various downstream tasks, where self-supervised pre-training is the most popular. Since there are no ground-truth annotations available, the design of pretext tasks is critical to the pre-training performance. Though driven by various motivations and progressing architectures (He et al., 2016; Dosovitskiy et al., 2021), the pretext task of visual self-supervised learning is essentially to perform vision dictionary look-up, inspired by the success of NLP tasks. 3.1 CONTRASTIVE LEARNING: INSTANCE-LEVEL VISION DICTIONARY LOOK-UP From the perspective of vision dictionary look-up, prominent contrastive learning methods establish instance-level vision dictionaries via a fixed-length queue (He et al., 2020) or batch-wise samples (Chen et al., 2020). The keys in the dictionary are dynamically updated as pre-training proceeds. Published as a conference paper at ICLR 2023 Û hjg][O