# a_unified_view_of_masked_image_modeling__30bb00e8.pdf

Published in Transactions on Machine Learning Research (03/2023)

A Unified View of Masked Image Modeling

Zhiliang Peng pengzhiliang19@mails.ucas.ac.cn University of Chinese Academy of Sciences

Li Dong lidong1@microsoft.com Microsoft Research

Hangbo Bao t-habao@microsoft.com Microsoft Research

Furu Wei fuwei@microsoft.com Microsoft Research

Qixiang Ye qxye@ucas.ac.cn University of Chinese Academy of Sciences

Reviewed on Open Review: https: // openreview. net/ forum? id= wm Gl Mha Be0

Masked image modeling has demonstrated great potential to eliminate the label-hungry problem of training large-scale vision Transformers, achieving impressive performance on various downstream tasks. In this work, we propose a unified view of masked image modeling after revisiting existing methods. Under the unified view, we introduce a simple yet effective method, termed as Mask Distill, which reconstructs normalized semantic features from teacher models at the masked positions, conditioning on corrupted input images. Experimental results on image classification and semantic segmentation show that Mask Distill achieves comparable or superior performance than state-of-the-art methods. When using the huge vision Transformer and pretraining 300 epochs, Mask Distill obtains 88.3% fine-tuning top-1 accuracy on Image Net-1k (224 size) and 58.8% semantic segmentation m Io U metric on ADE20k (512 size). The code and pretrained models will be available at https://aka.ms/unimim.

1 Introduction

In recent years, Transformer architectures have shown promising results in the natural language processing field (Vaswani et al., 2017) and computer vision field (Dosovitskiy et al., 2020). Transformer, in the process of scaling up, is easy to overfit the small datasets and tends to demand more and more data. In NLP, self-supervised pretraining methods based on language modeling (Radford & Narasimhan, 2018; Devlin et al., 2019; Dong et al., 2019), have successfully addressed this problem. Motivated by masked language modeling, BEi T (Bao et al., 2022) proposes masked image modeling (MIM) to relieve the label-hungry problem of vision Transformers (Vi T; Dosovitskiy et al. 2020), which shows impressive results in learning visual representations.

MIM is conceptually simple: models accept the corrupted input image and predict the target of the masked content. Take the pioneering work BEi T (Bao et al., 2022) as an example, the encoder accepts corrupted image patches as input and then predicts the corresponding discrete visual tokens from the tokenizer (Ramesh et al., 2021) at the masked positions. After that, the main difference between previous work lies in the architecture design (He et al., 2022; Chen et al., 2022a) and reconstruction targets (He et al., 2022; Liu et al., 2022b; Wei et al., 2021; 2022a; Baevski et al., 2022).

Contribution during internship at Microsoft Research. Corresponding authors.

Published in Transactions on Machine Learning Research (03/2023)

In this work, we provide a unified view of masked image modeling, as illustrated in Equation 1 and Figure 1: a teacher model, a normalization layer, a student model, a MIM head, and a loss function. According to it, we conduct a systemic comparison of the recent MIM works and present it in Table 1. The most significant difference is the teacher model selection, e.g., pixel values, tokenizers, pretrained models, and the momentum updated teacher.

Under this unified view, we induce a simple yet effective method, named Mask Distill. As shown in Figure 1, the ingredients of Mask Distill contain a teacher model based on CLIP (Radford et al., 2021), a fully-connection layer MIM head, layer normalization for target feature, and the Smooth-ℓ1 loss function. Compared to existing methods in Table 1, Mask Distill is loyal to the most straightforward design, but shows impressive results. Compared to knowledge distillation, Mask Distill pays more attention to extrapolating the masked patches rather than mimicking the target features.

We conduct MIM pretraining on Image Net-1k (Russakovsky et al., 2015) for base-, largeand huge-size Vi Ts. After that, we evaluate pretraining models on downstream visual tasks, image classification on Image Net-1k, and semantic segmentation on ADE20k (Zhou et al., 2019). With the large-size CLIP teacher, Mask Distill using Vi T-H/14 can achieve 88.3% accuracy on Image Net-1k and 58.8 m Io U on ADE20k, by pretraining 300 epochs.

The contributions of this study are summarized as follows:

We provide a unified view of masked image modeling: a teacher model, a normalization layer, a student model, a MIM head, and a loss function.

We propose a simple yet effective method, termed as Mask Distill.

We conduct extensive experiments on downstream tasks including Image Net fine-tuning and semantic segmentation. Experimental results show that the proposed approach improves performance across various settings.

2 A Unified View of Masked Image Modeling

In this section, we provide a unified view of the masked image modeling (MIM) task: a teacher model T , a normalization layer N, a student model S, a MIM head H, and an objective function L that measures the distance between the representation of the teacher model T and that of the student model S. The pretraining task can be unified as:

MIM = L(N(T (Ifull)), H(S(Imasked))) (1)

where Ifull and Imasked denote the full (original) image and the masked image respectively. According to Equation 1, we summarize the recent popular MIM works in Table 1.

1) Teacher models T . According to the semantic information of target, we split them into two groups: low-level and high-level target. For the low-level target, Vi T (Dosovitskiy et al., 2020), MAE (He et al., 2022), Sim MIM (Liu et al., 2022b), Conv MAE (Gao et al., 2022), Hi Vi T (Zhang et al., 2022) and Green MIM (Huang et al., 2022) utilize the original or normalized pixels as the MIM target. Mask Feat (Wei et al., 2021) introduces the feature descriptor HOG (Dalal & Triggs, 2005) as the regression target. And Ge2-AE regresses pixel and frequency from 2D-Discrete Fourier Transform in parallel. As for high-level target, BEi T (Bao et al., 2022), CAE (Chen et al., 2022a), Split Mask (El-Nouby et al., 2021), Pe Co (Dong et al., 2021) and BEi T v2 (Peng et al., 2022) predict the discrete tokens (instantiated as code in the visual tokenizer (Ramesh et al., 2021; Esser et al., 2021; Peng et al., 2022). Mask Feat (Wei et al., 2021) proposes to directly regress the pretrained model (e.g., DINO (Caron et al., 2021) and Dei T (Touvron et al., 2020)). MVP (Wei et al., 2022a) extends the pretrained model to the multimodal pretrained model CLIP (Radford et al., 2021). Moreover, following the BYOL paradigm (Grill et al., 2020), data2vec (Baevski et al., 2022), MSN (Assran et al., 2022), Con MIM (Yi et al., 2022), SIM (Tao et al., 2022) and Boot MAE (Dong et al., 2022) construct the regression target from the momentum updated teacher to boost itself online.

Published in Transactions on Machine Learning Research (03/2023)

Table 1: Systemic comparisons of masked image modeling methods from a unified view.

Methods Teacher T Student S MIM Head H Normalization N Loss Function L

Low-level pixel / feature Vi T (Dosovitskiy et al., 2020) Pixel Vi T FC / N/A MAE (He et al., 2022) Pixel Vi T Decoder Layer Norm ℓ2 Sim MIM (Liu et al., 2022b) Pixel Swin FC / ℓ1 Mask Feat (Wei et al., 2021) HOG Vi T FC / ℓ2 Ge2-AE (Liu et al., 2022a) Pixel&Frequency Vi T Decoders / ℓ2 Conv MAE (Gao et al., 2022) Pixel Hybrid Vi T Decoder Layer Norm ℓ2 Hi Vi T (Zhang et al., 2022) Pixel Hi Vi T Decoder Layer Norm ℓ2 Green MIM (Huang et al., 2022) Pixel Swin Decoder Layer Norm ℓ2

High-level feature BEi T (Bao et al., 2022) d VAE Vi T FC / Cross Entropy CAE (Chen et al., 2022a) d VAE Vi T Decoder / Cross Entropy Split Mask (El-Nouby et al., 2021) d VAE Vi T Decoder / Info NCE&Cross Ent. Pe Co (Dong et al., 2021) VQGAN Vi T FC / Cross Entropy BEi T v2 (Peng et al., 2022) VQ-KD Vi T FC / Cross Entropy Mask Feat (Wei et al., 2021) DINO Vi T FC (ℓ2) Cosine MVP (Wei et al., 2022a) CLIP Vi T FC (ℓ2) Cosine MILAN (Hou et al., 2022) CLIP Vi T Decoders ℓ2-Norm ℓ2 Mim Co (Zhou et al., 2022) Mo Cov3 Vi T FC / Info NCE data2vec (Baevski et al., 2022) EMA Vi T FC Layer Norm Smooth-ℓ1 MSN (Assran et al., 2022) EMA Vi T FC / Cross Entropy SIM (Tao et al., 2022) EMA Vi T Decoder Batch Norm Uni Grad loss Sd AE (Chen et al., 2022b) EMA Vi T Decoder Layer Norm Cosine Con MIM (Yi et al., 2022) EMA Vi T FC Batch Norm Info NCE Extre MA (Wu et al., 2022) EMA Vi T Cross Att Layer Norm Cosine Boot MAE (Dong et al., 2022) EMA&Pixel Vi T Decoders Layer Norm ℓ2

Mask Distill (Ours) CLIP Vi T FC Layer Norm Smooth-ℓ1

2) Student models S. MIM task is suitable for the models root in attention interaction, like Vi T (Dosovitskiy et al., 2020), Swin Transformers (Liu et al., 2022b), and some variants (Gao et al., 2022; Zhang et al., 2022). Because backbone architecture is not the primary focus of this study, we choose the vanilla Vi T (Dosovitskiy et al., 2020) as the analytical anchor.

3) MIM Heads H. BEi T (Bao et al., 2022) uses a simple fully-connection (FC) layer as the task head to generate prediction at the masked positions. MAE (He et al., 2022) introduces a decoder to decouple the masked prediction task from the encoder. In fact, the aim of the decoder in MAE is still to predict the target pixel at the masked positions. Therefore, we consider the decoder as a MIM head in Table 1. And this decoupling decoder is adopted by many recent works (Liu et al., 2022a; Gao et al., 2022; Zhang et al., 2022; Chen et al., 2022a; El-Nouby et al., 2021; Tao et al., 2022; Dong et al., 2022).

4) Normalization Layers N. MAE (He et al., 2022) also introduces per-patch normalized pixels (i.e., layer normalization without affine transformation) as the target to boost local pixels contrast, resulting in better performance. Meanwhile, normalization is usually applied for avoiding feature collapse in methods based on contrastive learning (Grill et al., 2020; Chen et al., 2020). Similarly, EMA-based MIM methods (Tao et al., 2022; Baevski et al., 2022; Yi et al., 2022) adopt various normalization methods to stabilize training as well as boost performance. There is no collapse issue when the teacher is pixels or frozen models by default.

5) Loss functions L. When the target is pixel or feature, ℓ1 or ℓ2 losses are appropriate for feature regression. When the target is discrete tokens, the cross entropy loss is the primary choice. Notably, after applying layer normalization, the variance of target feature rises, resulting in volatile loss, whereas Smooth-ℓ1 loss is a trade-off between ℓ1 and ℓ2, performing more stable. Of course, cosine similarity loss is also an alternative choice.

From Table 1, one can find that the main difference is the teacher models: pixel, momentum-updated teachers, and pretrained models. Pixel is easy to access but struggles with low-level semantic knowledge. Momentum-updated teachers do not need extra models or datasets but tend to suffer from the collapse

Published in Transactions on Machine Learning Research (03/2023)

Pixel, EMA, CLIP, DINO, MAE, BEi T...

Vi T, Swin, ...

Normlization

Loss Input Image

LN, BN, Identity, ... L1, L2, Smooth-L1,

FC, Decoder

Figure 1: Unified view of the masked image modeling framework. The bold text denotes the default ingredients of Mask Distill.

issue. Pretrained models are off-the-shelf and contain more rich semantic information than pixels, but how to prepare a high-quality teacher model is an essential problem.

3 Masked Distillation

Knowledge distillation (Hinton et al., 2015) has shown to be a promising approach for compressing a large model (referred to as the teacher model) into a small model (referred to as the student model), which utilizes much fewer parameters and computations while attaining comparable results on downstream tasks.

Based on the unified view, we offer a simple yet effective method, named Mask Distill, to distill a student model in a masked image modeling fashion. However, our purpose is not to compress the teacher model T into the student model S, but to boost S to outperform T . We instantiate the student model S as Vi T (Dosovitskiy et al., 2020) for comparison with others.

Specially, given the input image x RH W C, where (H, W) is the resolution and C is the number of image channels, the student S first divides x into N non-overlapping patches {xp i }N i=1 and then linear projects it into patch embeddings {ep i }N i=1. Following that, we select roughly 40% of the image patch embeddings to be masked, in a block-wise strategy (Bao et al., 2022). Denoting the masked position set as M, we use a shared learnable embedding e M to replace the original patch embeddings ep i if i M. After that, we get the masked sequence:

e M i = δ(i M) e M + (1 δ(i M)) ep i , (2)

where δ( ) is the indicator function. Subsequently, we prepend a learnable class token e CLS and add the learnable positional embeddings, and then feed those into stacked transformer blocks. Lastly, a masked image modeling head (usually instantiate as a fully-connected layer) is applied for predicting feature O R(N+1) D, where D is the dimension of target features.

Given a pretrained teacher model T , like DINO (Caron et al., 2021) and CLIP (Radford et al., 2021), the same image x is fed into T to get the target feature {ti}N i=1 patch-to-patch. To ensure that the output resolution of S and T is the same, the input resolution for T should be adjusted. Finally, the training objective of Mask Distill can be formulated as:

LMask Distill = X

i M log(ti(x)|xp i ) = 1 |M|

i M Smooth-ℓ1(oi, LN(ti)), (3)

where LN is the layer normalization without affine transformation.

Published in Transactions on Machine Learning Research (03/2023)

Table 2: Fine-tuning results on Image Net-1K and ADE20k. PT Epochs denotes the pretraining epochs. Rel. Pos. means relative positional embeddings.

Methods Teacher PT Rel. Classification Segmentation Model Data Epochs Pos. Top-1 Acc (%) m Io U (%)

Base-size models (Vi T-B/16) BEi T (Bao et al., 2022) DALL-E 250M 800 83.2 45.6 MAE (He et al., 2022) Pixel / 1600 83.6 48.1 CAE (Chen et al., 2022a) DALL-E 250M 1600 83.9 50.2 Sd AE (Chen et al., 2022b) EMA / 300 N/A 84.1 48.6 SIM (Tao et al., 2022) EMA / 1600 N/A 83.8 N/A Mask Feat (Wei et al., 2021) HOG / 1600 N/A 84.0 N/A Pe Co (Dong et al., 2021) VQGAN IN-1k 300 N/A 84.1 46.7 Pe Co (Dong et al., 2021) VQGAN IN-1k 800 N/A 84.5 48.5 data2vec (Baevski et al., 2022) EMA / 800 84.2 N/A CLIP (Radford et al., 2021) Text / / 84.9 51.1 MVP (Wei et al., 2022a) CLIP-B 400M 300 N/A 84.4 52.4 BEi T v2 (Peng et al., 2022) VQ-KD 400M 1600 85.5 53.1 Mask Distill (ours) CLIP-B 400M 300 84.9 52.7 Mask Distill (ours) CLIP-B 400M 300 85.0 53.8 Mask Distill (ours) CLIP-B 400M 800 85.5 54.3

Large-size models (Vi T-L/16) Mask Feat (Wei et al., 2021) HOG / 1600 N/A 85.7 N/A MAE (He et al., 2022) Pixel / 1600 85.9 53.6 CAE (Chen et al., 2022a) DALL-E 250M 1600 86.3 54.7 data2vec (Baevski et al., 2022) EMA / 1600 86.6 N/A BEi T v2 (Peng et al., 2022) VQ-KD 400M 1600 87.3 56.7 MILAN (Hou et al., 2022) CLIP-B 400M 400 86.7 55.3 Mask Distill (ours) CLIP-B 400M 300 86.8 56.3 Mask Distill (ours) CLIP-B 400M 800 87.1 56.5

Table 3: Fine-tuning results on Image Net-1K and ADE20k. The teacher is CLIP Vi T-L/14.

Methods Model PT Rel. Classification Segmentation Size Epochs Pos. Top-1 Acc (%) m Io U (%)

Scaling up to larger teacher, CLIP Vi T-L/14 Mask Distill (ours) Vi T-B/16 300 85.3 54.3 Mask Distill (ours) Vi T-L/16 300 87.6 57.9 Mask Distill (ours) Vi T-H/14 300 88.3 58.8

4 Experiments

We perform pretraining and then evaluate fine-tuning performance on various downstream tasks, such as image classification and semantic segmentation. Moreover, we conduct ablation studies to compare the contributions of different design choices.

For all pretraining experiments, we only use the Image Net-1k dataset (Russakovsky et al., 2015) contains 1.28M images. We adopt the block masking strategy to corrupt the input images for the student model, but keep the full images for the teacher, to construct the asymmetric informational bottleneck. All the teacher

Published in Transactions on Machine Learning Research (03/2023)

Table 4: Robustness evaluation on Image Net variants (Hendrycks et al., 2021b;a; Wang et al., 2019).

Methods Image Net Image Net Image Net Adversarial Rendition Sketch

Vi T-B/16 MAE 35.9 48.3 34.5 BEi T v2 54.4 61.0 45.6 Mask Distill 53.3 64.4 47.3

Vi T-L/16 MAE 57.1 59.9 45.3 BEi T v2 69.0 69.9 53.5 Mask Distill 69.0 75.3 56.9

Table 5: Mask Distill vs knowledge distillation. The teacher model is CLIP Vi T-Base (Radford et al., 2021).

Student Mask Pretaining Classification Models Ratios Epochs Accuracy (%)

Vi T-B/16 0 300 85.3 40% 300 85.0 (-0.3)

Vi T-B/16 0 800 85.2 40% 800 85.5 (+0.3)

Vi T-L/16 0 300 85.4 40% 300 86.8 (+1.4)

model checkpoints are from the official publication. When utilizing CLIP Vi T-L/14 as a teacher, we set the input image resolution to 196 196 for the teacher to match the number of patches with student Vi T-B/16 or Vi T-L/16. As for the student model, we use the Vi T-Base/Large equipped relative positional embeddings following BEi T (Bao et al., 2022; Peng et al., 2022). For the pretraining setting, we mainly follow BEi T (Bao et al., 2022; Peng et al., 2022): batch size 2048, learning rate 1.5e-3, Adam W optimizer with weight decay 0.05, drop path 0.1 (0.2) for Vi T-Base(large), block-wise mask 40%, epochs 300/800. More details can be found in Appendix.

Evaluation. We consider the popular evaluating protocol for image classification on Image Net-1k dataset: fine-tuning top-1 accuracy. We adopt the BEi T (Bao et al., 2022) fine-tuning recipe: For Vi T-Base, we fine-tune it for 100 epochs with 20 epochs warm-up, and use Adam W optimizer with weight decay 0.05, learning rate 5e-4, and decays in a cosine schedule, layer decay 0.65; For Vi T-Large, we fine-tune it for 50 epochs with 5 epochs warm-up, layer decay 0.75. For Vi T-Huge, we fine-tune it for 30 epochs with 5 epochs warm-up, layer decay 0.85. All the resolutions of input images are 224 224.

As for the semantic segmentation task, we evaluate the m Io U metric on ADE20K dataset (Zhou et al., 2019) with Uper Net (Xiao et al., 2018) framework. The input image resolution for training and evaluating are 512 512. Remarkably, for the Vi T-H/14 in Table 3, we convert it to Vi T-H/16 for semantic segmentation task. Similarly, Adam W optimizer with weight decay of 0.05 is applied. Additionally, the training steps are 160K, and the batch size is 16. And we employ learning rate {5e-5, 8e-5, 1e-4}, layer decay 0.75 (0.85), drop path 0.1 (0.2) for Vi T-Base (Large). More details can be found in Appendix.

4.2 Main Results

Table 2 reports the top-1 accuracy of some self-supervised methods on Image Net-1k using Vi T (Dosovitskiy et al., 2020) models. For Vi T-base, Mask Distill with 800 epochs pretraining schedule obtains 85.5% top-1 accuracy, surpasses CLIP (Radford et al., 2021), MVP Wei et al. (2022a), data2vec (Baevski et al., 2022) and Mask Feat (Wei et al., 2021) by 0.6%, 1.1%, 1.3% and 1.5% respectively. And Mask Distill also achieves comparable performance with BEi T v2 (Peng et al., 2022) on Image Net-1k but outperforms BEi T v2 by 1.2 m Io U on ADE20k. More comparison with BEi T v2 can be found in Section 4.4. When scaling up the student to Vi T-Large, Mask Distill achieves 86.8% top-1 accuracy and 56.3 m Io U. Compared to the recently proposed MILAN (Hou et al., 2022), Mask Distill outperforms it by 1% on the semantic segmentation task under the less pretraining epochs.

In Table 3, we use the CLIP Vi T-Large/14 checkpoint as the teacher model and pretrain student models for 300 epochs. One can see that Mask Distill can get consistent improvements compared to teacher CLIP Vi T-Base/16. Remarkably, Mask Distill can reach 88.3% accuracy on Image Net-1k and 58.8 m Io U on ADE20k by using the Vi T-Huge backbone.

Robustness evaluation. Following MAE (He et al., 2022) and BEi T v2 (Peng et al., 2022), we test the robustness of Mask Distill on three Image Net validation sets, i.e., Image Net-Adversarial (Hendrycks et al.,

Published in Transactions on Machine Learning Research (03/2023)

Table 6: Fow-shot image classification on Image Net-1k. We freeze the backbone and only learn the classifier during training.

Methods FT Few Shot Numbers k=2 k=4 k=8 k=16 k=32

Vi T-B/16 BEi T (Bao et al., 2022) 83.2 1.7 3.0 5.0 7.0 8.9 MAE (He et al., 2022) 83.6 11.5 21.5 31.5 39.5 46.4 CLIP (Radford et al., 2021) 84.9 35.4 45.2 55.1 61.3 65.6 BEi T v2 (Peng et al., 2022) 85.5 36.6 47.6 56.3 63.0 67.6 Mask Distill (ours) 85.5 37.8 48.7 56.3 62.3 66.2

2021b), Image Net-Rendition (Hendrycks et al., 2021a) and Image Net-Sketch (Wang et al., 2019). In Table 4, both MAE and BEi T v2 pretrain 1600 epochs, while Mask Distill pretrains 800 epochs but achieves comparable or superior performance.

4.3 Comparison with Knowledge Distillation

In Table 5, we compare Mask Distill with knowledge distillation, which can be considered as a special case of Mask Distill where the mask ratio is 0 and loss is calculated on all patches. Knowledge distillation surpasses Mask Distill by 0.3% when the pretraining schedule is 300 epochs, but is inferior to Mask Distill by 0.3% when the pretraining schedule is 800 epochs. Remarkably, Mask Distill outperforms knowledge distillation by a significant gain when the student model scales up to large-size models. The commonly used teacher model is CLIP Vi T-Base, which reaches 84.9% fine-tuning accuracy in terms of image classification on Image Net-1k.

When the student is larger than the teacher, the student is easy to fully reconstruct the latent space of the teacher without information bottleneck. This is why Vi T-L/16 obtains comparable performance with Vi T-B/16 (85.4% vs 85.3% in Table 5). But in Mask Distill, under the condition of the corrupted input, the student is encouraged to extrapolate the masked patches, rather than mimicking features at visible patches.

4.4 Comparison with BEi T v2

In BEi T v2 (Peng et al., 2022), CLIP Vi T-Base as the teacher model is responsible for distilling a vector quantized visual tokenizer, which provides the supervision for the subsequent MIM phase. But compared with Mask Distill, the quantized mechanism in BEi T v2 omits some fine-grained details from the teacher model. And these details are beneficial to the fast convergence of Mask Distill, e.g., Mask Distill achieves comparable image classification performance with 800 epochs pretraining while BEi T v2 need to pretrain 1600 epochs, as demonstrated in Table 2. That is, Mask Distill can avoid the codebook collapse problem in the tokenizer training phase (Peng et al., 2022) and achieve comparable performance. Meanwhile, such fine-grained details as supervision enhance the robustness of Mask Distill, as shown in Table 4.

4.5 Few shot classification

In Table 6, we report the fewshot image classification results of BEi T (Bao et al., 2022), MAE (He et al., 2022), CLIP (Radford et al., 2021), BEi T v2 (Peng et al., 2022) and Mask Distill on Image Net-1k (Russakovsky et al., 2015). We randomly choose k samples from each category and use them to learn a classifier, while keeping other parameters frozen during training. We use the entire validation set for evaluation. For each method, we use their public model weights and sweep a wide range of learning rates for a fair comparison. From Table 6, one can see that Mask Distill can consistently surpass the teacher under various settings. When k=2/4/8, Mask Distill can achieve the best fewshot performance. Moreover, we also find that fewshot performance is correlated with fine-tuning performance in terms of MIM, which is consistent with observation in contrastive learning (Ericsson et al., 2021).

Published in Transactions on Machine Learning Research (03/2023)

Table 7: Comparison of the MIM time consuming and GPU memory.

Method Supervision Average step time (s) GPU memory (G)

Vi T-B/16, batchsize 64 on each GPU MAE (He et al., 2022) Normalized pixels 0.358 10.6 data2vec (Baevski et al., 2022) EMA features 0.636 13.5 BEi T v2 (Peng et al., 2022) VQ-KD tokens 0.605 15.5 Mask Distill CLIP-B features 0.487 12.5

Vi T-L/16, batchsize 32 on each GPU MAE (He et al., 2022) Normalized pixels 0.551 13.0 data2vec (Baevski et al., 2022) EMA features 1.079 21.9 BEi T v2 (Peng et al., 2022) VQ-KD tokens 0.769 21.8 Mask Distill CLIP-B features 0.718 19.7

4.6 Comparison on Training Cost

In Table 7, We test all models under the same settings to compare the training cost. Average step time is calculated from (total training time) / (total steps). GPU memory is measured when the training phase is stable. MAE (He et al., 2022), data2vec (Baevski et al., 2022) and BEi T v2 (Peng et al., 2022) are evaluated by using their official codebase. Compared with using pixel values as reconstruction, other methods tend to spending more time on obtaining targets, including our Mask Distill. Compared with EMA-based data2vec and tokenizer-based BEi T v2, Mask Distill enjoys faster training time and lower GPU memory cost.

4.7 Ablation Studies

Teacher models. We collect some popular models to act as the teacher in Mask Distill, and pretrain a student model Vi T-Base for 300 epochs in a MIM fashion. The performance of the teacher and student are shown in Table 8. From #1 to #6, where teacher models are CLIP and SLIP (Mu et al., 2021) trained on the image-text pair datasets (YFCC15M, CC3M, CC12M and private 400M) in a language-guided contrastive way, Mask Distill consistently boost the teacher model by 0 3.3% accuracy. For #7, teacher model is Res Net, which is trained in the supervised way. Despite the gap in architecture, the student still enjoys the significant gain (83.5% vs 76.2%). From #8 to #9, teacher models Sim CLR (Chen et al., 2020) and DINO (Caron et al., 2021) only use image data. Mask Distill boosts them by 1.6% and 0.9% respectively.

Comparing #1, #2 and #8 in Table 8, where the same dataset and training epochs are applied to teachers, students in #1 and #8 respectively achieve 83.8% and 84.1%, but the former using the text information and the later not, implying that the language-guided supervision is not essential. Moreover, comparing #1 #5 and #9, both teacher and student in #9 trained on Image Net-1k can reach comparable performance with those in #1 #5, which further suggests that the extra language information is not the key.

From #10 to #12, we choose the model trained by MIM itself to act as the teacher model. We find that Mask Distill consistently outperform the corresponding teacher. However, comparing #9 with #10, where teacher can reach the same fine-tuning accuracy, students in #8 can obtain better performance in terms of fine-tuning accuracy and segmentation m Io U than those in #9, indicating that contrastive pretrained models tend to be the better but not the only solution.

Loss functions & Normalization. We compare MSE, cosine similarity and smooth-ℓ1 loss equipped with various normalization layers, then present the results in Table 9. From Table 9, one can see that smooth-ℓ1 loss equipped with LN can achieve better performance under the supervision of both DINO and CLIP, indicating that Normalization plays an important role in masked image modeling task.

Target layer selection. Usually, the deeper layer feature of a model is biased to the special task, e.g., image-image contrastive learning in DINO and image-text contrastive learning in CLIP. But whether it

Published in Transactions on Machine Learning Research (03/2023)

Table 8: Ablation studies on teacher models used in Mask Distill. For {CLIP, SLIP, Sim CLR} , the fine-tuning accuracy and model checkpoint are all from SLIP (Mu et al., 2021). For CLIP and DINO, we use the official model checkpoint and follow BEi T (Bao et al., 2022) fine-tuning recipe to get the top-1 accuracy. The teacher models in all methods are Vi T-Base model except Res Net-50 (He et al., 2016). The student model is Vi T-Base and is pretrained for 300 epochs.

Teacher Model T Student Model S Teacher Data Text Image Net (%) Image Net (%) ADE20k (%)

#1 CLIP YFCC15M 80.5 83.8 (+3.3) 47.4 #2 SLIP YFCC15M 82.6 84.3 (+1.7) 49.9 #3 SLIP YFCC15M 83.4 84.6 (+1.2) 50.8 #4 CLIP CC3M 79.5 83.7 (+4.2) 45.7 #5 CLIP CC12M 82.1 84.1 (+2.0) 48.3 #6 CLIP Private 400M 84.9 85.0 (+0.1) 53.8

#7 Res Net Image Net-1k 76.2 83.5 (+7.3) 46.9

#8 Sim CLR YFCC15M 82.5 84.1 (+1.6) 49.4 #9 DINO Image Net-1k 83.6 84.5 (+0.9) 50.4

#10 MAE Image Net-1k 83.6 84.3 (+0.7) 49.3 #11 BEi T Image Net-1k 83.2 83.8 (+0.6) 46.6 #12 BEi T v2 Image Net-1k 84.7 85.0 (+0.3) 52.1

0.2 0.4 0.6 0.8

random mask block mask

(a) Fine-tuning Accuracy

0.2 0.4 0.6 0.8

random mask block mask

(b) Linear-probing Accuracy

0.2 0.4 0.6 0.8

random mask block mask

(c) Segmentation m Io U

Figure 2: The block-wise mask vs random mask, under various mask ratios.

is beneficial for Mask Distill is not revealed. We conduct experiments on target feature from last layer, average of last 3 layers and average of last 6 layers. As shown in Table 10, the last layer s features are better for DINO teachers while the last 6 layers features are better for CLIP teachers. Moreover, results on the segmentation task show that the last layer features as target are superior. Therefore, we choose the last layer feature as the default target feature for all experiments.

Masked strategy. For the masked strategy, we evaluate the block-wise (Bao et al., 2022) masked method and random masked method in Figure 2. The block-wise masked method performs better than random mask under low mask ratios, while worse than random mask under high mask ratios. Taking the three evaluation protocols (fine-tuning on Image Net-1k, linear-probing Image Net-1k, and semantic segmentation on ADE20k) into consideration, we choose the block-wise mask with 40% mask ratio as the final decision.

4.8 Analysis: MIM Enhances Shape Bias

We explore whether the masked image modeling methods can enhance the shape-biased ability or not. The fraction of correct decisions based on object shape is characterized as shape bias. Naseer et al. (2021) present

Published in Transactions on Machine Learning Research (03/2023)

Table 9: Ablation study of loss functions and normalization layers. All models are pretrained for 300 epochs.

T L Norm Image Net ADE20k

DINO MSE 84.3 49.6 Cosine (ℓ2) 84.5 49.6 Smooth-ℓ1 LN 84.5 50.4

CLIP MSE 84.6 52.8 Cosine (ℓ2) 84.9 52.9 Smooth-ℓ1 BN 84.9 53.1 Smooth-ℓ1 LN 85.0 53.8

Table 10: Ablation study of target feature selection in Mask Distill. All models are pretrained for 300 epochs.

T Target Image Net ADE20k

DINO Last layer 84.5 50.4 Mean (last 3 layers) 84.4 49.7 Mean (last 6 layers) 84.3 49.8

CLIP Last layer 85.0 53.8 Mean (last 3 layers) 85.0 53.5 Mean (last 6 layers) 85.1 53.4

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

Fraction of 'shape' decisions

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

Fraction of 'texture' decisions

Shape categories

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

Fraction of 'shape' decisions

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

Fraction of 'texture' decisions

Shape categories

Figure 3: Shape-biased analysis under the teacher supervision of CLIP Vi T-B/16 (left) and MAE Vi T-B/16 (right). Circle, triangle and star denote humans, teachers and students, respectively. Vertical lines are the corresponding average values. Masked image modeling enhances the shape bias. (Best viewed in color)

that human usually is much more shape-biased compared with supervised classification models, such as convolutional networks, and vision Transformers. We evaluate the shape bias capacity on a stylized version of Image Net (Naseer et al., 2021) by using the checkpoints fine-tuned on the original Image Net-1k dataset. As shown in Figure 3, masked image modeling tends to promote the shape bias of the models. The results partially explains why Mask Distill generalizes better on Image Net variants as shown in Table 4.

5 Related Work

Masked image modeling. Masked language modeling task root in Transformers has achieved great success in learning strong language representations in recent years (Devlin et al., 2019; Dong et al., 2019; Bao et al., 2020). Inspired by it, BEi T (Bao et al., 2022) proposes a mask-then-predict framework to recover discrete visual tokens (Ramesh et al., 2021), which shows the great potential of masked image modeling for the computer vision field. After that, various target supervision has been explored under the masked image modeling framework, such as original or normalized pixels (He et al., 2022; Dong et al., 2021; Liu et al.,

Published in Transactions on Machine Learning Research (03/2023)

2022b; Gao et al., 2022; Liu et al., 2022a; Zhang et al., 2022; Huang et al., 2022), high-level features (Wei et al., 2021; 2022a; Peng et al., 2022; Zhou et al., 2022; Hou et al., 2022), and EMA-updated models (Baevski et al., 2022; Assran et al., 2022; Tao et al., 2022; Chen et al., 2022b; Yi et al., 2022; Wu et al., 2022; Dong et al., 2022). In this work, we decouple and analyze the components of the recent masked image modeling works, and then propose a simple yet effective paradigm for masked image modeling.

Contrastive learning. As a simple but effective self-supervised method, contrastive learning methods have ushered in rapid progress in recent years. The main idea is to enforce similarity over augmented views of an image and push the views augmented from other images away (Dosovitskiy et al., 2016; Wu et al., 2018; Hjelm et al., 2019; He et al., 2020; Chen et al., 2020), or to avoid model collapse after removing negative pairs (Grill et al., 2020; Chen & He, 2020; Chen et al., 2021; Caron et al., 2021). In the multimodal field, CLIP (Radford et al., 2021) and ALIGN (Jia et al., 2021) can learn image-language alignment representation, by grouping positive image-text pairs (an image and corresponding tag or caption) closer and separating negative imagetext pairs. And SLIP (Mu et al., 2021) combines language supervision and image self-supervision to further boost the learned visual representations. In this work, we consider contrastive models as the target for masked image modeling.

Knowledge distillation. Knowledge distillation (Hinton et al., 2015) considers the output of the teacher model as the pseudo label to learn the student model. Such a strategy squeezes the potential of small models and brings impressive gains. After that, knowledge distillation is transferred to various tasks (Touvron et al., 2020; He et al., 2019; Yang et al., 2021) and domains (Jiao et al., 2020; Wang et al., 2020). Wei et al. (2022b) proposes that using the normalized feature from teacher fully distills a same size student. However, in this work, Mask Distill aims to reconstruct the corresponding teacher output at masked patches rather than mimicking the teacher s feature at each patch.

6 Conclusion and Limitations

We summarized the existing MIM works upon the proposed unified view: teacher models, student models, normalization layers and MIM heads. After that, we propose a simple yet effective method, termed as Mask Distill, which predicts the normalized semantic features from CLIP s visual encoder at masked positions based on the corrupted input image. The simple framework beats many previous works with special designs and shows impressive performance across model sizes and tasks. In the future, we would like to explore the proposed method for multimodal pretraining (Wang et al., 2022).

The proposed Mask Distill requires an extra teacher model, similar to the tokenizer in BEi T series. Compared with the methods using pixels as targets, the teacher model in Mask Distill needs to spend extra time to obtain target features. Meanwhile, we point out that language-guided supervision is not essential in Subsection 4.7 on the academically accessible multi-model datasets, YFCC15M. But whether this conclusion is correct on private 400M image-text pair datasets remains an unknown question.

Mahmoud Assran, Mathilde Caron, Ishan Misra, Piotr Bojanowski, Florian Bordes, Pascal Vincent, Armand Joulin, Michael G. Rabbat, and Nicolas Ballas. Masked siamese networks for label-efficient learning. Ar Xiv, abs/2204.07141, 2022.

Alexei Baevski, Wei-Ning Hsu, Qiantong Xu, Arun Babu, Jiatao Gu, and Michael Auli. Data2vec: A general framework for self-supervised learning in speech, vision and language. ar Xiv preprint ar Xiv:2202.03555, 2022.

Hangbo Bao, Li Dong, Furu Wei, Wenhui Wang, Nan Yang, Xiaodong Liu, Yu Wang, Jianfeng Gao, Songhao Piao, Ming Zhou, and Hsiao-Wuen Hon. Uni LMv2: Pseudo-masked language models for unified language model pre-training. In Proceedings of the 37th International Conference on Machine Learning, ICML 2020, volume 119 of Proceedings of Machine Learning Research, pp. 642 652. PMLR, 2020.

Published in Transactions on Machine Learning Research (03/2023)

Hangbo Bao, Li Dong, Songhao Piao, and Furu Wei. BEi T: BERT pre-training of image transformers. In International Conference on Learning Representations, 2022.

Mathilde Caron, Hugo Touvron, Ishan Misra, Hervé Jégou, Julien Mairal, Piotr Bojanowski, and Armand Joulin. Emerging properties in self-supervised vision transformers. ar Xiv preprint ar Xiv:2104.14294, 2021.

Ting Chen, Simon Kornblith, Mohammad Norouzi, and Geoffrey Hinton. A simple framework for contrastive learning of visual representations. preprint ar Xiv:2002.05709, 2020.

Xiaokang Chen, Mingyu Ding, Xiaodi Wang, Ying Xin, Shentong Mo, Yunhao Wang, Shumin Han, Ping Luo, Gang Zeng, and Jingdong Wang. Context autoencoder for self-supervised representation learning. ar Xiv preprint ar Xiv:2202.03026, 2022a.

Xinlei Chen and Kaiming He. Exploring simple siamese representation learning. preprint ar Xiv:2011.10566, 2020.

Xinlei Chen, Saining Xie, and Kaiming He. An empirical study of training self-supervised vision transformers. Ar Xiv, abs/2104.02057, 2021.

Yabo Chen, Yuchen Liu, Dongsheng Jiang, Xiaopeng Zhang, Wenrui Dai, Hongkai Xiong, and Qi Tian. Sdae: Self-distillated masked autoencoder. Ar Xiv, abs/2208.00449, 2022b.

Navneet Dalal and Bill Triggs. Histograms of oriented gradients for human detection. In 2005 IEEE computer society conference on computer vision and pattern recognition (CVPR 05), volume 1, pp. 886 893. Ieee, 2005.

Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. BERT: pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 4171 4186. Association for Computational Linguistics, 2019.

Li Dong, Nan Yang, Wenhui Wang, Furu Wei, Xiaodong Liu, Yu Wang, Jianfeng Gao, Ming Zhou, and Hsiao-Wuen Hon. Unified language model pre-training for natural language understanding and generation. In Advances in Neural Information Processing Systems 32: Annual Conference on Neural Information Processing Systems 2019, Neur IPS 2019, December 8-14, 2019, Vancouver, BC, Canada, pp. 13042 13054, 2019.

Xiaoyi Dong, Jianmin Bao, Ting Zhang, Dongdong Chen, Weiming Zhang, Lu Yuan, Dong Chen, Fang Wen, and Nenghai Yu. Peco: Perceptual codebook for bert pre-training of vision transformers. ar Xiv preprint ar Xiv:2111.12710, 2021.

Xiaoyi Dong, Jianmin Bao, Ting Zhang, Dongdong Chen, Weiming Zhang, Lu Yuan, Dong Chen, Fang Wen, and Nenghai Yu. Bootstrapped masked autoencoders for vision bert pretraining. Ar Xiv, abs/2207.07116, 2022.

A. Dosovitskiy, P. Fischer, J. Springenberg, M. Riedmiller, and T. Brox. Discriminative unsupervised feature learning with exemplar convolutional neural networks. IEEE Transactions on Pattern Analysis & Machine Intelligence, 38(09):1734 1747, sep 2016. ISSN 1939-3539. doi: 10.1109/TPAMI.2015.2496141.

Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, et al. An image is worth 16x16 words: Transformers for image recognition at scale. preprint ar Xiv:2010.11929, 2020.

Alaaeldin El-Nouby, Gautier Izacard, Hugo Touvron, Ivan Laptev, Hervé Jegou, and Edouard Grave. Are large-scale datasets necessary for self-supervised pre-training? ar Xiv preprint ar Xiv:2112.10740, 2021.

Linus Ericsson, Henry Gouk, and Timothy M Hospedales. How well do self-supervised models transfer? In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5414 5423, 2021.

Published in Transactions on Machine Learning Research (03/2023)

Patrick Esser, Robin Rombach, and Bjorn Ommer. Taming transformers for high-resolution image synthesis. In CVPR, pp. 12873 12883, 2021.

Peng Gao, Teli Ma, Hongsheng Li, Jifeng Dai, and Yu Jiao Qiao. Convmae: Masked convolution meets masked autoencoders. Ar Xiv, abs/2205.03892, 2022.

Jean-Bastien Grill, Florian Strub, Florent Altché, Corentin Tallec, Pierre H Richemond, Elena Buchatskaya, Carl Doersch, Bernardo Avila Pires, Zhaohan Daniel Guo, Mohammad Gheshlaghi Azar, Bilal Piot, Koray Kavukcuoglu, Rémi Munos, and Michal Valko. Bootstrap your own latent: A new approach to self-supervised learning. In Neur IPS, 2020.

Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In CVPR, 2016.

Kaiming He, Haoqi Fan, Yuxin Wu, Saining Xie, and Ross Girshick. Momentum contrast for unsupervised visual representation learning. In CVPR, 2020.

Kaiming He, Xinlei Chen, Saining Xie, Yanghao Li, Piotr Dollár, and Ross Girshick. Masked autoencoders are scalable vision learners. In CVPR, 2022.

Tong He, Chunhua Shen, Zhi Tian, Dong Gong, Changming Sun, and Youliang Yan. Knowledge adaptation for efficient semantic segmentation. 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 578 587, 2019.

Dan Hendrycks, Steven Basart, Norman Mu, Saurav Kadavath, Frank Wang, Evan Dorundo, Rahul Desai, Tyler Zhu, Samyak Parajuli, Mike Guo, Dawn Song, Jacob Steinhardt, and Justin Gilmer. The many faces of robustness: A critical analysis of out-of-distribution generalization. In IEEE ICCV, 2021a.

Dan Hendrycks, Kevin Zhao, Steven Basart, Jacob Steinhardt, and Dawn Song. Natural adversarial examples. In IEEE CVPR, 2021b.

Geoffrey Hinton, Oriol Vinyals, Jeff Dean, et al. Distilling the knowledge in a neural network. ar Xiv preprint ar Xiv:1503.02531, 2(7), 2015.

R Devon Hjelm, Alex Fedorov, Samuel Lavoie-Marchildon, Karan Grewal, Phil Bachman, Adam Trischler, and Yoshua Bengio. Learning deep representations by mutual information estimation and maximization. In International Conference on Learning Representations, 2019.

Zejiang Hou, Fei Sun, Yen-Kuang Chen, Yuan Xie, and S. Y. Kung. Milan: Masked image pretraining on language assisted representation. Ar Xiv, abs/2208.06049, 2022.

Lang Huang, Shan You, Mingkai Zheng, Fei Wang, Chen Qian, and T. Yamasaki. Green hierarchical vision transformer for masked image modeling. Ar Xiv, abs/2205.13515, 2022.

Chao Jia, Yinfei Yang, Ye Xia, Yi-Ting Chen, Zarana Parekh, Hieu Pham, Quoc Le, Yun-Hsuan Sung, Zhen Li, and Tom Duerig. Scaling up visual and vision-language representation learning with noisy text supervision. In International Conference on Machine Learning, pp. 4904 4916. PMLR, 2021.

Xiaoqi Jiao, Yichun Yin, Lifeng Shang, Xin Jiang, Xiao Chen, Linlin Li, Fang Wang, and Qun Liu. Tinybert: Distilling bert for natural language understanding. Ar Xiv, abs/1909.10351, 2020.

Hao Liu, Xinghua Jiang, Xin Li, Antai Guo, Deqiang Jiang, and Bo Ren. The devil is in the frequency: Geminated gestalt autoencoder for self-supervised visual pre-training. Ar Xiv, abs/2204.08227, 2022a.

Ze Liu, Han Hu, Yutong Lin, Zhuliang Yao, Zhenda Xie, Yixuan Wei, Jia Ning, Yue Cao, Zheng Zhang, Li Dong, Furu Wei, and Baining Guo. Swin transformer v2: Scaling up capacity and resolution. In International Conference on Computer Vision and Pattern Recognition (CVPR), 2022b.

Norman Mu, Alexander Kirillov, David Wagner, and Saining Xie. Slip: Self-supervision meets language-image pre-training. ar Xiv preprint ar Xiv:2112.12750, 2021.

Published in Transactions on Machine Learning Research (03/2023)

Muhammad Muzammal Naseer, Kanchana Ranasinghe, Salman H Khan, Munawar Hayat, Fahad Shahbaz Khan, and Ming-Hsuan Yang. Intriguing properties of vision transformers. Advances in Neural Information Processing Systems, 34:23296 23308, 2021.

Zhiliang Peng, Li Dong, Hangbo Bao, Qixiang Ye, and Furu Wei. BEi T v2: Masked image modeling with vector-quantized visual tokenizers. ar Xiv preprint ar Xiv:2208.06366, 2022.

Alec Radford and Karthik Narasimhan. Improving language understanding by generative pre-training. 2018.

Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning transferable visual models from natural language supervision. In ICML, pp. 8748 8763. PMLR, 2021.

A. Ramesh, Mikhail Pavlov, Gabriel Goh, Scott Gray, Chelsea Voss, Alec Radford, Mark Chen, and Ilya Sutskever. Zero-shot text-to-image generation. Ar Xiv, abs/2102.12092, 2021.

Olga Russakovsky, Jia Deng, Hao Su, Jonathan Krause, Sanjeev Satheesh, Sean Ma, Zhiheng Huang, Andrej Karpathy, Aditya Khosla, Michael Bernstein, Alexander C Berg, and Li Fei-Fei. Imagenet large scale visual recognition challenge. IJCV, 2015.

Chenxin Tao, Xizhou Zhu, Gao Huang, Yu Qiao, Xiaogang Wang, and Jifeng Dai. Siamese image modeling for self-supervised vision representation learning. ar Xiv preprint ar Xiv:2206.01204, 2022.

Hugo Touvron, Matthieu Cord, Matthijs Douze, Francisco Massa, Alexandre Sablayrolles, and Hervé Jégou. Training data-efficient image transformers & distillation through attention. preprint ar Xiv:2012.12877, 2020.

Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, and Illia Polosukhin. Attention is all you need. In Isabelle Guyon, Ulrike von Luxburg, Samy Bengio, Hanna M. Wallach, Rob Fergus, S. V. N. Vishwanathan, and Roman Garnett (eds.), Advances in Neural Information Processing Systems 30: Annual Conference on Neural Information Processing Systems 2017, December 4-9, 2017, Long Beach, CA, USA, pp. 5998 6008, 2017.

Haohan Wang, Songwei Ge, Zachary Lipton, and Eric P Xing. Learning robust global representations by penalizing local predictive power. In Advances in Neural Information Processing Systems, pp. 10506 10518, 2019.

Wenhui Wang, Furu Wei, Li Dong, Hangbo Bao, Nan Yang, and Ming Zhou. Mini LM: Deep self-attention distillation for task-agnostic compression of pre-trained transformers. Ar Xiv, abs/2002.10957, 2020.

Wenhui Wang, Hangbo Bao, Li Dong, Johan Bjorck, Zhiliang Peng, Qiang Liu, Kriti Aggarwal, Owais Khan Mohammed, Saksham Singhal, Subhojit Som, et al. Image as a foreign language: BEi T pretraining for all vision and vision-language tasks. ar Xiv preprint ar Xiv:2208.10442, 2022.

Chen Wei, Haoqi Fan, Saining Xie, Chao-Yuan Wu, Alan Yuille, and Christoph Feichtenhofer. Masked feature prediction for self-supervised visual pre-training. ar Xiv preprint ar Xiv:2112.09133, 2021.

Longhui Wei, Lingxi Xie, Wengang Zhou, Houqiang Li, and Qi Tian. Mvp: Multimodality-guided visual pre-training. ar Xiv preprint ar Xiv:2203.05175, 2022a.

Yixuan Wei, Han Hu, Zhenda Xie, Zheng Zhang, Yue Cao, Jianmin Bao, Dong Chen, and Baining Guo. Contrastive learning rivals masked image modeling in fine-tuning via feature distillation. ar Xiv preprint ar Xiv:2205.14141, 2022b.

Zhirong Wu, Yuanjun Xiong, Stella X Yu, and Dahua Lin. Unsupervised feature learning via non-parametric instance discrimination. In CVPR, 2018.

Zhirong Wu, Zihang Lai, Xiao Sun, and Stephen Lin. Extreme masking for learning instance and distributed visual representations. Ar Xiv, abs/2206.04667, 2022.

Published in Transactions on Machine Learning Research (03/2023)

Tete Xiao, Yingcheng Liu, Bolei Zhou, Yuning Jiang, and Jian Sun. Unified perceptual parsing for scene understanding. In ECCV, 2018.

Jing Yang, Brais Martínez, Adrian Bulat, and Georgios Tzimiropoulos. Knowledge distillation via softmax regression representation learning. In ICLR, 2021.

Kun Yi, Yixiao Ge, Xiaotong Li, Shusheng Yang, Dian Li, Jianping Wu, Ying Shan, and Xiaohu Qie. Masked image modeling with denoising contrast. Ar Xiv, abs/2205.09616, 2022.

Xiaosong Zhang, Yunjie Tian, Wei Huang, Qixiang Ye, Qi Dai, Lingxi Xie, and Qi Tian. Hivit: Hierarchical vision transformer meets masked image modeling. Ar Xiv, abs/2205.14949, 2022.

Bolei Zhou, Hang Zhao, Xavier Puig, Tete Xiao, Sanja Fidler, Adela Barriuso, and Antonio Torralba. Semantic understanding of scenes through the ADE20K dataset. Int. J. Comput. Vis., 127(3):302 321, 2019.

Qiang Zhou, Chaohui Yu, Hao Luo, Zhibin Wang, and Hao Li. Mimco: Masked image modeling pre-training with contrastive teacher. ar Xiv preprint ar Xiv:2209.03063, 2022.

Published in Transactions on Machine Learning Research (03/2023)

A Hyperparameters for Mask Distill Pretraining

Hyperparameters Base Size Large Size Huge Size

Layers 12 24 32 Hidden size 768 1024 1280 FFN inner hidden size 3072 4096 5120 Attention heads 12 16 16 Layer scale 0.1 1e-5 1e-5 Patch size 16 16 16 16 14 14

Training epochs 300/800 Batch size 2048 Adam ϵ 1e-8 Adam β (0.9, 0.999) Peak learning rate 1.5e-3 Minimal learning rate 1e-5 Learning rate schedule Cosine Warmup epochs 10

Stoch. depth 0.1 0.2 0.25 Gradient clipping 3.0 Dropout Weight decay 0.05

Data Augment Random Resize And Crop Input resolution 224 224 Color jitter 0.4

Table 11: Hyperparameters for Mask Distill pretraining on Image Net-1K.

B Hyperparameters for ADE20K Semantic Segmentation Fine-tuning

Hyperparameters Vi T-B/16 Vi T-L/16

Relative positional embeddings Shared relative positional embeddings

Peak learning rate {0.5, 0.8, 1.0, 1.5}e-4 Fine-tuning steps 160K Batch size 16 Adam ϵ 1e-8 Adam β (0.9, 0.999) Layer-wise learning rate decay 0.75 0.85 Minimal learning rate 0 Learning rate schedule Linear Warmup steps 1500

Dropout Stoch. depth 0.1 0.2 Weight decay 0.05

Input resolution 512 512

Table 12: Hyperparameters for fine-tuning Mask Distill on ADE20K.

Published in Transactions on Machine Learning Research (03/2023)

C Hyperparameters for Image Classification Fine-tuning

Table 13: Hyperparameters for fine-tuning Mask Distill on Image Net-1K.

Hyperparameters Vi T-B/16 Vi T-L/16 Vi T-H/14

Peak learning rate 5e-4 5e-4 2e-4 Fine-tuning epochs 100 50 30 Warmup epochs 20 5 5 Layer-wise learning rate decay 0.65 0.8 0.85 Batch size 1024 Adam ϵ 1e-8 Adam β (0.9, 0.999) Minimal learning rate 1e-6 Learning rate schedule Cosine

Stoch. depth 0.1 0.2 0.25 Repeated Aug Weight decay 0.05 Label smoothing ε 0.1 Dropout Gradient clipping

Erasing prob. 0.25 Input resolution 224 224 Rand Augment 9/0.5 Mixup prob. 0.8 Cutmix prob. 1.0