# semisupervised_vision_transformers_at_scale__2192307a.pdf

Semi-supervised Vision Transformers at Scale

Zhaowei Cai, Avinash Ravichandran, Paolo Favaro, Manchen Wang,

Davide Modolo, Rahul Bhotika, Zhuowen Tu, Stefano Soatto

AWS AI Labs {zhaoweic,ravinash,pffavaro,manchenw,dmodolo,ztu,soattos}@amazon.com

We study semi-supervised learning (SSL) for vision transformers (Vi T), an underexplored topic despite the wide adoption of the Vi T architecture to different tasks. To tackle this problem, we use a SSL pipeline, consisting of ﬁrst un/self-supervised pre-training, followed by supervised ﬁne-tuning, and ﬁnally semi-supervised ﬁnetuning. At the semi-supervised ﬁne-tuning stage, we adopt an exponential moving average (EMA)-Teacher framework instead of the popular Fix Match, since the former is more stable and delivers higher accuracy for semi-supervised vision transformers. In addition, we propose a probabilistic pseudo mixup mechanism to interpolate unlabeled samples and their pseudo labels for improved regularization, which is important for training Vi Ts with weak inductive bias. Our proposed method, dubbed Semi-Vi T, achieves comparable or better performance than the CNN counterparts in the semi-supervised classiﬁcation setting. Semi-Vi T also enjoys the scalability beneﬁts of Vi Ts that can be readily scaled up to largesize models with increasing accuracy. For example, Semi-Vi T-Huge achieves an impressive 80% top-1 accuracy on Image Net using only 1% labels, which is comparable with Inception-v4 using 100% Image Net labels. The code is available at https://github.com/amazon-science/semi-vit.

1 Introduction

0 100 200 300 400 500 600 700 800 number of parameters (M)

top-1 accuracy

Image Net 1% Labels

Semi-Vi T Sim CLRv2 PAWS EMAN

0 100 200 300 400 500 600 700 800 number of parameters (M)

top-1 accuracy

Image Net 10% Labels

Semi-Vi T Sim CLRv2 PAWS Fix Match MPL

1% labels 10% labels 100% labels

Image Net top-1 accuracy

Semi-Vi T-H

Res Net-152

Inception-v4

Conv Ne Xt-L

Efficient Net-L2

Semi-Vi T-H

Figure 1: (a) and (b) are the comparisons of our Semi-Vi T with the state-of-the-art SSL algorithms at different model scales, and (c) is the comparison with the state-of-the-art supervised models.

In the past few years, Vision Transformers (Vi T) [18], which adapt the transformer architecture [64] to the visual domain, have achieved remarkable progress in supervised learning [63, 44, 73], un/self-supervised learning [16, 12, 25], and many other computer vision tasks [11, 19, 1, 58] (with architecture modiﬁcations). However, Vi Ts have yet to show the same advantage in semi-supervised learning (SSL), where only a small subset of the training data is labeled, a problem in the middle between supervised and un/self-supervised learning. Although several recent methods in SSL have

36th Conference on Neural Information Processing Systems (Neur IPS 2022).

signiﬁcantly advanced the ﬁeld [39, 62, 7, 55, 70, 10, 53], the transfer of these methods from Convolutional Neural Networks (CNN) to Vi T architectures has yet to show much promise. For example, as discussed in [68], the direct application of Fix Match [55], one of the most popular SSL methods, to Vi T leads to inferior performance (about 10 points worse) than when used with a CNN architecture. The challenge could be potentially caused by the fact that Vi Ts are known to require more data for training and to have a weaker inductive bias than CNNs [18]. However, in this paper we show that semi-supervised Vi Ts can outperform the CNN counterparts when trained properly, suggesting promising potential to advance SSL beyond CNN architectures.

To achieve that, we use the following SSL pipeline: 1) un/self-supervised pre-training on all data (both labeled and unlabeled), followed by 2) supervised ﬁne-tuning only on labeled data, and ﬁnally 3) semi-supervised ﬁne-tuning on all data. This pipeline is stable and helps reduce the sensitivity of hyperparameter tuning when training Vi Ts for SSL in our experiments. At the ﬁnal stage of semi-supervised ﬁne-tuning, we adopt the EMA-Teacher framework [62, 10], an improved version of the popular Fix Match [55]. Unlike Fix Match, which often fails to converge when training semisupervised Vi T, the EMA-Teacher shows more stable training behaviors and better performance. In addition, we propose probabilistic pseudo mixup for pseudo-labeling based SSL methods, a method that interpolates the unlabeled samples coupled with pseudo labels for enhanced regularization. In the standard mixup [75] the mixup ratio is randomly sampled from a Beta distribution. In contrast, in the probabilistic pseudo mixup the ratio depends on the respective conﬁdence of two mixed-up samples, such that the sample with higher conﬁdence will weigh more in the ﬁnal interpolated sample. This new data augmentation technique brings non-negligible gains since Vi T has weak inductive bias, especially for scenarios where the training is more difﬁcult, e.g., without un/self-supervised pre-training or on data regimes with very few labeled samples (e.g., 1% labels). We call our method Semi-Vi T. Notice that Semi-Vi T is built on exactly the same design of Vi Ts (i.e., there are neither additional parameters nor architectural changes).

Semi-Vi T achieves promising results on several fronts (Figure 1). 1) For the ﬁrst time, we show that pure Vi Ts can reach comparable or better accuracy than CNNs on SSL1. 2) Semi-Vi T can be readily scaled up under the SSL setting. This is illustrated in Figure 1 (a) and (b) on Vi T architectures at different scales, ranging from Vi T-Small to Vi T-Huge, and Semi-Vi T outperforms the prior art such as Sim CLRv2 [15]. 3) Semi-Vi T has shown the potential for a substantial reduction of labeling cost. For example, as seen in Figure 1 (c), Semi-Vi T-Huge with 1% (10%) Image Net labels achieves comparable performance of a fully-supervised Inception-v4 [59] (Conv Ne Xt-L [45]). This implies a 100 (10 ) reduction in human annotation cost. 4) Semi-Vi T achieves the state-of-the-art SSL results on Image Net, e.g., 80.0% (84.3%) top-1 accuracy with only 1% (10%) labels. In addition, the substantial boost in performance by Semi-Vi T is not isolated on Image Net: we ﬁnd an increase of 13%-21% (7%-10%) top-1 accuracy with 1% (10%) labels over the supervised ﬁne-tuning baselines,

for other datasets including Food-101 [9], i Naturalist [30] and Google Landmark [52].

2 Semi-supervised Vision Transformers

2.1 Pipeline

Some pipelines for semi-supervised learning exist in the literature. For example: 1) the model is directly trained from scratch using SSL techniques, e.g., Fix Match [55]; 2) the model is un/selfsupervised pretrained ﬁrst and ﬁnetuned on labeled data later [26, 14, 22]; 3) the model is selfsupervised pretrained ﬁrst and then ﬁnetuned via semi-supervised learning on both labeled and unlabeled data [10]. In this paper, we instead adopt the following pipeline: ﬁrst, optional selfsupervised pre-training on all data without using any labels; next, standard supervised ﬁne-tuning on available labeled data; and ﬁnally, semi-supervised ﬁne-tuning on both labeled and unlabeled data. This procedure is similar to [15], with the difference that they use knowledge distillation [29] in their ﬁnal stage. We ﬁnd that this training pipeline trains semi-supervised vision transformers in a stable manner and achieves promising results, with possibly less hyperparameter tuning.

1Although [68] was the ﬁrst to use a transformer architecture for SSL, it does so by combining both CNN and Vi T architectures, and requires to use CNN as the teacher to produce pseudo labels.

logit_s logit_w

pseudo-label

Vi T shared

(a) Fix Match

logit_s logit_w

pseudo-label

(b) EMA-Teacher

Figure 2: Comparison between Fix Match (a) and EMA-Teacher (b). xs/xw is the strongly/weakly augmented view of a sample x, and are the model parameters.

2.2 EMA-Teacher Framework

Fix Match [55] emerged as a popular SSL method in the past few years. As discussed in [10], it can be interpreted as a student-teacher framework, where the student and teacher models are identical, as seen in Figure 2 (a). However, Fix Match has unexpected behaviors, especially when the model incorporates batch normalization (BN) [35]. Although Vi T uses Layer Normalization (LN) [4] instead of BN as normalization, we still found that the Fix Match with Vi T underperforms the CNN counterparts and often does not converge. This phenomenon was also observed in [68]. A potential reason to this is that the student and the teacher models are identical in Fix Match, which could easily lead to model collapse [26, 22]. This instability of the identical student-teacher framework has also been observed in other areas, e.g. semi-supervised speech recognition [42, 48, 28]. As suggested in [10], the EMA-Teacher (shown in Figure 2 (b)) is an improved version of the Fix Match, thus we adopt it for our Semi-Vi T. In the EMA-Teacher framework, the teacher parameters 0 are updated by the exponential moving average (EMA) from the student parameters ,

0 := m 0 + (1 m) , (1)

where the momentum decay m is a number close to 1, e.g., 0.9999. The student parameters are updated by standard learning optimization, e.g., SGD or Adam W [46]. The other components are exactly the same as the Fix Match, as seen in Figure 2. This temporal weight averaging can stabilize the training trajectories [3, 36] and avoids the model collapse issue [26, 22]. Our experiments also show this EMA-Teacher framework has better results and more stable training behaviors than Fix Match when training Semi-Vi T.

2.3 Semi-supervised Learning Formulation

In the EMA-Teacher framework, there are both labeled and unlabeled samples in a minibatch during training. The loss on the labeled samples {(xl

i=1 is the standard cross-entropy loss, Ll = 1 Nl

i). For an unlabeled sample xu 2 {xu

i=1, a weak and a strong augmentation are applied to it, generating xu,w and xu,s, respectively. The weak augmented xu,w is forwarded through the teacher network, and outputs the probabilities over classes, p = f(xu,w; 0). Then the pseudo label is produced by ˆy = arg maxc pc with its associated conﬁdence o = max pc. The pseudo label with conﬁdence higher than a conﬁdence threshold is then used to supervise the learning of the student on the strong augmented sample xu,s,

[oi ]CE(xu,s

i , ˆyi), (2)

where [ ] is the indicator function. The overall loss is L = Ll + µLu, where µ is the trade-off weight. Note that only the pseudo labels with conﬁdence higher than a threshold contribute to the ﬁnal loss;

!! !" !# !$ !% !& !' !(

!# !% !( !! !$ !' !" !&

!"! !"" !"# !"$ !"% !"& !"' !"(

!"! = $!! + 1 $ !",$~*+,-(/, /)

(a) Pseudo Mixup

!! !" !# !$ !% !& !' !(

!" !! !# !$ !% !& !' !(

!"! !"" !"# !"$ !"% !"& !"' !"(

!"! = $!! + 1 $ !",$~*+,-(/, /)

(b) Pseudo Mixup+

!! !" !# !$ !% !& !' !(

!# !% !( !! !$ !' !" !&

!"! !"" !"# !"$ !"% !"& !"' !"(

!"! = $!!! + 1 $! !",$! = )!/()!+ )")

(c) Probabilistic Pseudo Mixup

Figure 3: Different variations of mixup on unlabeled data. The red samples are the ones passing the conﬁdence threshold, but not the blue samples.

the others are instead not used. The philosophy behind this ﬁltering is that the pseudo labels with low conﬁdence are noisier and could hijack the SSL training.

3 Probabilistic Pseudo Mixup

Mixup [75] performs convex combinations of pairs of samples and their labels,

x = λxi + (1 λ)xj,

y = λyi + (1 λ)yj, (3)

where the mixup ratio λ Beta( , ) 2 [0, 1], for 2 (0, 1). The samples are mixed-up usually in a single minibatch during training. Given a minibatch B and its shufﬂed version B, the mixed-up minibatch is B = λB + (1 λ) B, where λ could be either batch-wise or element-wise. Due to the nature of weak inductive bias, Vi T is more data hungry than CNN, thus effective data augmentation, e.g., mixup, is critical for training fully-supervised Vi T [18, 63, 44, 73]. This also applies to Semi-Vi T since it inherits the nature of weak inductive bias from Vi T. Although it is standard to use mixup in supervised learning, how to employ it under pseudo-labeling based SSL framework, e.g., the EMA-Teacher, is still unclear yet, and we are going to discuss it next.

3.2 Pseudo Mixup

Under the pseudo-labeling based SSL framework [40, 55, 53, 10], given an unlabeled sample and its pseudo label (xu, ˆy), only when its conﬁdence o is not smaller than the conﬁdence threshold , it will contribute to the loss Lu, as seen in (2). According to their conﬁdence scores, the unlabeled minibatch Bu can be grouped into a clean subset ˆBu = {(xu

i , ˆyi)|oi } and a noisy subset Bu = Bu ˆBu. One straightforward solution is to apply mixup on the full unlabeled minibatch Bu, with no differentiation between clean and noisy samples, denoted as pseudo mixup, as show in Figure 3 (a). After the pseudo mixup, still only the samples in ˆBu contribute to the loss, and the samples in Bu are abandoned. In this way, the mixup operation is more than just a data augmentation. In fact, a sample in Bu will also contribute to the ﬁnal loss if it is mixed-up with a sample in ˆBu. As a result, it could involve a substantial number of noisy samples into the loss calculation due to the randomness, which, however, is against the philosophy of pseudo-labeling. Since only the clean subset ˆBu contributes to the ﬁnal loss, another choice is to use mixup only on ˆBu, denoted as pseudo mixup+, as shown in Figure 3 (b). In this way, no sample in the noisy subset Bu will affect the training.

3.3 Probabilistic Pseudo Mixup

Although the samples in Bu are noisy, they still carry some useful information for the model to learn. The pseudo mixup above can somehow leverage those information by blending the noisy and clean pseudo samples together. However, the problem is the mixup ratio is randomly generated from a Beta distribution, which does not depend on the conﬁdence of each sample. This is not ideal. For example, when two samples are mixed-up, the sample with higher conﬁdence should have higher mixup ratio,

Table 1: Semi-Vi T results comparing with ﬁne-tuning. The models are self-pretrained by MAE [25].

Model Param Method 1% 10% 100%

Vi T-Base 86M ﬁnetune 57.4 73.7 83.7 Semi-Vi T 71.0 79.7 -

Vi T-Large 307M ﬁnetune 67.1 79.2 86.0 Semi-Vi T 77.3 83.3 -

Vi T-Huge 632M ﬁnetune 71.5 81.4 86.9 Semi-Vi T 80.0 84.3 -

Table 2: The comparison between the Fix Match and the EMA-Teacher. 8means the training is failed with accuracy close to 0.

Model Pretrained Method 1% 10%

Vi T-Small None Fix Match - 8 EMA-Teacher - 65.6

Vi T-Base None Fix Match - 8 EMA-Teacher - 68.9

Vi T-Base MAE Fix Match 8 74.8 EMA-Teacher 65.3 78.1

such that it can weigh more in the ﬁnal loss. Motivated by this intuition, we propose probabilistic pseudo mixup (Figure 3 (c)), where the mixup ratio λ reﬂects the sample conﬁdence,

λi = oi/(oi + oj). (4)

Also, the conﬁdence score of xu

i is updated after the mixup operation as

i = max(oi, oj), (5)

because the conﬁdence score should align with the majority of the image content. The ﬁnal clean subset Bu = {( xu

i } will contribute to the ﬁnal loss. Probabilistic pseudo mixup can enhance regularization, leverage information from all samples, even the noisy ones, and not violate the philosophy of pseudo labeling at the same time. It can effectively alleviate the issue of weak inductive bias of Semi-Vi T and bring substantial gains, as will be shown in our experiments.

4 Experiments

We evaluate Semi-Vi T mainly on Image Net, which consists of 1.28M training and 50K validation images. We sample 10%/1% labels from the Image Net training set for the semi-supervised evaluation. We study both scenarios: with and without self-supervised pre-training. Without self-pretraining, we only evaluate on 10% labels, since learning from scratch on 1% labels is very difﬁcult. When self-pretrained, MAE [25] is mainly used, and we directly use their pretrained models. All learning is optimized with Adam W [46], using cosine learning rate schedule, with a weight decay of 0.05. The default momentum decay m of (1) is 0.9999. In a minibatch, Nu = 5Nl, and the loss trade-off µ = 5. The mixup is a combination of mixup [75] and Cutmix [74] as in the implementation of [69]. More details can be found in the appendix.

4.1 Semi-Vi T Results

When the model is self-pretrained by MAE [25], we ﬁrst evaluate the ﬁne-tuning performance of MAE on the labeled data only, as the common practice in self/un-supervised learning literature [26, 14, 22], with results shown in Table 1. This already leads to strong semi-supervised baselines, e.g., 81.4 top-1 accuracy for Vi T-Huge on 10% labels, indicating that MAE is a strong self-supervised learning technique. However, Semi-Vi T has additional signiﬁcant improvements over the strong baselines for all models, e.g., 8.5-13.6 points for 1% labels and 2.9-6.0 points for 10% labels. The ﬁne-tuning results on 100% data are provided as upper-bounds for our Semi-Vi T, and their gaps to Semi-Vi T are small, e.g., 4.0/2.7/2.6 points for Vi T-Base/Large/Huge on 10% labels. An interesting observation is that the larger model is more effective for smaller number of labels, which is consistent with the observations in [15]. For example, the ﬁne-tuning gaps between 1% and 100% labels are 26.3/18.9/15.4 points for Vi T-Base/Large/Huge, which are decreasing. The observation on Semi-Vi T

Table 3: The comparison among different mixup variations.

Model Pretrained Mixup 1% 10%

Vi T-Small None

EMA-Teacher - 65.6 Pseudo Mixup - 68.3 Pseudo Mixup+ - 68.8 Prob Pseudo Mixup - 70.9

Vi T-Base None

EMA-Teacher - 68.9 Pseudo Mixup - 71.6 Pseudo Mixup+ - 72.1 Prob Pseudo Mixup - 73.5

Vi T-Base MAE

EMA-Teacher 65.3 78.1 Pseudo Mixup 69.5 78.3 Pseudo Mixup+ 70.1 78.7 Prob Pseudo Mixup 71.0 79.7

Table 4: The ablation on the conﬁdence threshold.

Method (Vi T-Base) label = 0 = 0.3 = 0.4 = 0.5 = 0.6 = 0.7 = 0.8 = 0.9 EMA-Teacher 1% 63.1 64.4 64.6 65.1 65.3 65.1 64.4 63.4 EMA-Teacher 10% 75.4 76.7 77.2 77.7 77.9 78.1 78.2 77.9 Semi-Vi T 1% 70.8 71.4 71.3 71.3 71.0 70.4 68.6 61.8 Semi-Vi T 10% 79.4 79.5 79.7 79.7 79.6 79.4 79.0 77.2

results is similar, e.g., 12.7/8.7/6.9 points to their upper-bounds on 1% labels. These results have shown that vision transformers can also perform very well in semi-supervised learning, as well as supervised learning and un/self-supervised learning.

4.2 Ablation Studies

We ablate different factors of Semi-Vi T in this section.

Fix Match v.s. EMA-Teacher is compared in Table 2. These experiments do not use the pseudo mixup techniques of Section 3 yet. When the model is not self-pretrained, the training of the Fix Match is unstable and often failed. When the model is self-pretrained, the Fix Match training becomes stable, and starts to achieve reasonable results, e.g., 74.8 for Vi T-Base on 10% labels, which is already better than the prior art on Res Net-50, e.g., 73.9 of MPL [53] and 74.0 of EMAN [10]. But it is only 1.1 points higher than the ﬁne-tuning baseline of Table 1, indicating the Fix Match is not an effective SSL framework for Vi T. But the EMA-Teacher achieves much better results, 3.3 points of improvement over Fix Match when self-pretrained. Even without self-pretraining, the EMA-Teacher can still achieve satisfactory performance, while Fix Match fails.

Probabilistic Pseudo Mixup Different mixup variations on unlabeled data are compared in Table 3. Note that the standard mixup with the implementation of [69] is used on the labeled data as usual. The EMA-Teacher does not use any mixup mechanism on the unlabeled data, serving as baselines here. When pseudo mixup of Figure 3 (a) is applied on the unlabeled data, the performance usually has some substantial gains over the EMA-Teacher baselines, especially for the scenarios where the training is more difﬁcult, e.g., without self-pretraining or on 1% labels. This shows the importance of using mixup on the unlabeled data for an improved regularization. However, as discussed in Section 3.2, pseudo mixup could involve many noisy samples into training. On the other hand, pseudo mixup+ of Figure 3 (b) can increase the performance of pseudo mixup constantly, by about 0.5 points, showing that removing those noisy samples does help. In addition, probabilistic pseudo mixup of Figure 3 (c) can further improve the performance of pseudo mixup+ by 1-2 points in all cases. These results imply that those noisy samples do carry some useful information for SSL training, but their weights should be suppressed especially when their conﬁdence scores are low. This data augmentation technique also effectively alleviates the training difﬁculty of semi-supervised vision transformers with weak inductive bias.

Effect of Conﬁdence Threshold We ablate the effect of the conﬁdence threshold of (2) in Table 4. We ﬁnd that Semi-Vi T is quite robust to the low conﬁdence thresholds. One possible reason is that

Table 5: The ablation on the momentum decay of exponential moving average.

Method (Vi T-Base) label m = 0 m = 0.9 m = 0.99 m = 0.999 m = 0.9999 m = 0.99999 EMA-Teacher 1% 8 23.5 49.3 63.1 65.3 59.7 EMA-Teacher 10% 74.8 75.3 76.4 77.2 78.1 77.9 Semi-Vi T 1% 69.3 69.7 71.1 71.6 71.0 63.1 Semi-Vi T 10% 79.5 79.5 79.6 79.8 79.7 79.0

Table 6: The ablation on supervised ﬁne-tuning.

Method (Vi T-Base) label epochs=0 epochs=10 epochs=50 epochs=100 epochs=200 Supervised-Vi T 1% - 24.7 53.6 57.4 56.9 Supervised-Vi T 10% - 66.3 72.9 73.7 73.2 EMA-Teacher 1% 62.7 62.5 60.9 65.3 66.9 EMA-Teacher 10% 76.5 74.2 77.7 78.1 78.2 Semi-Vi T 1% 69.7 69.8 70.4 71.0 70.9 Semi-Vi T 10% 79.3 79.4 79.6 79.7 79.6

Semi-Vi T uses probabilistic pseudo mixup. When is low, the low-conﬁdence samples will not hijack the training since their contributions depend on their conﬁdence scores. These imply that the hyperparameter can possibly be removed ( = 0) in Semi-Vi T. But the EMA-Teacher has a drop of 2-3 points when the conﬁdence threshold is removed. The ﬁnal choices of for different Semi-Vi T models are shown in Table 13 in the appendix.

Effect of Momentum Decay Table 5 shows the effect of momentum decay m in the EMA teacher. Note that when m = 0, the frameworks is reduced to the Fix Match. We can ﬁnd Semi-Vi T is robust to m. For 10% labels, Semi-Vi T has very minor changes when m decreases to 0, but the EMA-Teacher has a drop of 3.3 points. For 1% labels, the training is more challenging, and the choice of m becomes more important. In this case, Semi-Vi T has a drop of 2.3 points from m = 0.999 to m = 0, but the EMA-Teacher could fail when m decreases to 0. The robustness of Semi-Vi T to momentum decay m is also attributed to the use of probabilistic pseudo mixup.

Effect of Self-pretraining The self-pretraining of MAE [25] has a substantial boost in performance, as seen in Table 3. For Vi T-Base, MAE helps to improve by 6.2 and 9.2 points for the EMA-Teacher with and without probabilistic pseudo mixup, respectively. In addition, it helps to train the models in more challenging scenarios, e.g., 1% labels. Without self-pretraining, the training fails to deliver good results on 1% labels. Notice that, even without pre-training, our Semi-Vi T ( Prob Pseudo Mixup in Table 3) also achieves slightly better performance than the CNN counterparts: 70.9 of Semi-Vi T-Small v.s. 67.1 of Fix Match-Res Net50 or 69.2 [55] of EMAN-Res Net50 [10] when trained from scratch for 100 epochs.

Effect of Supervised Fine-tuning is ablated in Table 6 by varying the number of supervised ﬁne-tuning epochs. The rows of Supervised-Vi T are the numbers of supervised ﬁne-tuning, where the latter semi-supervised ﬁne-tuning begins. Semi-Vi T is still robust to the length of supervised ﬁne-tuning, where the accuracy decrease is 0.4 (1.3) for 10% (1%) labels when supervised ﬁne-tuning is removed. However, the performance of the EMA-Teacher decreases 1.7 (4.2) points. These show sufﬁcient supervised ﬁne-tuning does stabilize the training procedure, especially for less robust framework, e.g., the EMA-Teacher. However, notice that supervised ﬁne-tuning could sometimes hurt the performance if it is not sufﬁcient (e.g., epochs=10) for the EMA-Teacher.

Other Self-pretraining Techniques Beyond MAE, we also experiment on other self-pretraining techniques, including Mo Co-v3 [16] and DINO [12], in Table 7. By comparing the ﬁne-tuning results, DINO is close to Mo Co-v3 for Vi T-Base, but much better for Vi T-Small, and both of them are better than MAE for Vi T-Base, suggesting that DINO could be a better self-pretraining technique for smaller scales of Vi T models. On top of the strong ﬁne-tuning baselines, the semi-supervised ﬁne-tuning, using the EMA-Teacher, still has nontrivial improvements for both DINO and Mo Co-v3, e.g., 5.8 (2.1) points on 1% (10%) labels for DINO-Vi T-Base. In addition, the probabilistic pseudo mixup can further improve over the EMA-Teacher, independent of the self-pretraining algorithms. And the ﬁnal Semi-Vi T-Base of DINO is 2.1 (0.5) points better than that of MAE on 1% (10%) labels.

Table 7: Semi-Vi T results with other self-pretraining techniques.

Model Pretrained Method 1% 10%

Vi T-Small Mo Co-v3 [16]

ﬁnetune 51.2 69.1 EMA-Teacher 61.9 72.3 +Prob Pseudo Mixup 64.7 72.9

Vi T-Small DINO [12]

ﬁnetune 58.7 73.9 EMA-Teacher 66.3 76.3 +Prob Pseudo Mixup 68.0 77.1

Vi T-Base Mo Co-v3 [16]

ﬁnetune 66.3 74.5 EMA-Teacher 68.9 77.7 +Prob Pseudo Mixup 72.3 79.2

Vi T-Base DINO [12]

ﬁnetune 65.0 76.0 EMA-Teacher 70.8 78.1 +Prob Pseudo Mixup 73.1 80.2

Table 8: The results on Conv Ne Xt [45].

Model Upper-bound Method 10%

Conv Ne Xt-T 80.7

supervised 61.2 EMA-Teacher 70.4 +Prob Pseudo Mixup 74.1

Conv Ne Xt-S 81.4

supervised 64.1 EMA-Teacher 71.7 +Prob Pseudo Mixup 75.1

Other Network Architectures Although in this paper we mainly focus on Vi T architectures, the proposed probabilistic pseudo mixup is not limited to them. We also try it for CNN architectures, e.g., Res Net [27]. However, we ﬁnd the direct use of the standard mixup does not improve fully-supervised Res Net performance, so will the probabilistic pseudo mixup for its SSL setting. Instead, we evaluate it on the recently proposed Conv Ne Xt [45], which uses mixup for improved results. Since the goal is not to fully reproduce the results of [45], all models are trained only for 100 epochs, including the supervised upper-bounds. The results in Table 8 demonstrate that probabilistic pseudo mixup is not limited to Vi T, but also to CNN architectures, e.g., with improvements of 3-4 points, suggesting it can be well generalized.

4.3 Comparison with the State-of-the-Art

Semi-Vi Ts are compared with the state-of-the-art semi-supervised learning algorithms in Table 9. When the model capacity is close, our Semi-Vi T has shown much better results than the prior art, e.g., MPL-RN-50 [53] v.s. Semi-Vi T-Small, Cow Mix-RN152 [20] v.s. Semi-Vi T-Base, S4L-RN504 [8] v.s. Semi-Vi T-Large and Sim CLRv2+KD-RN152-3 -SK [15] v.s. Semi-Vi T-Huge. The only transformer based SSL method is Semi Former [68], but it requires to use CNN as the teacher model and blend convolution and transformer modules together for good performance. However, our Semi-Vi T is pure Vi T based, without any additional parameters and architecture changes, and the Semi-Vi T-Small model is already better than Semi Former (77.1 v.s. 75.5). These comparisons support that Semi-Vi T does advance the state-of-the-art of semi-supervised learning.

Scalability is an advantage of Vi T, and we compare the scalability of Semi-Vi T with previous works in Figure 1 (a) and (b). The comparison has shown that Semi-Vi T can achieve better trade-off between model capacity and accuracy and can be scaled up more effectively than the prior art, Sim CLRv2 [15]. For example, Sim CLRv2 and PAWS [2] scale up the model usually in terms of network depth and width, and they seem to saturate when the model is of medium size, e.g., around 300M parameters, but our Semi-Vi T continues to improve steadily beyond that point.

Semi-Vi T is also compared with the supervised state-of-the-art in Table 10. Our Semi-Vi T-Huge is comparable with Inception-v4 [59], but with 100 annotation cost reduction, and comparable with Conv Ne Xt-L [45] (better than Swin-B [44]), but with 10 annotation cost reduction. These comparisons imply that Semi-Vi T has great potential for labeling cost reduction.

Table 9: The comparison with the state-of-the-art SSL models.

Method Architecture Param 1% 10%

UDA [70] Res Net-50 26M - 68.8 Fix Match [55] Res Net-50 26M - 71.5 S4L [8] Res Net-50 (4 ) 375M - 73.2 MPL [53] Res Net-50 26M - 73.9 Cow Mix [20] Res Net-152 60M - 73.9 EMAN [10] Res Net-50 26M 63.0 74.0 PAWS [2] Res Net-50 26M 66.5 75.5 Sim CLRv2+KD [15] RN152 (3 +SK) 794M 76.6 80.9

Transformer

DINO [12] Vi T-Small 22M 64.5 72.2 Semi Former [68] Vi T-S+Conv 42M - 75.5 Semi-Vi T (ours) Vi T-Small 22M 68.0 77.1 Semi-Vi T (ours) Vi T-Base 86M 71.0 79.7 Semi-Vi T (ours) Vi T-Large 307M 77.3 83.3 Semi-Vi T (ours) Vi T-Huge 632M 80.0 84.3

Table 10: The comparison with the state-of-the-art fully supervised models.

Model Param Data top-1 top-5

Res Net-50 [27] 26M Image Net 76.0 93.0 Res Net-152 [27] 60M Image Net 77.8 93.8 Dense Net-264 [32] 34M Image Net 77.9 93.9 Inception-v3 [60] 24M Image Net 78.8 94.4 Inception-v4 [59] 48M Image Net 80.0 95.0 Res Ne Xt-101 [72] 84M Image Net 80.9 95.6 SENet-154 [31] 146M Image Net 81.3 95.5 Conv Ne Xt-L [45] 198M Image Net 84.3 - Efﬁcient Net-L2 [61] 480M Image Net 85.5 97.5

Transformer

Vi T-Huge [18] 632M JFT+Image Net 88.6 - Dei T-B [63] 86M Image Net 81.8 - Swin-B [44] 88M Image Net 83.3 - MAE-Vi T-Huge [25] 632M Image Net 86.9 - Semi-Vi T-Huge (ours) 632M 1%Image Net 80.0 93.1 Semi-Vi T-Huge (ours) 632M 10%Image Net 84.3 96.6

4.4 Other Datasets

The generalization of Semi-Vi T is evaluated on datasets including Food-101 [9], i Naturalist [30] and Google Landmark [52]. Since these datasets are beyond Image Net, we assume that the Image Net dataset is available and the model is already supervised pretrained on Image Net, and then the model is ﬁnetuned to different target datasets with a few labels. The results are shown in Table 11. On these datasets, our Semi-Vi T can improve over the ﬁne-tuning baselines by 13-21 (7-10) points on 1% (10%) labels. Note that on Food-101, Semi-Vi T on 1% (10%) labels is close to ﬁne-tuning baseline on 10% (100%) labels, i.e., 82.1 v.s. 84.5 (91.3 v.s. 93.1), indicating that using Semi-Vi T can help to save annotation costs by about 10 times on this dataset.

5 Related Work

Semi-supervised learning has a long history of research [78, 13]. The recent works can be roughly clustered into two groups, consistency-based [39, 62, 50, 70, 66] and pseudo-labeling based [40, 55, 53, 10]. Consistency-based methods usually add some noise to the input or the model, and then enforce their feature or probability outputs to be consistent. For example, to construct two outputs for later consistency regularization, -model [39] adds noise to the model weights using dropout [57], Mean-teacher [62] builds a teacher model by EMA updated from the student model, and UDA [70] applies a weak and a strong data augmentation to the input. On the other hand, the idea of pseudo-labeling or self-training can be traced back to [34, 49], which uses model predictions as hard pseudo labels to guide the learning on unlabeled data. This idea becomes popular in SSL recently [40, 55, 53, 10, 71], and some theoretical explanations are available [76, 24]. In the ofﬂine pseudo labeling [40, 71], the model used to generate pseudo labels is usually frozen or updated once

Table 11: The Semi-Vi T-Base results on other datasets.

Dataset # train/test # class Method 1% 10% 100%

Food-101 [9] 75.7K/25.2K 101 Finetune 60.9 84.5 93.1 Semi-Vi T 82.1 91.3 -

i Naturalist [30] 265K/3K 1010 Finetune 19.6 57.3 81.2 Semi-Vi T 32.3 67.7 -

Google Landmark [52] 200K/15.6K 256 Finetune 45.3 74.0 91.5 Semi-Vi T 61.0 81.0 -

in a while during training, e.g., at the end of every training epoch, but for online pseudo-labeling [55, 10] the teacher model is updated continuously along with the student. Beyond classiﬁcation, pseudo-labeling has also achieved promising progress in more challenging tasks, e.g., object detection [56, 43, 67]. Our Semi-Vi T also falls into the category of online pseudo-labeling.

Mixup [75] is an effective data augmentation technique, which interpolates the input samples and their labels linearly and performs vicinal risk minimization. It has been successfully used in image classiﬁcation and some other domains, e.g., generative adversarial networks [47], sentence classiﬁcation [23], etc. Other variants have also been developed, e.g., Manifold Mixup [65] that mixes up in the feature space or Cut Mix [74] which cuts a patch from one image and pastes it into another one. Mixup has also been successfully adopted in self-supervised learning [37, 41] and semi-supervised learning [7, 66, 6]. Although [7, 66, 6] also use mixup for SSL, they have differences with our probabilistic pseudo mixup: 1) they are consistency-based SSL framework, but ours is pseudo-labeling based; 2) their mixup ratio is random sampled, but ours depends on the pseudo label conﬁdence; 3) they have only shown successes on small CNN architectures and small datasets, e.g., CIFAR [38] and SVHN [51], but our successes are built on various scales of transformer architectures and large-scale datasets, e.g., Image Net [54], INaturalist [30], Google Landmark [52], etc.

6 Conclusion

In this paper, we propose Semi-Vi T for vision transformers based semi-supervised learning. This is the ﬁrst time that pure vision transformers can achieve promising results on semi-supervised learning and even surpass the previous best CNN based counterparts by a large margin. In addition, Semi-Vi T inherits the scalable beneﬁts from Vi T, and the larger model leads to smaller gap to the fully supervised upper-bounds. This has shown to be a promising direction for semi-supervised learning. And the advantages of Semi-Vi T can be well generalized to other datasets, suggesting potentially broader impacts. We hope these promising results could encourage more efforts in semi-supervised vision transformers.

Limitations Our paper only considers the standard semi-supervised classiﬁcation settings where the full dataset, e.g., Image Net, is downsampled to smaller scales, and not the advanced settings where full Image Net is used as labeled and additional data, e.g., Image Net-21K, is used as unlabeled. And we have only evaluated our approach on the classiﬁcation task. It is unclear whether the same conclusion holds in the case of a more advanced classiﬁcation setting and more challenging tasks, e.g., detection or segmentation.

Potential Negative Social Impacts Semi-Vi T has shown that strong models can be obtained with only a few labels, e.g., 1%. This increases the good AI models accessibility to anyone, which could potentially lead to their inappropriate use.

[1] Anurag Arnab, Mostafa Dehghani, Georg Heigold, Chen Sun, Mario Lucic, and Cordelia Schmid. Vivit: A

video vision transformer. In ICCV, pages 6816 6826. IEEE, 2021.

[2] Mahmoud Assran, Mathilde Caron, Ishan Misra, Piotr Bojanowski, Armand Joulin, Nicolas Ballas, and

Michael G. Rabbat. Semi-supervised learning of visual features by non-parametrically predicting view assignments with support samples. In ICCV, pages 8423 8432. IEEE, 2021.

[3] Ben Athiwaratkun, Marc Finzi, Pavel Izmailov, and Andrew Gordon Wilson. There are many consistent

explanations of unlabeled data: Why you should average. In ICLR. Open Review.net, 2019.

[4] Lei Jimmy Ba, Jamie Ryan Kiros, and Geoffrey E. Hinton. Layer normalization. Co RR, abs/1607.06450,

[5] Hangbo Bao, Li Dong, and Furu Wei. Beit: Bert pre-training of image transformers. ar Xiv preprint

ar Xiv:2106.08254, 2021.

[6] David Berthelot, Nicholas Carlini, Ekin D. Cubuk, Alex Kurakin, Kihyuk Sohn, Han Zhang, and Colin

Raffel. Remixmatch: Semi-supervised learning with distribution matching and augmentation anchoring. In ICLR. Open Review.net, 2020.

[7] David Berthelot, Nicholas Carlini, Ian J. Goodfellow, Nicolas Papernot, Avital Oliver, and Colin Raffel.

Mixmatch: A holistic approach to semi-supervised learning. In Neur IPS, pages 5050 5060, 2019.

[8] Lucas Beyer, Xiaohua Zhai, Avital Oliver, and Alexander Kolesnikov. S4L: self-supervised semi-supervised

learning. In ICCV, pages 1476 1485. IEEE, 2019.

[9] Lukas Bossard, Matthieu Guillaumin, and Luc Van Gool. Food-101 - mining discriminative components

with random forests. In ECCV, volume 8694 of Lecture Notes in Computer Science, pages 446 461. Springer, 2014.

[10] Zhaowei Cai, Avinash Ravichandran, Subhransu Maji, Charless C. Fowlkes, Zhuowen Tu, and Stefano

Soatto. Exponential moving average normalization for self-supervised and semi-supervised learning. In CVPR, pages 194 203, 2021.

[11] Nicolas Carion, Francisco Massa, Gabriel Synnaeve, Nicolas Usunier, Alexander Kirillov, and Sergey

Zagoruyko. End-to-end object detection with transformers. In ECCV, volume 12346 of Lecture Notes in Computer Science, pages 213 229. Springer, 2020.

[12] Mathilde Caron, Hugo Touvron, Ishan Misra, Hervé Jégou, Julien Mairal, Piotr Bojanowski, and Armand

Joulin. Emerging properties in self-supervised vision transformers. In ICCV, pages 9630 9640. IEEE, 2021.

[13] Olivier Chapelle, Bernhard Scholkopf, and Alexander Zien. Semi-supervised learning (chapelle, o. et al.,

eds.; 2006). IEEE Transactions on Neural Networks, 20(3):542 542, 2009.

[14] Ting Chen, Simon Kornblith, Mohammad Norouzi, and Geoffrey E. Hinton. A simple framework for

contrastive learning of visual representations. In ICML, volume 119 of Proceedings of Machine Learning Research, pages 1597 1607. PMLR, 2020.

[15] Ting Chen, Simon Kornblith, Kevin Swersky, Mohammad Norouzi, and Geoffrey E. Hinton. Big self-

supervised models are strong semi-supervised learners. In Neur IPS, 2020.

[16] Xinlei Chen, Saining Xie, and Kaiming He. An empirical study of training self-supervised vision transformers. In ICCV, pages 9620 9629. IEEE, 2021.

[17] Ekin Dogus Cubuk, Barret Zoph, Jonathon Shlens, and Quoc Le. Randaugment: Practical automated data

augmentation with a reduced search space. In Hugo Larochelle, Marc Aurelio Ranzato, Raia Hadsell, Maria-Florina Balcan, and Hsuan-Tien Lin, editors, Neur IPS, 2020.

[18] Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas

Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, Jakob Uszkoreit, and Neil Houlsby. An image is worth 16x16 words: Transformers for image recognition at scale. In ICLR. Open Review.net, 2021.

[19] Haoqi Fan, Bo Xiong, Karttikeya Mangalam, Yanghao Li, Zhicheng Yan, Jitendra Malik, and Christoph

Feichtenhofer. Multiscale vision transformers. In ICCV, pages 6804 6815. IEEE, 2021.

[20] Geoff French, Avital Oliver, and Tim Salimans. Milking cowmask for semi-supervised image classiﬁcation.

In VISIGRAPP, pages 75 84. SCITEPRESS, 2022.

[21] Priya Goyal, Piotr Dollár, Ross B. Girshick, Pieter Noordhuis, Lukasz Wesolowski, Aapo Kyrola, Andrew

Tulloch, Yangqing Jia, and Kaiming He. Accurate, large minibatch SGD: training imagenet in 1 hour. Co RR, abs/1706.02677, 2017.

[22] Jean-Bastien Grill, Florian Strub, Florent Altché, Corentin Tallec, Pierre H. Richemond, Elena Buchatskaya,

Carl Doersch, Bernardo Ávila Pires, Zhaohan Guo, Mohammad Gheshlaghi Azar, Bilal Piot, Koray Kavukcuoglu, Rémi Munos, and Michal Valko. Bootstrap your own latent - A new approach to selfsupervised learning. In Neur IPS, 2020.

[23] Hongyu Guo. Nonlinear mixup: Out-of-manifold data augmentation for text classiﬁcation. In AAAI, pages

4044 4051. AAAI Press, 2020.

[24] Haiyun He, Hanshu Yan, and Vincent YF Tan. Information-theoretic characterization of the generalization

error for iterative semi-supervised learning. Journal of Machine Learning Research, 23:1 52, 2022.

[25] Kaiming He, Xinlei Chen, Saining Xie, Yanghao Li, Piotr Dollár, and Ross Girshick. Masked autoencoders

are scalable vision learners. ar Xiv preprint ar Xiv:2111.06377, 2021.

[26] Kaiming He, Haoqi Fan, Yuxin Wu, Saining Xie, and Ross B. Girshick. Momentum contrast for unsuper-

vised visual representation learning. In CVPR, pages 9726 9735. Computer Vision Foundation / IEEE, 2020.

[27] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition.

In CVPR, pages 770 778. IEEE Computer Society, 2016.

[28] Yosuke Higuchi, Niko Moritz, Jonathan Le Roux, and Takaaki Hori. Momentum pseudo-labeling for

semi-supervised speech recognition. In Interspeech, pages 726 730. ISCA, 2021.

[29] Geoffrey E. Hinton, Oriol Vinyals, and Jeffrey Dean. Distilling the knowledge in a neural network. Co RR,

abs/1503.02531, 2015.

[30] Grant Van Horn, Oisin Mac Aodha, Yang Song, Alexander Shepard, Hartwig Adam, Pietro Perona, and

Serge J. Belongie. The inaturalist challenge 2017 dataset. Co RR, abs/1707.06642, 2017.

[31] Jie Hu, Li Shen, and Gang Sun. Squeeze-and-excitation networks. In CVPR, pages 7132 7141. Computer

Vision Foundation / IEEE Computer Society, 2018.

[32] Gao Huang, Zhuang Liu, Laurens van der Maaten, and Kilian Q. Weinberger. Densely connected convolutional networks. In CVPR, pages 2261 2269. IEEE Computer Society, 2017.

[33] Gao Huang, Yu Sun, Zhuang Liu, Daniel Sedra, and Kilian Q. Weinberger. Deep networks with stochastic

depth. In Bastian Leibe, Jiri Matas, Nicu Sebe, and Max Welling, editors, ECCV, volume 9908 of Lecture Notes in Computer Science, pages 646 661. Springer, 2016.

[34] H. J. Scudder III. Probability of error of some adaptive pattern-recognition machines. IEEE Trans. Inf.

Theory, 11(3):363 371, 1965.

[35] Sergey Ioffe and Christian Szegedy. Batch normalization: Accelerating deep network training by reducing

internal covariate shift. In ICML, volume 37 of JMLR Workshop and Conference Proceedings, pages 448 456. JMLR.org, 2015.

[36] Pavel Izmailov, Dmitrii Podoprikhin, Timur Garipov, Dmitry P. Vetrov, and Andrew Gordon Wilson.

Averaging weights leads to wider optima and better generalization. In UAI, pages 876 885. AUAI Press, 2018.

[37] Yannis Kalantidis, Mert Bülent Sariyildiz, Noé Pion, Philippe Weinzaepfel, and Diane Larlus. Hard

negative mixing for contrastive learning. In Neur IPS, 2020.

[38] Alex Krizhevsky, Geoffrey Hinton, et al. Learning multiple layers of features from tiny images. 2009.

[39] Samuli Laine and Timo Aila. Temporal ensembling for semi-supervised learning. In ICLR, 2017.

[40] Dong-Hyun Lee. Pseudo-label: The simple and efﬁcient semi-supervised learning method for deep neural

networks. In Workshop on challenges in representation learning, ICML, volume 3, 2013.

[41] Kibok Lee, Yian Zhu, Kihyuk Sohn, Chun-Liang Li, Jinwoo Shin, and Honglak Lee. i-mix: A domain-

agnostic strategy for contrastive representation learning. In ICLR. Open Review.net, 2021.

[42] Tatiana Likhomanenko, Qiantong Xu, Jacob Kahn, Gabriel Synnaeve, and Ronan Collobert. slimipl:

Language-model-free iterative pseudo-labeling. In Interspeech, pages 741 745. ISCA, 2021.

[43] Yen-Cheng Liu, Chih-Yao Ma, Zijian He, Chia-Wen Kuo, Kan Chen, Peizhao Zhang, Bichen Wu, Zsolt

Kira, and Peter Vajda. Unbiased teacher for semi-supervised object detection. In ICLR. Open Review.net, 2021.

[44] Ze Liu, Yutong Lin, Yue Cao, Han Hu, Yixuan Wei, Zheng Zhang, Stephen Lin, and Baining Guo. Swin

transformer: Hierarchical vision transformer using shifted windows. In ICCV, pages 9992 10002. IEEE, 2021.

[45] Zhuang Liu, Hanzi Mao, Chao-Yuan Wu, Christoph Feichtenhofer, Trevor Darrell, and Saining Xie. A

convnet for the 2020s. ar Xiv preprint ar Xiv:2201.03545, 2022.

[46] Ilya Loshchilov and Frank Hutter. Decoupled weight decay regularization. In ICLR. Open Review.net,

[47] Thomas Lucas, Corentin Tallec, Yann Ollivier, and Jakob Verbeek. Mixed batches and symmetric discriminators for GAN training. In ICML, volume 80 of Proceedings of Machine Learning Research, pages 2850 2859. PMLR, 2018.

[48] Vimal Manohar, Tatiana Likhomanenko, Qiantong Xu, Wei-Ning Hsu, Ronan Collobert, Yatharth Saraf,

Geoffrey Zweig, and Abdelrahman Mohamed. Kaizen: Continuously improving teacher using exponential moving average for semi-supervised speech recognition. In ASRU, pages 518 525. IEEE, 2021.

[49] Geoffrey J Mc Lachlan. Iterative reclassiﬁcation procedure for constructing an asymptotically optimal rule

of allocation in discriminant analysis. Journal of the American Statistical Association, 70(350):365 369, 1975.

[50] Takeru Miyato, Shin-ichi Maeda, Masanori Koyama, and Shin Ishii. Virtual adversarial training: A

regularization method for supervised and semi-supervised learning. IEEE Trans. Pattern Anal. Mach. Intell., 41(8):1979 1993, 2019.

[51] Yuval Netzer, Tao Wang, Adam Coates, Alessandro Bissacco, Bo Wu, and Andrew Y Ng. Reading digits

in natural images with unsupervised feature learning. 2011.

[52] Hyeonwoo Noh, Andre Araujo, Jack Sim, Tobias Weyand, and Bohyung Han. Large-scale image retrieval

with attentive deep local features. In ICCV, pages 3476 3485. IEEE Computer Society, 2017.

[53] Hieu Pham, Zihang Dai, Qizhe Xie, and Quoc V. Le. Meta pseudo labels. In CVPR, pages 11557 11568.

Computer Vision Foundation / IEEE, 2021.

[54] Olga Russakovsky, Jia Deng, Hao Su, Jonathan Krause, Sanjeev Satheesh, Sean Ma, Zhiheng Huang,

Andrej Karpathy, Aditya Khosla, Michael S. Bernstein, Alexander C. Berg, and Li Fei-Fei. Imagenet large scale visual recognition challenge. Int. J. Comput. Vis., 115(3):211 252, 2015.

[55] Kihyuk Sohn, David Berthelot, Nicholas Carlini, Zizhao Zhang, Han Zhang, Colin Raffel, Ekin Dogus

Cubuk, Alexey Kurakin, and Chun-Liang Li. Fixmatch: Simplifying semi-supervised learning with consistency and conﬁdence. In Neur IPS, 2020.

[56] Kihyuk Sohn, Zizhao Zhang, Chun-Liang Li, Han Zhang, Chen-Yu Lee, and Tomas Pﬁster. A simple

semi-supervised learning framework for object detection. ar Xiv preprint ar Xiv:2005.04757, 2020.

[57] Nitish Srivastava, Geoffrey E. Hinton, Alex Krizhevsky, Ilya Sutskever, and Ruslan Salakhutdinov. Dropout:

a simple way to prevent neural networks from overﬁtting. J. Mach. Learn. Res., 15(1):1929 1958, 2014.

[58] Robin Strudel, Ricardo Garcia Pinel, Ivan Laptev, and Cordelia Schmid. Segmenter: Transformer for

semantic segmentation. In ICCV, pages 7242 7252. IEEE, 2021.

[59] Christian Szegedy, Sergey Ioffe, Vincent Vanhoucke, and Alexander A. Alemi. Inception-v4, inception-

resnet and the impact of residual connections on learning. In AAAI, pages 4278 4284. AAAI Press, 2017.

[60] Christian Szegedy, Vincent Vanhoucke, Sergey Ioffe, Jonathon Shlens, and Zbigniew Wojna. Rethinking

the inception architecture for computer vision. In CVPR, pages 2818 2826. IEEE Computer Society, 2016.

[61] Mingxing Tan and Quoc V. Le. Efﬁcientnet: Rethinking model scaling for convolutional neural networks.

In ICML, volume 97 of Proceedings of Machine Learning Research, pages 6105 6114. PMLR, 2019.

[62] Antti Tarvainen and Harri Valpola. Mean teachers are better role models: Weight-averaged consistency

targets improve semi-supervised deep learning results. In Neur IPS, pages 1195 1204, 2017.

[63] Hugo Touvron, Matthieu Cord, Matthijs Douze, Francisco Massa, Alexandre Sablayrolles, and Hervé

Jégou. Training data-efﬁcient image transformers & distillation through attention. In ICML, volume 139 of Proceedings of Machine Learning Research, pages 10347 10357. PMLR, 2021.

[64] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz

Kaiser, and Illia Polosukhin. Attention is all you need. In Neur IPS, pages 5998 6008, 2017.

[65] Vikas Verma, Alex Lamb, Christopher Beckham, Amir Najaﬁ, Ioannis Mitliagkas, David Lopez-Paz,

and Yoshua Bengio. Manifold mixup: Better representations by interpolating hidden states. In ICML, volume 97 of Proceedings of Machine Learning Research, pages 6438 6447. PMLR, 2019.

[66] Vikas Verma, Alex Lamb, Juho Kannala, Yoshua Bengio, and David Lopez-Paz. Interpolation consistency

training for semi-supervised learning. In IJCAI, pages 3635 3641. ijcai.org, 2019.

[67] Pei Wang, Zhaowei Cai, Hao Yang, Gurumurthy Swaminathan, Nuno Vasconcelos, Bernt Schiele, and

Stefano Soatto. Omni-DETR: Omni-supervised object detection with transformers. In CVPR, 2022.

[68] Zejia Weng, Xitong Yang, Ang Li, Zuxuan Wu, and Yu-Gang Jiang. Semi-supervised vision transformers.

ar Xiv preprint ar Xiv:2111.11067, 2021.

[69] Ross Wightman. Pytorch image models. https://github.com/rwightman/ pytorch-image-models, 2019.

[70] Qizhe Xie, Zihang Dai, Eduard H. Hovy, Thang Luong, and Quoc Le. Unsupervised data augmentation for

consistency training. In Neur IPS, 2020.

[71] Qizhe Xie, Minh-Thang Luong, Eduard H. Hovy, and Quoc V. Le. Self-training with noisy student

improves imagenet classiﬁcation. In CVPR, pages 10684 10695. Computer Vision Foundation / IEEE, 2020.

[72] Saining Xie, Ross B. Girshick, Piotr Dollár, Zhuowen Tu, and Kaiming He. Aggregated residual transfor-

mations for deep neural networks. In CVPR, pages 5987 5995. IEEE Computer Society, 2017.

[73] Weijian Xu, Yifan Xu, Tyler A. Chang, and Zhuowen Tu. Co-scale conv-attentional image transformers.

In ICCV, pages 9961 9970. IEEE, 2021.

[74] Sangdoo Yun, Dongyoon Han, Sanghyuk Chun, Seong Joon Oh, Youngjoon Yoo, and Junsuk Choe. Cutmix:

Regularization strategy to train strong classiﬁers with localizable features. In ICCV, pages 6022 6031. IEEE, 2019. [75] Hongyi Zhang, Moustapha Cissé, Yann N. Dauphin, and David Lopez-Paz. mixup: Beyond empirical risk

minimization. In ICLR. Open Review.net, 2018. [76] Shuai Zhang, Meng Wang, Sijia Liu, Pin-Yu Chen, and Jinjun Xiong. How does unlabeled data improve

generalization in self-training? a one-hidden-layer theoretical analysis. ar Xiv preprint ar Xiv:2201.08514, 2022. [77] Zhun Zhong, Liang Zheng, Guoliang Kang, Shaozi Li, and Yi Yang. Random erasing data augmentation.

In AAAI, pages 13001 13008. AAAI Press, 2020. [78] Xiaojin Jerry Zhu. Semi-supervised learning literature survey. Technical report, University of Wisconsin-

Madison Department of Computer Sciences, 2005.

1. For all authors...

(a) Do the main claims made in the abstract and introduction accurately reﬂect the paper s

contributions and scope? [Yes] See the abstract and introduction sections. (b) Did you describe the limitations of your work? [Yes] See the conclusion section.

(c) Did you discuss any potential negative societal impacts of your work? [Yes] See the

conclusion section. (d) Have you read the ethics review guidelines and ensured that your paper conforms to

them? [Yes] 2. If you are including theoretical results...

(a) Did you state the full set of assumptions of all theoretical results? [N/A] (b) Did you include complete proofs of all theoretical results? [N/A] 3. If you ran experiments...

(a) Did you include the code, data, and instructions needed to reproduce the main ex-

perimental results (either in the supplemental material or as a URL)? [Yes] See the supplemental material. The code will be released upon acceptance. (b) Did you specify all the training details (e.g., data splits, hyperparameters, how they were

chosen)? [Yes] See the supplemental material. We have tried our best to provide all experimental details. And the code will be released upon acceptance for reproduction purpose. (c) Did you report error bars (e.g., with respect to the random seed after running experi-

ments multiple times)? [Yes] See the supplemental material. (d) Did you include the total amount of compute and the type of resources used (e.g., type

of GPUs, internal cluster, or cloud provider)? [Yes] See the supplemental material. 4. If you are using existing assets (e.g., code, data, models) or curating/releasing new assets...

(a) If your work uses existing assets, did you cite the creators? [Yes] See the references. (b) Did you mention the license of the assets? [No] Those are common assets.

(c) Did you include any new assets either in the supplemental material or as a URL? [N/A]

(d) Did you discuss whether and how consent was obtained from people whose data you re

using/curating? [N/A] (e) Did you discuss whether the data you are using/curating contains personally identiﬁable

information or offensive content? [No] Those are common assets. 5. If you used crowdsourcing or conducted research with human subjects...

(a) Did you include the full text of instructions given to participants and screenshots, if

applicable? [N/A] (b) Did you describe any potential participant risks, with links to Institutional Review

Board (IRB) approvals, if applicable? [N/A] (c) Did you include the estimated hourly wage paid to participants and the total amount

spent on participant compensation? [N/A]