# classimbalanced_semisupervised_learning_with_adaptive_thresholding__ad871c96.pdf

Class-Imbalanced Semi-Supervised Learning with Adaptive Thresholding

Lan-Zhe Guo 1 Yu-Feng Li 1

Semi-supervised learning (SSL) has proven to be successful in overcoming labeling difﬁculties by leveraging unlabeled data. Previous SSL algorithms typically assume a balanced class distribution. However, real-world datasets are usually class-imbalanced, causing the performance of existing SSL algorithms to be seriously decreased. One essential reason is that pseudo-labels for unlabeled data are selected based on a ﬁxed conﬁdence threshold, resulting in low performance on minority classes. In this paper, we develop a simple yet effective framework, which only involves adaptive thresholding for different classes in SSL algorithms, and achieves remarkable performance improvement on more than twenty imbalance ratios. Speciﬁcally, we explicitly optimize the number of pseudo-labels for each class in the SSL objective, so as to simultaneously obtain adaptive thresholds and minimize empirical risk. Moreover, the determination of the adaptive threshold can be efﬁciently obtained by a closed-form solution. Extensive experimental results demonstrate the effectiveness of our proposed algorithms.

1. Introduction

Machine learning, especially deep learning, has been repeatedly reported that can achieve competitive or even better performance than human beings on certain supervised learning tasks (Le Cun et al., 2015). These tasks, however, crucially rely on the availability of a large number of labeled training data. In many practical tasks, large-scale well-labeled datasets are difﬁcult to obtain, as the acquisition of labeled data requires huge human labor and ﬁnancial costs (Zhou, 2017; Li et al., 2019). On the other hand, there are usually abundant unlabeled data. Therefore, it is desirable for machine learning models to work with unlabeled data.

1National Key Laboratory for Novel Software Technology, Nanjing University, Nanjing 210023, China. Correspondence to: Yu-Feng Li <liyf@lamda.nju.edu.cn>.

Proceedings of the 39 th International Conference on Machine Learning, Baltimore, Maryland, USA, PMLR 162, 2022. Copyright 2022 by the author(s).

Semi-supervised learning (SSL) is one of the most promising learning paradigms to bypass the labeling cost by leveraging an abundance of unlabeled data (Chapelle et al., 2006). In much recent work, SSL can be categorized into several main classes in terms of the use of unlabeled data, such as entropy minimization (Grandvalet & Bengio, 2005), consistency regularization (Laine & Aila, 2017; Tarvainen & Valpola, 2017; Miyato et al., 2018), pseudo-labeling (Lee, 2013), and their combinations (Berthelot et al., 2019; Sohn et al., 2020; Berthelot et al., 2020; Xu et al., 2021). Due to its capability to handle both labeled and unlabeled data, SSL has been successfully applied into various tasks such as image classiﬁcation (Sohn et al., 2020), object detection (Jeong et al., 2019), semantic segmentation (Souly et al., 2017), text classiﬁcation (Miyato et al., 2017), etc. It has been reported in certain cases, such as image classiﬁcation (Sohn et al., 2020), SSL methods can achieve the performance of purely supervised learning even when a substantial portion of the labels in a given dataset has been discarded.

All of the positive results of SSL, however, are based on a basic assumption that the class distribution is balanced in both labeled and unlabeled data, i.e., the number of examples in each class is nearly the same. Such an assumption is difﬁcult to hold in practical applications. For example, in computer vision tasks, the frequency distribution of visual categories in our daily life is inherently imbalanced (Wang et al., 2017); in medical diagnosis tasks, a malignant lesion is rare compared to benign ones (Johnson & Khoshgoftaar, 2019), to name but a few.

It is well-known that machine learning models suffer severe performance degradation with such an imbalanced class distribution (Dong et al., 2019). Unfortunately, the class imbalance issue can be more problematic for SSL algorithms since they generate pseudo-labels for unlabeled data from the model s biased predictions. Take the state-of-the-art SSL algorithm Fix Match (Sohn et al., 2020) for an example, Fix Match uses the unlabeled examples with a ﬁxed high-conﬁdence prediction (e.g., 0.95) in classiﬁcation tasks. However, the prediction conﬁdence is biased towards the majority classes under class-imbalanced distribution, adopting a ﬁxed threshold for all classes results in the minority classes losing too many unlabeled examples with correct pseudo-labels, resulting in low performance (see Figure 1). That is to say, it may not be good enough for SSL algorithms

Class-Imbalanced Semi-Supervised Learning with Adaptive Thresholding

(a) Imbalanced Dataset

(b) Test Recall (%)

Figure 1. An example of experimental results on Wide Res Net-282 for the synthetically class-imbalanced CIFAR-10 dataset. (a) Both labeled and unlabeled datasets are class-imbalanced, where the most majority class has 100 more examples than the most minority class. (b) Recall rate on a balanced test data. Fix Match selects pseudo-labels if its conﬁdence prediction is greater than 0.95 for all classes, while the proposed Adsh algorithm selects pseudo-labels based on an adaptive class-dependent threshold. The results show that our proposal can improve the recall on minority classes, comparing to Fix Match.

to use a ﬁxed threshold to select pseudo-labels for all classes under class-imbalanced data distribution.

Unlike the previous works, we aim to the proposed approach has the ability to adaptively adjust the threshold for each class based on the class distribution. This inspires us to consider answering the following question in this study: Can we design an SSL algorithm that selects pseudolabels with adaptive thresholding?

To this end, we propose a generic SSL algorithm with adaptive thresholding (Adsh) that can adaptively select pseudo-labeled examples based on a class-dependent threshold during the training process. Speciﬁcally, our high-level idea is to explicitly consider the number of pseudo-labels to be selected for every class in the SSL objective so as to obtain adaptive thresholds and minimize empirical risk simultaneously. A highly efﬁcient closed-form solution can be derived from the optimization objective. Then, based on the solution we obtain an adaptive thresholding technique that encodes class-wise distribution to obtain class-dependent thresholds. The proposal is simple and effective. It only involves adaptive thresholds for different classes compared with previous SSL algorithms, e.g., Fix Match (Sohn et al., 2020). Empirical evaluations on extensive settings demonstrate the effectiveness of Adsh compared with the state-ofthe-art SSL algorithms. For example, experimental results on CIFAR-10, SVHN, and STL-10 datasets with different levels of class imbalance and different numbers of labeled data consistently show the performance improvement of our proposal. We also consider class imbalance and class distribution mismatch between labeled and unlabeled data simultaneously. Experimental results in this challenging setting also show the superiority of our proposal.

2. Related Works

This work is mainly related to class-imbalanced learning and semi-supervised learning.

Class-Imbalanced Learning. Real-world datasets usually yield a class-imbalanced label distribution (Liu et al., 2019) and make the standard training of machine learning models harder to generalize (Wang et al., 2017). Various algorithms have been proposed so far to address this problem (Buda et al., 2018; Johnson & Khoshgoftaar, 2019). The most commonly adopted approach is to re-balance the training objective with respect to the class-wise sample sizes. Two of such methods are representative: a) re-weighting, which inﬂuence the loss function by assigning relatively higher costs to examples from minor classes (Cao et al., 2019; Cui et al., 2019; Huang et al., 2019; Khan et al., 2019; 2017; Lin et al., 2017; Ren et al., 2018; Hu et al., 2019); b) re-sampling, which directly adjust label distribution by over-sampling for the minority class or under-sampling for the majority class, or both in order to obtain a balanced sampling distribution (Chawla et al., 2002; He & Garcia, 2009; Byrd & Lipton, 2019). However, naively re-balancing the objective usually results in over-ﬁtting to minority classes. Recently, there are also transfer-learning based methods been proposed by transferring features from majority classes to under-represented minority classes (Hariharan & Girshick, 2017; Liu et al., 2019; Yin et al., 2019). Nevertheless, these methods assume all labels are available and can not be applied to SSL scenarios directly.

Semi-Supervised Learning. SSL methods that aim to improve model performance by leveraging unlabeled data have a long history of research (Chapelle et al., 2006). Our paper is mainly related to deep SSL that introduces SSL techniques to DNNs and achieved signiﬁcant advancement in recent years (Berthelot et al., 2019; Grandvalet & Bengio, 2005; Laine & Aila, 2017; Miyato et al., 2018; Sohn et al., 2020; Tarvainen & Valpola, 2017). Typical ways of these SSL methods include training the model to ﬁt pseudo-labels or optimizing a well-designed objective that does not rely on labels. For example, pseudo-labeling based methods (Lee, 2013) generate pseudo-labels for unlabeled examples and train model to predict the pseudo-labels in a supervised manner; entropy minimization based methods (Grandvalet & Bengio, 2005) encourage the model s predicted distribution to have low entropy which does not require label information; consistency regularization based methods, e.g., Temporal Ensembling (Laine & Aila, 2017), Mean-Teacher (Tarvainen & Valpola, 2017), VAT (Miyato et al., 2018), etc, produce augmentations for unlabeled examples and optimize the consistency loss between the model output on given examples and it s augmented version. There are also methods called holistic methods that utilize these techniques simultaneously, such as Mix Match (Berthelot et al., 2019),

Class-Imbalanced Semi-Supervised Learning with Adaptive Thresholding

Re Mix Match (Berthelot et al., 2020) and Fix Match (Sohn et al., 2020). These SSL algorithms are reported to achieve near supervised performance on benchmark tasks. However, in some realistic scenarios, SSL methods suffer poor performance improvement (Guo et al., 2022), e.g, when labeled and unlabeled data distribution is mismatch (Oliver et al., 2018; Guo et al., 2020a;b; Zhou et al., 2021) or when class distribution is imbalanced (Kim et al., 2020; Wei et al., 2021; Guo et al., 2021). In this paper, we mainly focus on the class-imbalanced SSL problem.

Class-Imbalanced Semi-Supervised Learning. Recently, two representative algorithms DARP (Distribution Aligning Reﬁnery of Pseudo-label) (Kim et al., 2020) and CRe ST (Class-Rebalancing Self-Training) (Wei et al., 2021) have been proposed to address the class-imbalanced semisupervised learning. Speciﬁcally, DARP reﬁnes raw biased pseudo-labels to match the true class distribution. However, the process needs to know the ground-truth class distribution of unlabeled data as a prior which is evidently impossible in real tasks. To alleviate this limitation, DARP further proposes to estimate the class distribution by assuming that the confusion matrix between labeled data and unlabeled data are the same. Unfortunately, this assumption is also inappropriate since the trained model tends to overﬁt the small labeled dataset and obtain a nearly perfect confusion matrix while this can not generalize well to unlabeled data. CRe ST adopts a self-training manner that retrains the SSL model after adaptively select pseudo-labeled data from the unlabeled set to supplement the original labeled set. Different from the classical self-training strategy, CRe ST samples pseudo-labels according to the label frequency in order to progressively align the class distribution (i.e., the examples are selected with higher probabilities if they are predicted as minority classes). However, CRe ST assumes that the class distributions between labeled data and unlabeled data are the same which is difﬁcult to verify since we have no idea about the true class distribution of unlabeled data. These strict assumptions limit their wider applications.

3. Preliminary and Background

This section provides notations used in this paper and gives a brief review of SSL algorithms with a ﬁxed threshold.

3.1. Problem Setting and Notations

For a K-class classiﬁcation task, we are given a set of training data from an unknown distribution, which includes N labeled examples Dl = {(xl 1, yl 1), , (xl N, yl N)} and M unlabeled examples Du = {xu 1, , xu M} where x X Rd denote the input d-dimensional feature vector and y Y are corresponding one-hot label. The number of examples in class k under Dl and Du are denoted by Nk and Mk, respectively, i.e., PK k=1 Nk = N and PK k=1 Mk = M.

Without loss of generality, we assume that the classes are sorted in descending order, i.e., N1 N2 , , NK and M1 M2 , , MK. We measure the degree of class imbalance by imbalance ratio, which is deﬁned as γl = N1 NK and γu = M1 MK for labeled and unlabeled data respectively. γl and γu could be much larger than 1 and it is noteworthy that they are usually not the same in practical tasks. The goal is to learn a model f(x; θ) : X Y that generalizes well under a class-balanced test criterion, where θ is the model parameter.

The training loss of an SSL algorithm usually contains supervised loss Ls and unsupervised loss Lu with a trade-off parameter λu > 0: Ls + λu Lu, where Ls is constructed on Dl and Lu is constructed on Du. Typically, Ls applies the standard cross-entropy loss on labeled examples:

i=1 H(yi, f(y|xi; θ)) (1)

k=1 yi,k log f(y = k|x; θ)

where f(y|x; θ) [0, 1]K is the predicted probabilities produced by the model f with parameter θ for the input x, and H( , ) is the cross-entropy function.

Different constructions of the unsupervised loss Lu lead to different SSL methods. Typically, there are two ways of constructing Lu: one is to use pseudo-labels to formulate a "supervised loss" such as the cross-entropy loss (e.g., Fix Match (Sohn et al., 2020)), and another one is to optimize a regularization that does not depend on labels such as consistency regularization (e.g., UDA (Xie et al., 2020)). Next, we will introduce the a recent SSL work to interpret how to generate pseudo-labels and construct unsupervised loss Lu.

3.2. Fix Match: An SSL algorithm with Fixed Thresholding

Due to its simplicity yet empirical success, we select Fix Match (Sohn et al., 2020) as an SSL example in this subsection. Moreover, we consider Fix Match as a warm-up of the proposed algorithm, since Fix Match uses a ﬁxed threshold to select unlabeled examples, it will be used as a comparison with the proposed algorithm.

Fix Match applies weak and strong augmentations to unlabeled examples and generates pseudo-labels using the model s predictions on weakly augmented unlabeled examples. The pseudo-label is only retained if the model produces a high-conﬁdence prediction. The model is then trained to predict the pseudo-label when fed a strongly augmented version of the same example.

Speciﬁcally, given a batch of B labeled examples {(xl b, yl b) : b (1, , B)} and a batch of µB unlabeled examples

Class-Imbalanced Semi-Supervised Learning with Adaptive Thresholding

{xu b : b (1, , µB)} where µ determines the relative bath size of labeled and unlabeled data.

For unlabeled data, Fix Match tries to generate pseudo-labels via the model s predictions. Fix Match ﬁrst predict the class distribution given a weakly augmented version of an unlabeled example

qb = f(y|α(xu b ); θ) (2)

where α( ) is the weak augmentation. Then, it creates a pseudo-label by

ˆyu b = arg max(qb) (3)

Following by (Sohn et al., 2020), the argmax applied to a probability distribution produces a one-hot probability distribution. To construct the unsupervised loss, it computes the model prediction for a strong augmentation A of the same unlabeled examples xu b :

f(y|A(xu b ); θ) (4)

The unsupervised loss is deﬁned as the cross-entropy between ˆyu b and f(y|A(xu b ); θ):

H(ˆyu b , f(y|A(xu b ); θ)) (5)

Eventually, Fix Match only uses the unlabeled examples with a high-conﬁdence prediction by selecting based on a ﬁxed threshold τ = 0.95 for all classes. Therefore, in Fix Match, the unsupervised loss with cross-entropy and conﬁdence threshold is deﬁned as:

b=1 I(max(qb) τ)H(ˆyu b , f(y|A(xu b ); θ))

(6) where I( ) is an indicator function.

As we discussed in the introduction, with class-imbalanced training data, adopting a ﬁxed threshold for all classes may lead to the elimination of too many unlabeled examples with correct pseudo-labels in the minority classes (See Figure 1), resulting in low recall rates on minority classes and eventually drop off the overall performance. It is natural to think that: the threshold should be class-dependent and adaptive to class distributions rather than being ﬁxed for all classes. Therefore, in the next section, we are going to propose a new SSL scheme having adaptive thresholds for different classes.

4. Adsh: An SSL Algorithm with Adaptive Thresholding

We now turn to the framework we propose in this paper: Adsh, an SSL algorithm with thresholds that can be adaptively adjusted for different classes. The detail algorithm procedure is presented in Algorithm 1.

Algorithm 1 Adsh Algorithm.

Input: Labeled Data Dl, unlabeled data Du, number of classes K, number of epochs E, number of iterations T, unlabeled loss weight λu, unlabeled data ratio µ, class bias s RK, model parameter θ0.

1: t = 0 2: for e = 1 to E do 3: for iter = 1 to T/E do 4: Sample {(xl b, yl b) : b (1, , B)} from Dl. 5: Sample {xu b : b (1, , µB)} from Du.

6: Ls = 1 B PB b=1 H(yl b, f(y|α(xl b); θt)) // Compute cross entropy loss for labeled examples 7: for b = 1 to µB do 8: qb = f(y|α(xu b ); θt) // Predicted probability distribution 9: ˆyu b = arg max(qb) // Pseudo-label for xu b 10: Hb = H(ˆyu b , f(y|A(xu b ); θ) // Compute cross entropy loss for pseudo-labeled examples 11: end for 12: for k = 1 to K do 13: τk = exp( sk) // Class-dependent adaptive thresholds 14: end for 15: Lu = 1 µB PµB b=1 I(max(qb) τˆyu b )Hb 16: L = Ls + λu Lu 17: θt+1 = Optimization Step(θt, L) // Update model parameter via gradient methods, e.g., SGD 18: t = t + 1 19: end for 20: Update s via algorithm 2 or Eq.(11) // Update s 21: end for 22: return θT .

To alleviate the drawbacks of ﬁxed thresholding on the classimbalanced datasets, we propose to select pseudo-labels via class-dependent thresholds that adaptively change for each class. Speciﬁcally, we formulate the SSL objective as an optimization problem that explicitly encodes the number of pseudo-labels to be selected for each class into the objective:

min ˆy,s,θ 1 N

k=1 yi,k log f(y = k|α(xl i); θ) (7)

k=1 [ ˆyi,k log f(y = k|α(xu i ); θ)

s.t. ˆyi = [ˆyi,1, , ˆyi,K] {0, 1}K

sk > 0, 1 k K

where ˆy RM K is the pseudo-label matrix for unlabeled examples and ˆyi is the pseudo-label vector for example xu i .

Class-Imbalanced Semi-Supervised Learning with Adaptive Thresholding

Algorithm 2 Algorithm for Computing s. Input: Model parameter θ, unlabeled data Du = {xu i }M i=1, number of classes K, user-deﬁned threshold for the most majority class τ1.

1: Initialize an array C with K rows to save model prediction conﬁdence 2: for xu in Du do 3: q = f(y|α(xu); θ) // Prediction conﬁdence for unlabeled examples 4: ˆy = argmax(q) // Predicted pseudo-label 5: Cˆy Append(max(q)) // Save the maximum probability for each example 6: end for 7: ρ = 1.0 8: Sort Ck in descending order 1 k K 9: for len 1 to length(C1) do 10: if C1[len] < τ1 then 11: break 12: end if 13: ρ = len length(C1) 100% // Percentage of selected pseudo-labels for the most majority class 14: end for 15: for k = 1 to K do 16: sk = log(Ck[length(Ck) ρ]) // Determine sk for other classes 17: end for 18: return s.

ˆyi is required to be either a discrete one-hot or a zero vector, assigning ˆyi as 0 leads to ignoring this pseudo-label in the model training. sk introduces different levels of class-wise bias for pseudo-label selection, and a larger sk indicates a larger number of pseudo-labeled examples would be selected for class k.

Eq.(7) shows that, on one hand, the pseudo-label ˆy should be consistent with the model prediction, on the other hand, the number of selected pseudo-labels is controlled by sk explicitly for each class k, rather than based on a ﬁxed threshold τ. Similar ideas to control the number of selected examples have also been applied to other machine learning problems, e.g., domain adaptation (Zou et al., 2018), curriculum learning (Zou et al., 2019). Different from these works, our paper pays attention to class-imbalanced semisupervised learning and presents a general scheme for the pseudo-label selection which is an important part of SSL algorithms. This sheds new light on how to apply SSL to more realistic and challenging scenarios.

Eq.(7) can be optimized alternatively: ﬁrst, solving ˆy and s given a ﬁxed θ; then, optimizing θ in a supervised manner by leveraging pseudo-labels ˆy.

Solving ˆy and s given a ﬁxed θ. If the model parameter

θ is ﬁxed, we have the following theorem to guarantee the solution of ˆy.

Theorem 4.1. Given a learning model f(x; θ), the pseudolabel ˆy in Eq.(7) has the closed-form solution:

1, if k = argmax f(y = k|α(xu i ); θ) exp( sk) ,

f(y = k|α(xu i ); θ) exp( sk) 1.

0, otherwise.

Theorem 4.1 implies that pseudo-label ˆy is dependent on both model predictions and sk. Moreover, we can show that under certain conditions, Eq.(8) gives an class-dependent adaptive threshold,

Lemma 4.2. If exp(sk sk ) > f(y=k |α(xu i );θ) f(y=k|α(xu i );θ) holds for all k and k that satisfy f(y = k|α(xu i ); θ) > f(y = k |α(xu i ); θ), then we have: argmax f(y=k|α(xu i );θ) exp( sk) = argmax f(y = k|α(xu i ); θ).

It is noteworthy that the condition in above lemma is easy to satisfy since the model prone to over-conﬁdent (Thulasidasan et al., 2019), thus, f(y = k |α(xu i ); θ)/f(y = k|α(xu i ); θ) is relatively small in real tasks.

The above analysis shows that instead of selecting pseudolabels based on the original prediction conﬁdence and a ﬁxed threshold τ, we select pseudo-label for unlabeled example xu i that are predicted as ˆyu i with

I(max(qi) exp( sˆyu i )) (9)

where qi = f(y|α(xu i ); θ) and ˆyu i = argmax(qi).

If the ground-truth class distribution of unlabeled data is known, we can solve sk to make the pseudo-label ˆy has the same class distribution with the ground-truth y , i.e.,

PM i=1 ˆyi,k PM i=1 ˆyi,k =

PM i=1 y i,k PM i=1 y i,k , k, k {1, , K} (10)

Speciﬁcally, we can ﬁrst set s1 using a user-deﬁned hyperparameter, e.g., s1 = log(0.95), then, sk for 2 k K can be computed by:

i=1 I(f(y = k|α(xu i ); θ) exp( sk)) (11)

= PM i=1 I(f(y = 1|α(xu i ); θ) exp( s1)) γk

PM i=1 y i,1 PM i=1 y i,k indicates the imbalance ratio between

class 1 and class k.

Class-Imbalanced Semi-Supervised Learning with Adaptive Thresholding

Table 1. Comparison of classiﬁcation performance (Accuracy (%)) on imbalanced CIFAR-10 dataset under three different imbalance ratio: γ = 50, 100, 150 and two different numbers of labeled data: N1 = 1500, M1 = 3000 and N1 = 500, M1 = 4000. The best results are indicated in bold.

Imbalanced CIFAR-10 Dataset N1 = 1500, M1 = 3000 N1 = 500, M1 = 4000 Algorithm γ = 50 γ = 100 γ = 150 γ = 50 γ = 100 γ = 150 Supervised 65.23 0.05 58.94 0.13 55.63 0.38 51.31 0.34 45.82 0.41 40.90 0.39 CBL 65.52 0.31 58.52 0.45 52.36 0.58 51.94 0.71 46.22 0.92 41.58 1.24 Re-Sampling 64.53 0.39 56.34 0.42 53.21 0.51 51.96 0.65 48.13 1.25 40.26 1.88 c RT 67.82 0.14 63.43 0.45 59.56 0.44 56.28 1.45 48.11 0.79 45.02 1.08 LDAM 68.91 0.10 63.15 0.24 58.68 0.30 56.41 0.92 49.27 0.88 45.10 0.75 Mean-Teacher 68.84 0.82 61.33 0.28 54.79 0.31 56.34 1.68 48.55 0.77 45.32 1.20 Mix Match 73.59 0.46 65.03 0.26 62.71 0.29 65.32 1.20 56.41 1.96 52.38 1.88 Re Mix Match 78.96 0.29 72.88 0.12 68.61 0.40 76.83 0.98 70.12 1.23 59.58 1.30 Fix Match 79.10 0.14 71.50 0.31 68.47 0.15 77.34 0.96 68.45 0.94 60.10 0.82 DARP 81.60 0.31 75.23 0.14 69.31 0.26 76.72 0.46 69.41 0.50 61.23 0.31 CRe ST 82.03 0.26 75.08 0.41 69.84 0.39 76.18 0.36 69.50 0.70 60.81 0.55 Adsh 83.38 0.06 76.52 0.35 71.49 0.30 79.27 0.38 70.97 0.46 62.04 0.51

However, in many realistic scenarios, the class distribution is unknown, in this case we present a simple and effective alternative strategy to determine s without introducing additional hyper-parameters. The full procedure is presented in algorithm 2. Speciﬁcally, the algorithm to determine sk exploits the class-wise conﬁdence threshold effectively by ranking all the probabilities predicted as class k in descending order and setting sk such that exp( sk) be equal to the predicted probability ranked at ρ length(Ck), where length(Ck) is the number of unlabeled examples predicted as class k and ρ 100% denotes the percentage of selected conﬁdent pseudo-labels. Such a strategy takes the predicted probability ranked at ρ 100% separately from each class as a reference for thresholding.

The proportion ρ is computed from the majority class, i.e., the class 1. Speciﬁcally, we set s1 using a user-deﬁned hyper-parameter (e.g., τ1 = 0.95), then the proportion of pseudo-labels selected for class 1 would be determined as:

ρ = PM i=1 I(f(y = 1|α(xu i ); θ) τ1) length(C1)

where length(C1) = PM i=1 I(argmax(f(y|α(xu i ); θ)) = 1). This ensures pseudo-labels with the same conﬁdencelevel with-in class can be selected for every class.

Solving θ given ﬁxed ˆy and s. With the pseudo-label ˆy, we can solve θ in the supervised manner. Same as previous SSL methods, we optimize θ using the SGD algorithm, in which the unsupervised loss Lu is given by

b=1 I(max(qb) τˆyu b )H(ˆyu b , f(y|A(xu b ); θ))

The above unsupervised loss function implies that the pseudo-label selection is not dependent on a ﬁxed threshold. Instead, it is dependent on a threshold that adaptively changes for different classes. Selecting the pseudo-labels by utilizing the adaptive thresholding gives the advantage of selecting examples that have relatively low conﬁdence, but high within-class conﬁdence and thus help alleviate the bias problem of the original prediction under class-imbalanced distribution.

5. Experiments

In this section, we give comprehensive evaluations on various class-imbalanced SSL scenarios. We ﬁrst describe the experimental setups in Section 5.1. Then, we present empirical results of our proposal and other compared methods under extensive setups in Section 5.2. Finally, we present detailed analyses to help understand the superiority of our proposal in Section 5.3.

5.1. Experimental setup

Imbalanced Datasets. We conduct experiments on longtailed variants of CIFAR-10 (Krizhevsky & Hinton, 2009), SVHN (Netzer et al., 2011) and STL-10 (Coates et al., 2011) datasets with various levels of class imbalance and different ratios of labeled data. These are all widely adopted datasets to evaluate SSL algorithms. For constructing the class-imbalanced training dataset, we use two parameters γl, γu to denote the imbalance ratio of labeled and unlabeled data, i.e., γl = N1

NK , γu = M1

MK . Once γl, γu and N1, M1 are

given, we set Nk = N1 γ k 1

K 1 l and Mk = M1 γ k 1

K 1 u for 1 < k K. Speciﬁcally, we consider two different num-

Class-Imbalanced Semi-Supervised Learning with Adaptive Thresholding

(a) N1 = 1500, γl = 50

(b) N1 = 1500, γl = 100

(c) N1 = 1500, γl = 150

(d) N1 = 500, γl = 50

(e) N1 = 500, γl = 100

(f) N1 = 500, γl = 150

Figure 2. Comparison results of classiﬁcation performance on CIFAR-10 with 12 different imbalance ratios, i.e., γl [50, 100, 150], γu [1, 50, 100, 150] and 2 different number of labeled examples, i.e., N1 = 1500 (upper), N1 = 500 (lower).

bers of labeled examples, i.e., N1 = 500, M1 = 4000 and N1 = 1500, M1 = 3000, and various imbalance ratios, i.e., γl and γu come from combinations of [1, 50, 100, 150]. The test set remains untouched and balanced, so that accuracy is adopted as the evaluation criterion.

Compared Methods. We compare our Adsh with many methods, including class-imbalanced learning methods, SSL methods, and recently proposed class-imbalanced SSL methods. Speciﬁcally, for class-imbalanced learning, we consider a wide range of methods including a) Class-Balanced Loss (CBL) (Cui et al., 2019), a representative re-weighting strategy where labeled examples are re-weighted according to the inverse of the effective number of examples in each class; b) Re-Sampling (Byrd & Lipton, 2019), a typical re-sampling strategy where each labeled example is sampled with probability proportional to the inverse samples of its class; c) classiﬁer Re-Training (c RT) (Kang et al., 2020), which retrains the classiﬁer with a balancing objective after training the whole network to learn a representation under imbalanced distribution; d) Label-Distribution Aware Margin (LDAM), which imposes a larger margin to minority class in the training process and balancing the objective at the later stage of training. We also evaluate several classic SSL algorithms including a) Mean Teacher (Tarvainen & Valpola, 2017), which adds a con-

sistency regularization between the prediction of the current model and the ensemble of the model in previous training epochs; b) Mix Match (Berthelot et al., 2019), a holistic SSL method that adopted both pseudo-label and consistency regularization strategies with Mixup augmentations; c) Re Mix Match (Berthelot et al., 2020), which further improves Mix Match by adding an augmentation anchoring and a distribution alignment; d) Fix Match (Sohn et al., 2020), reported as the best performing SSL method, that generate pseudo-labels from the weakly augmented data and applied to strongly augmented data. To further show the efﬁcacy of our proposal, we also compared with recently proposed algorithms that consider SSL and class-imbalance simultaneously, including a) Distribution Aligning Reﬁnery of Pseudo-label (DARP) (Kim et al., 2020), which reﬁnes the pseudo-labels generated from the SSL model to match the ground-truth class distribution of unlabeled data; b) Class Rebalancing Self-Training (CRe ST) (Wei et al., 2021), a self-training based strategy that selects pseudo-labels according to the inverse of label frequency and align distribution progressively.

Implementation Details. In all experiments, we adopt the Wide Res Net-28-2 (Zagoruyko & Komodakis, 2016) as the backbone since it is commonly adopted in various SSL methods (Oliver et al., 2018). We train the model with batch size

Class-Imbalanced Semi-Supervised Learning with Adaptive Thresholding

(a) Confusion Matrix of Fix Match

(b) Confusion Matrix of Adsh

(c) Hyper-Parameter Sensitivity

Figure 3. Detailed analyses of the Adsh. (a) and (b): Confusion matrix on unlabeled data produced by Fix Match (left) and Adsh(right); (c): Performance robustness with hyper-parameter τ1 changes.

64 for 218 iterations. We adopt Adam (Kingma & Ba, 2015) optimizer with a learning rate 2 10 3. Following (Sohn et al., 2020) and (Kim et al., 2020), the exponential moving average (EMA) technique is applied with a decay rate 0.999. For all algorithms, we evaluate the model on the test dataset every 512 iterations and record the average test accuracy of the last 20 evaluations, following (Kim et al., 2020). Mean std accuracy over ﬁve random runs is reported. More details on the implementation are presented in the supplementary material.

5.2. Empirical Results

We ﬁrst evaluate Adsh with compared methods on the CIFAR-10 dataset under various levels of imbalance ratio and different numbers of labeled examples. In particular, we study two situations: γl = γu and γl = γu.

Results on CIFAR-10 with γl = γu. We ﬁrst conduct experiments in the case that γ := γl = γu, the most natural scenario that labeled and unlabeled data have the same distribution. Table 1 summarizes the performance of our Adsh and compared methods. From the results, we observe that in most cases SSL methods perform better than classimbalanced learning methods since they use more unlabeled training data. DARP and CRe ST methods achieve good performance among compared methods since they consider both unlabeled data exploitation and imbalanced distribution. It is noticeable that our proposal Adsh consistently achieves the best performance in all settings with various imbalance ratios and different numbers of labeled examples.

Results on CIFAR-10 with γl = γu. γl = γu brings new challenges since the distribution between labeled and unlabeled data is mismatched. We conduct experiments on 24 settings with different imbalance ratios γl, γu, and different numbers of labeled examples. We report the results of competitive methods Fix Match, DARP, and Adsh. The CRe ST is omitted since it can not be applied to the mismatched

distribution. The results are summarized in Figure 2. An interesting observation is that for a ﬁxed γl, all three methods suffer performance degradation when γu = 1, even this is the most balanced unlabeled dataset. One possible reason is that the extent of distribution mismatch prevents performance improvement. The results in Figure 2 show that our Adsh performs better than DARP and Fix Match methods on almost all settings while DARP performs even worse than the Fix Match algorithm in some cases.

Table 2. Comparison of classiﬁcation performance (Accuracy (%)) on imbalanced SVHN dataset with γ = γl = γu = 100, and STL-10 datasets with γl = 10 or 20 and unknown γu. The best results are indicated in bold.

SVHN STL-10 Algorithm γ = 100 γl = 10 γl = 20 Re Mix Match 88.91 0.32 67.43 0.43 60.82 0.93 Fix Match 89.34 0.20 73.25 0.21 63.54 0.21 DARP 90.15 0.46 76.97 0.45 68.87 0.66 CRe ST 89.90 0.64 76.30 0.38 69.43 0.89 Adsh 92.13 0.39 79.25 0.41 71.03 0.20

Results on SVHN and STL-10. We also present experimental results on SVHN and STL-10 datasets. In the case of SVHN, we construct imbalanced dataset as done in Section 5.1 in which 20% are labeled and γl = γu = 100. For STL-10, we construct a long-tailed variants labeled dataset with N1 = 450 and γl = {10, 20}. We fully use the unlabeled data in STL-10 with M = 100, 000, whose class distribution is imbalanced but the imbalance ratio γu is unknown. Therefore, in the case of STL-10, the labeled and unlabeled dataset may not have the same class distribution, i.e., γl = γu. Table 2 summarizes the learning performance on SVHN and STL-10 datasets. Since the simple class-imbalanced learning methods perform signiﬁcantly worse than SOTA SSL methods and class-imbalanced SSL methods, we omit the results of these methods. From the results, we can see that our proposal consistently improves the performance on both SVHN and STL-10 datasets.

Class-Imbalanced Semi-Supervised Learning with Adaptive Thresholding

5.3. Detailed Analyses

Quality of pseudo-labels. We evaluate Adsh by measuring the confusion matrix on unlabeled data to show that our Adsh can improve the quality of pseudo-labels. Figure 3(a) and Figure 3(b) visualize the confusion matrix of pseudolabels using the model trained on CIFAR-10 with γl = γu = 100, N1 = 1500, M1 = 3000. The results show that the raw pseudo-labels generated by Fix Match are biased towards majority classes, for example, there are more than 30% examples that belong to class 9 are predicted wrongly as class 1. On the contrary, our proposal can achieve a more unbiased confusion matrix. These results indicate that the quality of pseudo-labels is actually improved, which can help to improve the generalization performance.

Hyper-Parameter Sensitivity. We also study the performance sensitivity of Adsh to different values of hyperparameter τ1. The results of model trained on CIFAR-10 dataset with N1 = 1500, M1 = 3000, γl = γu = 100 are presented in Figure 3(c). When τ1 is set as 0.96, the model achieves the best performance while changing it to others did not hurt much. These results show that our proposal Adsh is robust to the hyper-parameter selection. Based on the results, to use the Adsh approach, we suggest setting τ1 as 0.96 ﬁrst, and further optimize it from {0.95, 0.96, 0.97, 0.98, 0.99}.

6. Conclusions

In this paper, we tackle an important problem of SSL, that is, SSL in the presence of class-imbalanced distribution. We propose a novel Adsh approach that adaptively selects pseudo-labels to train models based on a class-dependent threshold. We formulate the pseudo-label selection into an optimization objective by explicitly considering the number of pseudo-labels to be selected for each class and derive a highly efﬁcient closed-form solution. The proposed Adsh method is a generic scheme that can be easily integrated with existing SSL methods. We demonstrate the use of adaptively class-dependent thresholding can help to the performance of the SOTA SSL method Fix Match in extensive experiments, indicating the importance of adaptive threshold in classimbalanced SSL.

How to construct robust SSL models in realistic scenarios has attracted great attention in recent years. Classimbalanced SSL is a representative problem that brings robustness threats to SSL while is still understudied. Our work puts a promising scheme in this direction. One limitation of our scheme is it does not have theoretical guarantees. We will put efforts into this direction in future work, such as giving convergence analysis of SSL algorithms that use ﬁxed thresholds and adaptive thresholds. The code of this paper has been released on http:

//www.lamda.nju.edu.cn/code_ADSH.ashx.

Acknowledgements

This research was supported by the National Science Foundation of China (62176118, 61921006), and the Huawei Cooperation Fund.

Berthelot, D., Carlini, N., Goodfellow, I., Papernot, N., Oliver, A., and Raffel, C. Mixmatch: A holistic approach to semi-supervised learning. In Advances in Neural Information Processing Systems, pp. 5050 5060, 2019.

Berthelot, D., Carlini, N., Cubuk, E. D., Kurakin, A., Sohn, K., Zhang, H., and Raffel, C. Remixmatch: Semisupervised learning with distribution alignment and augmentation anchoring. In Proceedings of the 8th International Conference on Learning Representations, 2020.

Buda, M., Maki, A., and Mazurowski, M. A. A systematic study of the class imbalance problem in convolutional neural networks. Neural Networks, 106:249 259, 2018.

Byrd, J. and Lipton, Z. What is the effect of importance weighting in deep learning? In Proceedings of the 36th International Conference on Machine Learning, pp. 872 881, 2019.

Cao, K., Wei, C., Gaidon, A., Aréchiga, N., and Ma, T. Learning imbalanced datasets with label-distributionaware margin loss. In Advances in Neural Information Processing Systems, pp. 1565 1576, 2019.

Chapelle, O., Scholkopf, B., and Zien, A. Semi-supervised learning. MIT Press, 2006.

Chawla, N. V., Bowyer, K. W., Hall, L. O., and Kegelmeyer, W. P. Smote: synthetic minority over-sampling technique. Journal of Artiﬁcial Intelligence Research, 16:321 357, 2002.

Coates, A., Ng, A. Y., and Lee, H. An analysis of singlelayer networks in unsupervised feature learning. In Proceedings of the 14th International Conference on Artiﬁcial Intelligence and Statistics, pp. 215 223, 2011.

Cui, Y., Jia, M., Lin, T.-Y., Song, Y., and Belongie, S. Classbalanced loss based on effective number of samples. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 9268 9277, 2019.

Dong, Q., Gong, S., and Zhu, X. Imbalanced deep learning by minority class incremental rectiﬁcation. IEEE Transaction on Pattern Analysis and Machine Intelligence, 41 (6):1367 1381, 2019.

Class-Imbalanced Semi-Supervised Learning with Adaptive Thresholding

Grandvalet, Y. and Bengio, Y. Semi-supervised learning by entropy minimization. In Advances in Neural Information Processing Systems, pp. 529 536, 2005.

Guo, L.-Z., Zhang, Z.-Y., Jiang, Y., Li, Y.-F., and Zhou, Z.- H. Safe deep semi-supervised learning for unseen-class unlabeled data. In Proceedings of the 37th International Conference on Machine Learning, pp. 3897 3906, 2020a.

Guo, L. Z., Zhou, Z., and Li, Y. F. RECORD: resource constrained semi-supervised learning under distribution shift. In Proceedings of the 26th ACM SIGKDD Conference on Knowledge Discovery and Data Mining, pp. 1636 1644, 2020b.

Guo, L. Z., Zhou, Z., Shao, J. J., Zhang, Q., Kuang, F., Li, G. L., Liu, Z. X., Wu, G., Ma, N., Li, Q., and Li, Y. F. Learning from imbalanced and incomplete supervision with its application to ride-sharing liability judgment. In Proceedings of the 27th ACM SIGKDD Conference on Knowledge Discovery and Data Mining, pp. 487 495, 2021.

Guo, L. Z., Zhou, Z., and Li, Y. F. Robust deep semisupervised learning: A brief introduction. Co RR, abs/2202.05975, 2022.

Hariharan, B. and Girshick, R. Low-shot visual recognition by shrinking and hallucinating features. In Proceedings of the IEEE International Conference on Computer Vision, pp. 3018 3027, 2017.

He, H. and Garcia, E. A. Learning from imbalanced data. IEEE Transactions on Knowledge and Data Engineering, 21(9):1263 1284, 2009.

Hu, Z., Tan, B., Salakhutdinov, R., Mitchell, T. M., and Xing, E. P. Learning data manipulation for augmentation and weighting. In Advances in Neural Information Processing Systems, pp. 15738 15749, 2019.

Huang, C., Li, Y., Loy, C. C., and Tang, X. Deep imbalanced learning for face recognition and attribute prediction. IEEE Transactions on Pattern Analysis and Machine Intelligence, 42(11):2781 2794, 2019.

Jeong, J., Lee, S., Kim, J., and Kwak, N. Consistencybased semi-supervised learning for object detection. In Advances in Neural Information Processing Systems, pp. 10759 10768, 2019.

Johnson, J. M. and Khoshgoftaar, T. M. Survey on deep learning with class imbalance. Journal of Big Data, 6(1): 1 54, 2019.

Kang, B., Xie, S., Rohrbach, M., Yan, Z., Gordo, A., Feng, J., and Kalantidis, Y. Decoupling representation and classiﬁer for long-tailed recognition. In Proceedings of the

8th International Conference on Learning Representations, 2020.

Khan, S., Hayat, M., Zamir, S. W., Shen, J., and Shao, L. Striking the right balance with uncertainty. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 103 112, 2019.

Khan, S. H., Hayat, M., Bennamoun, M., Sohel, F. A., and Togneri, R. Cost-sensitive learning of deep feature representations from imbalanced data. IEEE Transactions on Neural Networks and Learning Systems, 29(8):3573 3587, 2017.

Kim, J., Hur, Y., Park, S., Yang, E., Hwang, S. J., and Shin, J. Distribution aligning reﬁnery of pseudo-label for imbalanced semi-supervised learning. In Advances in Neural Information Processing Systems, pp. 14567 14579, 2020.

Kingma, D. P. and Ba, J. Adam: A method for stochastic optimization. In Proceedings of the 3rd International Conference on Learning Representations, 2015.

Krizhevsky, A. and Hinton, G. Learning multiple layers of features from tiny images. Technical Report, 2009.

Laine, S. and Aila, T. Temporal ensembling for semisupervised learning. In Proceedings of the 5th International Conference on Learning Representations, 2017.

Le Cun, Y., Bengio, Y., and Hinton, G. Deep learning. Nature, 521(7553):436 444, 2015.

Lee, D.-H. Pseudo-label: The simple and efﬁcient semisupervised learning method for deep neural networks. In ICML Workshop on Challenges in Representation Learning, pp. 2 8, 2013.

Li, Y.-F., Guo, L.-Z., and Zhou, Z.-H. Towards safe weakly supervised learning. IEEE Transactions on Pattern Analysis and Machine Intelligence, 43(1):334 346, 2019.

Lin, T.-Y., Goyal, P., Girshick, R., He, K., and Dollár, P. Focal loss for dense object detection. In Proceedings of the IEEE International Conference on Computer Vision, pp. 2980 2988, 2017.

Liu, Z., Miao, Z., Zhan, X., Wang, J., Gong, B., and Yu, S. X. Large-scale long-tailed recognition in an open world. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2537 2546, 2019.

Miyato, T., Dai, A. M., and Goodfellow, I. J. Adversarial training methods for semi-supervised text classiﬁcation. In Proceedings of the 5th International Conference on Learning Representations, 2017.

Class-Imbalanced Semi-Supervised Learning with Adaptive Thresholding

Miyato, T., Maeda, S.-i., Koyama, M., and Ishii, S. Virtual adversarial training: a regularization method for supervised and semi-supervised learning. IEEE Transactions on Pattern Analysis and Machine Intelligence, 41 (8):1979 1993, 2018.

Netzer, Y., Wang, T., Coates, A., Bissacco, A., Wu, B., and Ng, A. Y. Reading digits in natural images with unsupervised feature learning. In Proceedings of the Neur IPS Workshop on Deep Learning and Unsupervised Feature Learning, 2011.

Oliver, A., Odena, A., Raffel, C. A., Cubuk, E. D., and Goodfellow, I. Realistic evaluation of deep semi-supervised learning algorithms. In Advances in Neural Information Processing Systems, pp. 3235 3246, 2018.

Ren, M., Zeng, W., Yang, B., and Urtasun, R. Learning to reweight examples for robust deep learning. In Proceedings of the 35th International Conference on Machine Learning, pp. 4331 4340, 2018.

Sohn, K., Berthelot, D., Carlini, N., Zhang, Z., Zhang, H., Raffel, C., Cubuk, E. D., Kurakin, A., and Li, C. L. Fixmatch: Simplifying semi-supervised learning with consistency and conﬁdence. In Advances in Neural Information Processing Systems, pp. 596 608, 2020.

Souly, N., Spampinato, C., and Shah, M. Semi supervised semantic segmentation using generative adversarial network. In Proceedings of the IEEE International Conference on Computer Vision, pp. 5688 5696, 2017.

Tarvainen, A. and Valpola, H. Mean teachers are better role models: Weight-averaged consistency targets improve semi-supervised deep learning results. In Advances in Neural Information Processing Systems, pp. 1195 1204, 2017.

Thulasidasan, S., Chennupati, G., Bilmes, J. A., Bhattacharya, T., and Michalak, S. On mixup training: Improved calibration and predictive uncertainty for deep neural networks. In Advances in Neural Information Processing Systems, pp. 13888 13899, 2019.

Wang, Y.-X., Ramanan, D., and Hebert, M. Learning to model the tail. In Advances in Neural Information Processing Systems, pp. 7032 7042, 2017.

Wei, C., Sohn, K., Mellina, C., Yuille, A. L., and Yang, F. Crest: A class-rebalancing self-training framework for imbalanced semi-supervised learning. In Proceedings of the IEEE International Conference on Computer Vision, 2021.

Xie, Q., Dai, Z., Hovy, E. H., Luong, T., and Le, Q. Unsupervised data augmentation for consistency training. In Advances in Neural Information Processing Systems, pp. 6256 6268, 2020.

Xu, Y., Shang, L., Ye, J., Qian, Q., Li, Y.-F., Sun, B., Li, H., and Jin, R. Dash: Semi-supervised learning with dynamic thresholding. In Proceedings of the 38th International Conference on Machine Learning, pp. 11525 11536, 2021.

Yin, X., Yu, X., Sohn, K., Liu, X., and Chandraker, M. Feature transfer learning for face recognition with underrepresented data. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 5704 5713, 2019.

Zagoruyko, S. and Komodakis, N. Wide residual networks. In Proceedings of the British Machine Vision Conference, 2016.

Zhou, Z., Guo, L. Z., Cheng, Z., Li, Y. F., and Pu, S. STEP: out-of-distribution detection in the presence of limited in-distribution labeled data. In Advances in Neural Information Processing Systems, pp. 29168 29180, 2021.

Zhou, Z.-H. A brief introduction to weakly supervised learning. National Science Review, 5(1):44 53, 2017.

Zou, Y., Yu, Z., Kumar, B. V. K. V., and Wang, J. Unsupervised domain adaptation for semantic segmentation via class-balanced self-training. In Proceedings of the European Conference on Computer Vision, pp. 297 313, 2018.

Zou, Y., Yu, Z., Liu, X., Kumar, B. V. K. V., and Wang, J. Conﬁdence regularized self-training. In Proceedings of the IEEE International Conference on Computer Vision, pp. 5981 5990, 2019.

Class-Imbalanced Semi-Supervised Learning with Adaptive Thresholding

A. Theorem Proof

Theorem A.1. The objective formulation

minimizeˆy Lu = 1

k=1 [ ˆyi,k log(f(y = k|α(xu i ); θ)) skˆyi,k] (13)

subject to ˆyi = [ˆyi,1, , ˆyi,K] {0, 1}K, 0 1 ˆyi 1

sk > 0, 1 k K

has the closed-form solution:

1, if k = argmax f(y = k|α(xu i ); θ) exp( sk) ,

f(y = k|α(xu i ); θ) exp( sk) 1.

0, otherwise.

Proof. To select pseudo-label ˆyi,k = 1 for xu i , two conditions need to be satisﬁed, ﬁrst

log(f(y = k|α(xu i ); θ)) sk < log(f(y = k |α(xu i ); θ)) sk

From the above inequality, we can derive that

f(y = k|α(xu i ); θ) exp( sk) > f(y = k |α(xu i ); θ)) exp( sk )

for all other class k .

Then, the second condition is, log(f(y = k|α(xu i ); θ)) sk 0

and we can obtain that, f(y = k|α(xu i ); θ)) exp( sk) 1

Therefore, the closed-formed solution for our objective function is Eq.(14).

B. Implementation Details

In all experiments, we adopt the Wide Res Net-28-2 as the backbone. We train the model with batch size 64 for 218 training iterations. For training with semi-supervised learning algorithms, we adopt Adam optimizer with a learning rate of 2 10 3. For the hyper-parameters of Adam, we use β1 = 0.9, β2 = 0.999 and ϵ = 10 8 which is the default choice. The exponential moving average (EMA) technique is applied with a decay rate of 0.999. For training with re-balancing algorithms, we use SGD with a learning rate of 0.1, momentum 0.9, and weight decay 5 10 4. The learning rate of SGD decays by 0.01 at the time step 80% and 90% iterations. For all algorithms, we evaluate the model on the test dataset every 512 iterations and record the average test accuracy of the last 20 evaluations. Mean std accuracy over ﬁve random runs is reported. All experiments are conducted on Tesla V100 GPUs.

For Mean-Teacher, the consistency coefﬁcient λu is set to 50 and the EMA model used for the evaluation is reused for the consistency regularization. We ramped up the consistency coefﬁcient starting from 0 to λu using a sigmoid schedule so that it achieves the maximum value at 1.0 105 iterations. For Mix Match, we set temperature T as 0.5, the number of augmentation K as 2, the parameter for beta distribution α as 0.75, and the consistency coefﬁcient λu as 75. The consistency coefﬁcient is linearly increased to λu started from 0. For Re Mix Match, we set K = 2 for the number of augmentations to balance the improvement from an augmentation anchoring and a computational cost, suggested by (Kim et al., 2020). We use Rand Augment as a strong augmentation. Other hyper-parameters are as same as the original paper. For Fix Match, we

Class-Imbalanced Semi-Supervised Learning with Adaptive Thresholding

use µ = 2 to determine the ratio of unlabeled data and set λu = 1, τ = 0.95 as the original paper. For DARP, we adopt the the ofﬁcial code and recommended parameters1. For c Re ST, we set the hyper-parameter as the original paper. Fix Match is adopted as the backbone SSL algorithm for DARP and c Re ST. For our Adsh we set τ1 = 0.95 as Fix Match and update s every 512 iterations.

C. Combination of Class-Imbalanced Learning and SSL

We also conduct experiments by combining the class-imbalanced learning method and SSL methods. Speciﬁcally, we examine Adsh and Fix Match by combining with the classiﬁer re-training (c RT) algorithm (Kang et al., 2020), which is a recently introduced state-of-the-art re-balancing algorithm for the class-imbalanced dataset. These algorithms are denoted by "Fix Match + c RT" and "Adsh+ c RT", respectively. Table 3 summarized the performance of Fix Match and Adsh with/without c RT. From the results, we can observe that combining with c RT can further the performance of Adsh. Moreover, with c RT, our proposal Adsh still achieves better performance than Fix Match.

Table 3. Comparison of classiﬁcation performance (Accuracy (%)) on imbalanced CIFAR-10 dataset under three different class-imbalance ratio γ = γl = γu. The best results are indicated in bold.

Imbalanced CIFAR-10 Algorithm γ = 50 γl = 100 γl = 150 Supervised 65.23 0.05 58.94 0.13 55.63 0.38 c RT 67.82 0.14 63.43 0.45 59.56 0.44 Fix Match 79.10 0.14 71.50 0.31 68.47 0.15 Adsh 83.38 0.06 76.52 0.35 71.49 0.30 Fix Match + c RT 84.32 0.40 78.39 0.45 73.26 0.23 Adsh + c RT 86.21 0.24 79.82 0.24 75.48 0.31

1https://github.com/bbuing9/DARP