# costsensitive_selftraining_for_optimizing_nondecomposable_metrics__6176de04.pdf Cost-Sensitive Self-Training for Optimizing Non-Decomposable Metrics Harsh Rangwani 1, Shrinivas Ramasubramanian 1, Sho Takemori 2, Takashi Kato2, Yuhei Umeda2, and R. Venkatesh Babu1 1Video Analytics Lab, Indian Institute of Science, Bengaluru, India 2Fujitsu Limited, Kanagawa, Japan harshr@iisc.ac.in, shrinivas.ramasubramanian@gmail.com, takemori.sho@fujitsu.com, kato.takashi_01@fujitsu.com, umeda.yuhei@fujitsu.com, venky@iisc.ac.in Self-training based semi-supervised learning algorithms have enabled the learning of highly accurate deep neural networks, using only a fraction of labeled data. However, the majority of work on self-training has focused on the objective of improving accuracy whereas practical machine learning systems can have complex goals (e.g. maximizing the minimum of recall across classes etc.) that are non-decomposable in nature. In this work, we introduce the Cost-Sensitive Self Training (CSST) framework which generalizes the self-training-based methods for optimizing non-decomposable metrics. We prove that our framework can better optimize the desired non-decomposable metric utilizing unlabeled data, under similar data distribution assumptions made for the analysis of self-training. Using the proposed CSST framework we obtain practical self-training methods (for both vision and NLP tasks) for optimizing different non-decomposable metrics using deep neural networks. Our results demonstrate that CSST achieves an improvement over the state-of-the-art in majority of the cases across datasets and objectives. 1 Introduction In recent years, semi-supervised learning algorithms are increasingly being used for training deep neural networks [3, 9, 31, 37]. These algorithms lead to accurate models by leveraging the unlabeled data in addition to the limited labeled data present. For example, it s possible to obtain a model with minimal accuracy degradation ( 1%) using 5% of labeled data with semi-supervised algorithms compared to supervised models trained using 100% labeled data [31]. Hence, the development of these algorithms has resulted in a vast reduction in the requirement for expensive labeled data. Self-training is one of the major paradigms for semi-supervised learning. It involves obtaining targets (e.g. pseudo-labels) from a network from the unlabeled data, and using them to train the network further. The modern self-training methods also utilize additional regularizers that enforce prediction consistency across input transformations (e.g., adversarial perturbations [18], augmentations [36, 31], etc.) , enabling them to achieve high performance using only a tiny fraction of labeled data. Currently, the enhanced variants of self-training with consistency regularization [40, 25] are among the state-ofthe-art (SOTA) methods for semi-supervised learning. Equal Contribution. Code: https://github.com/val-iisc/Cost Sensitive Self Training 36th Conference on Neural Information Processing Systems (Neur IPS 2022). Figure 1: We show a comparison of the SOTA CSL [20] method with the Self-training-based Semi-Supervised methods, for optimizing the minimum recall objective on the CIFAR10-LT dataset. Our proposed CSST framework produces significant gains in the desired metric leveraging additional unlabeled data through our proposed weighted novel consistency regularizer and thresholding mechanism. Despite the popularity of self-training methods, most of the works [36, 1, 31] have focused on the objective of improving prediction accuracy. However, there are nuanced objectives in real-world based on the application requirements. Examples include minimizing the worst-case recall [19] used for federated learning, classifier coverage for minority classes for ensuring fairness [6], etc. These objectives are complex and cannot be expressed just by using a loss function on the prediction of a single input (i.e., non-decomposable). There has been a considerable effort in optimizing nondecomposable objectives for different supervised machine learning models [20, 30]. However, as supervision can be expensive, in this work we aim to answer the question: Can we optimize nondecomposable objectives using self-training-based methods developed for semi-supervised learning? We first demonstrate that vanilla self-training methods (e.g., Fix Match [31], UDA [36], etc.) can produce unsatisfactory results for non-decomposable metrics (Fig. 1). We then generalize the Cost Sensitive Loss for Self-Training by introducing a novel weighted consistency regularizer, for a particular non-decomposable metric. Further, for training neural networks we introduce appropriate loss functions and pseudo label selection (thresholding) mechanisms considering the non-decomposable metric we aim to optimize. We also prove that we can achieve better performance on desired nondecomposable metric through our framework utilizing self-training, under similar assumptions on data distributions as made for theoretical analysis of self-training [35]. We demonstrate the practical application by optimizing various non-decomposable metrics by plugging existing methods (e.g. Fix Match [31] etc.) into our framework. Our framework leads to a significant average improvement in desired metric of minimizing worst-case recall while maintaining similar accuracy (Fig. 1). In summary: a) we introduce a Cost-Sensitive Self-Training (CSST) framework for optimizing nondecomposable metrics that utilizes unlabeled data in addition to labeled data. (Sec. 4) b) we provably demonstrate that our CSST framework can leverage unlabeled data to achieve better performance over baseline on desired non-decomposable metric (Sec. 3) c) we show that by combining CSST with self-training frameworks (e.g. Fix Match [31], UDA [36] etc.) leads to effective optimization of non-decomposable metrics, resulting in significant improvement over vanilla baselines. (Sec. 5) 2 Preliminaries 2.1 Non-Decomposable Objectives and Reduction to Cost-Sensitive Learning Table 1: Metrics defined using entries of a confusion matrix. Metric Definition Recall (reci[F]) Ci,i[F ] P Coverage (covi[F]) P Precision (preci[F]) Ci,i[F ] P Worst Case Recall mini Ci,i[F ] P We consider the K-class classification problem with an instance space X and the set of labels Y = [K]. The data distribution on X [K] is denoted by D. For i [K], we denote by πi the class prior P(y = i). Notations commonly used across paper are in Table 1 present in Appendix. For a classifier F : X [K], we define confusion matrix C[F] RK K by Cij[F] = E(x,y) D [1(y = i, F(x) = j)]. Many metrics relevant to classification can be defined as functions of entries of confusion matrices such as class-coverage, recall and accuracy to name a few. We introduce more complex metrics, which are of practical importance in the case of imbalanced distributions [4] (Tab. 1). A classifier often tends to suffer low recalls on tail (minority) classes in such cases. Therefore, one may want to maximize the worst case recall, max F min i [K] reci[F]. Similarly, on long-tailed datasets, the tail classes suffer from low coverage, lower than their respective priors. An interesting objective in such circumstances is to maximise the mean recall, subject to the coverage being within a given margin. i [K] reci[F] s.t. covj[F] 0.95 K , j [K]. (1) Many of these metrics are non-decomposable, i.e., one cannot compute these metrics by simply calculating the average of scores on individual examples. Optimizing for these metrics can be regarded as instances of cost-sensitive learning (CSL). More specifically, optimization problems of the form which can be written as a linear combination of Gi,j and Cij[F] will be our focus in this work where G is a K K matrix. i,j [K] Gij Cij[F], (2) The entry Gij represents the reward associated with predicting class j when the true class is i. The matrix G is called a gain matrix [20]. Some more complex non-decomposable objectives for classification can be reduced to CSL [22, 32, 20]. For instance, the aforementioned two complex objectives can be reduced to CSL using continuous relaxation or a Lagrange multiplier as bellow. Let K 1 RK be the K 1-dimensional probability simplex. Then, maximizing the minimum recall is equivalent to the saddle-point optimization problem: max F min λ K 1 i [K] λi Cii[F] Thus, for a fixed λ, the corresponding gain matrix is given as a diagonal matrix diag(G1, . . . , GK) with Gi = λi/πi for 1 i K. Similarly, using Lagrange multipliers λ RK 0, Eq. (1) is rewritten as a max-min optimization problem [20, Sec. 2]: max F min λ RK 0 i [K] Cii[F]/πi + X i [K] Cij[F] 0.95/K In this case, the corresponding gain matrix G is given as Gij = δij Kπi +λj, where δij is the Kronecker s delta. One can solve these max-min problems by alternatingly updating λ (using exponented gradient or projected gradient descent) and optimizing the cost-sensitive objectives [20]. 2.2 Loss Functions for Non-Decomposable Objectives The cross entropy loss function is appropriate for optimizing accuracy for deep neural networks, however, learning with CE can suffer low performance for cost-sensitive objectives [20]. Following [20], we introduce calibrated loss functions for given gain matrix G. We let pm : X K 1 RK be a prediction function of a model, where K 1 is the K 1-dimensional probability simplex. For a gain matrix G, the corresponding loss function is given as a combination of logit adjustment [17] and loss re-weighting [24]. We decompose the gain matrix G as G = MD, where D = diag(G11, . . . , GKK) be a diagonal matrix, with Dii > 0, i [K] and M RK K. For y [K] and model prediction pm(x), the hybrid loss is defined as follows: ℓhyb(y, pm(x)) = X i [K] Myi log (pm(x))i /Dii P j [K] (pm(x))j /Djj To make the dependence of G explicit, we also denote ℓhyb(y, pm(x)) as ℓhyb(y, pm(x); G). The average loss on training sample S X is defined as Lhyb(X) = 1 |S| P x S ℓhyb(u, pm(x)). Narasimhan and Menon [20] proved that the hybrid loss is calibrated, that is learning with ℓhyb gives the Bayes optimal classifier for G (c.f., [20, Proposition 4], of which we provide a formal statement in Sec. N.1). If G is a diagonal matrix (i.e., M = 1K), the hybrid loss is called the logit adjusted (LA) loss and ℓhyb(y, pm(x)) is denoted by ℓLA(y, pm(x)). 2.3 Consistency Regularizer for Semi-Supervised Learning Modern self-training methods not only leverage pseudo labels, but also forces consistent predictions of a classifier on augmented examples or neighbor examples [35, 18, 36, 31]. More formally, a classifier F is trained so that the consistent regularizer R(F) is small while a supervised loss or a loss between pseudo labeler are minimized [35, 31]. Here the consistency regularizer R(F) is defined as Ex [1(F(x) = F(x ), x s.t. x is a neighbor of an augmentation of x)] . In existing works, consistency regularizers are considered for optimization of accuracy. In the subsequent sections, we consider consistency regularizers for cost-sensitive objectives. 3 Cost-Sensitive Self-Training for Non-Decomposable Metrics 3.1 CSL and Weighted Error In the case of accuracy or 0-1-error, a self-training based SSL algorithm using a consistency regularizer achieves the state-of-the-art performance across a variety of datasets [31] and its effectiveness has been proved theoretically [35]. This section provides theoretical analysis of a self-training based SSL algorithm for non-decomposable objectives by generalizing [35]. More precisely, the main result of this section (Theorem 5) states that an SSL method using consistency regularizer improves a given pseudo labeler for non-decomposable objectives. We provide all the omitted proofs in Appendix for theoretical results in the paper. In Sec. 2, we considered non-decomposable metrics and their reduction to cost-sensitive learning objectives defined by Eq. (2) using a gain matrix. In this section, we consider an equivalent objective using the notion of weighted error. For weight matrix w = (wij)1 i,j K and a classifier F : X [K], a weighted error is defined as follows: Errw(F) = X i,j [K] wij Ex Pi [1(F(x) = j)] , where, Pi(x) denotes the class conditional distribution P(x | y = i). If w = diag(1/K, . . . , 1/K), then this coincides with the usual balanced error [17]. Since Cij[F] = E(x,y) D [1(y = i, F(x) = j)] = P(y = i) P(y = i)Ex Pi [1(F(x) = j)] , we can write: Gij Cij[F] = Gij(P(y = i) P(y = i)Ex Pi [1(F(x) = j)]) = Gij(πi πi Ex Pi [1(F(x) = j)]) Here πi is the class prior P(y = i) for 1 i K as before. Hence CSL objective (2) i.e. max F P i,j Gij Cij[F] is equivalent to minimizing the negative term above i.e. Gijπi Ex Pi [1(F(x) = j)] which is same as Errw(F) with wij = Gijπi for 1 i, j K. Hence, the notion of weighted error is equivalent to CSL, which we will also use later for deriving loss functions. We further note that if we add a matrix with the same columns (c1 0) to the gain matrix G, still the maximizers of CSL (2) are the same as the original problem. Hence, without loss of generality, we assume wij 0. We assume w = 0, i.e., |w|1 > 0 for avoiding degenerate solutions. In the previous work [35], it is assumed that there exists a ground truth classifier F : X [K] and the supports of distributions {Pi}1 i K are disjoint. However, if supports of distributions {Pi}1 i K are disjoint, a solution of the minimization problem min F Errw(F) is independent of w in some cases. More precisely, if w = diag(w1, . . . , w K) i.e. a diagonal matrix and wi > 0, i, then the optimal classifier is given as x 7 argmaxk [K] wk Pk(x) (this follows from [20, Proposition 1]). If supports are disjoint, then the optimal classifier is the same as x 7 argmaxk [K] Pk(x), which coincides with the ground truth classifier. Therefore, we do not assume the supports of Pi are disjoint nor a ground truth classifier exists unlike [35]. See Fig. 2 for an intuitive explanation. 3.2 Weighted Consistency Regularizer For improving accuracy, the consistency in prediction is equally important across the distributions Pi for 1 i K [31, 35]. However, for our case of weighted error, if the entries of the ith 0 row of the weight matrix w are larger than the other entries for some i0, then the consistency of model predictions on examples drawn from the distribution Pi0 are more important than those on the other (a) (w1, w2) = (1, 10) (b) (w1, w2) = (10, 1) (c) (w1, w2) = (1, 10) (d) (w1, w2) = (10, 1) w1P1(x) w2P2(x) opt. decision boundary Figure 2: Using a simple example, we explain a difference in theoretical assumptions compared to [35] that assumes {Pi}1 i K have disjoint supports. Here, we consider the case when K = 2, w = diag(w1, w2), and P1 and P2 are distributions on X R. (a), (b): In a perfect setting where two distributions P1 and P2 have disjoint supports, the Bayes optimal classifier for the CSL is identical to the ground truth classifier (x 7 argmaxi Pi(x)) for any choices of weights (w1, w2). (c), (d): In a more generalized setting, the Bayes optimal classifier x 7 argmaxi wi Pi(x) depends on the choice of weights (i.e., gain matrix). The optimal decision boundary (the green line) for the CSL moves to the right as w1/w2 increases. examples. In this case, we require more restrictive consistency regularizer for distribution for Pi0. Thus, we require a weighted (or cost-sensitive) consistency regularizer, which we define below. We assume that the instance space X is a normed vector space with norm | | and T is a set of augmentations, i.e., each T T is a map from X to itself. For a fixed r > 0, we define B(x) by {x X : T T s.t. |x T(x)| r}. For each i [K], we define conditional consistency regularizer by RB,i(F) = Ex Pi [1 ( x B(x) s.t. F(x) = F(x ))] . Then, we define the weighted consistency regularizer by RB,w(F) = P i,j [K] wij RB,i(F). If the prediction of classifier F is robust to augmentation T T and small noise, then RB,w(F) is small. For β > 0, we consider the following optimization objective for finding a classifier F: min F Errw(F) subject to RB,w(F) β. (6) A solution of the problem (6) is denoted by F . We let Fpl : X [K] a pseudo labeler (a classifier) with reasonable performance (we elaborate on this in Section 3.4). The following mathematically informal assumption below is required to interpret our main theorem. ASSUMPTION 1. We assume both β and Errw(F ) are sufficiently small so that they are negligible compared to Errw(Fpl). Assumption 1 assumes existence of an optimal classifier F that minimizes the error Errw(F) (i.e. Bayes Optimal) among the class of classifiers which are robust to data augmentation (i.e. low weighted consistency RB,w(F)). As we operate in case of overparameterized neural networks such a classifier F is bound to exist, but is unknown in our problem setup. In the case of the balanced error, the validity of this assumption is justified by the fact that the existing work [31] using consistency regularizer on data augmentation obtains classifier F , that achieves high accuracy (i.e., low balanced errors) for balanced datasets. Also in Appendix C.1, we provide an example that supports the validity of the assumption in the case of Gaussian mixtures and diagonal weight matrices. 3.3 Expansion Property For x X, we define the neighborhood N(x) of x by {x X : B(x) B(x ) = }. For a subset S X, neighborhood of S is defined as N(S) = x SN(x). Similarly to [35], we consider the following property on distributions. DEFINITION 2. Let c : (0, 1] [1, ) be a non-increasing function. For a distribution Q on X we say Q has c-expansion property if Q(N(S)) c(Q(S))Q(S) for any measurable S X. The c-expansion property implies that if Q(S) decreases, then the expansion factor Q(N(S))/Q(S) increases. This is a natural condition, because it roughly requires that if Q(S) is small, then Q(N(S)) is large compared to Q(S). For intuition let us consider a ball of radius l depicting S Rd with volume Q(S) and its neighborhood N(S) expands to a ball with radius l + 1. The expansion factor here would be ((l + 1)/l)d, hence as l (i.e. Q(S)) increases (1 + 1/l)d (i.e. Q(N(S))/Q(S)) decreases. Hence, it s natural to expect c to be a non-decreasing function. The c-expansion property (on each Pi) considered here is equivalent to the (a, ec)-expansion property, which is shown to be realistic for vision and used for theoretical analysis of self-training in [35], where a (0, 1) and ec > 1 (see Sec. N.2 in Appendix). In addition, we also show that it is also satisfied for mixtures of Gaussians and mixtures of manifolds (see Example 1 in Appendix for more details). Thus, the c-expansion property is a general property satisfied for a wide class of distributions. ASSUMPTION 3. For weighted probability measure Pw on X by Pw(U) = i,j [K] wij Pi(U) P i,j [K] wij for U X. We assume Pw satisfies c-expansion for a non-increasing function c : (0, 1] [1, ). 3.4 Cost-Sensitive Self-Training with Weighted Consistency Regularizer In this section we first introduce the assumptions on the pseudo labeler Fpl and then introduce the theoretical Cost-Sensitive Self-Training (CSST) objective. Fpl can be any classifier satisfying the following assumption, however, typically it is a classifier trained on a labeled dataset. ASSUMPTION 4. We assume that Errw(Fpl) + Errw(F ) |w|1. Let γ = c(pw), where pw = Errw(Fpl)+Errw(F ) |w|1 . We also assume γ > 3. Since c is non-increasing, γ (as a function of Errw(Fpl)) is a non-increasing function of Errw(Fpl) (and Errw(F )). Therefore, the assumption γ > 3 roughly requires that Errw(Fpl) is small . We provide concrete conditions for Errw(Fpl) that satisfy γ > 3 in the case of mixture of isotropic d-dimenional Gaussians for a region B(x) defined by r in Appendix (Example 2). In the example, we show that the condition γ > 3 is satisfied if Errw(Fpl) < 0.17 in the case when r = 1/(2 d) and satisfied if Errw(Fpl) < 0.33 in the case when r = 3/(2 d), where X Rd. Since we assume Errw(F ) is negligible compared to Errw(Fpl) (Assumption 1), the former condition in Assumption 4 is approximately equivalent to Errw(Fpl) |w|1 which is satisfied by the definition of Errw. We define L(i) 0-1(F, F ) = Ex Pi [1(F(x) = F (x))]. Then, we consider the following objective: min F Lw(F), where Lw(F) = γ + 1 γ 1Lw(F, Fpl) + 2γ γ 1RB,w(F). (7) Here Lw(F, Fpl) is defined as P i,j [K] wij L(i) 0-1(F, Fpl). The above objective corresponds to costsensitive self-training (with Fpl) objective with weighted consistency regularization. We provide following theorem which relates the weighted error of classifier ˆF learnt using the above objective to the weighted error of the pseudo labeler (Fpl). THEOREM 5. Any learnt classifier b F using the loss function Lw (i.e., argmin F Lw(F)) satisfies: Errw( b F) 2 γ 1Errw(Fpl) + γ + 1 γ 1Errw(F ) + 4γ γ 1RB,w(F ). REMARK. Since both Errw(F ) and RB,w(F ) β are negligible compared to Errw(Fpl) and γ > 3, Theorem 5 asserts that ˆF learnt by minimizing semi-supervised loss Lw(F, Fpl) with the consistency regularizer RB,w(F) can achieve superior performance than the pseudo labeler Fpl in terms of the weighted error Errw. The above theorem is a generalization of [35, Theorem 4.3], which provided a similar result for balanced 0-1-error in the case of distributions with disjoint supports. In Appendix Sec. D, following [35, 34], we also provide a generalization bound for Errw(F) using all-layer margin [34] in the case when classifiers are neural networks. 4 CSST in Practice In the previous section, we proved that by using self-training (CSST), we can achieve a superior classifier ˆF in comparison to pseudo labeler Fpl through weighted consistency regularization. As we have established the equivalence of the weighted error Errw to the CSL objective expressed in terms of G (Sec. 3.1) , we can theoretically optimize a given non-decomposable metric expressed by G better using CSST, utilizing the additional unlabeled data via self-training and weighted consistency regularization. We now show how CSST can be used in practice for optimizing non-decomposable metrics in the case of neural networks. The practical self-training methods utilizing consistency regularization (e.g., Fix Match [31], etc.) for semi-supervised learning have supervised loss Ls for labeled and consistency regularization loss for unlabeled samples (i.e., Lu) with a thresholding mechanism to select unlabeled samples. The final loss for training the network is Ls + λu Lu, where λu is the hyperparameter. The supervised loss Ls can be modified conveniently based on the desired non-decomposable metric by using suitable G (Sec. 2.1). We will now introduce the novel weighted consistency loss and its corresponding thresholding mechanism for unlabeled data in CSST, used for optimizing desired non-decomposable metric. Weighted Consistency Regularization. As the idea of consistency regularization is to enforce consistency between model prediction on different augmentations of input, this is usually achieved by minimizing some kind of divergence D. A lot of recent works [18, 31, 36] in semi-supervised learning use DKL to enforce consistency between the model s prediction on unlabeled data and its augmentations, pm(x) and pm(A(x)). Here A usually denotes a form of strong augmentation. Across these works, the distribution of confidence of the model s prediction is either sharpened or used to get a hard pseudo label to obtain ˆpm(x). As we aim to optimized the cost-sensitive learning objective, we aim to match the distribution of normalized distribution (i.e. norm(GTˆpm(x)) = GTˆpm(x)/ P i (GTˆpm(x))i) (Proposition 2 [20] also in Prop. 7 ) with the pm(A(x)) by minimizing the KL-Divergence between these. We now propose to use the following weighted consistency regularizer loss function for optimizing the same: ℓwt u (ˆpm(x), pm(A(x)), G) = i=1 (GTˆpm(x))i log(pm(A(x))i) (8) PROPOSITION 6. The minimizer of Lwt u = 1 |Bu| P x Bu ℓwt u (ˆpm(x), pm(A(x)), G) leads to minimization of KL Divergence i.e. DKL(norm(GTˆpm(x))||pm(A(x))) x X . As the above loss is similar in nature to the cost sensitive losses introduced by Narasimhan and Menon [20] (Sec. 2.1) we can use the logit-adjusted variants (i.e. ℓLA and ℓhyb based on type of G) of these in our final loss formulations (Lwt u ) for training overparameterized deep networks. We further show in Appendix Sec. B that the above loss ℓwt u approximately minimizes the theoretical weighted consistency regularization term RB,w(F) defined in Sec. 3.4. Threshold Mechanism for CSST. In the usual semi-supervised learning formulation [31] we use the confidence threshold (maxi(pm(x)i) > τ) as the function to select samples for which consistency regularization term is non-zero. We find that this leads to sub-optimal results in particularly the case of non-diagonal G as only a few samples cross the threshold (Fig. 4). As in the case of cost-sensitive loss formulation the samples may not achieve the high confidence to cross the threshold of consistency regularization. This is also theoretically justified by the following proposition: PROPOSITION 7 ([20] Proposition 2). Let popt m (x) be the optimal softmax model function obtained by optimizing the cost-sensitive objective in Eq. (2) by averaging weighted loss function ℓwt(y, pm(x)) = PK i=1 Gy,i log (pm(x)i) P j pm(x)j . Then optimal popt m (x) is: popt m (x) = Gy,i P j Gy,j = norm(GT y) (x, y). Here y is the one-hot representation vector for a label y. This proposition demonstrates that for a particular sample the high confidence pm(x) may not be optimal based on G. We now propose our novel way of thresholding samples for which consistency regularization is applied in CSST. Our thresholding method takes into account the objective of optimizing the non-decomposable metric by taking G into account. We propose to use the threshold on KL-Divergence of the softmax of the sample pm(x) with the optimal softmax (i.e. norm(GT ˆpm(x))) for a given G corresponding to the pseudo label (or sharpened) ˆpm(x), using which we modify the consistency regularization loss term: Lwt u (Bu) = 1 |Bu| x Bu 1(DKL(norm(GT ˆpm(x)) || pm(x)) τ)ℓwt u (ˆpm(x), pm(A(x)), G) (9) We name this proposed combination of KL-Thresholding and weighted consistency regularization as CSST in our experimental results. We find that for non-diagonal gain matrix G the proposed thresholding plays a major role in improving performance over supervised learning. This is demonstrated by comparison of CSST and CSST w/o KL-Thresholding (without proposed thresholding mechanism) in Fig. 4 and Tab. 3. We will now empirically incorporate CSST by introducing consistency based losses and thresholding mechanism for unlabeled data, into the popular semi-supervised methods of Fix Match [31] and Unsupervised Data Augmentation for Consistency Training (UDA) [36]. The exact expression for the weighted consistency losses utilized for UDA and Fix Match have been provided in the Appendix Sec. H.2 and H.3 . Table 2: Results of maximizing the worst-case recall over all classes (col 2 3) and over just the head and tail classes (col 4 7). Method CIFAR10-LT (ρ = 100) CIFAR100-LT (ρ = 10) Imagenet100-LT (ρ = 10) Avg. Rec Min. Rec Avg. Rec Min. HT Rec Avg. Rec Min. HT Rec ERM 0.52 0.26 0.36 0.14 0.40 0.30 LA 0.51 0.38 0.36 0.35 0.48 0.47 CSL 0.64 0.57 0.43 0.43 0.52 0.52 Vanilla (Fix Match) 0.78 0.48 0.63 0.36 0.58 0.49 CSST (Fix Match) 0.76 0.72 0.63 0.61 0.64 0.63 Figure 3: CIFAR-10 Long tail distribution ρ = 100, µ = 4. 5 Experiments We demonstrate that the proposed CSST framework shows significant gains in performance on both vision and NLP tasks on imbalanced datasets, with an imbalance ratio defined on the training set as ρ = maxi P (y=i) mini P (y=i) . We assume the labeled and unlabeled samples come from a similar data distribution and the unlabeled samples are much more abundant (µ times) the labeled. The frequency of samples follows an exponentially decaying long-tailed distributed as seen in Fig. 3, which closely imitates the distribution of real-world data [33, 10]. For CIFAR-10 [11], IMDb [16] and DBpedia-14 [14], we use ρ = 100 and ρ = 10 for CIFAR-100 [11] and Image Net-100 [27] 2 datasets. We compare our method against supervised methods of ERM, Logit Adjustment (LA) [17] and Cost Sensitive Learning (CSL) [21] trained on the same number of labeled samples as used by semi-supervised learning methods, along with vanilla semi-supervised methods of Fix Match (for vision) and UDA (for NLP tasks). We use Wide Res Nets(WRN) [39], specifically WRN-28-2 and WRN-28-8 for CIFAR-10 and CIFAR-100 respectively. For Image Net, we use a Res Net-50 [7] network for our experiments and finetuned Distil BERT(base uncased) [29] for IMDb and DBpedia-14 datasets. We divide the balanced held-out set for each dataset equally into validation and test sets. A detailed list of hyper-parameters and additional experiments can be found in the Appendix Tab. 4 and Sec. O respectively. Maximizing Worst-Case Recall. For CIFAR-10, IMDb, and DBpedia-14 datasets, we maximize the minimum recall among all classes (Eq. (3)). Given the low number of samples per class for datasets with larger number of classes like CIFAR-100 and Image Net-100, we pick objective (10). We define the head classes (H) and tail classes (T ) as the first 90 classes and last 10 classes respectively. The Min. HT recall objective can be mathematically formulated as: max F min (λH,λT ) 1 λH |H| The corresponding gain matrix G is diag( λH π1|H|, λH π2|H|, . . . , λT πK 1|T |, λT πK|T |). Since G is diagonal here, we use CSST(Fix Match) loss function Eq. (9) with the corresponding ℓwt u being substituted by ℓLA u as define in Sec. 2.1. Also for labeled samples we use LLA s as G is diagonal, we then combine the loss and train network using SGD. Each few steps of SGD, were followed by an update on the λ and G based on the uniform validation set (See Alg. 1 in Appendix). We find that CSST(Fix Match) significantly outperforms the other baselines in terms of the Min. recall and Min. Head-Tail recall for all datasets, the metrics which we aimed to optimize (Tab. 2), which shows effectiveness of CSST framework. Despite optimizing worst-case recall we find that our framework is still able to maintain reasonable average (Avg.) recall in comparison to baseline vanilla(Fix Match), which demonstrates it s practical applicability. We find that optimizing Min. recall across NLP tasks of classification on long-tailed data by plugging UDA into CSST(UDA) framework shows similar improvement in performance (Tab. 4). This establishes the generality of our framework to even self-training methods across domain of NLP as well. As the G is a diagonal matrix for this objective, the proposed KL-Based Thresholding here is equivalent to the confidence based threshold of Fix Match in this case. Despite the equivalence of thresholding mechanism, we see significant gains in min-recall (Tab. 2) just using the regularization term. We discuss their equivalance in Appendix Sec. I. Maximizing Mean Recall Under Coverage Constraints. Maximizing mean recall under coverage constraints objective seeks to result in a model with good average recall, yet at the same time 2https://www.kaggle.com/datasets/ambityga/imagenet100 Table 3: Results of maximizing the mean recall subject to coverage constraint all classes (col 2 3) and over the head and tail classes (col 4 7). Proposed CSST(Fix Match) approach compares favorably to ERM,LA,CSL vanilla(Fix Match) and CSST(Fix Match) w/o KL-Thresh.. It is the best at both maximizing mean recall and coming close to satisfying the coverage constraint. Method CIFAR10-LT CIFAR100-LT Image Net100-LT Per-class Coverage Head-Tail Coverage Head-Tail Coverage (ρ = 100, tgt : 0.1) (ρ = 10, tgt : 0.01) (ρ = 10, tgt : 0.01) Avg. Rec Min. Cov Avg. Rec Min. HT Cov Avg. Rec Min. HT Cov ERM 0.52 0.034 0.36 0.004 0.40 0.006 LA 0.51 0.039 0.36 0.009 0.48 0.009 CSL 0.60 0.090 0.45 0.010 0.48 0.010 Vanilla (Fix Match) 0.78 0.055 0.63 0.004 0.58 0.007 CSST(Fix Match) w/o KL-Thresh. 0.67 0.093 0.47 0.010 0.26 0.010 CSST(Fix Match) 0.80 0.092 0.63 0.010 0.58 0.010 Table 4: Results of maximizing the min recall over all classes for classification on NLP datasets. Proposed CSST(UDA) approach outperforms ERM and vanilla(UDA) baselines. Method IMDb (ρ = 10) IMDb (ρ = 100) DBpedia-14 (ρ = 100) Avg Rec Min Rec Avg Rec Min Rec Avg Rec Min Rec ERM 0.79 0.61 0.50 0.00 0.95 0.58 vanilla(UDA) 0.82 0.66 0.50 0.00 0.96 0.65 CSST(UDA) 0.89 0.88 0.77 0.75 0.99 0.97 constraints the proportion of predictions for each class to be uniform across classes. The ideal target coverage under a balanced evaluation set(or such circumstances) is given as covi[F] = 1 K , i [K]. Along similar lines to objective (10) we modify the objective (4) to maximize the average recall subject to the both the average head and tail class coverage (HT Coverage) being above a given threshold of 0.95 K . The G is non-diagonal here and Gij = (1j H λH |H| + 1j T λT |T |) + δij Kπi . max F min (λH,λT ) R2 0 (11) As these objectives corresponds to a non-diagonal G as shown in Eq. (4) in Sec. 2.1. Hence, for introducing CSST into Fix Match we replace first supervised loss Ls with Lhyb s . For the unlabeled data we introduce ℓhyb in Lwt u (Eq. (9)). Hence, the final objective L is defined as, L = Lhyb s + λu Lwt u . We update the parameters of the cost-sensitive loss (G and λ) periodically after few of SGD on the model parameters (Alg. 2 in Appendix). In this case our proposed thresholding mechanism in CSST(Fix Match) introduced in Sec. 4, leads to effective utilization of unlabled data resulting in improved performance over the naive CSST(Fix Match) without (w/o) KL-Thresholding (Tab. 3). In these experiments, the mean recall of our proposed approach either improves or stays same to the vanilla(Fix Match) implementation but only ours is the one that comes close to satisfying the coverage constraint. We further note that among all the methods, the only methods that come close to satisfying the coverage constraints are the ones with ℓhyb included in them. Ablation on amount of unlabeled data. Here, we ablated the total number of labeled sampled while keeping the number of unlabeled samples constant. We observe (in Fig. 5a) that as we increase the number of labeled samples the mean recall improved. This is because more labeled samples helps in better pseudo-labeling on the unlabeled samples and similarly as we decrease the number of labeled sample, the models errors on the pseudo-labels increase causing a reduction in mean recall. Hence, any addtional labeled data can be easily used to improve CSST performance. Ablation on τ threshold. Fig. 5b shows that when the KL divergence threshold is too high, a large number of samples with very low degree of distribution match are used for generating the sharpened target (or pseudo-labels), this leads to worsening of mean recall as many targets are incorrect. We find that keeping a conservative of τ = 0.3 works well across multiple experiments. Figure 4: Fraction of unlabelled data used for maximizing average recall under coverage constraints for CIFAR-100 (ρ = 10, µ = 4) (Sec. 5). (b) Figure 5: Maximizing average recall under coverage constraints for CIFAR-10 Long tail (ρ = 100) (Sec. 5). Fig. shows comparison of (a) increasing the ratio of unlabeled samples to labeled samples given fixed number of unlabeled samples (b) Ablation on KL diveregence based threshold for CSST(Fix Match) 6 Related Work Self-Training. Self-training algorithms have been popularly used for the tasks of semi-supervised learning [1, 37, 31, 13] and unsupervised domain adaptation [28, 41]. In recent years several regularizers which enforce consistency in the neighborhood (either an adversarial perturbation [18] or augmentation [36]) of a given sample have further enhanced the applicability and performance of self-training methods, when used in conjunction. However, these works have focused mostly on improving the generic metric of accuracy, unlike the general non-decomposable metrics we consider. Cost-Sensitive Learning. It refers to problem settings where the cost of error differs for a sample based on what class it belongs to. These settings are very important for critical real world applications like disease diagnosis, wherein mistakenly classifying a diseased person as healthy can be disastrous. There have been a pletheora of techniques proposed for these which can be classified into: importance weighting [15, 38, 5] and adaptive margin [2, 38] based techniques. For overparameterized models Narasimhan et al. [22] show that loss weighting based techniques are ineffective and propose a logit-adjustment based cost-sensitive loss which we also use in our framework. Complex Metrics for Deep Learning. There has been a prolonged effort on optimizing more complex metrics that take into account practical constraints [21, 26, 23]. However most work has focused on linear models leaving scope for works in context of deep neural networks. Sanyal et al. [30] train DNN using reweighting strategies for optimizing metrics, Huang et al. [8] use a reinforcement learning strategy to optimize complex metrics, and Kumar et al. [12] optimize complex AUC (Area Under Curve) metric for a deep neural network. However, all these works have primarily worked in supervised learning setup and are not designed to effectively make use of available unlabeled data. 7 Conclusion In this work, we aim to optimize practical non-decomposable metrics readily used in machine learning through self-training with consistency regularization, a class of semi-supervised learning methods. We introduce a cost-sensitive self-training framework (CSST) that involves minimizing a cost-sensitive error on pseudo labels and consistency regularization. We show theoretically that we can obtain classifiers that can better optimize the desired non-decomposable metric than the original model used for obtaining pseudo labels, under similar data distribution assumptions as used for theoretical analysis of Self-training. We then apply CSST to practical and effective self-training method of Fix Match and UDA, incorporating a novel regularizer and thresholding mechanism based on a given non-decomposable objective. We find that CSST leads to a significant gain in performance of desired non-decomposable metric, in comparison to vanilla self-training-based baseline. Analyzing the CSST framework when the distribution of unlabeled data significantly differs from labeled data is a good direction to pursue for future work. Acknowledgements: This work was supported in part by Fujitsu Research grant. Harsh Rangwani is supported by Prime Minister s Research Fellowship (PMRF). [1] David Berthelot, Nicholas Carlini, Ian Goodfellow, Nicolas Papernot, Avital Oliver, and Colin A Raffel. Mixmatch: A holistic approach to semi-supervised learning. In H. Wallach, H. Larochelle, A. Beygelzimer, F. d'Alché-Buc, E. Fox, and R. Garnett, editors, Advances in Neural Information Processing Systems. Curran Associates, Inc., 2019. URL https://proceedings.neurips.cc/paper/2019/file/ 1cd138d0499a68f4bb72bee04bbec2d7-Paper.pdf. 2, 10 [2] Kaidi Cao, Colin Wei, Adrien Gaidon, Nikos Arechiga, and Tengyu Ma. Learning imbalanced datasets with label-distribution-aware margin loss. Advances in neural information processing systems, 32, 2019. 10 [3] Olivier Chapelle, Bernhard Scholkopf, and Alexander Zien. Semi-supervised learning (chapelle, o. et al., eds.; 2006)[book reviews]. IEEE Transactions on Neural Networks, 20(3):542 542, 2009. 1 [4] Andrew Cotter, Heinrich Jiang, Maya R Gupta, Serena Wang, Taman Narayan, Seungil You, and Karthik Sridharan. Optimization with non-differentiable constraints with applications to fairness, recall, churn, and other goals. J. Mach. Learn. Res., 20(172):1 59, 2019. 2 [5] Yin Cui, Menglin Jia, Tsung-Yi Lin, Yang Song, and Serge Belongie. Class-balanced loss based on effective number of samples. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 9268 9277, 2019. 10 [6] Gabriel Goh, Andrew Cotter, Maya Gupta, and Michael P Friedlander. Satisfying real-world goals with dataset constraints. Advances in Neural Information Processing Systems, 29, 2016. 2 [7] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 770 778, 2016. 8 [8] Chen Huang, Shuangfei Zhai, Walter Talbott, Miguel Bautista Martin, Shih-Yu Sun, Carlos Guestrin, and Josh Susskind. Addressing the loss-metric mismatch with adaptive loss alignment. In International conference on machine learning, pages 2891 2900. PMLR, 2019. 10 [9] Durk P Kingma, Shakir Mohamed, Danilo Jimenez Rezende, and Max Welling. Semi-supervised learning with deep generative models. Advances in neural information processing systems, 27, 2014. 1 [10] Ranjay Krishna, Yuke Zhu, Oliver Groth, Justin Johnson, Kenji Hata, Joshua Kravitz, Stephanie Chen, Yannis Kalantidis, Li-Jia Li, David A Shamma, et al. Visual genome: Connecting language and vision using crowdsourced dense image annotations. International journal of computer vision, 123(1):32 73, 2017. 8 [11] Alex Krizhevsky and Geoffrey Hinton. Learning multiple layers of features from tiny images. 2009. Technical report, University of Toronto. 8 [12] Abhishek Kumar, Harikrishna Narasimhan, and Andrew Cotter. Implicit rate-constrained optimization of non-decomposable objectives. In International Conference on Machine Learning, pages 5861 5871. PMLR, 2021. 10 [13] Samuli Laine and Timo Aila. Temporal ensembling for semi-supervised learning. ar Xiv preprint ar Xiv:1610.02242, 2016. 10 [14] Jens Lehmann, Robert Isele, Max Jakob, Anja Jentzsch, Dimitris Kontokostas, Pablo N Mendes, Sebastian Hellmann, Mohamed Morsey, Patrick Van Kleef, Sören Auer, et al. Dbpedia a largescale, multilingual knowledge base extracted from wikipedia. Semantic web, 6(2):167 195, 2015. 8 [15] Yi Lin, Yoonkyung Lee, and Grace Wahba. Support vector machines for classification in nonstandard situations. Machine learning, 46(1):191 202, 2002. 10 [16] Andrew L. Maas, Raymond E. Daly, Peter T. Pham, Dan Huang, Andrew Y. Ng, and Christopher Potts. Learning word vectors for sentiment analysis. In Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies, pages 142 150, Portland, Oregon, USA, June 2011. Association for Computational Linguistics. URL http://www.aclweb.org/anthology/P11-1015. 8 [17] Aditya Krishna Menon, Sadeep Jayasumana, Ankit Singh Rawat, Himanshu Jain, Andreas Veit, and Sanjiv Kumar. Long-tail learning via logit adjustment. In International Conference on Learning Representations, 2020. 3, 4, 8 [18] Takeru Miyato, Shin-ichi Maeda, Masanori Koyama, and Shin Ishii. Virtual adversarial training: a regularization method for supervised and semi-supervised learning. IEEE transactions on pattern analysis and machine intelligence, 41(8):1979 1993, 2018. 1, 4, 7, 10 [19] Mehryar Mohri, Gary Sivek, and Ananda Theertha Suresh. Agnostic federated learning. In International Conference on Machine Learning, pages 4615 4625. PMLR, 2019. 2 [20] Harikrishna Narasimhan and Aditya K Menon. Training over-parameterized models with non-decomposable objectives. Advances in Neural Information Processing Systems, 34, 2021. 2, 3, 4, 7 [21] Harikrishna Narasimhan, Rohit Vaish, and Shivani Agarwal. On the statistical consistency of plug-in classifiers for non-decomposable performance measures. Advances in neural information processing systems, 27, 2014. 8, 10 [22] Harikrishna Narasimhan, Harish Ramaswamy, Aadirupa Saha, and Shivani Agarwal. Consistent multiclass algorithms for complex performance measures. In International Conference on Machine Learning, pages 2398 2407. PMLR, 2015. 3, 10 [23] Nagarajan Natarajan, Oluwasanmi Koyejo, Pradeep Ravikumar, and Inderjit Dhillon. Optimal classification with multivariate losses. In International Conference on Machine Learning, pages 1530 1538. PMLR, 2016. 10 [24] Giorgio Patrini, Alessandro Rozza, Aditya Krishna Menon, Richard Nock, and Lizhen Qu. Making deep neural networks robust to label noise: A loss correction approach. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 1944 1952, 2017. 3 [25] Hieu Pham, Zihang Dai, Qizhe Xie, and Quoc V Le. Meta pseudo labels. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 11557 11568, 2021. 1 [26] Shameem Puthiya Parambath, Nicolas Usunier, and Yves Grandvalet. Optimizing f-measures by cost-sensitive classification. Advances in neural information processing systems, 27, 2014. 10 [27] Olga Russakovsky, Jia Deng, Hao Su, Jonathan Krause, Sanjeev Satheesh, Sean Ma, Zhiheng Huang, Andrej Karpathy, Aditya Khosla, Michael Bernstein, et al. Imagenet large scale visual recognition challenge. International journal of computer vision, 115(3):211 252, 2015. 8 [28] Kuniaki Saito, Yoshitaka Ushiku, and Tatsuya Harada. Asymmetric tri-training for unsupervised domain adaptation. In International Conference on Machine Learning, pages 2988 2997. PMLR, 2017. 10 [29] Victor Sanh, Lysandre Debut, Julien Chaumond, and Thomas Wolf. Distilbert, a distilled version of bert: smaller, faster, cheaper and lighter. ar Xiv preprint ar Xiv:1910.01108, 2019. 8 [30] Amartya Sanyal, Pawan Kumar, Purushottam Kar, Sanjay Chawla, and Fabrizio Sebastiani. Optimizing non-decomposable measures with deep networks. Machine Learning, 107(8): 1597 1620, 2018. 2, 10 [31] Kihyuk Sohn, David Berthelot, Nicholas Carlini, Zizhao Zhang, Han Zhang, Colin A Raffel, Ekin Dogus Cubuk, Alexey Kurakin, and Chun-Liang Li. Fixmatch: Simplifying semisupervised learning with consistency and confidence. Advances in Neural Information Processing Systems, 33:596 608, 2020. 1, 2, 4, 5, 6, 7, 10 [32] Shiv Kumar Tavker, Harish Guruprasad Ramaswamy, and Harikrishna Narasimhan. Consistent plug-in classifiers for complex objectives and constraints. Advances in Neural Information Processing Systems, 33:20366 20377, 2020. 3 [33] Bart Thomee, David A Shamma, Gerald Friedland, Benjamin Elizalde, Karl Ni, Douglas Poland, Damian Borth, and Li-Jia Li. Yfcc100m: The new data in multimedia research. Communications of the ACM, 59(2):64 73, 2016. 8 [34] Colin Wei and Tengyu Ma. Improved sample complexities for deep neural networks and robust classification via an all-layer margin. In International Conference on Learning Representations, 2019. 6 [35] Colin Wei, Kendrick Shen, Yining Chen, and Tengyu Ma. Theoretical analysis of self-training with deep networks on unlabeled data. In International Conference on Learning Representations, 2020. 2, 4, 5, 6 [36] Qizhe Xie, Zihang Dai, Eduard Hovy, Thang Luong, and Quoc Le. Unsupervised data augmentation for consistency training. Advances in Neural Information Processing Systems, 33: 6256 6268, 2020. 1, 2, 4, 7, 10 [37] Qizhe Xie, Minh-Thang Luong, Eduard Hovy, and Quoc V Le. Self-training with noisy student improves imagenet classification. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 10687 10698, 2020. 1, 10 [38] Bianca Zadrozny, John Langford, and Naoki Abe. Cost-sensitive learning by cost-proportionate example weighting. In Third IEEE international conference on data mining, pages 435 442. IEEE, 2003. 10 [39] Sergey Zagoruyko and Nikos Komodakis. Wide residual networks. In Edwin R. Hancock Richard C. Wilson and William A. P. Smith, editors, Proceedings of the British Machine Vision Conference (BMVC), pages 87.1 87.12. BMVA Press, September 2016. ISBN 1-901725-59-6. doi: 10.5244/C.30.87. URL https://dx.doi.org/10.5244/C.30.87. 8 [40] Bowen Zhang, Yidong Wang, Wenxin Hou, Hao Wu, Jindong Wang, Manabu Okumura, and Takahiro Shinozaki. Flexmatch: Boosting semi-supervised learning with curriculum pseudo labeling. Advances in Neural Information Processing Systems, 34, 2021. 1 [41] Yang Zou, Zhiding Yu, Xiaofeng Liu, BVK Kumar, and Jinsong Wang. Confidence regularized self-training. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 5982 5991, 2019. 10 1. For all authors... (a) Do the main claims made in the abstract and introduction accurately reflect the paper s contributions and scope? [Yes] (b) Did you describe the limitations of your work? [Yes] See Section A in Appendix (c) Did you discuss any potential negative societal impacts of your work? [Yes] See Section A in Appendix (d) Have you read the ethics review guidelines and ensured that your paper conforms to them? [Yes] 2. If you are including theoretical results... (a) Did you state the full set of assumptions of all theoretical results? [Yes] See Assumption 1, Assumption 3, and Assumption 4. (b) Did you include complete proofs of all theoretical results? [Yes] See Section C, D and Section E in Appendix. 3. If you ran experiments... (a) Did you include the code, data, and instructions needed to reproduce the main experimental results (either in the supplemental material or as a URL)? [No] We will release our code here: https://github.com/val-iisc/Cost Sensitive Self Training (b) Did you specify all the training details (e.g., data splits, hyperparameters, how they were chosen)? [Yes] Refer Section 5, and more detail is available in Section L in supplementary material. (c) Did you report error bars (e.g., with respect to the random seed after running experiments multiple times)? [No] As our experiments are done on neural networks which are expensive to train we haven t included the error bars in result currently. However, we do plan to provide error bars for a certain subset of experiments in Appendix Tab. 5 (d) Did you include the total amount of compute and the type of resources used (e.g., type of GPUs, internal cluster, or cloud provider)? [Yes] See Section G in Appendix 4. If you are using existing assets (e.g., code, data, models) or curating/releasing new assets... (a) If your work uses existing assets, did you cite the creators? [Yes] See Section G in Appendix (b) Did you mention the license of the assets? [Yes] See Section G in Appendix (c) Did you include any new assets either in the supplemental material or as a URL? [No] (d) Did you discuss whether and how consent was obtained from people whose data you re using/curating? [N/A] (e) Did you discuss whether the data you are using/curating contains personally identifiable information or offensive content? [N/A] 5. If you used crowdsourcing or conducted research with human subjects... (a) Did you include the full text of instructions given to participants and screenshots, if applicable? [N/A] (b) Did you describe any potential participant risks, with links to Institutional Review Board (IRB) approvals, if applicable? [N/A] (c) Did you include the estimated hourly wage paid to participants and the total amount spent on participant compensation? [N/A]