# discriminative_complementarylabel_learning_with_weighted_loss__e2b8f930.pdf

Discriminative Complementary-Label Learning with Weighted Loss

Yi Gao 1 2 Min-Ling Zhang 2 3

Complementary-label learning (CLL) deals with the weak supervision scenario where each training instance is associated with one complementary label, which speciﬁes the class label that the instance does not belong to. Given the training instance x, existing CLL approaches aim at modeling the generative relationship between the complementary label y, i.e. P( y | x), and the ground-truth label y, i.e. P(y | x). Nonetheless, as the ground-truth label is not directly accessible for complementarily labeled training instance, strong generative assumptions may not hold for real-world CLL tasks. In this paper, we derive a simple and theoretically-sound discriminative model towards P( y | x), which naturally leads to a risk estimator with estimation error bound at O(1/ n) convergence rate. Accordingly, a practical CLL approach is proposed by further introducing weighted loss to the empirical risk to maximize the predictive gap between potential groundtruth label and complementary label. Extensive experiments clearly validate the effectiveness of the proposed discriminative complementary-label learning approach.

1. Introduction

Ordinary classiﬁcation tasks generally require vast data with high-quality labels, while accurately annotating large-scale datasets is costly and time-consuming. The weakly supervised learning (WSL) paradigm has brought a new inspiration to alleviate this problem, which allows learning algorithms to train classiﬁers with less expensive data (Zhou, 2017; Ishida et al., 2017). The researchers have studied various frameworks based on weak supervision informa-

1School of Cyber Science and Engineering, Southeast University, Nanjing 210096, China 2Key Laboratory of Computer Network and Information Integration (Southeast University), Ministry of Education, China 3School of Computer Science and Engineering, Southeast University, Nanjing 210096, China. Correspondence to: Min-Ling Zhang <zhangml@seu.edu.cn>.

Proceedings of the 38 th International Conference on Machine Learning, PMLR 139, 2021. Copyright 2021 by the author(s).

tion, including but not limited to, semi-supervised learning (Chapelle et al., 2006; Oliver et al., 2018; Calder et al., 2020; Izmailov et al., 2020), noisy-label learning (Ghosh et al., 2017; Zhang & Sabuncu, 2018; Ma et al., 2018; Kim et al., 2019; Liu & Guo, 2020; Han et al., 2020), positiveunlabeled learning (du Plessis et al., 2014; Sakai et al., 2018; Chapel et al., 2020; Hammoudeh & Lowd, 2020; Su et al., 2021), unlabeled-unlabeled learning (Lu et al., 2019; Golovnev et al., 2019) and partial label learning (Wu & Zhang, 2019; Lv et al., 2020).

Here, we consider another natural scenario of WSL complementary-label learning (CLL) (Ishida et al., 2017; Yu et al., 2018; Ishida et al., 2019; Xu et al., 2020; Chou et al., 2020; Feng et al., 2020) the class label speciﬁes one of the classes that the instance does not belong to, while the learned classiﬁer is expected to predict the ground-truth label of each instance. Collection of data with complementary labels is obviously much easier and less time-consuming than that of ordinary labels. To solve the CLL problem, previous approaches mainly focus on assuming the generative relationship between the complementary label y and the ground-truth label y of each instance, which could be roughly divided into two categories: the ﬁrst category assumes that the relationship between y and y is unbiased based on an uniform distribution, i.e. P( Y = y | X = x) = 1 c 1 P

y = y P(Y = y | X = x) (c refers to the number of classes) (Ishida et al., 2017; 2019; Feng et al., 2020), while the second one assumes that the relationship is biased, i.e. P( Y = y | X = x) = P

y = y P( Y = y | Y = y)P(Y = y | X = x) (Yu et al., 2018; Xu et al., 2020).

As a pioneering work, the approach proposed by Ishida et al. (2017) designed an unbiased risk estimator (URE) with a solid theoretical analysis according to the assumption of the ﬁrst category, which enables multi-class classiﬁcation with only complementary labels. However, this approach only works with a limited group of loss functions, i.e., the one-versus-all (OVA) and the pairwise comparison (PC) loss functions (Zhang, 2004). With the same unbiased generation assumption, Ishida et al. (2019) proposed a general URE framework of complementary-label learning, which is unrestricted in models and loss functions. Nonetheless, these URE-based approaches in CLL may suffer from overﬁtting, as the empirical gradients may deviate from true

Discriminative Complementary-Label Learning with Weighted Loss

gradients during the optimization procedure (Chou et al., 2020).

Unlike previous studies, Yu et al. (2018) take the assumption that a biased relationship exists between the complementary label y and the ground-truth label y which is modeled by the transition probability, i.e. P( Y = y | Y = y), y = y {1, . . . , c}. Their approach makes the widely-used multiclass Cross-Entropy (CE) loss be amenable for solving CLL tasks. Subsequently, Xu et al. (2020) applied Conditional Generative Adversarial Net (CGAN) (Goodfellow et al., 2014; Mirza & Osindero, 2014) based on the same biased assumption to improve the classiﬁcation accuracy of CLL. However, this method requires extra conditions to be satisﬁed such as the availability of a set of anchor instances to enable transition probability estimation, which may not be satisﬁed in reality.

Overall, existing approaches rely on modeling the generative relationship between the complementary label and the ground-truth label of each training instance. Nevertheless, the ground-truth label is unknown for CLL training examples such that these strong generative assumption may be not suitable for solving real-world CLL problem. To tackle this problem, we propose a discriminative solution to directly model P( y | x) from the output of trained classiﬁers, which naturally leads to a novel CLL risk estimator. Speciﬁcally, a weighted loss is introduced to the empirical risk yielding a practical discriminative CLL approach. Experimental results on benchmark datasets demonstrate the effectiveness of the proposed discriminative CLL approach. The main contributions are summarized as follows:

(1) We directly model P( y | x) from the predictive probability of learned classiﬁers in a simple yet effective manner. Correspondingly, we derive a risk estimator with guaranteed estimation error bound at O(1/ n) convergence rate.

(2) A practical CLL approach is proposed by introducing weighted loss to enforce predictive gap between potential ground-truth label and complementary label.

The rest of this paper is organized as follows. Section 2 gives formal deﬁnitions and brieﬂy reviews existing approaches to CLL. Section 3 presents the proposed discriminative CLL approach with theoretical analyses and algorithmic details. Section 4 reports the results of comparative experimental studies. Finally, Section 5 concludes this paper.

2. Background and Formulation

In this section, we give notations used in this paper, and brieﬂy discuss ordinary multi-class classiﬁcation and complementary-label learning.

2.1. Ordinary Multi-Class Classiﬁcation In ordinary multi-class classiﬁcation, let X Rd be the feature space and Y = {1, . . . , c} be the label space, where

c is the number of classes and c 2. Let p(x, y) be the unknown probability density function over random variables (X, Y ) X Y, and D = {(xi, yi)}n i=1 be a set of n training examples each associated with a ground-truth label. Ordinary multi-class classiﬁcation tasks aim to learn a classiﬁer that maps from the feature space to the label space f : X Rc, which is trained by minimizing the following classiﬁcation risk:

R(f) = E(X,Y ) p(x,y) ℓ(f(X), e Y ) (1)

where e Y {0, 1}c is the one-hot encoded label of X, and the Y -th element of e Y is one with all other elements being zero. E and ℓdenote the expectation and the loss function, respectively. Accordingly, the most possible predicted label by of an instance x is determined as

by = argmax k Y fk(x) (2)

where fk( ) denotes the k-th element of f( ), referring to the posterior probability of the k-th label being the ground-truth one, i.e., fk(X) = P(Y = k|X). The optimal classiﬁer f

in function class F corresponds to the minimizer of classiﬁcation risk R(f): f = argminf FR(f). As the underlying distribution p(x, y) is unknown, the classiﬁcation risk in Eq.(1) is usually approximated by the empirical risk Rn(f), i.e. Rn(f) = 1

n Pn i=1 ℓ(f(xi), eyi). Similarly, the optimal classiﬁer w.r.t. the empirical risk corresponds to: fn = argminf FRn(f).

2.2. Complementary-Label Learning

Different from ordinary multi-class classiﬁcation, each instance only has one complementary label in CLL. Let D = {(xi, yi)}n i=1 denote the set of complementarily labeled training examples, where yi Y \ {yi} is the complementary label of the instance xi and each example is sampled from p(x, y) which denotes an unknown probability distribution. As discussed in Section 1, existing approaches generally aim at modeling generative relationship between P( Y = y | X = x) and P(Y = y | X = x) (WLOG, we rewrite these terms as P( y | x) and P(y | x) in the rest of this paper), which can be categorized into the unbiased generative assumption and the biased one, respectively. The work of Ishida et al. (2017) follows the ﬁrst assumption to deﬁne P( y | x) as

p(x, y) = 1 c 1

y = y p(x, y)

P( y | x) p(x) = 1 c 1

y = y P(y | x)p(x). (3)

Since p(x) = p(x), we have P( y | x) = 1 c 1 P

y = y p(y | x). Based on Eq.(3), the OVA loss and PC loss for CLL, which naturally lead to an URE serving as an alternative formulation to Eq.(1), are deﬁned as

Discriminative Complementary-Label Learning with Weighted Loss

LOV A(f(X), Y ) = 1 c 1

Y = Y ℓ(f Y (X)) + ℓ( f Y (X))

LP C(f(X), Y ) = X

Y = Y ℓ(f Y (X) f Y (X)) (4)

where ℓ(z) is a binary loss which satisﬁes ℓ(z)+ℓ( z) = 1, such as the sigmoid loss ℓS(z) = 1 1+ez .

Different from Ishida et al. (2017; 2019), Yu et al. (2018) took another assumption which considers that the generative relationship between P( y | x) and P(y | x) is biased, i.e. the complementary label of an instance X is non-uniformly selected from Y \ {Y }. Therefore, this biased assumption can be formalized as P( Y = j | X) = P

k =j P( Y = j | Y = k)P(Y = k | X), where this biased generative relationship can be characterized by a transition probability matrix Q, i.e. Qkj = P( Y = j | Y = k) and Qkk = 0, k, j {1, . . . , c}.

Although feasible CLL approaches have been developed by exploiting either the unbiased (uniform) or biased (transition-based) generative assumption, their performance may be suboptimal for real-world CLL tasks where the two assumptions do not necessarily hold. In this paper, we propose a simple yet effective discriminative solution towards CLL which directly models P( y | x) from the predictive probability, and the solution naturally results in a CLL risk estimator with estimation error bound. We further introduce the weighted loss to maximize the predictive gap between the potential ground-truth label and complementary label.

3. The Proposed Approach

In this section, we introduce the proposed discriminative model and weighted loss. Accordingly, we further derive the estimation error bound of our approach.

3.1. The Discriminative Model

In ordinary multi-class classiﬁcation, we aim to optimize Eq.(1) where the predictive probability of the ground-truth label approaches one, i.e. f Y (X) 1, and other other labels to zero. In contrast, due to the complementary label is obviously not the ground-truth label of an instance, CLL expects that the predictive probability of the complementary label approaches zero, i.e. f Y (X) 0 (Kim et al., 2019).

The idea of Kim et al. (2019) also brings a strong motivation for us to propose the discriminative model that directly estimates P( y | x) from the classiﬁers output. Different from the approach proposed by Chou et al. (2020), we directly deﬁne the prediction probability of complementary label as f(X) = 1 f(X). Hence, the complementary loss ℓcan be expressed as

ℓ(f(X), e Y ) = ℓ( f(X), e Y ) = ℓ(1 f(X), e Y ) (5)

where e Y {0, 1}c is a one-hot vector for label Y , in which the Y -th element of e Y is one and all other elements being zero. As discussed above, the novel risk estimator for CLL can be described as

R(f) = E(X, Y ) p(x, y) h ℓ(f(X), e Y ) i . (6)

Correspondingly, the empirical risk estimator corresponds to Rn = 1

k=1 ℓ(1 fk(xi), e yi k ) (7)

where e yi k is the k-th element of e yi, fk denotes the k-th element of f.

3.2. Estimation Error Bound

Let F = {f(x)} be a c-valued function class to minimize empirical risk, where f = {f1, . . . , fc} F. We denote b Rn(F) as the Rademacher complexity of F for X with data size n (Mohri et al., 2012) and f n = argminf F Rn(f). Using M and Lℓto denote the upper bound and Lipschitz constant of ordinary loss function ℓrespectively. Here, we ﬁrst establish the upper bound of b Rn( ℓ F) in Lemma 1, which naturally leads to the uniform deviation bound that further guarantees to derive the estimation error bound. We start investigating Lemma 1 from Assumption 1.

Assumption 1. The loss function ℓ( , ) satisﬁes

ℓ(1 fk(X), 1 e Y k ) ℓ(fk(X), e Y k ). (8)

Such an assumption holds for some commonly used loss functions, such as MSE (Mean Squared Error) loss and MAE (Mean Absolute Error) loss.

Lemma 1. Based on Eq.(5) and Assumption 1, it holds that

b Rn( ℓ F) c2Lℓb Rn (Fk) . (9)

The proof is provided in Appendix. Given the upper bound for b Rn( ℓ F), we can directly obtain Lemma 2 based on Mc Diarmid s inequality (Mc Diarmid, 2013) and symmetrization (Mohri et al., 2012), which deﬁnes the uniform deviation bound.

Lemma 2. For any δ > 0, with probability at least 1 δ,

R(f) Rn(f) 2c2Lℓb Rn(Fk) (10)

where R(f) and Rn(f) is deﬁned by Eq.(6) and Eq.(7) respectively. The proof is given in Appendix.

According to Lemma 2, we can establish the estimation error bound for the proposed CLL risk estimator. The estimation error bound is shown in Theorem 1, whose proof is presented in Appendix.

Discriminative Complementary-Label Learning with Weighted Loss

Algorithm 1 CLL with weighted loss

D : the complementary-label training set {xi, yi}n i=1; T : the number of epochs; A: an external stochastic optimization algorithm; Output: θ : model parameter for f(x; θ); fk( ) : the k-th element of f(x; θ) and the predictive probability of the k-th label being the ground-truth label of an instance; for t = 1 to T do do

Shufﬂe D into B mini-batchs each with size s; for b = 1 to B do do

Let xb i be the i-th instance in b-th mini-batch, and yb i be the corresponding complementary label;

Set wk i = 1 fk(xb i) Pc j=1(1 fj(xb i));

Let Lb be the risk of b-th mini-batches, Lb = 1 s Ps i=1 Pc k=1(1 + wk i ) ℓ(fk(xb i), e yb i k ); Set gradient θLb; Update θ by A; end for end for

Theorem 1. For any δ > 0, with probability at least 1 δ,

R( f n) R( f ) 4c2Lℓb Rn(Fk) + M

(11) For all parametric models with a bounded norm, as n , R( f n) R( f ). Theorem 1 shows that the proposed risk estimator exists an estimation error bound and convergence rate is O(1 n). Note that in Eq.(11), c2 shows that the number of labels have a strong impact on our empirical performance. This implication agrees well with our expectation: the fewer number of labels, the more effective our proposed CLL method.

3.3. The Weighted Loss Commonly, the estimated posterior probability can be regarded as one metric to measure the prediction uncertainty and increasing uncertainty could lead to a deteriorated prediction performance (Yao et al., 2020). In Subsection 3.1, we propose to estimate the posterior probability of the complementary label. Although the proposed method is theoretically sound, its performance still depends heavily on the number of instances and the number of labels. In this part, we consider employing the prediction uncertainty in our proposed method. In this way, the highly conﬁdent predictions in the early stage of learning can be employed to boost the performance of succeeding updating of the model. Our solution is to introduce a weighted loss term to ℓto minimize the loss value in CLL, which is deﬁned as

ℓ(f(X), e Y ) = wℓ(1 f(X), e Y ) (12)

where w corresponds to a loss weight vector in the cdimensional simplex c 1.

Intuitively, w should be related to the prediction uncertainty and should be updated constantly through the whole learning process. A recent line of work proposes strategies, including maximum likelihood and maximum margin, to highlight the ground-truth label (Nguyen & Caruana, 2008; Liu & Dietterich, 2012; Yao et al., 2020; Jin & Ghahramani, 2002). In order to stand out the ground-truth label from all labels, maximum likelihood methods generally adopt EM procedure to optimize their models, which ﬁrstly use an independent E-step to learn weights, then train the models until convergence in the M-step (Jin & Ghahramani, 2002; Liu & Dietterich, 2012; Lv et al., 2020). However, E-step of these methods are separated from the M-step. In this way, these methods are easy to have a greedy solution, which will lead to the overﬁtting problem(Lv et al., 2020). Maximum margin methods maximize the margin between the ground-truth label and other labels to make the ground-truth label gradually prominent (Nguyen & Caruana, 2008). As thoroughly discussed in Yao et al. (2020), these methods are difﬁcult to be calibrated in the later processing when false positive is selected in the current step.

To address aforementioned problems, we use the current prediction probability of the complementary label to make more use of highly possible complementary labels. Moreover, E-step and M-step are considered as a whole during the training procedure and weights can be updated easily as well. Specially, we set w = [w1, w2, . . . , wc], where wk is the k-th element of w and is deﬁned as

wk = 1 fk(X) Pc j=1(1 fj(X)). (13)

Note that Pc k=1 wk = 1 and wk 0. Let us explain the setting of w with a simple CLL task of three labels (c = 3). Let f(xi) = [0.1, 0.7, 0.2] denote the predicted posterior probability of the three labels for xi, we can infer that the ﬁrst label is the potential ground-truth label because f(xi) = 1 f(xi). By our setting of w, the weight vector in Eq.(13) become [0.1, 0.7, 0.2] as well, and we apply a smaller weight to treating the ground-truth label as the complementary label, and a larger weight to treating two other labels, especially the highly conﬁdent second label as complementary labels. Therefore, the potential groundtruth label will be prominent gradually as the increasing of the predictive gap between the potential ground-truth label and the complementary label of each instance. Accordingly, we add the weighted loss and the unweighted loss together, resulting in our targeted loss

ℓ(f(X), e Y ) =

k=1 (1 + λwk)ℓ(1 fk(X), e Y k ) (14)

and the empirical risk estimator for CLL is described as

Discriminative Complementary-Label Learning with Weighted Loss

MNIST, linear MNIST, MLP Fashion, linear

Fashion, MLP Kuzushiji, linear Kuzushiji, MLP

Figure 1. The experimental results on various test datasets with different loss functions and models for 300 epochs. The dark color is the mean accuracy and the light color corresponds to the std.

k=1 (1 + λwk i )ℓ(1 fk(xi), e yi k ) (15)

where wk i corresponds to the k-th element of loss weight vector wi for instance xi. λ is the tradeoff parameter between the original loss function and the weighted loss function, and we set it simply as 1. The overall procedure of the proposed approach is shown in Algorithm 1.

4. Experiments

In this section, we evaluate the performance of the proposed approach with comparative studies against state-ofthe-art complementary-label learning approaches. We use L-UW and L-W to denote the proposed CLL approach instantiated with complementary loss function in Eq.(5) and Eq.(14) respectively. Due to f is directly applied to a model training, which will result in the gradient diffusion problem of a model, we employ the constraint Pc k=1 fk(X) = 1, where softmax function is used to normalize f to make f satisfy the constraint. Then, f can be immediately deﬁned as f = softmax(1 f), where fk = exp(1 fk)/ Pc j=1 exp(1 fk). The cross-entropy loss is commonly applied to multi-class classiﬁcation tasks, which is adopted to replace ℓin this paper. All experiments are implemented based on Py Torch (Paszke et al., 2019) and Colab 1. The code is available at https://github.com/Yolkjustlike/complementary-label-learning.

1https://colab.research.google.com

4.1. Experimental Settings

Datasets Following Ishida et al. (2017; 2019); Yu et al. (2018); Feng et al. (2020), three widely-used benchmark datasets, namely MNIST (Lecun et al., 1998), Fashion MNIST (Fashion) (Xiao et al., 2017), and Kuzushiji-MNIST (Kuzushiji) (Clanuwat et al., 2018), are used for experimental studies.

MNIST dataset (Lecun et al., 1998) is a handwritten digits dataset that consists of 10 classes, which has 60,000 training examples and 10,000 test examples.

Fashion dataset is collected by Xiao et al. (2017) from standardized images of fashion items, which has 60,000 training images and 10,000 test images from 10 classes.

The size of Kuzushiji dataset (Clanuwat et al., 2018) is similar to MNIST dataset. Kuzushiji dataset derives from Kuzushiji which includes 60,000 training images and 10,000 test images from 10 classes.

Base models Two base models are utilized: linear model and MLP model (d 500 c).

Baselines We employ four state-of-the-art CLL approaches to be compared with, including Pairwise Comparison (PC) with sigmoid loss (Ishida et al., 2017), forward loss correction (Forward) (Yu et al., 2018), Gradient Ascent (GA)

Discriminative Complementary-Label Learning with Weighted Loss

MNIST, linear MNIST, MLP Fashion, linear

Fashion, MLP Kuzushiji, linear Kuzushiji, MLP

Figure 2. Empirical risk minimization procedure for various models and loss functions.

Table 1. Test accuracy (mean std) out of 10 trials (in %), where data with unbiased complementary labels is used to train. The best performance on each data set is shown in boldface.

Dataset Model PC Forward GA NN L-UW L-W

MNIST linear 82.31 0.72 90.42 0.17 83.23 0.43 84.56 0.31 89.98 0.20 90.22 0.11

MLP 84.04 0.55 91.93 0.25 92.49 0.25 89.99 0.42 92.45 0.24 92.08 0.28

Fashion linear 75.29 0.83 81.14 0.20 77.41 0.30 78.32 0.31 81.79 0.22 82.04 0.21

MLP 77.55 0.39 82.31 0.24 81.62 0.19 80.29 0.47 83.15 0.20 83.40 0.32

Kuzushiji linear 54.57 1.13 60.57 0.42 52.52 1.12 55.27 0.85 60.87 0.48 61.29 0.31

MLP 59.32 0.59 65.59 0.54 69.56 0.53 65.44 0.51 65.17 1.43 66.98 1.63

Table 2. The Win/Loss statistics for the proposed approach of Table 1. If our approach outperforms comparison baselines, add 1 to the count of ours; otherwise, add 1 to the comparison baselines.

Baselines PC Forward GA NN L-UW 6/0 4/2 4/2 5/1 L-W 6/0 5/1 4/2 6/0

(Ishida et al., 2019) and Non-Negative loss (NN) (Ishida et al., 2019).

4.2. Comparison on Unbiased Complementary Labels

Setup Weight decay is set as 1e-4 and learning rate of 5e-5 is used for MNIST, Fashion and Kuzushiji. Adam (Kingma & Ba, 2015) optimization method is applied. For all datasets, the number of epoch and mini-batch size are set as 300 and

Table 3. The Win/Loss statistics for the proposed approach of Table 4. If our approach outperforms comparison baselines, add 1 to the count of ours; otherwise, add 1 to the comparison baselines.

Baselines PC Forward GA L-UW 14/4 13/5 15/3 L-W 15/3 16/2 15/3

256 respectively.

We divide the original training dataset into training and validation parts with proportion 9/1, where complementary labels are generated by randomly choosing one of the labels other than the ground-truth one (unbiased complementarylabel generation). Test set with ordinary labels is used to evaluate the performance of each comparing approach. The mean and standard deviation (std) of test accuracy out of 10

Discriminative Complementary-Label Learning with Weighted Loss

Table 4. Test accuracy (mean std) on three datasets out of 5 trials (in %), where data with biased complementary labels is used to train. The best performance on each data set is shown in boldface.

Set 1 Baselines PC Forward GA L-UW L-W

MNIST linear 19.66 0.28 19.54 0.58 9.86 0.15 18.23 0.17 18.57 0.55 MLP 19.34 0.69 20.44 0.15 9.80 0.00 19.46 0.34 21.13 2.06

Fanshion linear 10.65 1.11 12.71 2.73 10.01 0.17 13.73 0.28 14.40 0.55 MLP 15.40 3.55 16.94 0.37 10.06 0.87 20.81 1.32 21.77 0.93

Kuzushiji linear 14.10 0.80 13.40 0.88 10.63 0.30 13.17 0.62 13.83 0.24 MLP 14.43 0.87 12.97 0.97 10.22 0.38 13.56 0.59 14.75 0.10 Set 2 Baselines PC Forward GA L-UW L-W

MNIST linear 19.69 0.63 20.31 0.10 10.19 0.16 23.55 2.05 23.67 0.74 MLP 22.59 2.32 20.44 0.20 10.09 0.00 23.35 0.66 126.76 2.00

Fanshion linear 12.43 2.51 12.71 2.73 9.75 0.32 17.30 0.35 20.14 0.70 MLP 15.41 3.74 16.94 0.37 10.29 0.50 23.54 0.93 23.17 0.56

Kuzushiji linear 12.73 0.14 13.35 0.40 9.89 0.57 12.43 0.27 12.51 0.25 MLP 14.93 1.03 12.71 0.66 10.00 0.00 16.45 0.26 17.28 0.45 Set 3 Baselines PC Forward GA L-UW L-W

MNIST linear 72.22 1.43 78.53 4.41 78.55 0.80 81.16 0.12 79.72 0.27 MLP 84.46 0.23 80.67 5.34 85.13 0.10 84.98 0.10 85.91 0.11

Fanshion linear 58.34 0.61 60.28 3.76 65.62 0.17 59.71 4.08 61.57 0.45 MLP 58.00 0.91 61.92 3.56 63.80 0.07 62.19 0.13 63.11 0.12

Kuzushiji linear 46.54 0.43 54.41 1.95 50.16 0.41 54.47 1.10 57.85 2.32 MLP 51.85 1.58 51.05 1.52 52.24 0.72 52.56 3.62 52.02 3.68

trials on the model that corresponds to the best validation score on 300 epochs are shown in Table 1.

Results We show the mean and std of test accuracy for 300 epochs on MNIST, Fanshion, and Kuzushiji in Figure 1. Figure 2 illustrates corresponding empirical risk of PC (Ishida et al., 2017), Forward (Yu et al., 2018), GA (Ishida et al., 2019), NN (Ishida et al., 2019), L-UW and L-W on three benchmark datasets for all epochs during the process of training.

Based on the reported results in Figure 1, we can observe that the proposed discriminative CLL approach achieves better or at least comparable performance against the comparing approaches on different datasets. As shown in Figure 1, the std of GA is greater than that of L-UW and L-W, which demonstrates that the performance of our approaches is more stable than GA on different training data partitioning. Furthermore, the test accuracy of approaches in later training epochs gradually decrease when the more complex models are applied, which is especially prominent under the case of using MLP model. This is because overparameterized deep neural networks are available to make the training loss go zero via memorizing training data, while the model becomes overconﬁdent with a weak generalization performance that result in the degraded test performance (Ishida et al., 2020).

The viewpoint as mentioned earlier is reﬂected in Figure 2 as well, all approaches work normally with linear base model on MNIST, Fashion and Kuzushiji, while empirical risk of URE-based methods, such as PC and NN, goes zero or even negative when MLP model is applied (Ishida et al., 2019). We observe that the test accuracy of PC starts decreasing when its empirical risk trends to negative. In comparison, the gradient ascent trick is used in GA when the empirical risk approaches negative, which saves GA from the deteriorated performance.

In Table 1, we report the mean and std of the classiﬁcation accuracy on test data out of 10 trials, where Table 2 shows the Win/Loss statistics of the proposed approach outperforming other baselines. The following observations can be made based on Table 1 and Table 2: 1) L-UW (without weighted loss term) achieves comparable test accuracy to PC, Forward, GA and NN on different datasets, which indicates that the proposed simple discriminative model (Eq.(5)) serves as a feasible solution to CLL problem; 2) L-W (with weighted loss term) works well under all cases; it shows that the introduction of weighted loss to the discriminative model does help improve the generalization performance by maximizing the predictive gap between potential groundtruth label and complementary label.

Discriminative Complementary-Label Learning with Weighted Loss

MNIST, linear, set 1 MNIST, linear, set 2 MNIST, linear, set 3

Fashion, linear, set 1 Fashion, linear, set 2 Fashion, linear, set 3

Kuzushiji, linear, set 1 Kuzushiji, linear, set 2 Kuzushiji, linear, set 3

Figure 3. The experimental results on various biased settings on the linear model for 300 epochs. The dark color is the mean accuracy and the light color corresponds to the std.

4.3. Comparison on Biased Complementary Labels

Setup We use training data which is associated with biased complementary labels to evaluate the effectiveness of our approach. The biased complementary-label generation is similar to Yu et al. (2018). More speciﬁcally, we adopt three settings to generate biased complementary labels. For all settings, the complementary label is selected from Y \ {y}, which is divided into three subsets randomly, each including three class labels. For set 1: the selected probabilities of each complementary label in three subsets are 0.75/3, 0.24/3 and 0.01/3 respectively; the selected probabilities of that are 0.66/3, 0.24/3 and 0.1/3 respectively for set 2; for set 3, the probability of each label is selected as the complementary label at probabilities 0.45/3, 0.3/3 and 0.25/3.

We utilize the training dataset associated with biased complementary labels to train the model, while test set with

ordinary labels is applied to evaluate the performance of approaches. The other experimental settings are same with Subsection 4.2. The mean and std of test accuracy out of 5 trials on the model that corresponds to the best validation score on 300 epochs are shown in Table 4. Table 3 is used to count the Win/Loss results of the proposed approach that is superior to other baselines.

Results From the results shown in Table 4, we can ﬁnd that the test accuracy of all approaches has improved as the biased degree of complementary labels decreasing, which also demonstrates that the performance of approaches will suffer from the non-uniform selection of complementary labels. Furthermore, experimental results in Table 4 and Table 3 show that our proposal is better to other baselines in most cases.

Figure 3 and Figure 4 illustrate the mean and std of test accuracy for all epochs on the linear model and MLP model

Discriminative Complementary-Label Learning with Weighted Loss

MNIST, MLP, set 1 MNIST, MLP, set 2 MNIST, MLP, set 3

Fashion, MLP, set 1 Fashion, MLP, set 2 Fashion, MLP, set 3

Kuzushiji, MLP, set 1 Kuzushiji, MLP, set 2 Kuzushiji, MLP, set 3

Figure 4. The experimental results on various biased settings on the MLP model for 300 epochs. The dark color is the mean accuracy and the light color corresponds to the std.

respectively with different biased settings. As shown in Figure 3 and Figure 4, L-W gets comparable test accuracy on various biased setting to Forward when the biased transition matrix with no additional information is available for Forward. Moreover, the ﬂuctuation frequency of L-UW and L-W is less than that of GA in Figure 3, which indicates that L-UW and L-W have a stable performance in the biased complementary-label case. Due to the corresponding empirical risk of biased setting follow the same trend as the unbiased one, it is put in the Appendix.

5. Conclusion

In this paper, a risk estimator with guaranteed estimation error bound based on discriminative model is proposed for CLL. It estimates the complementary label predictions P( y | x) by the output of discriminative classiﬁers with sound theoretical properties. Accordingly, the weighted

loss which makes use of the output of current classiﬁcation model during the training procedure is further introduced to the classiﬁcation risk to yield the empirical risk for CLL model training. The effectiveness of the proposed discriminative CLL model is clearly validated with extensive comparative studies over benchmark datasets.

Acknowledgements

The authors wish to thank the anonymous reviewers for their helpful comments and suggestions. This work was supported by the National Key R&D Program of China (2018YFB1004300), and the China University S&T Innovation Plan Guided by the Ministry of Education. We thank the Big Data Center of Southeast University for providing the facility support on the numerical calculations in this paper.

Discriminative Complementary-Label Learning with Weighted Loss

Calder, J., Cook, B., Thorpe, M., and Slepcev, D. Poisson learning: Graph based semi-supervised learning at very low label rates. In Proceedings of the 37th International Conference on Machine Learning, pp. 1306 1316, Virtual Event, 2020.

Chapel, L., Alaya, M. Z., and Gasso, G. Partial optimal tranport with applications on positive-unlabeled learning. In Advances in Neural Information Processing Systems 33, Virtual Event, 2020.

Chapelle, O., Sch olkopf, B., and Zien, A. Introduction to Semi-Supervised Learning. The MIT Press, 2006.

Chou, Y.-T., Niu, G., Lin, H.-T., and Sugiyama, M. Unbiased risk estimators can mislead: A case study of learning with complementary labels. In Proceedings of the 37th International Conference on Machine Learning, pp. 1929 1938, Virtual Event, 2020.

Clanuwat, T., Bober-Irizar, M., Kitamoto, A., Lamb, A., Yamamoto, K., and Ha, D. Deep learning for classical japanese literature. Co RR, abs/1812.01718, 2018.

du Plessis, M. C., Niu, G., and Sugiyama, M. Analysis of learning from positive and unlabeled data. In Advances in Neural Information Processing Systems 27, pp. 703 711, Montreal, Canada, 2014.

Feng, L., Kaneko, T., Han, B., Niu, G., An, B., and Sugiyama, M. Learning with multiple complementary labels. In Proceedings of the 37th International Conference on Machine Learning, pp. 3072 3081, Virtual Event, 2020.

Ghosh, A., Kumar, H., and Sastry, P. S. Robust loss functions under label noise for deep neural networks. In Proceedings of the 31st AAAI Conference on Artiﬁcial Intelligence, pp. 1919 1925, San Francisco, CA, 2017.

Golovnev, A., P al, D., and Sz or enyi, B. The informationtheoretic value of unlabeled data in semi-supervised learning. In Proceedings of the 36th International Conference on Machine Learning, pp. 2328 2336, Long Beach, CA, 2019.

Goodfellow, I. J., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A. C., and Bengio, Y. Generative adversarial nets. In Advances in Neural Information Processing Systems 27, pp. 2672 2680, Montreal, Canada, 2014.

Hammoudeh, Z. and Lowd, D. Learning from positive and unlabeled data with arbitrary positive shift. In Advances in Neural Information Processing Systems 33, Virtual Event, 2020.

Han, B., Niu, G., Yu, X., Yao, Q., Xu, M., Tsang, I. W., and Sugiyama, M. SIGUA: forgetting may make learning with noisy labels more robust. In Proceedings of the 37th International Conference on Machine Learning, pp. 4006 4016, Virtual Event, 2020.

Ishida, T., Niu, G., Hu, W., and Sugiyama, M. Learning from complementary labels. In Advances in Neural Information Processing Systems 30, pp. 5639 5649, Long Beach, CA, 2017.

Ishida, T., Niu, G., Menon, A. K., and Sugiyama, M. Complementary-label learning for arbitrary losses and models. In Proceedings of the 36th International Conference on Machine Learning, pp. 2971 2980, Long Beach, CA, 2019.

Ishida, T., Yamane, I., Sakai, T., Niu, G., and Sugiyama, M. Do we need zero training loss after achieving zero training error? In Proceedings of the 37th International Conference on Machine Learning, pp. 4604 4614, Virtual Event, 2020.

Izmailov, P., Kirichenko, P., Finzi, M., and Wilson, A. G. Semi-supervised learning with normalizing ﬂows. In Proceedings of the 37th International Conference on Machine Learningt, pp. 4615 4630, Virtual Event, 2020.

Jin, R. and Ghahramani, Z. Learning with multiple labels. In Advances in Neural Information Processing Systems 15, pp. 897 904, Vancouver, Canada, 2002.

Kim, Y., Yim, J., Yun, J., and Kim, J. NLNL: Negative learning for noisy labels. In Proceedings of 2019 IEEE/CVF International Conference on Computer Vision, pp. 101 110, Seoul, South Korea, 2019.

Kingma, D. P. and Ba, J. Adam: A method for stochastic optimization. In Proceedings of the 3rd International Conference on Learning Representations, San Diego, CA, 2015.

Lecun, Y., Bottou, L., Bengio, Y., and Haffner, P. Gradientbased learning applied to document recognition. Proceedings of the IEEE, 86(11):2278 2324, 1998.

Liu, L. and Dietterich, T. G. A conditional multinomial mixture model for superset label learning. In Advances in Neural Information Processing Systems 25, pp. 557 565, Lake Tahoe, NV, 2012.

Liu, Y. and Guo, H. Peer loss functions: Learning from noisy labels without knowing noise rates. In Proceedings of the 37th International Conference on Machine Learning, pp. 6226 6236, Virtual Event, 2020.

Lu, N., Niu, G., Menon, A. K., and Sugiyama, M. On the minimal supervision for training any binary classiﬁer

Discriminative Complementary-Label Learning with Weighted Loss

from only unlabeled data. In Proceedings of the 7th International Conference on Learning Representations, New Orleans, LA, 2019.

Lv, J., Xu, M., Feng, L., Niu, G., Geng, X., and Sugiyama, M. Progressive identiﬁcation of true labels for partiallabel learning. In Proceedings of the 37th International Conference on Machine Learning, pp. 6500 6510, Virtual Event, 2020.

Ma, X., Wang, Y., Houle, M. E., Zhou, S., Erfani, S. M., Xia, S., Wijewickrema, S. N. R., and Bailey, J. Dimensionalitydriven learning with noisy labels. In Proceedings of the 35th International Conference on Machine Learning, pp. 3361 3370, Stockholm, Sweden, 2018.

Mc Diarmid, C. On the method of bounded differences. Surveys in Combinatorics, pp. 148 188, 2013.

Mirza, M. and Osindero, S. Conditional generative adversarial nets. Co RR, abs/1411.1784, 2014.

Mohri, M., Rostamizadeh, A., and Talwalkar, A. Foundations of Machine Learning. MIT Press, 2012.

Nguyen, N. and Caruana, R. Classiﬁcation with partial labels. In Proceedings of the 14th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, August 24-27, 2008, pp. 551 559, Las Vegas, NV, 2008.

Oliver, A., Odena, A., Raffel, C., Cubuk, E. D., and Goodfellow, I. J. Realistic evaluation of deep semi-supervised learning algorithms. In Advances in Neural Information Processing Systems 31, pp. 3239 3250, Montr eal, Canada, 2018.

Paszke, A., Gross, S., Massa, F., Lerer, A., Bradbury, J., Chanan, G., Killeen, T., Lin, Z., Gimelshein, N., Antiga, L., Desmaison, A., K opf, A., Yang, E., De Vito, Z., Raison, M., Tejani, A., Chilamkurthy, S., Steiner, B., Fang, L., Bai, J., and Chintala, S. Pytorch: An imperative style, high-performance deep learning library. In Advances in Neural Information Processing Systems 32, pp. 8024 8035, Vancouver, Canada, 2019.

Sakai, T., Niu, G., and Sugiyama, M. Semi-supervised AUC optimization based on positive-unlabeled learning. Machine Learning, 107(4):767 794, 2018.

Su, G., Chen, W., and Xu, M. Positive-unlabeled learning from imbalanced data. In Proceedings of the 30th International Joint Conference on Artiﬁcial Intelligence, Virtual Event, 2021.

Wu, J. and Zhang, M. Disambiguation enabled linear discriminant analysis for partial label dimensionality reduction. In Proceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, pp. 416 424, Anchorage, AK, 2019.

Xiao, H., Rasul, K., and Vollgraf, R. Fashion-mnist: a novel image dataset for benchmarking machine learning algorithms. Co RR, abs/1708.07747, 2017.

Xu, Y., Gong, M., Chen, J., Liu, T., Zhang, K., and Batmanghelich, K. Generative-discriminative complementary learning. In Proceedings of the 34th AAAI Conference on Artiﬁcial Intelligence, pp. 6526 6533, New York, NY, 2020.

Yao, Y., Deng, J., Chen, X., Gong, C., Wu, J., and Yang, J. Deep discriminative CNN with temporal ensembling for ambiguously-labeled image classiﬁcation. In Proceedings of the 34th AAAI Conference on Artiﬁcial Intelligence, pp. 12669 12676, New York, NY, 2020.

Yu, X., Liu, T., Gong, M., and Tao, D. Learning with biased complementary labels. In Proceedings of the 15th European Conference on Computer Vision, pp. 69 85, Munich, Germany, 2018.

Zhang, T. Statistical analysis of some multi-category large margin classiﬁcation methods. Journal of Machine Learning Research, 5:1225 1251, 2004.

Zhang, Z. and Sabuncu, M. R. Generalized cross entropy loss for training deep neural networks with noisy labels. In Advances in Neural Information Processing Systems 31, pp. 8792 8802, Montr eal, Canada, 2018.

Zhou, Z. H. A brief introduction to weakly supervised learning. National Science Review, (1):1, 2017.