# learning_with_multiple_complementary_labels__9f81a142.pdf

Learning with Multiple Complementary Labels

Lei Feng * 1 Takuo Kaneko * 2 3 Bo Han 4 3 Gang Niu 3 Bo An 1 Masashi Sugiyama 3 2

A complementary label (CL) simply indicates an incorrect class of an example, but learning with CLs results in multi-class classiﬁers that can predict the correct class. Unfortunately, the problem setting only allows a single CL for each example, which notably limits its potential since our labelers may easily identify multiple CLs (MCLs) to one example. In this paper, we propose a novel problem setting to allow MCLs for each example and two ways for learning with MCLs. In the ﬁrst way, we design two wrappers that decompose MCLs into many single CLs, so that we could use any method for learning with CLs. However, the supervision information that MCLs hold is conceptually diluted after decomposition. Thus, in the second way, we derive an unbiased risk estimator; minimizing it processes each set of MCLs as a whole and possesses an estimation error bound. We further improve the second way into minimizing properly chosen upper bounds. Experiments show that the former way works well for learning with MCLs but the latter is even better.

1. Introduction

Ordinary machine learning tasks generally require massive data with accurate supervision information, while it is expensive and time-consuming to collect the data with high-quality labels. To alleviate this problem, the researchers have studied various weakly supervised learning frameworks (Zhou, 2018), including semi-supervised learning (Chapelle et al., 2006; Li & Liang, 2019; Miyato et al., 2018; Niu et al., 2013; Zhu & Goldberg, 2009), positive-

*Equal contribution Work done when LF was an intern at RIKEN AIP and TK belonged to The University of Tokyo and RIKEN AIP. 1School of Computer Science and Engineering, Nanyang Technological University, Singapore 2The University of Tokyo 3RIKEN Center for Advanced Intelligence Project 4Department of Computer Science, Hong Kong Baptist University. Correspondence to: Lei Feng <feng0093@e.ntu.edu.sg>.

Proceedings of the 37 th International Conference on Machine Learning, Online, PMLR 119, 2020. Copyright 2020 by the author(s).

unlabeled learning (du Plessis et al., 2014; 2015; Elkan & Noto, 2008; Kiryo et al., 2017; Sakai et al., 2017; 2018), noisy-label learning (Han et al., 2018a;b; Menon et al., 2015; Wei et al., 2020; Xia et al., 2019), partial label learning (Cour et al., 2011; Feng & An, 2018; 2019a;b; Zhang & Yu, 2015), positive-conﬁdence learning (Ishida et al., 2018), similar-unlabeled learning (Bao et al., 2018), and unlabeled-unlabeled classiﬁcation (Lu et al., 2019; 2020).

Here, we consider another weakly supervised classiﬁcation framework called complementary-label learning (Ishida et al., 2017; 2019; Yu et al., 2018). In complementary-label learning, each training example is supplied with a complementary label (CL), which speciﬁes one of the classes that the example does not belong to. Compared with ordinary labels, it is obviously easier to collect CLs. Recently, complementary-label learning has been applied to online learning (Kaneko et al., 2019) and medical image segmentation (Rezaei et al., 2019). In addition, another potential application of learning with CLs would be data privacy. For example, collecting some survey data may require extremely private questions (Ishida et al., 2017; 2019). It may be difﬁcult for us to directly obtain the true answer (label) to the question. Nonetheless, it would be mentally less demanding if we ask the respondent to provide some incorrect answers. Besides, the respondent may provide multiple incorrect answers, rather than exactly one. In this case, multiple complementary labels (MCLs) would be more widespread than a single CL.

In this paper, we propose a novel problem setting (Section 3.1) that allows MCLs for each example, and provide a real-world motivation (Section 3.2). Although existing complementary-label learning approaches (Ishida et al., 2017; 2019; Yu et al., 2018) have provided solid theoretical foundations and achieved promising performance, they are all restricted to the case where each example is associated with a single CL. To learn with MCLs, we ﬁrst design two wrappers (Section 4.1) that decompose each example with MCLs into multiple examples, each with a single CL, in different manners. With the two wrappers, we are able to use arbitrary ordinary complementary-label learning approaches for learning with MCLs. However, the derived data with many single CLs may not match the assumed data distribution for complementary-label learning (Ishida et al., 2017; 2019). In addition, the supervision information would be

Learning with Multiple Complementary Labels

conceptually diluted after decomposition.

In order to solve the above problems, we further propose an unbiased risk estimator (Section 4.2) for learning with MCLs, which processes each set of MCLs as a whole. Our risk estimator is conceptually consistent, and builds a prototype baseline for the new problem setting that may inspire more specially designed methods for this new setting in the future. Then, we theoretically derive an estimation error bound, which guarantees that the empirical risk minimizer converges to the true risk minimizer with high probability as the number of training data approaches inﬁnity. Furthermore, we improve the risk estimator into minimizing properly chosen upper bounds for practical implementation (Section 4.3), and we show that they bring beneﬁts to gradient update. Experimental results show that the wrappers work well for learning with MCLs while the (improved) risk estimator is even better on various benchmark datasets.

2. Related Work

In this section, we introduce some notations and brieﬂy review the formulations of multi-class classiﬁcation and complementary-label learning.

2.1. Multi-Class Classiﬁcation

Suppose the feature space is X P Rd with d dimensions and the label space is Y t1, 2, . . . , ku with k classes, the instance x P X with its class label y P Y is sampled from an unknown probability distribution with density ppx, yq. Ordinary multi-class classiﬁcation aims to induce a learning function fpxq : Rd Ñ Rk that minimizes the classiﬁcation risk:

Rpfq Eppx,yq L fpxq, y , (1)

where L fpxq, y is a multi-class loss function. The predicted label is given as ˆy argmaxy PYfypxq, where fypxq is the y-th coordinate of fpxq.

2.2. Complementary-Label Learning

Suppose the dataset for complementary-label learning is denoted by tpxi, syiqun i 1, where syi P Y is a complementary label of xi, and each complementarily labeled example is sampled from sppx, syq. Ishida et al. (2017; 2019) assumed that sppx, syq is expressed as:

sppx, syq 1 k 1 ř

y sy ppx, yq. (2)

This assumption implies that all other labels except the correct label are chosen to be the complementary label with uniform probabilities. This is reasonable as we do not have extra labeling information except a complementary label. Under this assumption, it was proved by Ishida

et al. (2017) that an unbiased estimator of the original classiﬁcation risk can be obtained from only complementarily labeled data, when the loss function satisﬁes certain conditions. Speciﬁcally, they used the multi-class loss functions with the one-versus-all strategy and the pairwise comparison strategy (Zhang, 2004):

s LOVA fpxq, sy 1 k 1 ř

y1 sy ℓ fy1pxq ℓ fsypxq , s LPC fpxq, sy ř

y1 sy ℓ fy1pxq fsypxq ,

where ℓpzq is a binary loss function that satisﬁes ℓpzq ℓp zq 1, such as sigmoid loss ℓSpzq 1 1 ez and ramp loss ℓRpzq 1

2 maxp0, minp2, 1 zqq.

Later, another different assumption was used by Yu et al. (2018). They assumed that all other labels except the correct label are chosen to be the complementary label with different probabilities, and proposed to estimate the class transition probability matrix for model training. Although they showed that the minimizer of their learning objective coincides with the minimizer of the original classiﬁcation risk, they did not provide an unbiased risk estimator.

Recently, a more general unbiased risk estimator (Ishida et al., 2019) was proposed, which does not rely on speciﬁc losses or models. Their formulation is as follows:

s LFREE fpxq, sy kř

y 1 L fpxq, y pk 1q L fpxq, sy .

For this formulation, they showed that due to the negative term, the empirical risk could be unbounded below, which leads to over-ﬁtting. In order to alleviate this issue, the authors further proposed modiﬁed versions by using the max operator and the gradient ascent strategy.

In summary, although the above methods have provided solid theoretical foundations and achieved promising performance for complementary-label learning, they are all restricted to the case where each example is associated with a single CL. In this paper, we propose a novel problem setting that allows MCLs for each example.

3. Multiple Complementary Labels

In this section, we ﬁrst introduce our problem setting where each example is associated with MCLs, and then provide a corresponding real-world motivation.

3.1. Data Generation Process

Suppose the given dataset for learning with MCLs is represented as s D tpxi, s Yiqun i 1, where s Yi is a set of complementary labels for the instance xi. It is obvious that learning with MCLs is a generalization of complementarylabel learning that learns with a single CL. Speciﬁcally, if

Learning with Multiple Complementary Labels

s Yi contains only one complementary label with probability 1, we obtain a complementary-label learning problem. In addition, if s Yi contains k 1 complementary labels where k denotes the total number of classes, we obtain an ordinary multi-class classiﬁcation problem. It is easy to know that for all i, s Yi cannot be the empty set nor the full label set, hence s Yi P s Y where s Y t2Y H Yu and | s Y| 2k 2.

For the generation process of each example with MCLs, we assume that it relies on the size of the set of MCLs. Let us represent the size of the complementary label set by a random variable s, and assume s is sampled from a distribution ppsq. In this way, we assume that each training example pxi, s Yiq is drawn from the following data distribution:

sppx, s Y q ÿk 1

j 1 pps jqsppx, s Y | s jq, (4)

sppx, s Y | s jq :

y R s Y ppx, yq, if |s Y | j,

0, otherwise.

It is clear that when pps 1q 1, our introduced distribution reduces to the assumed distribution (e.g., Eq. (2)) in ordinary complementary-label learning approaches (Ishida et al., 2017; 2019). Then, we show that sppx, s Y q is a valid probability distribution by the following theorem.

Theorem 1. The following equality holds: ż

X sppx, s Y qdx ds Y 1. (5)

The proof is provided in Appendix A.1.

3.2. Real-World Motivation

Here, we present a real-world motivation for the assumed data distribution.

Since directly choosing the correct label is hard for labelers, it would be easier if a labeling system can randomly choose a label set and ask labelers whether the correct label is included in the proposed label set or not. Given a pattern x, suppose the labeling system ﬁrst randomly samples the size s of the proposed label set from ppsq, and then randomly and uniformly chooses a speciﬁc label set with size s from s Y. In this way, the collected label sets that do not include the correct label precisely follow the same distribution as Eq. (4). We will demonstrate this fact in the following.

We start by considering the case where the labeling system has already sampled the size s of the proposed label set. Then we have the following lemma.

Lemma 1. Given the sampled size s of the proposed label set, for any pattern x with its correct label y and any label

set s Y with size s (i.e., |s Y | s), the following equality holds:

ppy P s Y | x, sq s

The proof is provided in Appendix A.2.

Theorem 2. In the above setting, the distribution of collected data where the correct label y (y P Y) is not included in the label set s Y (s Y P s Y) is the same as Eq. (4), i.e.,

ppx, s Y | y R s Y q sppx, s Y q. (7)

The proof is provided in Appendix A.3.

4. Learning with Multiple Complementary Labels

In this section, we ﬁrst present two wrappers that enable us to use any ordinary complementary-label learning approach for learning with MCLs. Then, we present an unbiased risk estimator for learning with MCLs as a whole, and establish an estimation error bound.

4.1. Wrappers

Since ordinary complementary-label learning approaches cannot directly deal with MCLs, it would be natural to ask whether there exist some strategies that can enable us to use any existing complementary-label learning approach for learning with MCLs.

Motivated by this, we propose two wrappers that decompose each example with MCLs into multiple examples, each with a single CL. Speciﬁcally, suppose a training example with MCLs is given as pxi, s Yiq where s Yi tsy1, sy2u. Then ordinary complementary label learning approaches may learn from pxi, sy1q and pxi, sy2q. According to whether decomposition is after shufﬂing the training set, there are two decomposition strategies (wrappers) when we optimize a loss function by a stochastic optimization algorithm:

Decomposition after Shufﬂe. Given the shufﬂed training set with MCLs, in each mini-batch, we decompose each example into multiple examples, each with a single CL.

Decomposition before Shufﬂe. Given the training set with MCLs, we drive a new training set by decomposing each example into multiple examples, each with a single CL. Then, we shufﬂe the derived training set.

Both the above decomposition strategies enable us to use arbitrary ordinary complementary-label learning approaches for learning with MCLs. However, the derived training data with many single CLs may not match the originally assumed data distribution (i.e., Eq. (2)) for complementarylabel learning, since these CLs are completely derived from

Learning with Multiple Complementary Labels

Table 1. Supervision information for a set of MCLs (with size s).

Setting #TP #FP Supervision Purity

Many single CLs s pk 2qs 1{pk 1q A set of MCLs 1 k s 1 1{pk sq

MCLs while the data distribution with MCLs is relevant to the size of each set of MCLs. As a consequence, the learning consistency would no longer be guaranteed even if the complementary-label learning approach inside the wrappers is originally risk-consistent or classiﬁer-consistent.

Moreover, since ordinary complementary-label learning approaches can only learn with a single CL for each example at a time and treat each example independently, the supervision information for each set of MCLs would be conceptually diluted. We demonstrate this issue by Table 1. As shown in Table 1, there are two settings according to whether to decompose a set of MCLs into many single CLs or not. Since all the non-complementary labels have the possibility to be the correct label, we specially count how many times the correct label serves as a non-complementary label (denoted by #TP), and how many times the other labels except the correct label serve as a non-complementary label (denoted by #FP). Then the supervision purity is calculated by (#TP)/(#TP+#FP).

Clearly, the wrappers follow the setting where a set of MCLs is decomposed into many single CLs. If the size of the set of MCLs is s, then #TP equals s, since the correct label would serve as a non-complementary label for s times after decomposition, and the other labels except the correct label would serve as a non-complementary label for pk s 1qs sps 1q pk 2qs times, hence the supervision purity would be s{ps pk 2qsq 1{pk 1q. However, for the setting where the set of MCLs is not decomposed, we can easily know that the correct label serves as a noncomplementary label once, and the other labels expect the correct label serve as a non-complementary label for k s 1 times, hence the supervision purity is 1{pk sq. These observations clearly show that the supervision information is diluted after decomposing MCLs (s ě 2), which also motivate us to take a set of MCLs as a whole set. In the following, we will introduce our proposed unbiased risk estimator, which is able to learn with MCLs as a whole.

4.2. Unbiased Risk Estimator

The above example has shown that the supervision information is diluted after decomposition. The basic reason lies in that ordinary complementary-label learning approaches are designed by only considering the data distribution with a single CL, i.e., sppx, syq. However, the data distribution with MCLs sppx, s Y q becomes much different, and the wrappers fail to capture such distribution because they do not treat MCLs as a whole for each example. To solve this

problem, we propose an unbiased estimator of the original classiﬁcation risk for learning with MCLs as a whole.

We ﬁrst relate the data distribution with ordinary labels to that with MCLs by the following lemma.

Lemma 2. The following equality holds:

ppx, yq 1 ÿk 1

s Y P s Yy j sppx, s Y , s jq ,

where s Yy j is the set of all the possible label sets with size j that include a speciﬁc label y P Y, i.e.,

s Yy j : ts Y P s Y | y P s Y , |s Y | ju.

The proof is provided in Appendix B.1.

Based on Lemma 2, we derive an unbiased estimator of the ordinary classiﬁcation risk Eq. (1) by the following theorem.

Theorem 3. The ordinary classiﬁcation risk Eq. (1) can be equivalently expressed as

j 1 pps jq s Rjpfq, (8)

s Rjpfq : Esppx, s Y |s jqr s Lj fpxq, s Y s, (9)

s Lj fpxq, s Y : ÿ

y R s Y L fpxq, y

y1P s Y L fpxq, y1 . (10)

The proof is provided in Appendix B.2.

It is easy to verify that Eq. (8) reduces to Eq. (3) when pps 1q 1. Which means, our approach is a generalization of Ishida et al. (2019). Furthermore, according to Corollary 2 in Ishida et al. (2019), our approach is also a generalization of Ishida et al. (2017).

Given the dataset with MCLs s D tpxi, s Yiqun i 1, we can empirically approximate pps jq by nj{n where nj denotes the number of examples whose complementary label set size is j. By further taking into account Eqs. (8)-(10), we can obtain the following empirical approximation of the unbiased risk estimator introduced in Theorem 3:

y R s Yi L fpxiq, y

y1P s Yi L fpxiq, y1 . (11)

Estimation Error Bound. Here, we derive an estimation error bound for the proposed unbiased risk estimator

Learning with Multiple Complementary Labels

based on Rademacher complexity (Bartlett & Mendelson, 2002). Let F Ă tf : Rd Ñ Rku be the hypothesis class, pf : arg minf PF p Rpfq be the empirical risk minimizer, and f arg minf PF Rpfq be the true risk minimizer. Besides, we deﬁne the functional space Gy for the label y P Y as Gy tg : x Ñ fypxq | f P Fu. Then, we have the following theorem.

Theorem 4. Assume the loss function Lpfpxq, yq is ρLipschitz with respect to fpxq p0 ă ρ ă 8q for all y P Y. Let CL supx PX,f PF,y PY Lpfpxq, yq and Rnp Gyq be the Rademacher complexity of Gy given the sample size n. Then, for any δ ą 0, with probability at least 1 δ,

Rp pfq Rpf q

j 1 pps jq 4 ?

y 1 Rnjp Gyq Cj ?nj

where Cj p4k 4j 2q CL

δ 2 for all j P t1, . . . , k 1u and nj denotes the number of examples whose complementary label set size is j.

The deﬁnition of Rademacher complexity and the proof of Theorem 4 are provided in Appendix C. Theorem 4 shows that the empirical risk minimizer converges to the true risk minimizer with high probability as the number of training data approaches inﬁnity. It is worth noting that this bound is not only related to the Redemacher complexity of the function class, but also s and k. This observation accords with our intuition that the learning task will be harder if the number of classes k increases or the size of the complementary label set s decreases.

4.3. Practical Implementation

In this section, we present the practical implementation of our proposed formulation and improvements of the used loss functions. As described above, we have provided a general unbiased risk estimator that is able to use arbitrary loss functions. There arises a question: Can all loss functions work well in our approach? Unfortunately, the answer is negative.

The original classiﬁcation risk estimator in Eq. (1) includes an expectation over a non-negative loss L : Rk ˆrks Ñ R , hence the expected risk and the empirical approximation are both lower-bounded by zero. However, our proposed risk estimator in Theorem 3 contains a negative term. Although the expected risk estimator is unbiased, the empirical estimator may become unbounded below if the used loss function is unbounded, thereby leading to over-ﬁtting. Similar issues have also been shown by Ishida et al. (2019); Kiryo et al. (2017). The above analysis suggests that a bounded loss is probably better than an unbounded loss, in our empirical risk estimator (i.e., Eq. (11)).

To demonstrate the above conjecture, we would like to insert bounded and unbounded losses into Eq. (11), for comparison studies. Note that we assume that the softmax function is absorbed in each loss, and denote by pθpy|xq exppfypxqq{přk j 1 exppfjpxqqq the predicted probability of the instance x belonging to class y, where θ denotes the parameters of the model f. In this way, we list the compared loss functions as follows.

Categorical Cross Entropy (CCE):

LCCEpfpxq, yq log pθpy|xq. Mean Absolute Error (MAE):

LMAEpfpxq, yq 2 2pθpy|xq. Mean Square Error (MSE):

LMSEpfpxq, yq 1 2pθpy|xq ÿk

j 1 pθpj|xq2.

Generalized Cross Entropy (GCE) (Zhang & Sabuncu,

LGCEpfpxq, yq p1 pθpy|xqqq{q,

where q P p0, 1s is a user-deﬁned hyper-parameter. We set q 0.7, as suggested by Zhang & Sabuncu (2018). Partially Huberised Cross Entropy (PHuber-CE) (Menon et al., 2020):

LPHuber-CEpfpxq, yq " log pθpy|xq, if pθpy|xq ě 1

τ , τpθpy|xq log τ 1, else,

where τ ą 0 is a user-deﬁned hyper-parameter. We set τ 10, because it works well in Menon et al. (2020).

The detailed derivations of the above loss functions and their bounds are provided in Appendix D. Among these losses, CCE is unbounded while others are bounded. We will experimentally demonstrate (Figure 1) that by inserting the above losses into Eq. (11), bounded loss is signiﬁcantly better than unbounded loss. Furthermore, we conduct a deeper analysis of MAE because MAE has the special property that MAE is not only bounded, but also satisﬁes the symmetric condition (Ghosh et al., 2017), i.e., řk y 1 LMAE fpxq, y 2k 2, which means the sum of the losses over all classes is a constant for arbitrary examples. However, is MAE good enough? Previous studies (Wang et al., 2019; Zhang & Sabuncu, 2018) have already shown that MAE suffers from the optimization issue, which would affect its practical performance. To alleviate this problem, we further improve MAE by proposing two upper-bound surrogate loss functions. Speciﬁcally, by using MAE in Eq. (11), we obtain

y R s Yi LMAE fpxiq, y

|s Yi| L1 MAE fpxiq, s Yi Zi, (12)

Learning with Multiple Complementary Labels

0 50 100 150 200 250 Epoch

Test Accuracy

MAE GCE PHuber_CE MSE CCE

(a) MNIST, linear

0 50 100 150 200 250 Epoch

Test Accuracy

(b) MNIST, MLP

0 50 100 150 200 250 Epoch

Test Accuracy

(c) Fashion MNIST, linear

0 50 100 150 200 250 Epoch

Test Accuracy

(d) Fashion MNIST, MLP

0 50 100 150 200 250 Epoch

Test Accuracy

(e) Kuzushiji MNIST, linear

0 50 100 150 200 250 Epoch

Test Accuracy

(f) Kuzushiji MNIST, MLP

0 50 100 150 200 250 Epoch

Test Accuracy

(g) CIFAR-10, Res Net

0 50 100 150 200 250 Epoch

Test Accuracy

(h) CIFAR-10, Dense Net

Figure 1. Experimental results of different loss functions for different datasets and models. Dark colors show the mean accuracy of 5 trials and light colors show the standard deviation.

where L1 MAE fpxiq, s Yi : 1 ř j R s Yi pθpj|xiq, and Zi is a constant independent of fpxiq. It is clear that minimizing L1 MAE fpxiq, s Yi is equivalent to minimizing ř

y R s Yi LMAE fpx, yq .

Based on this fact, we further introduce two upper-bound surrogate loss functions of L1 MAE:

LEXPpfpxiq, s Yiq exp ÿ

j R s Yi pθpj|xiq ,

LLOGpfpxiq, s Yiq log ÿ

j R s Yi pθpj|xiq .

One can easily verify that L1 MAE is upper bounded by LEXP and LLOG using the two inequalities 1 z ď expp zq and 1 z ď log z, respectively. By replacing L1 MAE by LLOG and LLOG in Eq. (12), we obtain two new methods for learning with MCLs. We explain the advantage of LEXP and LLOG over L1 MAE by closely examining their gradients:

BL1 MAE Bθ " θpθpj|xiq, if j R s Yi, 0, else,

Bθ " θpθpj|xiq w EXP, if j R s Yi, 0, else,

Bθ " θpθpj|xiq w LOG, if j R s Yi, 0, else,

where w EXP exp ř j R s Yi pθpj|xiq and w LOG ř

j R s Yi pθpj|xiq 1. From their gradients, we can clearly observe that L1 MAE basically treats each example equally, while LEXP and LLOG give more weights to difﬁcult examples. Concretely, if ř

j R s Yi pθpj|xiq is small, both w EXP and

w LOG would be large. In other words, LEXP and LLOG pay more attention to hard examples whose sum of the predicted conﬁdences of all the non-complementary labels is small.

5. Experiments

In this section, we conduct extensive experiments to evaluate the performance of our proposed approaches including the two wrappers, the unbiased risk estimator with various loss functions and the two upper-bound surrogate loss functions.

Datasets. We use ﬁve widely-used benchmark datasets MNIST (Le Cun et al., 1998), Kuzushiji-MNIST (Clanuwat et al., 2018), Fashion-MNIST (Xiao et al., 2017), 20Newsgroups (Lang, 1995), and CIFAR-10 (Krizhevsky et al., 2009), and four datasets from the UCI repository (Blake & Merz, 1998). We use four base models including linear model, MLP model (d-500-k), Res Net (34 layers) (He et al., 2016), and Dense Net (22 layers) (Huang et al., 2017). The detailed descriptions of these datasets with the corresponding base models are provided in Appendix E.1. To generate MCLs, we instantiate ppsq k s {p2k 2q, @s P t1, , k 1u, which means ppsq represents the ratio of the number of label sets whose size is s to the number of all possible label sets. For each instance x, we ﬁrst randomly sample s from ppsq, and then uniformly and randomly sample a complementary label set s Y with size s (i.e., pps Y q 1{ k 1 s ).

Approaches. We absorb ﬁve ordinary complementarylabel learning approaches in the two wrappers (introduced

Learning with Multiple Complementary Labels

Table 2. Classiﬁcation accuracy (mean std) of each algorithm on the four UCI datasets using a linear model for 5 trials. The best performance among all the approaches is highlighted in boldface. In addition, { indicates whether the performance of our approach (the best of EXP and LOG) is statistically superior/inferior to the comparing algorithm on each dataset (paired t-test at 0.05 signiﬁcance level).

Approach Yeast Texture Dermatology Synthetic Control

Upper-bound Losses EXP 54.94 1.56% 97.51 0.09% 98.89 0.37% 27.87 5.13% LOG 60.11 1.93% 98.88 0.43% 99.46 1.14% 90.73 4.41%

Bounded Losses

MAE 33.07 0.37% 85.29 7.93% 85.39 2.58% 23.50 2.44% MSE 58.17 1.52% 97.59 0.16% 97.84 1.21% 34.20 8.69% GCE 57.56 1.56% 97.25 0.31% 97.53 1.81% 23.67 3.10% Phuber-CE 55.54 1.03% 94.89 3.28% 95.14 2.41% 24.71 3.18%

Unbounded Loss CCE 49.50 3.58% 92.08 1.15% 83.19 3.65% 63.47 6.91%

Decomposition before Shufﬂe

GA 27.91 5.02% 90.93 1.34% 36.05 9.79% 18.12 1.74% NN 32.73 3.59% 96.29 0.39% 61.49 6.83% 55.12 4.43% FREE 35.50 2.79% 94.36 1.08% 86.30 5.62% 76.95 3.26% PC 53.89 3.53% 92.68 0.81% 96.27 3.07% 72.63 5.86% Forward 58.15 1.54% 98.95 0.17% 99.37 0.85% 38.77 6.06%

Decomposition after Shufﬂe

GA 28.21 1.53% 83.66 2.27% 42.05 7.94% 25.46 1.28% NN 36.04 2.24% 93.91 0.40% 62.54 9.19% 59.80 5.14% FREE 43.47 1.36% 93.94 0.72% 86.22 6.07% 73.33 2.17% PC 54.58 2.57% 94.19 1.21% 95.73 3.33% 69.53 9.01% Forward 59.46 1.16% 97.65 0.32% 99.03 1.33% 43.57 5.83%

Partial Label Convex Formulation CLPL 55.39 1.21% 92.07 0.88% 99.42 0.54% 63.57 5.46%

Table 3. Classiﬁcation accuracy (mean std) of each algorithm on the four benchmark datasets using a linear model for 5 trials. The best performance among all the approaches is highlighted in boldface. In addition, { indicates whether the performance of our approach (the best of EXP and LOG) is statistically superior/inferior to the comparing algorithm on each dataset (paired t-test at 0.05 signiﬁcance level).

Approach MNIST Kuzushiji Fashion 20Newsgroups

Upper-bound Losses EXP 92.67 0.11% 64.23 0.33% 84.56 0.25% 81.72 0.39% LOG 92.58 0.09% 68.89 0.25% 84.42 0.16% 84.06 0.57%

Bounded Losses

MAE 92.66 0.12% 64.03 0.19% 84.50 0.16% 79.68 1.40% MSE 92.64 0.13% 64.51 0.55% 84.53 0.20% 81.55 0.52% GCE 92.66 0.12% 64.44 0.17% 84.44 0.15% 81.78 0.60% Phuber-CE 92.02 0.07% 63.81 0.75% 83.76 0.22% 73.52 1.04%

Unbounded Loss CCE 88.23 0.19% 62.27 0.84% 80.25 0.29% 63.78 0.79%

Decomposition before Shufﬂe

GA 85.51 0.26% 55.61 0.24% 78.64 0.33% 76.64 0.62% NN 88.09 0.16% 60.54 0.23% 80.68 0.07% 76.00 0.37% FREE 89.35 0.14% 65.21 0.45% 81.22 0.11% 68.34 0.72% PC 88.21 0.23% 62.76 0.40% 80.60 0.18% 66.91 1.20% Forward 92.57 0.05% 63.51 0.22% 84.38 0.20% 74.69 1.14%

Decomposition after Shufﬂe

GA 83.16 0.22% 56.31 0.42% 73.37 0.10% 66.14 0.79% NN 88.79 0.26% 63.19 0.12% 79.77 0.14% 66.35 0.53% FREE 89.02 0.22% 64.18 0.18% 80.11 0.04% 66.16 0.60% PC 87.76 0.17% 61.64 0.38% 80.58 0.17% 65.64 0.81% Forward 92.54 0.04% 63.69 0.14% 84.37 0.17% 71.98 3.41%

Partial Label Convex Formulation CLPL 81.85 0.27% 55.31 0.23% 77.26 0.10% 81.48 0.45%

in Section 4.1): GA, NN, and Free (Ishida et al., 2019), PC (Ishida et al., 2017), and Forward (Yu et al., 2018).We also use an unbounded loss CCE and four bounded losses MAE, MSE, GCE (Zhang & Sabuncu, 2018), and PHuber CE (Menon et al., 2020) in our empirical estimator Eq. (11). Besides, two upper-bound loss functions LOG and EXP are also inserted into Eq. (12). In addition, we also compare with a representative partial label learning approach CLPL (Cour et al., 2011). For all the approaches, we adopt the same base model for fair comparison. Learning rate and

weight decay are selected from t10 6, 10 5, , 10 1u. We implement our approach using Py Torch1, and use the Adam (Kingma & Ba, 2015) optimization method with minibatch size set to 256 and epoch number set to 250. Hyperparameters for all the approaches are selected so as to maximize the accuracy on a validation set (10% of the training set) of complementarily labeled data. All the experiments are conducted on NVIDIA Tesla V100 GPUs.

1www.pytorch.org

Learning with Multiple Complementary Labels

Table 4. Classiﬁcation accuracy (mean std) of each algorithm on the ﬁve benchmark datasets using neural networks for 5 trials. The best performance among all the approaches is highlighted in boldface. In addition, { indicates whether the performance of our approach (the best of EXP and LOG) is statistically superior/inferior to the comparing algorithm on each dataset (paired t-test at 0.05 signiﬁcance level).

Approach MNIST Kuzushiji Fashion CIFAR-10 R CIFAR-10 D 20Newsgroups

Upper-bound Losses EXP 97.80 0.06% 88.25 0.28% 88.07 0.19% 72.49 0.84% 75.53 0.58% 77.22 1.22% LOG 97.86 0.13% 88.24 0.08% 88.36 0.26% 75.38 0.34% 75.80 0.62% 79.46 0.94%

Bounded Losses

MAE 97.81 0.04% 88.11 0.40% 88.13 0.23% 65.57 4.08% 68.24 5.84% 49.83 4.01% MSE 96.84 0.08% 84.97 0.23% 86.14 0.04% 63.58 1.19% 70.89 0.81% 72.19 0.59% GCE 96.62 0.08% 85.02 0.26% 87.03 0.20% 68.40 1.05% 71.54 0.83% 74.96 0.47% Phuber-CE 95.00 0.36% 80.66 0.32% 85.52 0.18% 59.64 1.21% 66.49 0.67% 62.63 2.32%

Unbounded Loss CCE 88.64 0.50% 67.86 1.01% 80.97 0.23% 18.01 0.63% 44.94 1.20% 54.96 0.38%

Decomposition before Shufﬂe

GA 96.36 0.05% 84.35 0.22% 85.59 0.30% 69.05 0.83% 65.38 1.40% 79.06 0.57% NN 96.70 0.08% 82.21 0.36% 86.29 0.10% 63.85 0.74% 64.80 1.28% 76.81 0.44% FREE 88.55 0.38% 70.32 0.80% 81.17 0.36% 32.02 1.69% 39.22 0.43% 61.22 1.24% PC 92.74 0.17% 73.18 0.59% 83.32 0.28% 43.16 2.21% 49.53 1.18% 65.15 2.05% Forward 97.67 0.04% 87.65 0.24% 88.08 0.24% 71.92 1.09% 71.30 1.16% 77.19 %0.76

Decomposition after Shufﬂe

GA 92.08 0.22% 74.64 0.67% 79.73 0.19% 53.12 0.97% 56.51 0.89% 63.37 1.16% NN 92.47 0.14% 73.88 0.63% 82.99 0.13% 36.79 0.78% 53.78 0.92% 65.15 0.73% FREE 88.99 0.39% 70.09 0.74% 81.74 0.23% 15.16 2.22% 47.45 0.98% 50.86 1.56% PC 92.94 0.05% 68.60 1.32% 82.46 0.26% 33.16 0.92% 52.23 0.88% 64.32 0.86% Forward 97.49 0.08% 86.47 0.39% 87.56 0.14% 72.16 0.97% 75.23 1.02% 79.35 0.82%

Loss Comparison. Figure 1 shows the mean and standard deviation of test accuracy of 5 trials, for bounded loss functions MAE, MSE, GCE, PHuber-CE, and unbounded loss function CCE used in our empirical risk estimator Eq. (11). We also record the mean and standard deviation of training accuracy (the training set is evaluated with ordinary labels) of 5 trials, and put the results in Appendix E.2. As can be seen from Figure 1, all the bounded losses are signiﬁcantly better than the unbounded loss CCE in our formulation. This observation clearly accords with our discussion on the over-ﬁtting issue in Section 4.3. In addition, MAE achieves comparable performance compared with other bounded losses in most cases, while it is sometimes inferior to other bounded losses due to its optimization issue (Zhang & Sabuncu, 2018). Both the advantage and disadvantage of MAE motivate us to use the upper-bound loss functions EXP and LOG for improving the classiﬁcation performance.

Performance Comparison. Table 2, Table 3, and Table 4 show the experimental results of different approaches using a linear model or neural networks on the four UCI datasets and the other ﬁve benchmark datasets. In table 4, CIFAR-10 R and CIFAR-10 D mean that we use Res Net and Dense Net on CIFAR-10. Note that CLPL is a convex approach for partial label learning, which is specially designed with a linear model. Hence CLPL does not appear in Table 4. From the three tables, we can ﬁnd that equipped with the two wrappers Decomposition before Shufﬂe and Decomposition after Shufﬂe , ordinary complementary-label learning approaches work well for learning with MCLs. However, they are signiﬁcantly outperformed by the upperbound losses in most cases, which also achieve the best

performance among all the approaches on various benchmark datasets. In addition, we also study the case where the size of each complementary label set s is ﬁxed at j (i.e., pps jq 1) while increasing j from 1 to k 2. The corresponding experimental results are provided in Appendix E.3, which show that the classiﬁcation accuracy of our approaches increases as j increases. This observation is clearly in accordance with our derived estimation error bound (Theorem 4), as the estimation error would decrease if j increases.

6. Conclusion

In this paper, we propose a novel problem setting called learning with multiple complementary labels (MCLs), which is a generation of complementary-label learning (Ishida et al., 2017; 2019; Yu et al., 2018). To solve this learning problem, we ﬁrst design two wrappers that enable us to use arbitrary complementary-label learning approaches for learning with MCLs. However, we ﬁnd that the supervision information that MCLs hold is conceptually diluted after decomposition. Therefore, we further propose an unbiased risk estimator for learning with MCLs, which processes each set of MCLs as a whole. Then, we theoretically derive an estimation error bound, which guarantees the learning consistency. Although our risk estimator does not rely on speciﬁc models or loss functions, we show that bounded loss is generally better than unbounded loss in our empirical risk estimator. In addition, we improve the risk estimator into minimizing properly chosen upper bounds for practical implementation. Extensive experiments demonstrate the effectiveness of the proposed approaches.

Learning with Multiple Complementary Labels

Acknowledgements

This research was supported by the National Research Foundation, Singapore under its AI Singapore Programme (AISG Award No: AISG-RP-2019-0013), National Satellite of Excellence in Trustworthy Software Systems (Award No: NSOE-TSS2019-01), and NTU. BH was partially supported by the Early Career Scheme (ECS) through the Research Grants Council of Hong Kong under Grant No.22200720, HKBU Tier-1 Start-up Grant and HKBU CSD Start-up Grant. GN and MS were supported by JST AIP Acceleration Research Grant Number JPMJCR20U3, Japan.

Bao, H., Niu, G., and Sugiyama, M. Classiﬁcation from pairwise similarity and unlabeled data. In ICML, pp. 452 461, 2018.

Bartlett, P. L. and Mendelson, S. Rademacher and gaussian complexities: Risk bounds and structural results. JMLR, 3(11):463 482, 2002.

Blake, C. L. and Merz, C. J. Uci repository of machine learning databases, 1998. URL http://archive. ics.uci.edu/ml/index.php.

Chapelle, O., Scholkopf, B., and Zien, A. Semi-Supervised Learning. MIT Press, 2006.

Clanuwat, T., Bober-Irizar, M., Kitamoto, A., Lamb, A., Yamamoto, K., and Ha, D. Deep learning for classical japanese literature. ar Xiv preprint ar Xiv:1812.01718, 2018.

Cour, T., Sapp, B., and Taskar, B. Learning from partial labels. JMLR, 12(5):1501 1536, 2011.

du Plessis, M. C., Niu, G., and Sugiyama, M. Analysis of learning from positive and unlabeled data. In Neur IPS, pp. 703 711, 2014.

du Plessis, M. C., Niu, G., and Sugiyama, M. Convex formulation for learning from positive and unlabeled data. In ICML, pp. 1386 1394, 2015.

Elkan, C. and Noto, K. Learning classiﬁers from only positive and unlabeled data. In KDD, pp. 213 220, 2008.

Feng, L. and An, B. Leveraging latent label distributions for partial label learning. In IJCAI, pp. 2107 2113, 2018.

Feng, L. and An, B. Partial label learning with self-guided retraining. In AAAI, pp. 3542 3549, 2019a.

Feng, L. and An, B. Partial label learning by semantic difference maximization. In IJCAI, pp. 2294 2300, 2019b.

Ghosh, A., Kumar, H., and Sastry, P. Robust loss functions under label noise for deep neural networks. In AAAI, 2017.

Han, B., Yao, J., Niu, G., Zhou, M., Tsang, I., Zhang, Y., and Sugiyama, M. Masking: A new perspective of noisy supervision. In Neur IPS, pp. 5836 5846, 2018a.

Han, B., Yao, Q., Yu, X., Niu, G., Xu, M., Hu, W., Tsang, I., and Sugiyama, M. Co-teaching: Robust training of deep neural networks with extremely noisy labels. In Neur IPS, pp. 8527 8537, 2018b.

He, K., Zhang, X., Ren, S., and Sun, J. Deep residual learning for image recognition. In CVPR, pp. 770 778, 2016.

Huang, G., Liu, Z., Van Der Maaten, L., and Weinberger, K. Q. Densely connected convolutional networks. In CVPR, pp. 4700 4708, 2017.

Ishida, T., Niu, G., Hu, W., and Sugiyama, M. Learning from complementary labels. In Neur IPS, pp. 5644 5654, 2017.

Ishida, T., Niu, G., and Sugiyama, M. Binary classiﬁcation for positive-conﬁdence data. In Neur IPS, pp. 5917 5928, 2018.

Ishida, T., Niu, G., Menon, A. K., and Sugiyama, M. Complementary-label learning for arbitrary losses and models. In ICML, pp. 2971 2980, 2019.

Kaneko, T., Sato, I., and Sugiyama, M. Online multiclass classiﬁcation based on prediction margin for partial feedback. ar Xiv preprint ar Xiv:1902.01056, 2019.

Kingma, D. P. and Ba, J. Adam: A method for stochastic optimization. In ICLR, 2015.

Kiryo, R., Niu, G., du Plessis, M. C., and Sugiyama, M. Positive-unlabeled learning with non-negative risk estimator. In Neur IPS, pp. 1674 1684, 2017.

Krizhevsky, A., Hinton, G., et al. Learning multiple layers of features from tiny images. Technical report, Citeseer, 2009.

Lang, K. Newsweeder: Learning to ﬁlter netnews. In ICML, 1995.

Le Cun, Y., Bottou, L., Bengio, Y., Haffner, P., et al. Gradient-based learning applied to document recognition. Proceedings of the IEEE, 86(11):2278 2324, 1998.

Li, Y.-F. and Liang, D.-M. Safe semi-supervised learning: a brief introduction. Frontiers of Computer Science, 13(4): 669 676, 2019.

Learning with Multiple Complementary Labels

Lu, N., Niu, G., Menon, A. K., and Sugiyama, M. On the minimal supervision for training any binary classiﬁer from only unlabeled data. In ICLR, 2019.

Lu, N., Zhang, T., Niu, G., and Sugiyama, M. Mitigating overﬁtting in supervised classiﬁcation from two unlabeled datasets: A consistent risk correction approach. In AISTATS, 2020.

Menon, A., Van Rooyen, B., Ong, C. S., and Williamson, B. Learning from corrupted binary labels via classprobability estimation. In ICML, pp. 125 134, 2015.

Menon, A. K., Rawat, A. S., Reddi, S. J., and Kumar, S. Can gradient clipping mitigate label noise? In ICLR, 2020.

Miyato, T., Maeda, S.-i., Koyama, M., and Ishii, S. Virtual adversarial training: a regularization method for supervised and semi-supervised learning. TPAMI, 41(8): 1979 1993, 2018.

Niu, G., Jitkrittum, W., Dai, B., Hachiya, H., and Sugiyama, M. Squared-loss mutual information regularization: A novel information-theoretic approach to semi-supervised learning. In ICML, pp. 10 18, 2013.

Rezaei, M., Yang, H., and Meinel, C. Recurrent generative adversarial network for learning imbalanced medical image semantic segmentation. Multimedia Tools and Applications, pp. 1 20, 2019.

Sakai, T., du Plessis, M. C., Niu, G., and Sugiyama, M. Semi-supervised classiﬁcation based on classiﬁcation from positive and unlabeled data. In ICML, pp. 2998 3006, 2017.

Sakai, T., Niu, G., and Sugiyama, M. Semi-supervised auc optimization based on positive-unlabeled learning. MLJ, 107(4):767 794, 2018.

Wang, X., Kodirov, E., Hua, Y., and Robertson, N. M. Improving mae against cce under label noise. ar Xiv preprint ar Xiv:1903.12141, 2019.

Wei, H., Feng, L., Chen, X., and An, B. Combating noisy labels by agreement: A joint training method with coregularization. In CVPR, June 2020.

Xia, X., Liu, T., Wang, N., Han, B., Gong, C., Niu, G., and Sugiyama, M. Are anchor points really indispensable in label-noise learning? In Neur IPS, pp. 6835 6846, 2019.

Xiao, H., Rasul, K., and Vollgraf, R. Fashion-mnist: a novel image dataset for benchmarking machine learning algorithms. ar Xiv preprint ar Xiv:1708.07747, 2017.

Yu, X., Liu, T., Gong, M., and Tao, D. Learning with biased complementary labels. In ECCV, pp. 68 83, 2018.

Zhang, M.-L. and Yu, F. Solving the partial label learning problem: An instance-based approach. In IJCAI, pp. 4048 4054, 2015.

Zhang, T. Statistical analysis of some multi-category large margin classiﬁcation methods. JMLR, 5(10):1225 1251, 2004.

Zhang, Z. and Sabuncu, M. Generalized cross entropy loss for training deep neural networks with noisy labels. In Neur IPS, pp. 8778 8788, 2018.

Zhou, Z. A brief introduction to weakly supervised learning. National Science Review, 5(1):44 53, 2018.

Zhu, X. and Goldberg, A. B. Introduction to semisupervised learning. Synthesis Lectures on Artiﬁcial Intelligence and Machine Learning, 3(1):1 130, 2009.