# learning_with_multiple_complementary_labels__9f81a142.pdf Learning with Multiple Complementary Labels Lei Feng * 1 Takuo Kaneko * 2 3 Bo Han 4 3 Gang Niu 3 Bo An 1 Masashi Sugiyama 3 2 A complementary label (CL) simply indicates an incorrect class of an example, but learning with CLs results in multi-class classifiers that can predict the correct class. Unfortunately, the problem setting only allows a single CL for each example, which notably limits its potential since our labelers may easily identify multiple CLs (MCLs) to one example. In this paper, we propose a novel problem setting to allow MCLs for each example and two ways for learning with MCLs. In the first way, we design two wrappers that decompose MCLs into many single CLs, so that we could use any method for learning with CLs. However, the supervision information that MCLs hold is conceptually diluted after decomposition. Thus, in the second way, we derive an unbiased risk estimator; minimizing it processes each set of MCLs as a whole and possesses an estimation error bound. We further improve the second way into minimizing properly chosen upper bounds. Experiments show that the former way works well for learning with MCLs but the latter is even better. 1. Introduction Ordinary machine learning tasks generally require massive data with accurate supervision information, while it is expensive and time-consuming to collect the data with high-quality labels. To alleviate this problem, the researchers have studied various weakly supervised learning frameworks (Zhou, 2018), including semi-supervised learning (Chapelle et al., 2006; Li & Liang, 2019; Miyato et al., 2018; Niu et al., 2013; Zhu & Goldberg, 2009), positive- *Equal contribution Work done when LF was an intern at RIKEN AIP and TK belonged to The University of Tokyo and RIKEN AIP. 1School of Computer Science and Engineering, Nanyang Technological University, Singapore 2The University of Tokyo 3RIKEN Center for Advanced Intelligence Project 4Department of Computer Science, Hong Kong Baptist University. Correspondence to: Lei Feng . Proceedings of the 37 th International Conference on Machine Learning, Online, PMLR 119, 2020. Copyright 2020 by the author(s). unlabeled learning (du Plessis et al., 2014; 2015; Elkan & Noto, 2008; Kiryo et al., 2017; Sakai et al., 2017; 2018), noisy-label learning (Han et al., 2018a;b; Menon et al., 2015; Wei et al., 2020; Xia et al., 2019), partial label learning (Cour et al., 2011; Feng & An, 2018; 2019a;b; Zhang & Yu, 2015), positive-confidence learning (Ishida et al., 2018), similar-unlabeled learning (Bao et al., 2018), and unlabeled-unlabeled classification (Lu et al., 2019; 2020). Here, we consider another weakly supervised classification framework called complementary-label learning (Ishida et al., 2017; 2019; Yu et al., 2018). In complementary-label learning, each training example is supplied with a complementary label (CL), which specifies one of the classes that the example does not belong to. Compared with ordinary labels, it is obviously easier to collect CLs. Recently, complementary-label learning has been applied to online learning (Kaneko et al., 2019) and medical image segmentation (Rezaei et al., 2019). In addition, another potential application of learning with CLs would be data privacy. For example, collecting some survey data may require extremely private questions (Ishida et al., 2017; 2019). It may be difficult for us to directly obtain the true answer (label) to the question. Nonetheless, it would be mentally less demanding if we ask the respondent to provide some incorrect answers. Besides, the respondent may provide multiple incorrect answers, rather than exactly one. In this case, multiple complementary labels (MCLs) would be more widespread than a single CL. In this paper, we propose a novel problem setting (Section 3.1) that allows MCLs for each example, and provide a real-world motivation (Section 3.2). Although existing complementary-label learning approaches (Ishida et al., 2017; 2019; Yu et al., 2018) have provided solid theoretical foundations and achieved promising performance, they are all restricted to the case where each example is associated with a single CL. To learn with MCLs, we first design two wrappers (Section 4.1) that decompose each example with MCLs into multiple examples, each with a single CL, in different manners. With the two wrappers, we are able to use arbitrary ordinary complementary-label learning approaches for learning with MCLs. However, the derived data with many single CLs may not match the assumed data distribution for complementary-label learning (Ishida et al., 2017; 2019). In addition, the supervision information would be Learning with Multiple Complementary Labels conceptually diluted after decomposition. In order to solve the above problems, we further propose an unbiased risk estimator (Section 4.2) for learning with MCLs, which processes each set of MCLs as a whole. Our risk estimator is conceptually consistent, and builds a prototype baseline for the new problem setting that may inspire more specially designed methods for this new setting in the future. Then, we theoretically derive an estimation error bound, which guarantees that the empirical risk minimizer converges to the true risk minimizer with high probability as the number of training data approaches infinity. Furthermore, we improve the risk estimator into minimizing properly chosen upper bounds for practical implementation (Section 4.3), and we show that they bring benefits to gradient update. Experimental results show that the wrappers work well for learning with MCLs while the (improved) risk estimator is even better on various benchmark datasets. 2. Related Work In this section, we introduce some notations and briefly review the formulations of multi-class classification and complementary-label learning. 2.1. Multi-Class Classification Suppose the feature space is X P Rd with d dimensions and the label space is Y t1, 2, . . . , ku with k classes, the instance x P X with its class label y P Y is sampled from an unknown probability distribution with density ppx, yq. Ordinary multi-class classification aims to induce a learning function fpxq : Rd Ñ Rk that minimizes the classification risk: Rpfq Eppx,yq L fpxq, y , (1) where L fpxq, y is a multi-class loss function. The predicted label is given as ˆy argmaxy PYfypxq, where fypxq is the y-th coordinate of fpxq. 2.2. Complementary-Label Learning Suppose the dataset for complementary-label learning is denoted by tpxi, syiqun i 1, where syi P Y is a complementary label of xi, and each complementarily labeled example is sampled from sppx, syq. Ishida et al. (2017; 2019) assumed that sppx, syq is expressed as: sppx, syq 1 k 1 ř y sy ppx, yq. (2) This assumption implies that all other labels except the correct label are chosen to be the complementary label with uniform probabilities. This is reasonable as we do not have extra labeling information except a complementary label. Under this assumption, it was proved by Ishida et al. (2017) that an unbiased estimator of the original classification risk can be obtained from only complementarily labeled data, when the loss function satisfies certain conditions. Specifically, they used the multi-class loss functions with the one-versus-all strategy and the pairwise comparison strategy (Zhang, 2004): s LOVA fpxq, sy 1 k 1 ř y1 sy ℓ fy1pxq ℓ fsypxq , s LPC fpxq, sy ř y1 sy ℓ fy1pxq fsypxq , where ℓpzq is a binary loss function that satisfies ℓpzq ℓp zq 1, such as sigmoid loss ℓSpzq 1 1 ez and ramp loss ℓRpzq 1 2 maxp0, minp2, 1 zqq. Later, another different assumption was used by Yu et al. (2018). They assumed that all other labels except the correct label are chosen to be the complementary label with different probabilities, and proposed to estimate the class transition probability matrix for model training. Although they showed that the minimizer of their learning objective coincides with the minimizer of the original classification risk, they did not provide an unbiased risk estimator. Recently, a more general unbiased risk estimator (Ishida et al., 2019) was proposed, which does not rely on specific losses or models. Their formulation is as follows: s LFREE fpxq, sy kř y 1 L fpxq, y pk 1q L fpxq, sy . For this formulation, they showed that due to the negative term, the empirical risk could be unbounded below, which leads to over-fitting. In order to alleviate this issue, the authors further proposed modified versions by using the max operator and the gradient ascent strategy. In summary, although the above methods have provided solid theoretical foundations and achieved promising performance for complementary-label learning, they are all restricted to the case where each example is associated with a single CL. In this paper, we propose a novel problem setting that allows MCLs for each example. 3. Multiple Complementary Labels In this section, we first introduce our problem setting where each example is associated with MCLs, and then provide a corresponding real-world motivation. 3.1. Data Generation Process Suppose the given dataset for learning with MCLs is represented as s D tpxi, s Yiqun i 1, where s Yi is a set of complementary labels for the instance xi. It is obvious that learning with MCLs is a generalization of complementarylabel learning that learns with a single CL. Specifically, if Learning with Multiple Complementary Labels s Yi contains only one complementary label with probability 1, we obtain a complementary-label learning problem. In addition, if s Yi contains k 1 complementary labels where k denotes the total number of classes, we obtain an ordinary multi-class classification problem. It is easy to know that for all i, s Yi cannot be the empty set nor the full label set, hence s Yi P s Y where s Y t2Y H Yu and | s Y| 2k 2. For the generation process of each example with MCLs, we assume that it relies on the size of the set of MCLs. Let us represent the size of the complementary label set by a random variable s, and assume s is sampled from a distribution ppsq. In this way, we assume that each training example pxi, s Yiq is drawn from the following data distribution: sppx, s Y q ÿk 1 j 1 pps jqsppx, s Y | s jq, (4) sppx, s Y | s jq : y R s Y ppx, yq, if |s Y | j, 0, otherwise. It is clear that when pps 1q 1, our introduced distribution reduces to the assumed distribution (e.g., Eq. (2)) in ordinary complementary-label learning approaches (Ishida et al., 2017; 2019). Then, we show that sppx, s Y q is a valid probability distribution by the following theorem. Theorem 1. The following equality holds: ż X sppx, s Y qdx ds Y 1. (5) The proof is provided in Appendix A.1. 3.2. Real-World Motivation Here, we present a real-world motivation for the assumed data distribution. Since directly choosing the correct label is hard for labelers, it would be easier if a labeling system can randomly choose a label set and ask labelers whether the correct label is included in the proposed label set or not. Given a pattern x, suppose the labeling system first randomly samples the size s of the proposed label set from ppsq, and then randomly and uniformly chooses a specific label set with size s from s Y. In this way, the collected label sets that do not include the correct label precisely follow the same distribution as Eq. (4). We will demonstrate this fact in the following. We start by considering the case where the labeling system has already sampled the size s of the proposed label set. Then we have the following lemma. Lemma 1. Given the sampled size s of the proposed label set, for any pattern x with its correct label y and any label set s Y with size s (i.e., |s Y | s), the following equality holds: ppy P s Y | x, sq s The proof is provided in Appendix A.2. Theorem 2. In the above setting, the distribution of collected data where the correct label y (y P Y) is not included in the label set s Y (s Y P s Y) is the same as Eq. (4), i.e., ppx, s Y | y R s Y q sppx, s Y q. (7) The proof is provided in Appendix A.3. 4. Learning with Multiple Complementary Labels In this section, we first present two wrappers that enable us to use any ordinary complementary-label learning approach for learning with MCLs. Then, we present an unbiased risk estimator for learning with MCLs as a whole, and establish an estimation error bound. 4.1. Wrappers Since ordinary complementary-label learning approaches cannot directly deal with MCLs, it would be natural to ask whether there exist some strategies that can enable us to use any existing complementary-label learning approach for learning with MCLs. Motivated by this, we propose two wrappers that decompose each example with MCLs into multiple examples, each with a single CL. Specifically, suppose a training example with MCLs is given as pxi, s Yiq where s Yi tsy1, sy2u. Then ordinary complementary label learning approaches may learn from pxi, sy1q and pxi, sy2q. According to whether decomposition is after shuffling the training set, there are two decomposition strategies (wrappers) when we optimize a loss function by a stochastic optimization algorithm: Decomposition after Shuffle. Given the shuffled training set with MCLs, in each mini-batch, we decompose each example into multiple examples, each with a single CL. Decomposition before Shuffle. Given the training set with MCLs, we drive a new training set by decomposing each example into multiple examples, each with a single CL. Then, we shuffle the derived training set. Both the above decomposition strategies enable us to use arbitrary ordinary complementary-label learning approaches for learning with MCLs. However, the derived training data with many single CLs may not match the originally assumed data distribution (i.e., Eq. (2)) for complementarylabel learning, since these CLs are completely derived from Learning with Multiple Complementary Labels Table 1. Supervision information for a set of MCLs (with size s). Setting #TP #FP Supervision Purity Many single CLs s pk 2qs 1{pk 1q A set of MCLs 1 k s 1 1{pk sq MCLs while the data distribution with MCLs is relevant to the size of each set of MCLs. As a consequence, the learning consistency would no longer be guaranteed even if the complementary-label learning approach inside the wrappers is originally risk-consistent or classifier-consistent. Moreover, since ordinary complementary-label learning approaches can only learn with a single CL for each example at a time and treat each example independently, the supervision information for each set of MCLs would be conceptually diluted. We demonstrate this issue by Table 1. As shown in Table 1, there are two settings according to whether to decompose a set of MCLs into many single CLs or not. Since all the non-complementary labels have the possibility to be the correct label, we specially count how many times the correct label serves as a non-complementary label (denoted by #TP), and how many times the other labels except the correct label serve as a non-complementary label (denoted by #FP). Then the supervision purity is calculated by (#TP)/(#TP+#FP). Clearly, the wrappers follow the setting where a set of MCLs is decomposed into many single CLs. If the size of the set of MCLs is s, then #TP equals s, since the correct label would serve as a non-complementary label for s times after decomposition, and the other labels except the correct label would serve as a non-complementary label for pk s 1qs sps 1q pk 2qs times, hence the supervision purity would be s{ps pk 2qsq 1{pk 1q. However, for the setting where the set of MCLs is not decomposed, we can easily know that the correct label serves as a noncomplementary label once, and the other labels expect the correct label serve as a non-complementary label for k s 1 times, hence the supervision purity is 1{pk sq. These observations clearly show that the supervision information is diluted after decomposing MCLs (s ě 2), which also motivate us to take a set of MCLs as a whole set. In the following, we will introduce our proposed unbiased risk estimator, which is able to learn with MCLs as a whole. 4.2. Unbiased Risk Estimator The above example has shown that the supervision information is diluted after decomposition. The basic reason lies in that ordinary complementary-label learning approaches are designed by only considering the data distribution with a single CL, i.e., sppx, syq. However, the data distribution with MCLs sppx, s Y q becomes much different, and the wrappers fail to capture such distribution because they do not treat MCLs as a whole for each example. To solve this problem, we propose an unbiased estimator of the original classification risk for learning with MCLs as a whole. We first relate the data distribution with ordinary labels to that with MCLs by the following lemma. Lemma 2. The following equality holds: ppx, yq 1 ÿk 1 s Y P s Yy j sppx, s Y , s jq , where s Yy j is the set of all the possible label sets with size j that include a specific label y P Y, i.e., s Yy j : ts Y P s Y | y P s Y , |s Y | ju. The proof is provided in Appendix B.1. Based on Lemma 2, we derive an unbiased estimator of the ordinary classification risk Eq. (1) by the following theorem. Theorem 3. The ordinary classification risk Eq. (1) can be equivalently expressed as j 1 pps jq s Rjpfq, (8) s Rjpfq : Esppx, s Y |s jqr s Lj fpxq, s Y s, (9) s Lj fpxq, s Y : ÿ y R s Y L fpxq, y y1P s Y L fpxq, y1 . (10) The proof is provided in Appendix B.2. It is easy to verify that Eq. (8) reduces to Eq. (3) when pps 1q 1. Which means, our approach is a generalization of Ishida et al. (2019). Furthermore, according to Corollary 2 in Ishida et al. (2019), our approach is also a generalization of Ishida et al. (2017). Given the dataset with MCLs s D tpxi, s Yiqun i 1, we can empirically approximate pps jq by nj{n where nj denotes the number of examples whose complementary label set size is j. By further taking into account Eqs. (8)-(10), we can obtain the following empirical approximation of the unbiased risk estimator introduced in Theorem 3: y R s Yi L fpxiq, y y1P s Yi L fpxiq, y1 . (11) Estimation Error Bound. Here, we derive an estimation error bound for the proposed unbiased risk estimator Learning with Multiple Complementary Labels based on Rademacher complexity (Bartlett & Mendelson, 2002). Let F Ă tf : Rd Ñ Rku be the hypothesis class, pf : arg minf PF p Rpfq be the empirical risk minimizer, and f arg minf PF Rpfq be the true risk minimizer. Besides, we define the functional space Gy for the label y P Y as Gy tg : x Ñ fypxq | f P Fu. Then, we have the following theorem. Theorem 4. Assume the loss function Lpfpxq, yq is ρLipschitz with respect to fpxq p0 ă ρ ă 8q for all y P Y. Let CL supx PX,f PF,y PY Lpfpxq, yq and Rnp Gyq be the Rademacher complexity of Gy given the sample size n. Then, for any δ ą 0, with probability at least 1 δ, Rp pfq Rpf q j 1 pps jq 4 ? y 1 Rnjp Gyq Cj ?nj where Cj p4k 4j 2q CL δ 2 for all j P t1, . . . , k 1u and nj denotes the number of examples whose complementary label set size is j. The definition of Rademacher complexity and the proof of Theorem 4 are provided in Appendix C. Theorem 4 shows that the empirical risk minimizer converges to the true risk minimizer with high probability as the number of training data approaches infinity. It is worth noting that this bound is not only related to the Redemacher complexity of the function class, but also s and k. This observation accords with our intuition that the learning task will be harder if the number of classes k increases or the size of the complementary label set s decreases. 4.3. Practical Implementation In this section, we present the practical implementation of our proposed formulation and improvements of the used loss functions. As described above, we have provided a general unbiased risk estimator that is able to use arbitrary loss functions. There arises a question: Can all loss functions work well in our approach? Unfortunately, the answer is negative. The original classification risk estimator in Eq. (1) includes an expectation over a non-negative loss L : Rk ˆrks Ñ R , hence the expected risk and the empirical approximation are both lower-bounded by zero. However, our proposed risk estimator in Theorem 3 contains a negative term. Although the expected risk estimator is unbiased, the empirical estimator may become unbounded below if the used loss function is unbounded, thereby leading to over-fitting. Similar issues have also been shown by Ishida et al. (2019); Kiryo et al. (2017). The above analysis suggests that a bounded loss is probably better than an unbounded loss, in our empirical risk estimator (i.e., Eq. (11)). To demonstrate the above conjecture, we would like to insert bounded and unbounded losses into Eq. (11), for comparison studies. Note that we assume that the softmax function is absorbed in each loss, and denote by pθpy|xq exppfypxqq{přk j 1 exppfjpxqqq the predicted probability of the instance x belonging to class y, where θ denotes the parameters of the model f. In this way, we list the compared loss functions as follows. Categorical Cross Entropy (CCE): LCCEpfpxq, yq log pθpy|xq. Mean Absolute Error (MAE): LMAEpfpxq, yq 2 2pθpy|xq. Mean Square Error (MSE): LMSEpfpxq, yq 1 2pθpy|xq ÿk j 1 pθpj|xq2. Generalized Cross Entropy (GCE) (Zhang & Sabuncu, LGCEpfpxq, yq p1 pθpy|xqqq{q, where q P p0, 1s is a user-defined hyper-parameter. We set q 0.7, as suggested by Zhang & Sabuncu (2018). Partially Huberised Cross Entropy (PHuber-CE) (Menon et al., 2020): LPHuber-CEpfpxq, yq " log pθpy|xq, if pθpy|xq ě 1 τ , τpθpy|xq log τ 1, else, where τ ą 0 is a user-defined hyper-parameter. We set τ 10, because it works well in Menon et al. (2020). The detailed derivations of the above loss functions and their bounds are provided in Appendix D. Among these losses, CCE is unbounded while others are bounded. We will experimentally demonstrate (Figure 1) that by inserting the above losses into Eq. (11), bounded loss is significantly better than unbounded loss. Furthermore, we conduct a deeper analysis of MAE because MAE has the special property that MAE is not only bounded, but also satisfies the symmetric condition (Ghosh et al., 2017), i.e., řk y 1 LMAE fpxq, y 2k 2, which means the sum of the losses over all classes is a constant for arbitrary examples. However, is MAE good enough? Previous studies (Wang et al., 2019; Zhang & Sabuncu, 2018) have already shown that MAE suffers from the optimization issue, which would affect its practical performance. To alleviate this problem, we further improve MAE by proposing two upper-bound surrogate loss functions. Specifically, by using MAE in Eq. (11), we obtain y R s Yi LMAE fpxiq, y |s Yi| L1 MAE fpxiq, s Yi Zi, (12) Learning with Multiple Complementary Labels 0 50 100 150 200 250 Epoch Test Accuracy MAE GCE PHuber_CE MSE CCE (a) MNIST, linear 0 50 100 150 200 250 Epoch Test Accuracy (b) MNIST, MLP 0 50 100 150 200 250 Epoch Test Accuracy (c) Fashion MNIST, linear 0 50 100 150 200 250 Epoch Test Accuracy (d) Fashion MNIST, MLP 0 50 100 150 200 250 Epoch Test Accuracy (e) Kuzushiji MNIST, linear 0 50 100 150 200 250 Epoch Test Accuracy (f) Kuzushiji MNIST, MLP 0 50 100 150 200 250 Epoch Test Accuracy (g) CIFAR-10, Res Net 0 50 100 150 200 250 Epoch Test Accuracy (h) CIFAR-10, Dense Net Figure 1. Experimental results of different loss functions for different datasets and models. Dark colors show the mean accuracy of 5 trials and light colors show the standard deviation. where L1 MAE fpxiq, s Yi : 1 ř j R s Yi pθpj|xiq, and Zi is a constant independent of fpxiq. It is clear that minimizing L1 MAE fpxiq, s Yi is equivalent to minimizing ř y R s Yi LMAE fpx, yq . Based on this fact, we further introduce two upper-bound surrogate loss functions of L1 MAE: LEXPpfpxiq, s Yiq exp ÿ j R s Yi pθpj|xiq , LLOGpfpxiq, s Yiq log ÿ j R s Yi pθpj|xiq . One can easily verify that L1 MAE is upper bounded by LEXP and LLOG using the two inequalities 1 z ď expp zq and 1 z ď log z, respectively. By replacing L1 MAE by LLOG and LLOG in Eq. (12), we obtain two new methods for learning with MCLs. We explain the advantage of LEXP and LLOG over L1 MAE by closely examining their gradients: BL1 MAE Bθ " θpθpj|xiq, if j R s Yi, 0, else, Bθ " θpθpj|xiq w EXP, if j R s Yi, 0, else, Bθ " θpθpj|xiq w LOG, if j R s Yi, 0, else, where w EXP exp ř j R s Yi pθpj|xiq and w LOG ř j R s Yi pθpj|xiq 1. From their gradients, we can clearly observe that L1 MAE basically treats each example equally, while LEXP and LLOG give more weights to difficult examples. Concretely, if ř j R s Yi pθpj|xiq is small, both w EXP and w LOG would be large. In other words, LEXP and LLOG pay more attention to hard examples whose sum of the predicted confidences of all the non-complementary labels is small. 5. Experiments In this section, we conduct extensive experiments to evaluate the performance of our proposed approaches including the two wrappers, the unbiased risk estimator with various loss functions and the two upper-bound surrogate loss functions. Datasets. We use five widely-used benchmark datasets MNIST (Le Cun et al., 1998), Kuzushiji-MNIST (Clanuwat et al., 2018), Fashion-MNIST (Xiao et al., 2017), 20Newsgroups (Lang, 1995), and CIFAR-10 (Krizhevsky et al., 2009), and four datasets from the UCI repository (Blake & Merz, 1998). We use four base models including linear model, MLP model (d-500-k), Res Net (34 layers) (He et al., 2016), and Dense Net (22 layers) (Huang et al., 2017). The detailed descriptions of these datasets with the corresponding base models are provided in Appendix E.1. To generate MCLs, we instantiate ppsq k s {p2k 2q, @s P t1, , k 1u, which means ppsq represents the ratio of the number of label sets whose size is s to the number of all possible label sets. For each instance x, we first randomly sample s from ppsq, and then uniformly and randomly sample a complementary label set s Y with size s (i.e., pps Y q 1{ k 1 s ). Approaches. We absorb five ordinary complementarylabel learning approaches in the two wrappers (introduced Learning with Multiple Complementary Labels Table 2. Classification accuracy (mean std) of each algorithm on the four UCI datasets using a linear model for 5 trials. The best performance among all the approaches is highlighted in boldface. In addition, { indicates whether the performance of our approach (the best of EXP and LOG) is statistically superior/inferior to the comparing algorithm on each dataset (paired t-test at 0.05 significance level). Approach Yeast Texture Dermatology Synthetic Control Upper-bound Losses EXP 54.94 1.56% 97.51 0.09% 98.89 0.37% 27.87 5.13% LOG 60.11 1.93% 98.88 0.43% 99.46 1.14% 90.73 4.41% Bounded Losses MAE 33.07 0.37% 85.29 7.93% 85.39 2.58% 23.50 2.44% MSE 58.17 1.52% 97.59 0.16% 97.84 1.21% 34.20 8.69% GCE 57.56 1.56% 97.25 0.31% 97.53 1.81% 23.67 3.10% Phuber-CE 55.54 1.03% 94.89 3.28% 95.14 2.41% 24.71 3.18% Unbounded Loss CCE 49.50 3.58% 92.08 1.15% 83.19 3.65% 63.47 6.91% Decomposition before Shuffle GA 27.91 5.02% 90.93 1.34% 36.05 9.79% 18.12 1.74% NN 32.73 3.59% 96.29 0.39% 61.49 6.83% 55.12 4.43% FREE 35.50 2.79% 94.36 1.08% 86.30 5.62% 76.95 3.26% PC 53.89 3.53% 92.68 0.81% 96.27 3.07% 72.63 5.86% Forward 58.15 1.54% 98.95 0.17% 99.37 0.85% 38.77 6.06% Decomposition after Shuffle GA 28.21 1.53% 83.66 2.27% 42.05 7.94% 25.46 1.28% NN 36.04 2.24% 93.91 0.40% 62.54 9.19% 59.80 5.14% FREE 43.47 1.36% 93.94 0.72% 86.22 6.07% 73.33 2.17% PC 54.58 2.57% 94.19 1.21% 95.73 3.33% 69.53 9.01% Forward 59.46 1.16% 97.65 0.32% 99.03 1.33% 43.57 5.83% Partial Label Convex Formulation CLPL 55.39 1.21% 92.07 0.88% 99.42 0.54% 63.57 5.46% Table 3. Classification accuracy (mean std) of each algorithm on the four benchmark datasets using a linear model for 5 trials. The best performance among all the approaches is highlighted in boldface. In addition, { indicates whether the performance of our approach (the best of EXP and LOG) is statistically superior/inferior to the comparing algorithm on each dataset (paired t-test at 0.05 significance level). Approach MNIST Kuzushiji Fashion 20Newsgroups Upper-bound Losses EXP 92.67 0.11% 64.23 0.33% 84.56 0.25% 81.72 0.39% LOG 92.58 0.09% 68.89 0.25% 84.42 0.16% 84.06 0.57% Bounded Losses MAE 92.66 0.12% 64.03 0.19% 84.50 0.16% 79.68 1.40% MSE 92.64 0.13% 64.51 0.55% 84.53 0.20% 81.55 0.52% GCE 92.66 0.12% 64.44 0.17% 84.44 0.15% 81.78 0.60% Phuber-CE 92.02 0.07% 63.81 0.75% 83.76 0.22% 73.52 1.04% Unbounded Loss CCE 88.23 0.19% 62.27 0.84% 80.25 0.29% 63.78 0.79% Decomposition before Shuffle GA 85.51 0.26% 55.61 0.24% 78.64 0.33% 76.64 0.62% NN 88.09 0.16% 60.54 0.23% 80.68 0.07% 76.00 0.37% FREE 89.35 0.14% 65.21 0.45% 81.22 0.11% 68.34 0.72% PC 88.21 0.23% 62.76 0.40% 80.60 0.18% 66.91 1.20% Forward 92.57 0.05% 63.51 0.22% 84.38 0.20% 74.69 1.14% Decomposition after Shuffle GA 83.16 0.22% 56.31 0.42% 73.37 0.10% 66.14 0.79% NN 88.79 0.26% 63.19 0.12% 79.77 0.14% 66.35 0.53% FREE 89.02 0.22% 64.18 0.18% 80.11 0.04% 66.16 0.60% PC 87.76 0.17% 61.64 0.38% 80.58 0.17% 65.64 0.81% Forward 92.54 0.04% 63.69 0.14% 84.37 0.17% 71.98 3.41% Partial Label Convex Formulation CLPL 81.85 0.27% 55.31 0.23% 77.26 0.10% 81.48 0.45% in Section 4.1): GA, NN, and Free (Ishida et al., 2019), PC (Ishida et al., 2017), and Forward (Yu et al., 2018).We also use an unbounded loss CCE and four bounded losses MAE, MSE, GCE (Zhang & Sabuncu, 2018), and PHuber CE (Menon et al., 2020) in our empirical estimator Eq. (11). Besides, two upper-bound loss functions LOG and EXP are also inserted into Eq. (12). In addition, we also compare with a representative partial label learning approach CLPL (Cour et al., 2011). For all the approaches, we adopt the same base model for fair comparison. Learning rate and weight decay are selected from t10 6, 10 5, , 10 1u. We implement our approach using Py Torch1, and use the Adam (Kingma & Ba, 2015) optimization method with minibatch size set to 256 and epoch number set to 250. Hyperparameters for all the approaches are selected so as to maximize the accuracy on a validation set (10% of the training set) of complementarily labeled data. All the experiments are conducted on NVIDIA Tesla V100 GPUs. 1www.pytorch.org Learning with Multiple Complementary Labels Table 4. Classification accuracy (mean std) of each algorithm on the five benchmark datasets using neural networks for 5 trials. The best performance among all the approaches is highlighted in boldface. In addition, { indicates whether the performance of our approach (the best of EXP and LOG) is statistically superior/inferior to the comparing algorithm on each dataset (paired t-test at 0.05 significance level). Approach MNIST Kuzushiji Fashion CIFAR-10 R CIFAR-10 D 20Newsgroups Upper-bound Losses EXP 97.80 0.06% 88.25 0.28% 88.07 0.19% 72.49 0.84% 75.53 0.58% 77.22 1.22% LOG 97.86 0.13% 88.24 0.08% 88.36 0.26% 75.38 0.34% 75.80 0.62% 79.46 0.94% Bounded Losses MAE 97.81 0.04% 88.11 0.40% 88.13 0.23% 65.57 4.08% 68.24 5.84% 49.83 4.01% MSE 96.84 0.08% 84.97 0.23% 86.14 0.04% 63.58 1.19% 70.89 0.81% 72.19 0.59% GCE 96.62 0.08% 85.02 0.26% 87.03 0.20% 68.40 1.05% 71.54 0.83% 74.96 0.47% Phuber-CE 95.00 0.36% 80.66 0.32% 85.52 0.18% 59.64 1.21% 66.49 0.67% 62.63 2.32% Unbounded Loss CCE 88.64 0.50% 67.86 1.01% 80.97 0.23% 18.01 0.63% 44.94 1.20% 54.96 0.38% Decomposition before Shuffle GA 96.36 0.05% 84.35 0.22% 85.59 0.30% 69.05 0.83% 65.38 1.40% 79.06 0.57% NN 96.70 0.08% 82.21 0.36% 86.29 0.10% 63.85 0.74% 64.80 1.28% 76.81 0.44% FREE 88.55 0.38% 70.32 0.80% 81.17 0.36% 32.02 1.69% 39.22 0.43% 61.22 1.24% PC 92.74 0.17% 73.18 0.59% 83.32 0.28% 43.16 2.21% 49.53 1.18% 65.15 2.05% Forward 97.67 0.04% 87.65 0.24% 88.08 0.24% 71.92 1.09% 71.30 1.16% 77.19 %0.76 Decomposition after Shuffle GA 92.08 0.22% 74.64 0.67% 79.73 0.19% 53.12 0.97% 56.51 0.89% 63.37 1.16% NN 92.47 0.14% 73.88 0.63% 82.99 0.13% 36.79 0.78% 53.78 0.92% 65.15 0.73% FREE 88.99 0.39% 70.09 0.74% 81.74 0.23% 15.16 2.22% 47.45 0.98% 50.86 1.56% PC 92.94 0.05% 68.60 1.32% 82.46 0.26% 33.16 0.92% 52.23 0.88% 64.32 0.86% Forward 97.49 0.08% 86.47 0.39% 87.56 0.14% 72.16 0.97% 75.23 1.02% 79.35 0.82% Loss Comparison. Figure 1 shows the mean and standard deviation of test accuracy of 5 trials, for bounded loss functions MAE, MSE, GCE, PHuber-CE, and unbounded loss function CCE used in our empirical risk estimator Eq. (11). We also record the mean and standard deviation of training accuracy (the training set is evaluated with ordinary labels) of 5 trials, and put the results in Appendix E.2. As can be seen from Figure 1, all the bounded losses are significantly better than the unbounded loss CCE in our formulation. This observation clearly accords with our discussion on the over-fitting issue in Section 4.3. In addition, MAE achieves comparable performance compared with other bounded losses in most cases, while it is sometimes inferior to other bounded losses due to its optimization issue (Zhang & Sabuncu, 2018). Both the advantage and disadvantage of MAE motivate us to use the upper-bound loss functions EXP and LOG for improving the classification performance. Performance Comparison. Table 2, Table 3, and Table 4 show the experimental results of different approaches using a linear model or neural networks on the four UCI datasets and the other five benchmark datasets. In table 4, CIFAR-10 R and CIFAR-10 D mean that we use Res Net and Dense Net on CIFAR-10. Note that CLPL is a convex approach for partial label learning, which is specially designed with a linear model. Hence CLPL does not appear in Table 4. From the three tables, we can find that equipped with the two wrappers Decomposition before Shuffle and Decomposition after Shuffle , ordinary complementary-label learning approaches work well for learning with MCLs. However, they are significantly outperformed by the upperbound losses in most cases, which also achieve the best performance among all the approaches on various benchmark datasets. In addition, we also study the case where the size of each complementary label set s is fixed at j (i.e., pps jq 1) while increasing j from 1 to k 2. The corresponding experimental results are provided in Appendix E.3, which show that the classification accuracy of our approaches increases as j increases. This observation is clearly in accordance with our derived estimation error bound (Theorem 4), as the estimation error would decrease if j increases. 6. Conclusion In this paper, we propose a novel problem setting called learning with multiple complementary labels (MCLs), which is a generation of complementary-label learning (Ishida et al., 2017; 2019; Yu et al., 2018). To solve this learning problem, we first design two wrappers that enable us to use arbitrary complementary-label learning approaches for learning with MCLs. However, we find that the supervision information that MCLs hold is conceptually diluted after decomposition. Therefore, we further propose an unbiased risk estimator for learning with MCLs, which processes each set of MCLs as a whole. Then, we theoretically derive an estimation error bound, which guarantees the learning consistency. Although our risk estimator does not rely on specific models or loss functions, we show that bounded loss is generally better than unbounded loss in our empirical risk estimator. In addition, we improve the risk estimator into minimizing properly chosen upper bounds for practical implementation. Extensive experiments demonstrate the effectiveness of the proposed approaches. Learning with Multiple Complementary Labels Acknowledgements This research was supported by the National Research Foundation, Singapore under its AI Singapore Programme (AISG Award No: AISG-RP-2019-0013), National Satellite of Excellence in Trustworthy Software Systems (Award No: NSOE-TSS2019-01), and NTU. BH was partially supported by the Early Career Scheme (ECS) through the Research Grants Council of Hong Kong under Grant No.22200720, HKBU Tier-1 Start-up Grant and HKBU CSD Start-up Grant. GN and MS were supported by JST AIP Acceleration Research Grant Number JPMJCR20U3, Japan. Bao, H., Niu, G., and Sugiyama, M. Classification from pairwise similarity and unlabeled data. In ICML, pp. 452 461, 2018. Bartlett, P. L. and Mendelson, S. Rademacher and gaussian complexities: Risk bounds and structural results. JMLR, 3(11):463 482, 2002. Blake, C. L. and Merz, C. J. Uci repository of machine learning databases, 1998. URL http://archive. ics.uci.edu/ml/index.php. Chapelle, O., Scholkopf, B., and Zien, A. Semi-Supervised Learning. MIT Press, 2006. Clanuwat, T., Bober-Irizar, M., Kitamoto, A., Lamb, A., Yamamoto, K., and Ha, D. Deep learning for classical japanese literature. ar Xiv preprint ar Xiv:1812.01718, 2018. Cour, T., Sapp, B., and Taskar, B. Learning from partial labels. JMLR, 12(5):1501 1536, 2011. du Plessis, M. C., Niu, G., and Sugiyama, M. Analysis of learning from positive and unlabeled data. In Neur IPS, pp. 703 711, 2014. du Plessis, M. C., Niu, G., and Sugiyama, M. Convex formulation for learning from positive and unlabeled data. In ICML, pp. 1386 1394, 2015. Elkan, C. and Noto, K. Learning classifiers from only positive and unlabeled data. In KDD, pp. 213 220, 2008. Feng, L. and An, B. Leveraging latent label distributions for partial label learning. In IJCAI, pp. 2107 2113, 2018. Feng, L. and An, B. Partial label learning with self-guided retraining. In AAAI, pp. 3542 3549, 2019a. Feng, L. and An, B. Partial label learning by semantic difference maximization. In IJCAI, pp. 2294 2300, 2019b. Ghosh, A., Kumar, H., and Sastry, P. Robust loss functions under label noise for deep neural networks. In AAAI, 2017. Han, B., Yao, J., Niu, G., Zhou, M., Tsang, I., Zhang, Y., and Sugiyama, M. Masking: A new perspective of noisy supervision. In Neur IPS, pp. 5836 5846, 2018a. Han, B., Yao, Q., Yu, X., Niu, G., Xu, M., Hu, W., Tsang, I., and Sugiyama, M. Co-teaching: Robust training of deep neural networks with extremely noisy labels. In Neur IPS, pp. 8527 8537, 2018b. He, K., Zhang, X., Ren, S., and Sun, J. Deep residual learning for image recognition. In CVPR, pp. 770 778, 2016. Huang, G., Liu, Z., Van Der Maaten, L., and Weinberger, K. Q. Densely connected convolutional networks. In CVPR, pp. 4700 4708, 2017. Ishida, T., Niu, G., Hu, W., and Sugiyama, M. Learning from complementary labels. In Neur IPS, pp. 5644 5654, 2017. Ishida, T., Niu, G., and Sugiyama, M. Binary classification for positive-confidence data. In Neur IPS, pp. 5917 5928, 2018. Ishida, T., Niu, G., Menon, A. K., and Sugiyama, M. Complementary-label learning for arbitrary losses and models. In ICML, pp. 2971 2980, 2019. Kaneko, T., Sato, I., and Sugiyama, M. Online multiclass classification based on prediction margin for partial feedback. ar Xiv preprint ar Xiv:1902.01056, 2019. Kingma, D. P. and Ba, J. Adam: A method for stochastic optimization. In ICLR, 2015. Kiryo, R., Niu, G., du Plessis, M. C., and Sugiyama, M. Positive-unlabeled learning with non-negative risk estimator. In Neur IPS, pp. 1674 1684, 2017. Krizhevsky, A., Hinton, G., et al. Learning multiple layers of features from tiny images. Technical report, Citeseer, 2009. Lang, K. Newsweeder: Learning to filter netnews. In ICML, 1995. Le Cun, Y., Bottou, L., Bengio, Y., Haffner, P., et al. Gradient-based learning applied to document recognition. Proceedings of the IEEE, 86(11):2278 2324, 1998. Li, Y.-F. and Liang, D.-M. Safe semi-supervised learning: a brief introduction. Frontiers of Computer Science, 13(4): 669 676, 2019. Learning with Multiple Complementary Labels Lu, N., Niu, G., Menon, A. K., and Sugiyama, M. On the minimal supervision for training any binary classifier from only unlabeled data. In ICLR, 2019. Lu, N., Zhang, T., Niu, G., and Sugiyama, M. Mitigating overfitting in supervised classification from two unlabeled datasets: A consistent risk correction approach. In AISTATS, 2020. Menon, A., Van Rooyen, B., Ong, C. S., and Williamson, B. Learning from corrupted binary labels via classprobability estimation. In ICML, pp. 125 134, 2015. Menon, A. K., Rawat, A. S., Reddi, S. J., and Kumar, S. Can gradient clipping mitigate label noise? In ICLR, 2020. Miyato, T., Maeda, S.-i., Koyama, M., and Ishii, S. Virtual adversarial training: a regularization method for supervised and semi-supervised learning. TPAMI, 41(8): 1979 1993, 2018. Niu, G., Jitkrittum, W., Dai, B., Hachiya, H., and Sugiyama, M. Squared-loss mutual information regularization: A novel information-theoretic approach to semi-supervised learning. In ICML, pp. 10 18, 2013. Rezaei, M., Yang, H., and Meinel, C. Recurrent generative adversarial network for learning imbalanced medical image semantic segmentation. Multimedia Tools and Applications, pp. 1 20, 2019. Sakai, T., du Plessis, M. C., Niu, G., and Sugiyama, M. Semi-supervised classification based on classification from positive and unlabeled data. In ICML, pp. 2998 3006, 2017. Sakai, T., Niu, G., and Sugiyama, M. Semi-supervised auc optimization based on positive-unlabeled learning. MLJ, 107(4):767 794, 2018. Wang, X., Kodirov, E., Hua, Y., and Robertson, N. M. Improving mae against cce under label noise. ar Xiv preprint ar Xiv:1903.12141, 2019. Wei, H., Feng, L., Chen, X., and An, B. Combating noisy labels by agreement: A joint training method with coregularization. In CVPR, June 2020. Xia, X., Liu, T., Wang, N., Han, B., Gong, C., Niu, G., and Sugiyama, M. Are anchor points really indispensable in label-noise learning? In Neur IPS, pp. 6835 6846, 2019. Xiao, H., Rasul, K., and Vollgraf, R. Fashion-mnist: a novel image dataset for benchmarking machine learning algorithms. ar Xiv preprint ar Xiv:1708.07747, 2017. Yu, X., Liu, T., Gong, M., and Tao, D. Learning with biased complementary labels. In ECCV, pp. 68 83, 2018. Zhang, M.-L. and Yu, F. Solving the partial label learning problem: An instance-based approach. In IJCAI, pp. 4048 4054, 2015. Zhang, T. Statistical analysis of some multi-category large margin classification methods. JMLR, 5(10):1225 1251, 2004. Zhang, Z. and Sabuncu, M. Generalized cross entropy loss for training deep neural networks with noisy labels. In Neur IPS, pp. 8778 8788, 2018. Zhou, Z. A brief introduction to weakly supervised learning. National Science Review, 5(1):44 53, 2018. Zhu, X. and Goldberg, A. B. Introduction to semisupervised learning. Synthesis Lectures on Artificial Intelligence and Machine Learning, 3(1):1 130, 2009.