# generativediscriminative_complementary_learning__3421dc90.pdf The Thirty-Fourth AAAI Conference on Artificial Intelligence (AAAI-20) Generative-Discriminative Complementary Learning Yanwu Xu,1 Mingming Gong,1 Junxiang Chen,1 Tongliang Liu,2 Kun Zhang,3 Kayhan Batmanghelich1 1Department of Biomedical Informatics, University of Pittsburgh, {yanwuxu, mig73, juc91, kayhan}@pitt.edu, 2UBTECH Sydney AI Centre, School of Computer Science, tongliang.liu@sydney.edu.au 3Department of Philosophy, Carnegie Mellon University, kunz1@cmu.edu The majority of state-of-the-art deep learning methods are discriminative approaches, which model the conditional distribution of labels given inputs features. The success of such approaches heavily depends on high-quality labeled instances, which are not easy to obtain, especially as the number of candidate classes increases. In this paper, we study the complementary learning problem. Unlike ordinary labels, complementary labels are easy to obtain because an annotator only needs to provide a yes/no answer to a randomly chosen candidate class for each instance. We propose a generative-discriminative complementary learning method that estimates the ordinary labels by modeling both the conditional (discriminative) and instance (generative) distributions. Our method, we call Complementary Conditional GAN (CCGAN), improves the accuracy of predicting ordinary labels and is able to generate high-quality instances in spite of weak supervision. In addition to the extensive empirical studies, we also theoretically show that our model can retrieve the true conditional distribution from the complementarilylabeled data. Introduction Deep supervised learning has achieved great success in various applications such as visual recognition (Krizhevsky, Sutskever, and Hinton 2012; He et al. 2015) and natural language processing (Kim 2014). Despite the effectiveness of supervised classifiers, acquiring labeled data is often expensive and time-consuming. As a result, learning from weak supervision has been studied extensively in recent decades, including but not limited to semi-supervised learning (Kingma et al. 2014), multi-instance learning (Zhou et al. 2012), learning from side information (Hoffman, Gupta, and Darrell 2016), and learning from data with noisy labels (Natarajan et al. 2013; Liu and Tao 2016; Xia et al. 2019; Cheng et al. 2017). In this paper, we consider a recently proposed weaklysupervised classification scenario, i.e., learning from complementary labels (Ishida et al. 2017; Yu et al. 2018). Unlike an ordinary label, a complementary label specifies a class that an input instance does not belong to. Given an instance from a class, it is laborious to choose the correct class label Copyright c 2020, Association for the Advancement of Artificial Intelligence (www.aaai.org). All rights reserved. from many candidate classes, especially when the number of classes is relatively large or the annotator is not familiar with the characteristics of all candidate classes. However, it is less demanding and inexpensive to choose one of the incorrect class as a complementary label for an instance. For example, when an annotator is labelling an image containing an animal that she has never seen before, she can easily identify that this animal does not belong the usual animal classes he can see in daily life, such as, not dogs . In medical field, a doctor may not be able to identify the exact disease type given symptoms. However, he/she can easily obtain complementary labels denoting some disease types a patient does not belong to. Existing complementary learning methods modified the ordinary classification loss functions to enable learning from complementary labels. (Ishida et al. 2017) proposed a method that provides a consistent estimate of the classifier from complementarily-labeled data where the loss function satisfies a particular symmetric condition. However, this method only allows classification loss functions with certain non-convex binary losses for one-versus-all and pairwise comparison. Later, (Yu et al. 2018) proposed to use the forward loss correction technique (Patrini et al. 2017) that learns the conditional, PY |X, from complementary labels, where X denote the input features and Y denote labels. (Ishida et al. 2018) derived an unbiased estimator of the true classification risk with arbitrary loss functions from complementarily-labeled data. To clarify the differences between learning with ordinary and complimentary labels, we define the notion of effective sample size , which is the number of instances with ordinary labels that carries the same amount of information as instances with complementary labels of a given size. Since the complementary labels are weak labels, they carry only partial information about the ordinary labels. Hence, the effective sample size nl for complementary learning is much smaller than the given sample size n (i.e., nl << n). Current methods for learning with complementary labels need a relatively large training set to ensure low variance for predicting ordinary label. Although nl is small under complementary learning settings, we can still use all samples with size n to estimate the instance distribution PX. However, current complementary methods focus on modeling conditional PY |X and thus fail to account for information hidden in PX, which is essential in complementary learning. To improve the prediction performance, we propose a generative-discriminative complementary learning approach that learns both PY |X and PX|Y in a unified framework. Our main contributions can be summarized as follows: We propose a Complementary Conditional Generative Adversarial Net (CCGAN), which simultaneously learns PY |X and PX|Y from complementary labels. Because the estimate of PX|Y benefits from PX, it provides constraints on PY |X and helps reduce its estimation variance. Theoretically, we show that our CCGAN model is guaranteed to learn PX|Y from complementarily-labeled data. Empirically, we conduct comprehensive experiments on benchmark datasets, including MNIST, CIFAR10, CIFAR100, and VGG Face; demonstrating that our model gives accurate classification prediction and generates high-quality images. The code is at https://github.com/xuyanwu/Complementary GAN. Related Works Generative Adversarial Nets Generative Adversarial Nets (GANs) are a class of implicit generative models learned by adversarial training (Goodfellow et al. 2014). With the development of new network architectures (e.g., (Brock, Donahue, and Simonyan 2019)) and stabilizing techniques (e.g., (Miyato et al. 2018)), GANs generates high-quality images that are indistinguishable from real ones. Conditional GANs (CGANs) (Mirza and Osindero 2014) extend the GAN models to generate images given specific labels, which can be used to model the class conditional PX|Y (e.g., AC-GAN (Odena, Olah, and Shlens 2017), Projection c GAN (Miyato and Koyama 2018), and TAC-GAN (Gong et al. 2019)). However, training of CGANs requires ordinary labels for the images, which are not available under the complementary learning settings. To the best of our knowledge, our proposed CCGAN is the first conditional GAN that is trained with complementary labels. The most related works to us are the robust conditional GAN approaches that aim to learn a conditional GAN from labels corrupted by random noise (Thekumparampil et al. 2018; Kaneko, Ushiku, and Harada 2018). However, our method generates better quality images and more accurate prediction, by utilizing complementary labels. Semi-Supervised Learning Under semi-supervised learning settings, we are provided a relatively small number of labeled data and plenty of unlabeled data. The basic assumption for the semi-supervised methods is that the knowledge on PX gained from unlabeled data carries useful information for inferring PY |X. This principle has been implemented in various forms, such as co-training (Blum and Mitchell 1998), generative modeling (Odena 2016; Kumar, Sattigeri, and Fletcher 2017), etc. Inspired by the commonalities between complementary learning and semi-supervised learning, i.e., more data are available to estimate PX than PY |X; we propose to make use of PX to help infer PY |X in complementary learning. Background In this section, we first introduce the concept of learning from so-called complementary labels. Then, we discuss a state-of-the-art discriminative complementary learning approach, (Yu et al. 2018), which is the most relevant to our method. Problem Setup Let two random variables X and Y denote the features and the labels, respectively. The goal of discriminative learning is to infer a decision function (classifier) from independent and identically distributed training set {xi, yi}n i=1 X Y drawn from an unknown joint distribution PXY , where X X = Rd and Y Y = {1, . . . , K}. The optimal function, f , can be learned by minimizing the expected risk R(f) = E(X,Y ) PXY ℓ(f(X), Y ), where E denotes the expectation and ℓdenotes a classification loss function. Because PXY is unknown, we usually approximate R(f) using its empirical estimation Rn(f) = 1 n n i=1 ℓ(f(xi), yi). In the complementary learning setting, for each sample x, we are given only a complementary label y Y \ y which specifies a class that x does not belong to. That is to say, our goal is to learn f that minimizes the classification risk R(f) from complementarily-labeled data {xi, yi}n i=1 X Y drawn from an unknown distribution PX Y , where Y denote the random variable for complementary label. The ordinary loss function, ℓ( , ), cannot be used since we do not have access to the ordinary labels (yi s). In the following, we explain how discriminative learning can be extended in such scenarios. Discriminative Complementary Learning Existing Discriminative Complementary Learning (DCL) methods modified the ordinary classification loss function ℓto the complementary classification loss ℓto provide a consistent estimation of f. Various loss functions have been considered in the literature, such as one-vs-all ramp/sigmoid loss (Ishida et al. 2017), pair-comparison ramp/sigmoid loss (Ishida et al. 2017), and cross-entropy loss (Yu et al. 2018). Here we briefly review a recent method that modifies the cross-entropy loss for deep learning with complementary labels (Yu et al. 2018). The general idea is to view the ordinary label Y , as a latent random variable. Suppose the classifier has the form f(X) = arg maxi [K] gi(X), where gi(X) is an estimation for P(Y = i|X). The loss function for complementary labels is defined as ℓ(f(X), Y ) = ℓ(M g, Y ), where g = (g1(X), . . . , g K(X)) and M is the transition matrix satisfying P( Y = j|X) = i =j p( Y = j|Y = i) Mij P(Y = i|X). (1) (Ishida et al. 2017; 2018) assumed the uniform setting in which M takes 0 on diagonals and 1 K 1 on non-diagonals. (Yu et al. 2018) relaxed this assumption by allowing other values on non-diagonals and proposed a method to estimate M from data. It has been shown in (Yu et al. 2018) that the classifier fn that minimizes the empirical estimation of R(f), i.e., i=1 ℓ(f(xi), yi), (2) converges to the optimal classifier f as n . Proposed Method In this section, we will present the motivation and details of our generative-discriminative complementary learning method. First, we demonstrate why generative modeling is valuable for learning from complementary labels. Second, we present our Complementary Conditional GAN (CCGAN) model that is trained using complementarilylabeled data and provide theoretical guarantees. Finally, we discuss several practical factors that are crucial for reliably training our model. Motivation It is guaranteed that existing discriminative complementary learning approaches lead to optimal classifiers, given sufficiently large sample size. However, due to the uncertainty introduced by the complementary labels, the effective sample size is much smaller than the sample size n. If we have access to samples with ordinary labels {xi, yi}n i=1, we can learn the classifier fn by minimizing Rn(f). Since knowing the ordinary labels is equivalent to having all the K 1 complementary labels, we can also learn fn with ordinary labels by minimizing the empirical risk R n(f) = 1 n(K 1) k=1 ℓ(f(xi), yik), (3) where yik is the k-th complementary label for the i-th example. In practice, since we only have one complementary label for each instance, we are minimizing Rn(f) as shown in Eq. (2), rather than R n(f). Note that Rn(f) approximates R n(f) by randomly picking up one complementary label for the i-th example, which implies that the effective sample size is roughly n/(K 1). In other words, although we provide each instance a complementary label, the accuracy of the classifier learned by minimizing Rn is close to that of a classifier learned with n/(K 1) examples with ordinary labels. Because the effective sample size is usually much smaller than the actual sample size, complementary learning resembles semi-supervised learning, where only a small proportion of instances are associated with ordinary labels. In semisupervised learning, PX can be estimated with more unlabeled samples compared to PY |X, which requires labels to estimate. Therefore, modeling PX is beneficial because it allows us to take advantage of unlabeled data. This justifies the motivation of introducing a generative term in complementary learning. A natural way to utilize PX is to model the class-conditional, PX|Y . PX imposes a constraint on PX|Y indirectly since PX = P(X|Y = y)P(y)dy. Therefore, a more accurate estimation of PX will improve the estimation of PX|Y and thus PY |X. Complementary Conditional GAN (CCGAN) Given the recent advances in generative modeling using (conditional) GANs, we propose to use conditional GAN to model PX|Y in the paper. A conditional GAN learns a function G(Y, Z) that generates samples from a conditional distribution QX|Y , neural network is used to parameterize the generator function, and Z is a random samples drawn from a canonical distribution PZ. To learn the parameters, we can minimize certain divergence between QX,Y and PXY by solving the following optimization: min G max D E (X,Y ) PXY [φ(D(X, Y ))] + E Z PZ,Y PY [φ(1 D(G(Z, Y ), Y ))], (4) where φ is a function of choice and D is the discriminator. However, the conditional GAN framework cannot be directly used for our purpose for the following two reasons: 1) the first term in Eq. (4) cannot be evaluated directly, because we do not have access to the ordinary labels. 2) the conditional GAN only generates X s and does not infer the ordinary labels. A straightforward solution would be to generate (x, y) from the learned conditional GAN model and to train a separate classifier on the generated data. However, such two-step solution results in a sub-optimal performance. To enable generative-discriminative complementary learning, we propose a complementary conditional GAN (CCGAN) by extending the TAC-GAN (Gong et al. 2019) framework to deal with complementarily-labeled data. The model structures of GAN, TAC-GAN, and our CCGAN are shown in Figure 1. TAC-GAN decomposes the joint distributions as PXY = PY |XPX and QXY = QY |XQX and match the conditional distributions and marginal distributions separately. The marginals PX and QX are matched using adversarial loss (Goodfellow et al. 2014), and PY |X and QY |X are matched by sharing a classifier with probabilistic outputs. However, PY |X is not accessible in a complementary setting since the ordinary labels are not observed. Therefore, PY |X and QY |X cannot be directly matched as in TAC-GAN. Fortunately, we make use of the relation between PY |X and P Y |X ( Eq. (1) ) and propose a new loss matching PY |X and QY |X in a complementary setting. Specifically, we learn our CCGAN using the following objective min G,C max D,Cmi E X PXφ(D(X)) + E Z PZ,Y PY φ(1 D(G(Z, Y ))) + E (X, Y ) PX Y ℓ( Y , C(X)) b + E Z PZ,Y PY ℓ(Y, C(G(Z, Y ))) + E Z PZ,Y PY ℓ(Y, Cmi(G(Z, Y ))) φ(PX), φ(QX) ψ(PY |X) ψ(QY |X) φ(PX), φ(QX) (x, y) (G(z, y), y) φ(PX), φ(QX) ψ(QY |X) (G(z, y), y) Figure 1: Model structure. where ℓis the cross-entropy loss, C is a function modeled by a neural network with softmax layer as the final layer to produce class probability outputs, C(X) = M C(X), and Cmi is another function modeled by a neural network with class probability outputs. From the objective function, we can see that our method naturally combines generative and discriminative components in a unified framework. Specifically, the component b performs pure discriminative complementary learning on the complementarily-labeled data (only learns C), and the components a and c perform generative and discriminative learning simultaneously (learn both G and C). The three components in Eq. (5) correspond to the following three divergences: 1) component a corresponds to Jensen-Shannon divergence between PX and QX, 2) component b represents KL divergence between P Y |X and Q Y |X, and 3) component c corresponds to KL divergence between Q Y |X and QY |X, where Q Y |X is a conditional distribution of ordinary labels given features modeled by C and Q Y |X is a conditional distribution of complementary labels given features implied by Q Y |X through the relation Q Y |X = M Q Y |X. The following theorem demonstrates that minimizing these three divergences in our objective can effectively reduce the divergence between QY X and PY X. Theorem 1 Let PY X and QY X denote the data distribution and the distribution implied by our model, respectively. Let Q Y |X (Q Y |X) denote the conditional distribution of ordinary (complementary) labels given features induced by the parametric model C. If M is full rank, we have d T V (PXY , QXY ) 2c1 d JS(PX, QX) d KL(P Y |X, Q Y |X) d KL(QY |X, Q Y |X), (6) where d T V is the total variation distance, d JS is the Jensen Shannon divergence, d KL is the KL divergence, and c1 and c2 are two constants. A proof of Theorem 1 is provided in Section S1 of the supplementary file. An illustrative figure that shows the relations between the quantities in Theorem 1 is also provided in Section S2 of the supplementary file. Practical Considerations Estimating Prior PY In our CCGAN model, we need to sample the ordinary labels y from the prior distribution PY , which needs to be estimated from complementary labels. Let P Y = [P Y ( Y = 1), . . . , P Y ( Y = K)] be the vector containing complementary label probabilities and PY = [PY (Y = 1), . . . , PY (Y = K)] be true label probabilities. We estimate PY by solving the following optimization: min PY P Y M PY 2, s.t. || PY ||1 = 1 and PY [i] 0. (7) This is a standard quadratic programming (QP) problem and can be easily solved using a QP solver. Estimating M If the annotator is allowed to choose to assign either an ordinary label or a complementary label for each instance, the matrix M will be unknown because of the possible non-uniform selection of the complementary labels. In (Yu et al. 2018), the authors provided an anchor-based method to estimate M, we also follow the same technique. Please refer to (Yu et al. 2018) for more details. Incorporating Unlabeled Data In practice, we may have access to additional unlabeled data. We can readily incorporate such unlabeled data to improve the estimation of the first term in Eq. (5), which further improves the learning of G through the second term in Eq. (5) and eventually improves the classification performance. Experiments To demonstrate the effectiveness of our method, we present a number of experiments examining different aspects of our method. After introducing the implementation details, we evaluate our methods on three datasets, including MNIST (Le Cun and Cortes 2010), CIFAR10, CIFAR100 (Krizhevsky, Nair, and Hinton ), and VGGFACE2 (Cao et al. 2018). We compare classification accuracy of our CCGAN with the state-of-the-art Discriminative Learning (DCL) method (Yu et al. 2018) and show the capability of CCGAN to generate good quality class-conditioned images from complementarily-labeled data. In addition, ablation studies based on MNIST are presented to give a more detailed analysis of our method. To be notified, we also have the Inception Score and Fr echet Inception Distance (FID) to measure the generative performance of our model, the result is shown in S3. Implementation Details Label Generation All the four datasets have ordinary class labels, which allows generating labels to evaluate our method. Following the procedure in (Ishida et al. 2017), the label for each image was obtained by randomly picking a candidate class and asking the labeler to answer yes or no questions. In this case, The candidate classes are uniformly assigned to each image, and therefore the transition matrix M satisfies Mi,j = 1/(K 1), i = j; Mi,j = 0, i = j. Also, data are usually biased, and the annotators also tend to hold biased choices based on their experience. Thus transition matrix M could be biased (Yu et al. 2018). For uniformed M we assume M is known. However, for biased M, we consider both cases when true M is given, and M needs to be estimated during training time. To be notified, when generating complementary data, we assume M is known. Training Details We implemented our CCGAN model in Pytorch. We trained our CCGAN model in an end-to-end strategy, which means the classifier and GAN discriminator share the common bottom to neck conventional layers except for the final fully-connected softmax layer as well as mutual information learner. To train our CCGAN model, we optimized the whole objective equation 5 using Adam (Kingma and Ba 2014) with learning rate 2e 4, β1 = 0.0, β2 = 0.999 for both D and G network, where we train 2 steps of D and 1 step of G in each iteration for 10,000 iteration in total. To train our baseline DCL model, we apply the same training strategy as (Yu et al. 2018) for all dataset. For the additional VGGFACE2 dataset, we apply the same training settings as CIFAR100. We adopted data augmentation for all datasets except MNIST, where we first resized all images to 32 32 resolution, employed random croppings to change the image into 28 28 and then applied zero-padding to turn the image back with 32 32 resolution. MNIST We first evaluate our model on MNIST, which is a handwritten digit recognition dataset that contains 60K training images and 10K testing images, with size 32 32. We chose Lenet-5 (Le Cun and Cortes 2010) as the network structure for the DCL method and the C network in our CCGAN. We employed the DCGAN network (Radford, Metz, and Chintala 2015) as the backbone of our CCGAN. Due to the simplicity of MNIST data, the accuracy of learning is close to that of learning with ordinary labels if we use all Figure 2: Test accuracy on (a) MNIST dataset and (b) CIFAR10 dataset. x axis represents the proportion rl of labeled data in the training set S. 60K training samples. Therefore, we sample a subset of 6K images as our basic sampling set S for training. In the experiments, we evaluate all the methods under different sample sizes. Specifically, we randomly re-sampled subsets with rl 6K samples, where rl = 0.1, 0.2, . . . , 1; and trained all the methods on these subsets. The classification accuracy was evaluated on the 10k test set. We report the results under the following three settings: 1) We only use samples with complementary labels, ignoring all ordinary labels, to train our model CCGAN and baseline DCL. 2) We also train ordinary classifier such that all labeled data are provided with ordinary labels (Oracle). This classifier is trained with the strongest supervision possible, representing the best achievable classification performance. The results are shown in Figure 2 (a). It can be seen from the results that our CCGAN method outperforms DCL under different sample sizes, and the gap increases as the sample size reduces. our method outperforms DCL by a large margin. The results demonstrate that generative-discriminative modeling is advantageous over discriminative modeling for complementary learning. Figure 3 (a) shows the generated images from TAC GAN (Oracle) and our CCGAN. We can see that our CCGAN generates high-quality digit images, suggesting that CCGAN is able to learn PX|Y very well from complementarily-labeled data. CIFAR10 We then evaluate our method on the CIFAR10 dataset, which consists of 10 classes of 32 32 RGB images, including 60K training samples and 10K test samples. We deploy Res Net18 (He et al. 2015) as the structure of the C network in our TAC GAN(Oracle) TAC GAN(Oracle) TAC GAN(Oracle) TAC GAN(Oracle) Figure 3: Synthetic results. (a)Mnist, (b)Cifar10, (c)Cifar100, (d)Vggface100 Method rl 0.2 0.4 0.6 0.8 1.0 Ordinary label (Oracle) 0.673 0.804 0.870 0.891 0.917 DCL 0.378 0.685 802 0.849 0.884 CCGAN 0.447 0.728 0.822 0.865 0.896 Ordinary label (Oracle) 0.439 0.804 0.870 0.891 0.917 DCL 0.252 0.452 0.561 0.609 0.651 CCGAN 0.320 0.520 0.571 0.632 0.660 Table 1: This table shows the test accuracy on VGGFACE100 and CIFAR100 dataset. model. Since training GANs on the CIFAR10 dataset is unstable, we utilize the latest conditional structure Big-GAN (Brock, Donahue, and Simonyan 2018) for our CCGAN backbone. If without mention, the following dataset experiments apply the same settings. We evaluate all the methods following the same procedure used in the MNIST dataset. The results are shown in Figure 2 (b). Again our method consistently outperforms the DCL method for different sample sizes. Figure 3 (b) shows the generated images from TAC GAN (Oracle) and our CCGAN. It can be seen that our CCGAN successfully learns the appearance of each class from complementary labels. CIFAR100 and VGGFACE100 We finally evaluate our method on CIFAR100 and VGGFACE2 data, different from CIFAR10, CIFAR100 dataset contains 100 classes and each class has 500 images in average and 10.000 testing images of 100 classes in total. VGGCAE2 is a large-scale face recognition dataset. The face images have large variations in pose, age, illumination, ethnicity, and profession. The number of images for each person (class) varies from 87 to 843, with an average of 362 images for each person. We randomly sampled 100 classes and constructed a dataset for evaluation of our method. We selected 80% data as the training set S and the rest 20% as the testing set. Since our CCGAN model can only generate fixed-size images, we re-scaled all training images into 32 32. Because the number of classes is relatively large, the effective labeled sample size is approximately n/99, where n is the total sample size. In case of limited supervision, neither DCL nor our CCGAN can converge. Thus, we applied the complementary label generation approach in (Yu et al. 2018), which assumed only a small subset of candidate classes can be chosen as complementary labels. In specific, we randomly selected 10 candidate classes as the potential compelementary label each class, and assigned them with uniform probabilities. We used the same evaluation procedures used in MNIST and CIFAR10. The classification accuracy is reported in Table 1. It can be seen that our method outperforms DCL by 5% when the proportion of labeled data is smaller than 0.3 and is slightly better than DCL when the proportion is larger than 0.5. Figure 3 (c) shows the generated images from TAC GAN (Oracle) and our CCGAN. We can see that CCGAN generates images that are visually similar to the real images for each person. Biased M training According to (Yu et al. 2018), we also implement the biased transition matrix M setting. During the training time, we test two settings: 1) we assume true M used for generating data is known; 2) M can not be acquired and needs to be estimated. For the unknown M, we follow the same settings as (Yu et al. 2018) and apply the same anchor method to estimate M. The other training settings are the same as above experiments. The result is shown in Table 2. Method rl 0.2 0.6 1.0 0.2 0.6 1.0 True M Esimated M DCL 0.675 0.866 880 0.563 0.787 0.894 CCGAN 0.839 0.908 0.918 0.773 0.837 0.916 DCL 0.413 0.658 0.724 0.282 0.624 0.713 CCGAN 0.559 0.767 0.815 0.440 0.740 0.757 DCL 0.2814 0.582 663 0.176 0.381 0.574 CCGAN 0.320 0.621 0.664 0.206 0.445 0.589 DCL 0.461 0.769 0.863 0.161 0.660 0.836 CCGAN 0.533 0.805 0.866 0.174 0.681 0.850 Table 2: This table shows the test accuracy on MNIST, CIFAR10, CIFAR100, and VGGFACE100 when M is biased. Figure 4: Test accuracy. x axis denotes number of assigned complementary labels per image. classifier trained on ordinary labeled data as the Oracle. Ablation Study Here we conduct ablation studies on MNIST to study the details and validate possible extensions of our approach. Multiple Labels In this experiment, we give an intuitive strategy to verify the effectiveness of generative modeling for complementary learning. In ordinary supervised learning, discriminative models are usually preferred than generative models because estimating the high-dimensional PX|Y is difficult. To demonstrate the importance of generative modeling in complementary learning, we propose to assign multiple complementary labels to each image and observe how the performance changes with the number of complementary labels. The classification accuracy is shown in Figure 4. We can see that the accuracy of our CCGAN and DCL both increases with the number of complementary labels. When the number of complementary labels per image is large, DCL performs better than our CCGAN because the supervision information is sufficient. However, in practice, the number of complementary labels for each instance is typically small and is usually one. In this case, the advantage of generative modeling is obvious, as demonstrated by the superior performance of our CCGAN compared to DCL. Semi-Supervised Learning In practice we might have easier access to unlabeled data which can be incorporated into to our model to perform semi-supervised complementary learning. On the MNIST dataset, we used the additional 90% data as unlabeled data to improve the estima- Figure 5: Test accuracy of SCCGAN tion of the first term in our objective Eq. (5). We denote the semi-supervised method as Semi-supervised complementary Conditional GAN(SCCGAN). The classification accuracy w.r.t. different proportion of labeled data is shown in Figure 5. We can see that SCCGAN further improves the accuracy over CCGAN due to the incorporation of unlabeled data. Conclusion We study the limitation of complementary learning as a weakly supervised learning problem, where the effective supervised information is much smaller compared to the sample size. To address this problem, we propose a generativediscriminative model to learn a better data distribution, as a strategy to boost the performance of the classifier. We build a conditional GAN model (CCGAN) which learns a generative model conditioned on ordinary class labels from complementary labeled data, and unify the generative and discriminative modeling in one framework. Our method shows superior classification performance on several datasets, including MNIST, CIFAR10, and CIFAR100 and VGGFACE100. Besides, our model generates highquality synthetic images by utilizing complementary labeled data. In addition, we give a theoretical analysis that our model can converge to true conditional distribution learning from complementarily-labeled data. Acknowledgments This work was partially supported by NIH Award Number 1R01HL141813-01, NSF 1839332 Tripod+X, SAP SE and Australian Research Council Projects DP-180103424 and DE190101473. We gratefully acknowledge the support of NVIDIA Corporation with the donation of the Titan X Pascal GPU used for this research. We were also grateful for the computational resources provided by Pittsburgh Super Computing grant number TG-ASC170024. References Blum, A., and Mitchell, T. 1998. Combining labeled and unlabeled data with co-training. In Proceedings of the eleventh annual conference on Computational learning theory, 92 100. ACM. Brock, A.; Donahue, J.; and Simonyan, K. 2018. Large scale GAN training for high fidelity natural image synthesis. Co RR abs/1809.11096. Brock, A.; Donahue, J.; and Simonyan, K. 2019. Large scale GAN training for high fidelity natural image synthesis. In International Conference on Learning Representations. Cao, Q.; Shen, L.; Xie, W.; Parkhi, O. M.; and Zisserman, A. 2018. Vggface2: A dataset for recognising faces across pose and age. In International Conference on Automatic Face and Gesture Recognition. Cheng, J.; Liu, T.; Ramamohanarao, K.; and Tao, D. 2017. Learning with bounded instanceand label-dependent label noise. Ar Xiv abs/1709.03768. Gong, M.; Xu, Y.; Li, C.; Zhang, K.; and Batmanghelich, K. 2019. Twin auxiliary classifiers gan. ar Xiv preprint ar Xiv:1907.02690. Goodfellow, I.; Pouget-Abadie, J.; Mirza, M.; Xu, B.; Warde-Farley, D.; Ozair, S.; Courville, A.; and Bengio, Y. 2014. Generative adversarial nets. In Ghahramani, Z.; Welling, M.; Cortes, C.; Lawrence, N. D.; and Weinberger, K. Q., eds., Advances in Neural Information Processing Systems 27. Curran Associates, Inc. 2672 2680. He, K.; Zhang, X.; Ren, S.; and Sun, J. 2015. Deep residual learning for image recognition. Co RR abs/1512.03385. Hoffman, J.; Gupta, S.; and Darrell, T. 2016. Learning with side information through modality hallucination. 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR) 826 834. Ishida, T.; Niu, G.; Hu, W.; and Sugiyama, M. 2017. Learning from complementary labels. In Guyon, I.; Luxburg, U. V.; Bengio, S.; Wallach, H.; Fergus, R.; Vishwanathan, S.; and Garnett, R., eds., Advances in Neural Information Processing Systems 30. Curran Associates, Inc. 5639 5649. Ishida, T.; Niu, G.; Menon, A. K.; and Sugiyama, M. 2018. Complementary-label learning for arbitrary losses and models. Kaneko, T.; Ushiku, Y.; and Harada, T. 2018. Labelnoise robust generative adversarial networks. ar Xiv preprint ar Xiv:1811.11165. Kim, Y. 2014. Convolutional neural networks for sentence classification. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), 1746 1751. Kingma, D. P., and Ba, J. 2014. Adam: A method for stochastic optimization. Co RR abs/1412.6980. Kingma, D. P.; Mohamed, S.; Jimenez Rezende, D.; and Welling, M. 2014. Semi-supervised learning with deep generative models. In Ghahramani, Z.; Welling, M.; Cortes, C.; Lawrence, N. D.; and Weinberger, K. Q., eds., Advances in Neural Information Processing Systems 27. Curran Associates, Inc. 3581 3589. Krizhevsky, A.; Nair, V.; and Hinton, G. Cifar-10 (canadian institute for advanced research). Krizhevsky, A.; Sutskever, I.; and Hinton, G. E. 2012. Imagenet classification with deep convolutional neural networks. In Advances in neural information processing systems, 1097 1105. Kumar, A.; Sattigeri, P.; and Fletcher, T. 2017. Semisupervised learning with gans: Manifold invariance with im- proved inference. In Advances in Neural Information Processing Systems, 5534 5544. Le Cun, Y., and Cortes, C. 2010. MNIST handwritten digit database. Liu, T., and Tao, D. 2016. Classification with noisy labels by importance reweighting. IEEE Trans. Pattern Anal. Mach. Intell. 38(3):447 461. Mirza, M., and Osindero, S. 2014. Conditional generative adversarial nets. Co RR abs/1411.1784. Miyato, T., and Koyama, M. 2018. cgans with projection discriminator. Co RR abs/1802.05637. Miyato, T.; Kataoka, T.; Koyama, M.; and Yoshida, Y. 2018. Spectral normalization for generative adversarial networks. Co RR abs/1802.05957. Natarajan, N.; Dhillon, I. S.; Ravikumar, P. K.; and Tewari, A. 2013. Learning with noisy labels. In Burges, C. J. C.; Bottou, L.; Welling, M.; Ghahramani, Z.; and Weinberger, K. Q., eds., Advances in Neural Information Processing Systems 26. Curran Associates, Inc. 1196 1204. Odena, A.; Olah, C.; and Shlens, J. 2017. Conditional image synthesis with auxiliary classifier GANs. In Precup, D., and Teh, Y. W., eds., Proceedings of the 34th International Conference on Machine Learning, volume 70 of Proceedings of Machine Learning Research, 2642 2651. International Convention Centre, Sydney, Australia: PMLR. Odena, A. 2016. Semi-supervised learning with generative adversarial networks. ar Xiv preprint ar Xiv:1606.01583. Patrini, G.; Rozza, A.; Krishna Menon, A.; Nock, R.; and Qu, L. 2017. Making deep neural networks robust to label noise: A loss correction approach. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 1944 1952. Radford, A.; Metz, L.; and Chintala, S. 2015. Unsupervised representation learning with deep convolutional generative adversarial networks. Co RR abs/1511.06434. Thekumparampil, K. K.; Khetan, A.; Lin, Z.; and Oh, S. 2018. Robustness of conditional gans to noisy labels. In Advances in Neural Information Processing Systems, 10271 10282. Xia, X.; Liu, T.; Wang, N.; Han, B.; Gong, C.; Niu, G.; and Sugiyama, M. 2019. Are anchor points really indispensable in label-noise learning? Co RR abs/1906.00189. Yu, X.; Liu, T.; Gong, M.; and Tao, D. 2018. Learning with biased complementary labels. In ECCV (1), volume 11205 of Lecture Notes in Computer Science, 69 85. Springer. Zhou, Z.-H.; Zhang, M.-L.; Huang, S.-J.; and Li, Y.-F. 2012. Multi-instance multi-label learning. Artificial Intelligence 176(1):2291 2320.