# robustness_of_conditional_gans_to_noisy_labels__c322d686.pdf Robustness of conditional GANs to noisy labels Kiran Koshy Thekumparampil , Ashish Khetan , Zinan Lin , Sewoong Oh University of Illinois at Urbana-Champaign, Carnegie Mellon University We study the problem of learning conditional generators from noisy labeled samples, where the labels are corrupted by random noise. A standard training of conditional GANs will not only produce samples with wrong labels, but also generate poor quality samples. We consider two scenarios, depending on whether the noise model is known or not. When the distribution of the noise is known, we introduce a novel architecture which we call Robust Conditional GAN (RCGAN). The main idea is to corrupt the label of the generated sample before feeding to the adversarial discriminator, forcing the generator to produce samples with clean labels. This approach of passing through a matching noisy channel is justified by accompanying multiplicative approximation bounds between the loss of the RCGAN and the distance between the clean real distribution and the generator distribution. This shows that the proposed approach is robust, when used with a carefully chosen discriminator architecture, known as projection discriminator. When the distribution of the noise is not known, we provide an extension of our architecture, which we call RCGAN-U, that learns the noise model simultaneously while training the generator. We show experimentally on MNIST and CIFAR-10 datasets that both the approaches consistently improve upon baseline approaches, and RCGAN-U closely matches the performance of RCGAN. 1 Introduction Conditional generative adversarial networks (GAN) have been widely successful in several applications including improving image quality, semi-supervised learning, reinforcement learning, category transformation, style transfer, image de-noising, compression, in-painting, and super-resolution [30, 13, 49, 36, 26, 58]. The goal of training a conditional GAN is to generate samples from distributions satisfying certain conditioning on some correlated features. Concretely, given samples from joint distribution of a data point x and a label y, we want to learn to generate samples from the true conditional distribution of the real data PX|Y . A canonical conditional GAN studied in literature is the case of discrete label y [30, 36, 35, 32]. Significant progresses have been made in this setting, which are typically evaluated on the quality of the conditional samples. These include measuring inception scores and intra Fréchet inception distances, visual inspection on downstream tasks such as category morphing and super resolution [32], and faithfulness of the samples as measured by how accurately we can infer the class that generated the sample [36]. We study the problem of training conditional GANs with noisy discrete labels. By noisy labels, we refer to a setting where the label y for each example in the training set is randomly corrupted. Such noise can result from an adversary deliberately corrupting the data [7] or from human errors in crowdsourced label collection [12, 18]. This can be modeled as a random process, where a clean data Author emails are thekump2@illinois.edu, ashish.khetan09@gmail.com, zinanl@andrew.cmu.edu, and swoh@illinois.edu. This work used the Extreme Science and Engineering Discovery Environment (XSEDE), which is supported by National Science Foundation grant number OCI-1053575. Specifically, it used the Bridges system, which is supported by NSF award number ACI-1445606, at the Pittsburgh Supercomputing Center (PSC). 32nd Conference on Neural Information Processing Systems (Neur IPS 2018), Montréal, Canada. point x X and its label y [m] are drawn from a joint distribution PX,Y with m classes. For each data point, the label is corrupted by passing through a noisy channel represented by a row-stochastic confusion matrix C Rm m defined as Cij P(e Y = j|Y = i). This defines a joint distribution for the data point x and a noisy label ey: e PX,e Y . If we train a standard conditional GAN on noisy samples, then it solves the following optimization: min G G max D F V (G, D) = E (x,ey) e PX, e Y [φ (D(x, ey))] + E z N ,y e P e Y [φ (1 D(G(z; y), y))] , (1) where φ is a function of choice, D and G are the discriminator and the generator respectively optimized over function classes G and F of our choice, and N is the distribution of the latent random vector. For typical choices of φ, for example log( ), and large enough function classes G and F, the optimal conditional generator learns to generate samples from e PX|e Y , the corrupted conditional distribution. In other words, it generates samples X from classes other than what it is conditioned on. As the learned distribution exhibits such a bias, we call this naive approach the Biased GAN. Under this setting, there is a fundamental question of interest: can we design a novel conditional GAN that can generate samples from the true conditional distribution PX|Y , even when trained on noisy samples? Several aspects of this problem make it challenging and interesting. First, the performance of such robust GAN should depend on how noisy the channel C is. If C is rank-deficient, for instance, then there are multiple distributions that result in the same distribution after the corruption, and hence no reliable learning of the true distribution is possible. We would ideally want a theoretical guarantee that shows such trade-off between C and the robustness of GANs. Next, when the noise is from errors in crowdsourced labels, we might have some access to the confusion matrix C from historical data. On other cases of adversarial corruption, we might not have any information of C. We want to provide robust solutions to both. Finally, an important practical challenge in this setting is to correct the noisy labels in the training data. We address all such variations in our approaches and make the following contributions. Our contributions. We introduce two architectures to train conditional GANs with noisy samples. First, when we have the knowledge of the confusion matrix C, we propose RCGAN (Robust Conditional GAN) in Section 2. We first prove that minimizing the RCGAN loss provably recovers the clean distribution PX|Y (Theorem 2), under certain conditions on the class F of discriminators we optimize over (Assumption 1). We show that such a condition on F is also necessary, as without it, the training loss can be arbitrarily small while the generated distribution can be far from the real (Theorem 4). The assumption leads to our particular choice of the discriminator in RCGAN, called projection discriminator [32] that satisfies all the conditions (Remark 1). Finally, we provide a finite sample generalization bound showing that the loss minimized in training RCGAN does generalize, and results in the learned distribution being close to the clean conditional distribution PX|Y (Theorem 3). Experimental results in benchmark datasets confirm that RCGAN is robust against noisy samples, and improves significantly over the naive Biased GAN. Secondly, when we do not have access to C, we propose RCGAN-U (RCGAN with Unknown noise distribution) in Section 4. We provide experimental results showing that performance gains similar to that of RCGAN can be achieved. Finally, we showcase the practical use of thus learned conditional GANs, by using it to fix the noisy labels in the training data. Numerical experiments confirm that the RCGAN framework provides a more robust approach to correcting the noisy labels, compared to the state-of-the-art methods that rely only on discriminators. Related work. Two popular training methods for generative models are variational auto-encoders [22] and adversarial training [14]. The adversarial training approach has made significant advances in several applications of practical interest. [37, 2, 5] propose new architectures that significantly improve the training in practical image datasets. [58, 16] propose new architectures to transfer the style of one image to the other domain. [26, 43] show how to enhance a given image with learned generator, by enhancing the resolution or making it more realistic. [27, 50] show how to generate videos and [51, 1] demonstrate that 3-dimensional models can be generated from adversarial training. [23] proposes a new architecture encoding causal structures in conditional GANs. [42] introduces the state-of-the-art conditional independence tester. On a different direction, several recent approaches showcase how the manifold learned by the adversarial training can be used to solve inverse problems [9, 57, 53]. Conditional GANs have been proposed as a successful tool for various applications, including class conditional image generation [36], image to image translation [21], and image generation from text [38, 55]. Most of the conditional GANs incorporate the class information by naively concatenating it to the input or feature vector at some middle layer [30, 13, 38, 55]. AC-GANs [36] creates an auxiliary classifier to incorporate class information. Projection discriminator GAN [32] takes an inner product between the embedded class vector and the feature vector. A recent work [31] which proposes spectral normalization shows that high quality image generation on 1000-class ILSVRC2012 dataset [39] can be achieved using projection conditional discriminator. Robustness of (unconditional) GANs against adversarial or random noise has recently been studied in [10, 52]. [52] studies an adversarial attack that perturbs the discriminator output. The proposed architecture of RCGAN is inspired by a closely related work of Ambient GAN in [10]. Ambient GAN is a general framework addressing any corruption on the image itself (not necessarily just the labels). Given corrupted samples with a known corruption, Ambient GAN applies that corruption to the output of the generator before feeding it to the discriminator. Motivated by the success of Ambient GAN in de-noising, we propose RCGAN. An important distinction is that we make specific architectural choices guided by our theoretical analysis that gives a significant gain in practice (Appendix J). Under the scenario of interest with noisy labels, we provide sharp analyses for both the population loss and the finite sample loss. Such sharp characterizations do not exist for the more general Ambient GAN scenarios. Further, our RCGAN-U does not require the knowledge of the confusion matrix, departing from the Ambient GAN approach. Learning classifiers from noisy labels is a closely related problem. Recently [34, 20] proposed a theoretically motivated classifier which minimizes the modified loss in presence of noisy labels and showed improvement over the robust classifiers [29, 45, 46]. [47] proposed adding noise to the classifier output to match the noise distribution. Notation. For a vector, x p = (P i |xi|p)1/p is the ℓp-norm. For a matrix, let |||A|||p = max x p=1 Ax p denote the operator norm. Then |||A||| = maxi P j |Aij|, |||A|||1 = maxj P i |Aij| and |||A|||2 = σmax(A), the maximum singular value. 1 is all ones vector and I is identity matrix. [n] = {1, . . . , n}. For a vector x Rn, xi (i [n]) is its i-th coordinate. 2 Our first architecture: RCGAN Training a conditional GAN with noisy samples results in a biased generator. We propose Robust Conditional GAN (RCGAN) architecture which has the following pre-processing, discriminator update, and generator update steps. We assume in this section that the confusions matrix C is known (and the marginal PY can easily be inferred), and address the case of unknown C in Section 4. permutation regularizer adversarial loss Figure 1: The output x of the conditional generator G is paired with a noisy label ey corrupted by the channel C. The discriminator D estimates whether a given labeled sample is coming from the real data (xreal, yreal) or generated data (x, y). The permutation regularizer h is pre-trained on real data. Pre-processing: We train a classifier h to predict the noisy label ey given x under a loss l, trained on h arg minh H E(x,ey) e PX, e Y [ℓ(h(x), ey)], where H is a parametric family of classifiers (typically neural networks) and e PX,e Y is the joint distribution of real x and corresponding real noisy ey. D-step: We train on the following adversarial loss. In the second term below, y is generated according to PY and corresponding noisy labels are generated by corrupting the y according to the conditional distribution Cy which is the y-th row of the confusion matrix (assumed to be known): max D F E (x,ey) e PX, e Y [φ (D(x, ey))] + E z N, y PY ey|y Cy [φ (1 D(G(z; y), ey))] , where PY is the true marginal distribution of the labels, N is the distribution of the latent random vector, and F is a family of discriminators. G-step: We train on the following loss with some λ > 0: min G G E z N, y PY ey|y Cy φ (1 D(G(z; y), ey)) + λ ℓ(h (G(z; y)), y) , (2) where G is a family of generators. The idea of using auxiliary classifiers have been used to improve the quality of the image and stability of the training, for example in auxiliary classifier GAN (AC-GAN) [36], and improve the quality of clustering in the latent space [33]. We propose an auxiliary classifiers h, mitigating a permutation error, which we empirically identified on naive implementation of our idea with no regularizers. Permutation regularizer (controlled by λ). Permutation error occurs if, when asked to produce samples from a target class, the trained generator produces samples dominantly from a single class but different from the target class. We propose a regularizer h , which predicts the noisy label ey. As long as the confusion matrix is diagonally dominant, which is a necessary condition for identifiability, this regularizer encourages the correct permutation of the labels. More regularizers could potentially provide additonal robustness and we discuss one such regularizer (similar to the Info GAN loss [11]) in Appendix K. Theoretical motivation for RCGAN. When λ = 0, we get the standard conditional GAN update steps, albeit one which tries to minimize discriminator loss between the noisy real distribution e P and the distribution e Q of the generator when the label is passed through the same noisy channel parameterized by C. The main idea of RCGAN is to minimize a certain divergence between noisy real data and noisy generated data. For example, the choice of bounded functions F = {D : X [m] [0, 1]} and identity map φ(a) = a leads to a total variation minimization; The loss minimized in the G-step is the total variation d TV( e P, e Q) sup S X [m]{ e P(S) e Q(S)} between the two distributions with corrupted labels, up to some scaling and some shift. If we choose F = {D : X [m] [0, 1]} and φ(a) = log(a), then we are minimizing the Jensen-Shannon divergence d JS( e P, e Q) (1/2)d KL( e P ( e P + e Q)/2) + (1/2)d KL( e Q ( e P + e Q)/2), where d KL( ) denotes the Kullback-Leibler divergence. The following theorem provides approximation guarantees for some common divergence measures over noisy channel, justifying our proposed practical approach. We refer to Appendix B for a proof. Theorem 1. Let PX,Y and QX,Y be two distributions on X [m]. Let e PX,e Y , e QX,e Y be the corresponding distributions when samples from P, Q are passed through the noisy channel given by the confusion matrix C Rm m (as defined in Section 1). If C is full-rank, we get, d TV e P, e Q d TV (P, Q) |||C 1||| d TV e P, e Q , and (3) d JS e P e Q d JS(P Q) |||C 1||| 8 d JS e P e Q . (4) To interpret this theorem, let Q denote the distribution of the generator. The theorem implies that when the noisy generator distribution e Q becomes close to the noisy real distribution e P in total variation or in Jensen-Shannon divergence, then the generator distribution Q must be close to the distribution of real data P in the same metric. This justifies the use of the proposed architecture RCGAN. In practice, we minimize the sample divergence of the two distributions, instead of the population divergence as analyzed in the above theorem. However, these standard divergences are known to not generalize in training GANs [3]. To this end, we provide in Section 3 analyses on neural network distances, which are known to generalize, and provide finite sample bounds. 3 Theoretical Analysis of RCGAN It was shown in [3] that standard GAN losses of Jensen-Shannon divergence and Wasserstein distance both fail to generalize with a finite number of samples. On the other hand, more recent advances in analyzing GANs in [56, 6, 4] show promising generalization bounds by either assuming Lipschitz conditions on the generator model or by restricting the analysis to certain classes of distributions. Under those assumptions, where JS divergence generalizes, Theorem 1 justifies the use of the proposed RCGAN. However, those require the distribution to be Gaussian, mixture of Gaussians, or output of a neural network generator, for example in [4]. In this section, we provide analyses of RCGAN on a distance that generalizes without any assumptions on the distribution of the real data as proven in [3]: neural network distance. Formally, consider a class of real-valued functions F and a function φ : [0, 1] R which is either convex or concave. The neural network distance is defined as d F,φ(P, Q) sup D F E (x,y) P [φ (D(x, y))] + E (x,y) Q [φ (1 D(x, y))] µφ . (5) where P is the distribution of the real data, Q is that of the generated data, and µφ is the constant correction term to ensure that d F,φ(P, P) = 0. We further assume that F includes three constant functions D(x, y) = 0, D(x, y) = 1/2, and D(x, y) = 1, in order to ensure that d F,φ(P, Q) 0 and d F,φ(P, P) = 0, as shown in Lemma 1 in the Appendix. The proposed RCGAN with λ = 0 approximately minimizes the neural network distance d F,φ( e P, e Q) between the two corrupted distributions. In practice, F is a parametric family of functions from a specific neural network architecture that the designer has chosen. In theory, we aim to identify how the choice of class F provides the desired approximation bounds similar to those in Theorem 1, but for neural network distances. This analysis leads to the choice of projection discriminator [32] to be used in RCGAN (Remark 1). On the other hand, we show in Theorem 4 that an inappropriate choice of the discriminator architecture can cause non-approximation. Further, we provide the sample complexity of the approximation bounds in Theorem 3. We refer to the un-regularized version with λ = 0 as simply RCGAN. In this section, we focus on a class of loss functions called Integral Probability Metrics (IPM) where φ(x) = x [44]. This is a popular choice of loss in GANs in practice [48, 2, 8] and in analyses [4]. We write the induced neural network distance as d F(P, Q), dropping the φ in the notation. 3.1 Approximation bounds for neural network distances We define an operation over a matrix T Rm m and a class F of functions on X [m] R as T F n g(x, y) = X ey [m] Tyey f(x, ey) | f F o . (6) This makes it convenient to represent the neural network distance corrupted by noise with a confusion matrix C Rm m, where Cyey is the probability a label y is corrupted as ey. Formally, it follows from (5) and (6) that d F( e P, e Q) = d C F(P, Q). We refer to Appendix F for a proof. For d F( e P, e Q) to be a good approximation of d F(P, Q), we show that the following condition is sufficient. Assumption 1. We assume that the class of discriminator functions F can be decomposed into three parts F = {f1 + f2 + c | f1 F1, f2 F2} such that c R is any constant and F1 satisfies the inclusion condition: T F1 F1 , (7) for all |||T||| maxi P j |Tij| = 1; and F2 satisfies the label invariance condition: there exists a class F 2 of functions over only x, such that F2 = α g(x, y) | g(x, y) = f(x), for any f(x) F 2, and α [0, 1] . (8) We discuss the necessity and practical implications of this assumption in Section 3.2, and give examples satisfying these assumptions in Remark 1 and Appendix C. Notice that a trivial class with a single constant zero function satisfies both inclusion and label invariance conditions. For example, we can choose c = 0 and also choose to set either F1 = {f(x, y) = 0} or F2 = {f(x, y) = 0}, in which case F only needs to satisfy either one of the conditions in Assumption 1. The flexibility that we gain by allowing the set addition F1 + F2 is critical in applying these conditions to practical discriminators, especially in proving Remark 1. Note that in the inclusion condition in Eq. 7, we require the condition to hold for all max-norm bounded set: {T : maxi P j |Tij| = 1}. The reason a weaker condition of all row-stochastic matrices, {T : P j Tij = 1}, does not suffice is that in order to prove the upper bound in Eq. 9, we need to apply the invariance condition to |||C 1||| 1 C 1 F. This matrix |||C 1||| 1 C 1 is not row-stochastic, but still max-norm bounded. We first show that Assumption 1 is sufficient for approximability of the neural network distance from corrupted samples. For two distributions PX,Y and QX,Y on X [m], let e PX,e Y and e QX,e Y be the corresponding corrupted distributions respectively, where the label Y is passed through the noisy channel defined by the confusion matrix C Rm m, i.e. e P(x, ey) = P y P(x, y)Cy,ey. Theorem 2. If a class of functions F satisfies Assumption 1, then d F( e P, e Q) d F(P, Q) |||C 1||| d F( e P, e Q) , (9) where we follow the convention that |||C 1||| = if C is not full rank. We refer to Appendix F for a proof. This gives a sharp characterization on how two distances are related: the one we can minimize in training RCGAN (i.e. d F( e P, e Q)) and the true measure of closeness (i.e. d F(P, Q)). Although the latter cannot be directly evaluated or minimized, RCGAN is approximately minimizing the true neural network distance d F(P, Q) as desired. The lower bound proves a special case of the data-processing inequality. Two random variables from P and Q get closer in neural network distance, when passed through a stochastic transformation. The upper bound puts a limit on how much closer e P and e Q can get, depending on the noise level. This fundamental trade-off is captured by |||C 1||| . Under the noiseless case where C is the identity matrix, we have |||C 1||| = 1 and we recover a trivial fact that the two distances are equal. On the other extreme, if C is rank deficient, we use the convention that |||C 1||| = and the two distances can be arbitrarily different. The approximation factor of |||C 1||| captures how much the space F can shrink by the noise C. This coincides with Theorem 1, where a similar trade-off was identified for the TV distance. In Remark 3 in Appendix D, we show that these bounds cannot be tightened for general P, Q, and F. Theorem 2 shows that (i) RCGAN can learn the true conditional distribution, justifying its use; and (ii) performance of RCGAN is determined by how noisy the samples are via |||C 1||| . There are still two loose ends. First, does practical implementation of RCGAN architecture satisfy the inclusion and/or label invariance assumptions? Secondly, in practice we cannot minimize d F( e P, e Q) as we only have a finite number of samples. How much do we lose in this finite sample regime? We give precise answers to each question in the following two sections. 3.2 Inclusion and label invariance assumptions For RCGAN, we propose a popular state-of-the-art discriminator for conditional GANs known as the projection discriminator [32], parametrized by V Rm d V , v Rdv, and θ Rdθ: DV,v,θ(x, y) = vec(y)T V ψ(x; θ) + v T ψ (x; θ) , (10) where ψ(x; θ) Rd V and ψ (x; θ) Rdv are vector valued parametric functions for some integers d V , dv, and vec(y)T = [Iy=1, . . . , Iy=m]. The first term satisfies the inclusion condition, as any operation with T can be absorbed into V . The second term is label invariant as it does not depend on y. This is made precise in the following remark, whose proof is provided in Appendix G. Together with this remark, the approximability result in Theorem 2 justifies the use of projection discriminators in RCGAN, which we use in all our experiments. Remark 1. The class of projection discriminators {DV,v,θ(x, y)}V V1,v V2,θ Θ defined in Eq. 10 satisfies Assumption 1 for any ψ, ψ , and Θ, if V1 = V Rm d V maxi |Vij| 1 for all j [d V ] , and V2 = v Rdv v 1 . Other choices of V1 and V2 are also possible. For example, V 1 = {V Rm d V | P j maxi |Vij| 1} or V 1 = {V Rm d V ||||V ||| = maxi P j |Vij| 1} are also sufficient. We find the proposed choice of V1 easy to implement, as a column-wise L -norm normalization via projected gradient descent. We describe implementation details in Appendix L. In Appendix E, we show that Assumption 1 is also necessary. 3.3 Finite sample analysis In practice, we do not have access to the probability distributions e P and e Q. Instead, we observe a set of samples of a finite size n, from each of them. In training GAN, we minimize the empirical neural network distance, d F( e Pn, e Qn), where e Pn and e Qn denote the empirical distribution of n samples. Inspired from the recent generalization results in [3], we show that this empirical distance minimization leads to small d F(P, Q) up to an additive error that vanishes with an increasing sample size. As shown in [3], Lipschitz and bounded function classes are critical in achieving sample efficiency for GANs. We follow the same approach over a similar function class. Let Fp,L = {Du(x, y) [0, 1] | Du(x, y) is L-Lipschitz in u and u U Rp} , (11) be a class of bounded functions with parameter u Rp. We say that F is L-Lipschitz in u if |Du1(x, y) Du2(x, y)| L u1 u2 , u1, u2 U, x X, y [m]. (12) Theorem 3. For any class Fp,L of bounded Lipschitz functions Du(x, y) satisfying Assumption 1, there exists a universal constant c > 0 such that d Fp,L( e Pn, e Qn) ϵ d Fp,L(P, Q) |||C 1||| d Fp,L( e Pn, e Qn) + ϵ , (13) with probability at least 1 e p for any ε > 0 and n large enough, n (c p /ϵ2) log (p L/ϵ) . We refer to Appendix I for a proof. This justifies the proposed RCGAN which minimizes d F( e Pn, e Qn), as it leads to the generator Q being close to the real distribution P in neural network distance, d F(P, Q). These bounds inherit the approximability of the population version from Theorem 2. 4 Our second architecture: RCGAN-U In many real world scenarios the confusion matrix C is unknown. We propose RCGAN-Unknown (RCGAN-U) algorithm which jointly estimates the real distribution P and the noise model C. The pre-processing and D steps of the RCGAN-U are the same as those of RCGAN, assuming the current guess M of the confusion matrix. As the G-step in (2) is not differentiable in C, we use the following reparameterized estimator of the loss, motivated by similar technique in training classifiers from noisy labels: min G G,M C E z N y PY φM (G(z; y), y, D) + λ l(h (G(z; y)), y) where C is the set of all transition matrices and φM(x, y, D) = P ey [m] Myey φ(1 D(x, ey)). 5 Experiments Implementation details are explained in Appendix L. We consider one-coin based models, which are parameterized by their label accuracy probability α. In this model a sample with true label y is flipped uniformly at random to label ey in [m] \ {y} with probability 1 α. The entries of its confusion matrix C, will then be Cii = α and Ci =j = (1 α)/(m 1), where m is the number of classes. We call this model uniform flipping model. Code to reproduce our experiments is available at https://github.com/POLane16/Robust-Conditional-GAN. Baselines. First is the biased GAN, which is a conditional GAN applied directly on the noisy data. The loss is hence biased, and the true conditional distribution is not the optimal solution of this biased loss. Next natural baseline is using de-biased classifier as the discriminator, motivated by the approach of [34] on learning classifiers from noisy labels. The main insight is to modify the loss function according to C, such that in expectation the loss matches that of the clean data. We refer to this approach as unbiased GAN. Concretely, when training the discriminator, we propose the following (modified) de-biased loss: max D F E(x,ey) e PX, e Y X y [m] (C 1)eyyφ (D(x, y)) + Ez N y PY φ (1 D(G(z; y), y)) . (14) This is unbiased, as the first term is equivalent to E(x,y) PX,Y [φ(D(x, y))], which is the standard GAN loss with clean samples. However, such de-biasing is sensitive to the condition number of C, and can become numerically unstable for noisy channels as C 1 has large entries [20]. For both the dataset, we use linear classifiers for permutation regularizer of the RCGAN-U architecture. 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 0.0 RCGAN+y RCGAN RCGAN-U Un-biased Biased noise in the real data (1 α) generator label accuracy 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 0.0 RCGAN+y RCGAN RCGAN-U Un-biased Biased Un-biased classifier noise in the real data (1 α) label recovery accuracy Figure 2: Noisy MNIST dataset: Our RCGAN models consistently improves upon all competing baseline approaches in generator label accuracy (left). The trend continues in label recovery accuracy (right), where our proposed RCGAN-classifiers improves upon unbiased classifier [34], which is one of the state-of-the-art approaches tailored for label recovery. We train five architectures on MNIST dataset corrupted by the uniform flipping noise: RCGAN+y, RCGAN, RCGAN-U, unbiased GAN, and biased GAN. RCGAN+y architecture has the same architecture as RCGAN but the input to the first layer of its discriminator is concatenated with a one-hot representation of the label. We discuss our techniques to overcome the challenges involved in training RCGAN+y in Appendix L. Conditional generators can be used to generate samples x from a particular class y, in the classes it learned. We then can use a pre-trained classifier f to compare y to the true class of the sample, f(x) (as perceived by the classifier f). We compare the generator label accuracy defined as Ey PY ,Z N[I{y=f(G(z,y))}], in Figure 2, left panel. We generated 10k labels chosen uniformly at random and corresponding conditional samples from the generators, and calculated the generator label accuracy using a CNN classifier pre-trained on the clean MNIST data to an accuracy of 99.2%. The proposed RCGAN significantly improves upon the competing baselines, and achieves almost perfect label accuracy until a high noise of α = 0.3. RCGAN+y further improves upon RCGAN and to gain very high accuracy even at α = 0.125. The high accuracy of RCGAN-U suggests that robust training is possible without prior knowledge of the confusion matrix C. As expected, biased GAN has an accuracy of approximately 1 α. An immediate application of robust GANs is recovering the true labels of the noisy training data, which is an important and challenging problem in crowdsourcing. We propose a new meta-algorithm, which we call c GAN-label-recovery, which use any conditional generator G(z, y) trained on the noisy samples, to estimate the true label, as ˆy, of a sample x using the following optimization. ˆy arg min y [m] min zy |||G(zy, y) x|||2 2 . (15) In the right panel of Figure 2 we compare the label recovery accuracy of the meta-algorithm using the five conditional GANs, on 500 randomly chosen noisy training samples. This is also compared to a state-of-the-art method [34] for label recovery, which proposed minimizing unbiased loss function given the noisy labels and the confusion matrix. This unbiased classifier, was shown to outperforms the robust classifiers [29, 45, 46] and can be used to predict the true label of the training examples. In Figures 5 of Appendix M, we show example images from all the generators. 5.2 CIFAR-10 In Figure 3, we show the inception score [40] and the label accuracy of the conditional generator for the four approaches: our proposed RCGAN and RCGAN-U, against the baselines Unbiased (Section 5) and Biased (Section 1) GANs trained using CIFAR-10 images [24], while varying the label accuracy of the real data under uniform flipping model. In RCGAN-U, even with the regularizer, the learned confusion matrix was a permuted version of the true C, possibly because a linear classifier might be too simple to classify CIFAR images. To combat this, we initialized the confusion matrix M to be diagonally dominant (Appendix L). 0.0 0.2 0.4 0.6 0.8 RCGAN-U RCGAN Un-biased Biased noise in the real data (1 α) Inception score 0.0 0.2 0.4 0.6 0.8 0.8 RCGAN-U RCGAN Un-biased Biased noise in the real data (1 α) Generator label accuracy Figure 3: Noisy CIFAR-10 dataset: Our RCGAN (red) and RCGAN-U (blue) consistently improves upon Unbiased (magenta) and Biased (black) GANs trained on noisy CIFAR-10 in inception scores (left) and in generator label accuracy (right). In the left panel of Figure 3, our RCGAN and RCGAN-U consistently achieve higher inception scores than the other two approaches. The Unbiased GAN is highly unstable and hence produces garbage images for large noise (Fig. 6), possibly due to numerical instability of |||C 1||| , as noted in [20]. This confirms that robust GANs not only produce images from the correct class, but also produce better quality images. In the right panel of Figure 3, we report the generator label accuracy (Section 5.1) on 1k samples generated by each GAN. We classify the generator images using a Res Net-110 model1 trained to an accuracy of 92.3% on the noiseless CIFAR-10 dataset. Biased GAN has significantly lower label accuracy whereas the Unbiased GAN has low inception score. In Figure 6 in Appendix M, we show example images from the three generators for the different flipping probabilities. We believe that the gain in using the proposed robust GANs will be larger, when we train to higher accuracy with larger networks and extensive hyper parameter tuning, with latest innovations in GAN architectures, for example [54, 28, 17, 19, 41]. 6 Conclusion Standard conditional GANs can be sensitive to noise in the labels of the training data. We propose two new architectures to make them robust, one requiring the knowledge of the distribution of the noise and another which does not, and demonstrate the robustness on benchmark datasets of CIFAR-10 and MNIST. We further showcase how the learned generator can be used to recover the corrupted labels in the training data, which can potentially be used in practical applications. The proposed architecture combines the noise adding idea of Ambient GAN [10], projection discriminator of [32], and regularizers similar to those in Info GAN [11]. Inspired by Ambient GAN [10], the main idea is to pair the generator output image with a label that is passed through a noisy channel, before feeding to the discriminator. We justify this idea of noise adding by identifying a certain class of discriminators that have good generalization properties. In particular, we prove that projection discriminator, introduced in [32], has a good generalization property. We showcase that the proposed architecture, when trained with a regularizer, has superior robustness on benchmark datasets. Acknowledgement This work is supported by NSF awards CNS-1527754, CCF-1553452, CCF-1705007, RI-1815535 and Google Faculty Research Award. This work used the Extreme Science and Engineering Discovery Environment (XSEDE), which is supported by National Science Foundation grant number OCI1053575. Specifically, it used the Bridges system, which is supported by NSF award number ACI-1445606, at the Pittsburgh Supercomputing Center (PSC). This work is partially supported by the generous research credits on AWS cloud computing resources from Amazon. 1https://github.com/wenxinxu/resnet-in-tensorflow [1] Panos Achlioptas, Olga Diamanti, Ioannis Mitliagkas, and Leonidas Guibas. Representation learning and adversarial generation of 3D point clouds. ar Xiv preprint ar Xiv:1707.02392, 2017. [2] Martin Arjovsky, Soumith Chintala, and Léon Bottou. Wasserstein GAN. ar Xiv preprint ar Xiv:1701.07875, 2017. [3] Sanjeev Arora, Rong Ge, Yingyu Liang, Tengyu Ma, and Yi Zhang. Generalization and equilibrium in generative adversarial nets (GANs). ar Xiv preprint ar Xiv:1703.00573, 2017. [4] Yu Bai, Tengyu Ma, and Andrej Risteski. Approximability of discriminators implies diversity in GANs. ar Xiv preprint ar Xiv:1806.10586, 2018. [5] David Berthelot, Tom Schumm, and Luke Metz. BEGAN: Boundary equilibrium generative adversarial networks. ar Xiv preprint ar Xiv:1703.10717, 2017. [6] G Biau, B Cadre, M Sangnier, and U Tanielian. Some theoretical properties of GANs. ar Xiv preprint ar Xiv:1803.07819, 2018. [7] Battista Biggio, Blaine Nelson, and Pavel Laskov. Support vector machines under adversarial label noise. In Asian Conference on Machine Learning, pages 97 112, 2011. [8] Mikołaj Bi nkowski, Dougal J Sutherland, Michael Arbel, and Arthur Gretton. Demystifying MMD GANs. ar Xiv preprint ar Xiv:1801.01401, 2018. [9] Ashish Bora, Ajil Jalal, Eric Price, and Alexandros G Dimakis. Compressed sensing using generative models. ar Xiv preprint ar Xiv:1703.03208, 2017. [10] Ashish Bora, Eric Price, and Alexandros G Dimakis. Ambient GAN: Generative models from lossy measurements. In International Conference on Learning Representations (ICLR), 2018. [11] Xi Chen, Yan Duan, Rein Houthooft, John Schulman, Ilya Sutskever, and Pieter Abbeel. Info GAN: Interpretable representation learning by information maximizing generative adversarial nets. In Advances in Neural Information Processing Systems, pages 2172 2180, 2016. [12] Alexander Philip Dawid and Allan M Skene. Maximum likelihood estimation of observer error-rates using the em algorithm. Applied statistics, pages 20 28, 1979. [13] Emily L Denton, Soumith Chintala, Rob Fergus, et al. Deep generative image models using a Laplacian pyramid of adversarial networks. In Advances in neural information processing systems, pages 1486 1494, 2015. [14] Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio. Generative adversarial nets. In Advances in neural information processing systems, pages 2672 2680, 2014. [15] Ishaan Gulrajani, Faruk Ahmed, Martin Arjovsky, Vincent Dumoulin, and Aaron C Courville. Improved training of Wasserstein GANs. In Advances in Neural Information Processing Systems, pages 5769 5779, 2017. [16] Phillip Isola, Jun-Yan Zhu, Tinghui Zhou, and Alexei A Efros. Image-to-image translation with conditional adversarial networks. ar Xiv preprint, 2017. [17] Alexia Jolicoeur-Martineau. The relativistic discriminator: a key element missing from standard GAN. ar Xiv preprint ar Xiv:1807.00734, 2018. [18] David R Karger, Sewoong Oh, and Devavrat Shah. Iterative learning for reliable crowdsourcing systems. In Advances in neural information processing systems, pages 1953 1961, 2011. [19] Tero Karras, Timo Aila, Samuli Laine, and Jaakko Lehtinen. Progressive growing of GANs for improved quality, stability, and variation. ar Xiv preprint ar Xiv:1710.10196, 2017. [20] Ashish Khetan, Zachary C Lipton, and Anima Anandkumar. Learning from noisy singly-labeled data. ar Xiv preprint ar Xiv:1712.04577, 2017. [21] Taeksoo Kim, Moonsu Cha, Hyunsoo Kim, Jungkwon Lee, and Jiwon Kim. Learning to discover cross-domain relations with generative adversarial networks. ar Xiv preprint ar Xiv:1703.05192, 2017. [22] Diederik P Kingma and Max Welling. Auto-encoding variational bayes. ar Xiv preprint ar Xiv:1312.6114, 2013. [23] Murat Kocaoglu, Christopher Snyder, Alexandros G Dimakis, and Sriram Vishwanath. Causal GAN: Learning causal implicit generative models with adversarial training. ar Xiv preprint ar Xiv:1709.02023, 2017. [24] Alex Krizhevsky and Geoffrey Hinton. Learning multiple layers of features from tiny images. Technical report, Citeseer, 2009. [25] Yann Le Cun. The mnist database of handwritten digits. http://yann. lecun. com/exdb/mnist/, 1998. [26] Christian Ledig, Lucas Theis, Ferenc Huszár, Jose Caballero, Andrew Cunningham, Alejandro Acosta, Andrew Aitken, Alykhan Tejani, Johannes Totz, Zehan Wang, et al. Photo-realistic single image super-resolution using a generative adversarial network. ar Xiv preprint, 2016. [27] Xiaodan Liang, Lisa Lee, Wei Dai, and Eric P Xing. Dual motion GAN for future-flow embedded video prediction. ar Xiv preprint, 2017. [28] Zinan Lin, Ashish Khetan, Giulia Fanti, and Sewoong Oh. Pac GAN: The power of two samples in generative adversarial networks. ar Xiv preprint ar Xiv:1712.04086, 2017. [29] Bing Liu, Yang Dai, Xiaoli Li, Wee Sun Lee, and Philip S Yu. Building text classifiers using positive and unlabeled examples. In Data Mining, 2003. ICDM 2003. Third IEEE International Conference on, pages 179 186. IEEE, 2003. [30] Mehdi Mirza and Simon Osindero. Conditional generative adversarial nets. ar Xiv preprint ar Xiv:1411.1784, 2014. [31] Takeru Miyato, Toshiki Kataoka, Masanori Koyama, and Yuichi Yoshida. Spectral normalization for generative adversarial networks. ar Xiv preprint ar Xiv:1802.05957, 2018. [32] Takeru Miyato and Masanori Koyama. c GANs with projection discriminator. ar Xiv preprint ar Xiv:1802.05637, 2018. [33] Sudipto Mukherjee, Himanshu Asnani, Eugene Lin, and Sreeram Kannan. Cluster GAN: Latent space clustering in generative adversarial networks. ar Xiv preprint ar Xiv:1809.03627, 2018. [34] Nagarajan Natarajan, Inderjit S Dhillon, Pradeep K Ravikumar, and Ambuj Tewari. Learning with noisy labels. In Advances in neural information processing systems, pages 1196 1204, 2013. [35] Anh Nguyen, Jason Yosinski, Yoshua Bengio, Alexey Dosovitskiy, and Jeff Clune. Plug & play generative networks: Conditional iterative generation of images in latent space. ar Xiv preprint ar Xiv:1612.00005, 2016. [36] Augustus Odena, Christopher Olah, and Jonathon Shlens. Conditional image synthesis with auxiliary classifier gans. ar Xiv preprint ar Xiv:1610.09585, 2016. [37] Alec Radford, Luke Metz, and Soumith Chintala. Unsupervised representation learning with deep convolutional generative adversarial networks. ar Xiv preprint ar Xiv:1511.06434, 2015. [38] Scott Reed, Zeynep Akata, Xinchen Yan, Lajanugen Logeswaran, Bernt Schiele, and Honglak Lee. Generative adversarial text to image synthesis. ar Xiv preprint ar Xiv:1605.05396, 2016. [39] Olga Russakovsky, Jia Deng, Hao Su, Jonathan Krause, Sanjeev Satheesh, Sean Ma, Zhiheng Huang, Andrej Karpathy, Aditya Khosla, Michael Bernstein, et al. Imagenet large scale visual recognition challenge. International Journal of Computer Vision, 115(3):211 252, 2015. [40] Tim Salimans, Ian Goodfellow, Wojciech Zaremba, Vicki Cheung, Alec Radford, and Xi Chen. Improved techniques for training GANs. In Advances in Neural Information Processing Systems, pages 2234 2242, 2016. [41] Maziar Sanjabi, Jimmy Ba, Meisam Razaviyayn, and Jason D Lee. Solving approximate Wasserstein GANs to stationarity. ar Xiv preprint ar Xiv:1802.08249, 2018. [42] Rajat Sen, Karthikeyan Shanmugam, Himanshu Asnani, Arman Rahimzamani, and Sreeram Kannan. Mimic and classify: A meta-algorithm for conditional independence testing. ar Xiv preprint ar Xiv:1806.09708, 2018. [43] Ashish Shrivastava, Tomas Pfister, Oncel Tuzel, Josh Susskind, Wenda Wang, and Russ Webb. Learning from simulated and unsupervised images through adversarial training. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), volume 3, page 6, 2017. [44] Bharath K Sriperumbudur, Kenji Fukumizu, Arthur Gretton, Bernhard Schölkopf, and Gert RG Lanckriet. On integral probability metrics, φ-divergences and binary classification. ar Xiv preprint ar Xiv:0901.2698, 2009. [45] Guillaume Stempfel and Liva Ralaivola. Learning kernel perceptrons on noisy data using random projections. In International Conference on Algorithmic Learning Theory, pages 328 342. Springer, 2007. [46] Guillaume Stempfel and Liva Ralaivola. Learning SVMs from sloppily labeled data. In International Conference on Artificial Neural Networks, pages 884 893. Springer, 2009. [47] Sainbayar Sukhbaatar, Joan Bruna, Manohar Paluri, Lubomir Bourdev, and Rob Fergus. Training convolutional networks with noisy labels. ar Xiv preprint ar Xiv:1406.2080, 2014. [48] Dougal J Sutherland, Hsiao-Yu Tung, Heiko Strathmann, Soumyajit De, Aaditya Ramdas, Alex Smola, and Arthur Gretton. Generative models and model criticism via optimized maximum mean discrepancy. ar Xiv preprint ar Xiv:1611.04488, 2016. [49] Aaron van den Oord, Nal Kalchbrenner, and Koray Kavukcuoglu. Pixel recurrent neural networks. ar Xiv preprint ar Xiv:1601.06759, 2016. [50] Carl Vondrick, Hamed Pirsiavash, and Antonio Torralba. Generating videos with scene dynamics. In Advances In Neural Information Processing Systems, pages 613 621, 2016. [51] Jiajun Wu, Chengkai Zhang, Tianfan Xue, Bill Freeman, and Josh Tenenbaum. Learning a probabilistic latent space of object shapes via 3D generative-adversarial modeling. In Advances in Neural Information Processing Systems, pages 82 90, 2016. [52] Zhi Xu, Chengtao Li, and Stefanie Jegelka. Robust GANs against dishonest adversaries. ar Xiv preprint ar Xiv:1802.09700, 2018. [53] Raymond Yeh, Chen Chen, Teck Yian Lim, Mark Hasegawa-Johnson, and Minh N Do. Semantic image inpainting with perceptual and contextual losses. ar Xiv preprint ar Xiv:1607.07539, 2016. [54] Han Zhang, Ian Goodfellow, Dimitris Metaxas, and Augustus Odena. Self-attention generative adversarial networks. ar Xiv preprint ar Xiv:1805.08318, 2018. [55] Han Zhang, Tao Xu, Hongsheng Li, Shaoting Zhang, Xiaolei Huang, Xiaogang Wang, and Dimitris Metaxas. Stack GAN: Text to photo-realistic image synthesis with stacked generative adversarial networks. In IEEE Int. Conf. Comput. Vision (ICCV), pages 5907 5915, 2017. [56] Pengchuan Zhang, Qiang Liu, Dengyong Zhou, Tao Xu, and Xiaodong He. On the discrimination-generalization tradeoff in GANs. ar Xiv preprint ar Xiv:1711.02771, 2017. [57] Jun-Yan Zhu, Philipp Krähenbühl, Eli Shechtman, and Alexei A Efros. Generative visual manipulation on the natural image manifold. In European Conference on Computer Vision, pages 597 613. Springer, 2016. [58] Jun-Yan Zhu, Taesung Park, Phillip Isola, and Alexei A Efros. Unpaired image-to-image translation using cycle-consistent adversarial networks. ar Xiv preprint ar Xiv:1703.10593, 2017.