# improved_techniques_for_training_gans__5b6e1cde.pdf Improved Techniques for Training GANs Tim Salimans tim@openai.com Ian Goodfellow ian@openai.com Wojciech Zaremba woj@openai.com Vicki Cheung vicki@openai.com Alec Radford alec@openai.com Xi Chen peter@openai.com We present a variety of new architectural features and training procedures that we apply to the generative adversarial networks (GANs) framework. Using our new techniques, we achieve state-of-the-art results in semi-supervised classification on MNIST, CIFAR-10 and SVHN. The generated images are of high quality as confirmed by a visual Turing test: our model generates MNIST samples that humans cannot distinguish from real data, and CIFAR-10 samples that yield a human error rate of 21.3%. We also present Image Net samples with unprecedented resolution and show that our methods enable the model to learn recognizable features of Image Net classes. 1 Introduction Generative adversarial networks [1] (GANs) are a class of methods for learning generative models based on game theory. The goal of GANs is to train a generator network G(z; θ(G)) that produces samples from the data distribution, pdata(x), by transforming vectors of noise z as x = G(z; θ(G)). The training signal for G is provided by a discriminator network D(x) that is trained to distinguish samples from the generator distribution pmodel(x) from real data. The generator network G in turn is then trained to fool the discriminator into accepting its outputs as being real. Recent applications of GANs have shown that they can produce excellent samples [2, 3]. However, training GANs requires finding a Nash equilibrium of a non-convex game with continuous, highdimensional parameters. GANs are typically trained using gradient descent techniques that are designed to find a low value of a cost function, rather than to find the Nash equilibrium of a game. When used to seek for a Nash equilibrium, these algorithms may fail to converge [4]. In this work, we introduce several techniques intended to encourage convergence of the GANs game. These techniques are motivated by a heuristic understanding of the non-convergence problem. They lead to improved semi-supervised learning peformance and improved sample generation. We hope that some of them may form the basis for future work, providing formal guarantees of convergence. All code and hyperparameters may be found at https://github.com/openai/improved-gan. 2 Related work Several recent papers focus on improving the stability of training and the resulting perceptual quality of GAN samples [2, 3, 5, 6]. We build on some of these techniques in this work. For instance, we use some of the DCGAN architectural innovations proposed in Radford et al. [3], as discussed below. One of our proposed techniques, feature matching, discussed in Sec. 3.1, is similar in spirit to approaches that use maximum mean discrepancy [7, 8, 9] to train generator networks [10, 11]. 30th Conference on Neural Information Processing Systems (NIPS 2016), Barcelona, Spain. Another of our proposed techniques, minibatch features, is based in part on ideas used for batch normalization [12], while our proposed virtual batch normalization is a direct extension of batch normalization. One of the primary goals of this work is to improve the effectiveness of generative adversarial networks for semi-supervised learning (improving the performance of a supervised task, in this case, classification, by learning on additional unlabeled examples). Like many deep generative models, GANs have previously been applied to semi-supervised learning [13, 14], and our work can be seen as a continuation and refinement of this effort. In concurrent work, Odena [15] proposes to extend GANs to predict image labels like we do in Section 5, but without our feature matching extension (Section 3.1) which we found to be critical for obtaining state-of-the-art performance. 3 Toward Convergent GAN Training Training GANs consists in finding a Nash equilibrium to a two-player non-cooperative game. Each player wishes to minimize its own cost function, J(D)(θ(D), θ(G)) for the discriminator and J(G)(θ(D), θ(G)) for the generator. A Nash equilibirum is a point (θ(D), θ(G)) such that J(D) is at a minimum with respect to θ(D) and J(G) is at a minimum with respect to θ(G). Unfortunately, finding Nash equilibria is a very difficult problem. Algorithms exist for specialized cases, but we are not aware of any that are feasible to apply to the GAN game, where the cost functions are non-convex, the parameters are continuous, and the parameter space is extremely high-dimensional. The idea that a Nash equilibrium occurs when each player has minimal cost seems to intuitively motivate the idea of using traditional gradient-based minimization techniques to minimize each player s cost simultaneously. Unfortunately, a modification to θ(D) that reduces J(D) can increase J(G), and a modification to θ(G) that reduces J(G) can increase J(D). Gradient descent thus fails to converge for many games. For example, when one player minimizes xy with respect to x and another player minimizes xy with respect to y, gradient descent enters a stable orbit, rather than converging to x = y = 0, the desired equilibrium point [16]. Previous approaches to GAN training have thus applied gradient descent on each player s cost simultaneously, despite the lack of guarantee that this procedure will converge. We introduce the following techniques that are heuristically motivated to encourage convergence: 3.1 Feature matching Feature matching addresses the instability of GANs by specifying a new objective for the generator that prevents it from overtraining on the current discriminator. Instead of directly maximizing the output of the discriminator, the new objective requires the generator to generate data that matches the statistics of the real data, where we use the discriminator only to specify the statistics that we think are worth matching. Specifically, we train the generator to match the expected value of the features on an intermediate layer of the discriminator. This is a natural choice of statistics for the generator to match, since by training the discriminator we ask it to find those features that are most discriminative of real data versus data generated by the current model. Letting f(x) denote activations on an intermediate layer of the discriminator, our new objective for the generator is defined as: ||Ex pdataf(x) Ez pz(z)f(G(z))||2 2. The discriminator, and hence f(x), are trained in the usual way. As with regular GAN training, the objective has a fixed point where G exactly matches the distribution of training data. We have no guarantee of reaching this fixed point in practice, but our empirical results indicate that feature matching is indeed effective in situations where regular GAN becomes unstable. 3.2 Minibatch discrimination One of the main failure modes for GAN is for the generator to collapse to a parameter setting where it always emits the same point. When collapse to a single mode is imminent, the gradient of the discriminator may point in similar directions for many similar points. Because the discriminator processes each example independently, there is no coordination between its gradients, and thus no mechanism to tell the outputs of the generator to become more dissimilar to each other. Instead, all outputs race toward a single point that the discriminator currently believes is highly realistic. After collapse has occurred, the discriminator learns that this single point comes from the generator, but gradient descent is unable to separate the identical outputs. The gradients of the discriminator then push the single point produced by the generator around space forever, and the algorithm cannot converge to a distribution with the correct amount of entropy. An obvious strategy to avoid this type of failure is to allow the discriminator to look at multiple data examples in combination, and perform what we call minibatch discrimination. The concept of minibatch discrimination is quite general: any discriminator model that looks at multiple examples in combination, rather than in isolation, could potentially help avoid collapse of the generator. In fact, the successful application of batch normalization in the discriminator by Radford et al. [3] is well explained from this perspective. So far, however, we have restricted our experiments to models that explicitly aim to identify generator samples that are particularly close together. One successful specification for modelling the closeness between examples in a minibatch is as follows: Let f(xi) RA denote a vector of features for input xi, produced by some intermediate layer in the discriminator. We then multiply the vector f(xi) by a tensor T RA B C, which results in a matrix Mi RB C. We then compute the L1-distance between the rows of the resulting matrix Mi across samples i {1, 2, . . . , n} and apply a negative exponential (Fig. 1): cb(xi, xj) = exp( ||Mi,b Mj,b||L1) R. Figure 1: Figure sketches how minibatch discrimination works. Features f(xi) from sample xi are multiplied through a tensor T, and cross-sample distance is computed. The output o(xi) for this minibatch layer for a sample xi is then defined as the sum of the cb(xi, xj) s to all other samples: j=1 cb(xi, xj) R o(xi) = h o(xi)1, o(xi)2, . . . , o(xi)B i RB Next, we concatenate the output o(xi) of the minibatch layer with the intermediate features f(xi) that were its input, and we feed the result into the next layer of the discriminator. We compute these minibatch features separately for samples from the generator and from the training data. As before, the discriminator is still required to output a single number for each example indicating how likely it is to come from the training data: The task of the discriminator is thus effectively still to classify single examples as real data or generated data, but it is now able to use the other examples in the minibatch as side information. Minibatch discrimination allows us to generate visually appealing samples very quickly, and in this regard it is superior to feature matching (Section 6). Interestingly, however, feature matching was found to work much better if the goal is to obtain a strong classifier using the approach to semi-supervised learning described in Section 5. 3.3 Historical averaging When applying this technique, we modify each player s cost to include a term ||θ 1 t Pt i=1 θ[i]||2, where θ[i] is the value of the parameters at past time i. The historical average of the parameters can be updated in an online fashion so this learning rule scales well to long time series. This approach is loosely inspired by the fictitious play [17] algorithm that can find equilibria in other kinds of games. We found that our approach was able to find equilibria of low-dimensional, continuous non-convex games, such as the minimax game with one player controlling x, the other player controlling y, and value function (f(x) 1)(y 1), where f(x) = x for x < 0 and f(x) = x2 otherwise. For these same toy games, gradient descent fails by going into extended orbits that do not approach the equilibrium point. 3.4 One-sided label smoothing Label smoothing, a technique from the 1980s recently independently re-discovered by Szegedy et. al [18], replaces the 0 and 1 targets for a classifier with smoothed values, like .9 or .1, and was recently shown to reduce the vulnerability of neural networks to adversarial examples [19]. Replacing positive classification targets with α and negative targets with β, the optimal discriminator becomes D(x) = αpdata(x)+βpmodel(x) pdata(x)+pmodel(x) . The presence of pmodel in the numerator is problematic because, in areas where pdata is approximately zero and pmodel is large, erroneous samples from pmodel have no incentive to move nearer to the data. We therefore smooth only the positive labels to α, leaving negative labels set to 0. 3.5 Virtual batch normalization Batch normalization greatly improves optimization of neural networks, and was shown to be highly effective for DCGANs [3]. However, it causes the output of a neural network for an input example x to be highly dependent on several other inputs x in the same minibatch. To avoid this problem we introduce virtual batch normalization (VBN), in which each example x is normalized based on the statistics collected on a reference batch of examples that are chosen once and fixed at the start of training, and on x itself. The reference batch is normalized using only its own statistics. VBN is computationally expensive because it requires running forward propagation on two minibatches of data, so we use it only in the generator network. 4 Assessment of image quality Generative adversarial networks lack an objective function, which makes it difficult to compare performance of different models. One intuitive metric of performance can be obtained by having human annotators judge the visual quality of samples [2]. We automate this process using Amazon Mechanical Turk (MTurk), using the web interface in figure Fig. 2 (live at http://infinite-chamber-35121.herokuapp.com/ cifar-minibatch/), which we use to ask annotators to distinguish between generated data and real data. The resulting quality assessments of our models are described in Section 6. Figure 2: Web interface given to annotators. Annotators are asked to distinguish computer generated images from real ones. A downside of using human annotators is that the metric varies depending on the setup of the task and the motivation of the annotators. We also find that results change drastically when we give annotators feedback about their mistakes: By learning from such feedback, annotators are better able to point out the flaws in generated images, giving a more pessimistic quality assessment. The left column of Fig. 2 presents a screen from the annotation process, while the right column shows how we inform annotators about their mistakes. As an alternative to human annotators, we propose an automatic method to evaluate samples, which we find to correlate well with human evaluation: We apply the Inception model1 [20] to every generated image to get the conditional label distribution p(y|x). Images that contain meaningful objects should have a conditional label distribution p(y|x) with low entropy. Moreover, we expect the model to generate varied images, so the marginal R p(y|x = G(z))dz should have high entropy. Combining these two requirements, the metric that we propose is: exp(Ex KL(p(y|x)||p(y))), where we exponentiate results so the values are easier to compare. Our Inception score is closely related to the objective used for training generative models in Cat GAN [14]: Although we had less success using such an objective for training, we find it is a good metric for evaluation that correlates very well with human judgment. We find that it s important to evaluate the metric on a large enough number of samples (i.e. 50k) as part of this metric measures diversity. 5 Semi-supervised learning Consider a standard classifier for classifying a data point x into one of K possible classes. Such a model takes in x as input and outputs a K-dimensional vector of logits {l1, . . . , l K}, that can be turned into class probabilities by applying the softmax: pmodel(y = j|x) = exp(lj) PK k=1 exp(lk). In supervised learning, such a model is then trained by minimizing the cross-entropy between the observed labels and the model predictive distribution pmodel(y|x). 1We use the pretrained Inception model from http://download.tensorflow.org/models/image/ imagenet/inception-2015-12-05.tgz. Code to compute the Inception score with this model will be made available by the time of publication. We can do semi-supervised learning with any standard classifier by simply adding samples from the GAN generator G to our data set, labeling them with a new generated class y = K + 1, and correspondingly increasing the dimension of our classifier output from K to K + 1. We may then use pmodel(y = K + 1 | x) to supply the probability that x is fake, corresponding to 1 D(x) in the original GAN framework. We can now also learn from unlabeled data, as long as we know that it corresponds to one of the K classes of real data by maximizing log pmodel(y {1, . . . , K}|x). Assuming half of our data set consists of real data and half of it is generated (this is arbitrary), our loss function for training the classifier then becomes L = Ex,y pdata(x,y)[log pmodel(y|x)] Ex G[log pmodel(y = K + 1|x)] = Lsupervised + Lunsupervised, where Lsupervised = Ex,y pdata(x,y) log pmodel(y|x, y < K + 1) Lunsupervised = {Ex pdata(x) log[1 pmodel(y = K + 1|x)] + Ex G log[pmodel(y = K + 1|x)]}, where we have decomposed the total cross-entropy loss into our standard supervised loss function Lsupervised (the negative log probability of the label, given that the data is real) and an unsupervised loss Lunsupervised which is in fact the standard GAN game-value as becomes evident when we substitute D(x) = 1 pmodel(y = K + 1|x) into the expression: Lunsupervised = {Ex pdata(x) log D(x) + Ez noise log(1 D(G(z)))}. The optimal solution for minimizing both Lsupervised and Lunsupervised is to have exp[lj(x)] = c(x)p(y=j, x) j