# energybased_generative_adversarial_networks__bb25ae63.pdf Published as a conference paper at ICLR 2017 ENERGY-BASED GENERATIVE ADVERSARIAL NETWORKS Junbo Zhao, Michael Mathieu and Yann Le Cun Department of Computer Science, New York University Facebook Artificial Intelligence Research {jakezhao, mathieu, yann}@cs.nyu.edu We introduce the Energy-based Generative Adversarial Network model (EBGAN) which views the discriminator as an energy function that attributes low energies to the regions near the data manifold and higher energies to other regions. Similar to the probabilistic GANs, a generator is seen as being trained to produce contrastive samples with minimal energies, while the discriminator is trained to assign high energies to these generated samples. Viewing the discriminator as an energy function allows to use a wide variety of architectures and loss functionals in addition to the usual binary classifier with logistic output. Among them, we show one instantiation of EBGAN framework as using an auto-encoder architecture, with the energy being the reconstruction error, in place of the discriminator. We show that this form of EBGAN exhibits more stable behavior than regular GANs during training. We also show that a single-scale architecture can be trained to generate high-resolution images. 1 INTRODUCTION 1.1 ENERGY-BASED MODEL The essence of the energy-based model (Le Cun et al., 2006) is to build a function that maps each point of an input space to a single scalar, which is called energy . The learning phase is a datadriven process that shapes the energy surface in such a way that the desired configurations get assigned low energies, while the incorrect ones are given high energies. Supervised learning falls into this framework: for each X in the training set, the energy of the pair (X, Y ) takes low values when Y is the correct label and higher values for incorrect Y s. Similarly, when modeling X alone within an unsupervised learning setting, lower energy is attributed to the data manifold. The term contrastive sample is often used to refer to a data point causing an energy pull-up, such as the incorrect Y s in supervised learning and points from low data density regions in unsupervised learning. 1.2 GENERATIVE ADVERSARIAL NETWORKS Generative Adversarial Networks (GAN) (Goodfellow et al., 2014) have led to significant improvements in image generation (Denton et al., 2015; Radford et al., 2015; Im et al., 2016; Salimans et al., 2016), video prediction (Mathieu et al., 2015) and a number of other domains. The basic idea of GAN is to simultaneously train a discriminator and a generator. The discriminator is trained to distinguish real samples of a dataset from fake samples produced by the generator. The generator uses input from an easy-to-sample random source, and is trained to produce fake samples that the discriminator cannot distinguish from real data samples. During training, the generator receives the gradient of the output of the discriminator with respect to the fake sample. In the original formulation of GAN in Goodfellow et al. (2014), the discriminator produces a probability and, under certain conditions, convergence occurs when the distribution produced by the generator matches the data distribution. From a game theory point of view, the convergence of a GAN is reached when the generator and the discriminator reach a Nash equilibrium. Published as a conference paper at ICLR 2017 1.3 ENERGY-BASED GENERATIVE ADVERSARIAL NETWORKS In this work, we propose to view the discriminator as an energy function (or a contrast function) without explicit probabilistic interpretation. The energy function computed by the discriminator can be viewed as a trainable cost function for the generator. The discriminator is trained to assign low energy values to the regions of high data density, and higher energy values outside these regions. Conversely, the generator can be viewed as a trainable parameterized function that produces samples in regions of the space to which the discriminator assigns low energy. While it is often possible to convert energies into probabilities through a Gibbs distribution (Le Cun et al., 2006), the absence of normalization in this energy-based form of GAN provides greater flexibility in the choice of architecture of the discriminator and the training procedure. The probabilistic binary discriminator in the original formulation of GAN can be seen as one way among many to define the contrast function and loss functional, as described in Le Cun et al. (2006) for the supervised and weakly supervised settings, and Ranzato et al. (2007) for unsupervised learning. We experimentally demonstrate this concept, in the setting where the discriminator is an autoencoder architecture, and the energy is the reconstruction error. More details of the interpretation of EBGAN are provided in the appendix B. Our main contributions are summarized as follows: An energy-based formulation for generative adversarial training. A proof that under a simple hinge loss, when the system reaches convergence, the generator of EBGAN produces points that follow the underlying data distribution. An EBGAN framework with the discriminator using an auto-encoder architecture in which the energy is the reconstruction error. A set of systematic experiments to explore hyper-parameters and architectural choices that produce good result for both EBGANs and probabilistic GANs. A demonstration that EBGAN framework can be used to generate reasonable-looking highresolution images from the Image Net dataset at 256 256 pixel resolution, without a multiscale approach. 2 THE EBGAN MODEL Let pdata be the underlying probability density of the distribution that produces the dataset. The generator G is trained to produce a sample G(z), for instance an image, from a random vector z, which is sampled from a known distribution pz, for instance N(0, 1). The discriminator D takes either real or generated images, and estimates the energy value E R accordingly, as explained later. For simplicity, we assume that D produces non-negative values, but the analysis would hold as long as the values are bounded below. 2.1 OBJECTIVE FUNCTIONAL The output of the discriminator goes through an objective functional in order to shape the energy function, attributing low energy to the real data samples and higher energy to the generated ( fake ) ones. In this work, we use a margin loss, but many other choices are possible as explained in Le Cun et al. (2006). Similarly to what has been done with the probabilistic GAN (Goodfellow et al., 2014), we use a two different losses, one to train D and the other to train G, in order to get better quality gradients when the generator is far from convergence. Given a positive margin m, a data sample x and a generated sample G(z), the discriminator loss LD and the generator loss LG are formally defined by: LD(x, z) = D(x) + [m D G(z) ]+ (1) LG(z) = D G(z) (2) where [ ]+ = max(0, ). Minimizing LG with respect to the parameters of G is similar to maximizing the second term of LD. It has the same minimum but non-zero gradients when D(G(z)) m. Published as a conference paper at ICLR 2017 2.2 OPTIMALITY OF THE SOLUTION In this section, we present a theoretical analysis of the system presented in section 2.1. We show that if the system reaches a Nash equilibrium, then the generator G produces samples that are indistinguishable from the distribution of the dataset. This section is done in a non-parametric setting, i.e. we assume that D and G have infinite capacity. Given a generator G, let p G be the density distribution of G(z) where z pz. In other words, p G is the density distribution of the samples generated by G. We define V (G, D) = R x,z LD(x, z)pdata(x)pz(z)dxdz and U(G, D) = R z LG(z)pz(z)dz. We train the discriminator D to minimize the quantity V and the generator G to minimize the quantity U. A Nash equilibrium of the system is a pair (G , D ) that satisfies: V (G , D ) V (G , D) D (3) U(G , D ) U(G, D ) G (4) Theorem 1. If (D , G ) is a Nash equilibrium of the system, then p G = pdata almost everywhere, and V (D , G ) = m. Proof. First we observe that V (G , D) = Z x pdata(x)D(x)dx + Z z pz(z) [m D(G (z))]+ dz (5) pdata(x)D(x) + p G (x) [m D(x)]+ dx. (6) The analysis of the function ϕ(y) = ay+b(m y)+ (see lemma 1 in appendix A for details) shows: (a) D (x) m almost everywhere. To verify it, let us assume that there exists a set of measure non-zero such that D (x) > m. Let e D(x) = min(D (x), m). Then V (G , e D) < V (G , D ) which violates equation 3. (b) The function ϕ reaches its minimum in m if a < b and in 0 otherwise. So V (G , D) reaches its minimum when we replace D (x) by these values. We obtain V (G , D ) = m Z x 1pdata(x)

q(x)(p(x) q(x))dx (15) x (1 1p(x) q(x))(p(x) q(x))dx (16) x q(x)dx + Z x 1p(x) q(x)(p(x) q(x))dx (17) 1p(x)q(x)(p(x) q(x))dx = 0 and since the term in the integral is always non-negative, 1p(x)>q(x)(p(x) q(x)) = 0 for almost all x. And p(x) q(x) = 0 implies 1p(x)>q(x) = 0, so 1p(x)>q(x) = 0 almost everywhere. Therefore R x 1p(x)>q(x)dx = 0 which completes the proof, given the hypothesis. Proof of theorem 2 The sufficient conditions are obvious. The necessary condition on G comes from theorem 1, and the necessary condition on D (x) m is from the proof of theorem 1. Let us now assume that D (x) is not constant almost everywhere and find a contradiction. If it is not, then there exists a constant C and a set S of non-zero measure such that x S, D (x) C and x S, D (X) > C. In addition we can choose S such that there exists a subset S S of non-zero measure such that pdata(x) > 0 on S (because of the assumption in the footnote). We can build a generator G0 such that p G0(x) pdata(x) over S and p G0(x) < pdata(x) over S . We compute U(G , D ) U(G0, D ) = Z x (pdata p G0)D (x)dx (21) x (pdata p G0)(D (x) C)dx (22) S (pdata p G0)(D (x) C)dx + Z RN\S (pdata p G0)(D (x) C)dx (23) > 0 (24) which violates equation 4. Published as a conference paper at ICLR 2017 B APPENDIX: MORE INTERPRETATIONS ABOUT GANS AND ENERGY-BASED LEARNING TWO INTERPRETATIONS OF GANS GANs can be interpreted in two complementary ways. In the first interpretation, the key component is the generator, and the discriminator plays the role of a trainable objective function. Let us imagine that the data lies on a manifold. Until the generator produces samples that are recognized as being on the manifold, it gets a gradient indicating how to modify its output so it could approach the manifold. In such scenario, the discriminator acts to punish the generator when it produces samples that are outside the manifold. This can be understood as a way to train the generator with a set of possible desired outputs (e.g. the manifold) instead of a single desired output as in traditional supervised learning. For the second interpretation, the key component is the discriminator, and the generator is merely trained to produce contrastive samples. We show that by iteratively and interactively feeding contrastive samples, the generator enhances the semi-supervised learning performance of the discriminator (e.g. Ladder Network), in section 4.2. C APPENDIX: EXPERIMENT SETTINGS MORE DETAILS ABOUT THE GRID SEARCH For training both EBGANs and GANs for the grid search, we use the following setting: Batch normalization (Ioffe & Szegedy, 2015) is applied after each weight layer, except for the generator output layer and the discriminator input layer (Radford et al., 2015). Training images are scaled into range [-1,1]. Correspondingly the generator output layer is followed by a Tanh function. Re LU is used as the non-linearity function. Initialization: the weights in D from N(0, 0.002) and in G from N(0, 0.02). The bias are initialized to be 0. We evaluate the models from the grid search by calculating a modified version of the inception score, I = Ex KL(p(y)||p(y|x)), where x denotes a generated sample and y is the label predicted by a MNIST classifier that is trained off-line using the entire MNIST training set. Two main changes were made upon its original form: (i)-we swap the order of the distribution pair; (ii)-we omit the e( ) operation. The modified score condenses the histogram in figure 2 and figure 3. It is also worth noting that although we inherit the name inception score from Salimans et al. (2016), the evaluation isn t related to the inception model trained on Image Net dataset. The classifier is a regular 3-layer Conv Net trained on MNIST. The generations showed in figure 4 are the best GAN or EBGAN (obtaining the best I score) from the grid search. Their configurations are: figure 4(a): n Layer G=5, n Layer D=2, size G=1600, size D=1024, dropout D=0, optim D=SGD, optim G=SGD, lr=0.01. figure 4(b): n Layer G=5, n Layer D=2, size G=800, size D=1024, dropout D=0, optim D=ADAM, optim G=ADAM, lr=0.001, margin=10. figure 4(c): same as (b), with λP T = 0.1. LSUN & CELEBA We use a deep convolutional generator analogous to DCGAN s and a deep convolutional autoencoder for the discriminator. The auto-encoder is composed of strided convolution modules in the feedforward pathway and fractional-strided convolution modules in the feedback pathway. We leave the usage of upsampling or switches-unpooling (Zhao et al., 2015) to future research. We also followed the guidance suggested by Radford et al. (2015) for training EBGANs. The configuration of the deep auto-encoder is: Published as a conference paper at ICLR 2017 Encoder: (64)4c2s-(128)4c2s-(256)4c2s Decoder: (128)4c2s-(64)4c2s-(3)4c2s where (64)4c2s denotes a convolution/deconvolution layer with 64 output feature maps and kernel size 4 with stride 2. The margin m is set to 80 for LSUN and 20 for Celeb A. We built deeper models in both 128 128 and 256 256 experiments, in a similar fashion to section 4.3, 128 128 model: Generator: (1024)4c-(512)4c2s-(256)4c2s-(128)4c2s- (64)4c2s-(64)4c2s-(3)3c Noise #planes: 100-64-32-16-8-4 Encoder: (64)4c2s-(128)4c2s-(256)4c2s-(512)4c2s Decoder: (256)4c2s-(128)4c2s-(64)4c2s-(3)4c2s Margin: 40 256 256 model: Generator: (2048)4c-(1024)4c2s-(512)4c2s-(256)4c2s-(128)4c2s- (64)4c2s-(64)4c2s-(3)3c Noise #planes: 100-64-32-16-8-4-2 Encoder: (64)4c2s-(128)4c2s-(256)4c2s-(512)4c2s Decoder: (256)4c2s-(128)4c2s-(64)4c2s-(3)4c2s Margin: 80 Note that we feed noise into every layer of the generator where each noise component is initialized into a 4D tensor and concatenated with current feature maps in the feature space. Such strategy is also employed by Salimans et al. (2016). D APPENDIX: SEMI-SUPERVISED LEARNING EXPERIMENT SETTING BASELINE MODEL As stated in section 4.2, we chose a bottom-layer-cost Ladder Network as our baseline model. Specifically, we utilize an identical architecture as reported in both papers (Rasmus et al., 2015; Pezeshki et al., 2015); namely a fully-connected network of size 784-1000-500-250-250-250, with batch normalization and Re LU following each linear layer. To obtain a strong baseline, we tuned the weight of the reconstruction cost with values from the set { 5000 784}, while fixing the weight on the classification cost to 1. In the meantime, we also tuned the learning rate with values {0.002, 0.001, 0.0005, 0.0002, 0.0001}. We adopted Adam as the optimizer with β1 being set to 0.5. The minibatch size was set to 100. All the experiments are finished by 120,000 steps. We use the same learning rate decay mechanism as in the published papers starting from the two-thirds of total steps (i.e., from step #80,000) to linearly decay the learning rate to 0. The result reported in section 4.2 was done by the best tuned setting: λL2 = 1000 784 , lr = 0.0002. EBGAN-LN MODEL We place the same Ladder Network architecture into our EBGAN framework and train this EBGANLN model the same way as we train the EBGAN auto-encoder model. For technical details, we started training the EBGAN-LN model from the margin value 16 and gradually decay it to 0 within the first 60,000 steps. By the time, we found that the reconstruction error of the real image had already been low and reached the limitation of the architecture (Ladder Network itself); besides the generated images exhibit good quality (shown in figure 10). Thereafter we turned off training the generator but kept training the discriminator for another 120,000 steps. We set the initial learning rates to be 0.0005 for discriminator and 0.00025 for generator. The other setting is kept consistent with the best baseline LN model. The learning rate decay started at step #120,000 (also two-thirds of the total steps). Published as a conference paper at ICLR 2017 OTHER DETAILS Notice that we used the 28 28 version (unpadded) of the MNIST dataset in the EBGAN-LN experiment. For the EBGAN auto-encoder grid search experiments, we used the zero-padded version, i.e., size 32 32. No phenomenal difference has been found due to the zero-padding. We generally took the ℓ2 norm of the discrepancy between input and reconstruction for the loss term in the EBGAN auto-encoder model as formally written in section 2.1. However, for the EBGAN-LN experiment, we followed the original implementation of Ladder Network using a vanilla form of ℓ2 loss. Borrowed from Salimans et al. (2016), the batch normalization is adopted without the learned parameter γ but merely with a bias term β. It still remains unknown whether such trick could affect learning in some non-ignorable way, so this might have made our baseline model not a strict reproduction of the published models by Rasmus et al. (2015) and Pezeshki et al. (2015). E APPENDIX: TIPS FOR SETTING A GOOD ENERGY MARGIN VALUE It is crucial to set a proper energy margin value m in the framework of EBGAN, from both theoretical and experimental perspective. Hereby we provide a few tips: Delving into the formulation of the discriminator loss made by equation 1, we suggest a numerical balance between its two terms which concern real and fake sample respectively. The second term is apparently bounded by [0, m] (assuming the energy function D(x) is non-negative). It is desirable to make the first term bounded in a similar range. In theory, the upper bound of the first term is essentially determined by (i)-the capacity of D; (ii)-the complexity of the dataset. In practice, for the EBGAN auto-encoder model, one can run D (the auto-encoder) alone on the real sample dataset and monitor the loss. When it converges, the consequential loss implies a rough limit on how well such setting of D is capable to fit the dataset. This usually suggests a good start for a hyper-parameter searching on m. m being overly large results in a training instability/difficulty, while m being too small is prone to the mode-dropping problem. This property of m is depicted in figure 9. One successful technique, as we introduced in appendix D, is to start from a large m and gradually decayed it to 0 along training proceeds. Unlike the feature matching semisupervised learning technique proposed by Salimans et al. (2016), we show in figure 10 that not only does the EBGAN-LN model achieve a good semi-supervised learning performance, it also produces satisfactory generations. Abstracting away from the practical experimental tips, the theoretical understanding of EBGAN in section 2.2 also provides some insight for setting a feasible m. For instance, as implied by Theorem 2, setting a large m results in a broader range of γ to which D (x) may converge. Instability may come after an overly large γ because it generates two strong gradients pointing to opposite directions, from loss 1, which would demand more finicky optimization setting. F APPENDIX: MORE GENERATION LSUN AUGMENTED VERSION TRAINING For LSUN bedroom dataset, aside from the experiment on the whole images, we also train an EBGAN auto-encoder model based on dataset augmentation by cropping patches. All the patches are of size 64 64 and cropped from 96 96 original images. The generation is shown in figure 11. COMPARISON OF EBGANS AND EBGAN-PTS To further demonstrate how the pull-away term (PT) may influence EBGAN auto-encoder model training, we chose both the whole-image and augmented-patch version of the LSUN bedroom dataset, together with the Celeb A dataset to make some further experimentation. The comparison of EBGAN and EBGAN-PT generation are showed in figure 12, figure 13 and figure 14. Note Published as a conference paper at ICLR 2017 Figure 9: Generation from the EBGAN auto-encoder model trained with different m settings. From top to bottom, m is set to 1, 2, 4, 6, 8, 12, 16, 32 respectively. The rest setting is n Layer G=5, n Layer D=2, size G=1600, size D=1024, dropout D=0, optim D=ADAM, optim G=ADAM, lr=0.001. Figure 10: Generation from the EBGAN-LN model. The displayed generations are obtained by an identical experimental setting described in appendix D, with different random seeds. As we mentioned before, we used the unpadded version of the MNIST dataset (size 28 28) in the EBGANLN experiments. that all comparison pairs adopt identical architectural and hyper-parameter setting as in section 4.3. The cost weight on the PT is set to 0.1. Published as a conference paper at ICLR 2017 Figure 11: Generation from augmented-patch version of the LSUN bedroom dataset. Left(a): DCGAN generation. Right(b): EBGAN-PT generation. Figure 12: Generation from whole-image version of the LSUN bedroom dataset. Left(a): EBGAN. Right(b): EBGAN-PT. Published as a conference paper at ICLR 2017 Figure 13: Generation from augmented-patch version of the LSUN bedroom dataset. Left(a): EBGAN. Right(b): EBGAN-PT. Figure 14: Generation from the Celeb A dataset. Left(a): EBGAN. Right(b): EBGAN-PT.