# the_numerics_of_gans__80687031.pdf

The Numerics of GANs

Lars Mescheder Autonomous Vision Group MPI Tübingen lars.mescheder@tuebingen.mpg.de

Sebastian Nowozin Machine Intelligence and Perception Group Microsoft Research sebastian.nowozin@microsoft.com

Andreas Geiger Autonomous Vision Group MPI Tübingen andreas.geiger@tuebingen.mpg.de

In this paper, we analyze the numerics of common algorithms for training Generative Adversarial Networks (GANs). Using the formalism of smooth two-player games we analyze the associated gradient vector ﬁeld of GAN training objectives. Our ﬁndings suggest that the convergence of current algorithms suffers due to two factors: i) presence of eigenvalues of the Jacobian of the gradient vector ﬁeld with zero real-part, and ii) eigenvalues with big imaginary part. Using these ﬁndings, we design a new algorithm that overcomes some of these limitations and has better convergence properties. Experimentally, we demonstrate its superiority on training common GAN architectures and show convergence on GAN architectures that are known to be notoriously hard to train.

1 Introduction

Generative Adversarial Networks (GANs) [10] have been very successful in learning probability distributions. Since their ﬁrst appearance, GANs have been successfully applied to a variety of tasks, including image-to-image translation [12], image super-resolution [13], image in-painting [27] domain adaptation [26], probabilistic inference [14, 9, 8] and many more.

While very powerful, GANs are known to be notoriously hard to train. The standard strategy for stabilizing training is to carefully design the model, either by adapting the architecture [21] or by selecting an easy-to-optimize objective function [23, 4, 11].

In this work, we examine the general problem of ﬁnding local Nash-equilibria of smooth games. We revisit the de-facto standard algorithm for ﬁnding such equilibrium points, simultaneous gradient ascent. We theoretically show that the main factors preventing the algorithm from converging are the presence of eigenvalues of the Jacobian of the associated gradient vector ﬁeld with zero real-part and eigenvalues with a large imaginary part. The presence of the latter is also one of the reasons that make saddle-point problems more difﬁcult than local optimization problems. Utilizing these insights, we design a new algorithm that overcomes some of these problems. Experimentally, we show that our algorithm leads to stable training on many GAN architectures, including some that are known to be hard to train.

Our technique is orthogonal to strategies that try to make the GAN-game well-deﬁned, e.g. by adding instance noise [24] or by using the Wasserstein-divergence [4, 11]: while these strategies try to ensure the existence of Nash-equilibria, our paper deals with their computation and the numerical difﬁculties that can arise in practice.

31st Conference on Neural Information Processing Systems (NIPS 2017), Long Beach, CA, USA.

In summary, our contributions are as follows:

We identify the main reasons why simultaneous gradient ascent often fails to ﬁnd local Nash-equilibria. By utilizing these insights, we design a new, more robust algorithm for ﬁnding Nashequilibria of smooth two-player games. We empirically demonstrate that our method enables stable training of GANs on a variety of architectures and divergence measures.

The proofs for the theorems in this paper can be found the supplementary material.1

2 Background

In this section we ﬁrst revisit the concept of Generative Adversarial Networks (GANs) from a divergence minimization point of view. We then introduce the concept of a smooth (non-convex) two-player game and deﬁne the terminology used in the rest of the paper. Finally, we describe simultaneous gradient ascent, the de-facto standard algorithm for ﬁnding Nash-equilibria of such games, and derive some of its properties.

2.1 Divergence Measures and GANs

Generative Adversarial Networks are best understood in the context of divergence minimization: assume we are given a divergence function D, i.e. a function that takes a pair of probability distributions as input, outputs an element from [0, ] and satisﬁes D(p, p) = 0 for all probability distributions p. Moreover, assume we are given some target distribution p0 from which we can draw i.i.d. samples and a parametric family of distributions qθ that also allows us to draw i.i.d. samples. In practice qθ is usually implemented as a neural network that acts on a hidden code z sampled from some known distribution and outputs an element from the target space. Our goal is to ﬁnd θ that minimizes the divergence D(p0, qθ), i.e. we want to solve the optimization problem

min θ D(p0, qθ). (1)

Most divergences that are used in practice can be represented in the following form [10, 16, 4]:

D(p, q) = max f F Ex q [g1(f(x))] Ex p [g2(f(x))] (2)

for some function class F X R and convex functions g1, g2 : R R. Together with (1), this leads to mini-max problems of the form

min θ max f F Ex qθ [g1(f(x))] Ex p0 [g2(f(x))] . (3)

These divergences include the Jensen-Shannon divergence [10], all f-divergences [16], the Wasserstein divergence [4] and even the indicator divergence, which is 0 if p = q and otherwise.

In practice, the function class F in (3) is approximated with a parametric family of functions, e.g. parameterized by a neural network. Of course, when minimizing the divergence w.r.t. this approximated family, we no longer minimize the correct divergence. However, it can be veriﬁed that taking any class of functions in (3) leads to a divergence function for appropriate choices of g1 and g2. Therefore, some authors call these divergence functions neural network divergences [5].

2.2 Smooth Two-Player Games

A differentiable two-player game is deﬁned by two utility functions f(φ, θ) and g(φ, θ) deﬁned over a common space (φ, θ) Ω1 Ω2. Ω1 corresponds to the possible actions of player 1, Ω2 corresponds to the possible actions of player 2. The goal of player 1 is to maximize f, whereas player 2 tries to maximize g. In the context of GANs, Ω1 is the set of possible parameter values for the generator, whereas Ω2 is the set of possible parameter values for the discriminator. We call a game a zero-sum game if f = g. Note that the derivation of the GAN-game in Section 2.1 leads to a zero-sum game,

1The code for all experiments in this paper is available under https://github.com/LMescheder/ The Numerics Of GANs.

Algorithm 1 Simultaneous Gradient Ascent (Sim GA)

1: while not converged do 2: vφ φf(θ, φ) 3: vθ θg(θ, φ) 4: φ φ + hvφ 5: θ θ + hvθ 6: end while

whereas in practice people usually employ a variant of this formulation that is not a zero-sum game for better convergence [10].

Our goal is to ﬁnd a Nash-equilibrium of the game, i.e. a point x = ( φ, θ) given by the two conditions

φ argmax φ f(φ, θ) and θ argmax θ g( φ, θ). (4)

We call a point ( φ, θ) a local Nash-equilibrium, if (4) holds in a local neighborhood of ( φ, θ).

Every differentiable two-player game deﬁnes a vector ﬁeld

v(φ, θ) = φf(φ, θ) θg(φ, θ)

We call v the associated gradient vector ﬁeld to the game deﬁned by f and g.

For the special case of zero-sum two-player games, we have g = f and thus

v (φ, θ) = 2 φf(φ, θ) φ,θf(φ, θ) φ,θf(φ, θ) 2 θf(φ, θ)

As a direct consequence, we have the following:

Lemma 1. For zero-sum games, v (x) is negative (semi-)deﬁnite if and only if 2 φf(φ, θ) is negative (semi-)deﬁnite and 2 θf(φ, θ) is positive (semi-)deﬁnite.

Corollary 2. For zero-sum games, v ( x) is negative semi-deﬁnite for any local Nash-equilibrium x. Conversely, if x is a stationary point of v(x) and v ( x) is negative deﬁnite, then x is a local Nash-equilibrium.

Note that Corollary 2 is not true for general two-player games.

2.3 Simultaneous Gradient Ascent

The de-facto standard algorithm for ﬁnding Nash-equilibria of general smooth two-player games is Simultaneous Gradient Ascent (Sim GA), which was described in several works, for example in [22] and, more recently also in the context of GANs, in [16]. The idea is simple and is illustrated in Algorithm 1. We iteratively update the parameters of the two players by simultaneously applying gradient ascent to the utility functions of the two players. This can also be understood as applying the Euler-method to the ordinary differential equation

d dtx(t) = v(x(t)), (7)

where v(x) is the associated gradient vector ﬁeld of the two-player game.

It can be shown that simultaneous gradient ascent converges locally to a Nash-equilibrium for a zero-sum game, if the Hessian of both players is negative deﬁnite [16, 22] and the learning rate is small enough. Unfortunately, in the context of GANs the former condition is rarely met. We revisit the properties of simultaneous gradient ascent in Section 3 and also show a more subtle property, namely that even if the conditions for the convergence of simultaneous gradient ascent are met, it might require extremely small step sizes for convergence if the Jacobian of the associated gradient vector ﬁeld has eigenvalues with large imaginary part.

(a) Illustration how the eigenvalues are projected into unit ball.

(b) Example where h has to be chosen extremely small.

(c) Illustration how our method alleviates the problem.

Figure 1: Images showing how the eigenvalues of A are projected into the unit circle and what causes problems: when discretizing the gradient ﬂow with step size h, the eigenvalues of the Jacobian at a ﬁxed point are projected into the unit ball along rays from 1. However, this is only possible if the eigenvalues lie in the left half plane and requires extremely small step sizes h if the eigenvalues are close to the imaginary axis. The proposed method moves the eigenvalues to the left in order to make the problem better posed, thus allowing the algorithm to converge for reasonable step sizes.

3 Convergence Theory

In this section, we analyze the convergence properties of the most common method for training GANs, simultaneous gradient ascent2. We show that two major failure causes for this algorithm are eigenvalues of the Jacobian of the associated gradient vector ﬁeld with zero real-part as well as eigenvalues with large imaginary part.

For our theoretical analysis, we start with the following classical theorem about the convergence of ﬁxed-point iterations:

Proposition 3. Let F : Ω Ωbe a continuously differential function on an open subset Ωof Rn and let x Ωbe so that

1. F( x) = x, and

2. the absolute values of the eigenvalues of the Jacobian F ( x) are all smaller than 1.

Then there is an open neighborhood U of x so that for all x0 U, the iterates F (k)(x0) converge to x. The rate of convergence is at least linear. More precisely, the error F (k)(x0) x is in O(|λmax|k) for k where λmax is the eigenvalue of F ( x) with the largest absolute value.

Proof. See [6], Proposition 4.4.1.

In numerics, we often consider functions of the form

F(x) = x + h G(x) (8)

for some h > 0. Finding ﬁxed points of F is then equivalent to ﬁnding solutions to the nonlinear equation G(x) = 0 for x. For F as in (8), the Jacobian is given by

F (x) = I + h G (x). (9)

Note that in general neither F (x) nor G (x) are symmetric and can therefore have complex eigenvalues.

The following Lemma gives an easy condition, when a ﬁxed point of F as in (8) satisﬁes the conditions of Proposition 3.

2A similar analysis of alternating gradient ascent, a popular alternative to simultaneous gradient ascent, can be found in the supplementary material.

Lemma 4. Assume that A Rn n only has eigenvalues with negative real-part and let h > 0. Then the eigenvalues of the matrix I + h A lie in the unit ball if and only if

h < 1 |ℜ(λ)| 2

1 + ℑ(λ) ℜ(λ) 2 (10)

for all eigenvalues λ of A.

Corollary 5. If v ( x) only has eigenvalues with negative real-part at a stationary point x, then Algorithm 1 is locally convergent to x for h > 0 small enough.

Equation 10 shows that there are two major factors that determine the maximum possible step size h: (i) the maximum value of ℜ(λ) and (ii) the maximum value q of |ℑ(λ)/ℜ(λ)|. Note that as q goes to inﬁnity, we have to choose h according to O(q 2) which can quickly become extremely small. This is visualized in Figure 1: if G ( x) has an eigenvalue with small absolute real part but big imaginary part, h needs to be chosen extremely small to still achieve convergence. Moreover, even if we make h small enough, most eigenvalues of F ( x) will be very close to 1, which leads by Proposition 3 to very slow convergence of the algorithm. This is in particular a problem of simultaneous gradient ascent for two-player games (in contrast to gradient ascent for local optimization), where the Jacobian G ( x) is not symmetric and can therefore have non-real eigenvalues.

4 Consensus Optimization

In this section, we derive the proposed method and analyze its convergence properties.

4.1 Derivation

Finding stationary points of the vector ﬁeld v(x) is equivalent to solving the equation v(x) = 0. In the context of two-player games this means solving the two equations

φf(φ, θ) = 0 and θg(φ, θ) = 0. (11)

A simple strategy for ﬁnding such stationary points is to minimize L(x) = 1

2 v(x) 2 for x. Unfortunately, this can result in unstable stationary points of v or other local minima of 1

2 v(x) 2 and in practice, we found it did not work well.

We therefore consider a modiﬁed vector ﬁeld w(x) that is as close as possible to the original vector ﬁeld v(x), but at the same time still minimizes L(x) (at least locally). A sensible candidate for such a vector ﬁeld is w(x) = v(x) γ L(x) (12)

for some γ > 0. A simple calculation shows that the gradient L(x) is given by

L(x) = v (x)Tv(x). (13)

This vector ﬁeld is the gradient vector ﬁeld associated to the modiﬁed two-player game given by the two modiﬁed utility functions

f(φ, θ) = f(φ, θ) γL(φ, θ) and g(φ, θ) = g(φ, θ) γL(φ, θ). (14)

The regularizer L(φ, θ) encourages agreement between the two players. Therefore we call the resulting algorithm Consensus Optimization (Algorithm 2). 3 4

3This algorithm requires backpropagation through the squared norm of the gradient with respect to the weights of the network. This is sometimes called double backpropagation and is for example supported by the deep learning frameworks Tensorﬂow [1] and Py Torch [19]. 4As was pointed out by Ferenc Huzsár in one of his blog posts on www.inference.vc, naively implementing this algorithm in a mini-batch setting leads to biased estimates of L(x). However, the bias goes down linearly with the batch size, which justiﬁes the usage of consensus optimization in a mini-batch setting. Alternatively, it is possible to debias the estimate by subtracting a multiple of the sample variance of the gradients, see the supplementary material for details.

Algorithm 2 Consensus optimization

1: while not converged do 2: vφ φ(f(θ, φ) γL(θ, φ)) 3: vθ θ(g(θ, φ) γL(θ, φ)) 4: φ φ + hvφ 5: θ θ + hvθ 6: end while

4.2 Convergence

For analyzing convergence, we consider a more general algorithm than in Section 4.1 which is given by iteratively applying a function F of the form

F(x) = x + h A(x)v(x). (15)

for some step size h > 0 and an invertible matrix A(x) to x. Consensus optimization is a special case of this algorithm for A(x) = I γ v (x)T. We assume that 1

γ is not an eigenvalue of v (x)T for any x, so that A(x) is indeed invertible. Lemma 6. Assume h > 0 and A(x) invertible for all x. Then x is a ﬁxed point of (15) if and only if it is a stationary point of v. Moreover, if x is a stationary point of v, we have

F ( x) = I + h A( x)v ( x). (16)

Lemma 7. Let A(x) = I γv (x)T and assume that v ( x) is negative semi-deﬁnite and invertible5 . Then A( x)v ( x) is negative deﬁnite.

As a consequence of Lemma 6 and Lemma 7, we can show local convergence of our algorithm to a local Nash equilibrium: Corollary 8. Let v(x) be the associated gradient vector ﬁeld of a two-player zero-sum game and A(x) = I γv (x)T. If x is a local Nash-equilibrium, then there is an open neighborhood U of x so that for all x0 U, the iterates F (k)(x0) converge to x for h > 0 small enough.

Our method solves the problem of eigenvalues of the Jacobian with (approximately) zero real-part. As the next Lemma shows, it also alleviates the problem of eigenvalues with a big imaginary-to-realpart-quotient:

Lemma 9. Assume that A Rn n is negative semi-deﬁnite. Let q(γ) be the maximum of |ℑ(λ)|

|ℜ(λ)| (possibly inﬁnite) with respect to λ where λ denotes the eigenvalues of A γAT A and ℜ(λ) and ℑ(λ) denote their real and imaginary part respectively. Moreover, assume that A is invertible with |Av| ρ|v| for ρ > 0 and let

c = min v S(Cn) | v T(A + AT)v| | v T(A AT)v| (17)

where S(Cn) denotes the unit sphere in Cn. Then

q(γ) 1 c + 2ρ2γ . (18)

Lemma 9 shows that the imaginary-to-real-part-quotient can be made arbitrarily small for an appropriate choice of γ. According to Proposition 3, this leads to better convergence properties near a local Nash-equilibrium.

5 Experiments

Mixture of Gaussians In our ﬁrst experiment we evaluate our method on a simple 2D-example where our goal is to learn a mixture of 8 Gaussians with standard deviations equal to 10 2 and modes

5Note that v ( x) is usually not symmetric and therefore it is possible that v ( x) is negative semi-deﬁnite and invertible but not negative-deﬁnite.

(a) Simultaneous Gradient Ascent

(b) Consensus optimization

Figure 2: Comparison of Simultaneous Gradient Ascent and Consensus optimization on a circular mixture of Gaussians. The images depict from left to right the resulting densities of the algorithm after 0, 5000, 10000 and 20000 iterations as well as the target density (in red).

v (x) w (x)

Before training

After training

Figure 3: Empirical distribution of eigenvalues before and after training using consensus optimization. The ﬁrst column shows the distribution of the eigenvalues of the Jacobian v (x) of the unmodiﬁed vector ﬁeld v(x). The second column shows the eigenvalues of the Jacobian w (x) of the regularized vector ﬁeld w(x) = v(x) γ L(x) used in consensus optimization. We see that v (x) has eigenvalues close to the imaginary axis near the Nash-equilibrium. As predicted theoretically, this is not the case for the regularized vector ﬁeld w(x). For visualization purposes, the real part of the spectrum of w (x) before training was clipped.

uniformly distributed around the unit circle. While simplistic, algorithms training GANs often fail to converge even on such simple examples without extensive ﬁne-tuning of the architecture and hyper parameters [15].

For both the generator and critic we use fully connected neural networks with 4 hidden layers and 16 hidden units in each layer. For all layers, we use RELU-nonlinearities. We use a 16-dimensional Gaussian prior for the latent code z and set up the game between the generator and critic using the utility functions as in [10]. To test our method, we run both Sim GA and our method with RMSProp and a learning rate of 10 4 for 20000 steps. For our method, we use a regularization parameter of γ = 10.

The results produced by Sim GA and our method for 0, 5000, 10000 and 20000 iterations are depicted in Figure 2. We see that while Sim GA jumps around the modes of the distribution and fails to converge , our method converges smoothly to the target distribution (shown in red). Figure 3 shows the empirical distribution of the eigenvalues of the Jacobian of v(x) and the regularized vector ﬁeld w(x). It can be seen that near the Nash-equilibrium most eigenvalues are indeed very close to the

(a) cifar-10

(b) celeb A Figure 4: Samples generated from a model where both the generator and discriminator are given as in [21], but without batch-normalization. For celeb A, we also use a constant number of ﬁlters in each layer and add additional RESNET-layers.

(a) Discriminator loss

(b) Generator loss

(c) Inception score

Figure 5: (a) and (b): Comparison of the generator and discriminator loss on a DC-GAN architecture with 3 convolutional layers trained on cifar-10 for consensus optimization (without batchnormalization) and alternating gradient ascent (with batch-normalization). We observe that while alternating gradient ascent leads to highly ﬂuctuating losses, consensus optimization successfully stabilizes the training and makes the losses almost constant during training. (c): Comparison of the inception score over time which was computed using 6400 samples. We see that on this architecture both methods have comparable rates of convergence and consensus optimization achieves slightly better end results.

imaginary axis and that the proposed modiﬁcation of the vector ﬁeld used in consensus optimization moves the eigenvalues to the left.

CIFAR-10 and Celeb A In our second experiment, we apply our method to the cifar-10 and celeb Adatasets, using a DC-GAN-like architecture [21] without batch normalization in the generator or the discriminator. For celeb A, we additionally use a constant number of ﬁlters in each layer and add additional RESNET-layers. These architectures are known to be hard to optimize using simultaneous (or alternating) gradient ascent [21, 4].

Figure 4a and 4b depict samples from the model trained with our method. We see that our method successfully trains the models and we also observe that unlike when using alternating gradient ascent, the generator and discriminator losses remain almost constant during training. This is illustrated in Figure 5. For a quantitative evaluation, we also measured the inception-score [23] over time (Figure 5c), showing that our method compares favorably to a DC-GAN trained with alternating gradient ascent. The improvement of consensus optimization over alternating gradient ascent is even more signiﬁcant if we use 4 instead of 3 convolutional layers, see Figure 11 in the supplementary material for details.

Additional experimental results can be found in the supplementary material.

6 Discussion

While we could prove local convergence of our method in Section 4, we believe that even more insights can be gained by examining global convergence properties. In particular, our analysis from

Section 4 cannot explain why the generator and discriminator losses remain almost constant during training.

Our theoretical results assume the existence of a Nash-equilibrium. When we are trying to minimize an f-divergence and the dimensionality of the generator distribution is misspeciﬁed, this might not be the case [3]. Nonetheless, we found that our method works well in practice and we leave a closer theoretical investigation of this fact to future research.

In practice, our method can potentially make formerly instable stationary points of the gradient vector ﬁeld stable if the regularization parameter is chosen to be high. This may lead to poor solutions. We also found that our method becomes less stable for deeper architectures, which we attribute to the fact that the gradients can have very different scales in such architectures, so that the simple L2-penalty from Section 4 needs to be rescaled accordingly.

Our method can be regarded as an approximation to the implicit Euler method for integrating the gradient vector ﬁeld. It can be shown that the implicit Euler method has appealing stability properties [7] that can be translated into convergence theorems for local Nash-equilibria. However, the implicit Euler method requires the solution of a nonlinear equation in each iteration. Nonetheless, we believe that further progress can be made by ﬁnding better approximations to the implicit Euler method.

An alternative interpretation is to view our method as a second order method. We hence believe that further progress can be made by revisiting second order optimization methods [2, 18] in the context of saddle point problems.

7 Related Work

Saddle point problems do not only arise in the context of training GANs. For example, the popular actor-critic models [20] in reinforcement learning are also special cases of saddle-point problems.

Finding a stable algorithm for training GANs is a long standing problem and multiple solutions have been proposed. Unrolled GANs [15] unroll the optimization with respect to the critic, thereby giving the generator more informative gradients. Though unrolling the optimization was shown to stabilize training, it can be cumbersome to implement and in addition it also results in a big model. As was recently shown, the stability of GAN-training can be improved by using objectives derived from the Wasserstein-1-distance (induced by the Kantorovich-Rubinstein-norm) instead of f-divergences [4, 11]. While Wasserstein-GANs often provide a good solution for the stable training of GANs, they require keeping the critic optimal, which can be time-consuming and can in practice only be achieved approximately, thus violating the conditions for theoretical guarantees. Moreover, some methods like Adversarial Variational Bayes [14] explicitly prescribe the divergence measure to be used, thus making it impossible to apply Wasserstein-GANs. Other approaches that try to stabilize training, try to design an easy-to-optimize architecture [23, 21] or make use of additional labels [23, 17].

In contrast to all the approaches described above, our work focuses on stabilizing training on a wide range of architecture and divergence functions.

8 Conclusion

In this work, starting from GAN objective functions we analyzed the general difﬁculties of ﬁnding local Nash-equilibria in smooth two-player games. We pinpointed the major numerical difﬁculties that arise in the current state-of-the-art algorithms and, using our insights, we presented a new algorithm for training generative adversarial networks. Our novel algorithm has favorable properties in theory and practice: from the theoretical viewpoint, we showed that it is locally convergent to a Nashequilibrium even if the eigenvalues of the Jacobian are problematic. This is particularly interesting for games that arise in the context of GANs where such problems are common. From the practical viewpoint, our algorithm can be used in combination with any GAN-architecture whose objective can be formulated as a two-player game to stabilize the training. We demonstrated experimentally that our algorithm stabilizes the training and successfully combats training issues like mode collapse. We believe our work is a ﬁrst step towards an understanding of the numerics of GAN training and more general deep learning objective functions.

Acknowledgements

This work was supported by Microsoft Research through its Ph D Scholarship Programme.

[1] Martín Abadi, Ashish Agarwal, Paul Barham, Eugene Brevdo, Zhifeng Chen, Craig Citro, Greg S Corrado, Andy Davis, Jeffrey Dean, Matthieu Devin, et al. Tensorﬂow: Large-scale machine learning on heterogeneous distributed systems. Co RR, abs/1603.04467, 2016.

[2] Shun-ichi Amari. Natural gradient works efﬁciently in learning. Neural Computation, 10(2):251 276, 1998.

[3] Martín Arjovsky and Léon Bottou. Towards principled methods for training generative adversarial networks. Co RR, abs/1701.04862, 2017.

[4] Martín Arjovsky, Soumith Chintala, and Léon Bottou. Wasserstein GAN. Co RR, abs/1701.07875, 2017.

[5] Sanjeev Arora, Rong Ge, Yingyu Liang, Tengyu Ma, and Yi Zhang. Generalization and equilibrium in generative adversarial nets (gans). In Proceedings of the 34th International Conference on Machine Learning, ICML 2017, Sydney, NSW, Australia, 6-11 August 2017, pages 224 232, 2017.

[6] Dimitri P Bertsekas. Constrained optimization and Lagrange multiplier methods. Academic press, 2014.

[7] John Charles Butcher. Numerical methods for ordinary differential equations. John Wiley & Sons, 2016.

[8] Jeff Donahue, Philipp Krähenbühl, and Trevor Darrell. Adversarial feature learning. Co RR, abs/1605.09782, 2016.

[9] Vincent Dumoulin, Ishmael Belghazi, Ben Poole, Alex Lamb, Martín Arjovsky, Olivier Mastropietro, and Aaron C. Courville. Adversarially learned inference. Co RR, abs/1606.00704, 2016.

[10] Ian J. Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron C. Courville, and Yoshua Bengio. Generative adversarial nets. In Advances in Neural Information Processing Systems 27: Annual Conference on Neural Information Processing Systems 2014, December 8-13 2014, Montreal, Quebec, Canada, pages 2672 2680, 2014.

[11] Ishaan Gulrajani, Faruk Ahmed, Martín Arjovsky, Vincent Dumoulin, and Aaron C. Courville. Improved training of wasserstein gans. Co RR, abs/1704.00028, 2017.

[12] Phillip Isola, Jun-Yan Zhu, Tinghui Zhou, and Alexei A. Efros. Image-to-image translation with conditional adversarial networks. Co RR, abs/1611.07004, 2016.

[13] Christian Ledig, Lucas Theis, Ferenc Huszar, Jose Caballero, Andrew P. Aitken, Alykhan Tejani, Johannes Totz, Zehan Wang, and Wenzhe Shi. Photo-realistic single image super-resolution using a generative adversarial network. Co RR, abs/1609.04802, 2016.

[14] Lars M. Mescheder, Sebastian Nowozin, and Andreas Geiger. Adversarial variational bayes: Unifying variational autoencoders and generative adversarial networks. In Proceedings of the 34th International Conference on Machine Learning, ICML 2017, Sydney, NSW, Australia, 6-11 August 2017, pages 2391 2400, 2017.

[15] Luke Metz, Ben Poole, David Pfau, and Jascha Sohl-Dickstein. Unrolled generative adversarial networks. Co RR, abs/1611.02163, 2016.

[16] Sebastian Nowozin, Botond Cseke, and Ryota Tomioka. f-gan: Training generative neural samplers using variational divergence minimization. In Advances in Neural Information Processing Systems 29: Annual Conference on Neural Information Processing Systems 2016, December 5-10, 2016, Barcelona, Spain, pages 271 279, 2016.

[17] Augustus Odena, Christopher Olah, and Jonathon Shlens. Conditional image synthesis with auxiliary classiﬁer gans. In Proceedings of the 34th International Conference on Machine Learning, ICML 2017, Sydney, NSW, Australia, 6-11 August 2017, pages 2642 2651, 2017.

[18] Razvan Pascanu and Yoshua Bengio. Natural gradient revisited. Co RR, abs/1301.3584, 2013.

[19] Adam Paszke and Soumith Chintala. Pytorch, 2017.

[20] David Pfau and Oriol Vinyals. Connecting generative adversarial networks and actor-critic methods. Co RR, abs/1610.01945, 2016.

[21] Alec Radford, Luke Metz, and Soumith Chintala. Unsupervised representation learning with deep convolutional generative adversarial networks. Co RR, abs/1511.06434, 2015.

[22] Lillian J. Ratliff, Samuel Burden, and S. Shankar Sastry. Characterization and computation of local nash equilibria in continuous games. In 51st Annual Allerton Conference on Communication, Control, and Computing, Allerton 2013, Allerton Park & Retreat Center, Monticello, IL, USA, October 2-4, 2013, pages 917 924, 2013.

[23] Tim Salimans, Ian J. Goodfellow, Wojciech Zaremba, Vicki Cheung, Alec Radford, and Xi Chen. Improved techniques for training gans. In Advances in Neural Information Processing Systems 29: Annual Conference on Neural Information Processing Systems 2016, December 5-10, 2016, Barcelona, Spain, pages 2226 2234, 2016.

[24] Casper Kaae Sønderby, Jose Caballero, Lucas Theis, Wenzhe Shi, and Ferenc Huszár. Amortised MAP inference for image super-resolution. Co RR, abs/1610.04490, 2016.

[25] Tijmen Tieleman and Geoffrey Hinton. Lecture 6.5-rmsprop: Divide the gradient by a running average of its recent magnitude, 2012.

[26] Eric Tzeng, Judy Hoffman, Kate Saenko, and Trevor Darrell. Adversarial discriminative domain adaptation. Co RR, abs/1702.05464, 2017.

[27] Raymond Yeh, Chen Chen, Teck-Yian Lim, Mark Hasegawa-Johnson, and Minh N. Do. Semantic image inpainting with perceptual and contextual losses. Co RR, abs/1607.07539, 2016.