# a_meanfield_analysis_of_twoplayer_zerosum_games__3700426c.pdf A mean-field analysis of two-player zero-sum games Carles Domingo-Enrich Courant Institute of Mathematical Sciences New York University New York, NY cd2754@nyu.edu Samy Jelassi Princeton University Princeton, NJ sjelassi@princeton.edu Arthur Mensch École Normale Supérieure Paris, France arthur.mensch@m4x.org Grant Rotskoff Courant Institute of Mathematical Sciences New York University New York, NY rotskoff@cims.nyu.edu Joan Bruna Courant Institute of Mathematical Sciences & Center for Data Science New York University New York, NY bruna@cims.nyu.edu Finding Nash equilibria in two-player zero-sum continuous games is a central problem in machine learning, e.g. for training both GANs and robust models. The existence of pure Nash equilibria requires strong conditions which are not typically met in practice. Mixed Nash equilibria exist in greater generality and may be found using mirror descent. Yet this approach does not scale to high dimensions. To address this limitation, we parametrize mixed strategies as mixtures of particles, whose positions and weights are updated using gradient descent-ascent. We study this dynamics as an interacting gradient flow over measure spaces endowed with the Wasserstein-Fisher-Rao metric. We establish global convergence to an approximate equilibrium for the related Langevin gradient-ascent dynamic. We prove a law of large numbers that relates particle dynamics to mean-field dynamics. Our method identifies mixed equilibria in high dimensions and is demonstrably effective for training mixtures of GANs. 1 Introduction Multi-objective optimization problems arise in many fields, from economics to civil engineering. Tasks that require optimizing multiple objectives have also become a routine part of many agent-based machine learning algorithms including generative adversarial networks (Goodfellow et al., 2014), imaginative agents (Racanière et al., 2017), hierarchical reinforcement learning (Wayne and Abbott, 2014) and multi-agent reinforcement learning (Bu et al., 2008). It not only remains difficult to carry out the necessary optimization, but also to assess the optimality of a given solution. Multi-agent optimization is generally cast as finding equilibria in the space of strategies. The classic notion of equilibrium is due to Nash (Nash, 1951): a Nash equilibrium is a set of agent strategies for which no agent can unilaterally improve its loss value. Pure Nash equilibria, in which each agent 34th Conference on Neural Information Processing Systems (Neur IPS 2020), Vancouver, Canada. adopts a single strategy, provide a limited notion of optimality because they exist only under restrictive conditions. On the other hand, mixed Nash equilibria (MNE), where agents adopt a strategy from a probability distribution over the set of all strategies, exist in much greater generality (Glicksberg, 1952). Importantly, MNE exist for games with infinite-dimensional compact strategy spaces, in which each player observes a loss function that is continuous in its strategy. We encounter this setting in different game formulations of machine learning problems, like GANs (Goodfellow et al., 2014). Although MNE are guaranteed to exist, it is difficult to identify them. Indeed, worst-case complexity analyses have shown that without additional assumptions on the losses there is no efficient algorithm for finding a MNE, even in the case of two-player finite games (Daskalakis et al., 2009). Some recent progress has been made; (Hsieh et al., 2019) proposed a mirror-descent algorithm with convergence guarantees, which is approximately realizable in high-dimension. Contributions. Following Hsieh et al. (2019), we formulate continuous two-player zero-sum games as a multi-agent optimization problem over the space of probability measures on strategies. We describe two gradient descent-ascent dynamics in this space, both involving a transport term. We show that the stationary points of a gradient ascent-descent flow with Langevin diffusion over the space of mixed strategies are approximate MNE. We analyse a gradient ascent-descent dynamics that jointly updates the positions and weights of two mixed strategies to converge to an exact MNE. This dynamics corresponds to a gradient descent-ascent flow over the space of measures endowed with a Wasserstein-Fisher Rao (WFR) metric (Chizat, Peyré, et al., 2018). We discretize both dynamics in space and time to obtain implementable training algorithms. We provide mean-field type consistency results on the discretization. We demonstrate numerically how both dynamics overcome the curse of dimensionality for finding MNE on synthetic games. On real data, we use WFR flows to train mixtures of GANs, that explicitly discover data clusters while maintaining good performance. 2 Related work Equilibria in continuous games. Most of the works that study convergence to equilibria in continuous games or GANs do not frame the problem in the infinite-dimensional space of measures, but on finite-dimensional spaces. That is because they either (i) restrict their attention to games with convexity-concavity assumptions in which pure equilibria exist (Mertikopoulos et al., 2019; Lin et al., 2018; Nouiehed et al., 2019), or (ii) provide algorithms with convergence guarantees to local notions of equilibrium such as stable fixed points, local Nash equilibria and local minimax points (Heusel et al., 2017; Adolphs et al., 2018; Mazumdar et al., 2019; Jin et al., 2019; Fiez et al., 2019; Balduzzi et al., 2018). Both approaches differ from ours, which is to give global convergence guarantees without convexity assumptions. Some works have studied approximate MNE in infinite-dimensional measure spaces. Arora et al. (2017) proved the existence of approximate MNE and studied the generalization properties of this approximate solution; their analysis, however, does not provide a constructive method to identify such a solution. In a more explicit setting, Grnarova et al. (2017) designed an online-learning algorithm for finding a MNE in GANs under the assumption that the discriminator is a single hidden layer neural network. Balandat et al. (2016) apply the dual averaging algorithm to the minimax problem and show that it recovers a MNE, but they do not provide any convergence rate nor a practical algorithm for learning mixed NE. Our framework holds without making any assumption on the architectures of the discriminator and generator and provides explicit algorithms with some convergence guarantees. Mean-field view of nonlinear gradient descent. Our approach is closely related to the mean-field perspective on wide neural networks (Mei et al., 2018; Rotskoff and Vanden-Eijnden, 2018; Chizat and Bach, 2018; Sirignano and Spiliopoulos, 2019; Rotskoff, Jelassi, et al., 2019). These methods view training algorithms as approximations of Wasserstein gradient flows, which are dynamics on measures over the space of neurons. In our setting, a mixed strategy corresponds to a measure over the space of strategies. Particle approaches for two-player games. Our theoretical work sheds a new light on the results of Hsieh et al. (2019), and rigorously justifies important algorithmic modifications the authors introduced. Specifically, they give rates of convergence for infinite-dimensional mirror descent on measures (i.e. updating strategy weights but not their positions). The straightforward implementation of this algorithm performs poorly unless the dimension is low (Fig. 1), which is why they proposed an implementable two-timescale version, in which the inner loop is a transport-based sampling procedure closely related to our Algorithm 1. This implementable version is not studied theoretically, as the two-timescale structure hinders a thorough analysis. Our analysis includes transport on equal footing with mirror descent updates. 3 Problem setup and mean-field dynamics Notation. For a topological space X we denote by P(X) the space of Borel probability measures on X, and M+(X) the space of Borel (positive) measures. For a given measure µ P(X) that is absolutely continuous with respect to the canonical Borel measure dx of X and has Radon-Nikodym derivative dµ dx C(X), we define its differential entropy H(µ) = R log( dµ dx)dµ. For measures µ, ν P(X), W2 is the 2-Wasserstein distance. 3.1 Lifting differentiable games to spaces of strategy distributions Differentiable two-player zero-sum games. We recall the definition of a differentiable zero-sum game, and show how finding a mixed Nash equilibrium to such a game is equivalent to solving a bi-linear game in the infinite dimensional space of distributions on strategies. We will use gradient flow approaches for solving the lifted problem. Definition 1. A two-player zero-sum game consists of a set of two players with parameters z = (x, y) Z = X Y, where players observe a loss functions ℓ1 : Z R and ℓ2 : Z R that satisfy for all (x, y) Z, ℓ1(x, y) + ℓ2(x, y) = 0. ℓ ℓ1 = ℓ2 is the loss of the game. The compact finite-dimensional spaces of strategies X and Y are endowed with a certain distance function d (which we assume Euclidean in what follows G.5 derives our results on arbitrary strategy manifolds). This allows to define differentiable games, amenable to first-order optimization. We make the following mild assumption over the regularity of losses and constraints (Glicksberg, 1952). Assumption 1. The parameter spaces X and Y are compact Riemannian manifolds without boundary of dimensions dx, dy embedded in RDx, RDy respectively. The loss ℓis continuously differentiable and L-smooth with respect to each parameter. That is, for all x, x X and y, y Y, xℓ(x, y) xℓ(x , y ) 2 L(d(x, x ) + d(y, y )), yℓ(x, y) yℓ(x , y ) 2 L(d(x, x ) + d(y, y )). From pure to mixed Nash equilibria. Assuming that both players play simultaneously, a pure Nash equilibrium point is a pair of strategies (x , y ) X Y such that, for all (x, y) X Y, ℓ(x , y) ℓ(x , y ) ℓ(x, y ). Such points do not always exist in continuous games. In contrast, mixed Nash equilibria (MNE) are guaranteed to exist (Glicksberg, 1952) under Asm. 1. Those distributions (µ x, µ y) P(X) P(Y) are global saddle points of the expected loss L(µx, µy) RR ℓ(x, y)dµx(x)dµy(y). Formally, for all µx, µy P(X) P(Y), L(µ x, µy) L(µ x, µ y) L(µx, µ y). (1) We quantify the accuracy of an estimation (ˆµx, ˆµy) of a MNE using the Nikaidô and Isoda (1955) error NI(ˆµx, ˆµy) = sup µy P(Y) L(ˆµx, µy) inf ˆµx P(X) L(µx, ˆµy). (2) We track the evolution of this metric in our theoretical results ( 4.2) and in our experiments. We obtain guarantees on finding ε-MNE (µε x, µε y), i.e. distribution pairs such that NI(µε x, µε y) ε. 3.2 Training dynamics on discrete mixtures of strategies We study three different dynamics for solving (1). Let us first assume that the two players play finite mixtures of n strategies µx = Pn i=1 wi xδxi P(X), µy = Pn i=1 wi yδyi P(Y), where Algorithm 1 Langevin Descent-Ascent (L-DA). 1: Input: IID samples x1 0, . . . , xn 0 from µx,0 P(X), IID samples y1 0, . . . , yn 0 Y from µy,0 P(Y) 2: for t = 0, . . . , T do 3: for i = 1, . . . , n do 4: Sample W i t N(0, I), xi t+1 = xi t η n Pn j=1 xℓ(xi t, yj t ) + p 2ηβ 1 W i t 5: Sample W i t N(0, I), yi t+1 = yi t + η n Pn j=1 yℓ(xj t, yi t) + p 2ηβ 1 W i t 6: Return µn x,T = 1 n Pn i=1 δxi T , µn y,T = 1 n Pn i=1 δyi T {xi, yi}i [1:n] are the positions of the strategies and wi x, wi y 0 are their weights. In the simplest setting, those mixtures are assumed uniform, i.e. wi x = wi y = 1/n. Finding the best 2n strategies involve finding a saddle point of L(µx, µy) = 1 n2 P j ℓ(xi, yj). Starting from random independent initial strategies xi 0 = ξi µx,0, yi 0 = ξi µy,0, we may hope that the gradient descent-ascent dynamics dxi t dt = 1 j=1 xℓ(xi t, yj t ), dyi t dt = 1 j=1 yℓ(xj t, yi t), i [1 : n] (3) finds such a saddle point. Yet this may fail in simple nonconvex-nonconcave games, as illustrated in G.2 the particle distributions collapse to a stationary point that is not a MNE. To mitigate this convergence problem, we analyse a perturbed dynamics analogous to Langevin gradient descent. Using the same initialization as in (3), we add a small amount of noise in the gradient dynamics and obtain the stochastic differential equations j=1 xℓ(Xi t, Y j t )dt + r 2 β d W i t , d Y i t = 1 j=1 yℓ(Xj t , Y i t )dt + r 2 β d W i t , (4) where W i t , W i t are independent Brownian motions. The discretization of (4) results in Alg. 1; it is similar to Alg. 4 in Hsieh et al. (2019). We propose a second alternative dynamics to (3), that updates both the positions and the weights of the particles, using relative updates for weights. We will show that it enjoys better convergence properties in the mean-field limit. dxi t dt = γ j=1 wj y,t xℓ(xi t, yj t ), dwi x,t dt = α j=1 wj y,tℓ(xi t, yj t ) + K(t) and similarly for all yi t (flipping the sign of ℓ). K(t) Pn k=1 Pn j=1 wj y,twk x,tℓ(xi t, yj t ) keeps wx,t in the simplex. We use uniform weights for initialization. When γ = 0 and α = 1, only the weights are updated: this results in the continuous-time version of the infinite-dimensional mirror descent studied by Hsieh et al. (2019). The Euler discretization of (5) results in Alg. 2. 3.3 Training dynamics as gradient flows on measures The three dynamics that we have introduced at the level of particles induces dynamics on the associated empirical probability measures. If {xi t, yi t}i [1,n] is a solution of (3), then µx(t) = 1 n Pn i=1 δxi t and µy(t) = 1 n Pn i=1 δyi t are solutions of the Interacting Wasserstein Gradient Flow (IWGF) of L: ( tµx = (µx x Vx(µy, x)), µx(0) = 1 n Pn i=1 δxi 0, tµy = (µy y Vy(µx, y)), µy(0) = 1 n Pn i=1 δyi 0. (6) The derivation of (6) is provided in G.3. We use the notation Vx(µy, x) δL δµx (µx, µy)(x) = R ℓ(x, y)dµy(y) for the first variations of the functional L(µx, µy). Holding µy fixed, the evolution Algorithm 2 Wasserstein-Fisher-Rao Descent-Ascent (WFR-DA). 1: Input: IID samples x(1) 0 , . . . , x(n) 0 from νx,0 P(X), IID samples y(1) 0 , . . . , y(n) 0 from νy,0 P(Y). Initial weights: For all i [1 : n], w(i) x = 1, w(i) y = 1. 2: for t = 0, . . . , T do 3: [x(i) t+1]n i=1 = [x(i) t η Pn j=1 w(i) y,t xℓ(x(i) t , y(j) t )]n i=1 4: [ ˆ w(i) x,t+1]n i=1= h w(i) x,t exp η Pn j=1 w(j) y,tℓ(x(i) t ,y(j) t ) in i=1, [w(i) x,t+1]n i=1=[ ˆ w(i) x,t+1]n i=1/ Pn j=1 ˆ w(j) x,t+1 5: [y(i) t+1]n i=1 = [y(i) t + η Pn j=1 w(j) x,t yℓ(x(j) t , y(i) t )]n i=1, 6: [ ˆ w(i) y,t+1]n i=1= h w(i) y,t exp η Pn j=1 w(j) x,tℓ(x(j) t ,y(i) t ) in i=1, [w(i) y,t+1]n i=1=[ ˆ w(i) y,t+1]n i=1/ Pn j=1 ˆ w(j) y,t+1 7: Return νn x,T = 1 T +1 PT t=0 Pn i=1 w(i) x,T δx(i) T , νn y,T = 1 T +1 PT t=0 Pn i=1 w(i) y,T δy(i) T of µx is a Wasserstein gradient flow on L( , µy) (Ambrosio et al., 2005). We interpret these PDEs in the weak sense, i.e. equality holds when integrating measures against bounded continuous functions. The distributions µx(t) = 1 n Pn i=1 δXi t and µy(t) = 1 n Pn i=1 δY i t , where {Xi, Y i}i [1:n] are solutions of (4) follows a Entropy-Regularized Interacting Wasserstein Gradient Flow (ERIWGF): ( tµx = x (µx x Vx(µy, x)) + β 1 xµx, µx(0) = 1 n Pn i=1 δxi 0 tµy = y (µy y Vy(µx, y)) + β 1 yµy, µy(0) = 1 n Pn i=1 δyi 0 (7) The derivation of (7) is provided in Lemma 10. It is a system of coupled nonlinear Fokker-Planck equations, that are the Kolmogorov forward equations of the SDE (4). They correspond to the IWGF of the entropy-regularized loss Lβ(µx, µy) L(µx, µy) + β 1(H(µy) H(µx). Finally, if {xi, yi, wi x, wi y}i [1:n] solve (5), then µx(t) = Pn i=1 wi x,tδxi t, µy(t) = Pn i=1 wi y,tδyi t solve the Interacting Wasserstein-Fisher-Rao Gradient Flow (IWFRGF) of L: ( tµx = γ x (µx x Vx(µy, x)) αµx(Vx(µy, x) L(µx, µy)), µx(0) = Pn i=1 wi x,0δxi 0, tµy = γ y (µy y Vy(µx, y)) + αµy(Vy(µx, y) L(µx, µy)), µy(0) = Pn i=1 wi y,0δyi 0.(8) The derivation of (8) is provided in App. A and Lemma 11. The Wasserstein-Fisher-Rao or Hellinger Kantorovich metric (Chizat, Peyré, et al., 2015; Kondratyev et al., 2016; Gallouët and Monsaingeon, 2016) is a metric on the probability space M+(X) induced by a lifting to the space P(X R+) of the form ν 7 µ = R R+ w dν( , w). If we keep νy fixed, the first equation in (8) is a Wasserstein Fisher-Rao gradient flow (slightly modified by the term αµx L(µx, µy) to constrain µx in P(X)). The term αµx(Vx(µy, x) L(µx, µy)), which also arises in entropic mirror descent, allow mass to teleport from bad strategies to better ones with finite cost by moving along the weight coordinate. Wasserstein-Fisher-Rao gradient flows have been used by Chizat (2019), Rotskoff, Jelassi, et al. (2019), and Liero et al. (2018) in the context of optimization. Initialization of (6), (7) and (8) may be done with the measures µx,0 and µy,0 from which {xi 0}, {yi 0} are sampled, in which case the measures µx(t) and µy(t) are not discrete and follow the mean-field dynamics. In 4.3 we link the dynamics starting from discrete realizations to the mean-field dynamics. 4 Convergence analysis We establish convergence results for the entropy-regularized dynamics and the WFR dynamics. 4.1 Convergence of the entropy-regularized Wasserstein dynamics The following theorem characterizes the stationary points of the entropy-regularized dynamics. Theorem 1. Suppose that Asm. 1 holds, that ℓ C2(X Y) and that the initial measures µx,0, µy,0 have densities in L1(X), L1(Y). If a solution (µx(t), µy(t)) of the ERIWGF (7) converges in time, it must converge to the point (ˆµx, ˆµy) which is the unique fixed point of the problem Zx e β R ℓ(x,y) dµy(y), ρy(y) = 1 Zy eβ R ℓ(x,y) dµx(x). (9) (ˆµx, ˆµy) is an ε-Nash equilibrium of the game given by L when β 4 ε log 2 1 Vδ Vδ (2Kℓ/ε 1) , where Kℓ:= maxx,y ℓ(x, y) minx,y ℓ(x, y) is the length of the range of ℓ, δ := ε/(2Lip(ℓ)) and Vδ is a lower bound on the volume of a ball of radius δ in X, Y. The proof is in App. C. Theorem 1 characterizes the stationary points of the ERIWGF but does not provide a guarantee of convergence in time. It implies that if the dynamics (7) converges in time, the limit will be an ε-Nash equilibrium of L, with ε = O(1/β) (disregarding log factors). The dynamics (7) correspond to a Mc Kean-Vlasov process on the joint probability measure µx µy. While convergence to stationary solutions of such processes have been studied in the Euclidean case (Eberle et al., 2019)l, their results would only guarantee convergence for temperatures β 1 Lip(ℓ) in our setup, which is not strong enough to certify convergence to arbitrary ε-NE. There is a trade-off between setting a low temperature β 1, which yields an ε-Nash equilibrium with small ε but possibly slow or no convergence, and setting a high temperature, which has the opposite effect. Linear potential Fokker-Planck equations (that we recover when both players are decoupled) indeed converge exponentially with rate e λβt for all β, with λβ decreasing exponentially with β for nonconvex potentials (Markowich and Villani, 1999, sec. 5). Entropic regularization also biases the dynamics towards measures with full support and hence precludes convergence to sparse equilibria even if they exist. This problem does not arise in the WFR dynamics. 4.2 Analysis of the Wasserstein-Fisher-Rao dynamics Theorem 2 states that, at a certain time t0, the time averaged measures of the solution (νx, νy) of (8) are an ε-MNE, where ε can be made arbitrarily small by adjusting the constants γ, α of the dynamics. We define νx(t) = 1 t R t 0 νx(s) ds and νy(t) = 1 t R t 0 νy(s) ds, where νx and νy are solutions of (8). Theorem 2. Let ε > 0 arbitrary. Suppose that νx,0, νy,0 are such that their Radon-Nikodym derivatives with respect to the Borel measures of X, Y are lower-bounded by e K x, e K y respectively. For any δ (0, 1/2), there exists a constant Cδ,X,Y,K x,K y > 0 depending on the dimensions of X, Y, their curvatures and K x, K y, such that if γ/α < 1, γ α ε/Cδ,X,Y,K x,K y 2 1 δ NI( νx(t0), νy(t0)) ε where t0 = (αγ) 1/2. The proof (App. D) builds on the convergence properties of continuous-time mirror descent and closely follows the proof of Theorem 3.8 from Chizat (2019). We explicit the dependency of Cδ,X,Y,K x,K y on the dimensions of the manifolds and the properties of the loss ℓ. Notice that Theorem 2 ensures convergence towards an ε-Nash equilibrium of the non-regularized game. Following Chizat (2019), it is possible to replace the regularity assumption on the initial measures νx,0, νy,0 by a singular initialisation, at the expense of using O(exp(d)) particles. This result is not a convergence result for the measures, but rather on the value of the NI error. Notice that it involves time-averaging and a finite horizon. Similar results are common for mirror descent in convex games (Juditsky et al., 2011), albeit in the discrete-time setting. Theorem 2 does not capture the benefits of transport, as it regards it as a perturbation of mirror descent (which corresponds to γ = 0). When targetting a small error ε, we need to set γ α because of the bound on γ/α. In this case, mirror descent is the main driver of the dynamics. However, it is seen empirically that taking much higher ratios γ/α (i.e. increasing the importance of the transport term) results in better performance. A satisfying explanation of this phenomenon is still sought after in the simpler optimization setting (Chizat, 2019). 4.3 Convergence to mean-field The following theorem (proof in App. F) links the empirical measures of the systems (4), (5) to the solutions of the mean field dynamics (7) and (8) respectively. It can be seen as a law of large numbers. It shows that by Theorem 3, Alg. 1 and Alg. 2 approximate the mean-field dynamics studied in 4.1 and 4.2. Theorem 3. (i) Let µn x = 1 n Pn i=1 δX(i) C([0, T], P(X)), µn y = 1 n Pn i=1 δY (i) C([0, T], P(Y)) be the empirical measures of a solution of (4) up to an arbitrary time T. Let µx 1 2 4 8 16 32 10 Log NI error 50x2 particles 100x2 particles Langevin DA Mirror DA WFR DA 1 2 4 8 16 32 Figure 1: Nikaido-Isoida errors for L-DA, WFR-DA and mirror descent, as a function of the problem dimension, for a nonconvex loss ℓa (left) and convex loss ℓb (right). L-DA and WFR-DA outperforms mirror descent for large dimensions. Values averaged over 20 runs after 30000 iterations. Error bars show standard deviation across runs. - Log-likelihood Total G. iter Deep (overfitting) networks 1G 1D 3G 3D 5G 5D - Log-likelihood Total G. iter Shallow (underfitting) networks 1G 1D 3G 3D 5G 5D Multi G, multi D Single G, single D True G0 G1 G2 G3 G4 Figure 2: Training mixtures of GANs over a synthetic mixture of Gaussians in 2D. WFR-DA converges faster with models with low number of parameters, and similar performance with overparametrized models. Mixtures naturally perform a form of clustering of the data. Errors bars show variance across 5 runs. C([0, T], P(X)), µy C([0, T], P(Y)) be a solution of the ERIWGF (7) with mean-field initial conditions µx(0) = µx,0, µy(0) = µy,0. Then, E[W2 2(µn x,t, µx,t) + W2 2(µn y,t, µy,t)] n 0, E[|NI(µn x,t, µn y,t) NI(µx,t, µy,t)|] n 0, uniformly over t [0, T]. NI is the Nikaido-Isoda error defined in (2). (ii) Let νn x = Pn i=1 wi x,tδX(i) C([0, T], P(X)), µn y = Pn i=1 wi y,tδY (i) C([0, T], P(Y)) be the (projected) empirical measures of a solution of (5) up to an arbitrary time T. Let νx C([0, T], P(X)), νy C([0, T], P(Y)) be a solution of (8) with mean-field initial conditions µx(0) = µx,0, µy(0) = µy,0. Then, E[W2 2(νn x,t, νx,t) + W2 2(νn y,t, νy,t)] n 0, E[|NI( νn x,t, νn y,t) NI( νx,t, νy,t)|] n 0, uniformly over t [0, T]. νx,t, νy,t, νn x,t, νn y,t are the time-averaged measures, as in Theorem 2. 5 Numerical Experiments We show that WFR and Langevin dynamics outperform mirror descent in high dimension, on synthetic games. We then show the interests of using WFR-DA for training GANs. Code has been made available for reproducibility. 0.2 0.4 0.6 0.8 1.0 20 1G 1D 5G 1D W 5G 2D W 5G 2D WFR Fake CIFAR10 WFR flow on 5 gen. 2 discr. MNIST images, WFR, 2 gen. 2 discr. Figure 3: Training mixtures of GANs over CIFAR10. We compare the algorithm that updates the mixture weights and parameters (WFR-DA flow) with the algorithm that only updates parameters (W-DA flow). Using several discriminators and a WFR-DA flow brings more stable convergence. Each generator tends to specialize in a type of images. Errors bars show variance across 5 runs. 5.1 Polynomial games on spheres We study two different games with losses ℓa, ℓb : Sd 1 Sd 1 R of the form ℓa(x, y) = x A0x + x A1y + y A2y + y A3(x2) + a 0 x + a 1 y ℓb(x, y) = x A 0 A0x + x A1y + y A 2 A2y + a 0 x + a 1 y, where A0, A1, A2, A3, a0, a1 are matrices and vectors with components sampled from a normal distribution N(0, 1), and x2 is the vector given by component-wise multiplication of x. ℓb is a convex loss on the sphere, while ℓa is not. We run Langevin Descent-Ascent (updates of positions) and WFR Descent-Ascent (updates of weights and positions), and compare it with mirror descent (updates of weights).We note that the computation of the NI error (2) entails solving two optimization problems on measures, or equivalently in parameter space. We solve each of them by performing 2000 gradient acsent runs with random uniform initialization and selecting the highed minimum final value. This gives a lower bound on the NI error which is precise enough for our purposes. We perform time averaging on the weights of mirror descent and WFR-DA, but not on the positions of WFR-DA because that would incur an O(t) overhead on memory. Results. Wirror descent performs like WFR-DA in low dimensions, but suffers strongly from the curse of dimensionality (Fig. 1). On the other hand, algorithms that incorporate a transport term keep performing well in high dimensions. In particular, WFR-DA is consistently the algorithm with lowest NI error. Notice that the errors in the n = 50 and n = 100 plots do not differ much, confirming that we reach a mean-field regime. 5.2 Training GAN mixtures We now use WFR-DA to train mixtures of generator networks. We consider the Wasserstein-GAN (Arjovsky et al., 2017) setting. We seek to approximate a distribution Pdata with a distribution Gx, defined as the push-forward of a noise distribution N(0, I) by a neural-network gx. The discrepancy between Pdata and Gx is estimated by a neural-network discriminator fy, leading to the problem min x max y ℓ(x, y) Ea pdata[fy(a)] Eε N(0,I)[fy(gx(ε))]. We lift this problem in the space of distributions over the parameters x and y (see G.4), that we represent through weighted discrete distributions of Pp i=1 w(i) x δx(i) and Pq j=1 w(j) y δy(j). We solve min x(i),wx p max y(j),wy q j=1 w(i) x w(j) y ℓ(x(i), y(j)) , using Alg. 2, where q is the q-dimensional simplex. The optimal generation strategy corresponding to an equilibrium point (x(i))i, wx, (y(j))j, wy is then to randomly select a generator gx I with I sampled among [n] with probability w(i) x , and use it to generate gx I(ε), with ε N(0, I). Training mixtures of generators has been proposed by Ghosh et al. (2018), with a tweaked discriminator loss. Our formulation only involves a lifting in the space of measures, and uses a new training algorithm. Results on 2D GMMs. We first set Pdata to be an 8-mode mixture of Gaussians in two dimensions. We use the original W-GAN loss, with weight cropping for the discriminators (fy(j))j. We measure the interest of using mixtures when a single generator gx(i) cannot fit Pdata (single-layer MLP), and when it can (4-layer MLP). We report results in Fig. 2, measuring the log likelihood of Gx for the GMM during training. The WFR dynamic is stable even with few particles. When training under-parametrized generators, using mixtures permits faster convergence (in terms of generator updates). In the over-parametrized setting, training a single generator or a mixture of generators perform similarly. WFR-DA is thus useful to train mixtures of simple generators. In this setting, each simple generator identifies modes in the training data, doing data clustering at no cost (Fig. 2 right). Results on real data. We train a mixture of Res Net generators on CIFAR10 and MNIST. We replace the position updates in Alg. 2 by extrapolated Adam steps (Gidel et al., 2019) to achieve faster convergence, and perform grid search over generator and discriminators learning rates. Convergence curves for the best learning rates are displayed in Fig. 3 right, measuring test FID (Heusel et al., 2017). With a sufficient number of generators and discriminators (G > 5, D > 2), the model trains as fast as a normal GAN. WFR-DA is thus stable and efficient even with a reasonable number of particles. Using the discretized WFR versus the Wasserstein flow provides a slight improvement over updating parameters only. As with GMMs, each generator trained with WFR-DA becomes specialised in generating a fraction of the target data, thereby identifying clusters. Those could be used for unsupervised conditional generation of images. 6 Conclusions and future work We have explored non-convex-non-concave, high-dimensional games from the perspective of optimal transport. As with non-convex optimization, framing the problem in terms of measures provides geometric benefits, at the expense of moving into non-Euclidean metric spaces over measures. Our theoretical results establish approximate mean-field convergence for two setups: Langevin Descent Ascent and WFR D-A, and directly applies to GANs, for mixtures of generators and discriminators. Despite the positive convergence guarantees our results are qualitative in nature, i.e. without rates. In the entropic case, the unfavorable tradeoff between temperature and convergence of the associated Mc Kean-Vlasov scheme deserves further study, maybe through log-Sobolev-type inequalities (Markowich and Villani, 1999). In the WFR case, we lack a local convergence analysis explaining the benefits of transport observed empirically, perhaps leveraging sharpness Polyak-Łojasiewicz results such as those in (Chizat, 2019) or (Sanjabi et al., 2018). Finally, in our GAN formulation, each generator is associated to a single particle in a high-dimensional product space of all network parameters, which is not scalable to large population sizes that would approximate their mean-field limit. A natural question is to understand to what extent our framework could be combined with specific choices of architecture, as recently studied in (Lei et al., 2019). Broader impact We study algorithms designed to find equilibria in games, provide theoretical guarantees of convergence and test their performance empirically. Among other applications, our results give insight into training algorithms for generative adversarial networks (GANs), which are useful for many relevant tasks such as image generation, image-to-image or text-to-image translation and video prediction. As always, we note that machine learning improvements like ours come in the form of building machines to do X better . For a sufficiently malicious or ill-informed choice of X, such as surveillance or recidivism prediction, almost any progress in machine learning might indirectly lead to a negative outcome, and our work is not excluded from that. Funding disclosure C. Domingo-Enrich thanks J. De Dios Pont for conversations on the subject. This work is partially supported by the Alfred P. Sloan Foundation, NSF RI-1816753, NSF CAREER CIF 1845360, NSF CHS-1901091, Samsung Electronics, and the Institute for Advanced Study. The work of C. Domingo Enrich is partially supported by the La Caixa Fellowship. The work of A. Mensch is supported by the European Research Council (ERC project NORIA). The work of G. Rotskoff is supported by the James S. Mc Donnell Foundation. Adolphs, Leonard, Hadi Daneshmand, Aurelien Lucchi, and Thomas Hofmann (2018). Local saddle point optimization: A curvature exploitation approach . In: ar Xiv: 1805.05751. Ambrosio, Luigi, Nicola Gigli, and Giuseppe Savaré (2005). Gradient flows: in metric spaces and in the space of probability measures. Lectures in mathematics ETH Zürich. Boston: Birkhäuser. 333 pp. Arjovsky, Martin, Soumith Chintala, and Léon Bottou (2017). Wasserstein GAN . In: ar Xiv: 1701.07875. Arora, Sanjeev, Rong Ge, Yingyu Liang, Tengyu Ma, and Yi Zhang (2017). Generalization and equilibrium in generative adversarial nets (GANs) . In: Proceedings of the International Conference on Machine Learning. Balandat, Maximilian, Walid Krichene, Claire Tomlin, and Alexandre Bayen (2016). Minimizing regret on reflexive Banach spaces and Nash equilibria in continuous zero-sum games . In: Advances in Neural Information Processing Systems. Balduzzi, David, Sébastien Racanière, James Martens, Jakob Foerster, Karl Tuyls, and Thore Graepel (2018). The Mechanics of n-Player Differentiable Games . In: Proceedings of the International Conference on Machine Learning. Bogachev, V. I., N. V. Krylov, and M. Röckner (2001). On Regularity of Transition Probabilities and Invariant Measures of Singular Diffusions under Minimal Conditions . In: Communications in Partial Differential Equations 26.11-12, pp. 2037 2080. Bu, Lucian, Robert Babu, Bart De Schutter, et al. (2008). A comprehensive survey of multi-agent reinforcement learning . In: IEEE Transactions on Systems, Man, and Cybernetics, Part C (Applications and Reviews) 38.2, pp. 156 172. Chirikjian, G.S. (2009). Stochastic models, information theory, and Lie groups. Applied and numerical harmonic analysis. Birkhäuser. Chizat, Lénaïc (2019). Sparse Optimization on Measures with Over-parameterized Gradient Descent . In: ar Xiv: 1907.10300v1. Chizat, Lénaïc and Francis Bach (2018). On the global convergence of gradient descent for overparameterized models using optimal transport . In: Advances in neural information processing systems, pp. 3036 3046. Chizat, Lénaïc, Gabriel Peyré, Bernhard Schmitzer, and François-Xavier Vialard (2015). Unbalanced Optimal Transport: Dynamic and Kantorovich Formulation . In: ar Xiv preprint ar Xiv:1508.05216. (2018). An Interpolating Distance Between Optimal Transport and Fisher Rao Metrics . In: Foundations of Computational Mathematics 18.1, pp. 1 44. Daskalakis, Constantinos, Paul W Goldberg, and Christos H Papadimitriou (2009). The complexity of computing a Nash equilibrium . In: SIAM Journal on Computing 39.1, pp. 195 259. Eberle, Andreas, Arnaud Guillin, and Raphael Zimmer (2019). Quantitative Harris-type theorems for diffusions and Mc Kean-Vlasov processes . In: Trans. Amer. Math. Soc. 371, pp. 7135 7173. Fiez, Tanner, Benjamin Chasnov, and Lillian J. Ratliff (2019). Convergence of Learning Dynamics in Stackelberg Games. Gallouët, Thomas O. and Léonard Monsaingeon (2016). A JKO Splitting Scheme for Kantorovich Fisher-Rao Gradient Flows . In: SIAM J. Math. Analysis 49, pp. 1100 1130. Ghosh, Arnab, Viveka Kulharia, Vinay Namboodiri, Philip H. S. Torr, and Puneet K. Dokania (2018). Multi-Agent Diverse Generative Adversarial Networks . In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. Gidel, Gauthier, Hugo Berard, Gaëtan Vignoud, Pascal Vincent, and Simon Lacoste-Julien (2019). A variational inequality perspective on generative adversarial networks . In: International Conference on Learning Representations. Glicksberg, Irving L (1952). A further generalization of the Kakutani fixed point theorem, with application to Nash equilibrium points . In: Proceedings of the American Mathematical Society 3.1, pp. 170 174. Goodfellow, Ian, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio (2014). Generative adversarial nets . In: Advances in Neural Information Processing Systems. Grnarova, Paulina, Kfir Y Levy, Aurelien Lucchi, Thomas Hofmann, and Andreas Krause (2017). An online learning approach to generative adversarial networks . In: ar Xiv: 1706.03269. Heusel, Martin, Hubert Ramsauer, Thomas Unterthiner, Bernhard Nessler, and Sepp Hochreiter (2017). GANs trained by a two time-scale update rule converge to a local Nash equilibrium . In: Advances in Neural Information Processing Systems. Hsieh, Ya-Ping, Chen Liu, and Volkan Cevher (2019). Finding Mixed Nash Equilibria of Generative Adversarial Networks . In: Proceedings of the International Conference on Machine Learning. Huang, Wen, Min Ji, Zhenxin Liu, and Yingfei Yi (2015). Steady States of Fokker Planck Equations: I. Existence . In: Journal of Dynamics and Differential Equations 27, pp. 721 742. Jin, Chi, Praneeth Netrapalli, and Michael I Jordan (2019). Minmax optimization: Stable limit points of gradient descent ascent are locally optimal . In: ar Xiv: 1902.00618. Juditsky, Anatoli, Arkadi Nemirovski, and Claire Tauvel (2011). Solving variational inequalities with stochastic mirror-prox algorithm . In: Stochastic Systems 1.1, pp. 17 58. Kallenberg, O. (2002). Foundations of Modern Probability. Probability and Its Applications. Springer New York. Kondratyev, Stanislav, Léonard Monsaingeon, Dmitry Vorotnikov, et al. (2016). A new optimal transport distance on the space of finite Radon measures . In: Advances in Differential Equations 21.11/12, pp. 1117 1164. Lacker, Daniel (2018). Mean field games and interacting particle systems . In: Preprint. Lei, Qi, Jason D Lee, Alexandros G Dimakis, and Constantinos Daskalakis (2019). SGD Learns One-Layer Networks in WGANs . In: ar Xiv: 1910.07030. Liero, Matthias, Alexander Mielke, and Giuseppe Savaré (2018). Optimal Entropy-Transport problems and a new Hellinger Kantorovich distance between positive measures . In: Inventiones mathematicae 211.3, pp. 969 1117. Lin, Qihang, Mingrui Liu, Hassan Rafique, and Tianbao Yang (2018). Solving weakly-convexweakly-concave saddle-point problems as weakly-monotone variational inequality . In: ar Xiv: 1810.10207. Markowich, P. A. and C. Villani (1999). On The Trend To Equilibrium For The Fokker-Planck Equation: An Interplay Between Physics And Functional Analysis . In: Physics and Functional Analysis, Matematica Contemporanea (SBM) 19, pp. 1 29. Mazumdar, Eric V, Michael I Jordan, and S Shankar Sastry (2019). On finding local Nash equilibria (and only local Nash equilibria) in zero-sum games . In: ar Xiv: 1901.00838. Mei, Song, Andrea Montanari, and Phan-Minh Nguyen (2018). A mean field view of the landscape of two-layer neural networks . In: Proceedings of the National Academy of Sciences 115.33, E7665 E7671. Mertikopoulos, Panayotis, Bruno Lecouat, Houssam Zenati, Chuan-Sheng Foo, Vijay Chandrasekhar, and Georgios Piliouras (2019). Optimistic mirror descent in saddle-point problems: Going the extra (gradient) mile . In: International Conference on Learning Representations. Nash, John (1951). Non-cooperative games . In: Annals of Mathematics, pp. 286 295. Nikaidô, Hukukane and Kazuo Isoda (1955). Note on Non-Cooperative Convex Games . In: Pacific Journal of Mathematics 5.Suppl. 1, pp. 807 815. Nouiehed, Maher, Maziar Sanjabi, Tianjian Huang, Jason D Lee, and Meisam Razaviyayn (2019). Solving a class of non-convex min-max games using iterative first order methods . In: Advances in Neural Information Processing Systems, pp. 14905 14916. Posner, Edward C. (1975). Random Coding Strategies for Minimum Entropy . In: IEEE Transations on Information Theory 21.4, pp. 388 391. Racanière, Sébastien, Théophane Weber, David Reichert, Lars Buesing, Arthur Guez, Danilo Jimenez Rezende, Adria Puigdomenech Badia, Oriol Vinyals, Nicolas Heess, Yujia Li, et al. (2017). Imagination-augmented agents for deep reinforcement learning . In: Advances in Neural Information Processing Systems. Rosen, J. B. (1965). Existence and Uniqueness of Equilibrium Points for Concave N-Person Games . In: Econometrica 33.3, pp. 520 534. Rotskoff, Grant, Samy Jelassi, Joan Bruna, and Eric Vanden-Eijnden (2019). Global convergence of neuron birth-death dynamics . In: ar Xiv: 1902.01843. Rotskoff, Grant and Eric Vanden-Eijnden (2018). Neural networks as interacting particle systems: Asymptotic convexity of the loss landscape and universal scaling of the approximation error . In: ar Xiv: 1805.00915. Sanjabi, Maziar, Meisam Razaviyayn, and Jason D Lee (2018). Solving non-convex non-concave min-max games under Polyak-Łojasiewicz condition . In: ar Xiv preprint ar Xiv:1812.02878. Sirignano, Justin and Konstantinos Spiliopoulos (2019). Mean field analysis of neural networks: A central limit theorem . In: Stochastic Processes and their Applications. Sznitman, Alain-Sol (1991). Topics in propagation of chaos . In: Ecole d Eté de Probabilités de Saint-Flour. Wayne, Greg and LF Abbott (2014). Hierarchical control using networks trained with higher-level forward models . In: Neural Computation 26.10, pp. 2163 2193.