# learning_meanfield_games__8efc88a4.pdf Learning Mean-Field Games Xin Guo University of California, Berkeley xinguo@berkeley.edu Anran Hu University of California, Berkeley anran_hu@berkeley.edu Renyuan Xu University of California, Berkeley renyuanxu@berkeley.edu Junzi Zhang Stanford University junziz@stanford.edu This paper presents a general mean-field game (GMFG) framework for simultaneous learning and decision-making in stochastic games with a large population. It first establishes the existence of a unique Nash Equilibrium to this GMFG, and explains that naively combining Q-learning with the fixed-point approach in classical MFGs yields unstable algorithms. It then proposes a Q-learning algorithm with Boltzmann policy (GMF-Q), with analysis of convergence property and computational complexity. The experiments on repeated Ad auction problems demonstrate that this GMF-Q algorithm is efficient and robust in terms of convergence and learning accuracy. Moreover, its performance is superior in convergence, stability, and learning ability, when compared with existing algorithms for multi-agent reinforcement learning. 1 Introduction Motivating example. This paper is motivated by the following Ad auction problem for an advertiser. An Ad auction is a stochastic game on an Ad exchange platform among a large number of players, the advertisers. In between the time a web user requests a page and the time the page is displayed, usually within a millisecond, a Vickrey-type of second-best-price auction is run to incentivize interested advertisers to bid for an Ad slot to display advertisement. Each advertiser has limited information before each bid: first, her own valuation for a slot depends on an unknown conversion of clicks for the item; secondly, she, should she win the bid, only knows the reward after the user s activities on the website are finished. In addition, she has a budget constraint in this repeated auction. The question is, how should she bid in this online sequential repeated game when there is a large population of bidders competing on the Ad platform, with unknown distributions of the conversion of clicks and rewards? Besides the Ad auction, there are many real-world problems involving a large number of players and unknown systems. Examples include massive multi-player online role-playing games [19], high frequency tradings [24], and the sharing economy [13]. Our work. Motivated by these problems, we consider a general framework of simultaneous learning and decision-making in stochastic games with a large population. We formulate a general mean-fieldgame (GMFG) with incorporation of action distributions, (randomized) relaxed policies, and with unknown rewards and dynamics. This general framework can also be viewed as a generalized version of MFGs of Mc Kean-Vlasov type [1], which is a different paradigm from the classical MFG. It is also beyond the scope of the existing Q-learning framework for Markov decision problem (MDP) with unknown distributions, as MDP is technically equivalent to a single player stochastic game. 33rd Conference on Neural Information Processing Systems (Neur IPS 2019), Vancouver, Canada. On the theory front, this general framework differs from all existing MFGs. We establish under appropriate technical conditions, the existence and uniqueness of the Nash equilibrium (NE) to this GMFG. On the computational front, we show that naively combining Q-learning with the three-step fixed-point approach in classical MFGs yields unstable algorithms. We then propose a Q-learning algorithm with Boltzmann policy (GMF-Q), establish its convergence property and analyze its computational complexity. Finally, we apply this GMF-Q algorithm to the Ad auction problem, where this GMF-Q algorithm demonstrates its efficiency and robustness in terms of convergence and learning. Moreover, its performance is superior, when compared with existing algorithms for multi-agent reinforcement learning for convergence, stability, and learning accuracy. Related works. On learning large population games with mean-field approximations, [39] focuses on inverse reinforcement learning for MFGs without decision making, [40] studies an MARL problem with a first-order mean-field approximation term modeling the interaction between one player and all the other finite players, and [22] and [41] consider model-based adaptive learning for MFGs in specific models (e.g., linear-quadratic and oscillator games). More recently, [26] studies the local convergence of actor-critic algorithms on finite time horizon MFGs, and [34] proposes a policy-gradient based algorithm and analyzes the so-called local NE for reinforcement learning in infinite time horizon MFGs. For learning large population games without mean-field approximation, see [14, 21] and the references therein. In the specific topic of learning auctions with a large number of advertisers, [6] and [20] explore reinforcement learning techniques to search for social optimal solutions with real-word data, and [18] uses MFGs to model the auction system with unknown conversion of clicks within a Bayesian framework. However, none of these works consider the problem of simultaneous learning and decision-making in a general MFG framework. Neither do they establish the existence and uniqueness of the (global) NE, nor do they present model-free learning algorithms with complexity analysis and convergence to the NE. Note that in principle, global results are harder to obtain compared to local results. 2 Framework of General MFG (GMFG) 2.1 Background: classical N-player Markovian game and MFG Let us first recall the classical N-player game. There are N players in a game. At each step t, the state of player i (= 1, 2, , N) is si t S Rd and she takes an action ai t A Rp. Here d, p are positive integers, and S and A are compact (for example, finite) state space and action space, respectively. Given the current state profile of N-players st = (s1 t, . . . , s N t ) SN and the action ai t, player i will receive a reward ri(st, ai t) and her state will change to si t+1 according to a transition probability function P i(st, ai t). A Markovian game further restricts the admissible policy/control for player i to be of the form ai t = πi t(st). That is, πi t : SN P(A) maps each state profile s SN to a randomized action, with P(X) the space of probability measures on space X. The accumulated reward (a.k.a. the value function) for player i, given the initial state profile s and the policy profile sequence π := {πt} t=0 with πt = (π1 t , . . . , πN t ), is then defined as V i(s,π) := E t=0 γtri(st, ai t) s0 = s where γ (0, 1) is the discount factor, ai t πi t(st), and si t+1 P i(st, ai t). The goal of each player is to maximize her value function over all admissible policy sequences. In general, this type of stochastic N-player game is notoriously hard to analyze, especially when N is large [28]. Mean field game (MFG), pioneered by [17] and [23] in the continuous settings and later developed in [4, 10, 16, 25, 33] for discrete settings, provides an ingenious and tractable aggregation approach to approximate the otherwise challenging N-player stochastic games. The basic idea for an MFG goes as follows. Assume all players are identical, indistinguishable and interchangeable, when N , one can view the limit of other players states s i t = (s1 t, . . . , si 1 t , si+1 t , . . . , s N t ) as a population state distribution µt with µt(s) := lim N PN j=1,j =i Isj t =s N .1 Due to the homogeneity 1Here the indicator function Isj t =s = 1 if sj t = s and 0 otherwise. of the players, one can then focus on a single (representative) player. That is, in an MFG, one may consider instead the following optimization problem, maximizeπ V (s,π,µ) := E P t=0 γtr(st, at, µt)|s0 = s subject to st+1 P(st, at, µt), at πt(st, µt), where π := {πt} t=0 denotes the policy sequence and µ := {µt} t=0 the distribution flow. In this MFG setting, at time t, after the representative player chooses her action at according to some policy πt, she will receive reward r(st, at, µt) and her state will evolve under a controlled stochastic dynamics of a mean-field type P( |st, at, µt). Here the policy πt depends on both the current state st and the current population state distribution µt such that π : S P(S) P(A). 2.2 General MFG (GMFG) In the classical MFG setting, the reward and the dynamic for each player are known. They depend only on st the state of the player, at the action of this particular player, and µt the population state distribution. In contrast, in the motivating auction example, the reward and the dynamic are unknown; they rely on the actions of all players, as well as on st and µt. We therefore define the following general MFG (GMFG) framework. At time t, after the representative player chooses her action at according to some policy π : S P(S) P(A), she will receive a reward r(st, at, Lt) and her state will evolve according to P( |st, at, Lt), where r and P are possibly unknown. The objective of the player is to solve the following control problem: maximizeπ V (s,π,L) := E P t=0 γtr(st, at, Lt)|s0 = s subject to st+1 P(st, at, Lt), at πt(st, µt). (GMFG) Here, L := {Lt} t=0, with Lt = Pst,at P(S A) the joint distribution of the state and the action (i.e., the population state-action pair). Lt has marginal distributions αt for the population action and µt for the population state. Notice that {Lt} t=0 could depend on time. Namely, an infinite time horizon MFG could still have time-dependent NE solution due to the mean information process (game interaction) in the MFG. This is fundamentally different from the theory of single-agent MDP where the optimal control, if exists uniquely, would be time independent in an infinite time horizon setting. In this framework, we adopt the well-known Nash Equilibrium (NE) for analyzing stochastic games. Definition 2.1 (NE for GMFGs). In (GMFG), a player-population profile (π ,L ) := ({π t } t=0, {L t } t=0) is called an NE if 1. (Single player side) Fix L , for any policy sequence π := {πt} t=0 and initial state s S, V (s,π ,L ) V (s,π,L ) . (2) 2. (Population side) Pst,at = L t for all t 0, where {st, at} t=0 is the dynamics under the policy sequence π starting from s0 µ 0, with at π t (st, µ t ), st+1 P( |st, at, L t ), and µ t being the population state marginal of L t . The single player side condition captures the optimality of π , when the population side is fixed. The population side condition ensures the consistency of the solution: it guarantees that the state and action distribution flow of the single player does match the population state and action sequence L . 2.3 Example: GMFG for the repeated auction Now, consider the repeated Vickrey auction with a budget constraint in Section 1. Take a representative advertiser in the auction. Denote st {0, 1, 2, , smax} as the budget of this player at time t, where smax N+ is the maximum budget allowed on the Ad exchange with a unit bidding price. Denote at as the bid price submitted by this player and αt as the bidding/(action) distribution of the population. The reward for this advertiser with bid at and budget st is rt = Iw M t =1 h (vt a M t ) (1 + ρ)Ist st. That is, if this player does not win the bid, the budget will remain the same. If she wins and has enough money to pay, her budget will decrease from st to st a M t . However, if she wins but does not have enough money, her budget will be 0 after the payment and there will be a penalty in the reward function. Note that in this game, both the rewards rt and the dynamics st are unknown a priori. In practice, one often modifies the dynamics of st+1 with a non-negative random budget fulfillment (st+1) after the auction clearing [11], such that ˆst+1 = st+1 + (st+1). One may see some particular choices of (st+1) in the experiment section (Section 5). 3 Solution for GMFGs We now establish the existence and uniqueness of the NE to (GMFG), by generalizing the classical fixed-point approach for MFGs to this GMFG setting. (See [17] and [23] for the classical case). It consists of three steps. Step A. Fix L := {Lt} t=0, (GMFG) becomes the classical optimization problem. Indeed, with L fixed, the population state distribution sequence µ := {µt} t=0 is also fixed, hence the space of admissible policies is reduced to the single-player case. Solving (GMFG) is now reduced to finding a policy sequence π t,L Π := {π | π : S P(A)} over all admissible πL = {πt,L} t=0, to maximize V (s,πL,L) := E P t=0 γtr(st, at, Lt)|s0 = s , subject to st+1 P(st, at, Lt), at πt,L(st). Notice that with L fixed, one can safely suppress the dependency on µt in the admissible policies. Moreover, given this fixed L sequence and the solution π L := {π t,L} t=0, one can define a mapping from the fixed population distribution sequence L to an arbitrarily chosen optimal randomized policy sequence. That is, Γ1 : {P(S A)} t=0 {Π} t=0, such that π L = Γ1(L). Note that this π L sequence satisfies the single player side condition in Definition 2.1 for the population state-action pair sequence L. That is, V (s,π L,L) V (s,π,L) , for any policy sequence π = {πt} t=0 and any initial state s S. As in the MFG literature [17], a feedback regularity condition is needed for analyzing Step A. Assumption 1. There exists a constant d1 0, such that for any L,L {P(S A)} t=0, D(Γ1(L), Γ1(L )) d1W1(L,L ), (4) D(π,π ) := sup s S W1(π(s),π (s)) = sup s S sup t N W1(πt(s), π t(s)), W1(L,L ) := sup t N W1(Lt, L t), (5) and W1 is the ℓ1-Wasserstein distance between probability measures [9, 31, 37]. Step B. Based on the analysis in Step A and π L = {π t,L} t=0, update the initial sequence L to L following the controlled dynamics P( |st, at, Lt). Accordingly, for any admissible policy sequence π {Π} t=0 and a joint population state-action pair sequence L {P(S A)} t=0, define a mapping Γ2 : {Π} t=0 {P(S A)} t=0 {P(S A)} t=0 as follows: Γ2(π,L) := ˆLˆLˆL = {Pst,at} t=0, (6) where st+1 µt P( | , at, Lt), at πt(st), s0 µ0, and µt is the population state marginal of Lt. One needs a standard assumption in this step. Assumption 2. There exist constants d2, d3 0, such that for any admissible policy sequences π,π1,π2 and joint distribution sequences L,L1,L2, W1(Γ2(π1,L), Γ2(π2,L)) d2D(π1,π2), (7) W1(Γ2(π,L1), Γ2(π,L2)) d3W1(L1,L2). (8) Assumption 2 can be reduced to Lipschitz continuity and boundedness of the transition dynamics P. (See the Appendix for more details.) Step C. Repeat Step A and Step B until L matches L. This step is to take care of the population side condition. To ensure the convergence of the combined step A and step B, it suffices if Γ : {P(S A)} t=0 {P(S A)} t=0 is a contractive mapping under the W1 distance, with Γ(L) := Γ2(Γ1(L),L). Then by the Banach fixed point theorem and the completeness of the related metric spaces, there exists a unique NE to the GMFG. In summary, we have Theorem 1 (Existence and Uniqueness of GMFG solution). Given Assumptions 1 and 2, and assuming that d1d2 + d3 < 1, there exists a unique NE to (GMFG). 4 RL Algorithms for (stationary) GMFGs In this section, we design the computational algorithm for the GMFG. Since the reward and transition distributions are unknown, this is simultaneously learning the system and finding the NE of the game. We will focus on the case with finite state and action spaces, i.e., |S|, |A| < . We will look for stationary (time independent) NEs. Accordingly, we abbreviate π := {π} t=0 and L := {L} t=0 as π and L, respectively. This stationarity property enables developing appropriate time-independent Q-learning algorithm, suitable for an infinite time horizon game. Modification from the GMFG framework to this special stationary setting is straightforward, and is left to Appendix B. Note that the assumptions to guarantee the existence and uniqueness of GMFG solutions are slightly different between the stationary and non-stationary cases. For instance, one can compare (7)-(8) with (21)-(22). The algorithm consists of two steps, parallel to Step A and Step B in Section 3. Step 1: Q-learning with stability for fixed L. With L fixed, it becomes a standard learning problem for an infinite horizon MDP. We will focus on the Q-learning algorithm [35, 32]. The Q-learning algorithm approximates the value iteration by stochastic approximation. At each step with the state s and an action a, the system reaches state s according to the controlled dynamics and the Q-function is updated according to QL(s, a) (1 βt(s, a))QL(s, a) + βt(s, a) [r(s, a, L) + γ max a QL(s , a)] , (9) where the step size βt(s, a) can be chosen as (cf. [7]) βt(s, a) = |#(s, a, t) + 1| h, (s, a) = (st, at), 0, otherwise. with h (1/2, 1). Here #(s, a, t) is the number of times up to time t that one visits the pair (s, a). The algorithm then proceeds to choose action a based on QL with appropriate exploration strategies, including the ϵ-greedy strategy. After obtaining the approximate ˆQ L, in order to retrieve an approximately optimal policy, it would be natural to define an argmax-e operator so that actions with equal maximum Q-values would have equal probabilities to be selected. Unfortunately, the discontinuity and sensitivity of argmax-e could lead to an unstable algorithm (see Figure 4 for the corresponding naive Algorithm 2 in Appendix). 2 Instead, we consider a Boltzmann policy based on the operator softmaxc : Rn Rn, defined as softmaxc(x)i = exp(cxi) Pn j=1 exp(cxj). (10) This operator is smooth and close to the argmax-e (see Lemma 7 in the Appendix). Moreover, even though Boltzmann policies are not optimal, the difference between the Boltzmann and the optimal one can always be controlled by choosing the hyper-parameter c appropriately in the softmax operator. Note that other smoothing operators (e.g., Mellowmax [2]) may also be considered in the future. Step 2: error control in updating L. Given the sub-optimality of the Boltzmann policy, one needs to characterize the difference between the optimal policy and the non-optimal ones. In particular, one can define the action gap between the best action and the second best action in terms of the Q-value as δs(L) := maxa A Q L(s, a ) maxa/ argmaxa AQ L(s,a) Q L(s, a) > 0. Action gap is important for approximation algorithms [3], and are closely related to the problem-dependent bounds for regret analysis in reinforcement learning and multi-armed bandits, and advantage learning algorithms including A2C [27]. The problem is: in order for the learning algorithm to converge in terms of L (Theorem 2), one needs to ensure a definite differentiation between the optimal policy and the sub-optimal ones. This is problematic as the infimum of δs(L) over an infinite number of L can be 0. To address this, the population distribution at step k, say Lk, needs to be projected to a finite grid, called ϵ-net. The relation between the ϵ-net and action gaps is as follows: For any ϵ > 0, there exist a positive function φ(ϵ) and an ϵ-net Sϵ := {L(1), . . . , L(Nϵ)} P(S A), with the properties that mini=1,...,Nϵ d T V (L, L(i)) ϵ for any L P(S A), and that maxa A Q L(i)(s, a ) Q L(i)(s, a) φ(ϵ) for any i = 1, . . . , Nϵ, s S, and any a / argmaxa AQ L(i)(s, a). Here the existence of ϵ-nets is trivial due to the compactness of the probability simplex P(S A), and the existence of φ(ϵ) comes from the finiteness of the action set A. In practice, φ(ϵ) often takes the form of Dϵα with D > 0 and the exponent α > 0 characterizing the decay rate of the action gaps. Finally, to enable Q-learning, it is assumed that one has access to a population simulator (See [30, 38]). That is, for any policy π Π, given the current state s S, for any population distribution L, one can obtain the next state s P( |s, π(s, µ), L), a reward r = r(s, π(s, µ), L), and the next population distribution L = Ps ,π(s ,µ). For brevity, we denote the simulator as (s , r, L ) = G(s, π, L). Here µ is the state marginal distribution of L. In summary, we propose the following Algorithm 1. Algorithm 1 Q-learning for GMFGs (GMF-Q) 1: Input: Initial L0, tolerance ϵ > 0. 2: for k = 0, 1, do 3: Perform Q-learning for Tk iterations to find the approximate Q-function ˆQ k(s, a) = ˆQ Lk(s, a) of an MDP with dynamics PLk(s |s, a) and rewards r Lk(s, a). 4: Compute πk Π with πk(s) = softmaxc( ˆQ k(s, )). 5: Sample s µk (µk is the population state marginal of Lk), obtain Lk+1 from G(s, πk, Lk). 6: Find Lk+1 = Proj Sϵ( Lk+1) 7: end for Note that softmax is applied only at the end of each outer iteration when a good approximation of Q function is obtained. Within the outer iteration for the MDP problem with fixed mean-field information, standard Q-learning method is applied. 2argmax-e is not continuous: Let x = (1, 1), then argmax-e(x) = (1/2, 1/2). For any ϵ > 0, let y = (1, 1 ϵ), then argmax-e(y) = (1, 0). Here Proj Sϵ(L) = argmin L(1),...,L(Nϵ)d T V (L(i), L). For computational tractability, it would be sufficient to choose Sϵ as a truncation grid so that projection of Lk onto the epsilon-net reduces to truncating Lk to a certain number of digits. For instance, in our experiment, the number of digits is chosen to be 4. The choices of the hyper-parameters c and Tk can be found in Lemma 8 and Theorem 2. In practice, the algorithm is rather robust with respect to these hyper-parameters. In the special case when the rewards r L and transition dynamics P( |s, a, L) are known, one can replace the Q-learning step in the above Algorithm 1 by a value iteration, resulting in the GMF-V Algorithm 3 in the Appendix. We next show the convergence of this GMF-Q algorithm (Algorithm 1) to an ϵ-Nash of (GMFG), with complexity analysis. Theorem 2 (Convergence and complexity of GMF-Q). Assume the same conditions in Theorem 4 and Lemma 8 in the Appendix. For any tolerances ϵ, δ > 0, set δk = δ/Kϵ,η, ϵk = (k +1) (1+η) for some η (0, 1] (k = 0, . . . , Kϵ,η 1), Tk = T MLk (δk, ϵk) (defined in Lemma 8 in the Appendix) and c = log(1/ϵ) φ(ϵ) . Then with probability at least 1 2δ, W1(LKϵ,η, L ) Cϵ. Moreover, the total number of iterations T = PKϵ,η 1 k=0 T MLk (δk, ϵk) is bounded by 3 T = O K 1+ 4 h ϵ,η (log(Kϵ,η/δ)) h +3 . (11) Here Kϵ,η := 2 max (ηϵ/c) 1/η, logd(ϵ/max{diam(S)diam(A), c}) + 1 is the number of outer iterations, h is the step-size exponent in Q-learning (defined in Lemma 8 in the Appendix), and the constant C is independent of δ, ϵ and η. The proof of Theorem 2 in the Appendix depends on the Lipschitz continuity of the softmax operator [8], the closeness between softmax and the argmax-e (Lemma 7 in the Appendix), and the complexity of Q-learning for the MDP (Lemma 8 in the Appendix). 5 Experiment: repeated auction game In this section, we report the performance of the proposed GMF-Q Algorithm. The objectives of the experiments include 1) testing the convergence, stability, and learning ability of GMF-Q in the GMFG setting, and 2) comparing GMF-Q with existing multi-agent reinforcement learning algorithms, including IL algorithm and MF-Q algorithm. We take the GMFG framework for the repeated auction game from Section 2.3. Here each advertiser learns to bid in the auction with a budget constraint. Parameters. The model parameters are set as: |S| = |A| = 10, the overbidding penalty ρ = 0.2, the distributions of the conversion rate v uniform({1, 2, 3, 4}), and the competition intensity index M = 5. The random fulfillment is chosen as: if s < smax, (s) = 1 with probability 1 2 and (s) = 0 with probability 1 2; if s = smax, (s) = 0. The algorithm parameters are (unless otherwise specified): the temperature parameter c = 4.0, the discount factor γ = 0.8, the parameter h from Lemma 8 in the Appendix being h = 0.87, and the baseline inner iteration being 2000. Recall that for GMF-Q, both v and the dynamics of P for s are unknown a priori. The 90%-confidence intervals are calculated with 20 sample paths. Performance evaluation in the GMFG setting. Our experiment shows that the GMF-Q Algorithm is efficient and robust, and learns well. Convergence and stability of GMF-Q. GMF-Q is efficient and robust. First, GMF-Q converges after about 10 outer iterations; secondly, as the number of inner iterations increases, the error decreases (Figure 2); and finally, the convergence is robust with respect to both the change of number of states and the initial population distribution (Figure 3). In contrast, the Naive algorithm does not converge even with 10000 inner iterations, and the joint distribution Lt keeps fluctuating (Figure 4). 4, η = 1, the bound reduces to T = O(K 3 ϵ (log( Kϵ 3 ). Note that this bound may not be tight. Table 1: Q-table with T GMF-V k = 5000. T GMF-Q k 1000 3000 5000 10000 Q 0.21263 0.1294 0.10258 0.0989 Figure 1: Q-tables: GMF-Q vs. GMF-V. Learning accuracy of GMF-Q. GMF-Q learns well. Its learning accuracy is tested against its special form GMF-V (Appendix G), with the latter assuming a known distribution of conversion rate v and the dynamics P for the budget s. The relative L2 distance between the Q-tables of these two algorithms is Q := QGMF-V QGMF-Q 2 QGMF-V 2 = 0.098879. This implies that GMF-Q learns the true GMFG solution with 90-percent accuracy with 10000 inner iterations. The heatmap in Figure 1(a) is the Q-table for GMF-Q Algorithm after 20 outer iterations. Within each outer iteration, there are T GMF-Q k = 10000 inner iterations. The heatmap in Figure 1(b) is the Q-table for GMF-Q Algorithm after 20 outer iterations. Within each outer iteration, there are T GMF-V k = 5000 inner iterations. Comparison with existing algorithms for N-player games. To test the effectiveness of GMFQ for approximating N-player games, we next compare GMF-Q with IL algorithm and MF-Q algorithm. IL algorithm [36] considers N independent players and each player solves a decentralized reinforcement learning problem ignoring other players in the system. The MF-Q algorithm [40] extends the NASH-Q Learning algorithm for the N-player game introduced in [15], adds the aggregate actions ( a i = j =i aj N 1 ) from the opponents, and works for the class of games where the interactions are only through the average actions of N players. 0 2 4 6 8 10 12 14 16 18 outer iteration | t + 1 t|1 500 2000 5000 Figure 2: Convergence with different number of inner iterations. 0 2 4 6 8 10 12 14 16 18 outer iteration | t t + 1|1 Figure 3: Convergence with different number of states. 0 20 40 60 80 100 outer iteration (a) fluctuation in l . 0 20 40 60 80 100 outer iteration | t t + 1|1 (b) fluctuation in l1. Figure 4: Fluctuations of Naive Algorithm (30 sample paths). 0 10000 20000 30000 40000 50000 60000 70000 80000 MF-Q IL GMF-Q (a) |S| = |A| = 10, N = 20. 0 10000 20000 30000 40000 50000 60000 70000 80000 MF-Q IL GMF-Q (b) |S| = |A| = 20, N = 20. 0 10000 20000 30000 40000 50000 60000 70000 80000 MF-Q IL GMF-Q (c) |S| = |A| = 10, N = 40. Figure 5: Learning accuracy based on C(π). Performance metric. We adopt the following metric to measure the difference between a given policy π and an NE (here ϵ0 > 0 is a safeguard, and is taken as 0.1 in the experiments): C(π) = 1 N|S|N XN s SN maxπi Vi(s, (π i, πi)) Vi(s,π) | maxπi Vi(s, (π i, πi))| + ϵ0 . Clearly C(π) 0, and C(π ) = 0 if and only if π is an NE. Policy arg maxπi Vi(s, (π i, πi)) is called the best response to π i. A similar metric without normalization has been adopted in [29]. Our experiment (Figure 5) shows that GMF-Q is superior in terms of convergence rate, accuracy, and stability for approximating an N-player game: GMF-Q converges faster than IL and MF-Q, with the smallest error, and with the lowest variance, as ϵ-net improves the stability. For instance, when N = 20, IL Algorithm converges with the largest error 0.220. The error from MF-Q is 0.101, smaller than IL but still bigger than the error from GMF-Q. The GMF-Q converges with the lowest error 0.065. Moreover, as N increases, the error of GMF-Q deceases while the errors of both MF-Q and IL increase significantly. As |S| and |A| increase, GMF-Q is robust with respect to this increase of dimensionality, while both MF-Q and IL clearly suffer from the increase of the dimensionality with decreased convergence rate and accuracy. Therefore, GMF-Q is more scalable than IL and MF-Q, when the system is complex and the number of players N is large. 6 Conclusion This paper builds a GMFG framework for simultaneous learning and decision-making, establishes the existence and uniqueness of NE, and proposes a Q-learning algorithm GMF-Q with convergence and complexity analysis. Experiments demonstrate superior performance of GMF-Q. Acknowledgment We thank Haoran Tang for the insightful early discussion on stabilizing the Q-learning algorithm and sharing the ideas of his work on soft-Q-learning [12], which motivates our adoption of the soft-max operators. We also thank the anonymous Neur IPS 2019 reviewers for the valuable suggestions. [1] B. Acciaio, J. Backhoff, and R. Carmona. Extended mean field control problems: stochastic maximum principle and transport perspective. Arxiv Preprint:1802.05754, 2018. [2] K. Asadi and M. L. Littman. An alternative softmax operator for reinforcement learning. In Proceedings of the 34th International Conference on Machine Learning, volume 70, pages 243 252, 2017. [3] M. G. Bellemare, G. Ostrovski, A. Guez, P. S. Thomas, and R. Munos. Increasing the action gap: new operators for reinforcement learning. In AAAI Conference on Artificial Intelligence, pages 1476 1483, 2016. [4] M. Benaim and J. Y. Le Boudec. A class of mean field interaction models for computer and communication systems. Performance evaluation, 65(11-12):823 838, 2008. [5] F. Bolley. Separability and completeness for the Wasserstein distance. Séminaire de Probabilités XLI, pages 371 377, 2008. [6] H. Cai, K. Ren, W. Zhang, K. Malialis, J. Wang, Y. Yu, and D. Guo. Real-time bidding by reinforcement learning in display advertising. In Proceedings of the Tenth ACM International Conference on Web Search and Data Mining, pages 661 670. ACM, 2017. [7] E. Even-Dar and Y. Mansour. Learning rates for Q-learning. Journal of Machine Learning Research, 5(Dec):1 25, 2003. [8] B. Gao and L. Pavel. On the properties of the softmax function with application in game theory and reinforcement learning. Arxiv Preprint:1704.00805, 2017. [9] A. L. Gibbs and F. E. Su. On choosing and bounding probability metrics. International Statistical Review, 70(3):419 435, 2002. [10] D. A. Gomes, J. Mohr, and R. R. Souza. Discrete time, finite state space mean field games. Journal de mathématiques pures et appliquées, 93(3):308 328, 2010. [11] R. Gummadi, P. Key, and A. Proutiere. Repeated auctions under budget constraints: Optimal bidding strategies and equilibria. In the Eighth Ad Auction Workshop, 2012. [12] T. Haarnoja, H. Tang, P. Abbeel, and S. Levine. Reinforcement learning with deep energy-based policies. Arxiv Preprint:1702.08165, 2017. [13] J. Hamari, M. Sjöklint, and A. Ukkonen. The sharing economy: Why people participate in collaborative consumption. Journal of the Association for Information Science and Technology, 67(9):2047 2059, 2016. [14] P. Hernandez-Leal, B. Kartal, and M. E. Taylor. Is multiagent deep reinforcement learning the answer or the question? A brief survey. Arxiv Preprint:1810.05587, 2018. [15] J. Hu and M. P. Wellman. Nash Q-learning for general-sum stochastic games. Journal of Machine Learning Research, 4(Nov):1039 1069, 2003. [16] M. Huang and Y. Ma. Mean field stochastic games with binary action spaces and monotone costs. Ar Xiv Preprint:1701.06661, 2017. [17] M. Huang, R. P. Malhamé, and P. E. Caines. Large population stochastic dynamic games: closedloop Mc Kean-Vlasov systems and the Nash certainty equivalence principle. Communications in Information & Systems, 6(3):221 252, 2006. [18] K. Iyer, R. Johari, and M. Sundararajan. Mean field equilibria of dynamic auctions with learning. ACM SIGecom Exchanges, 10(3):10 14, 2011. [19] S. H. Jeong, A. R. Kang, and H. K. Kim. Analysis of game bot s behavioral characteristics in social interaction networks of MMORPG. ACM SIGCOMM Computer Communication Review, 45(4):99 100, 2015. [20] J. Jin, C. Song, H. Li, K. Gai, J. Wang, and W. Zhang. Real-time bidding with multi-agent reinforcement learning in display advertising. Arxiv Preprint:1802.09756, 2018. [21] S. Kapoor. Multi-agent reinforcement learning: A report on challenges and approaches. Arxiv Preprint:1807.09427, 2018. [22] A. C Kizilkale and P. E Caines. Mean field stochastic adaptive control. IEEE Transactions on Automatic Control, 58(4):905 920, 2013. [23] J-M. Lasry and P-L. Lions. Mean field games. Japanese Journal of Mathematics, 2(1):229 260, 2007. [24] C-A. Lehalle and C. Mouzouni. A mean field game of portfolio trading and its consequences on perceived correlations. Ar Xiv Preprint:1902.09606, 2019. [25] J. P. M. López. Discrete time mean field games: The short-stage limit. Journal of Dynamics & Games, 2(1):89 101, 2015. [26] D. Mguni, J. Jennings, and E. M. de Cote. Decentralised learning in systems with many, many strategic agents. In Thirty-Second AAAI Conference on Artificial Intelligence, 2018. [27] V. M. Minh, A. P. Badia, M. Mirza, A. Graves, T. P. Lillicrap, T. Harley, D. Silver, and K. Kavukcuoglu. Asynchronous methods for deep reinforcement learning. In International Conference on Machine Learning, 2016. [28] C. H. Papadimitriou and T. Roughgarden. Computing equilibria in multi-player games. In Proceedings of the sixteenth annual ACM-SIAM symposium on Discrete algorithms, pages 82 91, 2005. [29] J. Pérolat, B. Piot, and O. Pietquin. Actor-critic fictitious play in simultaneous move multistage games. In International Conference on Artificial Intelligence and Statistics, 2018. [30] J. Pérolat, F. Strub, B. Piot, and O. Pietquin. Learning Nash equilibrium for general-sum Markov games from batch data. Arxiv Preprint:1606.08718, 2016. [31] G. Peyré and M. Cuturi. Computational optimal transport. Foundations and Trends in Machine Learning, 11(5-6):355 607, 2019. [32] B. Recht. A tour of reinforcement learning: The view from continuous control. Annual Review of Control, Robotics, and Autonomous Systems, 2018. [33] N. Saldi, T. Basar, and M. Raginsky. Markov Nash equilibria in mean-field games with discounted cost. SIAM Journal on Control and Optimization, 56(6):4256 4287, 2018. [34] J. Subramanian and A. Mahajan. Reinforcement learning in stationary mean-field games. In 18th International Conference on Autonomous Agents and Multiagent Systems, pages 251 259, 2019. [35] R. S. Sutton and A. G. Barto. Reinforcement learning: An introduction. MIT press, 2018. [36] M. Tan. Multi-agent reinforcement learning: independent vs. cooperative agents. In International Conference on Machine Learning, pages 330 337, 1993. [37] C. Villani. Optimal transport: old and new, volume 338. Springer Science & Business Media, 2008. [38] H. T. Wai, Z. Yang, Z. Wang, and M. Hong. Multi-agent reinforcement learning via double averaging primal-dual optimization. In Advances in Neural Information Processing Systems, pages 9672 9683, 2018. [39] J. Yang, X. Ye, R. Trivedi, H. Xu, and H. Zha. Deep mean field games for learning optimal behavior policy of large populations. Arxiv Preprint:1711.03156, 2017. [40] Y. Yang, R. Luo, M. Li, M. Zhou, W. Zhang, and J. Wang. Mean field multi-agent reinforcement learning. Arxiv Preprint:1802.05438, 2018. [41] H. Yin, P. G. Mehta, S. P. Meyn, and U. V. Shanbhag. Learning in mean-field games. IEEE Transactions on Automatic Control, 59(3):629 644, 2013.