# learning_meanfield_games__8efc88a4.pdf

Learning Mean-Field Games

Xin Guo University of California, Berkeley xinguo@berkeley.edu

Anran Hu University of California, Berkeley anran_hu@berkeley.edu

Renyuan Xu University of California, Berkeley renyuanxu@berkeley.edu

Junzi Zhang Stanford University junziz@stanford.edu

This paper presents a general mean-ﬁeld game (GMFG) framework for simultaneous learning and decision-making in stochastic games with a large population. It ﬁrst establishes the existence of a unique Nash Equilibrium to this GMFG, and explains that naively combining Q-learning with the ﬁxed-point approach in classical MFGs yields unstable algorithms. It then proposes a Q-learning algorithm with Boltzmann policy (GMF-Q), with analysis of convergence property and computational complexity. The experiments on repeated Ad auction problems demonstrate that this GMF-Q algorithm is efﬁcient and robust in terms of convergence and learning accuracy. Moreover, its performance is superior in convergence, stability, and learning ability, when compared with existing algorithms for multi-agent reinforcement learning.

1 Introduction

Motivating example. This paper is motivated by the following Ad auction problem for an advertiser. An Ad auction is a stochastic game on an Ad exchange platform among a large number of players, the advertisers. In between the time a web user requests a page and the time the page is displayed, usually within a millisecond, a Vickrey-type of second-best-price auction is run to incentivize interested advertisers to bid for an Ad slot to display advertisement. Each advertiser has limited information before each bid: ﬁrst, her own valuation for a slot depends on an unknown conversion of clicks for the item; secondly, she, should she win the bid, only knows the reward after the user s activities on the website are ﬁnished. In addition, she has a budget constraint in this repeated auction.

The question is, how should she bid in this online sequential repeated game when there is a large population of bidders competing on the Ad platform, with unknown distributions of the conversion of clicks and rewards?

Besides the Ad auction, there are many real-world problems involving a large number of players and unknown systems. Examples include massive multi-player online role-playing games [19], high frequency tradings [24], and the sharing economy [13].

Our work. Motivated by these problems, we consider a general framework of simultaneous learning and decision-making in stochastic games with a large population. We formulate a general mean-ﬁeldgame (GMFG) with incorporation of action distributions, (randomized) relaxed policies, and with unknown rewards and dynamics. This general framework can also be viewed as a generalized version of MFGs of Mc Kean-Vlasov type [1], which is a different paradigm from the classical MFG. It is also beyond the scope of the existing Q-learning framework for Markov decision problem (MDP) with unknown distributions, as MDP is technically equivalent to a single player stochastic game.

33rd Conference on Neural Information Processing Systems (Neur IPS 2019), Vancouver, Canada.

On the theory front, this general framework differs from all existing MFGs. We establish under appropriate technical conditions, the existence and uniqueness of the Nash equilibrium (NE) to this GMFG. On the computational front, we show that naively combining Q-learning with the three-step ﬁxed-point approach in classical MFGs yields unstable algorithms. We then propose a Q-learning algorithm with Boltzmann policy (GMF-Q), establish its convergence property and analyze its computational complexity. Finally, we apply this GMF-Q algorithm to the Ad auction problem, where this GMF-Q algorithm demonstrates its efﬁciency and robustness in terms of convergence and learning. Moreover, its performance is superior, when compared with existing algorithms for multi-agent reinforcement learning for convergence, stability, and learning accuracy.

Related works. On learning large population games with mean-ﬁeld approximations, [39] focuses on inverse reinforcement learning for MFGs without decision making, [40] studies an MARL problem with a ﬁrst-order mean-ﬁeld approximation term modeling the interaction between one player and all the other ﬁnite players, and [22] and [41] consider model-based adaptive learning for MFGs in speciﬁc models (e.g., linear-quadratic and oscillator games). More recently, [26] studies the local convergence of actor-critic algorithms on ﬁnite time horizon MFGs, and [34] proposes a policy-gradient based algorithm and analyzes the so-called local NE for reinforcement learning in inﬁnite time horizon MFGs. For learning large population games without mean-ﬁeld approximation, see [14, 21] and the references therein. In the speciﬁc topic of learning auctions with a large number of advertisers, [6] and [20] explore reinforcement learning techniques to search for social optimal solutions with real-word data, and [18] uses MFGs to model the auction system with unknown conversion of clicks within a Bayesian framework.

However, none of these works consider the problem of simultaneous learning and decision-making in a general MFG framework. Neither do they establish the existence and uniqueness of the (global) NE, nor do they present model-free learning algorithms with complexity analysis and convergence to the NE. Note that in principle, global results are harder to obtain compared to local results.

2 Framework of General MFG (GMFG)

2.1 Background: classical N-player Markovian game and MFG

Let us ﬁrst recall the classical N-player game. There are N players in a game. At each step t, the state of player i (= 1, 2, , N) is si t S Rd and she takes an action ai t A Rp. Here d, p are positive integers, and S and A are compact (for example, ﬁnite) state space and action space, respectively. Given the current state proﬁle of N-players st = (s1 t, . . . , s N t ) SN and the action ai t, player i will receive a reward ri(st, ai t) and her state will change to si t+1 according to a transition probability function P i(st, ai t).

A Markovian game further restricts the admissible policy/control for player i to be of the form ai t = πi t(st). That is, πi t : SN P(A) maps each state proﬁle s SN to a randomized action, with P(X) the space of probability measures on space X. The accumulated reward (a.k.a. the value function) for player i, given the initial state proﬁle s and the policy proﬁle sequence π := {πt} t=0 with πt = (π1 t , . . . , πN t ), is then deﬁned as

V i(s,π) := E

t=0 γtri(st, ai t) s0 = s

where γ (0, 1) is the discount factor, ai t πi t(st), and si t+1 P i(st, ai t). The goal of each player is to maximize her value function over all admissible policy sequences.

In general, this type of stochastic N-player game is notoriously hard to analyze, especially when N is large [28]. Mean ﬁeld game (MFG), pioneered by [17] and [23] in the continuous settings and later developed in [4, 10, 16, 25, 33] for discrete settings, provides an ingenious and tractable aggregation approach to approximate the otherwise challenging N-player stochastic games. The basic idea for an MFG goes as follows. Assume all players are identical, indistinguishable and interchangeable, when N , one can view the limit of other players states s i t = (s1 t, . . . , si 1 t , si+1 t , . . . , s N t ) as

a population state distribution µt with µt(s) := lim N

PN j=1,j =i Isj t =s N .1 Due to the homogeneity

1Here the indicator function Isj t =s = 1 if sj t = s and 0 otherwise.

of the players, one can then focus on a single (representative) player. That is, in an MFG, one may consider instead the following optimization problem,

maximizeπ V (s,π,µ) := E P

t=0 γtr(st, at, µt)|s0 = s

subject to st+1 P(st, at, µt), at πt(st, µt),

where π := {πt} t=0 denotes the policy sequence and µ := {µt} t=0 the distribution ﬂow. In this MFG setting, at time t, after the representative player chooses her action at according to some policy πt, she will receive reward r(st, at, µt) and her state will evolve under a controlled stochastic dynamics of a mean-ﬁeld type P( |st, at, µt). Here the policy πt depends on both the current state st and the current population state distribution µt such that π : S P(S) P(A).

2.2 General MFG (GMFG)

In the classical MFG setting, the reward and the dynamic for each player are known. They depend only on st the state of the player, at the action of this particular player, and µt the population state distribution. In contrast, in the motivating auction example, the reward and the dynamic are unknown; they rely on the actions of all players, as well as on st and µt.

We therefore deﬁne the following general MFG (GMFG) framework. At time t, after the representative player chooses her action at according to some policy π : S P(S) P(A), she will receive a reward r(st, at, Lt) and her state will evolve according to P( |st, at, Lt), where r and P are possibly unknown. The objective of the player is to solve the following control problem:

maximizeπ V (s,π,L) := E P

t=0 γtr(st, at, Lt)|s0 = s

subject to st+1 P(st, at, Lt), at πt(st, µt). (GMFG)

Here, L := {Lt} t=0, with Lt = Pst,at P(S A) the joint distribution of the state and the action (i.e., the population state-action pair). Lt has marginal distributions αt for the population action and µt for the population state. Notice that {Lt} t=0 could depend on time. Namely, an inﬁnite time horizon MFG could still have time-dependent NE solution due to the mean information process (game interaction) in the MFG. This is fundamentally different from the theory of single-agent MDP where the optimal control, if exists uniquely, would be time independent in an inﬁnite time horizon setting.

In this framework, we adopt the well-known Nash Equilibrium (NE) for analyzing stochastic games.

Deﬁnition 2.1 (NE for GMFGs). In (GMFG), a player-population proﬁle (π ,L ) := ({π t } t=0, {L t } t=0) is called an NE if

1. (Single player side) Fix L , for any policy sequence π := {πt} t=0 and initial state s S,

V (s,π ,L ) V (s,π,L ) . (2)

2. (Population side) Pst,at = L t for all t 0, where {st, at} t=0 is the dynamics under the policy sequence π starting from s0 µ 0, with at π t (st, µ t ), st+1 P( |st, at, L t ), and µ t being the population state marginal of L t .

The single player side condition captures the optimality of π , when the population side is ﬁxed. The population side condition ensures the consistency of the solution: it guarantees that the state and action distribution ﬂow of the single player does match the population state and action sequence L .

2.3 Example: GMFG for the repeated auction

Now, consider the repeated Vickrey auction with a budget constraint in Section 1. Take a representative advertiser in the auction. Denote st {0, 1, 2, , smax} as the budget of this player at time t, where smax N+ is the maximum budget allowed on the Ad exchange with a unit bidding price. Denote at as the bid price submitted by this player and αt as the bidding/(action) distribution of the population. The reward for this advertiser with bid at and budget st is

rt = Iw M t =1

h (vt a M t ) (1 + ρ)Ist<a M t (a M t st) i . (3)

Here w M t takes values 1 and 0, with w M t = 1 meaning this player winning the bid and 0 otherwise. The probability of winning the bid would depend on M, the index for the game intensity, and αt. (See discussion on M in Appendix H.1.) The conversion of clicks at time t is vt and follows an unknown distribution. a M t is the value of the second largest bid at time t, taking values from 0 to smax, and depends on both M and Lt. Should the player win the bid, the reward rt consists of two parts, corresponding to the two terms in (3). The ﬁrst term is the proﬁt of wining the auction, as the winner only needs to pay for the second best bid a M t in a Vickrey auction. The second term is the penalty of overshooting if the payment exceeds her budget, with a penalty rate ρ. At each time t, the budget dynamics st follows,

st, w M t = 1, st a M t , w M t = 1 and a M t st, 0, w M t = 1 and a M t > st.

That is, if this player does not win the bid, the budget will remain the same. If she wins and has enough money to pay, her budget will decrease from st to st a M t . However, if she wins but does not have enough money, her budget will be 0 after the payment and there will be a penalty in the reward function. Note that in this game, both the rewards rt and the dynamics st are unknown a priori.

In practice, one often modiﬁes the dynamics of st+1 with a non-negative random budget fulﬁllment (st+1) after the auction clearing [11], such that ˆst+1 = st+1 + (st+1). One may see some particular choices of (st+1) in the experiment section (Section 5).

3 Solution for GMFGs

We now establish the existence and uniqueness of the NE to (GMFG), by generalizing the classical ﬁxed-point approach for MFGs to this GMFG setting. (See [17] and [23] for the classical case). It consists of three steps.

Step A. Fix L := {Lt} t=0, (GMFG) becomes the classical optimization problem. Indeed, with L ﬁxed, the population state distribution sequence µ := {µt} t=0 is also ﬁxed, hence the space of admissible policies is reduced to the single-player case. Solving (GMFG) is now reduced to ﬁnding a policy sequence π t,L Π := {π | π : S P(A)} over all admissible πL = {πt,L} t=0, to maximize

V (s,πL,L) := E P

t=0 γtr(st, at, Lt)|s0 = s ,

subject to st+1 P(st, at, Lt), at πt,L(st).

Notice that with L ﬁxed, one can safely suppress the dependency on µt in the admissible policies. Moreover, given this ﬁxed L sequence and the solution π L := {π t,L} t=0, one can deﬁne a mapping from the ﬁxed population distribution sequence L to an arbitrarily chosen optimal randomized policy sequence. That is, Γ1 : {P(S A)} t=0 {Π} t=0,

such that π L = Γ1(L). Note that this π L sequence satisﬁes the single player side condition in Deﬁnition 2.1 for the population state-action pair sequence L. That is, V (s,π L,L) V (s,π,L) , for any policy sequence π = {πt} t=0 and any initial state s S.

As in the MFG literature [17], a feedback regularity condition is needed for analyzing Step A.

Assumption 1. There exists a constant d1 0, such that for any L,L {P(S A)} t=0,

D(Γ1(L), Γ1(L )) d1W1(L,L ), (4)

D(π,π ) := sup s S W1(π(s),π (s)) = sup s S sup t N W1(πt(s), π t(s)),

W1(L,L ) := sup t N W1(Lt, L t), (5)

and W1 is the ℓ1-Wasserstein distance between probability measures [9, 31, 37].

Step B. Based on the analysis in Step A and π L = {π t,L} t=0, update the initial sequence L to L

following the controlled dynamics P( |st, at, Lt).

Accordingly, for any admissible policy sequence π {Π} t=0 and a joint population state-action pair sequence L {P(S A)} t=0, deﬁne a mapping Γ2 : {Π} t=0 {P(S A)} t=0 {P(S A)} t=0 as follows:

Γ2(π,L) := ˆLˆLˆL = {Pst,at} t=0, (6)

where st+1 µt P( | , at, Lt), at πt(st), s0 µ0, and µt is the population state marginal of Lt.

One needs a standard assumption in this step. Assumption 2. There exist constants d2, d3 0, such that for any admissible policy sequences π,π1,π2 and joint distribution sequences L,L1,L2,

W1(Γ2(π1,L), Γ2(π2,L)) d2D(π1,π2), (7)

W1(Γ2(π,L1), Γ2(π,L2)) d3W1(L1,L2). (8)

Assumption 2 can be reduced to Lipschitz continuity and boundedness of the transition dynamics P. (See the Appendix for more details.)

Step C. Repeat Step A and Step B until L matches L.

This step is to take care of the population side condition. To ensure the convergence of the combined step A and step B, it sufﬁces if Γ : {P(S A)} t=0 {P(S A)} t=0 is a contractive mapping under the W1 distance, with Γ(L) := Γ2(Γ1(L),L). Then by the Banach ﬁxed point theorem and the completeness of the related metric spaces, there exists a unique NE to the GMFG.

In summary, we have Theorem 1 (Existence and Uniqueness of GMFG solution). Given Assumptions 1 and 2, and assuming that d1d2 + d3 < 1, there exists a unique NE to (GMFG).

4 RL Algorithms for (stationary) GMFGs

In this section, we design the computational algorithm for the GMFG. Since the reward and transition distributions are unknown, this is simultaneously learning the system and ﬁnding the NE of the game. We will focus on the case with ﬁnite state and action spaces, i.e., |S|, |A| < . We will look for stationary (time independent) NEs. Accordingly, we abbreviate π := {π} t=0 and L := {L} t=0 as π and L, respectively. This stationarity property enables developing appropriate time-independent Q-learning algorithm, suitable for an inﬁnite time horizon game. Modiﬁcation from the GMFG framework to this special stationary setting is straightforward, and is left to Appendix B. Note that the assumptions to guarantee the existence and uniqueness of GMFG solutions are slightly different between the stationary and non-stationary cases. For instance, one can compare (7)-(8) with (21)-(22).

The algorithm consists of two steps, parallel to Step A and Step B in Section 3.

Step 1: Q-learning with stability for ﬁxed L. With L ﬁxed, it becomes a standard learning problem for an inﬁnite horizon MDP. We will focus on the Q-learning algorithm [35, 32].

The Q-learning algorithm approximates the value iteration by stochastic approximation. At each step with the state s and an action a, the system reaches state s according to the controlled dynamics and the Q-function is updated according to

QL(s, a) (1 βt(s, a))QL(s, a) + βt(s, a) [r(s, a, L) + γ max a QL(s , a)] , (9)

where the step size βt(s, a) can be chosen as (cf. [7])

βt(s, a) = |#(s, a, t) + 1| h, (s, a) = (st, at), 0, otherwise.

with h (1/2, 1). Here #(s, a, t) is the number of times up to time t that one visits the pair (s, a). The algorithm then proceeds to choose action a based on QL with appropriate exploration strategies, including the ϵ-greedy strategy.

After obtaining the approximate ˆQ L, in order to retrieve an approximately optimal policy, it would be natural to deﬁne an argmax-e operator so that actions with equal maximum Q-values would have equal probabilities to be selected. Unfortunately, the discontinuity and sensitivity of argmax-e could lead to an unstable algorithm (see Figure 4 for the corresponding naive Algorithm 2 in Appendix). 2

Instead, we consider a Boltzmann policy based on the operator softmaxc : Rn Rn, deﬁned as

softmaxc(x)i = exp(cxi) Pn j=1 exp(cxj). (10)

This operator is smooth and close to the argmax-e (see Lemma 7 in the Appendix). Moreover, even though Boltzmann policies are not optimal, the difference between the Boltzmann and the optimal one can always be controlled by choosing the hyper-parameter c appropriately in the softmax operator. Note that other smoothing operators (e.g., Mellowmax [2]) may also be considered in the future.

Step 2: error control in updating L. Given the sub-optimality of the Boltzmann policy, one needs to characterize the difference between the optimal policy and the non-optimal ones. In particular, one can deﬁne the action gap between the best action and the second best action in terms of the Q-value as δs(L) := maxa A Q L(s, a ) maxa/ argmaxa AQ L(s,a) Q L(s, a) > 0. Action gap is important for approximation algorithms [3], and are closely related to the problem-dependent bounds for regret analysis in reinforcement learning and multi-armed bandits, and advantage learning algorithms including A2C [27].

The problem is: in order for the learning algorithm to converge in terms of L (Theorem 2), one needs to ensure a deﬁnite differentiation between the optimal policy and the sub-optimal ones. This is problematic as the inﬁmum of δs(L) over an inﬁnite number of L can be 0. To address this, the population distribution at step k, say Lk, needs to be projected to a ﬁnite grid, called ϵ-net. The relation between the ϵ-net and action gaps is as follows:

For any ϵ > 0, there exist a positive function φ(ϵ) and an ϵ-net Sϵ := {L(1), . . . , L(Nϵ)} P(S A), with the properties that mini=1,...,Nϵ d T V (L, L(i)) ϵ for any L P(S A), and that maxa A Q L(i)(s, a ) Q L(i)(s, a) φ(ϵ) for any i = 1, . . . , Nϵ, s S, and any a / argmaxa AQ L(i)(s, a).

Here the existence of ϵ-nets is trivial due to the compactness of the probability simplex P(S A), and the existence of φ(ϵ) comes from the ﬁniteness of the action set A. In practice, φ(ϵ) often takes the form of Dϵα with D > 0 and the exponent α > 0 characterizing the decay rate of the action gaps.

Finally, to enable Q-learning, it is assumed that one has access to a population simulator (See [30, 38]). That is, for any policy π Π, given the current state s S, for any population distribution L, one can obtain the next state s P( |s, π(s, µ), L), a reward r = r(s, π(s, µ), L), and the next population distribution L = Ps ,π(s ,µ). For brevity, we denote the simulator as (s , r, L ) = G(s, π, L). Here µ is the state marginal distribution of L.

In summary, we propose the following Algorithm 1.

Algorithm 1 Q-learning for GMFGs (GMF-Q)

1: Input: Initial L0, tolerance ϵ > 0. 2: for k = 0, 1, do 3: Perform Q-learning for Tk iterations to ﬁnd the approximate Q-function ˆQ k(s, a) = ˆQ Lk(s, a) of an MDP with dynamics PLk(s |s, a) and rewards r Lk(s, a). 4: Compute πk Π with πk(s) = softmaxc( ˆQ k(s, )). 5: Sample s µk (µk is the population state marginal of Lk), obtain Lk+1 from G(s, πk, Lk). 6: Find Lk+1 = Proj Sϵ( Lk+1) 7: end for

Note that softmax is applied only at the end of each outer iteration when a good approximation of Q function is obtained. Within the outer iteration for the MDP problem with ﬁxed mean-ﬁeld information, standard Q-learning method is applied.

2argmax-e is not continuous: Let x = (1, 1), then argmax-e(x) = (1/2, 1/2). For any ϵ > 0, let y = (1, 1 ϵ), then argmax-e(y) = (1, 0).

Here Proj Sϵ(L) = argmin L(1),...,L(Nϵ)d T V (L(i), L). For computational tractability, it would be sufﬁcient to choose Sϵ as a truncation grid so that projection of Lk onto the epsilon-net reduces to truncating Lk to a certain number of digits. For instance, in our experiment, the number of digits is chosen to be 4. The choices of the hyper-parameters c and Tk can be found in Lemma 8 and Theorem 2. In practice, the algorithm is rather robust with respect to these hyper-parameters.

In the special case when the rewards r L and transition dynamics P( |s, a, L) are known, one can replace the Q-learning step in the above Algorithm 1 by a value iteration, resulting in the GMF-V Algorithm 3 in the Appendix.

We next show the convergence of this GMF-Q algorithm (Algorithm 1) to an ϵ-Nash of (GMFG), with complexity analysis. Theorem 2 (Convergence and complexity of GMF-Q). Assume the same conditions in Theorem 4 and Lemma 8 in the Appendix. For any tolerances ϵ, δ > 0, set δk = δ/Kϵ,η, ϵk = (k +1) (1+η) for some η (0, 1] (k = 0, . . . , Kϵ,η 1), Tk = T MLk (δk, ϵk) (deﬁned in Lemma 8 in the Appendix) and c = log(1/ϵ)

φ(ϵ) . Then with probability at least 1 2δ, W1(LKϵ,η, L ) Cϵ.

Moreover, the total number of iterations T = PKϵ,η 1 k=0 T MLk (δk, ϵk) is bounded by 3

T = O K 1+ 4

h ϵ,η (log(Kϵ,η/δ))

h +3 . (11)

Here Kϵ,η := 2 max (ηϵ/c) 1/η, logd(ϵ/max{diam(S)diam(A), c}) + 1 is the number of outer iterations, h is the step-size exponent in Q-learning (deﬁned in Lemma 8 in the Appendix), and the constant C is independent of δ, ϵ and η.

The proof of Theorem 2 in the Appendix depends on the Lipschitz continuity of the softmax operator [8], the closeness between softmax and the argmax-e (Lemma 7 in the Appendix), and the complexity of Q-learning for the MDP (Lemma 8 in the Appendix).

5 Experiment: repeated auction game

In this section, we report the performance of the proposed GMF-Q Algorithm. The objectives of the experiments include 1) testing the convergence, stability, and learning ability of GMF-Q in the GMFG setting, and 2) comparing GMF-Q with existing multi-agent reinforcement learning algorithms, including IL algorithm and MF-Q algorithm.

We take the GMFG framework for the repeated auction game from Section 2.3. Here each advertiser learns to bid in the auction with a budget constraint.

Parameters. The model parameters are set as: |S| = |A| = 10, the overbidding penalty ρ = 0.2, the distributions of the conversion rate v uniform({1, 2, 3, 4}), and the competition intensity index M = 5. The random fulﬁllment is chosen as: if s < smax, (s) = 1 with probability 1

2 and (s) = 0 with probability 1

2; if s = smax, (s) = 0.

The algorithm parameters are (unless otherwise speciﬁed): the temperature parameter c = 4.0, the discount factor γ = 0.8, the parameter h from Lemma 8 in the Appendix being h = 0.87, and the baseline inner iteration being 2000. Recall that for GMF-Q, both v and the dynamics of P for s are unknown a priori. The 90%-conﬁdence intervals are calculated with 20 sample paths.

Performance evaluation in the GMFG setting. Our experiment shows that the GMF-Q Algorithm is efﬁcient and robust, and learns well.

Convergence and stability of GMF-Q. GMF-Q is efﬁcient and robust. First, GMF-Q converges after about 10 outer iterations; secondly, as the number of inner iterations increases, the error decreases (Figure 2); and ﬁnally, the convergence is robust with respect to both the change of number of states and the initial population distribution (Figure 3).

In contrast, the Naive algorithm does not converge even with 10000 inner iterations, and the joint distribution Lt keeps ﬂuctuating (Figure 4).

4, η = 1, the bound reduces to T = O(K

3 ϵ (log( Kϵ

3 ). Note that this bound may not be tight.

Table 1: Q-table with T GMF-V k = 5000. T GMF-Q k 1000 3000 5000 10000 Q 0.21263 0.1294 0.10258 0.0989

Figure 1: Q-tables: GMF-Q vs. GMF-V.

Learning accuracy of GMF-Q. GMF-Q learns well. Its learning accuracy is tested against its special form GMF-V (Appendix G), with the latter assuming a known distribution of conversion rate v and the dynamics P for the budget s. The relative L2 distance between the Q-tables of these two algorithms is Q := QGMF-V QGMF-Q 2

QGMF-V 2 = 0.098879. This implies that GMF-Q learns the true GMFG solution with 90-percent accuracy with 10000 inner iterations.

The heatmap in Figure 1(a) is the Q-table for GMF-Q Algorithm after 20 outer iterations. Within each outer iteration, there are T GMF-Q k = 10000 inner iterations. The heatmap in Figure 1(b) is the Q-table for GMF-Q Algorithm after 20 outer iterations. Within each outer iteration, there are T GMF-V k = 5000 inner iterations.

Comparison with existing algorithms for N-player games. To test the effectiveness of GMFQ for approximating N-player games, we next compare GMF-Q with IL algorithm and MF-Q algorithm. IL algorithm [36] considers N independent players and each player solves a decentralized reinforcement learning problem ignoring other players in the system. The MF-Q algorithm [40] extends the NASH-Q Learning algorithm for the N-player game introduced in [15], adds the aggregate actions ( a i =

j =i aj N 1 ) from the opponents, and works for the class of games where the interactions are only through the average actions of N players.

0 2 4 6 8 10 12 14 16 18 outer iteration

| t + 1 t|1

500 2000 5000

Figure 2: Convergence with different number of inner iterations.

0 2 4 6 8 10 12 14 16 18 outer iteration

| t t + 1|1

Figure 3: Convergence with different number of states.

0 20 40 60 80 100 outer iteration

(a) ﬂuctuation in l .

0 20 40 60 80 100 outer iteration

| t t + 1|1

(b) ﬂuctuation in l1.

Figure 4: Fluctuations of Naive Algorithm (30 sample paths).

0 10000 20000 30000 40000 50000 60000 70000 80000

MF-Q IL GMF-Q

(a) |S| = |A| = 10, N = 20.

0 10000 20000 30000 40000 50000 60000 70000 80000

MF-Q IL GMF-Q

(b) |S| = |A| = 20, N = 20.

0 10000 20000 30000 40000 50000 60000 70000 80000

MF-Q IL GMF-Q

(c) |S| = |A| = 10, N = 40.

Figure 5: Learning accuracy based on C(π).

Performance metric. We adopt the following metric to measure the difference between a given policy π and an NE (here ϵ0 > 0 is a safeguard, and is taken as 0.1 in the experiments):

C(π) = 1 N|S|N XN

s SN maxπi Vi(s, (π i, πi)) Vi(s,π)

| maxπi Vi(s, (π i, πi))| + ϵ0 .

Clearly C(π) 0, and C(π ) = 0 if and only if π is an NE. Policy arg maxπi Vi(s, (π i, πi)) is called the best response to π i. A similar metric without normalization has been adopted in [29].

Our experiment (Figure 5) shows that GMF-Q is superior in terms of convergence rate, accuracy, and stability for approximating an N-player game: GMF-Q converges faster than IL and MF-Q, with the smallest error, and with the lowest variance, as ϵ-net improves the stability.

For instance, when N = 20, IL Algorithm converges with the largest error 0.220. The error from MF-Q is 0.101, smaller than IL but still bigger than the error from GMF-Q. The GMF-Q converges with the lowest error 0.065. Moreover, as N increases, the error of GMF-Q deceases while the errors of both MF-Q and IL increase signiﬁcantly. As |S| and |A| increase, GMF-Q is robust with respect to this increase of dimensionality, while both MF-Q and IL clearly suffer from the increase of the dimensionality with decreased convergence rate and accuracy. Therefore, GMF-Q is more scalable than IL and MF-Q, when the system is complex and the number of players N is large.

6 Conclusion

This paper builds a GMFG framework for simultaneous learning and decision-making, establishes the existence and uniqueness of NE, and proposes a Q-learning algorithm GMF-Q with convergence and complexity analysis. Experiments demonstrate superior performance of GMF-Q.

Acknowledgment

We thank Haoran Tang for the insightful early discussion on stabilizing the Q-learning algorithm and sharing the ideas of his work on soft-Q-learning [12], which motivates our adoption of the soft-max operators. We also thank the anonymous Neur IPS 2019 reviewers for the valuable suggestions.

[1] B. Acciaio, J. Backhoff, and R. Carmona. Extended mean ﬁeld control problems: stochastic maximum principle and transport perspective. Arxiv Preprint:1802.05754, 2018.

[2] K. Asadi and M. L. Littman. An alternative softmax operator for reinforcement learning. In Proceedings of the 34th International Conference on Machine Learning, volume 70, pages 243 252, 2017.

[3] M. G. Bellemare, G. Ostrovski, A. Guez, P. S. Thomas, and R. Munos. Increasing the action gap: new operators for reinforcement learning. In AAAI Conference on Artiﬁcial Intelligence, pages 1476 1483, 2016.

[4] M. Benaim and J. Y. Le Boudec. A class of mean ﬁeld interaction models for computer and communication systems. Performance evaluation, 65(11-12):823 838, 2008.

[5] F. Bolley. Separability and completeness for the Wasserstein distance. Séminaire de Probabilités XLI, pages 371 377, 2008.

[6] H. Cai, K. Ren, W. Zhang, K. Malialis, J. Wang, Y. Yu, and D. Guo. Real-time bidding by reinforcement learning in display advertising. In Proceedings of the Tenth ACM International Conference on Web Search and Data Mining, pages 661 670. ACM, 2017.

[7] E. Even-Dar and Y. Mansour. Learning rates for Q-learning. Journal of Machine Learning Research, 5(Dec):1 25, 2003.

[8] B. Gao and L. Pavel. On the properties of the softmax function with application in game theory and reinforcement learning. Arxiv Preprint:1704.00805, 2017.

[9] A. L. Gibbs and F. E. Su. On choosing and bounding probability metrics. International Statistical Review, 70(3):419 435, 2002.

[10] D. A. Gomes, J. Mohr, and R. R. Souza. Discrete time, ﬁnite state space mean ﬁeld games. Journal de mathématiques pures et appliquées, 93(3):308 328, 2010.

[11] R. Gummadi, P. Key, and A. Proutiere. Repeated auctions under budget constraints: Optimal bidding strategies and equilibria. In the Eighth Ad Auction Workshop, 2012.

[12] T. Haarnoja, H. Tang, P. Abbeel, and S. Levine. Reinforcement learning with deep energy-based policies. Arxiv Preprint:1702.08165, 2017.

[13] J. Hamari, M. Sjöklint, and A. Ukkonen. The sharing economy: Why people participate in collaborative consumption. Journal of the Association for Information Science and Technology, 67(9):2047 2059, 2016.

[14] P. Hernandez-Leal, B. Kartal, and M. E. Taylor. Is multiagent deep reinforcement learning the answer or the question? A brief survey. Arxiv Preprint:1810.05587, 2018.

[15] J. Hu and M. P. Wellman. Nash Q-learning for general-sum stochastic games. Journal of Machine Learning Research, 4(Nov):1039 1069, 2003.

[16] M. Huang and Y. Ma. Mean ﬁeld stochastic games with binary action spaces and monotone costs. Ar Xiv Preprint:1701.06661, 2017.

[17] M. Huang, R. P. Malhamé, and P. E. Caines. Large population stochastic dynamic games: closedloop Mc Kean-Vlasov systems and the Nash certainty equivalence principle. Communications in Information & Systems, 6(3):221 252, 2006.

[18] K. Iyer, R. Johari, and M. Sundararajan. Mean ﬁeld equilibria of dynamic auctions with learning. ACM SIGecom Exchanges, 10(3):10 14, 2011.

[19] S. H. Jeong, A. R. Kang, and H. K. Kim. Analysis of game bot s behavioral characteristics in social interaction networks of MMORPG. ACM SIGCOMM Computer Communication Review, 45(4):99 100, 2015.

[20] J. Jin, C. Song, H. Li, K. Gai, J. Wang, and W. Zhang. Real-time bidding with multi-agent reinforcement learning in display advertising. Arxiv Preprint:1802.09756, 2018.

[21] S. Kapoor. Multi-agent reinforcement learning: A report on challenges and approaches. Arxiv Preprint:1807.09427, 2018.

[22] A. C Kizilkale and P. E Caines. Mean ﬁeld stochastic adaptive control. IEEE Transactions on Automatic Control, 58(4):905 920, 2013.

[23] J-M. Lasry and P-L. Lions. Mean ﬁeld games. Japanese Journal of Mathematics, 2(1):229 260, 2007.

[24] C-A. Lehalle and C. Mouzouni. A mean ﬁeld game of portfolio trading and its consequences on perceived correlations. Ar Xiv Preprint:1902.09606, 2019.

[25] J. P. M. López. Discrete time mean ﬁeld games: The short-stage limit. Journal of Dynamics & Games, 2(1):89 101, 2015.

[26] D. Mguni, J. Jennings, and E. M. de Cote. Decentralised learning in systems with many, many strategic agents. In Thirty-Second AAAI Conference on Artiﬁcial Intelligence, 2018.

[27] V. M. Minh, A. P. Badia, M. Mirza, A. Graves, T. P. Lillicrap, T. Harley, D. Silver, and K. Kavukcuoglu. Asynchronous methods for deep reinforcement learning. In International Conference on Machine Learning, 2016.

[28] C. H. Papadimitriou and T. Roughgarden. Computing equilibria in multi-player games. In Proceedings of the sixteenth annual ACM-SIAM symposium on Discrete algorithms, pages 82 91, 2005.

[29] J. Pérolat, B. Piot, and O. Pietquin. Actor-critic ﬁctitious play in simultaneous move multistage games. In International Conference on Artiﬁcial Intelligence and Statistics, 2018.

[30] J. Pérolat, F. Strub, B. Piot, and O. Pietquin. Learning Nash equilibrium for general-sum Markov games from batch data. Arxiv Preprint:1606.08718, 2016.

[31] G. Peyré and M. Cuturi. Computational optimal transport. Foundations and Trends in Machine Learning, 11(5-6):355 607, 2019.

[32] B. Recht. A tour of reinforcement learning: The view from continuous control. Annual Review of Control, Robotics, and Autonomous Systems, 2018.

[33] N. Saldi, T. Basar, and M. Raginsky. Markov Nash equilibria in mean-ﬁeld games with discounted cost. SIAM Journal on Control and Optimization, 56(6):4256 4287, 2018.

[34] J. Subramanian and A. Mahajan. Reinforcement learning in stationary mean-ﬁeld games. In 18th International Conference on Autonomous Agents and Multiagent Systems, pages 251 259, 2019.

[35] R. S. Sutton and A. G. Barto. Reinforcement learning: An introduction. MIT press, 2018.

[36] M. Tan. Multi-agent reinforcement learning: independent vs. cooperative agents. In International Conference on Machine Learning, pages 330 337, 1993.

[37] C. Villani. Optimal transport: old and new, volume 338. Springer Science & Business Media, 2008.

[38] H. T. Wai, Z. Yang, Z. Wang, and M. Hong. Multi-agent reinforcement learning via double averaging primal-dual optimization. In Advances in Neural Information Processing Systems, pages 9672 9683, 2018.

[39] J. Yang, X. Ye, R. Trivedi, H. Xu, and H. Zha. Deep mean ﬁeld games for learning optimal behavior policy of large populations. Arxiv Preprint:1711.03156, 2017.

[40] Y. Yang, R. Luo, M. Li, M. Zhou, W. Zhang, and J. Wang. Mean ﬁeld multi-agent reinforcement learning. Arxiv Preprint:1802.05438, 2018.

[41] H. Yin, P. G. Mehta, S. P. Meyn, and U. V. Shanbhag. Learning in mean-ﬁeld games. IEEE Transactions on Automatic Control, 59(3):629 644, 2013.