# learning_regularized_monotone_graphon_meanfield_games__d984c755.pdf Learning Regularized Monotone Graphon Mean-Field Games Fengzhuo Zhang1 Vincent Y. F. Tan1 Zhaoran Wang2 Zhuoran Yang3 1National University of Singapore 2 Northwestern University 3Yale University fzzhang@u.nus.edu, vtan@nus.edu.sg, zhaoranwang@gmail.com, zhuoranyang.work@gmail.com This paper studies two fundamental problems in regularized Graphon Mean-Field Games (GMFGs). First, we establish the existence of a Nash Equilibrium (NE) of any λ-regularized GMFG (for λ 0). This result relies on weaker conditions than those in previous works for analyzing both unregularized GMFGs (λ = 0) and λ-regularized MFGs, which are special cases of GMFGs. Second, we propose provably efficient algorithms to learn the NE in weakly monotone GMFGs, motivated by Lasry and Lions [2007]. Previous literature either only analyzed continuous-time algorithms or required extra conditions to analyze discrete-time algorithms. In contrast, we design a discrete-time algorithm and derive its convergence rate solely under weakly monotone conditions. Furthermore, we develop and analyze the action-value function estimation procedure during the online learning process, which is absent from algorithms for monotone GMFGs. This serves as a sub-module in our optimization algorithm. The efficiency of the designed algorithm is corroborated by empirical evaluations. 1 Introduction In Multi-Agent Reinforcement Learning (MARL), the sizes of state and action spaces grow exponentially in the number of agents, which is known as the curse of many agents [Sonu et al., 2017, Wang et al., 2020] and restrict its applicability to large-scale scenarios. The Mean-Field Game (MFG) has thus been proposed to mitigate this problem by exploiting the homogeneity assumption [Huang et al., 2006, Lasry and Lions, 2007], and it has achieved tremendous successes in many real-world applications [Cousin et al., 2011, Achdou et al., 2020]. However, the homogeneity assumption is an impediment when modeling scenarios in which the agents are heterogeneous. GMFGs, as extensions of MFGs, have thus been proposed to model the behaviors of heterogeneous agents and ameliorate the curse of many agents at the same time [Parise and Ozdaglar, 2019, Carmona et al., 2022]. Despite the empirical successes of the Graphon Mean-Field Game (GMFG) [Aurell et al., 2022a], its theoretical understanding is lacking. First, sufficient conditions for Nash Equilibrium (NE) existence in regularized GMFG have not been established. Most works only address the existence of the NE in unregularized GMFGs. However, regularization is employed in practical implementations for improved exploration and robustness [Geist et al., 2019]. Moreover, previous works prove the existence of NE in regularized MFGs, a special case of GMFGs, only under the contraction condition, which is overly restrictive for real-world applications. Second, the analysis of discretetime algorithms for monotone GMFGs is lacking. Most existing works design provably efficient discrete-time algorithms only under contraction conditions, as shown in Table 1. Complementarily, previous works on monotone GMFGs either only derive the convergence rate for continuous-time algorithms, which ignores the discretization error, or require extra conditions, (e.g., potential games) to analyze discrete-time algorithms. 37th Conference on Neural Information Processing Systems (Neur IPS 2023). Table 1: Comparison of GMFGs and MFGs learning algorithms Condition No population manipulation Online playing Heterogeneity Discrete-time algorithm Convergence rate Anahtarci et al. [2022] Contraction No No No Yes Yes Xie et al. [2021a] Contraction No Yes No Yes Yes Zaman et al. [2022] Contraction No Yes No Yes Yes Yardim et al. [2022] Contraction Yes Yes No Yes Yes Perrin et al. [2020] Monotone No No No No Yes Geist et al. [2021] Potential &Monotone No No No Yes Yes Perolat et al. [2021] Monotone Yes Yes Yes No No Fabian et al. [2022] Monotone Yes Yes Yes No No Our work Monotone Yes Yes Yes Yes Yes In this paper, we first consider GMFGs in full generality, i.e., without any contraction or monotone conditions. The goal is to establish the existence of NE in the regularized GMFG in this general setting. Then we focus on monotone GMFGs motivated by Lasry and Lions [2007]. We aim to learn the unique NE from the online interactions of all agents with and without the action-value function oracle. When the oracle is absent, the action-value functions should be estimated from the data of sampled agents generated in the online game. In the analysis, we have to overcome difficulties that arise from both the existence of the NE problem and the learning of the NE. First, the proof of the existence of NE in regularized GMFG involves establishing some topological spaces and operators related to NE on which fixed point theorems are applicable. However, the direct construction of the space and the operators for GMFGs with uncountably infinite agents is challenging. Second, the design and analysis of the discrete-time NE learning algorithm require subtle exploitation of the monotone condition. Unlike continuous-time algorithms with infinitesimal step sizes, the design of appropriate step sizes is additionally required for the discrete-time algorithm to guarantee that iterates evolve appropriately. This guarantee originates from the delicate utilization of the monotone condition in the optimization procedures. To address the difficulty of the existence problem, we construct a regularized MFG from the regularized GMFG and show that the NE of the constructed game can be converted into the NE of the original game, thus mitigating the difficulty of having an uncountable number of agents. To handle the difficulty in the NE learning problem, we design the Monotone GMFG Policy Mirror Descent (Mono GMFG-PMD) algorithm, which iteratively implements policy mirror descent for each player. We show that this procedure results in a decrease of the KL divergence between the iterate and the NE, and this decrease is related to the gap appearing in the weakly monotone condition. When the action-value function oracle is absent, we also design and analyze action-value functions estimation algorithms to serve as a submodule of the optimization procedures. Main Contributions: We first establish the existence of the NE in the λ-regularized GMFG with λ 0 assuming Lipschitzness of graphons and continuity of transition kernels and reward functions. Our result relaxes the assumption of the Lipschitzness of transition kernels and rewards required in previous works on unregularized GMFGs and the contraction condition in the literature on regularized MFG [Cui and Koeppl, 2021a]. Then we design and analyze the Mono GMFG-PMD algorithm. Using an action-value function oracle, the convergence rate for Mono GMFG-PMD is proved to be O(T 1/2) after T iterations. Without the oracle, the convergence rate includes an additional O(K 1/2 + N 1) term that arises from sampling N agents and collecting data from K episodes, reflecting the generalization error and the approximation error of the estimation algorithm. As shown in Table 1 , our algorithm can be implemented from the online interaction of agents and does not require the distribution flow manipulation. Detailed explanations of the properties stated in the columns of Table 1 are provided in Appendix A. Our result for Mono GMFG-PMD provides the first convergence rate for discrete-time algorithms in monotone GMFGs. 2 Related Works MFGs were proposed by Huang et al. [2006] and Lasry and Lions [2007] to model the interactions among a set of homogeneous agents. In recent years, learning the NE of the MFGs formulated by discrete-time Markov Decision Process (MDP)s has attracted a lot of interest. There is a large body of works that design and analyze algorithms for the MFGs under contraction conditions [Anahtarci et al., 2019, 2022, Cui and Koeppl, 2021a, Xie et al., 2021a, Zaman et al., 2022, Yardim et al., 2022]. Typically, these works design reinforcement learning algorithms to approximate the contraction operators in MFGs, and the NE is learned by iteratively applying this operator. In contrast, another line of works focuses on the MFGs under monotone conditions. Motivated by Lasry and Lions [2007], the transition kernels in these works are independent of the mean fields. Perrin et al. [2020] proposed and analyzed the continuous-time fictitious play algorithm, which dynamically weighs the past mean fields and the best response policies to derive new mean fields and policies. With the additional potential assumption, Geist et al. [2021] derived the convergence rate for the discrete-time fictitious play algorithm. Perolat et al. [2021] then proposed the continuous-time policy mirror descent algorithm but only proved the asymptotic convergence, i.e., the consistency. In summary, there is no convergence rate result for any discrete-time algorithm for MFGs under the monotone condition. In addition, the relationship between the contraction conditions and the monotone conditions is not clear from existing works, but they complement each other. To capture the potential heterogeneity among agents, GMFGs have been proposed by Parise and Ozdaglar [2019] in static settings as an extension of MFGs. The heterogeneous interactions among agents are represented by graphons. Aurell et al. [2022b], Caines and Huang [2019] then extended the GMFGs to the continuous-time setting, where the existence and the uniqueness of NE were established. Vasal et al. [2020] formulated discrete-time GMFGs and provided way to compute the NE with master equations. With the precise underlying graphons values, Cui and Koeppl [2021b] proposed algorithms to learn the NE of GMFGs with the contraction condition by modifying MFGs learning algorithms. Fabian et al. [2022] considered the monotone GMFG and proposed the continuous-time policy mirror descent algorithm to learn the NE. However, only consistency was provided in the latter two works. Notation Let [N] := {1, , N}. Given a measurable space (Ω, F), we denote the collection of all the measures and the probability measures on (Ω, F) as M(Ω) and (Ω), respectively. For a metric space (X, ), we use C(X) and Cb(X) to denote the set of all continuous functions and the set of all bounded continuous functions on X, respectively. For a measurable space (X, F) and two distributions P, Q (X), the total variation distance between them is defined as TV(P, Q) := sup A F |P(A) Q(A)|. A sequence of measures {µn} on X is said to converge weakly to a measure µ if R X g(x)µn(dx) R X g(x)µ(dx) for all g Cb(X). 3 Preliminaries 3.1 Graphon Mean-Field Games We consider a GMFG that is defined through a tuple (I, S, A, H, P, r, W, µ1). The horizon (or length) of the game is denoted as H. In GMFG, we have infinite players, each corresponding to a point α I = [0, 1]. The state and action space of them are the same, denoted as S Rds and A Rda respectively. The interaction among players is captured by graphons. Graphons are symmetric functions that map [0, 1]2 to [0, 1]. Symmetry here refers to that W(α, β) = W(β, α) for all α, β [0, 1]. We denote the set of graphons as W = {W : [0, 1]2 [0, 1] | W is symmetric.}. The set of graphons of the game is W = {Wh}H h=1 with Wh W. The state transition and reward of each player are influenced by the collective behavior of all the other players through an aggregate z M(S). At time h [H], we denote the state distribution of player β I as µβ h (S). The aggregate for player α I with the underlying graphon Wh W is then defined as 0 Wh(α, β)µβ h dβ. (1) The transition kernels P = {Ph}H h=1 of the game are functions Ph : S A M(S) S for all h [H]. At time h, if player α takes action aα h A at state sα h S, her state will transition according to sα h+1 Ph( | sα h, aα h, zα h). The reward functions are denoted as rh : S A M(S) [0, 1] for all h [H]. We note that the players in GMFG are heterogeneous. This means that different players will, in general, receive different aggregates from other players. The distribution µ1 (S) is the initial state distribution for all the players. A policy for an player α is πα = {πα h}H h=1 ΠH, where πα h : S (A) takes action based only on the current state, and Π is the set of all these policies. A policy for all the players πI Π = ΠH I is the collection of the policies of each player, i.e, πI = {πα}α I. In the following, we denote the state distributions of all the players at time h and the state distributions of all the players at any time (distribution flow) respectively as µI h = {µα h}α I and µI = {µI h}H h=1 = (S)H I. Eqn. (1) shows that the aggregate zα h is a function of µI h and Wh, so to make this dependence explicit, we also write it as zα h(µI h, Wh). We consider the entropy-regularized GMFG. It has been shown that the regularization results in policy gradient algorithms converging faster [Shani et al., 2020, Cen et al., 2022]. In this game, the rewards of each player are the sum of the original rewards and the negative entropy of the policy multiplied by a constant. In a λ-regularized GMFG (λ 0), the value function and the action-value function of player α with policy πα on the MDP induced by the distribution flow µI are defined as V λ,α h (s, πα, µI) = Eπα H X t=h rt sα t , aα t , zα t (µI t , Wt) λ log πα t (aα t | sα t ) sα h = s , (2) Qλ,α h (s, a, πα, µI) = rh s, a, zα h(µI h, Wh) + Eπα V λ,α h+1(sα h+1, πα, µI) | sα h = s, aα h = a (3) for all h [H], where the expectation Eπα is taken with respect to the stochastic process induced by implementing policy πα on the MDP induced by µI. The cumulative reward function of player α is defined as Jλ,α(πα, µI) = Eµ1[V λ,α 1 (s, πα, µI)]. Then the notion of an NE is defined as follows. Definition 1. An NE of the λ-regularized GMFG is a pair (π ,I, µ ,I) Π that satisfies: (i) (player rationality) Jλ,α(π ,α, µ ,I) = max πα ΠH Jλ,α( πα, µ ,I) for all α I up to a zeromeasure set on I with respect to the Lebesgue measure. (ii) (Distribution consistency) The distribution flow µ ,I is equal to the distribution flow induced by implementing the policy π ,I. Similar to the NE for the finite-player games, the NE of the λ-regularized GMFG requires that the policy of each player is optimal. However, in GMFGs, the optimality is with respect to µ ,I. 3.2 Mean-Field Games As an important subclass of GMFG, MFG corresponds to the GMFG with constant graphons, i.e, W(α, β) = p for all α, β I. MFGs involve infinite homogeneous players. All the players employ the same policy and thus share the same distribution flow. The aggregate in Eqn. 1 degenerates to zα h = R 1 0 p µβ hdβ = p µh for all α I. Here µh is the state distribution of a representative player. Thus, an MFG is denoted by a tuple ( S, A, H, P, r, µ1). The state space, the action space, and the horizon are respectively denoted as S, A, and H. Here, the transition kernels P = { Ph}H h=1 are functions Ph : S A (S) S, and reward functions rh : S A (S) [0, 1] for all h [H]. In the MFG, all the players adopt the same policy π = {πh}H h=1 where πh Π. The value function and the action-value function in the λ-regularized MFG with the underlying distribution flow µ = {µh}H h=1 (S)H can be similarly defined as Eqn. (2) and (3) respectively as follows V λ h (s, π, µ) = Eπ H X t=h rt(st, at, µt) λ log πt(at | st) sh = s , Qλ h(s, a, π, µ) = rh(s, a, µh) + Eπ V λ h+1(sh+1, π, µ) | sh = s, ah = a for all h [H]. The cumulative reward is Jλ(π, µ) = Eµ1[ V λ 1 (s, π, µ)]. The notion of NE can be similarly derived as follows. Definition 2. An NE of the λ-regularized MFG is a pair (π , µ ) ΠH (S)H that satisfies: (i) (player rationality) Jλ(π , µ ) = max π ΠH Jλ( π, µ ). (ii) (Distribution consistency) The distribution flow µ is equal to the distribution flow induced by the policy π . Remark 1. Compared with Definition 1, the definition of NE in MFG only involves the policy and the distribution flow of a single representative player, since the agents are homogeneous in MFGs. 4 Existence of the NEs in Regularized GMFGs and MFGs We now state some assumptions to demonstrate the existence of a NE for λ-regularized GMFGs. Assumption 1. The state space S is compact, and the action space A is finite. This assumption imposes rather mild constraints on S and A. In real-world applications, the states are usually physical quantities and thus reside in a compact set. For the action space, many deep reinforcement learning algorithms discretize the potential continuous action sets into finite sets [Lowe et al., 2017, Mordatch and Abbeel, 2018]. Assumption 2. The graphons Wh for h [H] are continuous functions. The stronger version of this assumption (Lipschitz continuity) is widely adopted in GMFG works [Cui and Koeppl, 2021b, Fabian et al., 2022]. It helps us to build the continuity of the transition kernels and rewards with respect to players. Assumption 3. For all h [H], the reward function rh(s, a, z) is continuous on S A M(S), that is if (sn, an, zn) (s, a, z) as n , then rh(sn, an, zn) rh(s, a, z). The transition kernel Ph( | s, a, z) is weakly continuous in S A M(S), that is if (sn, an, zn) (s, a, z) as n , Ph( | sn, an, zn) Ph( | s, a, z) weakly. This assumption states the continuity of the models, i.e., the transition kernels and the rewards, as functions of the state, action, and aggregate. We note that the Lipschitz continuity assumption of the model in the previous works implies that our assumption is satisfied [Cui and Koeppl, 2021b, Fabian et al., 2022]. Next, we state the existence result of regularized GMFG. Theorem 1. Under Assumptions 1, 2 and 3, for all λ 0, the λ-regularized GMFG (I, S, A, H, P, r, W, µ1) admits an NE (πI, µI) Π . This theorem strengthens previous existence results in Cui and Koeppl [2021b] and Fabian et al. [2022] in two aspects. First, our assumptions are weaker. These two existing works require a finite state space and the Lipschitz continuity of the models. In contrast, we only need a compact state space and the model continuity. Second, their results only hold for the unregularized case (λ = 0), whereas ours holds for any λ 0. In the proof of Theorem 1, we construct a MFG from the given GMFG and show that an NE of the constructed MFG can be converted to an NE of the GMFG. Then we prove the existence of NE in the constructed regularized MFG. Our existence result for the regularized MFG is also a significant addition to the MFG literature. Remark 2. Although we show that an NE of the constructed MFG can be converted to an NE of GMFG, this proof does not imply that GMFG forms a subclass of or is equivalent to MFG. This is because we have only demonstrated the relationship between the NEs of these two games, but the exact realizations of the GMFG and the conceptually constructed MFG may differ. It means that the sample paths of these two games may not be the same, which include the realizations of the states, actions, and rewards of all the players. We next state the assumption needed for the MFG. Assumption 4. The MFG ( S, A, H, P, r, µ1) satisfies that: (i) The state space S is compact, and the action space A is finite. (ii) The reward functions rh(s, a, µ) for h [H] are continuous on S A (S). The transition kernels are weakly continuous on S A (S); that is if (sn, an, µn) (s, a, µ) as n , Ph( | sn, an, µn) Ph( | s, a, µ) weakly. Then the existence of the NE is stated as follows. Theorem 2. Under Assumption 4, the λ-regularzied MFG ( S, A, H, P, r, µ1) admits an NE (π, µ) ΠH (S)H for any λ 0. Our result in Theorem 2 imposes weaker conditions than previous works [Cui and Koeppl, 2021a, Anahtarci et al., 2022] to guarantee the existence of an NE. These existing works prove the existence of NE by assuming a contractive property and the finiteness of the state space. They also require a strong Lispchitz assumption [Anahtarci et al., 2022], where the Lipschitz constants of the models should be small enough. In contrast, we only require the continuity assumption in Theorem 2. This is due to our analysis of the operator for the regularized MFG and the use of Kakutani fixed point theorem Guide [2006]. 5 Learning NE of Monotone GMFGs In this section, we focus on GMFGs with transition kernels that are independent of the aggregate z, i.e., Ph : S A S for h [H]. This model is motivated by the seminal work Lasry and Lions [2007], where the state evolution in continuous-time MFG is characterized by the Fokker Plank equation. However, the form of the Fokker Plank equation results in the state transition of each player being independent of other players. This model is also widely accepted in the discrete-time GMFG and MFG literature [Fabian et al., 2022, Perrin et al., 2020, Perolat et al., 2021]. 5.1 Monotone GMFG We first generalize the notion of monotonicity from multi-population MFGs in Perolat et al. [2021] to GMFGs. Definition 3 (Weakly Monotone Condition). A GMFG is said to be weakly monotone if for any ρI, ρI (S A)I and their marginalizations on the states µI, µI (S)I, we have Z ρα(s, a) ρα(s, a) rh s, a, zα h(µI, Wh) rh s, a, zα h( µI, Wh) ds dα 0 for all h [H], where Wh is the underlying graphon. It is strictly weakly monotone if the inequality is strict when ρI = ρI. In two MDPs induced by the distribution flows µI of πI and µI of πI, the weakly monotone condition states that we can achieve higher rewards at stage h [H] by swapping the policies. This condition has two important implications. The first is the uniqueness of the NE. Proposition 1. Under Assumptions 1, 2, and 3, a strictly weakly monotone λ-regularized GMFG has a unique NE for any λ 0 up to a zero-measure set on I with respect to the Lebesgue measure. In the following, we denote this unique NE as (π ,I, µ ,I), and we aim to learn this NE. The second implication concerns the relationship between the cumulative rewards of two policies. Proposition 2. If a λ-regularized GMFG satisfies the weakly monotone condition, then for any two policies πI, πI Π and their induced distribution flows µI, µI , we have Z 1 0 Jλ,α(πα, µI) + Jλ,α( πα, µI) Jλ,α( πα, µI) Jλ,α(πα, µI) dα 0 If the λ-regularized GMFG satisfies the strictly weakly monotone condition, then the inequality is strict when πI = πI. Proposition 2 shows that if we have two policies, we can improve the cumulative rewards on the MDP induced by these policies by swapping the policies. This implies an important property of the NE (π ,I, µ ,I). Since π ,I is optimal on the MDP induced by µ ,I, we have R 1 0 Jλ,α(π ,α, µ ,I) dα R 1 0 Jλ,α(πα, µ ,I) dα for any πI Π. Then Proposition 2 shows that R 1 0 Jλ,α(π ,α, µI) dα R 1 0 Jλ,α(πα, µI) dα for any policy πI and the distribution flow µI it induces. This means that the NE policy gains cumulative rewards not less than any policy πI on the MDP induced by πI. This motivates the design of our NE learning algorithm. 5.2 Policy Mirror Descent Algorithm for Monotone GMFG In this section, we introduce the algorithm to learn the NE, which is called Mono GMFG-PMD and whose pseudo-code is outlined in Algorithm 1. It consists of three main steps. The first step estimates the action-value function (Line 3). We need to evaluate the action-value function of a policy on the MDP induced by itself. This estimate can be obtained for each player independently by playing the πI t several times. We assume access to a sub-module for this and quantify the estimation error in our analysis. The second step is the policy mirror descent (Line 4). Given ληt < 1, This step can be equivalently formulated as ˆπα t+1,h( | s) = argmax p (A) h ˆQλ,α h (s, , πα t , µI t ), p λR(p) i Dkl p πα t,h( | s) s S, Algorithm 1 Mono GMFG-PMD Procedure: 1: Initialize πα 1,h( | s) = Unif(A) for all s S,h [H] and α I. 2: for t = 1, 2, , T do 3: Compute the action-value function ˆQλ,α h (s, a, πα t , µI t ) for all α I and h [H], where µI t is the distribution flow induced by πI t . 4: ˆπα t+1,h( | s) πα t,h( | s) 1 ληt exp ηt ˆQλ,α h (s, a, πα t , µI t ) for all α I and h [H] 5: πα t+1,h( | s) = (1 βt)ˆπα t+1,h( | s) + βt Unif(A) 6: end for 7: Output πI = Unif πI [1:T ] where R(p) = p, log p is the negative entropy function. This step aims to improve the performance of the policy πI t on its own induced MDP. Intuitively, since the policy π ,I in NE performs better than πI t on the MDP induced by µI t as shown in Section 5.1, the improved policy πI t+1 should be closer to π ,I than πI t . The third step mixes the current policy with the uniform policy (Line 5) to encourage exploration. Mono GMFG-PMD is different from previous NE learning algorithms for monotone GMFG in Perolat et al. [2021], Fabian et al. [2022] in three different ways. First, Mono GMFG-PMD is designed to learn the NE of the λ-regularized GMFG with λ > 0, whereas other algorithms learn the NE of the unregularized GMFGs. As a result, our policy improvement (Line 4) discounts the previous policy as (πα t,h)1 ληt, but other algorithms retain πα t,h. Second, our algorithm is discrete-time and thus amenable for practical implementation. However, other provably efficient algorithms evolve in continuous time. Finally, Mono GMFG-PMD encourages exploration in Line 5, which is important for the theoretical analysis. Such a step is missing in other algorithms. 5.3 Theoretical Analysis for Mono GMFG-PMD with Estimation Oracle This section provides theoretical analysis for the Mono GMFG-PMD algorithm given an action-value function oracle in Line 3, i.e., ˆQλ,α h = Qλ,α h . We denote the unique NE of the λ-regularized GMFG as (π ,I, µ ,I). For any policy πI, we measure the distance to the policy π ,I of NE using D(πI) = Z 1 h=1 Eµ ,α h h Dkl π ,α h ( | sα h) πα h( | sα h) i dα. This metric measures the weighted KL divergence between policy πI and the NE policy, and the weights are the NE distribution flow µ ,I. Theorem 3. Assume that the GMFG is strictly weakly monotone and we have an action-value function oracle. Let ηt = η = O(T 1/2) and βt = β = O(T 1) in Mono GMFG-PMD. For any λ > 0 we have = O λ log2 T Theorem 3 provides the first convergence rate result for a discrete-time algorithm on strictly weakly monotone GMFGs under mild assumptions. In contrast, Perolat et al. [2021], Fabian et al. [2022] only consider the continuous-time policy mirror descent, which is difficult for the practical implementation, and only provide the asymptotic consistency results. Geist et al. [2021] derive exploitability results for a fictitious play algorithm but require the potential structure and the Lipschitzness of the reward function. Our proof for Theorem 3 mainly exploits properties of NE discussed in Section 5.1. Concretely, we use the fact that the policy mirror descent procedure reduces the distance between the policy iterate and the NE as D(πI t+1) D(πI t ) R 1 0 Jλ,α(πα t , µI t ) Jλ,α(π ,α, µI t ) dα. Thus, the policy iterate becomes closer to the NE. However, the discretization error and the exploration influence (Line 5) also appear, requiring additional care to show that D(πI t ), in general, decreases. Algorithm 2 Estimation of Action-value Function Procedure: 1: Sample N players {i/N}N i=1 [0, 1]. 2: The ith player implements πb,i t for i [N], and the other players implement πI t . 3: Collect data {(si τ,h, ai τ,h, ri τ,h)}N,K,H i,τ,h=1 of sampled players from K independent episodes. 4: Initialize ˆV λ,i H+1(s, a) = 0 for all s S, a A and i [N]. 5: for time h = H, , 1 do 6: for Player i = 1, , N(in parallel) do 7: ˆQλ,i h = argminf Fh PK τ=1 f(si τ,h, ai τ,h) ri τ,h ˆV λ,i h+1(si τ,h+1) 2. 8: ˆV λ,i h (s) = ˆQλ,i h (s, ), πi/N t,h ( , s) λR πi/N t,h ( , s) . 9: end for 10: end for 11: Output { ˆQλ,i h }N,H i,h=1. 5.4 Theoretical Analysis for Mono GMFG-PMD with General Function Classes In this section, we remove the requirement that one is given oracle access to an action-value function and we estimate it in Line 3 of Mono GMFG-PMD using general function classes. We consider the action-value function class F = F1 FH, where Fh {f : S A [0, (H h + 1)(1 + λ log |A|)]} is the class of action-value functions at time h [H]. Then we estimate the action-value functions using Algorithm 2. This algorithm mainly involves two steps. The first is involves data collection (Line 3). Here we assign policies to players and collect data from their interactions. We let the sampled N players implement behavior policies {πb,i t }N i=1, which can be different from {πi/N t }N i=1. This will not change the aggregate zα h(µI h, Wh) for any α I, since only a zero-measure set of players change their policies. The second is the action-value function estimation (Lines 7 and 8). The action-value function is selected based on the previous value function, and the value function is updated from the derived estimate. This can be implemented in parallel for all players. We highlight that the estimation analysis cannot leverage results from general non-parametric regression [Wainwright, 2019], since the response variable ˆV λ,i h+1 is not independent of si τ,h+1 in our setting. Assumption 5 (Realizability). For any policy πI Π and the induced distribution flow µI , we have Qλ,α h ( , , πα, µI) Fh for h [H]. This assumption ensures that we can find the nominal action-value function in the function class. For a policy π ΠH and a function f : S A R, we define the operator (T π h f)(s, a) = Es Ph( |s,a)[ f(s , ), πh+1( |s ) λR(πh+1( |s ))]. For a policy πI and the induced distribution flow µI, we have Qλ,α h (s, a, πα, µI) = rh(s, a, zα h) + (T πα h Qλ,α h+1)(s, a). Assumption 6 (Completeness). For any policy πI Π and the induced distribution flow µI , we have that for all f Fh+1, rh( , , zα h(µI h, Wh)) + (T πα h f) Fh for all α I, h [H 1]. This completeness assumption ensures that the estimates from F also satisfy the relationship between nominal action-value functions through T πα h . These realizability and completeness assumptions are widely adopted in the off-policy evaluation and offline reinforcement learning literature [Uehara et al., 2022, Xie et al., 2021b]. Assumption 7. The reward functions {rh}H h=1 are Lipschitz in z, i.e., |rh(s, a, z) rh(s, a, z )| Lr z z 1 for all s S, a A, h [H]. The graphons Wh for h [H] are Lipschitz continuous functions, i.e., there exists a constant L = LW > 0 (depending only on W = {Wh}H h=1) such that |Wh(α, β) Wh(α , β )| LW (|α α | + |β β |) for all α, α , β, β [0, 1], and h [H]. The Lipschitzness assumption is common in the GMFG works [Parise and Ozdaglar, 2019, Carmona et al., 2022, Cui and Koeppl, 2021b]. It helps us to approximate the action-value function of a player by that of sampled players. We denote the state distributions of player i induced by policy πb,i t as µb,i t . Then we require the behavior policies {πb,i t }N i=1 to satisfy the following requirements. (a) Beach Bar problem with SBM graphons (b) Beach Bar problem with exp-graphons. Figure 1: Simulation results for Beach Bar problem with SBM and exp-graphons. Assumption 8. For any t [T], the behavior policies explore sufficiently. More precisely, for any policy π ΠH and induced distributions µ (S)H, we have sups S,a A πh(a | s)/πb,i t,h(a | s) C1 and sups S dµh(s)/dµb,i t,h(s) C2 for h [H] and i [N] where C1, C2 > 0 are constants. This assumption guarantees that the behavior policies explore the actions that may be adopted by πI t and π ,I. Such an assumption is widely adopted in offline reinforcement learning and off-policy evaluation works [Uehara et al., 2022, Xie et al., 2021b]. Theorem 4. Assume that the GMFG is weakly monotone and that Assumptions 5, 6, 7, and 8 hold. Let ηt = η = O(T 1/2) and βt = β = O(T 1) in Algorithm 1 (Mono GMFG-PMD). Then with probability at least 1 δ, Algorithms 1 and 2 yield = O λ log2 T T + C1C2 H3/2B2 H λ K log TNH N (5BH/K, F[H]) δ + H log T where BH = H(1 + λ log |A|), and N (5BH/K, F[H]) = maxh [H] N (5BH/K, Fh) is the ℓ -covering number of the function class. The error in Theorem 4 consists of both the optimization and estimation errors. The optimization error corresponds to the first term, which also appears in Theorem 3. The estimation error consists of the generalization error and the approximation error, in the second and third terms respectively. When the function class F is finite, this term scales as O(K 1/2), which originates from the fact that we estimate the action-value function from the empirical error instead of its population counterpart. The approximation error scales as O(N 1). This term originates from the fact that the action-value function of player α is approximated by that of the sampled player near α. To learn a policy that is at most ε > 0 far from the NE, we can set T = O(ε 2), K = O(ε 2), and N = O(ε 1), which in total results in TK = O(ε 4) episodes of online plays. 6 Experiments In this section, we conduct experiments to corroborate our theoretical findings. We run different algorithms on the Beach Bar problem Perrin et al. [2020], Fabian et al. [2022]. The underlying graphons are set to Stochastic Block Model (SBM) and exp-graphons. The details of experiments are deferred to Appendix B. Since the NEs of the games are not available, we adopt the exploitability to measure the proximity between a policy and the NE. For a policy πI and its induced distribution flow µI, the exploitability for the λ-regularized GMFG is defined as Exploit(πI) = Z 1 0 max π ΠH Jλ,α( π, µI) Jλ,α(πα, µI)dα. First, the experimental results demonstrate the necessity of modelling the heterogeneity of agents. Figure 1 demonstrates the performance degradation of approximating GMFG by MFG. Here we let the agents play in the GMFG with constant graphons Wh(α, β) = p for p {0, 0.5, 1}. The agents have oracle access to the action-value function. We observe that this approximation results in gross errors for learning the NEs of GMFGs with non-constant graphons. Second, the experiments show that the algorithms designed for unregularized GMFG cannot learn the NE of regularized GMFG. We implement the discrete-time version of the algorithm in Fabian et al. [2022]; results marked Unreg PMD show that the exploitability first decreases and then increases. In line with the discussion in Section 5.2, this originates from keeping too much gradient knowledge in previous iterates πI t . The gradient of the policy is largely correct in the several initial iterations, but a large amount of past knowledge results in it deviating in later iterations, since the past knowledge accumulates. In contrast, our algorithm discounts the past knowledge as (πI t )1 ληt. Finally, the results indicate the influence of action-value function estimation. In the experiments, we run our algorithm when N = 5, K = 300, N = 10, K = 100, and N = 10, K = 300. Figure 1 shows that the algorithm with N = 10, K = 300 can achieve a smaller error than the algorithms both with N = 10, K = 100 and N = 5, K = 300. This is in agreement with Theorem 4. 7 Conclusion In this paper, we focused on two fundamental problems of λ-regularized GMFG. Firstly, we established the existence of NE. This result greatly weakened the conditions in the previous works. Secondly, the provably efficient NE learning algorithms were proposed and analyzed in the weakly monotone GMFG motivated by Lasry and Lions [2007]. The convergence rate of Mono GMFG-PMD features the first performance guarantee of discrete-time algorithm without extra conditions in monotone GMFGs. We leave the lower bound of this problem to the future works. Acknowledgments and Disclosure of Funding Fengzhuo Zhang and Vincent Tan acknowledge funding by the Singapore Data Science Consortium (SDSC) Dissertation Research Fellowship, the Singapore Ministry of Education Academic Research Fund (Ac RF) Tier 2 under grant number A-8000423-00-00, and Ac RF Tier 1 under grant numbers A-8000980-00-00 and A-8000189-01-00. Zhaoran Wang acknowledges National Science Foundation (Awards 2048075, 2008827, 2015568, 1934931), Simons Institute (Theory of Reinforcement Learning), Amazon, J.P. Morgan, and Two Sigma for their support. Y. Achdou, P. Cardaliaguet, F. Delarue, A. Porretta, F. Santambrogio, Y. Achdou, and M. Laurière. Mean field games and applications: Numerical aspects. Mean Field Games: Cetraro, Italy 2019, pages 249 307, 2020. B. Anahtarci, C. D. Kariksiz, and N. Saldi. Fitted q-learning in mean-field games. ar Xiv preprint ar Xiv:1912.13309, 2019. B. Anahtarci, C. D. Kariksiz, and N. Saldi. Q-learning in regularized mean-field games. Dynamic Games and Applications, pages 1 29, 2022. A. Aurell, R. Carmona, G. Dayanıklı, and M. Laurière. Finite state graphon games with applications to epidemics. Dynamic Games and Applications, 12(1):49 81, 2022a. A. Aurell, R. Carmona, and M. Lauriere. Stochastic graphon games: Ii. the linear-quadratic case. Applied Mathematics & Optimization, 85(3):1 33, 2022b. D. Bertsekas and S. E. Shreve. Stochastic optimal control: the discrete-time case, volume 5. Athena Scientific, 1996. Q. Cai, Z. Yang, C. Jin, and Z. Wang. Provably efficient exploration in policy optimization. In International Conference on Machine Learning, pages 1283 1294. PMLR, 2020. P. E. Caines and M. Huang. Graphon mean field games and the gmfg equations: ε-nash equilibria. In 2019 IEEE 58th conference on decision and control (CDC), pages 286 292. IEEE, 2019. R. Carmona, D. B. Cooney, C. V. Graves, and M. Lauriere. Stochastic graphon games: I. the static case. Mathematics of Operations Research, 47(1):750 778, 2022. S. Cen, C. Cheng, Y. Chen, Y. Wei, and Y. Chi. Fast global convergence of natural policy gradient methods with entropy regularization. Operations Research, 70(4):2563 2578, 2022. A. Cousin, S. Crépey, O. Guéant, D. Hobson, M. Jeanblanc, J. Lasry, J. Laurent, P. Lions, P. Tankov, and O. Guéant. Mean field games and applications. Paris-Princeton lectures on mathematical finance 2010, pages 205 266, 2011. K. Cui and H. Koeppl. Approximately solving mean field games via entropy-regularized deep reinforcement learning. In International Conference on Artificial Intelligence and Statistics, pages 1909 1917. PMLR, 2021a. Kai Cui and Heinz Koeppl. Learning graphon mean field games and approximate nash equilibria. In International Conference on Learning Representations, 2021b. C. Fabian, . Cui, and H. Koeppl. Learning sparse graphon mean field games. ar Xiv preprint ar Xiv:2209.03880, 2022. M. Geist, B. Scherrer, and O. Pietquin. A theory of regularized markov decision processes. In International Conference on Machine Learning, pages 2160 2169. PMLR, 2019. M. Geist, J. Pérolat, M. Laurière, R. Elie, S. Perrin, O. Bachem, R. Munos, and O. Pietquin. Concave utility reinforcement learning: the mean-field game viewpoint. ar Xiv preprint ar Xiv:2106.03787, 2021. A. H. Guide. Infinite dimensional analysis. Springer, 2006. L. Györfi, M. Kohler, A. Krzyzak, and H. Walk. A distribution-free theory of nonparametric regression, volume 1. Springer, 2002. K. Hinderer. Foundations of non-stationary dynamic programming with discrete time parameter, 1970. M. Huang, R. P. Malhamé, and P. E. Caines. Large population stochastic dynamic games: closed-loop mckean-vlasov systems and the nash certainty equivalence principle. 2006. H.-J. Langen. Convergence of dynamic programming models. Mathematics of Operations Research, 6(4):493 512, 1981. J. Lasry and P. Lions. Mean field games. Japanese journal of mathematics, 2(1):229 260, 2007. R. Lowe, Y. I. Wu, A. Tamar, J. Harb, Open AI Pieter A., and I. Mordatch. Multi-agent actor-critic for mixed cooperative-competitive environments. Advances in neural information processing systems, 30, 2017. I. Mordatch and P. Abbeel. Emergence of grounded compositional language in multi-agent populations. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 32, 2018. Francesca Parise and Asuman Ozdaglar. Graphon games. In Proceedings of the 2019 ACM Conference on Economics and Computation, pages 457 458, 2019. J. Perolat, S. Perrin, R. Elie, M. Laurière, G. Piliouras, M. Geist, K. Tuyls, and O. Pietquin. Scaling up mean field games with online mirror descent. ar Xiv preprint ar Xiv:2103.00623, 2021. S. Perrin, J. Pérolat, M. Laurière, M. Geist, R. Elie, and O. Pietquin. Fictitious play for mean field games: Continuous time analysis and applications. Advances in Neural Information Processing Systems, 33:13199 13213, 2020. L. Shani, Y. Efroni, and S. Mannor. Adaptive trust region policy optimization: Global convergence and faster rates for regularized mdps. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 34, pages 5668 5675, 2020. E. Sonu, Y. Chen, and P. Doshi. Decision-theoretic planning under anonymity in agent populations. Journal of Artificial Intelligence Research, 59:725 770, 2017. M. Uehara, C. Shi, and N. Kallus. A review of off-policy evaluation in reinforcement learning. ar Xiv preprint ar Xiv:2212.06355, 2022. D. Vasal, R. K. Mishra, and S. Vishwanath. Master equation of discrete time graphon mean field games and teams. ar Xiv preprint ar Xiv:2001.05633, 2020. M. J. Wainwright. High-dimensional statistics: A non-asymptotic viewpoint, volume 48. Cambridge university press, 2019. L. Wang, Z. Yang, and Z. Wang. Breaking the curse of many agents: Provable mean embedding Qiteration for mean-field reinforcement learning. In International Conference on Machine Learning, pages 10092 10103. PMLR, 2020. Q. Xie, Z. Yang, Z. Wang, and A. Minca. Learning while playing in mean-field games: Convergence and optimality. In International Conference on Machine Learning, pages 11436 11447. PMLR, 2021a. T. Xie, C. Cheng, N. Jiang, P. Mineiro, and A. Agarwal. Bellman-consistent pessimism for offline reinforcement learning. Advances in neural information processing systems, 34:6683 6694, 2021b. B. Yardim, S. Cayci, M. Geist, and N. He. Policy mirror ascent for efficient and independent learning in mean field games. ar Xiv preprint ar Xiv:2212.14449, 2022. M. A. uz Zaman, A. Koppel, S. Bhatt, and T. Basar. Oracle-free reinforcement learning in mean-field games along a single sample path. ar Xiv preprint ar Xiv:2208.11639, 2022.