# learning_in_congestion_games_with_bandit_feedback__eee27ead.pdf Learning in Congestion Games with Bandit Feedback Qiwen Cui1 qwcui@cs.washington.edu Zhihan Xiong1 zhihanx@cs.washington.edu Maryam Fazel2 mfazel@uw.edu Simon S. Du1 ssdu@cs.washington.edu 1 Paul G. Allen School of Computer Science & Engineering, University of Washington 2 Department of Electrical & Computer Engineering, University of Washington In this paper, we investigate Nash-regret minimization in congestion games, a class of games with benign theoretical structure and broad real-world applications. We first propose a centralized algorithm based on the optimism in the face of uncertainty principle for congestion games with (semi-)bandit feedback, and obtain finite-sample guarantees. Then we propose a decentralized algorithm via a novel combination of the Frank-Wolfe method and G-optimal design. By exploiting the structure of the congestion game, we show the sample complexity of both algorithms depends only polynomially on the number of players and the number of facilities, but not the size of the action set, which can be exponentially large in terms of the number of facilities. We further define a new problem class, Markov congestion games, which allows us to model the non-stationarity in congestion games. We propose a centralized algorithm for Markov congestion games, whose sample complexity again has only polynomial dependence on all relevant problem parameters, but not the size of the action set. 1 Introduction Nash equilibrium (NE) is a widely adopted concept in game theory community, used to describe the behavior of multi-agent systems with selfish players [Roughgarden, 2010]. At the Nash equilibrium, no player has the incentive to change its own strategy unilaterally, which implies it is a steady state of the game dynamics. For a general-sum game, computing the Nash equilibrium is PPAD-hard [Daskalakis, 2013] and the query complexity is exponential in the number of players [Rubinstein, 2016]. To help address these issues, a natural approach is to consider games with special structures. In this paper, we focus on congestion games. Congestion games are general-sum games with facilities (resources) shared among noncooperative players [Rosenthal, 1973]. During the game, each player will decide what combination of facilities to utilize, and popular facilities will become congested, which results in a possibly higher cost on each user. One example of congestion game is the routing game [Fotakis et al., 2002], where each player needs to travel from a given starting point to a destination point through some shared routes. These routes are represented as a traffic graph and the facilities are the edges. Each player will decide her path to go, and the more players use the same edge, the longer the edge travel time will be. Congestion games also have wide applications in electrical grids [Ibars et al., 2010], internet routing Equal contribution. 36th Conference on Neural Information Processing Systems (Neur IPS 2022). [Al-Kashoash et al., 2017] and rate allocation [Johari and Tsitsiklis, 2004]. In many real-world scenarios, players can only have (semi-)bandit feedback, i.e., players know only the payoff of the facilities they choose. This kind of learning under uncertainty has been widely studied in bandits and in reinforcement learning for the single-agent setting, while theoretical understanding for the multi-agent case is still largely missing. There are two types of algorithms in multi-agent systems, namely centralized algorithms and decentralized algorithms. For centralized algorithms, there exists a central authority that can control and receive feedback from all players in the game. As we have global coordination, centralized algorithms usually have favorable performance. On the other hand, such a central authority may not always be available in practice, and thus people turn to decentralized algorithms, i.e., each player makes decisions individually and can only observe her own feedback. However, decentralized algorithms are vulnerable to nonstationarity because each player is making decisions in a nonstationary environment as others strategies are changing [Zhang et al., 2021a]. In this paper, we will study both centralized and decentralized algorithms in congestion games with bandit feedback, and we will provide motivating scenarios for both algorithms in Section 1.2. The main challenge in designing algorithms for m-player congestion games with bandit feedback is the curse of exponential action set, i.e., the number of actions can be exponential in the number of facilities F because every subset of facilities can be an action. As a result, an efficient algorithm should have sample complexity polynomial in m and F and has no dependence on the size of the action space. One closely related type of general-sum game is the potential game, in which each individual s payoff changes, resulting from strategy modification, can be quantified by a common potential function. It is well-known that all congestion games are potential games, and each potential game has an equivalent congestion game formulation [Monderer and Shapley, 1996]. However, existing algorithms designed for potential games all have sample complexity scaling at least linearly in the number of actions [Leonardos et al., 2021, Ding et al., 2022], which is inefficient for congestion games. This motivates the following question: Can we design provably sample-efficient centralized and decentralized learning algorithms for congestion games with bandit feedback? We provide an affirmative answer to this question. To be precise, we use Nash-regret minimization (formally defined in Section 3) as our objective for learning in congestion games. This regretlike objective commonly appears in the literature of online learning and reinforcement learning [Orabona, 2019, Ding et al., 2022, Liu et al., 2021], which focuses on finite-time analysis and accumulative rewards throughout the learning process instead of the asymptotic behavior. In general, a sublinear Nash regret implies a best-iterate convergence, meaning that the algorithm has reached the approximate Nash equilibrium at least once, while the converse does not hold. We highlight our contributions below and compare our results with previous algorithms in Table 1. Our algorithms are shaded and we prove sublinear Nash regrets for all of them. In Table 1, sample complexity refers to the number of samples required to reach best-iterate convergence to an ϵ-approximate Nash equilibrium and the results are obtained by standard online-to-batch conversion as in Section 3.1 of [Jin et al., 2018]. 1.1 Main Novelties and Contributions 1. Centralized algorithm for congestion game. We adapt the principle of optimism in the face of uncertainty in stochastic bandits to ensure sufficient exploration in congestion games. We begin with congestion games with semi-bandit feedback, in which each player can observe the reward of every facility in the action. Instead of estimating the action reward as in stochastic multi-armed bandits, we estimate the facility rewards directly, which removes the dependence on the size of action space. Furthermore, we consider congestion games with bandit feedback, in which each player can only observe the overall reward. In this setting, we borrow ideas from linear bandits to estimate the reward function and analyze the algorithm. The algorithm is provably sample efficient in both cases. 2. Decentralized algorithm for congestion game. Our decentralized algorithm is a Frank-Wolfe method with exploration, in which each player only observes her own actions and rewards. To efficiently explore in the congestion game, we utilize G-optimal design allocation for bandit feedback and a specific distribution for semi-bandit feedback. As a result, the sample complexity does not depend on the number of actions. In addition, the L1 smoothness parameter of the potential function Table 1: Comparison of algorithms for congestion games in terms of sample complexity and Nash regret, where IPPG stands for independent projected policy gradient , IPGA stands for independent policy gradient ascent , I represents the setting of semi-bandit feedback and II represents the setting of bandit feedback. Bandit feedback is assumed for algorithms from previous work. Here, Ai is the size of player i s action space, m is the number of players, Amax = maxi [m] Ai, F is the number of facilities and T is the number of samples collected. Our algorithms are shaded. Algorithms Sample complexity Nash regret Decentralized Nash-VI [Liu et al., 2021] (Qm i=1 Ai)F/ϵ2 p (Qm i=1 Ai)FT No V-learning [Jin et al., 2021a] Amax F/ϵ2 (CCE) NA Yes IPPG [Leonardos et al., 2021] Amaxm F/ϵ6 NA Yes IPGA [Ding et al., 2022] A2 maxm3F 5/ϵ5 m F 4/3 Amax T 4/5 Yes Nash-UCB I m F 2/ϵ2 F m T No Nash-UCB II m2F 3/ϵ2 m F 3/2 T No Frank-Wolfe with Exploration I m12F 9/ϵ6 m2F 3/2T 5/6 Yes Frank-Wolfe with Exploration II m12F 12/ϵ6 m2F 2T 5/6 Yes does not depend on the number of actions, which is exploited by the Frank-Wolfe method. With the help of these two specific algorithmic designs for congestion games, we give the first decentralized algorithm for both semi-bandit feedback and bandit feedback that has no dependence on the size of the action space in congestion games. 3. Centralized algorithm for independent Markov congestion game. We extend the formulation of congestion game into a Markov setting and propose the independent Markov congestion game (IMCG), in which each facility has its own internal state and state transition happens independently among all the facilities. In Section 1.2, we give some examples that fit in this model. By utilizing techniques from factored MDPs, we extend our centralized algorithms for congestion games to efficiently solve IMCGs, with both semi-bandit and bandit feedback. 1.2 Motivating Examples We provide an exmple here to motivate our proposed models. See Section 3 for the formal definition of (semi-)bandit feedback and (Markov) congestion games and Appendix A for additional examples. Example 1 (Routing Games). For a routing game, there are multiple players in a traffic graph travelling from starting points to destination points, and the facilities are the edges (roads). The cost of each edge is the waiting time, which depends on the number of players using that edge. Centralized algorithm for routing games: Imagine each player is using Google Maps to navigate. Then Google Maps can serve as a center that knows the starting points and the destination points, as well as the real-time feedback of the waiting time on each edge of all the players. Google Maps itself also has the incentive to assign paths according to the Nash equilibrium strategy as then each player will find out that deviating from the navigation has no benefit and thus sticks to the app. Decentralized algorithm for routing games: Consider the case where players are still using Google Maps but due to privacy concerns or limited bandwidth, they only use the offline version, which has access only to the information of each single user. Then Google Maps needs to use decentralized algorithms so that it can still assign Nash equilibrium strategy to each user after repeated plays. Markov routing games: For Markov routing games, the time cost on each edge will change between different timesteps, which is a more accurate model of the real-world. For instance, some roads are prone to car accidents, which will result in an increasing cost on the next timestep, and the chance of accidents also depends on the number of players using that edge currently. This is modeled by the Markovian facility state transition in independent Markov congestion games. 2 Related Work Potential Games. Potential games are general-sum games that admit a common potential function to quantify the changes in individual s payoff [Monderer and Shapley, 1996]. Algorithmic game theory community has studied how different dynamics converge to the Nash equilibium, e.g., best response dynamics [Durand, 2018, Swenson et al., 2018] and no-regret dynamics [Heliou et al., 2017, Cheung and Piliouras, 2020], while usually they provide only asymptotic convergence, with either full information setting or bandit feedback setting. Recently, reinforcement learning community studied Markov potential games with bandit feedback, which can be applied to standard potential games. See the Markov Games part below for more details. Congestion Games. Congestion games are developed in the seminal work [Rosenthal, 1973], and later Monderer and Shapley [1996] builds a close connection between congestion games and potential games. Congestion games are divided into atomic and non-atomic congestion games depending on whether each player is separable. Many papers consider non-atomic congestion games with non-decreasing cost function, which implies a convex potential function [Roughgarden and Tardos, 2004]. We consider the more difficult atomic congestion game where the potential function can be non-convex. For online non-atomic case, [Krichene et al., 2015] considers partial information setting while they provide convergence in the sense of Cesaro means. [Kleinberg et al., 2009, Krichene et al., 2014] show that some no-regret online learning algorithms asymptotically converges to Nash equilibrium. [Chen and Lu, 2015, 2016] are two closely related works that consider bandit feedback in atomic congestion games and provide non-asymptotic convergence. However, they still assume a convex potential function and the sample complexity has exponential dependence on the number of facilities, which is far from ideal. Markov Games. Markov games are widely studied since the seminal work [Shapley, 1953]. Recently, the topic has received much attention due to advances in reinforcement learning theory. Liu et al. [2021] provides a centralized algorithm for learning the Nash equilibrium in general-sum Markov games, and [Jin et al., 2021a, Song et al., 2021] provide decentralized algorithms for learning the (coarse) correlated equilibrium. One closely related line of research is on Markov potential games [Leonardos et al., 2021, Zhang et al., 2021b, Fox et al., 2021, Cen et al., 2022, Ding et al., 2022]. However, applying their algorithms to congestion games leads to explicit dependence on the number of actions, which would be exponentially worse than our algorithms. See Table 1 for comparisons. Our independent Markov congestion game is motivated by the state-based potential games studied in Marden [2012] and Macua et al. [2018], and its transition kernel is closely related to the factored MDPs, for which single agent algorithms are studied in [Osband and Van Roy, 2014, Chen et al., 2020, Xu and Tewari, 2020, Tian et al., 2020, Rosenberg and Mansour, 2021]. Learning in Games. Different from our paper, learning in games in traditional literature of game theory mainly considers players asymptotic behavior [Leslie and Collins, 2005, Cominetti et al., 2010, Coucheney et al., 2015]. In early literature, Leslie [2004] investigates actor-critic learning and Q-learning algorithms in games with bandit feedback and their connection to best-response dynamics. Leslie and Collins [2005] proposes individual Q-learning algorithm and shows that it converges to the NE almost surely in two-player zero-sum game and Leslie and Collins [2006] studies learning the NE from the perspective of a fictitious play-like process. Later, Cominetti et al. [2010] considers payoff-based learning rules and shows convergence to NE in traffic games, while another payoff-based learning model for continuous games is developed in Bervoets et al. [2020]. Coucheney et al. [2015] derives a new penalty-regulated dynamics and proposes a corresponding learning algorithms that converges to NE in potential games with bandit feedback. Bravo et al. [2018] proposes that in monotone games with bandit feedback, as long as all players are using some no-regret learning algorithm, the dynamics will converge to the NE, and an improved analysis of the same derivative-free algorithm is given in Drusvyatskiy et al. [2022]. In contrast, our learning objective focuses on finite-time cumulative rewards, which is more widely used in current multi-agent reinforcement learning literature [Ding et al., 2022, Liu et al., 2021]. 3 Preliminaries General-sum Matrix Games. We consider the model of general-sum matrix games, defined by the tuple G = ({Ai}m i=1 , R), where m is the number of players, Ai is the action space of player i and R( |a) is the reward distribution on [0, rmax]m with mean r(a). Let A = A1 Am be the whole action space and denote an element as a = (a1, . . . , am) A. After all players take actions a A, a reward vector is sampled r R( |a) and player i will receive reward ri [0, rmax] with mean ri(a). Each player s objective is to maximize her own reward. A general policy π is defined as a vector in (A), the probability simplex over the action space A. A product policy π = (π1, . . . , πm) is defined as a tuple in (A1) (Am), in which a = (a1, . . . , am) π represents ai i.i.d. πi. The value of policy π for player i is V π i = Ea π[ri(a)]. Nash Equilibrium and Nash Regret. Given a general policy π, let π i be the marginal joint policy of players 1, . . . , i 1, i + 1, . . . , m. Then, the best response of player i under policy π is π i = argmaxµ (Ai) V µ,π i i and the corresponding value is V ,π i i := V π i ,π i i . Our goal is to find the approximate Nash equilibrium of the matrix game, which is defined below. Definition 1. A product policy π is an ϵ-approximate Nash equilibrium if maxi(V ,π i i V π i ) ϵ. An ϵ-approximate Nash equilibrium can be obtained by achieving a sublinear Nash regret, which is defined below. See Section 3 in Ding et al. [2022] for a more detailed discussion. Definition 2. With πk being the policy at k-th episode, the Nash regret after K episodes is define as Nash-Regret(K) = k=1 max i [m] V ,πk i i V πk i Remark 1. Here, if we replace maxi [m] by Pm i=1 in the definition of Nash regret, the single-step Nash regret at episode k will become the Nikaido-Isoda (NI) function evaluated at πk, which is a popular objective for equilibrium computation [Nikaidô and Isoda, 1955, Raghunathan et al., 2019]. Replacing maxi [m] by Pm i=1 will multiply our regret bounds by a factor of m, while our conclusion will not be affected. Potential Games. A potential game is a general-sum game such that there exists a potential function Φ : (A) [0, Φmax] such that for any player i [m] and policies πi, π i, π i, it satisfies Φ(πi, π i) Φ(π i, π i) = V πi,π i i V π i,π i i . We can immediately see that a policy that maximizes the potential function is a Nash equilibrium. Congestion Games. A congestion game is defined by G = (F, {Ai}m i=1 , Rf f F), where F = [F] is called the facility set and Rf( |n) [0, 1] is the reward distribution for facility f with mean rf(n), where n [m]. Each action ai Ai is a subset of F (i.e., ai F). Suppose the joint action chosen by all the players is a A, then a random reward is sampled rf Rf( |nf(a)) for each facility f, where nf(a) = Pm i=1 1 {f ai} is the number of players using facility f. The reward collected by player i is ri = P f ai rf with mean ri(a) = P f ai rf(nf(a)) [0, F]. Connection to Potential Games [Monderer and Shapley, 1996]. As a special class of potential game, all congestion games have the potential function: Φ(a) = P f F Pnf (a) i=1 rf(i). To see this, we can easily verify that Φ(ai, a i) Φ(a i, a i) = ri(ai, a i) ri(a i, a i) holds. Then, by defining Φ(π) = Ea π[Φ(a)], we can have Φ(πi, π i) Φ(π i, π i) = V πi,π i i V π i,π i i . Types of feedback. There are in general two types of reward feedback for the congestion games, semi-bandit feedback and bandit feedback, both of which are reasonable under different scenarios. In semi-bandit feedback, after taking the action, player i will receive reward information rf for each f ai; in bandit feedback, after taking the action, player i will only receive the reward ri = P f ai rf with no knowledge about each rf. In this paper, we will address both of them, with more focus on the bandit feedback, which can be directly generalized to semi-bandit feedback. 4 Centralized Algorithms for Congestion Games In this section, we introduce two centralized algorithms for congestion games one for the semibandit feedback and one for the bandit feedback. We will see that both of them can achieve sublinear Nash regret with polynomial dependence on both m and F. 4.1 Algorithm for Semi-bandit Feedback Summarized in Algorithm 1, Nash upper confidence bound (Nash-UCB) for congestion games is developed based on optimism in the face of uncertainty. In particular, the algorithm estimates the reward matrices optimistically in line 4, computes its Nash equilibrium policy in line 5 and then follows this policy. For convenience, we define the empirical counter N k,f(n) = Pk k =1 1 n nf(ak ) = n o and ι = 2 log(4(m + 1)K/δ). Then, the reward estimator for f and the bonus term are defined as Pk k =1 rk ,f1 n nf(ak ) = n o N k,f(n) 1 , bk,r i (a) = X ι N k,f(nf(a)) 1, (1) where rk,f [0, 1] is the random reward realization of rf(nf(ak)). Naturally, the reward estimator for player i is ˆrk i (a) = P f ai ˆrk,f(nf(a)). Algorithm 1 Nash-UCB for Congestion Games 1: Input: ϵ, accuracy parameter for Nash equilibrium computation 2: for episode k = 1, . . . , K do 3: for player i = 1, . . . , m do 4: Q k i (a) ˆrk i (a) + bk,r i (a) for all a A 5: πk ϵ-NASH(Q k 1( ), , Q k m( )) (Algorithm 2) 6: Take action ak πk and observe reward rk,f 7: Update reward estimators ˆrk i and bonus term bk,r i Algorithm 1 is motivated by the Nash-VI algorithm in [Liu et al., 2021] plus a deliberate utilization of the special reward structure in the congestion games. Moreover, notice that a matrix game with reward functions Q k 1( ), . . . , Q k m( ) forms a potential game (see Lemma 1). As a result, in line 5, we can efficiently compute the ϵ-approximate Nash equilibrium πk for that matrix game by utilizing Algorithm 2, (see Lemma 2). It is a simple greedy algorithm such that in each round, it modifies one player s policy whose modification can increase the potential function most. In addition, Algorithm 2 always outputs a deterministic product policy. Algorithm 2 ϵ-approximate Nash Equilibrium for Potential Games 1: Input: ϵ, accuracy parameter; full information potential game ({Ai}m i=1 , {ri}m i=1) such that ri [0, rmax] for all i [m] 2: Initialize: π1 = a1, arbitrary deterministic product policy 3: for round k = 1, . . . , mrmax ϵ do 4: for player i = 1, . . . , m do 5: i = maxai Ai ri(ai, πk i) ri(πk) 6: ak+1 i = argmaxa Ai ri(ai, πk i) ri(πk) 7: if maxi [m] i ϵ then 8: return πk 9: j = argmaxi [m] i 10: πk+1(j) = ak+1 j , πk+1(i) = πk(i), for all i = j 4.2 Algorithm for Bandit Feedback When the players can only receive bandit feedback, estimating ˆrk,f directly for each f F is no longer feasible. However, notice that the reward function ri(a) = P f ai rf(nf(a)) can be seen as an inner product between vectors characterized by action a and reward function rf( ). Therefore, under bandit feedback, we can treat it as a linear bandit and use ridge regression to build the reward estimator rk i and corresponding bonus term bk,r, whose index i is dropped since it is the same for all players. The new algorithm will use these two terms to replace ˆrk i and bk,r i in line 4 of Algorithm 1. In particular, define θ [0, 1] d with d = m F to be the vector such that rf(n) = θn+m(f 1). Meanwhile, for player i [m], define Ai : A 7 {0, 1} d to be the vector-valued function such that [Ai(a)]j = 1 j = n + m(f 1), f ai, n = nf(a) . In other words, Ai(a) is a 0-1 vector with element 1 only at indices corresponding to those in θ that represents rf(n) for f ai and n = nf(a). Now, with these definitions, the reward function can be written as ri(a) = Ai(a), θ . Then, we build the reward estimator and the bonus term through ridge regression and corresponding confidence bound, which are defined as the following: rk i (a) = D Ai(a), bθk E , bk,r(a) = max i [m] Ai(a) (V k) 1 q where bθk = V k 1 Pk 1 k =1 Pm i=1 Ai(ak )rk i , V k = I + Pk 1 k =1 Pm i=1 Ai(ak )Ai(ak ) and q F d log 1 + mk F + F ι. Note that we cannot bound the sum of this bonus terms by directly applying the elliptical potential lemma. We instead prove its variant in Lemma 4. 4.3 Regret Analysis The Nash regret bounds for the two versions of Algorithm 1 are formally presented in Theorem 1. The proof details are deferred to Appendix C. Theorem 1. Let ϵ = 1/K. For congestion games with semi-bandit feedback, by running Algorithm 1 with reward estimator and bonus term in (1), with probability at least 1 δ, we can achieve that Nash-Regret(K) e O F Furthermore, if we only have bandit feedback, then by running Algorithm 1 with reward estimator and bonus term in (2), with probability at least 1 δ, we can achieve that Nash-Regret(K) e O m F 3/2 Remark 2. Since each action is a subset of F, the size of each player s action space can be 2F . As a result, directly applying Nash-VI in [Liu et al., 2021] leads to a regret bound exponential in F. Remark 3. Note that we assume rf [0, 1], which implies ri [0, F] for each player i [m]. 5 Decentralized Algorithms for Congestion Games In this section, we present a decentralized algorithm for congestion games. Due to limited space, we only introduce the version of bandit feedback as in Section 4.2. The algorithmic details for the semi-bandit feedback setting are deferred into Appendix D.3. We will show that under both settings, even though each player can only observe her own actions and rewards, our decentralized algorithm still enjoys sublinear Nash regret with polynomial dependence on m and F. We first define the vector-valued function ϕi : Ai 7 {0, 1}Fi to be the feature map of player i such that [ϕi(ai)]f = 1 {f ai} for ai Ai and f S ai Ai ai. Here, Fi is the size of S ai Ai ai F and we can immediately see that Fi F for any i [m]. The core idea of our algorithm is that the Nash equilibrium can be found by reaching the stationary points of the potential function since all congestion games are potential games. Here, the UCBlike algorithms used in the centralized setting are not applicable because their policy computation requires value functions for all players (e.g., line 5 of Algorithm 1), which are not available in the decentralized setting. Summarized in Algorithm 3, the decentralized algorithm is developed based on the Frank-Wolfe method and has the following three major components. Gradient Estimator. In line 7, the algorithm builds the estimator b k i Φ defined in (4) by using the τ reward samples collected from line 5. Here, b k i Φ estimates the gradient of potential function Φ with respect to the policy πk i . Recall that for a congestion game, we have Φ(a) = P f F Pnf (a) i=1 rf(i) Algorithm 3 Frank-Wolfe with Exploration for Congestion Game 1: Input: γ, ν, mixture weights; π1 i , initial policy. 2: Initialize: ρi, the G-optimal design for player i, defined in (5). 3: for episode k = 1, , K do 4: for round t = 1, , τ do 5: Each player takes action ak,t i πk i , observes reward rk,t i . 6: for player i = 1, , m do 7: Compute b k i Φ(ai) by the formula in (4) for all ai Ai 8: Compute eπk+1 i argmaxπi (Ai) D πi, b k i Φ E 9: Update πk+1 i (1 γ)(νeπk+1 i + (1 ν)πk i ) + γρi and Φ(π) = Ea π [Φ(a)]. Then we can define iΦ := πiΦ as a vector of dimension |Ai|. For the component indexed by some ai Ai, we can see that Φ(π) = πi(ai)Ea i π i [ri(ai, a i)] + const, where const does not depend on πi(ai). Therefore, we have iΦ(ai) = Ea i π i [ri(ai, a i)] = Ea i π i f ai rf(nf(ai, a i)) = ϕi(ai), θi(π) , (3) where [θi(π)]f = Ea i π i rf(nf(a i) + 1) . Meanwhile, the mean of the t-th reward that player i received at episode k satisfies E h rk,t i | ak,ti = ri(ak,t) = X rf(nf(ak,t)) = D ϕi(ak,t i ), θk,t i (ak,t i) E , where [θk,t i (ak,t i)]f = rf(nf(ak,t i) + 1) and its mean is [θi(πk)]f. Therefore, we can use linear regression to estimate θi(πk). In particular, we have bθk i (πk) = 1 τ Pτ t=1 Σk i 1 ϕi(ak,t i )rk,t i , with the covariance matrix Σk i = Eai πk i ϕi(ai)ϕi(ai) . Then, we have the unbiased gradient estimate b k i Φ(ai) = D ϕi(ai), bθk i (πk) E = 1 t=1 ϕi(ai) Σk i 1 ϕi(ak,t i )rk,t i . (4) Remark 4. One difference between Algorithm 3 (decentralized) and Algorithm 1 (centralized) is that in the decentralized algorithm, each player is required to play the same policy for τ times before an update can be applied. An episode is thus defined for convenience as the time period during which the players policies are fixed. We make this artificial design mainly for controlling the variance of the gradient estimator b k i Φ(ai). However, we conjecture that with more careful design and analysis, it should be possible to improve Algorithm 3 so that only one sample is required per episode [Zhang et al., 2020]. G-optimal Design. In line 8 and 9, the algorithm performs standard Frank-Wolfe update and mixes the updated policy with an exploration policy ρi, which is defined as the G-optimal allocation for features {ϕi(ai)}ai Ai. To be specific, we have ρi = argmin λ (Ai) max ai Ai ϕi(ai) 2 Ea i λ[ϕi(a i)ϕi(a i) ] 1 . (5) Here ρi guarantees that Σk i is invertible and the variance of b k i Φ(ai) = D ϕi(ai), bθk i (πk) E depends only on F instead of the size of action space (Lemma 9) because by the famous Kiefer-Wolfowitz theorem, we have maxai Ai ϕi(ai) 2 Ea i ρi[ϕi(a i)ϕi(a i) ] 1 = Fi F [Lattimore and Szepesvári, Frank-Wolfe Update. Finally, we emphasize that it is crucial to use Frank-Wolfe update because it is compatible with L1 norm and we can show that Φ is m F-smooth with respect to the L1 norm (Lemma 11). In contrast, its smoothness for L2 norm will depend on the size of the action space. Before the game starts, each player i can compute her ρi based on her own action set Ai. During the game, all players only have access to their own actions and rewards, which means that Algorithm 3 is fully decentralized. The Nash regret bound for this algorithm is formally stated in Theorem 2 and the proof details are given in Appendix D.1 and D.2. Theorem 2. Let T = Kτ. For congestion game with bandit feedback, by running Algorithm 3 with gradient estimator b k i Φ in (4) and exploration distribution ρi in (5), if K 2F m , then with probability at least 1 δ, we have Nash-Regret(T) := k=1 τ max i [m] V ,πk i i V πk i e O m2F 2T 5/6 + m3F 3T 2/3 . For congestion game with semi-bandit feedback, by running Algorithm 3 with gradient estimator e k i Φ(ai) and exploration distribution ρi defined in Appendix D.3, if K 2 F m , then with probability at least 1 δ, we have Nash-Regret(T) e O m2F 3/2T 5/6 + m3F 2T 2/3 . 6 Extension to Independent Markov Congestion Games In this section, we propose and analyze a Markov extension of the congestion games, called the independent Markov congestion games (IMCGs). 6.1 Problem Formulation General-sum Markov Games. A finite-horizon time-inhomogeneous tabular general-sum Markov game is defined by M = {S, {Ai}m i=1 , H, P, R, s0}, where S is the state space, m is the number of players, Ai is the action space of player i, A = A1 Am is the whole action space, H is the time horizon, s0 is the initial state1, P = (P1, P2, , PH) with Ph [0, 1]S A S as the transition kernel at timestep h, R = {Rh( |sh, ah)}H h=1 with Rh( |sh, ah) as the reward distribution on [0, rmax]m with mean rh(sh, ah) [0, rmax]m at timestep h [H]. At timestep h, all players choose their actions simultaneously and a reward vector is sampled rh Rh( |sh, ah), where sh is the current state and ah = (ah,1, ah,2, , ah,m) is the joint action. Each player i receives reward rh,i and the state transits to sh+1 Ph( |sh, ah). The objective for each player is to maximize her own total reward. We assume that the initial state s1 is fixed. A (Markov) policy π is a collection of H functions {πh : S 7 (A)}H h=1, each of which maps a state to a distribution over the action space. π is a product policy if πh( | s) is a product policy for each (h, s) [H] S. The value function and Q-value function of player i at timestep h under policy π are defined as V π h,i(s) = Eπ h =h rh ,i(sh , ah ) | sh = s , Qπ h,i(s, a) = Eπ h =h rh ,i(sh , ah ) | sh = s, ah = a The best responses and Nash regret can be defined similarly as those for matrix games. In particular, given a policy π, player i s best response policy is π h,i( | s) = argmaxµ (Ai) V µ,π i h,i (s) and the corresponding value function is denoted as V ,π i h,i . Definition 3. With πk being the policy at kth episode, the Nash regret after K episodes is define as Nash-Regret(K) = k=1 max i [m] V ,πk i 1,i V πk 1,i Independent Markov Congestion Game. A general-sum Markov game is an independent Markov congestion game (IMCG) if there exists a facility set F such that ai F for any ai Ai, a state space S = Q f F Sf, a set of facility reward distributions {Rf h}h [H],f F such that if the joint 1An episode is defined as running H steps from the initial state s0, which is common for the episodic MDP. action at sh is a, we have rh,i = P f ai rf h, where rf h Rf h( |sh, nf(a)) with support on [0, 1] and mean rf h(sh, nf(a)), and a set of transition matrices {P f h }h [H],f F such that Ph(s |s, a) = Q f F P f h (s f|sf, nf(a)). In other words, at each timestep h and state s S, the players are in a congestion game. Meanwhile, each facility has its own state and independent state transition, which only depends on its current state and number of players using that facility. This transition kernel can be viewed as a special case of that in factored MDPs [Szita and L orincz, 2009]. The IMCG also admits two types of feedback, semi-bandit feedback and bandit feedback, just like the congestion game. In this paper, we will consider both types of feedback. 6.2 Theoretical Guarantee Summarized in Algorithm 5, our centralized algorithm for IMCGs is naturally extended from the Nash-UCB (Algorithm 1) by incorporating transition kernel estimators, corresponding bonus terms and Bellman backward update. The key idea is to utilize the independent transition structure to remove the dependence on the exponential size of the state space S = Q f F Sf. We tackle this issue by adapting technique from factored MDP [Chen et al., 2020]. The algorithmic details for both types of feedback are deferred into Appendix E. The Nash regret bounds for the two versions of Algorithm 5 are stated in Theorem 3 and the proof details are deferred to Appendix F. Theorem 3. For independent Markov congestion game with semi-bandit feedback, by running the centralized Algorithm 5, with probability at least 1 δ, we can achieve that Nash-Regret(K) e O Furthermore, if we only have bandit feedback, then by running Algorithm 5 with reward estimator and bonus term in (12) and (13), with probability at least 1 δ, we can achieve that Nash-Regret(K) e O The regret bound in [Liu et al., 2021] is e O( p H3S2(Πm i=1Ai)T), where both Ai and S = Q can be exponential in F. Our bounds have polynomial dependence on all the parameters. 7 Conclusion In this paper, we study sample-efficient learning in congestion games by utilizing the special reward structure. We propose both centralized and decentralized algorithms for congestion games with two types of feedback, all achieving sample complexities only polynomial in the number of facilities. To the best of our knowledge, each one of them is the first sample-efficient learning algorithm for congestion games in its own setting. We further define the independent Markov congestion game (IMCG) as a natural extension of the congestion game into the Markov setting together with a sample-efficient centralized algorithm for both types of feedback. One promising future direction is to find a sample-efficient decentralized algorithm such that from each player s own perspective, the algorithm is still no-regret. In other words, diminishing regret is guaranteed for the player by running this algorithm even though other players may use policies from different algorithms. Another important future direction is to find sample-efficient centralized/decentralized algorithms that can explicitly find an approximate Nash equilibrium policy. Acknowledgements We sincerely thank Jing Dong for pointing out a mistake in the initial draft of this paper. This work was supported in part by NSF TRIPODS II-DMS 2023166, NSF CCF 2007036, NSF IIS 2110170, NSF DMS 2134106, NSF CCF 2212261, NSF IIS 2143493, NSF CCF 2019844. Hayder AA Al-Kashoash, Maryam Hafeez, and Andrew H Kemp. Congestion control for 6lowpan networks: A game theoretic framework. IEEE internet of things journal, 4(3):760 771, 2017. Yu Bai and Chi Jin. Provable self-play algorithms for competitive reinforcement learning. In International conference on machine learning, pages 551 560. PMLR, 2020. Sebastian Bervoets, Mario Bravo, and Mathieu Faure. Learning with minimal information in continuous games. Theoretical Economics, 15(4):1471 1508, 2020. Mario Bravo, David Leslie, and Panayotis Mertikopoulos. Bandit learning in concave n-person games. Advances in Neural Information Processing Systems, 31, 2018. Shicong Cen, Fan Chen, and Yuejie Chi. Independent natural policy gradient methods for potential games: Finite-time global convergence with entropy regularization. ar Xiv preprint ar Xiv:2204.05466, 2022. Po-An Chen and Chi-Jen Lu. Playing congestion games with bandit feedbacks. In AAMAS, pages 1721 1722, 2015. Po-An Chen and Chi-Jen Lu. Generalized mirror descents in congestion games. Artificial Intelligence, 241:217 243, 2016. Xiaoyu Chen, Jiachen Hu, Lihong Li, and Liwei Wang. Efficient reinforcement learning in factored mdps with application to constrained rl. ar Xiv preprint ar Xiv:2008.13319, 2020. Yun Kuen Cheung and Georgios Piliouras. Chaos, extremism and optimism: Volume analysis of learning in games. Advances in Neural Information Processing Systems, 33:9039 9049, 2020. Roberto Cominetti, Emerson Melo, and Sylvain Sorin. A payoff-based learning procedure and its application to traffic games. Games and Economic Behavior, 70(1):71 83, 2010. Pierre Coucheney, Bruno Gaujal, and Panayotis Mertikopoulos. Penalty-regulated dynamics and robust learning procedures in games. Mathematics of Operations Research, 40(3):611 633, 2015. Constantinos Daskalakis. On the complexity of approximating a nash equilibrium. ACM Transactions on Algorithms (TALG), 9(3):1 35, 2013. Dongsheng Ding, Chen-Yu Wei, Kaiqing Zhang, and Mihailo R. Jovanovi c. Independent policy gradient for large-scale markov potential games: Sharper rates, function approximation, and game-agnostic convergence, 2022. Dmitriy Drusvyatskiy, Maryam Fazel, and Lillian J Ratliff. Improved rates for derivative free gradient play in strongly monotone games. In Proc. IEEE Conference on Decision and Control, 2022. Stéphane Durand. Analysis of Best Response Dynamics in Potential Games. Ph D thesis, Université Grenoble Alpes, 2018. Dimitris Fotakis, Spyros Kontogiannis, Elias Koutsoupias, Marios Mavronicolas, and Paul Spirakis. The structure and complexity of nash equilibria for a selfish routing game. In International Colloquium on Automata, Languages, and Programming, pages 123 134. Springer, 2002. Roy Fox, Stephen Mc Aleer, Will Overman, and Ioannis Panageas. Independent natural policy gradient always converges in markov potential games. ar Xiv preprint ar Xiv:2110.10614, 2021. Amélie Heliou, Johanne Cohen, and Panayotis Mertikopoulos. Learning with bandit feedback in potential games. Advances in Neural Information Processing Systems, 30, 2017. Christian Ibars, Monica Navarro, and Lorenza Giupponi. Distributed demand management in smart grid with a congestion game. In 2010 First IEEE International Conference on Smart Grid Communications, pages 495 500. IEEE, 2010. Chi Jin, Zeyuan Allen-Zhu, Sebastien Bubeck, and Michael I Jordan. Is q-learning provably efficient? Advances in neural information processing systems, 31, 2018. Chi Jin, Qinghua Liu, Yuanhao Wang, and Tiancheng Yu. V-learning a simple, efficient, decentralized algorithm for multiagent rl. ar Xiv preprint ar Xiv:2110.14555, 2021a. Chi Jin, Qinghua Liu, Yuanhao Wang, and Tiancheng Yu. V-learning a simple, efficient, decentralized algorithm for multiagent rl, 2021b. Ramesh Johari and John N Tsitsiklis. Efficiency loss in a network resource allocation game. Mathematics of Operations Research, 29(3):407 435, 2004. Robert Kleinberg, Georgios Piliouras, and Éva Tardos. Multiplicative updates outperform generic no-regret learning in congestion games. In Proceedings of the forty-first annual ACM symposium on Theory of computing, pages 533 542, 2009. Walid Krichene, Benjamin Drighès, and Alexandre Bayen. On the convergence of no-regret learning in selfish routing. In International Conference on Machine Learning, pages 163 171. PMLR, 2014. Walid Krichene, Benjamin Drighès, and Alexandre M Bayen. Online learning of nash equilibria in congestion games. SIAM Journal on Control and Optimization, 53(2):1056 1081, 2015. Tor Lattimore and Csaba Szepesvári. Bandit algorithms. Cambridge University Press, 2020. Stefanos Leonardos, Will Overman, Ioannis Panageas, and Georgios Piliouras. Global convergence of multi-agent policy gradient in markov potential games, 2021. David S Leslie. Reinforcement learning in games. Ph D thesis, University of Bristol, 2004. David S Leslie and Edmund J Collins. Individual q-learning in normal form games. SIAM Journal on Control and Optimization, 44(2):495 514, 2005. David S Leslie and Edmund J Collins. Generalised weakened fictitious play. Games and Economic Behavior, 56(2):285 298, 2006. Qinghua Liu, Tiancheng Yu, Yu Bai, and Chi Jin. A sharp analysis of model-based reinforcement learning with self-play. In International Conference on Machine Learning, pages 7001 7010. PMLR, 2021. Sergio Valcarcel Macua, Javier Zazo, and Santiago Zazo. Learning parametric closed-loop policies for markov potential games. ar Xiv preprint ar Xiv:1802.00899, 2018. Jason R Marden. State based potential games. Automatica, 48(12):3075 3088, 2012. Dov Monderer and Lloyd S Shapley. Potential games. Games and economic behavior, 14(1):124 143, 1996. Hukukane Nikaidô and Kazuo Isoda. Note on non-cooperative convex games. Pacific Journal of Mathematics, 5(S1):807 815, 1955. Francesco Orabona. A modern introduction to online learning. ar Xiv preprint ar Xiv:1912.13213, 2019. Ian Osband and Benjamin Van Roy. Near-optimal reinforcement learning in factored mdps. In Z. Ghahramani, M. Welling, C. Cortes, N. Lawrence, and K.Q. Weinberger, editors, Advances in Neural Information Processing Systems, volume 27. Curran Associates, Inc., 2014. Arvind Raghunathan, Anoop Cherian, and Devesh Jha. Game theoretic optimization via gradientbased nikaido-isoda function. In International Conference on Machine Learning, pages 5291 5300. PMLR, 2019. Aviv Rosenberg and Yishay Mansour. Oracle-efficient regret minimization in factored mdps with unknown structure. Advances in Neural Information Processing Systems, 34, 2021. Robert W Rosenthal. A class of games possessing pure-strategy nash equilibria. International Journal of Game Theory, 2(1):65 67, 1973. Tim Roughgarden. Algorithmic game theory. Communications of the ACM, 53(7):78 86, 2010. Tim Roughgarden and Éva Tardos. Bounding the inefficiency of equilibria in nonatomic congestion games. Games and economic behavior, 47(2):389 403, 2004. Aviad Rubinstein. Settling the complexity of computing approximate two-player nash equilibria. In 2016 IEEE 57th Annual Symposium on Foundations of Computer Science (FOCS), pages 258 265. IEEE, 2016. Lloyd S Shapley. Stochastic games. Proceedings of the national academy of sciences, 39(10): 1095 1100, 1953. Ziang Song, Song Mei, and Yu Bai. When can we learn general-sum markov games with a large number of players sample-efficiently? ar Xiv preprint ar Xiv:2110.04184, 2021. Brian Swenson, Ryan Murray, and Soummya Kar. On best-response dynamics in potential games. SIAM Journal on Control and Optimization, 56(4):2734 2767, 2018. István Szita and András L orincz. Optimistic initialization and greediness lead to polynomial time learning in factored mdps. In Proceedings of the 26th annual international conference on machine learning, pages 1001 1008, 2009. Yi Tian, Jian Qian, and Suvrit Sra. Towards minimax optimal reinforcement learning in factored markov decision processes. Advances in Neural Information Processing Systems, 33:19896 19907, 2020. Ziping Xu and Ambuj Tewari. Reinforcement learning in factored mdps: Oracle-efficient algorithms and tighter regret bounds for the non-episodic setting. Advances in Neural Information Processing Systems, 33:18226 18236, 2020. Kaiqing Zhang, Zhuoran Yang, and Tamer Basar. Multi-agent reinforcement learning: A selective overview of theories and algorithms. Handbook of Reinforcement Learning and Control, pages 321 384, 2021a. Mingrui Zhang, Zebang Shen, Aryan Mokhtari, Hamed Hassani, and Amin Karbasi. One sample stochastic frank-wolfe. In International Conference on Artificial Intelligence and Statistics, pages 4012 4023. PMLR, 2020. Runyu Zhang, Zhaolin Ren, and Na Li. Gradient play in stochastic games: stationary points, convergence, and sample complexity. ar Xiv preprint ar Xiv:2106.00198, 2021b. 1. For all authors... (a) Do the main claims made in the abstract and introduction accurately reflect the paper s contributions and scope? [Yes] (b) Did you describe the limitations of your work? [Yes] (c) Did you discuss any potential negative societal impacts of your work? [No] (d) Have you read the ethics review guidelines and ensured that your paper conforms to them? [Yes] 2. If you are including theoretical results... (a) Did you state the full set of assumptions of all theoretical results? [Yes] (b) Did you include complete proofs of all theoretical results? [Yes] 3. If you ran experiments... (a) Did you include the code, data, and instructions needed to reproduce the main experimental results (either in the supplemental material or as a URL)? [N/A] (b) Did you specify all the training details (e.g., data splits, hyperparameters, how they were chosen)? [N/A] (c) Did you report error bars (e.g., with respect to the random seed after running experiments multiple times)? [N/A] (d) Did you include the total amount of compute and the type of resources used (e.g., type of GPUs, internal cluster, or cloud provider)? [N/A] 4. If you are using existing assets (e.g., code, data, models) or curating/releasing new assets... (a) If your work uses existing assets, did you cite the creators? [N/A] (b) Did you mention the license of the assets? [N/A] (c) Did you include any new assets either in the supplemental material or as a URL? [N/A] (d) Did you discuss whether and how consent was obtained from people whose data you re using/curating? [N/A] (e) Did you discuss whether the data you are using/curating contains personally identifiable information or offensive content? [N/A] 5. If you used crowdsourcing or conducted research with human subjects... (a) Did you include the full text of instructions given to participants and screenshots, if applicable? [N/A] (b) Did you describe any potential participant risks, with links to Institutional Review Board (IRB) approvals, if applicable? [N/A] (c) Did you include the estimated hourly wage paid to participants and the total amount spent on participant compensation? [N/A]