# sampleefficient_multiagent_rl_an_optimization_perspective__c45aa161.pdf Published as a conference paper at ICLR 2024 SAMPLE-EFFICIENT MULTI-AGENT RL: AN OPTIMIZATION PERSPECTIVE Nuoya Xiong IIIS, Tsinghua University xiongny20@mails.tsinghua.edu.cn Zhihan Liu Northwestern University zhihanliu2027@u.northwestern.edu Zhaoran Wang Northwestern University zhaoranwang@gmail.com Zhuoran Yang Yale University zhuoran.yang@yale.edu We study multi-agent reinforcement learning (MARL) for the general-sum Markov Games (MGs) under general function approximation. In order to find the minimum assumption for sample-efficient learning, we introduce a novel complexity measure called the Multi-Agent Decoupling Coefficient (MADC) for general-sum MGs. Using this measure, we propose the first unified algorithmic framework that ensures sample efficiency in learning Nash Equilibrium, Coarse Correlated Equilibrium, and Correlated Equilibrium for both model-based and model-free MARL problems with low MADC. We also show that our algorithm provides comparable sublinear regret to the existing works. Moreover, our algorithm only requires an equilibrium-solving oracle and an oracle that solves regularized supervised learning, and thus avoids solving constrained optimization problems within data-dependent constraints (Jin et al., 2020a; Wang et al., 2023) or executing sampling procedures with complex multi-objective optimization problems (Foster et al., 2023). Moreover, the model-free version of our algorithms is the first provably efficient model-free algorithm for learning Nash equilibrium of general-sum MGs. 1 INTRODUCTION Multi-agent reinforcement learning (MARL) has achieved remarkable empirical successes in solving complicated games involving sequential and strategic decision-making across multiple agents (Vinyals et al., 2019; Brown & Sandholm, 2018; Silver et al., 2016). These achievements have catalyzed many research efforts focusing on developing efficient MARL algorithms in a theoretically principled manner. Specifically, a multi-agent system is typically modeled as a general-sum Markov Game (MG) (Littman, 1994), with the primary aim of efficiently discerning a certain equilibrium notion among multiple agents from data collected via online interactions. Some popular equilibrium notions include Nash equilibrium (NE), correlated equilibrium (CE), and coarse correlated equilibrium (CCE). However, multi-agent general-sum Markov Games (MGs) bring forth various challenges. In particular, empirical application suffers from the large state space. Such a challenge necessitates the use of the function approximation as an effective way to extract the essential features of RL problems and avoid dealing directly with the large state space. Yet, adopting function approximation in a general-sum MG brings about additional complexities not found in single-agent RL or a zero-sum MG. Many prevailing studies on single-agent RL or two-agent zero-sum MGs with the function approximation leverage the special relationships between the optimal policy and the optimal value function (Jin et al., 2021a; Du et al., 2021; Zhong et al., 2022; Jin et al., 2022; Huang et al., 2021; Xiong et al., 2022). In particular, in single-agent RL, the optimal policy is the greedy policy with respect to the optimal value function. Whereas in a two-agent zero-sum MG, the Nash equilibrium Equal contribution Published as a conference paper at ICLR 2024 is obtained by solving a minimax estimation problem based on the optimal value function. Contrastingly, in a general-sum MG, individual agents possess distinct value functions, and thus there exists no unified optimal value function that characterizes the equilibrium behavior. Moreover, unlike a zero-sum MG, a general-sum MG can admit diverse equilibrium notions, where each corresponds to a set of policies. Consequently, methodologies developed for single-agent RL or zero-sum MGs cannot be directly extended to general-sum MGs. Recently, several works propose sample-efficient RL algorithms for general-sum MGs. In particular, Chen et al. (2022b); Foster et al. (2023) propose model-based algorithms for learning NE/CCE/CE based on multi-agent extensions of the Estimation-to-Decision algorithm (Foster et al., 2021), and they establish regret upper bounds in terms of complexity metrics that extend Decision-Estimation Coefficient (Foster et al., 2021) to MGs. In addition, Wang et al. (2023) study model-free RL for general-sum MGs with the general function approximation. They focus on developing a decentralized and no-regret algorithm that finds a CCE. Thus, it seems unclear how to design a provably sample-efficient MARL algorithm for NE/CCE/CE for general-sum MGs in a model-free manner. Furthermore, motivated by the recent development in single-agent RL (Jin et al., 2021a; Du et al., 2021; Zhong et al., 2022; Foster et al., 2021; Liu et al., 2023), we aim to develop a unified algorithmic framework for MARL that covers both model-free and model-based approaches. Thus, we aim to address the following questions: Can we design a unified algorithmic framework for general-sum MGs such that (i) it is provably sample-efficient in learning NE/CCE/CE in the context of the function approximation and (ii) it covers both model-free and model-based MARL approaches? In this paper, we provide an affirmative answer to the above questions. Specifically, we propose a unified algorithmic framework named Multi-Agent Maximize-to-EXplore (MAMEX) for generalsum MGs with the general function approximation. MAMEX extends the framework of Maximizeto-Explore (Liu et al., 2023) to general-sum MGs by employing it together with an equilibrium solver for general-sum normal-form games defined over the policy space. Maximize-to-Explore (MEX) is a class of RL algorithms for single-agent MDP and two-agent zerosum MGs where each new policy is updated by solving an optimization problem involving a hypothesis f, which can be regarded as the action-value function in the model-free version and the transition model in the model-based version. The optimization objective of MEX contains two terms (a) the optimal value with respect to the hypothesis f and (b) a loss function computed from data that quantifies how far f is from being the true hypothesis. Here, the term (a) reflects the planning part of online RL and leverages the fact that the optimal policy is uniquely characterized by the given hypothesis. On the other hand, the term (b), which can be the mean-squared Bellman error or log-likelihood function, reflects the estimation part of online RL. By optimizing the sum of (a) and (b) over the space of hypotheses without any data-dependent constraints, MEX balances exploitation with exploration in the context of the function approximation. However, the first term in MEX s optimization objective leverages the fact that the optimal policy can be uniquely constructed from the optimal value function or the true model, using a greedy step or dynamic programming. Such a nice property cannot be extended to general-sum MGs, where the relationship between the equilibrium policies and value function is more complicated, and each agent has its own value function. As a result, it is impractical to construct a single-objective optimization problem in the style of MEX over the hypothesis space for general-sum MGs. Instead of optimizing over the spaces of hypotheses, MAMEX optimizes over the policy space. Specifically, in each iteration, MAMEX updates the joint policy of all agents by solving for a desired equilibrium (NE/CCE/CE) of a normal-form game, where the pure strategies are a class of joint policies of the n agents, e.g., the class of deterministic joint policies. Besides, for each pure strategy of this normal form game, the corresponding payoff function is obtained by solving a regularized optimization problem over the hypothesis space a la MEX. Thus, policy updates in MAMEX involve the following two steps: (i) For each pure strategy π, construct the payoff function V i(π) for each agent i by solving an unconstrained and regularized optimization problem; (ii) Compute the NE/CCE/CE of the normal-form game over the space of pure strategies with payoff functions {Vi(π)}n i=1, where n is the number of agents. Published as a conference paper at ICLR 2024 The implementation of MAMEX only requires an oracle for solving a single-objective and unconstrained optimization problem and an oracle for solving NE/CCE/CE of a normal-form game. Compared to existing works that either solve constrained optimization subproblems within datadependent constraints (Wang et al., 2023), or complex multi-objective or minimax optimization subproblems (Foster et al., 2023; Chen et al., 2022b), MAMEX is more amenable to practical implementations. Furthermore, step (i) of MAMEX resembles MEX, which enables both model-free and model-based instantiations. We prove that MAMEX is provably sample-efficient in a rich class of general-sum MGs. To this end, we introduce a novel complexity measure named Multi-Agent Decoupling Coefficient (MADC) to capture the exploration-exploitation tradeoff in MARL. Compared to the decoupling coefficient and its variants (Dann et al., 2021; Agarwal & Zhang, 2022; Zhong et al., 2022) proposed for the singleagent setting, MADC characterize the hardness of exploration in MGs in terms of the discrepancy between the out-of-sample prediction error and the in-sample training error incurred by minimizing a discrepancy function ℓon the historical data. MADC is defined based on the intuition that if a hypothesis attains a small training error on a well-explored dataset, it would also incur a small prediction error. When the MADC of an MG instance is small, achieving a small training error ensures a small prediction error, and thus exploration is relatively easy. We prove that MAMEX achieves a sublinear regret for learning NE/CCE/CE in classes with small MADCs, which includes multi-agent counterparts of models with low Bellman eluder dimensions (Jin et al., 2021a; 2022; Huang et al., 2021), Bilinear Classes (Du et al., 2021), and models with low witness ranks (Sun et al., 2019; Huang et al., 2021). When specialized to specific members within these classes, MAMEX yields comparable regret upper bounds to existing works. Our Contributions. In summary, our contributions are two-fold. First, we provide a unified algorithmic framework named Multi-Agent Maximize-to-EXplore (MAMEX) for both model-free and model-based MARL, which is sample-efficient in finding the NE/CCE/CE in general-sum MGs with small MADCs. Moreover, MAMEX leverages an equilibrium-solving oracle for normal-form games defined over a class of joint policies for policy updates, and a single-objective optimization procedure that solves for the payoff functions of these normal-form games. To our best knowledge, the model-free version of MAMEX is the first model-free algorithm for general-sum MGs that learns all three equilibria NE, CCE, and CE with sample efficiency. Second, we introduce a complexity measure, Multi-Agent Decoupling Coefficient (MADC), to quantify the hardness of exploration in a general-sum MG in the context of the function approximation. The class of MGs with low MADCs includes a rich class of MG instances, such as multi-agent counterparts of models with low Bellman eluder dimensions (Jin et al., 2021a; 2022; Huang et al., 2021), Bilinear Classes (Du et al., 2021), and models with low witness ranks (Sun et al., 2019; Huang et al., 2021). When specialized to specific MG instances in these classes, we achieve comparable regret upper bounds to existing works. Related Works. Our paper is closely related to the prior research on Markov Games and MARL with the function approximation. A comprehensive summary of the related literature is in A. 2 MODELS AND PRELIMINARIES 2.1 MARKOV GAMES For clarity, certain mathematical notations are provided in Appendix B. General-Sum Markov Games In this work, we consider general-sum Markov Games (MGs) in the episodic setting, which is denoted by a tuple (S, H, A, {r(i) h }i [n],h [H], {Ph}h [H], ρ), where n is the number of agents, H is the length of one episode, S is the state set, and A = n i=1Ai is the joint action set. Here, Ai is the action set of the agent i. Moreover, r(i) h : S A 7 R is the known reward function1 of the agent i at step h, Ph : S A (S) is the transition kernel at the h-th step, and ρ (S) is the distribution of the initial state s1. We assume the n agents 1Our results can be extended to the unknown stochastic reward case (Agarwal & Zhang, 2022; Zhong et al., 2022). Note that learning the transition kernel is more difficult than learning the reward. Published as a conference paper at ICLR 2024 observe the same state at each step and each agent i chooses an action within its own action set Ai simultaneously. In each episode, starting from s1 p0, for each h [H], the agents choose their joint action ah A in state sh, where ah = (a(1) h , . . . , a(n) h ). Then, each agent i receives its own reward r(i) h (sh, ah), and the game move to the next state sh+1 Ph(sh+1 | sh, ah). Moreover, we assume PH h=1 r(i) h (sh, ah) [0, R] for any possible state-action sequences for some 1 R H. In MGs, the agents policy can be stochastic and correlated. To capture such a property, we introduce the notion of pure policy and joint policy as follows. For each agent i, its local (Markov) policy maps a state s to a distribution over the local action space Ai. We let Πpur h,i {π : S 7 (Ai)} denote a subset of the agent i s local policies for step h [H], which is called the set of Markov pure policies. We then define Πpur i = h [H]Πpur h,i. We assume the agent i s policy is a random variable taking values in Πpur i . Specifically, let ω Ωbe the random seed. The random policy π(i) = {π(i) h }h [H] for the agent i contains H mappings π(i) h : Ω7 Πpur h,i such that π(i) h (ω) Πpur h,i is a pure policy. To execute π(i), the agent i first samples a random seed ω Ω, and then follows the policy π(i) h (ω) for all h [H]. The joint policy π of the n agents is a set of policies {π(i)}n i=1 where ω = (ω1, , ωn) are joint variables. In other words, {π(i) h (ω)}i [n] n i=1Πpur h,i are random policies of the n agents whose randomness is correlated by the random seed ω. Equivalently, we can regard π as a random variable over n i=1Πpur h,i. Furthermore, a special class of the joint policy is the product policy, where each agent executes their own policies independently. In other words, we have ω = (ω1, . . . , ωn), where ω1, . . . , ωn are independent, and each π(i) depends on ωi only. We let πh(a | s) denote the probability of taking action a in the state s at step h. As a result, we have πh(a | s) = Qn i=1 π(i) h (a(i) | s) for any product policy π. Furthermore, using the notion of pure policy and joint policy, we can equivalently view the MG as a normal form game over Πpur = n i=1Πpur i . That is, each pure policy can be viewed as a pure strategy of the normal form game, and each joint policy can be viewed as a mixed strategy. Such a view is without loss of generality, because we can choose Πpur i to be the set of all possible deterministic policies of the agent i. Meanwhile, using a general Πpur i , we can also incorporate parametric policies as the pure policies, e.g., log-linear policies (Xie et al., 2021; Yuan et al., 2022; Cayci et al., 2021). The value function V (i),π h is the expected cumulative rewards received by the agent i from step h to step H, when all the agents follow a joint policy π, which is defined as V (i),π h (s) = Eπ h H X h =h r(i) h (sh , ah ) sh = s i . We let V (i),π(ρ) = Es ρ[V (i),π 1 (s)] denote the agent i s expected cumulative rewards within the whole episode. Besides, the corresponding Q-function (action-value function) can be written as Q(i),π h (s, a) = Eπ h H X h =h r(i) h (sh , ah ) sh = s, ah = a i . (2.1) For a joint policy π and any agent i, we let π( i) denote the joint policy excluding the agent i. Given π( i), the best response of the agent i is defined as π(i), = arg maxν (Πpur i ) V (i),ν π( i)(ρ), which is random policy of the agent i that maximizes its expected rewards when other agents follow π( i). Besides, we denote µ(i),π = (π(i), , π( i)). Online Learning and Solution Concepts We focus on three common equilibrium notions in the game theory: Nash Equilibrium (NE), Coarse Correlated Equilibrium (CCE) and Correlated Equilibrium (CE). First, a NE of a game is a product policy that no individual player can improve its expected cumulative rewards by unilaterally deviating its local policy. Definition 2.1 (ε-Nash Equilibrium). A product policy π is an ε-Nash Equilibrium if V (i),µ(i),π(ρ) V (i),π(ρ) + ε for all i [n], where µ(i),π = (π(i), , π( i)) and π(i), is the best response policy with respect to π( i). Published as a conference paper at ICLR 2024 In other words, a product policy π is an ε-Nash Equilibrium if and only if n max ν (Πpur i ) V (i),ν π( i)(ρ) V (i),π(ρ) o ε. In this work, we design algorithms for the online and self-play setting. That is, we control the joint policy all agents, interact with the environment over K episodes, and aim to learn the desired equilibrium notion from bandit feedbacks. To this end, let πk denote the joint policy that the agents execute in the k-th episode, k [K]. We define the Nash-regret as the cumulative suboptimality across all agents with respect to NE. Definition 2.2 (Nash-Regret). For all k [K], let πk denote the product policy deployed in the k-th episode, then the Nash-regret is defined as Reg NE(K) = V (i),µ(i),πk (ρ) V (i),πk(ρ) . By replacing the concept of NE to CCE and CE, we can define CCE-regret and CE-regret in a similar way. The detailed definitions are provided in C. We note that the definitions of NE, CCE, and CE align with those defined on the normal form game defined on the space of pure policies. That is, each agent i s pure strategy is a pure policy π(i) Πpur i , and the payoff of the agent i when the mixed strategy is π is given by V (i),π(ρ). 2.2 FUNCTION APPROXIMATION To handle the large state space in MARL, we assume the access to a hypothesis class F, which captures the Q function in the model-free setting and the transition kernel in the model-based setting. Model-Based Function Approximation In the model-based setting, the hypothesis class F contains the model (transition kernel) of MGs. Specifically, we let Pf = {P1,f , PH,f} denote the transition kernel parameterized by f F. When the model parameters are f and the joint policy is π, we denote the value function and Q-function of the agent i at the h-th step as V (i),π h,f (s) and Q(i),π h,f (s, a) respectively. We have the Bellman equation Q(i),π h,f (s, a) = r(i) h (s, a) + Es Ph,f ( |s,a)[V (i),π h+1,f(s )]. Model-Free Function Approximation In the model-free setting, we let F = n i=1F(i) = n i=1( H h=1F(i) h ) be a class of Q-functions of the n agents, where F(i) h = {f (i) h : S A 7 R} is a class of Q-functions of the agent i at the h-th step. For any f F, we denote Q(i) h,f(s, a) = f (i) h (s, a) for all i [n] and h [H]. Meanwhile, for any joint policy π and any f F, we define V (i),π h,f (s) = Ea π(s)[f (i) h (s, a)] = f (i) h (s, ), πh( | s) A. For any joint policy π, agent i, and step h, we define the Bellman operator T (i),π h by letting (T (i),π h (fh+1))(s, a) = r(i) h (s, a) + Es Ph(s |s,a) fh+1(s , ), πh+1( | s ) A, f F(i). (2.2) Note that the Bellman operator depends on the index i of the agent because the reward functions of the agents are different. Such a definition is an extension of the Bellman evaluation operator in the single-agent setting (Puterman, 2014) to the multi-agent MGs. By definition, {Q(i),π h } defined in (2.1) is the fixed point of T (i),π h , i.e., Q(i),π h = T (i),π h (Q(i),π h+1 ) for all h [H]. For both the model-based and the model-free settings, we impose the realizability assumption, which requires that the hypothesis space F is sufficiently expressive such that it contains the true transition model or the true Q-functions. Besides, for the model-free setting, we also require that the hypothesis classes be closed with respect to the Bellman operator. Assumption 2.3 (Realizability and Completeness). For the model-based setting, we assume the true transition model f lies in the hypothesis class F. Besides, for the model-free setting, for any pure policy π and any i [n], we assume that Q(i),π F(i) and T (i),π h F(i) h+1 F(i) h for all h [H]. Published as a conference paper at ICLR 2024 Covering Number and Bracketing Number. When a function class F is infinite, the δ-covering number NF(δ) and the δ-bracketing number BF(δ) serve as surrogates of the cardinality of F. We put the definitions in C.2. Multi-Agent Decoupling Coefficient Now we introduce a key complexity measure multi-agent decoupling coefficient (MADC) which captures the hardness of exploration in MARL. Such a notion is an extension of the decoupling coefficient (Dann et al., 2021) to general-sum MGs. Definition 2.4 (Multi-Agent Decoupling Coefficient). The Multi-Agent Decoupling Coefficient of a MG is defined as the smallest constant d MADC 1 such that for any i [n], µ > 0, {f k}k [K] F(i), and {πk}k [K] Πpur the following inequality holds: k=1 (V (i),πk f k (ρ) V (i),πk(ρ)) | {z } prediction error s=1 ℓ(i),s(f k, πk) | {z } training error + µ d MADC + 6d MADCH | {z } gap where we define V (i),πk f k (ρ) = Es1 ρ[V (i),πk 1,f k (s1)], and ℓ(i),s(f k, πk) is a discrepancy function that measures the inconsistency between f k and πk, on the historical data. The specific definitions of {ℓ(i),s}i [n],s [K 1] under the model-free and model-based settings are given in (2.4) and (2.5). Model-Free RL In the model-free setting, for {πk}k [K] Πpur in (2.3), the discrepancy function ℓ(i),s(f, π) for π Πpur is defined as ℓ(i),s(f, π) = h=1 E(sh,ah) πs h((fh T (i),π h (fh+1))(sh, ah))2, f F(i), s [K]. (2.4) That is, ℓ(i),s(f, π) measures agent i s mean-squared Bellman error for evaluating π, when the trajectory is sampled by letting all agents follow policy ζs. Model-Based RL We choose the discrepancy function ℓ(i),s in Assumption 2.5 as ℓ(i),s(f k, πk) = h=1 E(sh,ah) πs h D2 H Ph,f k( | sh, ah) Ph,f ( | sh, ah) , (2.5) where DH denotes the Hellinger distance and E(sh,ah) πs h means that the expectation is taken with respect to the randomness of the trajectory induced by πs on the true model f . Intuitively, it represents the expected in-sample distance of model f k and true model f . Note that the discrepancy between f k, πk in (2.3) is summed over s [k 1]. Thus, in both the model-free and model-based settings, the training error can be viewed as the in-sample error of f k on the historical data collected before the k-th episode. Thus, for an MG with a finite MADC, the prediction error is small whenever the training error is small. Specifically, when the training error is O(Kα) for some α (0, 2), then by choosing a proper µ, we know that the prediction error grows as O( Kα d MADC) = o(K). In other words, as K increases, the average prediction error decays to zero. In single-agent RL, when we adopt an optimistic algorithm, the prediction error serves as an upper bound of the regret (Dann et al., 2021; Zhong et al., 2022; Jin et al., 2021a). Therefore, by quantifying how the prediction error is related to the training error, the MADC can be used to characterize the hardness of exploration in MARL. Compared to the decoupling coefficient and its variants for the single-agent MDP or the two-player zero-sum MG Dann et al. (2021); Agarwal & Zhang (2022); Zhong et al. (2022); Xiong et al. (2022), MADC selects the policy πk in a different way. In the single-agent setting, the policy πk is always selected as the greedy policy of f k, hence V πk 1,f k(ρ) is equivalent to the optimal value function. In the zero-sum MG, the policy pair πk is always selected as the Nash policy and the best response (Xiong et al., 2022). On the contrary, in our definition, the policy πk is not necessarily the greedy policy of f k. In fact, {πk}k [K] can be any pure policy sequence that is unrelated to {f k}k [K]. Assumption 2.5 (Finite MADC). We assume that the MADC of the general-sum MG of interest is finite, denoted by d MADC. As we will show in Section D, the class of MGs with low MADCs include a rich class of MG instances, including multi-agent counterparts of models with low Bellman eluder dimensions (Jin et al., 2021a; 2022; Huang et al., 2021), bilinear classes (Du et al., 2021), and models with low witness ranks (Sun et al., 2019; Huang et al., 2021). Published as a conference paper at ICLR 2024 3 ALGORITHM AND RESULTS In this section, we first introduce a unified algorithmic framework called Multi-Agent Maximize-to EXplore (MAMEX). Then, we present the regret and sample complexity upper bounds of MAMEX, showing that both the model-free and model-based versions of MAMEX are sample-efficient for learning NE/CCE/CE under the general function approximation. 3.1 ALGORITHM Algorithm 1 Multi-Agent Maximize-to-EXplore (MAMEX) 1: Input: Hypothesis class F, parameter η > 0, and an equilibrium solving oracle EQ. 2: for k = 1, 2, , K do 3: Compute V k i (π) defined in (3.1) for all π Πpur and all i [n]. 4: Compute the NE/CCE/CE of the normal-form game defined on Πpur with payoff functions {V k i (π)}n i=1: πk EQ(V k 1, V k 2, , V k n). 5: Sample a pure joint policy ζk πk, and collect a trajectory {sk h, ak h}h [H] following ζk. 6: Update {L(i),k}n i=1 according to (3.2) (model-free) or (3.3) (model-based). 7: end for In this subsection, we provide the MAMEX algorithm for multi-agent RL under the general function approximation, which extends the MEX algorithm (Liu et al., 2023) to general-sum MGs. Recall that the definitions of NE/CCE/CE of general-sum MGs coincide with those defined in the normalform game with pure strategies being the pure policies in Πpur. Thus, when we know the payoffs {V (i),π(ρ)}i [n] for all π Πpur, we can directly compute the desired NE/CCE/CE given an equilibrium solving oracle for the normal-form game. However, each V (i),π(ρ) is unknown and has to be estimated from data via online learning. Thus, in a nutshell, MAMEX is an iterative algorithm that consists of the following two steps: (a) Policy evaluation: For each k [K], construct an estimator V k i (π) of V (i),π(ρ) for each pure policy π Πpur and the agent i [n] in each episode based on the historical data collected in the previous k 1 episodes. Here, the policy evaluation subproblem can be solved in both the model-free and model-based fashion. (b) Equilibrium finding: Compute an equilibrium (NE/CCE/CE) for the normal-form game over the space of pure policies with the estimated payoff functions {V k i (π)}n i=1. The joint policy returned by the equilibrium finding step is then executed in the next episode to generate a new trajectory. By the algorithmic design, to strike a balance between exploration and exploitation, it is crucial to construct {V k i (π)}n i=1 in such a way that promotes exploration. To this end, we solve a regularized optimization problem over the hypothesis class F(i) to obtain V k i (π), where the objective function balances exploration with exploitation. We introduce the details of MAMEX as follows. Policy Evaluation. For each k [K], before the k-th episode, we have collected k 1 trajectories τ 1:k 1 = k 1 t=1 {st 1, at 1, rt 1, , st H, at H, rt H}. For any i [n], π Πpur and f F(i)2, we can define a data-dependent discrepancy function L(i),k 1(f, π, τ 1:k 1). Such a function measures the in-sample error of the hypothesis f with respect a policy π, evaluated on the historical data τ 1:k 1. The specific form of such a function differs under the model-free and model-based settings. In particular, as we will show in (3.2) and (3.3) below, under the model-free setting, L(i),k 1(f, π, τ 1:k 1) is constructed based on the mean-squared Bellman error with respect to the Bellman operator T (i),π h in (2.2), while under the model-based setting, L(i),k 1(f, π, τ 1:k 1) is constructed based on the negative log-likelihood loss. Then, for each π Πpur and i [n], we define V k i (π) as V k i (π) = sup f F(i) n b V (i),π,k(f) := V (i),π f (ρ) | {z } (a) η L(i),k 1(f, π, τ 1:k 1) | {z } (b) 2For ease of notation, under the model-based setting, we denote F (i) = F for all agent i [n]. Published as a conference paper at ICLR 2024 Equilibrium Finding. Afterwards, the algorithm utilizes the equilibrium oracle EQ (Line 4 of Algorithm 1) to compute an equilibrium (NE/CCE/CE) for the normal-form game over Πpur with payoff functions {V k i (π)}n i=1. The solution to the equilibrium oracle is a mixed strategy πk, i.e., a probability distribution over Πpur. Finally, we sample a random pure policy ζk from πk and execute ζk in the k-th episode to generate a new trajectory. See Algorithm 1 for the details of MAMEX. Here, we implicitly assume that Πpur is finite for ease of presentation. For example, Πpur is the set of all deterministic policies. When Πpur is infinite, we can replace Πpur by a 1/K-cover of Πpur with respect to the distance d(i)(π(i), eπ(i)) = maxs S π(i)( | s) eπ(i)( | s) 1. Furthermore, the objective b V (i),π,k(f) in (3.1) is constructed by a sum of (a) the value function V (i),π f (ρ) of π under the hypothesis f and (b) a regularized term η L(i),k 1(f, π, τ 1:k 1), and the payoff function V k i (π) is obtained by solving a maximization problem over F(i). The two terms (a) and (b) represent the exploration and exploitation objectives, respectively, and the parameter η > 0 controls the trade-off between them. To see this, consider the case where we only have the term (b) in the objective function. In the model-based setting, (3.1) reduces to the maximum likelihood estimation (MLE) of the model f given the historical data τ 1:k 1. Then πk returned by Line 4 is the equilibrium policy computed from the MLE model. Thus, without term (a) in b V (i),π,k(f), the algorithm only performs exploitation. In addition to fitting the model, the term (a) also encourages the algorithm to find a model with a large value function under the given policy π, which promotes exploration. Under the model-free setting, only having term (b) reduces to least-squares policy evaluation (LSPE) (Sutton & Barto, 2018), and thus term (b) also performs exploitation only. Comparison with Single-Agent MEX (Liu et al., 2023). When reduced to the single-agent MDP, MAMEX can be further simplified to the single-agent MEX algorithm (Liu et al., 2023). In particular, when n = 1, equilibrium finding is reduced to maximizing the function defined in (3.1) over single-agent policies, i.e., maxπ maxf F b V π,k(f). By exchanging the order of the two maximizations, we obtain an optimization problem over the hypothesis class F, which recovers the single-agent MEX (Liu et al., 2023). In contrast, in general-sum MGs, the equilibrium policy can no longer be obtained by a single-objective optimization problem. Hence, it is unviable to directly extend MEX to optimize over hypothesis space in MARL. Instead, MAMEX solves an optimization over F in the style of MEX for each pure policy π Πpur, and then computes the NE/CCE/CE of the normal-form game over the space of pure policies. Comparison with Existing MARL Algorithms with Function Approximation Previous RL algorithms for MGs with the general function approximation usually require solving minimax optimization (Chen et al., 2022b; Zhan et al., 2022a; Foster et al., 2023) or constrained optimization subproblems within data-dependent constraints (Wang et al., 2023). In comparison, the optimization subproblems of MEX are single-objective and do not have data-dependent constraints, and thus seem easier to implement. For example, in practice, the inner problem can be solved by a regularized version of TD learning (Liu et al., 2023), and the outer equilibrium finding can be realized by any fast method to calculate equilibrium (Hart & Mas-Colell, 2000; Anagnostides et al., 2022). In the following, we instantiate the empirical discrepancy function L(i),k 1 for both the model-free setting and the model-based setting. Model-Free Algorithm Under the model-free setting, we define the empirical discrepancy function L as follows. For any h [H] and k [K], let ξk h = {sk h, ak h, sk h+1}. For any i [n], π Πpur and f F(i), we define L(i),k 1(f, π, τ 1:k 1) = h l(i) h (ξj h, f, f, π) 2 inf f h F(i) h l(i) h (ξj h, f , f, π) 2i , (3.2) where l(i) h (ξj h, f, g, π) = (fh(sj h, aj h) r(i) h (sj h, aj h) gh+1(sj h+1, ), πh+1( | sj h+1) A)2 is the mean-squared Bellman error involving fh and gh+1. Published as a conference paper at ICLR 2024 In Lemma E.1, we can show that L(i),k 1(f, π, τ 1:k 1) is an upper bound of Pk 1 s=1 ℓ(i),s(f, π), where ℓ(i),s is defined in (2.4). Thus, the function L(i),k 1 can be used to control the training error in the definition of MADC. Model-Based Algorithm For the model-based setting, we define L(i),k 1 as the negative loglikelihood: L(i),k 1(f, π, τ 1:k 1) = j=1 log Ph,f(sj h+1 | sj h, aj h). (3.3) As we will show in Lemma E.3, the function L(i),k 1 can be used to control the training error in (2.3), where ℓ(i),s is defined in (2.5). 3.2 THEORETICAL RESULTS In this subsection, we present our main theoretical results and show that MAMEX (Algorithm 1) is sample-efficient for learning NE/CCE/CE in the context of general function approximation. Theorem 3.1. Let the discrepancy function ℓ(i),s in (2.3) be defined in (2.4) and (2.5) for model-free and model-based settings, respectively. Suppose Assumptions 2.3 and 2.5 hold. By setting K 16 and η = 4/ K 1, with probability at least 1 δ, the regret of Algorithm 1 after K episodes is upper bounded by Reg NE,CCE,CE(K) e O n H KΥF,δ + nd MADC K + nd MADCH , where e O( ) hides absolute constants and polylogarithmic terms in H and K, and ΥF,δ is a term that quantifies the complexity of the hypothesis class F. In particular, we have ΥF,δ = R2 log(maxi [n] NF(i)(1/K) |Πpur|/δ) in the model-free setting and ΥF,δ = log (BF(1/K)/δ) in the model-based setting. Theorem 3.1 shows that our MAMEX achieves a sublinear K-regret for learning NE/CCE/CE, where the multiplicative factor depends polynomially on the number of agents n and horizon H. Thus, MAMEX is sample-efficient in the context of the general function approximation. Moreover, the regret depends on the complexity of the hypothesis class via two quantifies the MADC d MADC, which captures the inherent challenge of exploring the dynamics of the MG, and the quantity ΥF,δ, which characterizes the complexity of estimating the true hypothesis f based on data. To be more specific, in the model-free setting, since we need to evaluate each pure policy, ΥF,δ contains log |Πpur| due to uniform concentration. When reduced to the tabular setting, we can choose Πpur to be the set of deterministic policies, and both ΥF,δ and d MADC are polynomials of |S| and |A|. Furthermore, when specialized to tractable special cases with function approximation and some special pure policy class such as log-linear policy class Cayci et al. (2021), we show in D that Theorem D.8 yields regret upper bounds comparable to existing works. Moreover, using the standard online-to-batch techniques, we can transform the regret bound into a sample complexity result. We defer the details to E.3. 4 CONCLUSION In this paper, we study multi-player general-sum MGs under the general function approximation. We propose a unified algorithmic framework MAMEX for both model-free and model-based RL problems with the general function approximation. Compared with previous works that either solve constrained optimization subproblems within data-dependent sub-level sets (Wang et al., 2023), or complex multi-objective minimax optimization subproblems (Chen et al., 2022b; Foster et al., 2023), the implementation of MAMEX requires only an oracle for solving a single-objective unconstrained optimization problem with an equilibrium oracle of a normal-form game, thus being more amenable to empirical implementation. Moreover, we introduce a complexity measure MADC to capture the exploration-exploitation tradeoff for general-sum MGs. We prove that MAMEX is provably sampleefficient in learning NE/CCE/CE on RL problems with small MADCs, which covers a rich class of MG models. When specialized to the special examples with small MADCs, the regret of MAMEX is comparable to existing algorithms that are designed for specific MG subclasses. Published as a conference paper at ICLR 2024 ACKNOWLEDGEMENTS Zhaoran Wang acknowledges National Science Foundation (Awards 2048075, 2008827, 2015568, 1934931), Simons Institute (Theory of Reinforcement Learning), Amazon, J.P. Morgan, and Two Sigma for their supports. Zhuoran Yang acknowledges Simons Institute (Theory of Reinforcement Learning) for its support. Yasin Abbasi-Yadkori, D avid P al, and Csaba Szepesv ari. Improved algorithms for linear stochastic bandits. Advances in neural information processing systems, 24, 2011. Alekh Agarwal and Tong Zhang. Model-based rl with optimistic posterior sampling: Structural conditions and sample complexity. In Advances in Neural Information Processing Systems, 2022. Alekh Agarwal, Daniel Hsu, Satyen Kale, John Langford, Lihong Li, and Robert Schapire. Taming the monster: A fast and simple algorithm for contextual bandits. In International Conference on Machine Learning, pp. 1638 1646. PMLR, 2014. Ioannis Anagnostides, Gabriele Farina, Christian Kroer, Andrea Celli, and Tuomas Sandholm. Faster no-regret learning dynamics for extensive-form correlated and coarse correlated equilibria. ar Xiv preprint ar Xiv:2202.05446, 2022. Yu Bai and Chi Jin. Provable self-play algorithms for competitive reinforcement learning. In International conference on machine learning, pp. 551 560. PMLR, 2020. Yu Bai, Chi Jin, and Tiancheng Yu. Near-optimal reinforcement learning with self-play. Advances in neural information processing systems, 33:2159 2170, 2020. Yu Bai, Chi Jin, Huan Wang, and Caiming Xiong. Sample-efficient learning of stackelberg equilibria in general-sum games. Advances in Neural Information Processing Systems, 34:25799 25811, 2021. Noam Brown and Tuomas Sandholm. Superhuman ai for heads-up no-limit poker: Libratus beats top professionals. Science, 359(6374):418 424, 2018. Semih Cayci, Niao He, and Rayadurgam Srikant. Linear convergence of entropy-regularized natural policy gradient with linear function approximation. ar Xiv preprint ar Xiv:2106.04096, 2021. Fan Chen, Yu Bai, and Song Mei. Partially observable rl with b-stability: Unified structural condition and sharp sample-efficient algorithms. ar Xiv preprint ar Xiv:2209.14990, 2022a. Fan Chen, Song Mei, and Yu Bai. Unified algorithms for rl with decision-estimation coefficients: No-regret, pac, and reward-free learning. ar Xiv preprint ar Xiv:2209.11745, 2022b. Zixiang Chen, Chris Junchi Li, Angela Yuan, Quanquan Gu, and Michael I Jordan. A general framework for sample-efficient function approximation in reinforcement learning. ar Xiv preprint ar Xiv:2209.15634, 2022c. Zixiang Chen, Dongruo Zhou, and Quanquan Gu. Almost optimal algorithms for two-player zerosum linear mixture markov games. In International Conference on Algorithmic Learning Theory, pp. 227 261. PMLR, 2022d. Qiwen Cui, Kaiqing Zhang, and Simon S Du. Breaking the curse of multiagents in a large state space: Rl in markov games with independent linear function approximation. ar Xiv preprint ar Xiv:2302.03673, 2023. Varsha Dani, Thomas P Hayes, and Sham M Kakade. Stochastic linear optimization under bandit feedback. 2008. Christoph Dann, Mehryar Mohri, Tong Zhang, and Julian Zimmert. A provably efficient model-free posterior sampling method for episodic reinforcement learning. Advances in Neural Information Processing Systems, 34:12040 12051, 2021. Published as a conference paper at ICLR 2024 Constantinos Daskalakis, Noah Golowich, and Kaiqing Zhang. The complexity of markov equilibrium in stochastic games. ar Xiv preprint ar Xiv:2204.03991, 2022. Simon Du, Sham Kakade, Jason Lee, Shachar Lovett, Gaurav Mahajan, Wen Sun, and Ruosong Wang. Bilinear classes: A structural framework for provable generalization in rl. In International Conference on Machine Learning, pp. 2826 2836. PMLR, 2021. Dylan J Foster, Sham M Kakade, Jian Qian, and Alexander Rakhlin. The statistical complexity of interactive decision making. ar Xiv preprint ar Xiv:2112.13487, 2021. Dylan J Foster, Dean P Foster, Noah Golowich, and Alexander Rakhlin. On the complexity of multi-agent decision making: From learning in games to partial monitoring. ar Xiv preprint ar Xiv:2305.00684, 2023. Sergiu Hart and Andreu Mas-Colell. A simple adaptive procedure leading to correlated equilibrium. Econometrica, 68(5):1127 1150, 2000. Baihe Huang, Jason D Lee, Zhaoran Wang, and Zhuoran Yang. Towards general function approximation in zero-sum markov games. In International Conference on Learning Representations, 2021. Nan Jiang, Akshay Krishnamurthy, Alekh Agarwal, John Langford, and Robert E Schapire. Contextual decision processes with low bellman rank are pac-learnable. In International Conference on Machine Learning, pp. 1704 1713. PMLR, 2017. Chi Jin, Sham Kakade, Akshay Krishnamurthy, and Qinghua Liu. Sample-efficient reinforcement learning of undercomplete pomdps. Advances in Neural Information Processing Systems, 33: 18530 18539, 2020a. Chi Jin, Zhuoran Yang, Zhaoran Wang, and Michael I Jordan. Provably efficient reinforcement learning with linear function approximation. In Conference on Learning Theory, pp. 2137 2143. PMLR, 2020b. Chi Jin, Qinghua Liu, and Sobhan Miryoosefi. Bellman eluder dimension: New rich classes of rl problems, and sample-efficient algorithms. Advances in Neural Information Processing Systems, 34, 2021a. Chi Jin, Qinghua Liu, Yuanhao Wang, and Tiancheng Yu. V-learning a simple, efficient, decentralized algorithm for multiagent rl. ar Xiv preprint ar Xiv:2110.14555, 2021b. Chi Jin, Qinghua Liu, and Tiancheng Yu. The power of exploiter: Provable multi-agent rl in large state spaces. In International Conference on Machine Learning, pp. 10251 10279. PMLR, 2022. Avetik Karagulyan. Sampling with the Langevin Monte-Carlo. Ph D thesis, Institut polytechnique de Paris, 2021. Marc Lanctot, Vinicius Zambaldi, Audrunas Gruslys, Angeliki Lazaridou, Karl Tuyls, Julien P erolat, David Silver, and Thore Graepel. A unified game-theoretic approach to multiagent reinforcement learning. Advances in neural information processing systems, 30, 2017. Michael L Littman. Markov games as a framework for multi-agent reinforcement learning. In Machine learning proceedings 1994, pp. 157 163. Elsevier, 1994. Qinghua Liu, Tiancheng Yu, Yu Bai, and Chi Jin. A sharp analysis of model-based reinforcement learning with self-play. In International Conference on Machine Learning, pp. 7001 7010. PMLR, 2021. Qinghua Liu, Alan Chung, Csaba Szepesv ari, and Chi Jin. When is partially observable reinforcement learning not scary? ar Xiv preprint ar Xiv:2204.08967, 2022a. Qinghua Liu, Csaba Szepesv ari, and Chi Jin. Sample-efficient reinforcement learning of partially observable markov games. ar Xiv preprint ar Xiv:2206.01315, 2022b. Published as a conference paper at ICLR 2024 Zhihan Liu, Miao Lu, Wei Xiong, Han Zhong, Hao Hu, Shenao Zhang, Sirui Zheng, Zhuoran Yang, and Zhaoran Wang. One objective to rule them all: A maximization objective fusing estimation and planning for exploration. ar Xiv preprint ar Xiv:2305.18258, 2023. Luke Marris, Paul Muller, Marc Lanctot, Karl Tuyls, and Thore Graepel. Multi-agent training beyond zero-sum with correlated equilibrium meta-solvers. In International Conference on Machine Learning, pp. 7480 7491. PMLR, 2021. Chengzhuo Ni, Yuda Song, Xuezhou Zhang, Chi Jin, and Mengdi Wang. Representation learning for general-sum low-rank markov games. ar Xiv preprint ar Xiv:2210.16976, 2022. Martin L Puterman. Markov decision processes: discrete stochastic dynamic programming. John Wiley & Sons, 2014. Daniel Russo and Benjamin Van Roy. Eluder dimension and the sample complexity of optimistic exploration. Advances in Neural Information Processing Systems, 26, 2013. David Silver, Aja Huang, Chris J Maddison, Arthur Guez, Laurent Sifre, George Van Den Driessche, Julian Schrittwieser, Ioannis Antonoglou, Veda Panneershelvam, Marc Lanctot, et al. Mastering the game of go with deep neural networks and tree search. nature, 529(7587):484 489, 2016. Ziang Song, Song Mei, and Yu Bai. When can we learn general-sum markov games with a large number of players sample-efficiently? ar Xiv preprint ar Xiv:2110.04184, 2021. Wen Sun, Nan Jiang, Akshay Krishnamurthy, Alekh Agarwal, and John Langford. Model-based rl in contextual decision processes: Pac bounds and exponential improvements over model-free approaches. In Conference on learning theory, pp. 2898 2933. PMLR, 2019. Richard S Sutton and Andrew G Barto. Reinforcement learning: An introduction. MIT press, 2018. Yi Tian, Yuanhao Wang, Tiancheng Yu, and Suvrit Sra. Online learning in unknown markov games. In International conference on machine learning, pp. 10279 10288. PMLR, 2021. Oriol Vinyals, Igor Babuschkin, Wojciech M Czarnecki, Micha el Mathieu, Andrew Dudzik, Junyoung Chung, David H Choi, Richard Powell, Timo Ewalds, Petko Georgiev, et al. Grandmaster level in starcraft ii using multi-agent reinforcement learning. Nature, 575(7782):350 354, 2019. Ruosong Wang, Ruslan Salakhutdinov, and Lin F Yang. Provably efficient reinforcement learning with general value function approximation. ar Xiv preprint ar Xiv:2005.10804, 2020a. Ruosong Wang, Russ R Salakhutdinov, and Lin Yang. Reinforcement learning with general value function approximation: Provably efficient approach via bounded eluder dimension. Advances in Neural Information Processing Systems, 33:6123 6135, 2020b. Yining Wang, Ruosong Wang, Simon S Du, and Akshay Krishnamurthy. Optimism in reinforcement learning with generalized linear function approximation. ar Xiv preprint ar Xiv:1912.04136, 2019. Yuanhao Wang, Qinghua Liu, Yu Bai, and Chi Jin. Breaking the curse of multiagency: Provably efficient decentralized multi-agent rl with function approximation. ar Xiv preprint ar Xiv:2302.06606, 2023. Chen-Yu Wei, Yi-Te Hong, and Chi-Jen Lu. Online reinforcement learning in stochastic games. Advances in Neural Information Processing Systems, 30, 2017. Michael P Wellman. Methods for empirical game-theoretic analysis. In AAAI, volume 980, pp. 1552 1556, 2006. Qiaomin Xie, Yudong Chen, Zhaoran Wang, and Zhuoran Yang. Learning zero-sum simultaneousmove markov games using function approximation and correlated equilibrium. In Conference on learning theory, pp. 3674 3682. PMLR, 2020. Tengyang Xie, Ching-An Cheng, Nan Jiang, Paul Mineiro, and Alekh Agarwal. Bellman-consistent pessimism for offline reinforcement learning. Advances in neural information processing systems, 34:6683 6694, 2021. Published as a conference paper at ICLR 2024 Wei Xiong, Han Zhong, Chengshuai Shi, Cong Shen, and Tong Zhang. A self-play posterior sampling algorithm for zero-sum markov games. In International Conference on Machine Learning, pp. 24496 24523. PMLR, 2022. Rui Yuan, Simon S Du, Robert M Gower, Alessandro Lazaric, and Lin Xiao. Linear convergence of natural policy gradient methods with log-linear policies. ar Xiv preprint ar Xiv:2210.01400, 2022. Andrea Zanette, Alessandro Lazaric, Mykel Kochenderfer, and Emma Brunskill. Learning near optimal policies with low inherent bellman error. In International Conference on Machine Learning, pp. 10978 10989. PMLR, 2020. Andrea Zanette, Martin J Wainwright, and Emma Brunskill. Provable benefits of actor-critic methods for offline reinforcement learning. Advances in neural information processing systems, 34: 13626 13640, 2021. Wenhao Zhan, Jason D Lee, and Zhuoran Yang. Decentralized optimistic hyperpolicy mirror descent: Provably no-regret learning in markov games. ar Xiv preprint ar Xiv:2206.01588, 2022a. Wenhao Zhan, Masatoshi Uehara, Wen Sun, and Jason D Lee. Pac reinforcement learning for predictive state representations. ar Xiv preprint ar Xiv:2207.05738, 2022b. Kaiqing Zhang, Sham Kakade, Tamer Basar, and Lin Yang. Model-based multi-agent rl in zero-sum markov games with near-optimal sample complexity. Advances in Neural Information Processing Systems, 33:1166 1178, 2020. Yulai Zhao, Yuandong Tian, Jason D Lee, and Simon S Du. Provably efficient policy optimization for two-player zero-sum markov games. ar Xiv preprint ar Xiv:2102.08903, 2021. Han Zhong, Wei Xiong, Sirui Zheng, Liwei Wang, Zhaoran Wang, Zhuoran Yang, and Tong Zhang. A posterior sampling framework for interactive decision making. ar Xiv preprint ar Xiv:2211.01962, 2022. Published as a conference paper at ICLR 2024 Supplementary Material 1 Introduction 1 2 Models and Preliminaries 3 2.1 Markov Games . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3 2.2 Function Approximation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5 3 Algorithm and Results 7 3.1 Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7 3.2 Theoretical Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9 4 Conclusion 9 A Related Work 16 B Notation 17 C Additional Definitions 17 C.1 NE/CCE/CE-Regret . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17 C.2 Covering Number and Bracketing Number . . . . . . . . . . . . . . . . . . . . . . 18 D Relationships between MADC and Tractable RL Problems 18 D.1 Model-Free MARL Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18 D.2 Model-Based RL Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22 E Proof of Main Results 23 E.1 Proof of Model-Free Version of Theorem 3.1 . . . . . . . . . . . . . . . . . . . . 23 E.2 Proof of Model-Based Version of Theorem 3.1 . . . . . . . . . . . . . . . . . . . . 27 E.3 Sample Complexity Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30 F Proof of Theorems and Lemmas 30 F.1 Proof of Theorem D.4 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30 F.2 Proof of Theorem D.5 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32 F.3 Proof of Theorem D.8 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33 F.4 Proof of Theorem D.12 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35 F.5 Proof of Theorem D.14 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38 F.6 Proof of Lemma E.1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39 F.7 Proof of Lemma E.2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41 F.8 Proof of Lemma E.3 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41 F.9 Proof of Corollary E.4 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42 Published as a conference paper at ICLR 2024 G Technical Tools 43 H Computational Efficiency 43 Published as a conference paper at ICLR 2024 A RELATED WORK Markov Games Markov Game (MG) (Littman, 1994) is a popular model of multi-agent reinforcement learning, which generalizes the Markov decision process to multiple agents. A series of recent works design the sample-efficient algorithm for two-agent zero-sum games (Wei et al., 2017; Zhang et al., 2020; Xie et al., 2020; Bai et al., 2020; Bai & Jin, 2020; Bai et al., 2021; Zhao et al., 2021; Huang et al., 2021; Jin et al., 2022; Chen et al., 2022b;d). For instance, Bai & Jin (2020) provide a sample-efficient algorithm in an episodic MG based on optimistic value iteration. Xie et al. (2020); Chen et al. (2022d) mainly focus on zero-sum MGs with a linear structure. Huang et al. (2021); Jin et al. (2022); Chen et al. (2022b) further consider the two-player zero-sum MGs under general function approximation, and provide algorithms with a sublinear regret. Another line of research focuses on general-sum MGs with multiple players (Jin et al., 2020a; Liu et al., 2021; Tian et al., 2021; Jin et al., 2021b; Song et al., 2021; Liu et al., 2022b; Daskalakis et al., 2022; Zhan et al., 2022a; Cui et al., 2023; Wang et al., 2023). Some of previous works (Liu et al., 2021; Tian et al., 2021; Liu et al., 2022b) consider learning all three equilibrium notions NE, CCE, and CE and their regret or sample complexity results are exponential in the number of agents. To break this exponential curse, some existing works propose decentralized algorithms for learning CCE or CE rather than NE (Jin et al., 2021b; Daskalakis et al., 2022; Zhan et al., 2022a; Cui et al., 2023; Wang et al., 2023). Moreover, there are also some works to study how to learn the equilibrium from a practical perspective. The EGTA (Wellman, 2006) uses a graph to represent the deviation for all profiles and identify the Nash equilibrium and whether a profile is relatively stable. Note if we want to evaluate all profiles and construct the whole graph, we need many samples to estimate the payoff functions for each profile, and then identify all the deviations and construct the graph. The author also mentions that one may apply machine learning techniques to fit a payoff function over the entire profile space given the available data. In fact, we use the function approximation technique to derive an estimate of all the profiles without estimating all the profiles. Thus, our approach can be considered as a learning approach to evaluate all the profiles. Some previous works (Lanctot et al., 2017; Marris et al., 2021) propose a practical algorithm PSRO to compute the equilibrium in Markov Games, which also needs an equilibrium-solving oracle to learn the equilibrium. To be more specific, the PSRO learns the equilibrium in the following way: At first, every player chooses a uniform policy as their strategy. The algorithm then calculates the equilibrium by a meta-solver and trains an oracle that outputs the best response πi of the equilibrium for player i. After that, the algorithm adds πi into the strategy space of the player i. Last, the algorithm simulates all the new joint policy and construct a new normal-form game for the next iteration. However, in each iteration, it should simulate all the new joint policies and estimate the return. Consequently, the sample complexity increases exponentially as the iteration rounds increase. Different from PSRO, MAMEX utilizes the function approximation technique to the value function. The precise characterization of the structure of the value function can help us to evaluate the policy without actually simulating the environment with more samples. To be more specific, at each round, instead of simulating the environment and getting a Monte-Carlo return of each joint policy, MAMEX only needs to solve a regularized optimization problem over the function space of the value function. The solution of the optimization problem is used to be a payoff for the normal-form game. Since solving this optimization subproblem does not need to additional samples, MAMEX bypasses the requirement for exponential samples to simulate the environment and estimate the value for each joint policy π. This characteristic enhances its sample efficiency in comparison to PSRO. MARL with Function Approximation There are many papers working on multi-player generalsum MGs with the function approximation (Zhan et al., 2022a; Ni et al., 2022; Chen et al., 2022b; Wang et al., 2023; Cui et al., 2023; Foster et al., 2023) that build upon previous works for function approximation in the single-agent setting (Jiang et al., 2017; Sun et al., 2019; Jin et al., 2020b; Wang et al., 2020b; Dann et al., 2021; Du et al., 2021; Jin et al., 2021a; Foster et al., 2021; Chen et al., 2022c; Agarwal & Zhang, 2022; Zhong et al., 2022; Liu et al., 2023). In recent years, Xiong Published as a conference paper at ICLR 2024 et al. (2022) consider the multi-agent decoupling coefficient in the two-player zero-sum MGs, and provide the posterior sampling algorithm. However, unlike a zero-sum MG, a general-sum MG can have various equilibrium concepts, each of which aligns with a specific set of policies. Hence, their definition of the multi-agent decoupling coefficient cannot be extended to the general-sum setting. Chen et al. (2022b) and Foster et al. (2023) generalize the complexity measure Decision Estimation Coefficient (DEC), and learn the equilibria in model-based general-sum MGs. Ni et al. (2022) provide both a model-based algorithm and a model-free algorithm for the low-rank MGs. Some previous works (Zhan et al., 2022a; Wang et al., 2023; Cui et al., 2023) provide model-free algorithms that learn CCE and CE with polynomial sample complexity. Compared to their works, this paper provides a unified algorithmic framework for both model-free and model-based MARL problems, which learns NE/CCE/CE efficiently under general function approximation and provides comparable regret to existing works. In particular, our work provides the first model-free algorithm for learning NE/CCE/CE of general-sum MGs in the context of the general function approximation. For n sets F1, , Fn, we let n i=1Fi denote F1 Fn. For a set A, we denote (A) as a set of probability distributions over A. For a vector x Rn, we denote x 1 = Pn i=1 |xi|, x 2 = p Pn i=1 x2 i and x = maxn i=1 |xi|. For a function f : X 7 Y, we denote f = supx X |f(x)| as the infinity norm. For two functions f, g : A 7 R, we denote f, g A = Ea A[f(x)g(x)] as the inner product with respect to the set A. For a Hilbert space V and f, g V, we denote f, g V as the inner product defined in the Hilbert space V, and f V is the norm defined in Hilbert space V. For two distributions over P, Q (X), the Hellinger distance is defined as D2 H(P Q) = 1 d P(x)/d Q(x) 1)2]. For a vector x Rd, the softmax mapping is denoted by Softmax(x) Rd with Softmax(x) C ADDITIONAL DEFINITIONS C.1 NE/CCE/CE-REGRET In the following, we provide the definitions of Coarse Correlated Equilibrium (CCE), Correlated Equilibrium (CE), and the corresponding regret CCE-regret and CE-regret. A Coarse Correlated Equilibrium is a joint policy π such that no agent can achieve higher rewards by only changing its local policy. Compared with a NE, a CCE allows different agents to be correlated, while NE only considers product policies. Definition C.1 (ε-Coarse Correlated Equilibrium). A joint policy π is a ε-Coarse Correlated Equilibrium if V (i),µ(i),π(ρ) V (i),π(ρ) + ε for all i [n]. Here, the definition of ε-CCE is similar to that of an ε-NE. But here π is a joint policy, i.e., the randomness of the local policies of the n agents can be coupled together. As a result, CCE is a more general equilibrium notion than NE. Similarly, we can define the CCE-regret, which represents the cumulative suboptimality across all agents with respect to CCE. Definition C.2 (CCE-Regret). For all k [K], let πk denote the joint policy that is deployed in the k-th episode, then the CCE-regret is defined as Reg CCE(K) = V (i),µ(i),πk (ρ) V (i),πk(ρ) . Last, the Correlated Equilibrium has been extensively studied in previous works for MARL (Jin et al., 2020a; Chen et al., 2022b; Cui et al., 2023; Wang et al., 2023). To introduce the concept of CE, we need first to introduce the strategy modification. A strategy modification for the i-th agent is a mapping ϕi : Πpur i Πpur i . Given any random policy π, the best strategy modification for i-th agent is defined as arg maxϕi Eυ π[V ϕi(υ(i)) υ( i)(ρ)]. A CE is a joint policy π such that no agent can achieve higher rewards by only changing its local policy through strategic modification. Definition C.3 (ε-Correlated Equilibrium). A joint policy π is a ε-Correlated Equilibrium if maxϕi Eυ π[V ϕi(υ(i)) υ( i)(ρ)] V π(ρ) + ε for any agent i [n]. Published as a conference paper at ICLR 2024 We can similarly define CE-regret as the sum of suboptimality terms with respect to CE. Definition C.4 (CE-Regret). For any k [K], let πk denote the joint policy that is deployed in the k-th episode, the CE-regret is defined as Reg CE(K) = max ϕi Eυ πk V (i),ϕi(υ(i)) υ( i)(ρ) V (i),πk(ρ) . Compared to the NE/CCE regret, the strategy modification of one agent in CE can be correlated to the policies of other agents. Instead, the best response is independent of the other agents. C.2 COVERING NUMBER AND BRACKETING NUMBER When a function class F is infinite, the δ-covering number and the δ-bracketing number serve as surrogates of the cardinality of F. Intuitively, the δ-covering number is the minimum number of balls of radius δ required to cover a set. Definition C.5 (δ-Covering Number). The δ-covering number of a function class F with respect to distance metric d, denoted as NF(δ, d), is the minimum integer q satisfying the following property: there exists a subset F F with |F | = q such that for any f1 F we can find f2 F with d(f1, f2) δ. To simplify the notation, we write NF(δ, ) as NF(δ). Definition C.6 (δ-Bracketing Number). A δ-bracket of size N is a bracket {gi 1, gi 2}N i=1, where gi 1 and gi 2 are functions mapping any policy π and trajectory τ to R, such that for all i [N], π Π we have gi 1(π, ) gi 2(π, ) δ. Also, for any f F, there must exist an i [N] such that gi 1(π, τH) Pπ f (τH) gi 2(π, τH) for all possible τH and π. The δ-bracketing number of F, denoted by BF(δ), is the minimum size of a δ-bracket. D RELATIONSHIPS BETWEEN MADC AND TRACTABLE RL PROBLEMS In this section, we show that the class of MGs with finite MADCs contains a rich class of models. Thus, when applied to these concrete MARL models, Theorem 3.1 shows that MAMEX learns NE/CCE/CE with provable sample efficiency. In the sequel, we instantiate the discrepancy function ℓ(i),s for both model-free and model-based MARL, and introduce some concrete general-sum MG models that satisfy Assumption 2.5. D.1 MODEL-FREE MARL PROBLEMS Now we provide function classes with small MADCs including multi-agent counterparts of models with low Bellman eluder dimensions (Jin et al., 2021a; Huang et al., 2021) and Bilinear Classes (Du et al., 2021). Then, we introduce some concrete examples in these members and show that the regret upper bound of MAMEX in Theorem 3.1, when specialized to these special cases, are comparable to existing works. Multi-Agent Bellman Eluder Dimension Recently, Jin et al. (2021a) introduce a model-free complexity measure called Bellman Eluder dimension (BE dimension)and show that function classes with low BE dimensions contain a wide range of RL problems such as linear MDP (Jin et al., 2020b), kernel MDP (Jin et al., 2021a) and function classes with low eluder dimension (Wang et al., 2020a). In this subsection, we extend the notion of BE dimension to MARL. First, we introduce the definition of ε-independence between distributions and the concept of distribution eluder dimension. Definition D.1 (ε-Independent Distributions). Let G be a function class on X, and υ, µ1, , µn are probability distributions over X. We called υ is ε-independent of {µ1 µn} with respect to G if there exists a function g G such that p Pn i=1(Eµi[g])2 ε and |Eυ[g]| > ε. By this definition, if ν is ε-dependent of {µ1, , µn}, then whenever we have p Pn i=1(Eµi[g])2 ε for some g G, we also have |Eυ[g]| ε. Published as a conference paper at ICLR 2024 Definition D.2 (Distribution Eluder Dimension). Let G be a function class on X and D be a family of probability measures over X. The distributional eluder dimension dim DE(G, D, ε) is the length of the longest sequence ρ1, , ρn D such that there exists ε ε where ρi is ε -independent of {ρ1, , ρi 1} for all i [n]. In other words, distributional eluder dimension dim DE(G, D, ε) is the length of the longest sequences of distributions in D such that each element is ε -independent of its predicessors with respect to G, from some ε ϵ. Such a notion generalizes the standard eluder dimension Russo & Van Roy (2013) to the distributional setting. When we set D to be the set of Dirac measures {δx( )}x X , the distributional eluder dimension dim DE(G G, D, ε) reduces to the standard eluder dimension introduced in Russo & Van Roy (2013). Here, G G = {g1 g2 : g1, g2 G}. For any agent i and any pure policy π Πpur, we denote the function class of the Bellman residual as F(i),π h = {fh T (i),πfh+1 | f F(i)}. Now we introduce the definition of the multi-agent BE dimension with respect to a class of distributions. Definition D.3 (Multi-Agent Bellman Eluder Dimension). Let D = {Dh}h [H] be a set of H classes of distributions over S A, one for each step of an episode. The multi-agent Bellman eluder (BE) dimension with respect to D is defined as dim MABE(F, D, ε) = max h [H] max i [n] π Πpur F(i),π h , Dh, ε o . (D.1) In other words, the multi-agent BE dimension is defined as the maximum of the distribution eluder dimensions with respect to Dh, based on the agent-specific Bellman residue classes S π Πpur F(i),π h . Compared with the BE dimension for single-agent RL (Jin et al., 2021a), the multi-agent version takes the maximum over the agent index i [n], and the function class involves the union of the function class F(i),π h for all π Πpur. In comparison, leveraging the facts that the optimal policy is the greedy policy of the optimal value function, and that the optimal value function is the fixed point of the Bellman optimality operator, it suffices to only consider residues of the Bellman optimality operator in the definition of single-agent BE dimension. In contrast, for general-sum MGs, finding the desired equilibrium policies is not a single-objective policy optimization problem, and the notion of the Bellman optimality operator is not well-defined. As a result, to extend the concept of Bellman eluder dimension to general-sum MGs, in the function class, we take into account F(i),π h for all π Πpur, which correspond to evaluating the performance of all the pure policies. Besides, in (D.1), we also take the maximum over all agents i [n] and all steps h [H], which aligns with the definition of single-agent BE dimension. Furthermore, in the definition of multi-agent BE dimension, we need to specify a set of distributions D = {Dh}h [H] over S A. We consider two classes. First, let D = {D ,h}h [H] denote a class of probability measures over S A with D ,h = {δ(s,a)( ) | (s, a) S A}, which contains all the Dirac measures that put mass one to a state-action pair at step h. Second, given the set of pure policies Πpur, we let DΠ = {DΠ,h}h [H] denote a class of probability measures induced Πpur as follows. For any π Πpur, when all the agents follow π on the true MG model, they generate a Markov chain {sh, ah}h [H] whose joint distribution is determined by π, denoted by Pπ. Then, for any h [H], we define DΠ,h = {ρ (S A) | ρ( ) = Pπ((sh, ah) = ), π Πpur}, i.e., DΠ,h denotes the collection of all marginal distributions of (sh, ah) induced by pure policies. In the following, to simplify the notation, we denote dim MABE(F, ε) = min dim MABE(F, D , ε), dim MABE(F, DΠ, ε) . (D.2) The following theorem shows that, when F satisfies realizability and completeness (Assumption 2.3), for a general-sum MG with a finite multi-agent BE dimension given by (D.2), its multi-agent decoupling coefficient (Definition 2.4) is also bounded. In other words, Assumption 2.5 holds any general-sum MG model with a low multi-agent BE dimension. As a result, the class of MGs with finite multi-agent BE dimensions is a subclass of MGs with finite multi-agent decoupling coefficients. Theorem D.4 (Low Multi-Agent BE Dimension Low MADC). Let K any integer and let F be a hypothesis class under the model-free setting, i.e., a class of Q-functions. Assume that F satisfy the realizability and completeness condition specified in Assumption 2.3. Suppose that F has a finite multi-agent BE dimension d = dim MABE(F, 1/K), then with the discrepancy function ℓ(i),s given Published as a conference paper at ICLR 2024 in (2.4), the multi-agent decoupling coefficient of F satisfies d MADC = O(d H log K), where O( ) omits absolute constants. Proof. See F.1 for a detailed proof. Combining Theorem 3.1 and Theorem D.4, we obtain that MAMEX achieves a sublinear e O(nd H K +nd H2 +n HR2 K log ΥF,δ) regret for function classes with a finite multi-agent BE dimension d. It remains to see that that function classes with low multi-agent BE dimensions contain a wide range of RL problems. To this end, we prove that if the eluder dimension (Russo & Van Roy, 2013) of the function class F(i) h is small for all h [H] and i [n], then F = n i=1( H h=1F(i) h ) has a low multi-agent BE dimension. Function classes with finite eluder dimension contains linear, generalized linear, and kernel functions (Russo & Van Roy, 2013), and thus contains a wide rage of MG models. On these MG problems, the model-free version of MAMEX achieve sample efficiency provably. Theorem D.5. Suppose F satisfies Assumption 2.3. For any i [n] and h [H], let dim E(F(i) h , ε) denote the eluder dimension of F(i) h , which is a special case of the distributional eluder dimension introduced in Definition D.2. That is, dim E(F(i) h , ε) is equal to dim DE(F(i) h F(i) h , D , ε), where F(i) h F(i) h = {g: g = f1 f2, f1, f2 F(i) h } and D contains the class of dirac measures on S A. Then, the multi-agent BE dimension defined in (D.2) satisfy dim MABE(F, ε) max h [H] max i [n] dim E(F(i) h , ε). Proof. See F.2 for a detailed proof. Multi-Agent Bilinear Classes Bilinear Classes (Du et al., 2021) consists of MDP models where the Bellman error admits a bilienar structure. On these models, Du et al. (2021) propose online RL algorithms that are provably sample-efficient. Thus, Bilinear Classes is a family of tractable MDP models with general function approximation. In the sequel, we extend Bilinear Classes to generalsum MGs and show that such an extension covers some notable special cases studied in the existing works. Then, we prove that multi-agent Bilinear Classes have a small MADC, thus satisfying the Assumption 2.5. Therefore, when applied to these problems, MAMEX provably achieves sample efficiency. Definition D.6 (Multi-Agent Bilinear Classes). Let V be a Hilbert space and let , V and V denote the inner product and norm on V. Given a multi-agent general-sum MG with a hypothesis class F satisfying Assumption 2.3, it belongs to multi-agent Bilinear Classes if there exist H functions {W (i) h : F(i) Πpur 7 V}H h=1 for each agent i [n] and {Xh : Πpur 7 V}H h=1 such that the Bellman error of each agent i can be factorized using W (i) h and Xh. That is, for each i [n], f F(i), h [H], π, π Πpur, we have E(sh,ah) π fh(sh, ah) r(i) h (sh, ah) Es Ph(s |sh,ah) fh+1(s , ), πh+1( | s ) A = W (i) h (f, π) W (i) h (f (i),µ(i),π, µ(i),π), Xh(π ) where µ(i),π = (π(i), , π( i)) is the best response for i-th agent given that the other agents all follow π. Here, the function f (i),µ(i),π is the fixed point of T (i),µ(i),π, i.e., f (i),µ(i),π h = T (i),µ(i),πf µ(i),π h+1 . (D.4) Moreover, we require that {W (i) h , Xh}h [H] satisfy a regularity condition sup π Πpur,h [H] Xh(π) V 1, sup i [n],f F(i),π Πpur,h [H] W (i) h (f, π) V BW , (D.5) where BW is a constant. Published as a conference paper at ICLR 2024 In this definition, for any π Πpur and f F(i), fh(sh, ah) r(i) h (sh, ah) Es Ph(s |sh,ah) fh+1(s , ), πh+1( | s ) A is the Bellman error of f at (sh, ah) for evaluating policy π on behalf of agent i. On the left-hand side of (D.3), we evaluate such a Bellman error with respect to the distribution induced by another policy π . Equation (D.3) shows that this error can be factorized into the inner product between W (i) h and X(i) h , where both W (i) h only involves (f, π) while X(i) h only involves π . Thus, multi-agent Bilinear Classes specifies a family of Markov games whose Bellman error satisfies a factorization property. Furthermore, recall that the best response π(i), = maxν (Πpur i ) V ν,π( i) is attained at some pure policy, thus we have µ(i),π Πpur. Under Assumption 2.3, the fixed point f (i),µ(i),π in (D.4) is guaranteed to exist and belongs to F. We define Xh = {Xh(π) : f F, π Πpur} and X = SH h=1 Xh. The complexity of the multiagent bilinear class essentially is determined by the complexity of the Hilbert space V. To allow V be infinite-dimensional, we introduce the notion of information gain, which characterizes the intrinsic complexity of V in terms of exploration. Definition D.7 (Information Gain). Suppose V is a Hilbert space and X V. For ε > 0 and integer K > 0, the information gain γK(ε, X) is defined by γK(ε, X) = max x1, ,x K X log det The following theorem shows that multi-agent Bilinear Classes with small information gain have low MADCs. Theorem D.8 (Multi-Agent Bilinear Classes Low MADC). For a general-sum MG in the multiagent bilinear class with a hypothesis class F, let γK(ε, X) = PH h=1 γK(ε, Xh) be the information gain. Then, Assumption 2.5 holds with the discrepancy function ℓ(i),s given in (2.4). In particular, we have d MADC max 1, 8R2 γK(1/(KB2 W ), X) , where BW is given in (D.5) and R (0, H] is an upper bound on PH h rh. Proof. See F.3 for a detailed proof. Now we introduce some concrete members of multi-agent Bilinear Classes, which are general-sum MGs with linear function approximation. In single-agent RL, linear Bellman complete MDPs (Wang et al., 2019) assume that the MDP model satisfies the Bellman completeness condition with respect to linear Q-functions. We can extend such a model to general-sum MGs. Example D.9 (Linear Bellman Complete MGs). We say a Markov Game is a linear Bellman complete MG of dimension d, if for any step h [H] there exists a known feature ϕh : S A 7 Rd with ϕh(s, a) 1 for all (s, a) S A such that Assumption 2.3 holds for linear functions of ϕh. In other words, the Markov game satisfies Assumption 2.3 with F(i) h {ϕ h θ | θ Rd, θ 2 dθ}for all i [n] and h [H], where dθ > 0 is a parameter. It is easy to see that Linear Bellman complete MGs belong to multi-agent Bilinear Classes by choosing Xh(π) = Eπ[ϕ(sh, ah)] Rd, W (i) h (f, π) = θf,h w(i) f,h, where θf,h satisfies that f(sh, ah) = θ f,hϕh(sh, ah), and w(i) f,h satisfies that3 (w(i) f,h) ϕh(sh, ah) = r(i) h (sh, ah) + Es Ph( |sh,ah) fh+1(s , ), πh+1( | s ) A = T (i),π(fh+1) F(i) h . 3If there are multiple θ satisfying the requirement, we can break the tie arbitrarily. Published as a conference paper at ICLR 2024 Then, we have Xh V = {ϕ Rd : ϕ 2 1} for all h [H] and BW = 2 d. It can be shown that the logarithm of 1/K-covering number of F is log(NF(1/K)) = e O(d), and the information gain can bounded by γK(1/B2 W K, X) = h=1 γK(1/B2 W K, Xh) h=1 γK(1/4d K, Xh) = e O(Hd), where e O omits absolute constants and logarithmic factors (Du et al., 2021; Wang et al., 2020b). Thus, by Theorem 3.1, MAMEX achieves a e O(nd HR2 K log |Πpur| + nd H2) regret. For the single-agent setting, comparing to the state-of-the-art e O(d H K) regret when R = 1 (Zanette et al., 2020; Chen et al., 2022c), our result matches their results in terms of d, H and K with an extra factor |Πpur| in the logarithmic term. Note that when the pure policy set of i-th agent is selected as some particular policy classes such as log-linear policy Πpur h,i = {πϑ : πϑ( | s) = Softmax(ϑ ψ(s, )), ϑ 2 1, ψ( , ) 1, ϑ Rdπ}, we can select a cover by bΘ = {bϑ : bϑi = ϑi/ε ε, ϑ 2 1, ϑ Rdπ}. Zanette et al. (2021) prove that the logarithm of cardinality of the induced covering {πϑ : ϑ bΘ}H is bounded by e O(n Hdπ), and then MAMEX provides a e O((nd+n2dπ)H2R2 K +nd H2) regret. In particular, as one of the examples of Linear Bellman Complete MGs, Xie et al. (2020) consider a similar linear structure for two-player zero-sum games. Example D.10 (Zero-Sum Linear MGs (Xie et al., 2020)). In a zero-sum linear MG, for each (s, a, b) S A B and h [H], we have reward rh(s, a, b) [0, 1], and there is a known feature map ϕ : S A B Rd, H known vectors θh Rd and a vector of d unknown measures µh = {µh,d }d [d] on S such that ϕ( , , ) 2 1, θh 2 rh(s, a, b) = ϕ(s, a, b) θh, Ph( | s, a, b) = ϕ(s, a, b) µh( ). Zero-sum linear MGs is a special case of linear Bellman complete MG with two players and dθ = 2H d, and our algorithm provides a e O(d H3 K log(|Πpur|)) regret by choosing R = H and the fact that log N (i) F (1/K) = e O(d). The previous work provides a e O(d3/2H2 K) sublinear regret (Xie et al., 2020) and a Ω(d H3/2 K) information-theoretic lower bound (Chen et al., 2022d) for zero-sum linear MGs. Thus, our regret matches the lower bound in terms of d, has a higher order in H compared to Xie et al. (2020) and an extra factor log |Πpur|. Again, we can adopt the class of log-linear policies with a policy cover, which leads to log |Πpur| = e O(dπ). Thus, MAMEX yields a e O((d H3 + dπH4) D.2 MODEL-BASED RL PROBLEMS Sun et al. (2019) provide a complexity measure witness rank to characterize the exploration hardness of the model-based RL problems. In the following, we extend the notion of the witness rank to MARL. Example D.11 (Multi-Agent Witness Rank). Let V = {Vh : S A S 7 [0, 1]}h [H] denote a class of discriminators and let F be a hypothesis class such that the true model, denoted by f , belongs to F. We say a multi-agent witness rank of a general-sum MG is at most d, if for any model f F and any policy π Πpur there exist mappings {Xh : Πpur Rd}H h=1 and {Wh : F Rd}H h=1 max v Vh E(sh,ah) π[(Es Ph,f ( |sh,ah) Es Ph,f ( |sh,ah))v(sh, ah, s )] W (i) h (f), Xh(π) , κwit E(sh,ah) π[(Es Ph,f ( |sh,ah) Es Ph,f ( |sh,ah))V (i),π h+1,f(s )] W (i) h (f), Xh(π) (D.7) for all h [H], where κwit is a parameter. Here, V (i),π h+1,f is the value function of π associated with agent i under model f. Moreover, these mappings satisfy the following regularity condition: sup h [H],π Πpur Xh(π) 1, sup h [H],f F,i [n] W (i) h (f) BW . Published as a conference paper at ICLR 2024 Compared with the single-agent witness rank (Sun et al., 2019), the policy π in the mapping Xh(π) and the expectation E(sh,ah) π in (D.6) and (D.7) can be an arbitrary pure policy instead of the optimal policy πf of the model f. This stricter assumption is essential for general-sum MGs because we are interested in various equilibrium notions and each equilibrium can be non-unique. The following theorem shows that model classes with small multi-agent witness ranks have small MADCs. Theorem D.12 (Multi-Agent Witness Rank Low MADC). Let F be a class of general-sum MGs whose multi-agent witness rank is no more than d. Then, for any f F, we have d MADC = e O(Hd/κ2 wit), where d MADC is the multi-agent decoupling coefficient of f . Proof. See F.4 for detailed proof. This theorem shows that the multi-agent decoupling coefficient is upper bounded by the multi-agent witness rank, which shows that the class of MG models with a finite multi-agent decoupling coefficient contains models with a finite multi-agent witness rank. Hence, many concrete MG models such as the multi-agent version of factor MDP and linear kernel MDP all have finite multi-agent decoupling coefficients. Therefore, applying Theorem 3.1 to models with a finite Multi-Agent witness rank, the model-based version of MAMEX achieves a e O(n Hd K/κ2 wit + n H K) regret with witness rank d. Note that for the model-based RL problems, our regret does not have the term log(|Πpur|), because the discrepancy function ℓ(i),s in 2.5 is independent with πk. When applying our results to the single-agent setting, Theorem D.12 provides a similar regret result as in previous works (Sun et al., 2019; Zhong et al., 2022). Another example of model-based RL problems is the linear mixture MG (Chen et al., 2022d), which assumes that the transition kernel P(s | s, a) is a linear combination of d feature mappings {ϕi(s , s, a)}i [d], i.e. P(s | s, a) = Pd i=1 θiϕi(s , s, a), where a is a joint action. Example D.13 (Multi-Agent Linear Mixture MGs). We call one general-sum MG is a linear mixture MG with dimension d, if there exist h vectors {θh Rd}h [H] and a known feature ϕ(s | s, a) Rd such that θh 2 d and Ph(s | s, a) = θh, ϕ(s | s, a) for any state-action pair (s , s, a) S S A. The following theorem shows that a linear mixture general-sum MG has a finite multi-agent decoupling coefficient. Thus, MAMEX can be readily applied to these models with sample efficiency. Theorem D.14 (Multi-Agent Linear Mixture MGs Low MADC). For a linear mixture MG with dimension d, we have d MADC = e O(d HR4), where R is an upper bound on PH h=1 rh. Proof. See F.5 for a detailed proof. Chen et al. (2022d) provides a minimax-optimal e O(d H K) regret for two-player zero-sum MGs for rh [0, 1]. Now choose Fh = {θh Rd}. Combining with Theorem D.14 and Theorem 3.1, and the fact that log(BF(1/K)) = e O(Hd) (Liu et al., 2022a), MAMEX achieves a e O(nd H5 K + nd H4) regret, where we set R = H. Compared with their regret upper bound, when applying our result to two-player zero-sum MGs by choosing n = 2, the leading term of our regret e O(d H5 K) matches the minimax-optimal result in terms of d and K but with an extra multiplicative factor H2. E PROOF OF MAIN RESULTS E.1 PROOF OF MODEL-FREE VERSION OF THEOREM 3.1 Proof. We first consider learning Nash equilibrium and coarse correlated equilibrium. NE/CCE First, by Assumption 2.3, for any pure joint policy υ, there exists a function f (i),υ F(i) satisfies that it has no Bellman error with Bellman operator T (i),υ for any pure joint policy υ, i.e. T (i),υ h f (i),υ h+1 = f (i),υ h . (E.1) Published as a conference paper at ICLR 2024 Hence, {f (i),υ h }h [H] is the Q-function of the agent i when all agents follow the policy υ. Thus, we have V (i),υ f (i),υ(ρ) = Es1 ρ,a υ(s1)[f (i),υ 1 (s, a)] = Es1 ρ,a υ(s1)[Q(i),υ 1 (s, a)] = V (i),υ(ρ). (E.2) Also, denote bf (i),υ = arg supf F(i) b V υ i (f) as the optimal function with respect to the regularized value b V (i),π(f) for the pure joint policy π and agent i. Now we have Eυ πk h V (i),υ b f (i),υ(ρ) ηL(i),k 1( bf (i),υ, υ, τ 1:k 1) i = Eυ πk h sup f F(i) b V (i),υ(f) i max υ(i) Πpur Eυ υ(i) π( i),k h sup f F(i) b V (i),υ(f) i . (E.3) The inequality holds because of the property of Nash Equilibrium or Coarse Correlated Equilibrium. Then, since the best response π(i),k, is a pure policy, we have max υ(i) Πpur Eυ υ(i) π( i),k h sup f F(i) b V (i),υ(f) i Eυ π(i),k, π( i),k h sup f F(i) b V (i),υ(f) i = Eυ π(i),k, π( i),k h b V (i),υ(f (i),υ) i Eυ µ(i),πk h V (i),υ f (i),υ(ρ) ηL(i),k 1(f (i),υ, υ, τ 1:k 1) i , (E.4) where υ Πpur i , µ(i),πk = (π(i),k, , π( i),k) and π(i),k, is the best response given the action of other agents π( i),k. Thus, combining (E.3) and (E.4), we can derive Eυ µ(i),πk h V (i),υ f (i),υ(ρ) i Eυ πk h V (i),υ b f (i),υ(ρ) i ηEυ µ(i),πk h L(i),k 1(f (i),υ, υ, τ 1:k 1) i ηEυ πk h L(i),k 1( bf (i),υ, υ, τ 1:k 1) i . (E.5) Now we provide the concentration lemma, which shows that the empirical discrepancy function L(i),k(f, π, τ 1:k) is an estimate of the true discrepancy function Pk 1 s=0 ℓ(i),s(f, π). Lemma E.1 (Concentration Lemma). For any k [K] pure joint policy π, and {ζs}k 1 s=1 Π that be executed in Algorithm 1 in the first k 1 episodes, with probability at least 1 δ, L(i),k 1(f, π, τ 1:k 1) 1 s=0 ℓ(i),s(f, π) where εconc = max{O(HR2 log(HK maxi [n] NF(i)(1/K)|Πpur|/δ)), H} and ℓ(i),s(f, π) = h=1 E(sh,ah) ζs h h ((fh T (i),π h fh+1)(sh, ah))2i . Proof. See F.6 for a detailed proof. In other words, if we define the event as L(i),k(f, π, τ 1:k) 1 s=0 ℓ(i),s(f, π) εconc, f F(i), π Πpur, k [K] we have Pr{E1} 1 δ. Note that the εconc contains log(|Πpur|/δ) in the logarithmic term, which arises from our policy-search style algorithm. Lemma E.2 (Optimal Concentration Lemma). For all index i [n], all π Πpur and function f (i),π F(i) such that T (i),πf (i),π = f (i),π, with probability at least 1 δ, we have L(i),k(f (i),π, π, τ 1:k) εconc. Published as a conference paper at ICLR 2024 Proof. See F.7 for a detailed proof. In other words, if we define the event as E2 = { i [n], π Πpur, L(i),k(f (i),π, π, τ 1:k) εconc}, we have Pr{E2} 1 δ. Lemma E.2 shows that the empirical discrepancy function L(i),k(f, π, τ 1:k) is small if the function f and the policy π are consistent, i.e. f = f (i),π. Now by (E.5) and Lemma E.2, for any i [n], under the event E2, Eυ µ(i),πk h V (i),υ(ρ) i Eυ πk h V (i),υ(ρ) i = Eυ µ(i),πk h V (i),υ f (i),υ(ρ) i Eυ πk h V (i),υ(ρ) i = Eυ µ(i),πk h V (i),υ f (i),υ(ρ) i Eυ πk h V (i),υ b f (i),υ(ρ) i +Eυ πk h V (i),υ b f (i),υ(ρ) i Eυ πk h V (i),υ(ρ) i . By (E.5) and Lemma E.2, under event E2, (a) can be bounded by (a) ηEυ µ(i),πk h L(i),k 1(f (i),υ, υ, τ 1:k 1) i ηEυ πk h L(i),k 1( bf (i),υ, υ, τ 1:k 1) i (E.6) ηεconc ηEυ πk h L(i),k 1( bf (i),υ, υ, τ 1:k 1) i . (E.7) Now by Assumption 2.5, on the events E1 and E2 we have V (i),µ(i),πk (ρ) V (i),πk(ρ) Eυ µ(i),πk h V (i),υ(ρ) i Eυ πk h V (i),υ(ρ) i ηεconc ηEυ πk h L(i),k 1( bf (i),υ, υ, τ 1:k 1) i + Eυ πk h V (i),υ b f (i),υ(ρ) i Eυ πk h V (i),υ(ρ) i . (E.8) Now since bf (i),υ = arg maxf F(i) h V (i),υ f (ρ) ηL(i),k 1(f, υ, τ 1:k 1) i is the optimal function with respect to the regularized value, under the event E2 we have V (i),υ b f (i),υ(ρ) ηL(i),k 1( bf (i),υ, υ, τ 1:k 1) V (i),υ f (i),υ(ρ) ηL(i),k 1(f (i),υ, υ, τ 1:k 1), then we have ηL(i),k 1( bf (i),υ, υ, τ 1:k 1) 0 and by η 1, ηL(i),k 1( bf (i),υ, υ, τ 1:k 1) V (i),υ b f (i),υ(ρ) V (i),υ f (i),υ(ρ) + ηL(i),k 1(f (i),υ, υ, τ 1:k 1) R + ηεconc 2εconc, where the last inequality follows the Lemma E.2. If we define L(i),k 1 2εconc ( bf (i),υ, υ, τ 1:k 1) = L(i),k 1( bf (i),υ, υ, τ 1:k 1) I{ηL(i),k 1( bf (i),υ, υ, τ 1:k 1) 2εconc} and the event as E3 = n i [n], υ Πpur, L(i),k 1 2εconc ( bf (i),υ, υ, τ 1:k 1) = L(i),k 1( bf (i),υ, υ, τ 1:k 1) o , we will have E3 E2. Since the policy ζk that algorithm executes is sampled from πk, then the sequence {Yk}K k=1 that is defined by Yk = Eυ πk h V (i),υ b f (i),υ(ρ) V (i),υ(ρ) ηL(i),k 1 2εconc ( bf (i),υ, υ, τ 1:k 1) i b f (i),ζk (ρ) V (i),υ(ρ) ηL(i),k 1 2εconc ( bf (i),ζk, ζk, τ 1:k 1) Published as a conference paper at ICLR 2024 is a martingale difference sequence. Now by Azuma-Hoeffding s inequality and Yk R+2εconc 3εconc, with probability at least 1 δ we have h Eυ πk h V (i),υ b f (i),υ(ρ) V (i),υ(ρ) ηL(i),k 1 2εconc ( bf (i),υ, υ, τ 1:k 1) i b f (i),ζk (ρ) V (i),υ(ρ) ηL(i),k 1 2εconc ( bf (i),ζk, ζk, τ 1:k 1) i O(εconc Define the event E4 as the (E.9) holds. Now by choosing η K and taking the union bound over the event E1, E2, E3 and E4, with probability at least 1 4δ, we can get ηεconc ηEυ πk h L(i),k 1( bf (i),υ, υ, τ 1:k 1) i + Eυ πk h V (i),υ b f(i),υ(ρ) i Eυ πk h V (i),υ(ρ) i ηεconc ηEυ πk h L(i),k 1 2εconc ( bf (i),υ, υ, τ 1:k 1) i + Eυ πk h V (i),υ b f(i),υ(ρ) i Eυ πk h V (i),υ(ρ) i ηεconc ηL(i),k 1( bf (i),ζk, ζk, τ 1:k 1) + V (i),ζk b f(i),ζk (ρ) V (i),ζk(ρ) + e O(nεconc The first inequality holds because of Eq (E.8). The equality in the second line holds under Lemma E.2 (event E3 E2). The second inequality is derived from Azuma-Hoeffding s inequality (event E4). Now using Lemma E.1 and MADC assumption, we can get s=0 ℓ(i),s(f, ζk) b f (i),ζk (ρ) V (i),ζk(ρ) + 4n nµ d MADC + 6d MADCH + 4n The second inequality uses Assumption 2.5. Now the regret can be bounded by K d MADC + 6d MADCH + 4n Kεconc + O(nεconc K + nd MADCH + nd MADC Hence, we complete the proof by noting that εconc = e O(HR2 log ΥF,δ). CE By changing the best response to the strategy modification, we can derive a proof for Correlated Equilibrium (CE). We simplify the notation of strategy modification as ϕi(υ(i)) υ( i) as ϕi(υ). Now we have Eυ πk h V (i),υ b f (i),υ(ρ) ηL(i),k 1( bf (i),υ, υ, τ 1:k 1) i = Eυ πk h sup f F(i) b V (i),υ(f) i = max ϕi Eυ πk h sup f F(i) b V (i),ϕi(υ(i)) υ( i)(f) i . (E.11) The second equality holds because of the property of Correlated Equilibrium. Now we have max ϕi Eυ πk h sup f F(i) b V (i),ϕi(υ(i)) υ( i)(f) i max ϕi Eυ πk h V (i),ϕi(υ) f (i),ϕi(υ)(ρ) ηL(i),k 1(f (i),ϕi(υ), ϕi(υ), τ 1:k 1) i max ϕi Eυ πk h V (i),ϕi(υ) f (i),ϕi(υ)(ρ) ηεconc) i . (E.12) Published as a conference paper at ICLR 2024 The first equality holds by f (i),ϕi(υ) F(i) in (E.1), and the last inequality is derived from Lemma E.2 and ϕi(υ) is a pure joint policy. Then, by combining (E.11) and (E.12), we can get max ϕi Eυ πk h V (i),ϕi(υ) f (i),ϕi(υ)(ρ) i Eυ πk h V (i),υ b f (i),υ(ρ) i ηεconc ηEυ πk h L(i),k 1( bf (i),υ, υ, τ 1:k 1) i . Hence, we can upper bound the regret of the agent i at k-th episode as max ϕi Eυ πk h V (i),ϕi(υ(i)) υ( i)(ρ) i Eυ πk h V (i),υ(ρ) i = max ϕi Eυ πk h V (i),ϕi(υ) f (i),ϕi(υ)(ρ) i Eυ πk h V (i),υ(ρ) i = max ϕi Eυ πk h V (i),ϕi(υ) f (i),ϕi(υ)(ρ) i Eυ πk h V (i),υ b f (i),υ(ρ) i + Eυ πk h V (i),υ b f (i),υ(ρ) i Eυ πk h V (i),υ(ρ) i ηεconc ηEυ πk L(i),k 1( bf (i),υ, υ, τ 1:k 1) + Eυ πk h V (i),υ b f (i),υ(ρ) i Eυ πk h V (i),υ(ρ) i . The rest of the proof is the same as in NE/CCE after (E.7). E.2 PROOF OF MODEL-BASED VERSION OF THEOREM 3.1 Proof. We first consider NE/CCE. NE/CCE Denote bf (i),π = arg supf F b V π i (f) as the optimal model with respect to the regularized value b V (i),π(f). Since for model-based RL problems, the empirical discrepancy function L(f, π, τ) and ℓ(i),s(f, π) is independent with policy π, we simplify it as L(f, τ) and ℓ(i),s(f). Then, from the definition of regularized value function b V (i),π(f), we have Eυ πk h V (i),υ b f (i),υ(ρ) ηL(i),k 1( bf (i),υ, τ 1:k 1) i = Eυ πk h sup f F b V (i),υ(f) i max υ(i) Πpur Eυ υ(i) π( i),k h sup f F b V (i),υ(f) i . (E.13) The inequality holds by the fact that πk is the NE/CCE of the regularized value function b V (i),π(f). Now since the best response π(i),k, is a pure policy, we have max υ(i) Πpur Eυ υ(i) π( i),k h sup f F b V (i),υ(f) i Eυ π(i),k, π( i),k h sup f F b V (i),υ(f) i Eυ µ(i),πk h V (i),υ f (ρ) ηL(i),k 1(f , τ 1:k 1) i . (E.14) Thus, by combining E.13 and E.14, we have Eυ µ(i),πk h V (i),υ f (ρ) i Eυ πk h V (i),υ b f (i),υ(ρ) i ηL(i),k 1(f , τ 1:k 1) ηEυ πk h L(i),k 1( bf (i),υ, τ 1:k 1) i . (E.15) Now we provide our concentration lemma for model-based RL problems. Lemma E.3 (Concentration Lemma for Model-Based RL Problems). With probability at least 1 δ, for any k [K], f F, for the executed policy {ζs}k 1 s=1 in Algorithm 1, we have L(i),k 1(f , τ 1:k 1) L(i),k 1(f, τ 1:k 1) s=1 ℓ(i),s(f) + κconc, (E.16) where κconc = max{2H log HBF(1/K) δ , H}, where BF(1/K) is the 1/K-bracketing number of the model class F. We also define the event E5 as the situation when (E.16) holds. Published as a conference paper at ICLR 2024 Proof. See F.8 for detailed proof. By Lemma E.3, for any i [n], Eυ µ(i),πk h V (i),υ(ρ) i Eυ πk h V (i),υ(ρ) i = Eυ µ(i),πk h V (i),υ f (ρ) i Eυ πk h V (i),υ b f(i),υ(ρ) i +Eυ πk h V (i),υ b f(i),υ(ρ) i Eυ πk h V (i),υ(ρ) i . (E.17) Now substitute into equation (E.15), (a) ηL(i),k 1(f , τ 1:k 1) ηEυ πk h L(i),k 1( bf (i),υ, τ 1:k 1) i = Eυ πk h ηL(i),k 1(f , τ 1:k 1) ηL(i),k 1( bf (i),υ, τ 1:k 1) i . (E.18) Hence, combining with (E.17) and (E.18), we can get Eυ µ(i),πk h V (i),υ(ρ) i Eυ πk h V (i),υ(ρ) i (a) + Eυ πk h V (i),υ b f (i),υ(ρ) i Eυ πk h V (i),υ(ρ) i Eυ πk h ηL(i),k 1(f , τ 1:k 1) ηL(i),k 1( bf (i),υ, τ 1:k 1) + V (i),υ b f (i),υ(ρ) V (i),υ(ρ) i . By summing over k [K] and I [n], the regret can be obtained by k=1 Eυ πk h ηL(i),k 1(f , τ 1:k 1) ηL(i),k 1( bf (i),υ, τ 1:k 1) + V (i),υ b f(i),υ(ρ) V (i),υ(ρ) i . Now we want to use Azuma-Hoeffding s inequality to transform υ πk to executed policy ζk. To achieve this goal, note that by Lemma E.3, under event E5, we have L(i),k 1(f , τ 1:k 1) L(i),k 1( bf (i),υ, τ 1:k 1) κconc. (E.21) Moreover, since bf (i),υ achieves the maximum value of the regularized value function b V (i),π(f) = V (i),υ f (ρ) L(i),k 1(f , τ 1:k 1), we have L(i),k 1(f , τ 1:k 1) L(i),k 1( bf (i),υ, τ 1:k 1) Eυ µ(i),πk h V (i),υ f (ρ) i Eυ πk h V (i),υ f (i),υ(ρ) i Thus, if we define L(i),υ ε = L(i),k 1(f , τ 1:k 1) L(i),k 1( bf (i),υ, τ 1:k 1) I n |L(i),k 1(f , τ 1:k 1) L(i),k 1( bf (i),υ, τ 1:k 1)| ε o , we can have |L(i),υ κconc| κconc is bounded under event E5. Then, with probability at least 1 δ, L(i),υ κconc = L(i),k 1(f , τ 1:k 1) L(i),k 1( bf (i),υ, τ 1:k 1). Then, we can apply Azuma-Hoeffding s inequality to transform the expectation to the executed policy ζk. k=1 L(i),ζk κconc k=1 Eυ πk h L(i),υ κconc i = O(κconc log K). (E.22) Published as a conference paper at ICLR 2024 Now by taking the union bound of Azuma-Hoeffding s inequality and event E5, with probability at least 1 2δ, k=1 Eυ πk h ηL(i),k 1(f , τ 1:k 1) ηL(i),k 1( bf (i),υ, τ 1:k 1) + V (i),υ b f(i),υ(ρ) V (i),υ(ρ) i k=1 Eυ πk h ηLυ κconc + V (i),υ b f(i),υ(ρ) V (i),υ(ρ) i ηLζk κconc + V (i),ζk b f(i),ζk (ρ) V (i),ζk(ρ) + e O(nκconc), where the first inequality holds by (E.20), the equality holds under event E5, and the last inequality holds by (E.22). Then, by Lemma E.3, under event E5, we have ηL(i),k 1(f , τ 1:k 1) ηL(i),k 1( bf (i),ζk, τ 1:k 1) + V (i),υ b f (i),ζk (ρ) V (i),ζk(ρ) s=1 ℓ(i),s( bf (i),ζk) + ηκconc + V (i),υ b f (i),ζk (ρ) V (i),ζk(ρ) Then, by Assumption 2.5, (b) can be further upper bounded by s=1 ℓ(i),s( bf (i),ζk) + ηκconc + V (i),υ b f (i),ζk (ρ) V (i),ζk(ρ) nηKκconc + n η d MADC + 6nd MADCH = e O(nκconc K + nd MADC K + nd MADCH). The first inequality holds by Lemma E.3. The last equality holds by η = 4/ K. Finally, the regret can be bounded by Reg(K) (b) + e O(nκconc) = e O(nκconc K + nd MADC K + nd MADCH). Thus, we complete the proof by noting that κconc = O(H) Correlated Equilibrium Similar to model-free problems, we only need to replace the best response with strategy modification. Eυ πk h V (i),υ b f (i),υ(ρ) ηL(i),k 1( bf (i),υ, τ 1:k 1) i = Eυ πk h sup f F b V (i),υ(f) i = max ϕi Eυ πk h sup f F b V (i),ϕi(υ)(f) i . The last equality uses the property that πk is a CE with respect to the payoff function supf F b V (i),υ(f). Then, since f F, we can further derive Eυ πk h V (i),υ b f (i),υ(ρ) ηL(i),k 1( bf (i),υ, τ 1:k 1) i max ϕi Eυ πk h b V (i),ϕi(υ)(f ) i = max ϕi Eυ πk h V (i),ϕi(υ) f (ρ) ηL(i),k 1(f , τ 1:k 1) i . (E.23) The second equality holds by the property of CE. Thus, we have max ϕi Eυ πk[V (i),ϕi(υ) f (ρ)] Eυ πk[V (i),υ b f (i),υ(ρ)] ηL(i),k 1(f , τ 1:k 1) ηEυ πk L(i),k 1( bf (i),υ, τ 1:k 1). (E.24) Published as a conference paper at ICLR 2024 Hence, combining with (E.23) and (E.24), we can upper bound the regret of the agent i at k-th episode as max ϕi Eυ πk h V (i),ϕi(υ(i)) υ( i)(ρ) i Eυ πk h V (i),υ(ρ) i = max ϕi Eυ πk h V (i),ϕi(υ) f (i),ϕi(υ)(ρ) i Eυ πk h V (i),υ(ρ) i = max ϕi Eυ πk h V (i),ϕi(υ) f (i),ϕi(υ)(ρ) i Eυ πk h V (i),υ b f (i),υ(ρ) i + Eυ πk h V (i),υ b f (i),υ(ρ) i Eυ πk h V (i),υ(ρ) i ηεconc ηEυ πk L(i),k 1( bf (i),υ, υ, τ 1:k 1) + Eυ πk h V (i),υ b f (i),υ(ρ) i Eυ πk h V (i),υ(ρ) i . The rest of the proof is the same as NE/CCE after (E.15). E.3 SAMPLE COMPLEXITY RESULTS we can utilize the standard online-to-batch techniques to transform the regret result in Theorem 3.1 into the sample complexity result. Corollary E.4. Under the same setting as in Theorem 3.1, with probability at least 1 δ, when K e O n2H2 + n2d2 MADCΥ2 F,δ ε 2 , if we output the mixture policy πout = Unif({πk}k [K]), the output policy πout is a ε-{NE, CCE, CE}. Proof. See F.9 for the proof. Corollary E.4 shows that MAMEX is sample-efficient for learning all three equilibria of general-sum MGs under general function approximation. F PROOF OF THEOREMS AND LEMMAS F.1 PROOF OF THEOREM D.4 Proof. The proof follows Proposition 3 in Dann et al. (2021). First, we provide the following lemma in Dann et al. (2021). Lemma F.1. For any positive real number sequence x1, , xn, we have Pn i=1 xi p Pn i=1 ix2 i p Now denote Eε = dim MABE(F, ε). We fix i [n], and ignore both h and i for simplicity. Also denote bes,k h = Eπs[ϕt] and es,k h = bes,k h I{bes,k h > ε}, where ϕt = (I T (i),πk h )fh F(i) h . We initialize K buckets B0 h, , BK 1 h , and we want to add element ek,k h for k [K] into these buckets one by one. The rule for adding elements is as follows: If ek,k h = 0, we do not add it to any buckets. Otherwise we go through all buckets from B0 h to BK 1 h , and add ek,k h to Bi h whenever X s t 1,s Bi h (es,k h )2 < (ek,k h )2. Now assume we add ek,k h into the bucket bk h. Then, for all 1 i bk h 1, we have (ek,k h )2 P s k 1,s Bi h(es,k h )2. Thus, s=1 (es,k h )2 s t 1,s Bi h k=1 bk h(ek,k h )2. (F.1) Now note that by the definition of ε-independent sequence, for the measures in Bi h {πk1, , πkj}, πk is a ε -independent from all predecessors πk1, , πkj 1 such that ε > ε. (We can choose Published as a conference paper at ICLR 2024 ε = ek,k h c for enough small c such that q P s t 1,s Bi h(es,k h )2 ε and ε > ε by ek,k h > ε.) Thus, from the definition of BE dimension, the size of each bucket cannot exceed Eε. Now by Jensen s inequality, we can get k=1 bk h(ek,k h )2 = i=1 i|Bi h| es,s h |Bi h| where the last inequality uses the fact that |Bi h| Eε. Let xi = P s Bi h(es,s h ). By Lemma F.1, we have 1 Eε(1 + log K) Hence, combining (F.1), (F.2) and (F.3), we can get Eε(1 + log K) k=1 bk h(ek,k h )2 !1/2 Eε(1 + log K) s=1 (es,k h )2 !1/2 Now by the definition es,k h = bes,k h I{bes,k h > ε} and the fact that |B0 h| Eε, we can have k=1 bek,k h HKε + HKε + min{HEε, HK} + Then, by (F.3), we can further bounded it by k=1 bek,k h HKε + min{HEε, HK} + Eε(1 + log K) s=1 (es,k h )2 !1/2 HKε + min{HEε, HK} + EεH(1 + log K) s=1 (es,k h )2 !1/2 The last inequality uses the Jensen s inequality Now we can use a similar technique in Xie et al. (2021). Define (r h)(i)(s, a) = f k h(s, a) Es Ph( |s,a) f k h+1(s , ), πk h+1( | s ) . Then, we have Es1 ρ f k 1 (s1, πk 1(s1)) = Eπk f k h(sh, πk h(sh)) f k h+1(sh+1, πk h+1(sh+1)) # f k h(sh, πk h(sh)) Es Ph( |s,a) f k h+1(s , ), πk h+1( | s ) # h=1 (r h)(i)(s, a) Hence, we can rewrite the regret of the k-th episode as Es1 ρ h (f k 1 (s1, πk 1(s1)) V (i),πk(s1)) i = Eπk h=1 ((r h)(i)(s, a) r(i) h (s, a)) Published as a conference paper at ICLR 2024 The last inequality uses the fact that Then, substitute into the definition of (r h)(i), we can get Es1 ρ h (f k 1 (s1, πk 1(s1)) V (i),πk(s1)) i h=1 (f k h(s, a) Es Ph( |s,a) f k h+1(s , ), πk h+1( | s ) r(i) h (s, a)) h=1 (I T (i),πk h=1 bek,k h , where the first equality holds by the definition of Mk, the second equality holds by decomposing the value function to the expected cumulative sum of the reward function, and the last equality is derived by the definition of bek,k h . Now we can get h=1 bek,k h HKε + min{HEε, HK} + (EεH 2 log K) s=1 (es,k h )2 !1/2 The last inequality holds by (F.4). Now by the definition es,k h bes,k h = Eπs[(I T (i),πk h )(fh)] and the basic inequality ab µa + b/µ for µ > 0, we can derive HKε + min{HEε, HK} + µ (EεH 2 log K) + 1 Eπs h (I T (i),πk h )(fh) i 2 HKε + min{HEε, HK} + µ (EεH 2 log K) + 1 s=1 Eπs I T (i),πk h )(fh) 2 . The last inequality holds by (E[X])2 E[X2]. Thus, by choosing ε = 1/K, we can derive Reg(K) H + HE1/K + µ(E1/KH 2 log K) + 1 s=1 Eπs I T (i),πk 6d MADCH + µd MADC + 1 s=1 Eπs I T (i),πk h )(fh) 2 , where d MADC = max{2E1/KH log K, 1} = O(2dim MABE(F, 1/K)H log K). The last inequality uses the fact that log K 1 and H 1. F.2 PROOF OF THEOREM D.5 Proof. For any policy π and i [n], assume δz1, , δzm is an ε-independent sequence with respect to S π Πpur F(i),π h = S π Πpur(I T (i),π)F, where δz1, δz2, , δzm D , i.e. δzi is a Dirichlet probability measure over S A that δzi = δ(s,a)( ). Then, for each j [m], there exist function f j F(i) and policy πj Π such that |(I T (i),πj)fj(zj)| > ε and q Pj 1 p=1 |(f p h T (i),πpf p h+1)(zp)|2 ε. Define gj h = T (i),πjf j h+1, by Assumption 2.3, we have gj h F(i) h Fh. Thus, |(f j h gj h)(zj)| > ε and q Pj p=1 |(f p h gp h)(zp)|2 < ε. Thus, by the definition of eluder dimension, we have m dim E(Fh, ε). Hence, for all i and policy π, dim E(Fh, ε) m max h [H] max i [n] dim DE π Πpur F(i),π h , Dh, , ε which concludes the proof. Published as a conference paper at ICLR 2024 F.3 PROOF OF THEOREM D.8 Proof. First, by the elliptical potential lemma introduced in Lemma G.3, if we define Λk,h = εI + Pk 1 s=1 xk,hx T k,h, for any {xk,h}K k=1 Xh we have h=1 min n 1, xk,h 2 Λ 1 k,h h=1 2 log det k=1 xk,hx T k,h = 2γK(ε, X). (F.7) Now denote Σk,h = εI + Pk 1 s=1 Xh(πs)Xh(πs)T . Similar to Section F.1, define (r h)(i)(s, a) = f k h(s, a) Es Ph( |s,a) f k h+1(s , ), πk h+1( | s ) [ 1, 1], then we can have Es1 ρ h f k 1 (s1, πk 1(s1)) V (i),πk(s1) i = Eπk h=1 ((r h)(i)(s, a) r(i) h (s, a)) Then, we can substitute the definition of (r h)(i) and derive Es1 ρ h f k 1 (s1, πk 1(s1)) V (i),πk(s1) i h=1 (f k h(s, a) Es Ph( |s,a) f k h+1(s , ), πk h+1( | s ) r(i) h (s, a)) h=1 min W (i) h (f k, πk) W (i) h (f µ(i),πk , µ(i),πk), Xh(πk) V Then, by min{x, 2R} 2R min{x, 1}, we have k=1 Es1 ρ h (f k 1 (s1, πk 1(s1)) V (i),πk(s1)) i h=1 min W (i) h (f k, πk) W (i) h (f µ(i),πk , µ(i),πk), Xh(πk) V h=1 min W (i) h (f k, πk) W (i) h (f µ(i),πk , µ(i),πk), Xh(πk) V ( Xh(πk) Σ 1 k,h 1 ( Xh(πk) Σ 1 k,h > 1 The last inequality is because 1 = I{E} + I{ E} for any event. Now we decompose the (F.8) into two terms A + B, where h=1 min W (i) h (f k, πk) W (i) h (f µ(i),πk , µ(i),πk), Xh(πk) V ( Xh(πk) Σ 1 k,h 1 h=1 min W (i) h (f k, πk) W (i) h (f µ(i),πk , µ(i),πk), Xh(πk) V ( Xh(πk) Σ 1 k,h > 1 Now we bound A and B respectively. For A, we can use Cauchy s inequality and get W (i) h (f k, πk) W (i) h (f µ(i),πk , µ(i),πk) Σk,h Xh(πk) Σ 1 k,h I n Xh(πk) Σ 1 k,h 1 o W (i) h (f k, πk) W (i) h (f µ(i),πk , µ(i),πk) Σk,h min n Xh(πk) Σ 1 k,h , 1 o . The first inequality holds by Cauchy s inequality that | X, Y | X Σ Y Σ 1. Published as a conference paper at ICLR 2024 Now by the definition Σk,h = εI + Pk 1 s=1 Xh(πs)Xh(πs)T , we expand the term W (i) h (f k, πk) W (i) h (f µ(i),πk , µ(i),πk) Σk,h as W (i) h (f k, πk) W (i) h (f µ(i),πk , µ(i),πk) Σk,h ε W (i) h (f k, πk) W (i) h (f µ(i),πk , µ(i),πk) 2 W (i) h (f k, πk) W (i) h (f µ(i),πk , µ(i),πk), Xh(πs) 2 #1/2 W (i) h (f k, πk) W (i) h (f µ(i),πk , µ(i),πk), Xh(πs) 2 #1/2 The last inequality holds by b. Then, we can get h=1 W (i) h (f k, πk) W (i) h (f µ(i),πk , µ(i),πk) Σk,h min n Xh(πk) Σ 1 k,h, 1 o s=1 | W (i) h (f k, πk) W (i) h (f µ(i),πk , µ(i),πk), Xh(πs) |2 #1/2 min n Xh(πk) Σ 1 k,h, 1 o A1 + A2. (F.12) where A1 and A2 are defined as follows: h=1 min n Xh(πk) 2 Σ 1 k,h, 1 o!1/2 s=1 | W (i) h (f k, πk) W (i) h (f µ(i),πk , µ(i),πk), Xh(πs) |2 !1/2 h=1 min{ Xh(πk) 2 Σ 1 k,h, 1} Now we bound A1 and A2 respectively. First, for A1, using (F.7), we have 4εKHB2 W 2γK(ε, X). Then, for A2, we have s=1 | W (i) h (f k, πk) W (i) h (f µ(i),πk , µ(i),πk), Xh(πs) |2 !1/2 p s=1 ℓ(i),s(f k, πk) The equality holds by the definition of ℓ(i),s and the definition of multi-agent bilinear class (D.3). Then, since ab aµ + b/µ for any µ > 0, we can further derive A2 2Rµ 2γK(ε, X) + 1 2Rµ s=1 ℓ(i),s(f k, πk). Now by adding A1 and A2 and combining with (F.11) and (F.12), we can finally get A 2R(A1 + A2) q 4εKHB2 W 8RγK(ε, X) + µ 8R2γK(ε, X) + 1 s=1 ℓ(i),s(f k, πk) 32RB2 W εHK + γK(ε, X) + µ 8R2γK(ε, X) + 1 s=1 ℓ(i),s(f k, πk). Published as a conference paper at ICLR 2024 Now we have complete the bound of A. For B, by (F.7), since I{x > 1} min{1, x2}, we know that h=1 I n Xh(πk) Σ 1 k,h > 1 o h=1 min n 1, xk,h 2 Σ 1 k,h o 2γK(ε, X). (F.13) Thus, by the definition of B in (F.10), we can derive h=1 I n Xh(πk) Σ 1 k,h > 1 o 2γK(ε, X). Now note that Reg(K) A + B, then by choosing ε = 1/32RKB2 W and d MADC = max{1, 8R2γK(ε, X)} we can derive Reg(K) A + B 32RB2 W εHK + 3γK(ε, X) + µ 8R2γK(ε, X) + 1 s=1 ℓ(i),s(f k, πk) = H + 3γK(ε, X) + µ 8R2γK(ε, X) + 1 s=1 ℓ(i),s(f k, πk) 6d MADCH + µd MADC + 1 s=1 ℓ(i),s(f k, πk). The last inequality uses the fact that d MADC 1, H 1. Hence, we complete the proof. F.4 PROOF OF THEOREM D.12 Proof. In this subsection, we give a detailed proof of Theorem D.12. First, similar to the performance difference lemma in Jiang et al. (2017), we have Es1 ρ h V (i),πk 1,f (s1) V (i),πk = Eπk h Q(i),πk 1,f (s1, a1) i Eπk h=1 r(i) h (sh, ah) h,f (sh, ah) r(i) h (sh, ah) Q(i),πk h+1,f(sh+1, ah+1) # The last equality holds by splitting the term. Now, since Eπk h Q(i),πk h+1,f(sh+1, ah+1) i = Eπk h V (i),πk h+1,f (sh+1) i = Eπk h Esh+1 Ph,f ( |sh,ah) h V (i),πk h+1,f (sh+1) ii , we can rewrite (F.14) as Es1 ρ h V (i),πk 1,f (s1) V (i),πk h,f (sh, ah) r(i) h (sh, ah) Esh+1 Ph,f ( |sh,ah)V (i),πk h+1,f (sh+1) # h=1 (Esh+1 Ph,fk ( |sh,ah) Esh+1 Ph,f ( |sh,ah)) h V (i),πk h+1,f k(sh+1) i# Published as a conference paper at ICLR 2024 Then, combining (F.16) and the definition of multi-agent witness rank (D.7), we can derive k=1 Es1 ρ h V (i),πk 1,fk (s1) V (i),πk h=1 min R, 1 κwit | Wh(f k), Xh(πk) | h=1 min R, 1 κwit | Wh(f k), Xh(πk) | I n Xh(πk) Σ 1 k,h 1 o + I n Xh(πk) Σ 1 k,h 1 o . (F.17) Now note that K X k=1 min n 1, Xh(πk) 2 Σ 1 k,h o 2d log ε + K and I{x > 1} min{1, x2}, we can derive h=1 I{ Xh(πk) Σ 1 k,h > 1} D(ε)H. (F.18) Then, combining (F.17) and (F.18), we can get k=1 Es1 ρ h V (i),πk 1,fk (s1) V (i),πk h=1 min 1, 1 κwit | Wh(f k), Xh(πk) | I{ Xh(πk) Σ 1 k,h 1} + I{ Xh(πk) Σ 1 k,h > 1} h=1 min 1, 1 κwit | Wh(f k), Xh(πk) | I{ Xh(πk) Σ 1 k,h 1} + D(ε)HR 1 κwit Wh(f k) Σk,h min 1, Xh(πk) 2 Σ 1 k,h +D(ε)HR. (F.19) The last inequality uses the Cauchy s inequality X, Y X A Y A 1 and the fact that x I{x 1} min{1, x2}. Further, by the definition of Σk,h, we decompose the first term as ε Wh(f k) 2 2 + s=1 | Wh(f k), Xh(πs) |2 #1/2 min n 1, Xh(πk) Σ 1 k,h s=1 | Wh(f k), Xh(πs) |2 #1/2 min n 1, Xh(πk) Σ 1 k,h The second inequality is derived by the inequality Wh(f k) BW and b. Now sum over k [K] and h [H], we can get s=1 | Wh(f k), Xh(πs) |2 #1/2 min n 1, Xh(πk) Σ 1 k,h εBW min n 1, Xh(πk) Σ 1 k,h s=1 | Wh(f k), Xh(πs) |2 #1/2 min n 1, Xh(πk) Σ 1 k,h | {z } (Y ) Published as a conference paper at ICLR 2024 First, we try to give an upper bound for (X). By Cauchy s inequality and (F.18), we can derive h=1 min n 1, Xh(πk) 2 Σ 1 k,h HKεB2 W D(ε)H HKεB2 W κ2 wit + D(ε)H. (F.21) On the other hand, for (Y), we can bound it using Cauchy s inequality that P (P a a) (P b b), (Y ) 1 κwit s=1 | Wh(f k), Xh(πs) |2 ! K X h=1 min n 1, Xh(πk) 2 Σ 1 k,h v u u t D(ε)H s=1 | Wh(f k), Xh(πs) |2 ! The last inequality holds by the definition of D(ε) in F.18. Now by the definition of multi-agent witness rank D.6, we note that | Wh(f k), Xh(πs) |2 max v Vh E(sh,ah) π[(Esh+1 Ph,fk ( |sh,ah) Esh+1 Ph,f ( |sh,ah))v(sh, ah, sh+1)] 2 max v Vh E(sh,ah) πs (Esh+1 Ph,fk ( |sh,ah) Esh+1 Ph,f ( |sh,ah))v(sh, ah, sh+1) 2 E(sh,ah) πs max v Vh (Esh+1 Ph,fk ( |sh,ah) Esh+1 Ph,f ( |sh,ah))v(sh, ah, sh+1) 2 The last two inequalities use Jensen s inequality. Hence, by the definition of total variation distance, we can get | Wh(f k), Xh(πs) |2 TV Ph,f k( | sh, ah), Ph,f ( | sh, ah) 2 (F.22) 2D2 H Ph,f k( | sh, ah), Ph,f ( | sh, ah) , (F.23) where the TV( , ) denotes the total variation distance and DH denotes the Hellinger divergence. The inequality (F.22) holds by the fact that v(sh, ah, sh+1) [0, 1], and the (F.23) holds by the relationship between TV distance and Hellinger distance. Then, we can substitute the inequality (F.23) and get (Y ) 1 κwit v u u t D(ε)H s=1 E(sh,ah) πs2D2 H Ph,f k( | sh, ah), Ph,f ( | sh, ah) ! s=1 E(sh,ah) πs D2 H Ph,f k( | sh, ah), Ph,f ( | sh, ah) ! (F.24) Hence, combining (F.19), (F.20), (F.21) and (F.24), we can get k=1 Es1 ρ h V (i),πk 1,f k (s1) V (i),πk R A + D(ε)HR R(X + Y ) + D(ε)HR HKRεB2 W /κ2 wit + D(ε)HR + µR2 2D(ε)H s=1 E(sh,ah) πs D2 H Ph,f k( | sh, ah), Ph,f ( | sh, ah) ! Published as a conference paper at ICLR 2024 Now by the definition of ℓ(i),s of the model-based problem in (2.5), choosing ε = κ2 wit/HKB2 W and d MADC = 2R2D(ε)H κ2 wit , we can get Reg(K) 6d MADCH + µ d MADC + 1 s=1 ℓ(i),s(f k, πk) complete the proof by D(κ2 wit/HKB2 W ) = e O(d). F.5 PROOF OF THEOREM D.14 Proof. First, we fix an index i [n]. Similar to Section F.4, we can get k=1 Es1 ρ h V (i),πk 1,f k (s1) V (i),πk h=1 (Q(i),πk h,f k (sh, ah) r(i) h (sh, ah) Esh+1 Ph,f ( |sh,ah)V (i),πk h+1,f k(sh+1, ah+1)) h=1 (Esh+1 Ph,fk ( |sh,ah) Esh+1 Ph,f ( |sh,ah))[V (i),πk h+1,f k(sh+1)] h=1 (θh,f k θ h)T Eπk Z S ϕh(s | s, a)V (i),πk h+1,f k(s )ds , where the last equality is because of the property of the linear mixture MG. Now we denote Wh(f) = R(θh,f θ h) (F.26) Xh(f, π) = Eπ S ϕh(s | s, a)V (i),π h+1,f(s )ds Then, we have Wh(f) 2 d, Xh(f, π) 1 and k=1 Es1 ρ h V (i),πk 1,f k (s1) V (i),πk h=1 min{ Wh(f k), Xh(f k, πk) , R}. Now similar to Section F.4, if we replace Xh(πk) to Xh(f k, πk), from (F.21) and (F.24) with BW = 2 d R we can get k=1 Es1 ρ h V (i),πk 1,f k (s1) V (i),πk HKRε4d R2 + D(ε)HR + µR4 2D(ε)H + 1 µR2 s=1 Wh(f k), Xh(f s, πs) 2 ! where D(ε) = 2d log ε+K ε . Moreover, by (F.26) and (F.27), note that Wh(f k), Xh(f s, πs) = (θh,f k θ h)T Eπs Z S ϕh(s | s, a)V (i),πs h+1,f s(s )ds = Eπs h (Esh+1 Ph,fk ( |sh,ah) Esh+1 Ph,f ( |sh,ah)[V (i),πs h+1,f s(sh+1)] i Eπs h 2 V (i),πs h+1,f s( ) d TV(Ph,f k( | sh, ah) Ph,f ( | sh, ah)) i 2RDH(Ph,f k( | sh, ah) Ph,f ( | sh, ah)) i . Published as a conference paper at ICLR 2024 Hence, from (F.28) and Jensen s inequality that (E[X])2 E[X2], we can have k=1 Es1 ρ h V (i),πk 1,f k (s1) V (i),πk HKRε4d R2 + D(ε)HR + µR4 2D(ε)H s=1 E(sh,ah) πs 8R2D2 H Ph,f k( | sh, ah), Ph,f ( | sh, ah) ! By the definition of discrepancy function ℓ(i),s in (2.5), and choosing ε = 1/HKd, d MADC = HR4D(1/HKd) = e O(Hd R4), we can derive Reg(K) 4R3 + D(1/HKd)HR + µd MADC + 1 s=1 ℓ(i),s(f k, πk) 6d MADCH + µd MADC + 1 s=1 ℓ(i),s(f k, πk). Hence, we complete the proof. F.6 PROOF OF LEMMA E.1 Proof. The proof is modified from Zhong et al. (2022). Define Wj,h be the filtration induced by {sk 1, ak 1, r(i),k 1 , , sk H, ak H, r(i),k H }j 1 k=1. First, for h [H], i [n], f F(i) and π Π, we define the random variable Y (i) j (h, f, ζk) = fh(sj h, aj h) r(i) h (sj h, aj h) fh+1(sj h+1 ), ζk h+1( | sj h+1) 2 h (f)(sj h, aj h) r(i) h (sj h, aj h) fh+1(sj h+1 ), ζk h+1( | sj h+1) 2 . By taking conditional expectation of Yj with respect to aj h, sj h, we can get E[Y (i) j (h, f, ζk) | Wj,h] = Esh,ah ζj[(fh T (i),ζk h (f))(sh, ah)]2 E[(Y (i) j (h, f, ζk))2 | Wj,h] 2R2E[Y (i) j (h, f, ζk) | Wj,h], where 3R |fh(sj h, aj h) r(i) h (sj h, aj h) fh+1(sj h+1 ), ζk h+1( | sj h+1) )| is the constant upper bound. Denote Zj = Y (i) j (h, f, ζk) Esh+1[Y (i) j (h, f, ζk) | Wj,h] with |Zj| 4R2. By the Freedman inequality, for any 0 < η < 1 4R, with probability at least 1 δ, j=1 Var[Y (i) j (h, f, ζk) | Wj,h] + log(1/δ) j=1 E[(Y (i) j (h, f, ζk))2 | Wj,h] + log(1/δ) j=1 2R2E[Y (i) j (h, f, ζk) | Wj,h] + log(1/δ) By choosing η = min 2R q Pk j=1 E[Y (i) j (h,f,ζk)|Wj,h] , we will have j=1 E[Y (i) j (h, f, ζk) | Wj,h] log(1/δ) + R2 log(1/δ) Published as a conference paper at ICLR 2024 Similarly, if we apply the Freedman s inequality with Pk j=1 Zj, with probability at least 1 2δ, j=1 E[Y (i) j (h, f, ζk) | Wj,h] log(1/δ) + R2 log(1/δ) Denote the ρ-covering set of F(i) as CF(i)(ρ), then for any f F(i), ζ Πpur i , there exists a pair ef CF(i)(ρ) such that fh(sh, ah) r(i) h (sh, ah) fh+1(sh+1, ), ζh+1( | sh+1) efh(sh, ah) r(i) h (sh, ah) efh+1(sh+1, ), ζh+1( | sh+1) 3ρ for all (sh, ah, sh+1) S A S. Now by taking a union bound over CF(i)(ρ), we have that with probability at least 1 δ, for all ef CF(i)(ρ), j=1 e Y (i) j (h, ef, ζ) j=1 E[e Y (i) j (h, ef, ζ) | Wj,h] j=1 E[e Y (i) j (h, ef, ζ) | Wj,h]ι + R2ι where ι = 2 log(HK|CF(i)(ρ)|/δ) 2 log(HKNF(i)(ρ)). Now note that for all f F(i), ζ Πpur i , we have j=0 Y (i) j (h, f, ζ) j=0 (fh(sj h, aj h) r(i) h (sj h, aj h) fh+1(sj h+1, ), ζh+1( | sj h+1) )2 (T (i),ζ h (f)(sj h, aj h) r(i) h (sj h, aj h) fh+1(sj h+1, ), ζh+1( | sj h+1) )2 j=0 (fh(sj h, aj h) r(i) h (sj h, aj h) fh+1(sj h+1, ), ζh+1( | sj h+1) )2 inf f h F(i) h (T (i),ζk h (f )(sj h, aj h) r(i) h (sj h, aj h) fh+1(sj h+1, ), ζh+1( | sj h+1) )2 = L(i),k 1(f, ζ, τ 1:k 1). Then, by (F.29) we can get j=0 E[e Y (i) j (h, ef, ζ) | Wj,h] 4L(i),k 1( ef, ζ, τ 1:k 1) + O(HR2ι). Now similar to (Jin et al., 2021a), by the definition of ρ-covering number, for any k [K], f F(i) and ζ Πpur i , j=0 E[Y (i) j (h, f, ζ) | Wj,h] 4L(i),k 1(f, ζ, τ 1:k 1) + O(HR2ι + HRkρ). Now since sj h, aj h ζj, we can have j=0 ℓj,(i)(f, ζk) = j=0 E[Y (i) j (h, f, ζk) | Wj,h] 4L(i),k 1(f, ζk, τ 1:k 1) + O(HR2ι + HRkρ). We complete the proof by choosing ρ = 1/K and choose εconc = O(HR2ι+HRkρ) = O(HR2ι). Published as a conference paper at ICLR 2024 F.7 PROOF OF LEMMA E.2 Proof. First, for any f F(i) and π Πpur we define the random variable Q(i) j (h, f, π) = (fh(sj h, aj h) r(i) h (sj h, aj h) f h+1(sj h+1, ), πh+1( | sj h+1) )2 (f h(sj h, aj h) r(i) h (sj h, aj h) f h+1(sj h+1, ), πh+1( | sj h+1) )2. Then, by similar derivations in Lemma E.1, we can get E[Q(i) j (h, f, π) | Wj,h] = Esh,ah ζj[(fh T (i),π(f ))(sh, ah)]2 0, E[(Q(i) j (h, f, π))2 | Wj,h] 2R2E[Q(i) j (h, f, π) | Wj,h]. Then, by Freedman s inequality, with probability at least 1 δ, for all elements in ef CF(i)(ρ), we have j=0 e Q(i) j (h, ef, π) j=0 E[ e Q(i) j (h, ef, π) | Wj,h] j=0 Esh+1[ e Q(i) j (h, ef, π) | Wj,h]ι + R2ι then we can have j=0 e Q(i) j (h, ef, π) O(R2ι). Thus, by the definition of CF(i)(ρ), for all f F(i) and π Πpur i , we have j=0 Q(i) j (h, f, π) O(R2ι + Rkρ). L(i),k(f, π) = j=0 Q(i) j (h, f, π) O(HR2ι + HRkρ) = O(HR2ι). Thus, we complete the proof. F.8 PROOF OF LEMMA E.3 Proof. For simplicity, we first assume F is a finite class. Given a model f F and h [H], we define Xj h,f = log Ph,f (sj h+1|sj h,aj h) Ph,f (sj h+1|sj h,aj h) . Thus, L(i),k(f , τ 1:k) L(i),k(f, τ 1:k) = j=1 Xj h,f. (F.30) Now we define the filtration Gj as Gj = σ({s1 h, a1 h, , sj h, aj h}). Then, by Lemma G.1 for all f F, with probability at least 1 δ, we have j=1 Xj h, f + log(H|F|/δ). Published as a conference paper at ICLR 2024 Now we decompose the first term at the right side as v u u tlog Ph,f(sj h+1 | sj h, aj h) Ph,f (sj h+1 | sj h, aj h) = E(sj h,aj h) πj Esh+1 Ph,f ( |sj h,aj h) v u u t Ph,f(sj h+1 | sj h, aj h) Ph,f (sj h+1 | sj h, aj h) = E(sj h,aj h) πj Ph,f(sj h+1 | sj h, aj h)Ph,f (sj h+1 | sj h, aj h)dsj h+1 2E(sj h,aj h) πj[D2 H(Ph,f(sj h+1 | sj h, aj h) Ph,f (sj h+1 | sj h, aj h))]. Now by the inequality log x x 1, we have 2E(sj h,aj h) πj[D2 H(Ph,f(sj h+1 | sj h, aj h) Ph,f (sj h+1 | sj h, aj h))] 1 + log(H|F|/δ) 1 2E(sj h,aj h) πj h D2 H(Ph,f(sj h+1 | sj h, aj h) Ph,f (sj h+1 | sj h, aj h)) i + log(H|F|/δ). Sum over h [H] with (F.30), we can complete the proof by j=1 ℓ(i),j(f) + κconc, where κconc = H log(H|F|/δ). For infinite model classes F, we can use 1/K-bracketing number BF(1/K) to replace the cardinality |F| (Liu et al., 2022a; Zhong et al., 2022; Zhan et al., 2022b). F.9 PROOF OF COROLLARY E.4 Proof. We provide the proof for NE. The proof for CCE/CE are the same by replacing the NEregret to the CCE/CE-regret. By taking the minimum of the index of agents rather than adding them together, we can modify the proof of Theorem 3.1 and derive, with probability at least 1 δ, k=1 max i [n] V (i),µ(i),πk (ρ) V (i),πk(ρ) ! K + d MADCH Hence, by choosing K = e O (H2Υ2 F,δ + d2 MADC) ε 2 + d MADCH ε 1 with ε < 1, we have V (i),µ(i),πout(ρ) V (i),πout(ρ) k=1 max i [n] V (i),µ(i),πk (ρ) V (i),πk(ρ) ! where the second inequality holds from πout = Unif({πk}k [K]). Hence, πout is a ε-NE. Published as a conference paper at ICLR 2024 G TECHNICAL TOOLS We provide the following lemma to complete the proof of model-based RL problems. The detailed proof can be found in (Foster et al., 2021). Lemma G.1. For any real-valued random variable sequence {Xk}k [K] adapted to a filtration {Gk}k [K], with probability at least 1 δ, for any k [K], we can have s=1 log E[exp( Xs) | Fs 1] + log(1/δ). In the next lemma, we introduce the Freedman s inequality, which has been commonly used in previous RL algorithms. (Jin et al., 2021b; Chen et al., 2022c; Zhong et al., 2022) Lemma G.2 (Freedman s Inequality (Agarwal et al., 2014)). Let {Zk}k [K] be a martingale difference sequence that adapted to filtration {Fk}k [K]. If |Zk| R for all k [K], then for η (0, 1 R), with probability at least 1 δ, we can have k=1 E[X2 k | Fk 1] + log(1/δ) The next elliptical potential lemma is first introduced in the linear bandit literature (Dani et al., 2008; Abbasi-Yadkori et al., 2011) and then applied to the RL problems with Bilinear Classes (Du et al., 2021) and the general function approximation (Chen et al., 2022a; Zhong et al., 2022). Lemma G.3 (Elliptical Potential Lemma). Let {xk}K k=1 be a sequence of real-valued vector, i.e. xk Rd for any k [K]. Then, if we define Λi = εI + PK k=1 xkx T k , we can get that k=1 min n 1, xi 2 Λ 1 i o 2 log det(ΛK+1) k=1 xkx T k Proof. The proof is provided in Lemma 11 of (Abbasi-Yadkori et al., 2011). H COMPUTATIONAL EFFICIENCY Since the normal-form game is defined over the pure policy space, the size of this game can be exponentially large, which makes the algorithm not computationally efficient. Actually, the computational complexity of this oracle depends on the type of equilibrium. For CCE, the oracle can be approximately implemented by mirror descent (Wang et al., 2023), where the time of calculating V (π) depends on the number of iterations. Hence, the computational complexity of this oracle depends only on the number of iterations rather than the pure policy space |Π|pur. To be more specific, by replacing the optimistic value function V (t),π i in Algorithm 4 (DOPMD) of [1] by the regularized payoff function in 3.1, one can get an approximated CCE by mirror descent using FTRL analysis. The sampling procedure in Line 3 of DOPMD can be implemented by Langevin dynamics (Karagulyan, 2021) as long as we can obtain the gradient of the regularized payoff function. However, for Nash Equilibrium, calculating the equilibrium is PPAD-complete. So MAMEX is computationally inefficient for learning NE.