# mdpgt_momentumbased_decentralized_policy_gradient_tracking__9feab018.pdf MDPGT: Momentum-Based Decentralized Policy Gradient Tracking Zhanhong Jiang1, Xian Yeow Lee2, Sin Yong Tan2, Kai Liang Tan2, Aditya Balu2, Young M. Lee1, Chinmay Hegde3, Soumik Sarkar2 1Johnson Controls Inc., 507 East Michigan St, Milwaukee, WI 53202, 2Iowa State University, Ames, IA 50010, 3New York University, 6 Metro Tech Center, Brooklyn, NY 11201, {zhanhong.jiang, young.m.lee}@jci.com, {xylee, tsyong98, kailiang, baditya, soumiks}@iastate.edu, {chinmay.h}@nyu.edu We propose a novel policy gradient method for multi-agent reinforcement learning, which leverages two different variancereduction techniques and does not require large batches over iterations. Specifically, we propose a momentum-based decentralized policy gradient tracking (MDPGT) where a new momentum-based variance reduction technique is used to approximate the local policy gradient surrogate with importance sampling, and an intermediate parameter is adopted to track two consecutive policy gradient surrogates. MDPGT provably achieves the best available sample complexity of O(N 1ϵ 3) for converging to an ϵ-stationary point of the global average of N local performance functions (possibly nonconcave). This outperforms the state-of-the-art sample complexity in decentralized model-free reinforcement learning and when initialized with a single trajectory, the sample complexity matches those obtained by the existing decentralized policy gradient methods. We further validate the theoretical claim for the Gaussian policy function. When the required error tolerance ϵ is small enough, MDPGT leads to a linear speed up, which has been previously established in decentralized stochastic optimization, but not for reinforcement learning. Lastly, we provide empirical results on a multi-agent reinforcement learning benchmark environment to support our theoretical findings. Introduction Multi-agent reinforcement learning (MARL) is an emerging topic which has been explored both in theoretical (Nguyen et al. 2014; Zhang et al. 2018; Qu et al. 2019; Zhang et al. 2021b) and empirical settings (Helou, Kalathil, and Xie 2020; Mukherjee, Bai, and Chakrabortty 2020; Zhou et al. 2020). Several appealing applications of MARL can be seen in (Zhang, Yang, and Bas ar 2019; Nguyen, Nguyen, and Nahavandi 2020) and relevant references therein. While MARL can primarily be cast into two different categories, i.e., cooperative (Li, Chen, and Chen 2020; Wang et al. 2020; Li et al. 2020) and competitive (Chen et al. 2020), our focus is in the cooperative setting; see (Wei et al. 2021) for details on the competitive setting. Cooperative MARL is typically modeled as a networked multi-agent Markov decision process (MDP) (Zhang, Yang, and Basar 2018; Chu, Chinchali, and Katti 2020; Zhang et al. 2018) in which the Copyright 2022, Association for the Advancement of Artificial Intelligence (www.aaai.org). All rights reserved. agents share a centralized reward function (Sim oes, Lau, and Reis 2020; Ackermann et al. 2019). However, in practice, this is not necessarily the case, and instead a more general yet challenging scenario is that agents have heterogeneous reward functions. Inherently, the ultimate goal in such a cooperative MARL setting is for agents to maximize the global average of local long-term returns. To address this problem, various algorithms have been proposed, including distributed-learning (Arslan and Y uksel 2016; Nguyen and Mukhopadhyay 2017) and distributed actor-critic (Li et al. 2020; Ryu, Shin, and Park 2020). More recent works have successfully showed finite-sample analysis for decentralized batch MARL (Zhang et al. 2021b) and leveraged advances in analysis of descent-ascent algorithms (Lu et al. 2021). These preliminary attempts have facilitated the theoretical understanding of cooperative MARL by showing explicit sample complexity bounds, which match that of standard (vanilla) stochastic gradient descent (SGD). Additionally, recent works (Huang et al. 2020; Xu, Gao, and Gu 2019) in centralized RL have revealed that with simple variance reduction techniques, this sample complexity can be reduced to O(ϵ 3) to reach an ϵ-stationary point (i.e., E[ J(x) ] ϵ, where J is the return function and x Rd is the decision variable to be optimized), which has been admitted as the best complexity in decentralized optimization (Das et al. 2020; Karimireddy et al. 2020). However, no similar matching bounds have yet been reported in the decentralized (cooperative MARL) setting. Hence, this motivates the question: Can we achieve a sample complexity of O(ϵ 3) in decentralized MARL via variance reduction? In this paper, we answer this question affirmatively by proposing a variance-reduced policy gradient tracking approach, MDPGT (Algorithm 1), and analyzing it in Theorem 1. Additionally, we propose a variation (based on a different initialization) that enables state-of-the-art (SOTA) sample complexity for decentralized MARL (Zhang et al. 2021b; Lu et al. 2021). See Table 1 for SOTA comparisons. Specifically: 1. We propose MDPGT, in which we use a stochastic policy gradient surrogate, a convex combination of the vanilla stochastic policy gradient and an importance samplingbased stochastic recursive algorithm (SARAH) (Nguyen et al. 2017) for the local gradient update. Instead of directly applying the stochastic policy gradient surro- The Thirty-Sixth AAAI Conference on Artificial Intelligence (AAAI-22) gate in the parameter update, an intermediate parameter is adopted to track the difference between two consecutive stochastic policy gradient surrogates. For smooth nonconcave performance functions, we show that MDPGT with the mini-batch initialization can converge to an ϵ-stationary point in O(N 1ϵ 3) gradient-based updates which matches the best available known upper bounds (Huang et al. 2020). 2. We modify the initialization of the proposed algorithm MDPGT by using a single trajectory instead of a minibatch of trajectories. Surprisingly, we find that only one trajectory results in a larger sampling complexity O(N 1ϵ 4), which, however, is the same as obtained by the SOTA (Zhang et al. 2021b; Lu et al. 2021) with a linear speed up when ϵ is sufficiently small. Additionally, our algorithm shows that when updating the policy parameter in MDPGT, the mini-batch size is O(1) instead of being ϵ-related (Xu, Gao, and Gu 2019; Qu et al. 2019), which can significantly improve practical efficiency. 3. To facilitate the theoretical understanding of MDPGT, we leverage a benchmark gridworld environment for numerical simulation and compare our proposed algorithm to a baseline decentralized policy gradient (DPG) and the momentum-based decentralized policy gradient (MDPG, described in the supplementary materials), which is a new variant created in this work for the purpose of empirical comparison. We show that our theoretical claims are valid based on the experiments. Related Works. Most previous decentralized MARL papers (Zhang et al. 2018; Suttle et al. 2020; Li et al. 2020; Chen et al. 2020; Bono et al. 2018) tend to focus on convergence to the optimal return. Exceptions include (Qu et al. 2019), where they proved non-asymptotic convergence rates with nonlinear function approximation using value propagation. This enables us to approximately derive the number of stochastic gradient evaluations. However, the algorithm involves the complex inner-outer structure and requires the size of the mini-batch to be K, with K being the number of iterations, which may not be practically implementable. Zhang et al. (2021b) obtain O(ϵ 4) for the cooperative setting by using gradient tracking (GT), which is a bias correction technique dedicated to decentralized optimization, but with several specifically imposed assumptions, such as stationary sample paths, which may not be realistic. Lu et al. (2021) also utilize GT but require dual parameter updates to achieve O(ϵ 4); our approach is different and simpler. In this context, we mention that centralized counterparts of MARL (Huang et al. 2020; Xu, Gao, and Gu 2019; Papini et al. 2018) have achieved sample complexity of O(ϵ 3). However, in both Xu, Gao, and Gu (2019) and Papini et al. (2018), the size of mini-batch is ϵ-related, which is more computationally sophisticated than those in both (Huang et al. 2020) and our proposed method. We provide additional discussion of related work in the supplementary materials. Preliminaries We first formulate MARL, followed by an overview of variance reduction techniques and decentralized policy gradients. MARL Formulation In this context, we consider a networked system involving multiple agents (say N) that collaboratively solve dynamic optimization problems. Specifically, the system can be quantified as a graph, i.e., G = (V, E), where V = {1, 2, ..., N} is the vertex set and E V V is the edge set. Throughout the paper, we assume that G is static and undirected, though in a few previous works (Lin et al. 2019; Suttle et al. 2020; Zhang et al. 2018) G could be directed. The goal of this work is to provide the rigorous theoretical analysis for our decentralized MARL algorithm, with the property of the graph not being the main focus. When a pair of agents i and j can communicate with each other, we have (i, j) E. We also define the neighborhood of a specific agent i, Nb(i), such that Nb(i) {j|j V, (i, j) Eor j = i}. Only agents in Nb(i) are able to communicate with the agent i. We next present the definition of networked MARL on top of G. With multiple agents, the networked Markov decision process is thus characterized by a tuple (S, {Ai}i V, P, {ri}i V, G, γ), where S indicates a global state space shared by all agents in V with |S| < , Ai signifies the action space specified for agent i, and γ (0, 1] is the discount factor. Moreover, in the cooperative MARL setting, the environment is driven by the joint action space instead of individual action spaces. Thus, A Q i V Ai is defined as the joint action space over all agents in V. P : S A S [0, 1] represents the probability function to transition the current state to the next state. {ri}i V : S A R is the local reward function of agent i and ri [ R, R](R > 0). Additionally, states and actions are assumed to be globally observable, while the rewards are only locally observable. Such an assumption corresponds to the definition of the cooperative MARL setting and has been generic in previous works (Zhang et al. 2018; Zhang, Yang, and Basar 2018; Zhang et al. 2021b). We next describe how agents behave in such an environment. Suppose that the current state of the environment is sk S, where k is the time step. Each agent i chooses its own action ai k Ai, based on the local policy, πi : S Ai [0, 1]. For a parameterized policy, we denote by πi xi(s, ai), which indicates the probability of agent i choosing action ai given the current state s and xi Rdi here is the policy parameter. Stacking all local policy parameter together yields x = [(x1) , (x2) , ..., (x N) ] R P i V di. Hence, the joint policy can be denoted as πx : S A [0, 1], where πx(s, a) Q i V πi xi(s, ai) and a A. In this context, the decisions are decentralized due to locally observable rewards, locally evaluated policies and locally executed actions. To simplify the notations, we drop the xi for πi xi and x for πx respectively for local and joint policies throughout the rest of the paper. With the joint policy π and the state transition function P, the environment evolves from s to s with the probability P(s |s, a). Another assumption imposed in this paper for the policy function is that for all i V, s S, ai Ai, πi(s, ai) is continuously differentiable w.r.t. all xi Rdi. Such an assumption will assist in characterizing the smoothness of the objective function. The goal for each agent is to learn a local policy πi such Method Complexity Dec. Var. Red. Linear Speed Up I.S. MBPG (Huang et al. 2020) O(ϵ 3) Value Prop (Qu et al. 2019) O(ϵ 4) DCPG (Zeng et al. 2020) O(ϵ 4) Safe-Dec-PG (Lu et al. 2021) O(ϵ 4) DFQI (Zhang et al. 2021b) O(ϵ 4) Dec-TD(0)+GT (Lin and Ling 2021) N/A MDPGT (ours) O(ϵ 4) MDPGT-MI (ours) O(ϵ 3) Table 1: Comparisons between existing and proposed approaches. 1 Complexity: Sampling complexity for achieving E[ J(x) ] ϵ. 2 Linear Speed Up: If an algorithm has O(1/ K) convergence, then its sampling complexity of attaining an O(ϵ) accurate solution is ϵ 2. Similarly, O(1/ NK) corresponds to N 1ϵ 2, which is N times faster than the former. Typically, K has to satisfy a certain condition. 3 MDPGT-MI is MDPGT with mini-batch initialization. We use this notation for conveniently classifying two different initialization approaches. In the rest of paper, we still adopt MDPGT to unify these two approaches. 4 Dec.: decentralized. 5 Var. Red.: variance reduction. 4 I.S. importance sampling. that the joint policy π is able to maximize the global average of expected cumulative discounted rewards, i.e., π = argmaxx Rd J(x) 1 where H is the horizon and d = P i V di. Several works (Zhang et al. 2018; Qu et al. 2019; Zhang et al. 2021b; Lin et al. 2019; Suttle et al. 2020) have made their attempts to resolve this optimization problem, leading to different algorithms. Since each agent only has access to local information, a communication protocol needs to be introduced in the system, as done in decentralized optimization. With that, well-known centralized policy-based algorithms can be extended as MARL algorithms. Nevertheless, one issue that has not been sufficiently explored is the inherent policy gradient variance, which could even be more significant in the MARL algorithms. Consequently, this work propose novel MARL algorithms to investigate how to reduce the policy gradient variance during the optimization. Variance Reduction and Bias Correction In stochastic optimization, variance reduction techniques have been well studied and applied to either centralized or decentralized gradient descent type of algorithms, such as SVRG (Johnson and Zhang 2013), SARAH (Nguyen et al. 2017), SPIDER (Fang et al. 2018), Hybrid-SARAH (Tran Dinh et al. 2019) and STORM (Cutkosky and Orabona 2019). In another line of work, the GT technique (Pu and Nedi c 2020; Sun, Daneshmand, and Scutari 2019) was proposed specifically for consensus-based decentralized optimization techniques to improve the convergence rate by tracking and correcting each agent s locally aggregated gradients. In our work, we leverage both Hybrid-SARAH and GT to reduce the policy gradient variance and correct the policy gradient bias respectively in the MARL and achieve the best convergence rate. Hybrid-SARAH performs with a trade-off parameter to balance the effect between vanilla stochastic gradient and SARAH. More detail on these techniques are elaborated in the supplementary materials. So far, we are not aware of existing results that have successfully shown SARAH or Hybrid-SARAH type of variance reduction techniques well suited for decentralized nonoblivious learning problems, e.g., MARL. Consequently, the regular Hybrid-SARAH technique cannot be directly applied to MARL; we address this challenge in sections below. Decentralized Policy Gradient Given a time horizon H, we define a trajectory specifically for agent i, as τ i {s0, ai 0, ..., s H 1, ai H 1} under any stationary policy. By following the trajectory τ i, a cumulative discounted reward is given as Ri(τ i) PH h=0 γhri h such that an individual return can be obtained as: Ji(xi) Eτ i pi(τ i|xi)[Ri(τ i)] = Z Ri(τ i)pi(τ i|xi)dτ i, (2) where pi(τ i|xi) is the probability distribution over τ i that is equivalent to the following expression given the initial distribution ρi 0 = ρi(s0). Without loss of generalization, we can assume that the initial distribution is identical for all agents, namely ρ(s0). Then, we have, pi(τ i|xi) = ρ0(s0) h=0 P(sh+1|sh, ai h)πi(ai h|sh). (3) For each agent i, the goal is to find an optimal policy πi to maximize the return Ji(xi). As discussed above, the underlying dynamic distribution results in a non-oblivious learning problem, which is more significant in MARL. To resolve this issue, the decentralized policy gradient is a decent choice. As background knowledge of MARL, we next present how to arrive at the local stochastic policy gradient, which will help characterize the analysis for the proposed algorithms. Computing the gradient of Ji(xi) w.r.t xi yields the following formula: Ji(xi) = Z Ri(τ i) pi(τ i|xi) pi(τ i|xi) pi(τ i|xi)dτ i = Eτ i pi(τ i|xi)[ logpi(τ i|xi)Ri(τ i)] (4) In practice, pi(τ i|xi) is typically unknown such that the accurate full policy gradient for agent i is difficult to obtain. Thus, similar to decentralized stochastic gradient descent (Jiang et al. 2017), we calculate the policy gradient by sampling a mini-batch of trajectories B = {τ i m}|B| m=1 from the distribution pi(τ i|xi) such that ˆ Ji(xi) = 1 m B logpi(τ i m|xi)Ri(τ i m). (5) In addition, combining Eq. 3, we can observe that logpi(τ i m|xi) is independent of the probability transition P. Hence, Eq. 5 is written as ˆ Ji(xi) = 1 m B gi(τ i m|xi) h=0 xilogπi(ai,m h , sm h ) h=0 γhri h(ai,m h , sm h ) In the above equation, gi(τ i|xi) is the unbiased estimate of Ji(xi), i.e., E[gi(τ i|xi)] = Ji(xi). Some well-known policy gradient estimators can be obtained through Eq. 6, such as decentralized REINFORCE, which is the direct extension of its centralized version. We refer interested readers to (Huang et al. 2020) for more details. Our Proposed Approach: MDPGT Hybrid Importance Sampling SARAH In this subsection, we propose a hybrid importance sampling version of SARAH, termed HIS-SARAH, for decentralized policy gradient updates. First, we define the importance sampling weight (Metelli et al. 2020) as follows: υ(τ|x , x) = p(τ|x ) πx(ah|sh) . (7) As mentioned in the last section, due to the nonoblivious learning problem, Eτ p(τ|x)[g(τ|x) g(τ|x )] = J(x) J(x ). With Eq. 7 we have Eτ p(τ|x)[g(τ|x) υ(τ|x , x)g(τ|x )] = J(x) J(x ), which has been analyzed in (Huang et al. 2020) for centralized policy optimization methods and will be a key relationship in our proof. We denote by ui the stochastic policy gradient surrogate for agent i. Thus, applying Eq. 7 in a decentralized manner for Hybrid-SARAH (See Supplementary materials for definition) gives the following update law at a time step k: ui k = βgi(τ i k|xi k) + (1 β)[ui k 1 + gi(τ i k|xi k) υi(τ i k|xi k 1, xi k)gi(τ i k|xi k 1)]. (8) Algorithm 1: MDPGT Result: x K chosen uniformly random from {xi k, i V}K k=1 Input: xi 0 = x0 Rd, η R+, β (0, 1), W RN N, vi 0 = 0d, ui 1 = 0d, K, B Z+, k = 1 Initialize the local policy gradient surrogate by sampling a trajectory τ i 0 from pi(τ i|xi 0) : ui 0 = gi(τ i 0|xi 0), or by sampling a mini-batch of trajectories {τ i,m 0 }|B| m=1 from pi(τ i|xi 0) : ui 0 = 1 |B| P|B| m=1 gi(τ i,m 0 |xi 0) Initialize the local policy gradient tracker: vi 1 = P j Nb(i) ωijvj 0 + ui 0 ui 1 Initialize the local estimate of the policy network parameter: xi 1 = P j Nb(i) ωij(xj 0 + ηvj 1) while k < K do for each agent do Sample a trajectory τ i k from pi(τ i|xi k) and compute the local policy gradient surrogate using Eq. 8 Update the local policy gradient tracker vi k+1 = P j Nb(i) ωijvj k + ui k ui k 1 Update the local estimate of the policy network parameters xi k+1 = P j Nb(i) ωij(xj k + ηvj k+1) end k = k + 1 end The second term on the right hand side of Eq. 8 differ in the extra importance sampling weight compared to Eq. 13 in the supplementary materials. Intuitively, υi(τ i k|xi k 1, xi k) resolves the non-stationarity in the MARL and retains the regular variance reduction property of HIS-SARAH as applied in supervised learning problems. Clearly, each ui k is a conditionally biased estimator Ji(xi k), i.e., E[ui k] = Ji(xi k) typically. Nevertheless, it can be shown that E[ui k] = E[ Ji(xi k)], which implies that ui k acts as a surrogate for the underlying exact full policy gradient. Therefore, ui k will be called directly the stochastic policy gradient surrogate for the rest of analysis. With Eq. 8 in hand, we now are ready to present the algorithmic framework in the following. Algorithmic Framework We first present MDPGT (in Algorithm 1), which only takes a trajectory to initialize the policy gradient surrogate, leading to significant randomness due to the conditionally biased estimator property at the starting point of optimization, but still retaining the same sampling complexity as compared to the SOTA of MARL. To have a better initialization, we also present another way of initialization by sampling a mini-batch of trajectories from the distribution (in blue in Algorithm 1). Surprisingly, we will see that with a proper size of minibatch initialization, the sampling complexity of our proposed approach complies with the best result in centralized RL, which improves the SOTA of MARL. A brief outline of Algorithm 1 is as follows. The initialization of the policy gradient surrogate ui 0 can either be based on only a trajectory sampled from pi(τ i|xi 0) or a mini-batch. Subsequently, the policy gradient tracker and network parameters are initialized based on ui 0. The core part of the algorithm consists of each individual update for ui k, vi k, and xi k. By controlling the value of β in Eq. 8, MDPGT can degenerate to either vanilla decentralized policy gradient (with β = 1) or decentralized version of SRVR-PG (Xu, Gao, and Gu 2019) (with β = 0), both with the gradient tracking step. In our work, to emphasize the impact of the trade-off on the policy gradient surrogate, we keep β (0, 1), which makes β act more closely as the momentum coefficient in accelerated SGD algorithms (Singh et al. 2020). We emphasize that we are unaware of theoretical results for decentralized SRVR-PG. Hence, the proof techniques presented in this paper can also apply to this case. Another implication from Algorithm 1 is that at the beginning of each time step k, only one trajectory is required for computing the policy gradient, allowing for the batch size to be independent of ϵ, i.e., O(1), where we omit the number of agents N when considering the whole networked system. Theoretical Convergence In this section, we present an analysis of MDPGT. Most of the assumptions below are mild, and standard in the decentralized optimization and RL literature. Due to space limitations, we defer auxiliary lemmas and proofs to the supplementary materials. Assumption 1. Gradient and Hessian matrix of function logπi(ai|s) are bounded, i.e., there exist constants Cg, Ch > 0 such that logπi(ai|s) Cg and 2logπi(ai|s) Ch, for all i V. Note that we skip the subscript xi at πi for the notation simplicity. In this context, we did not impose the bounded policy gradient assumption, though it can be derived based on the above assumption, which has been adopted in centralized RL algorithms (Zhang et al. 2021a; Huang et al. 2020; Xu, Gao, and Gu 2019). Additionally, it also helps derive the smoothness of Ji(xi) that has typically been exposed as an assumption in decentralized learning/optimization literature. Assumption 2. The mixing matrix W RN N is doubly stochastic such that λ W P [0, 1), where λ signifies the second largest eigenvalue to measure the algebraic connectivity of the graph, and P = 1 N 1 1 and 1 is a column vector with each entry being 1. Assumption 3. Variance of importance sampling weight υi(τ i|x1, x2) is bounded, i.e., there exists a constant M > 0 such that V(υi(τ i|x1, x2)) M, for any x1, x2 Rdi and τ i pi(τ i|x1), for all i V. Assumption 2 is generic in most previous works on decentralized optimization, though such a property has been relaxed in some works (Nedi c and Olshevsky 2014). However, we have not been aware of any existing works in MARL doing such a relaxation and its investigation can be of independent interest. Assumption 3 is specifically introduced for importance sampling-based methods. Such an assumption is critical to construct the relationship between V(υi(τ i|x1, x2)) and x1 x2 2, through which the impact of the variance of importance sampling on the convergence can be explicitly quantified. Another typical assumption is for the bounded variance of stochastic gradient such that E[ gi(τ i|xi) Ji(xi) 2] σ2 i . However, under MARL setting, such a result can be derived from Assumption 1 and we present the formal result in Lemma 1. In this context, we also have σ2 = 1 N PN i=1 σ2 i , for all i V. The explicit expression of σ2 is given in the supplementary materials. Main Results We present the main results to show specifically the convergence rates for MDPGT when it is initialized by a minibatch of trajectories. We denote by L > 0 the smoothness constant and G > 0 the upper bound of gi(τ i|xi) for all i V. We further define a constant Cυ > 0 such that V(υi(τ i|x1, x2)) C2 υ x1 x2 2. The explicit expressions of these constants are derived in lemmas in the supplementary materials. Note that in our work, G is not directly assumed, but instead derived based on Assumption 1. Theorem 1. Let Assumptions 1,2 and 3 hold. Let the momentum coefficient β = 96L2+96G2Cυ N η2. If MDPGT is initialized by a mini-batch of trajectories with the size being B and the step size satisfies the following condition 0 < η min (1 λ2)2 λ 12844L2 + 9792G2Cυ , p N(1 λ2)λ 31 L2 + G2Cυ , 1 6(L2 + G2Cυ) then the output x K satisfies: for all K 2: E[ J( x K) 2] 4(J J( x0)) N|B|βK + 8β σ2 + 34λ2 J( x0) 2 KN(1 λ2)3 + 68λ2 σ2 (1 λ2)3|B|K + 204λ2β2 σ2 where J is the upper bound of J(x) and J( x0) 2 PN i=1 Ji( x0) 2. Theorem 1 depicts that when K , MDPGT enables convergence to a steady-state error in a sublinear rate O(1/K) if η and β are selected properly, i.e., E[ J( x K) 2] 8β σ2 N + 204λ2β2 σ2 (1 λ2)3 . (10) By observing Eq. 10, the steady-state error is determined by the number of agents, the variance of stochastic policy gradient, and the spectral gap of the graph 1 λ. Increasing the number of agents leads to a small error bound. Though different network topologies imply different error bounds, the higher order term of β can reduce the impact of the spectral gap on the error bound. Another suggestion from Eq. 10 is that η and β can be reduced to make the steadystate error arbitrarily small, while in return this can affect the speed of convergence. Surprisingly, even though we have to adopt the bounded stochastic policy gradient derived from Assumption 1 for analysis, the error bound in Eq. 10 only depends heavily on the variance, which is inherently consistent with most conclusions from decentralized optimization in literature without the bounded stochastic gradient assumption. While J is essentially correlated to the upper bound of reward R, in this context, we still adopt the implicit J for convenience. We next provide the analysis for the nonasymptotic behavior, defining appropriately η, β, and |B|. Corollary 1. Let η = N 2/3 8LK1/3 , β = DN 1/3 64L2K2/3 , |B| = N 2/3 in Theorem 1. We have, E[ J( x K) 2] 256L3D(J J( x0)) + 2048L4 σ2 + D2 σ2 8L2D(NK)2/3 | {z } T1 34λ2 J( x0) 2 KN(1 λ2)3 + λ2 σ2(51D2 + 69632N 2/3L4) 1024(1 λ2)3K4/3L4 | {z } T2 K max N 2D1.5 512L3 , 29791 N(L2 + G2Cυ)1.5 512L3λ3(1 λ2)1.5 , (12844L2 + 9792G2Cυ)1.5N 2λ3 512L3(1 λ2)6 where D = 96L2 + 96G2Cυ. Remark 1. An implication from Corollary 1 is that at the early stage of optimization, before T1 in Eq. 11 dominates, the complexity is tightly related to the algebraic connectivity of the network topology in T2, which is measured by the spectral gap 1 λ. However, T2 is in a large order of 1/K. As the optimization moves towards the latter stage where T1 dominates, the overall complexity is independent of the network topology. For the ease of exposition, with Corollary 1, when K is sufficiently large, it is an immediate consequence as E[ J( x K) 2] O((NK) 2/3). Thus, for achieving E[ J( x K) ] ϵ, the following relationship is obtained: E[ J( x K) ] = p (E[ J( x K) ])2 p E[ J( x K) 2] ϵ. Combining the last two inequalities results in the ultimate sampling complexity, i.e., O(N 1ϵ 3), which exhibits linear speed up. More importantly, this is N times smaller than the sampling complexity of the centralized momentum-based policy gradient methods (Huang et al. 2020) that performs on a single node. However, we have known from Corollary 1 that typically K has to be large, which will in the linear speed up regime reduce η and β. We also investigate a worse initialization with only a single trajectory sampled from pi(τ i|xi 0). However, without a mini-batch initialization, the eventual sampling complexity is O(N 1ϵ 4) (see results in the supplementary materials). Though variance reduction techniques have not reduced the order of ϵ 1, compared to the SOTA approaches, the linear speed up still enables the complexity to be N times smaller than that in (Xu, Gao, and Gu 2019; Huang et al. 2020). Additionally, different from traditional decentralized learning problems, MARL has more significant variances in the optimization procedure due to the non-oblivious characteristic. Using just a single trajectory for each agent to initialize is can be a poor scheme, but the adopted variance reduction techniques can successfully maintain the SOTA sampling complexity in a decentralized setting. Please refer to the supplementary materials for formal results and proof. Implication for Gaussian Policy. We study the sample complexity when the policy function πi(ai|s) of each agent is explicitly a Gaussian distribution. For a bounded action space Ai R, a Gaussian policy parameterized by xi is defined as πi(ai|s) = 1 2π exp ((xi) φi(s) ai)2 where ξ2 is a constant standard deviation parameter and φi(s) : S Rdi is a mapping from the state space to the feature space. Thus, the following formal result can be obtained. A more formal analysis and proof can be seen in the supplementary materials. Corollary 2. Let πi(ai|s) be defined as a Gaussian distribution in Eq. 12 with |ai| Ca, where Ca, Cf > 0, and φi(s) Cf, and η, β, |B| be defined as in Corollary 1. The sampling complexity of attaining the accuracy E[ J( x K) ] ϵ is O((1 γ) 4.5N 1ϵ 3). Numerical Experiments and Results To validate our proposed algorithm, we performed experiments on a cooperative navigation multi-agent environment that has been commonly used as a benchmark in several previous works (Qu et al. 2019; Zhang et al. 2018; Lu et al. 2021). Our platform for cooperative navigation is derived off the particle environment introduced by (Lowe et al. 2017). In our modification, all agents are initialized at random locations with a specific goal in a 2D grid world. Each agent observes the combined position and velocity of itself and all other agents. The agents are capable of moving up, down, left or right with the objective of navigating to their respective goals. The reward function of each agent is defined as the negative euclidean distance of the agent to the goal. Additionally, a penalty of -1 is imposed whenever the agent collides with other agents. All agent s policy is represented by a 3-layer dense neural network with 64 hidden units and tanh activation functions. The agents were trained for 50k episodes with a horizon of 50 steps and discount factor of 0.99. For brevity, we present numerical results in only one environment setting with five agents. Additional results with different number of agents and a simplified environment and computing infrastructure details are available in the supplementary materials1. Efficacy of MDPGT. Figure 1 illustrates the average training rewards obtained by the five agents in the cooperative navigation gridworld environment. As observed, both MDPG and MDPGT significantly outperforms the baseline, denoted as DPG. Comparing MDPG with MDPGT, we observe that while both algorithms initially have similar performance, 1Codes to reproduce results are also available at:https://github.com/xylee95/MD-PGT Figure 1: Average rewards of MDPGT(β = 0.5), MDPG and DPG in a cooperative navigation task for five agents (averaged across five random seeds). The solid lines denote the mean and shaded regions denote the standard deviation of rewards. Figure 2: Ablation study illustrating the effect of various momentum coefficients, β on the performance of MDPGT for five agents in the cooperative navigation environment. MDPGT begins to outperform MDPG around the 15k iteration. Additionally, when we compare the standard deviation of the rewards, shown as shaded regions, we observe that standard deviation of MDPGT is also smaller than the standard deviation of MDPG. In summary, these results validate our theoretical findings that utilizing gradient tracking as bias correction technique does improve the performance of the algorithm. Additionally, the improvement in terms of sampling complexity over DPG is empirically evident through the result. Effect of Momentum Coefficient. Next, we perform an additional ablation study to investigate the effect of the momentum coefficient β on the performance of MDPGT. As shown in Figure 2, we see that the choice of momentum coefficient does indeed have an effect on the performance. A β that is low can induce a faster convergence rate, but at the cost of a higher fluctuations in rewards, as seen by β = 0.2 and 0.3. Conversely, a high β value will cause the surrogate to degenerate into vanilla policy gradients and reflects a similar performance as DPG, which matches the implication by Eq. 10. Ultimately, for this environment, β = 0.4 and 0.5 offers the perfect balance between convergence rate and stability/variance of the training. Hence, β can be viewed Figure 3: Experiment results for five agents in the cooperative navigation environment to compare the effects of different network topologies. β = 0.5 for all experiments shown. as hyper-parameter which can be tuned to trade off between optimizing for convergence versus training stability. Effect of Different Topologies. Finally, we provide evidence which confirms the fact that our proposed method holds under various networks topologies. To test our hypothesis, we train five agents in the same cooperative navigation environment using three different network topologies: fullyconnected, ring and bi-partite topology. As seen in Figure 3, the five agents achieves similar rewards despite communicating via different network topologies. This validates our claim in Remark 1. Conclusions This paper proposes a novel MARL algorithm that involves variance reduction techniques to reduce the sampling complexity of decentralized policy-based methods. Specifically, we developed the algorithmic framework and analyzed it in a principled manner. An importance sampling-based stochastic recursive momentum is presented as the policy gradient surrogate, which is taken as input to a policy gradient tracker. Through theoretical analysis, we find that the proposed method can improve the sampling efficiency in the decentralized RL settings compared to SOTA methods. To the best of our knowledge, this is the first work that achieve the best available rate, O(ϵ 3), for generic (possibly non-concave) performance functions. Empirical results have shown the superiority of the proposed MDPGT over the baseline decentralized policy gradient methods. Future research directions include: 1) testing on more complex decentralized environments to reveal potentially novel and interesting results; 2) extending the proposed method to model-based decentralized RL settings to improve further the sampling efficiency; 3) theoretically analyzing the robustness of the proposed method under adversarial attacks. Acknowledgements This work was partly supported by the National Science Foundation under grant numbers CNS-1932033, CNS-1845969, and CCF-2005804. References Ackermann, J.; Gabler, V.; Osa, T.; and Sugiyama, M. 2019. Reducing overestimation bias in multi-agent domains using double centralized critics. ar Xiv preprint ar Xiv:1910.01465. Arslan, G.; and Y uksel, S. 2016. Decentralized Q-learning for stochastic teams and games. IEEE Transactions on Automatic Control, 62(4): 1545 1558. Bono, G.; Dibangoye, J. S.; Matignon, L.; Pereyron, F.; and Simonin, O. 2018. Cooperative multi-agent policy gradient. In Joint European Conference on Machine Learning and Knowledge Discovery in Databases, 459 476. Springer. Chen, B.; Xu, M.; Liu, Z.; Li, L.; and Zhao, D. 2020. Delay Aware Multi-Agent Reinforcement Learning for Cooperative and Competitive Environments. ar Xiv e-prints, ar Xiv 2005. Chu, T.; Chinchali, S.; and Katti, S. 2020. Multi-agent reinforcement learning for networked system control. ar Xiv preprint ar Xiv:2004.01339. Cutkosky, A.; and Orabona, F. 2019. Momentum-based variance reduction in non-convex sgd. ar Xiv preprint ar Xiv:1905.10018. Das, R.; Acharya, A.; Hashemi, A.; Sanghavi, S.; Dhillon, I. S.; and Topcu, U. 2020. Faster Non-Convex Federated Learning via Global and Local Momentum. ar Xiv preprint ar Xiv:2012.04061. Fang, C.; Li, C. J.; Lin, Z.; and Zhang, T. 2018. Spider: Near-optimal non-convex optimization via stochastic path integrated differential estimator. ar Xiv preprint ar Xiv:1807.01695. Helou, R. E.; Kalathil, D.; and Xie, L. 2020. Fully Decentralized Reinforcement Learning-based Control of Photovoltaics in Distribution Grids for Joint Provision of Real and Reactive Power. ar Xiv preprint ar Xiv:2008.01231. Huang, F.; Gao, S.; Pei, J.; and Huang, H. 2020. Momentumbased policy gradient methods. In International Conference on Machine Learning, 4422 4433. PMLR. Jiang, Z.; Balu, A.; Hegde, C.; and Sarkar, S. 2017. Collaborative deep learning in fixed topology networks. ar Xiv preprint ar Xiv:1706.07880. Johnson, R.; and Zhang, T. 2013. Accelerating stochastic gradient descent using predictive variance reduction. Advances in neural information processing systems, 26: 315 323. Karimireddy, S. P.; Jaggi, M.; Kale, S.; Mohri, M.; Reddi, S. J.; Stich, S. U.; and Suresh, A. T. 2020. Mime: Mimicking centralized stochastic algorithms in federated learning. ar Xiv preprint ar Xiv:2008.03606. Li, M.-L.; Chen, S.; and Chen, J. 2020. Adaptive learning: A new decentralized reinforcement learning approach for cooperative multiagent systems. IEEE Access, 8: 99404 99421. Li, W.; Jin, B.; Wang, X.; Yan, J.; and Zha, H. 2020. F2A2: Flexible Fully-decentralized Approximate Actor-critic for Cooperative Multi-agent Reinforcement Learning. ar Xiv preprint ar Xiv:2004.11145. Lin, Q.; and Ling, Q. 2021. Decentralized TD (0) with Gradient Tracking. IEEE Signal Processing Letters. Lin, Y.; Zhang, K.; Yang, Z.; Wang, Z.; Bas ar, T.; Sandhu, R.; and Liu, J. 2019. A communication-efficient multi-agent actor-critic algorithm for distributed reinforcement learning. In 2019 IEEE 58th Conference on Decision and Control (CDC), 5562 5567. IEEE. Lowe, R.; Wu, Y.; Tamar, A.; Harb, J.; Abbeel, P.; and Mordatch, I. 2017. Multi-Agent Actor-Critic for Mixed Cooperative-Competitive Environments. Neural Information Processing Systems (NIPS). Lu, S.; Zhang, K.; Chen, T.; Basar, T.; and Horesh, L. 2021. Decentralized Policy Gradient Descent Ascent for Safe Multi Agent Reinforcement Learning. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 35, 8767 8775. Metelli, A. M.; Papini, M.; Montali, N.; and Restelli, M. 2020. Importance Sampling Techniques for Policy Optimization. Journal of Machine Learning Research, 21(141): 1 75. Mukherjee, S.; Bai, H.; and Chakrabortty, A. 2020. Modelfree decentralized reinforcement learning control of distributed energy resources. In 2020 IEEE Power & Energy Society General Meeting (PESGM), 1 5. IEEE. Nedi c, A.; and Olshevsky, A. 2014. Distributed optimization over time-varying directed graphs. IEEE Transactions on Automatic Control, 60(3): 601 615. Nguyen, D. T.; Yeoh, W.; Lau, H. C.; Zilberstein, S.; and Zhang, C. 2014. Decentralized multi-agent reinforcement learning in average-reward dynamic DCOPs. In Proceedings of the AAAI conference on artificial intelligence, volume 28. Nguyen, L. M.; Liu, J.; Scheinberg, K.; and Tak aˇc, M. 2017. SARAH: A novel method for machine learning problems using stochastic recursive gradient. In International Conference on Machine Learning, 2613 2621. PMLR. Nguyen, T.; and Mukhopadhyay, S. 2017. Selectively decentralized Q-learning. In 2017 IEEE International Conference on Systems, Man, and Cybernetics (SMC), 328 333. IEEE. Nguyen, T. T.; Nguyen, N. D.; and Nahavandi, S. 2020. Deep reinforcement learning for multiagent systems: A review of challenges, solutions, and applications. IEEE transactions on cybernetics, 50(9): 3826 3839. Papini, M.; Binaghi, D.; Canonaco, G.; Pirotta, M.; and Restelli, M. 2018. Stochastic variance-reduced policy gradient. In International Conference on Machine Learning, 4026 4035. PMLR. Pu, S.; and Nedi c, A. 2020. Distributed stochastic gradient tracking methods. Mathematical Programming, 1 49. Qu, C.; Mannor, S.; Xu, H.; Qi, Y.; Song, L.; and Xiong, J. 2019. Value propagation for decentralized networked deep multi-agent reinforcement learning. ar Xiv preprint ar Xiv:1901.09326. Ryu, H.; Shin, H.; and Park, J. 2020. Multi-agent actor-critic with hierarchical graph attention network. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 34, 7236 7243. Sim oes, D.; Lau, N.; and Reis, L. P. 2020. Multi-agent actor centralized-critic with communication. Neurocomputing, 390: 40 56. Singh, N.; Data, D.; George, J.; and Diggavi, S. 2020. SQu ARM-SGD: Communication-Efficient Momentum SGD for Decentralized Optimization. ar Xiv preprint ar Xiv:2005.07041. Sun, Y.; Daneshmand, A.; and Scutari, G. 2019. Convergence rate of distributed optimization algorithms based on gradient tracking. ar Xiv preprint ar Xiv:1905.02637. Suttle, W.; Yang, Z.; Zhang, K.; Wang, Z.; Bas ar, T.; and Liu, J. 2020. A multi-agent off-policy actor-critic algorithm for distributed reinforcement learning. IFAC-Papers On Line, 53(2): 1549 1554. Tran-Dinh, Q.; Pham, N. H.; Phan, D. T.; and Nguyen, L. M. 2019. Hybrid stochastic gradient descent algorithms for stochastic nonconvex optimization. ar Xiv preprint ar Xiv:1905.05920. Wang, X.; Wang, C.; Li, X.; Leung, V. C.; and Taleb, T. 2020. Federated deep reinforcement learning for internet of things with decentralized cooperative edge caching. IEEE Internet of Things Journal, 7(10): 9441 9455. Wei, C.-Y.; Lee, C.-W.; Zhang, M.; and Luo, H. 2021. Lastiterate Convergence of Decentralized Optimistic Gradient Descent/Ascent in Infinite-horizon Competitive Markov Games. ar Xiv preprint ar Xiv:2102.04540. Xu, P.; Gao, F.; and Gu, Q. 2019. Sample efficient policy gradient methods with recursive variance reduction. ar Xiv preprint ar Xiv:1909.08610. Zeng, S.; Anwar, A.; Doan, T.; Romberg, J.; and Raychowdhury, A. 2020. A decentralized policy gradient approach to multi-task reinforcement learning. ar Xiv preprint ar Xiv:2006.04338. Zhang, J.; Ni, C.; Yu, Z.; Szepesvari, C.; and Wang, M. 2021a. On the Convergence and Sample Efficiency of Variance-Reduced Policy Gradient Method. ar Xiv preprint ar Xiv:2102.08607. Zhang, K.; Yang, Z.; and Basar, T. 2018. Networked multiagent reinforcement learning in continuous spaces. In 2018 IEEE Conference on Decision and Control (CDC), 2771 2776. IEEE. Zhang, K.; Yang, Z.; and Bas ar, T. 2019. Decentralized Multi Agent Reinforcement Learning with Networked Agents: Recent Advances. ar Xiv preprint ar Xiv:1912.03821. Zhang, K.; Yang, Z.; Liu, H.; Zhang, T.; and Basar, T. 2018. Fully decentralized multi-agent reinforcement learning with networked agents. In International Conference on Machine Learning, 5872 5881. PMLR. Zhang, K.; Yang, Z.; Liu, H.; Zhang, T.; and Basar, T. 2021b. Finite-sample analysis for decentralized batch multi-agent reinforcement learning with networked agents. IEEE Transactions on Automatic Control. Zhou, P.; Chen, X.; Liu, Z.; Braud, T.; Hui, P.; and Kangasharju, J. 2020. DRLE: Decentralized Reinforcement Learning at the Edge for Traffic Light Control. ar Xiv preprint ar Xiv:2009.01502.