# finitetime_analysis_of_decentralized_singletimescale_actorcritic__b3e8d234.pdf

Published in Transactions on Machine Learning Research (01/2023)

Finite-Time Analysis of Decentralized Single-Timescale Actor-Critic

Qijun Luo qijunluo@link.cuhk.edu.cn School of Science and Engineering Shenzhen Research Institute of Big Data (SRIBD) The Chinese University of Hong Kong, Shenzhen Shenzhen, China

Xiao Li lixiao@cuhk.edu.cn School of Data Science Shenzhen Institute of Artificial Intelligence and Robotics for Society (AIRS) The Chinese University of Hong Kong, Shenzhen Shenzhen, China

Reviewed on Open Review: https: // openreview. net/ forum? id= KQRv0O8i W4

Decentralized Actor-Critic (AC) algorithms have been widely utilized for multi-agent reinforcement learning (MARL) and have achieved remarkable success. Apart from its empirical success, the theoretical convergence property of decentralized AC algorithms is largely unexplored. Most of the existing finite-time convergence results are derived based on either double-loop update or two-timescale step sizes rule, and this is the case even for centralized AC algorithm under a single-agent setting. In practice, the single-timescale update is widely utilized, where actor and critic are updated in an alternating manner with step sizes being of the same order. In this work, we study a decentralized single-timescale AC algorithm. Theoretically, using linear approximation for value and reward estimation, we show that the algorithm has sample complexity of O(ε 2) under Markovian sampling, which matches the optimal complexity with a double-loop implementation (here, O hides a logarithmic term). When we reduce to the single-agent setting, our result yields new sample complexity for centralized AC using a single-timescale update scheme. The central to establishing our complexity results is the hidden smoothness of the optimal critic variable we revealed. We also provide a local action privacy-preserving version of our algorithm and its analysis. Finally, we conduct experiments to show the superiority of our algorithm over the existing decentralized AC algorithms.

1 Introduction

Multi-agent reinforcement learning (MARL) (Littman, 1994; Vinyals et al., 2019) has been successful in various models of multi-agent systems, such as robotics (Lillicrap et al., 2015), autonomous driving (Yu et al., 2019), Go (Silver et al., 2017), etc. MARL has been extensively explored in the past decades; see, e.g., (Lowe et al., 2017; Omidshafiei et al., 2017; Zhang et al., 2021; Son et al., 2019; Espeholt et al., 2018; Rashid et al., 2018). These works either focus on the setting where a central controller is available, or assuming a common reward function for all agents. Among the many cooperative MARL settings, the work (Zhang et al., 2018) proposed the fully decentralized MARL with networked agents. In this setting, each agent maintains a private heterogeneous reward function, and agents can only access local/neighboring information through communicating with its neighboring agents on the network. Then, the objective of all agents is to jointly maximize the average long-term reward through interacting with environment modeled by multi-agent Markov decision process (MDP). They proposed the decentralized Actor-Critic (AC) algorithm to solve this

Published in Transactions on Machine Learning Research (01/2023)

MARL problem, and showed its impressive performance. However, the theoretical convergence properties of such class of decentralized AC algorithms are largely unexplored; see (Zhang et al., 2021) for a comprehensive survey. In this work, our goal is to establish the finite-time convergence results under this fully decentralized MARL setting. We first review some recent progresses on this line of research below.

Related works and motivations. The first fully decentralized AC algorithm with provable convergence guarantee was proposed by (Zhang et al., 2018), and they achieved asymptotic convergence results under two-timescale step sizes, which requires actor s step sizes to diminish in a faster scale than the critic s step sizes. The sample complexities of decentralized AC were established recently. In particular, (Chen et al., 2022) and (Hairi et al., 2022) independently proposed two communication efficient decentralized AC algorithms with optimal sample complexity of O(ε 2 log(ε 1)) under Markovian sampling scheme. Nevertheless, their analysis are based on double-loop implementation, where each policy optimization step follows a nearly accurate critic optimization step (a.k.a. policy evaluation), i.e., solving the critic optimization subproblem to ε-accuracy. Such a double-loop scheme requires careful tuning of two additional hyper-parameters, which are the batch size and inner loop size. In particular, the batch size and inner loop size need to be of order O(ε 1) and O(log(ε 1)) in order to achieve their sample complexity results, respectively. In practice, single-loop algorithmic framework is often utilized, where one updates the actor and critic in an alternating manner by performing any constant algorithmic iterations for both subproblems; see, e.g., (Schulman et al., 2017; Lowe et al., 2017; Lin et al., 2019; Zhang et al., 2020). The work (Zeng et al., 2022) proposed a new decentralized AC algorithm based on such a single-loop alternative update. However, they have to adopt two-timescale step sizes rule to ensure convergence, which requires actor s step sizes to diminish in a faster scale than the critic s step sizes. Due to the separation of the step sizes, the critic optimization subproblem is solved exactly when the number of iterations tends to . Such a restriction on the step size will slow down the convergence speed of the algorithm. As a consequence, they only obtain sub-optimal sample complexity of O(ε 5

2 ). In practice, most algorithms are implemented with single-timescale step sizes rule, where the step sizes for the actor s and critic s updates are of the same order. Though there are some theoretical achievements for single-timescale update in other areas such as TDC (Wang et al., 2021) and bi-level optimization (Chen et al., 2021a), similar theoretical understanding under AC setting is largely unexplored.

Indeed, even when reducing to single-agent setting, the convergence property of single-timescale AC algorithm is not well established. The works (Fu et al., 2021; Guo et al., 2021) established the finite-time convergence result under a special single-timescale implementation, where they attained the sample complexity of O(ε 2). Their analysis is based on an algorithm where the critic optimization step is formulated as a least-square temporal difference (LSTD) at each iteration, which requires to sample the transition tuples for O(ε 1) times to form the data matrix in the LSTD subproblem. Then, they solve the LSTD subproblem in a closed-form fashion by inverting a matrix of large size. Later, (Chen et al., 2021a) obtained the same sample complexity using TD(0) update for critic variables under i.i.d. sampling. Their analysis highly relies on the assumption that the Jacobian of the stationary distribution is Lipschitz continuous, which is not justified in their work.

The above observations motivate us to ask the following question:

Can we establish finite-time convergence result for decentralized AC algorithm with single-timescale step sizes rule?1

Main contributions. By answering this question positively, we have the following contributions:

We design a decentralized AC algorithm, which employs a single-timescale step sizes rule and adopts Markovian sampling scheme. The proposed algorithm allows communication between agents for every Kc iterations with Kc being any integer lies in [1, O(ε 1

2 )], rather than communicating at each iteration as adopted by previous single-loop decentralized AC algorithms (Zeng et al., 2022; Zhang et al., 2018).

Using linear approximation for value and reward estimation, we establish the finite-time convergence result for the proposed algorithm under standard assumptions. In particular, we show that the algorithm has a sample complexity of O(ε 2), which matches the optimal complexity up to a

1As convention in (Fu et al., 2021), when we use "single-timescale", it means we utilize a single-loop algorithmic framework with single-timescale step sizes rule.

Published in Transactions on Machine Learning Research (01/2023)

logarithmic term. In addition, we show that the logarithmic term hidden in the O can be removed under the i.i.d. sampling scheme. These convergence results are valid for all the above mentioned choices for Kc.

To preserve privacy of local actions, we propose a variant of our algorithm which utilizes noisy local rewards for estimating global rewards. We show that such an algorithm will maintain the optimal sample complexity at the expense of communicating at each iteration.

Our key technical result is to reveal the hidden smoothness of the optimal critic variable, so that we can derive a sufficient descent on the averaged critic s optimal gap under the single-timescale update. Consequently, we can resort to the classic convergence analysis for alternating optimization algorithms to establish the approximate ascent property of the overall optimization process, which leads to the final sample complexity results. We also designed a Lyapunov function to analyze the descent of the objective function with a single-timescale update under the decentralized setting.

When we reduce to the non-decentralized case, i.e., the single-agent setting, our results yield new sample complexity guarantees for the classic centralized AC algorithm using a single-timescale update scheme.

Discussion on a concurrent work. We note that there is a concurrent work (Olshevsky & Gharesifard, 2023) which also analyzes the single-timescale AC algorithm and achieves similar complexity results. Their analysis is based on the small gain theorem, which is different from ours. These two analysis frameworks provide useful insights for the AC algorithm from different perspectives. (Olshevsky & Gharesifard, 2023) shows that the coupled expression on the errors of actor and critic can be fit into a non-linear small gain theorem framework, which bounds the actor s error by desired order. Our analysis reveals the hidden smoothness of the optimal critic variable so that approximate descent on the critic s objective can be achieved. In addition, (Olshevsky & Gharesifard, 2023) considers the single-agent setting while our analysis deals with the more general decentralized setting. Moreover, (Olshevsky & Gharesifard, 2023) analyzes the i.i.d. sampling scheme where the single agent is assumed to have access to the transition tuples from the stationary distribution and the discounted state-visitation distribution. By contrast, our setting considers the practical Markovian sampling scheme, where the transition tuples are from the trajectory generated during the update of the agents.

2 Preliminary

In this section, we introduce the problem formulation and the policy gradient theorem, which serves as the preliminary for the analyzed decentralzed AC algorithm.

Suppose there are multiple agents aiming to independently optimize a common global objective, and each agent can communicate with its neighbors through a network. To model the topology, we define the graph as G = (N, E), where N is the set of nodes with |N| = N and E is the set of edges with |E| = E. In the graph, each node represents an agent, and each edge represents a communication link. The interaction between agents follows the networked multi-agent MDP.

2.1 Markov decision process

A networked multi-agent MDP is defined by a tuple (G, S, {Ai}i [N], P, {ri}i [N], γ). G denotes the communication topology (the graph), S is the finite state space observed by all agents, Ai represents the finite action space of agent i. Let A := A1 AN denote the joint action space and P(s |s, a) : S A S [0, 1] denote the transition probability from any state s S to any state s S for any joint action a A. ri : S A R is the local reward function that determines the reward received by agent i given transition (s, a); γ [0, 1] is the discount factor.

For simplicity, we will use a := [a1, , a N] to denote the joint action, and θ RNdθ to denote concatenation of all actor s joint parameters of all actors, with θi Rdθ. Here, without loss of generality, we assume that every agent has the same number of parameters for notation brevity. The MDP goes as follows: For a given state s, each agent make its decision ai based on its policy ai πθi( |s). The state transits to the next state s based on the joint action of all the agents: s P( |s, a). Then, each agent will receive its own reward

Published in Transactions on Machine Learning Research (01/2023)

ri(s, a). For the notation brevity, we assume that the reward function mapping is deterministic and does not depend on the next state without loss of generality. The stationary distribution induced by the policy πθ and the transition kernel is denoted by µπθ(s).

Our objective is to find a set of policies that maximize the accumulated discounted mean reward received by agents

θ = arg max θ J(θ) := E

k=0 γk r(sk, ak)

Here, k represents the time step. r(sk, ak) := 1 N PN i=1 ri(sk, ak) is the mean reward among agents at time step k. The randomness of the expectation comes from the initial state distribution µ0(s), the transition kernel P, and the stochastic policy πθi( |s).

2.2 Policy gradient Theorem

Under the discounted reward setting, the global state-value function, action-value function, and advantage function for policy set θ, state s, and action a, are defined as

Vπθ(s) := E

k=0 γk r(sk, ak)|s0 = s

Qπθ(s, a) := E

k=0 γk r(sk, ak)|s0 = s, a0 = a

Aπθ(s, a) := Qπθ(s, a) Vπθ(s).

To maximize the objective function defined in (1), the policy gradient (Sutton et al., 2000) can be computed as follow

θJ(θ) = Es dπθ ,a πθ

1 1 γ Aπθ(s, a)ψπθ(s, a) ,

where dπθ(s) := (1 γ) P k=0 γk P(sk = s) is the discounted state visitation distribution under policy πθ, and ψπθ(s, a) := log πθ(s, a) is the score function.

Following the derivation of (Zhang et al., 2018), the policy gradient for each agent under discounted reward setting can be expressed as

θi J(θ) = Es dπθ ,a πθ

1 1 γ Aπθ(s, a)ψπθi(s, ai) . (3)

3 Algorithms

3.1 Decentralized single-timescale Actor-Critic

We introduce the decentralized single-timescale AC algorithm; see Algorithm 1. In the remaining parts of this section, we will explain the updates in the algorithm in details.

In fully-decentralized MARL, each agent can only observe its local reward and action, while trying to maximize the global reward (mean reward) defined in (1). The decentralized AC algorithm solves the problem by updating actor and critic variables alternatively on an online trajectory. Specifically, we have N pairs of actor and critic. In order to maximize J(θ), each critic tries to estimate the global state-value function Vπθ(s) defined in (2). Then, each actor updates its policy parameter based on approximated policy gradient. We now provide more details about the algorithm.

Critics update. We will use ωi Rdω to denote the ith critic s parameter and ω := 1

N PN i=1 ωi to represent the averaged parameter of critic. Each critic approximates the global value function as Vπθ(s) ˆVωi(s).

Published in Transactions on Machine Learning Research (01/2023)

Algorithm 1: Decentralized single-timescale AC (reward estimator version)

1: Initialize: Actor parameter θ0, critic parameter ω0, reward estimator parameter λ0, initial state s0. 2: for k = 0, , K 1 do 3: Option 1: i.i.d. sampling: 4: sk µθk(s), ak πθk( |sk), sk+1 P( |sk, ak). 5: Option 2: Markovian sampling: 6: ak πθk( |sk), sk+1 P( |sk, ak). 7: 8: Periodical consensus: Compute ωi k and λi k by (4) and (7). 9: 10: for i = 0, , N in parallel do 11: Reward estimator update: update λi k+1 by (8). 12: Critic update: Update ωi k+1 by (5). 13: Actor update: Update θi k+1 by (6). 14: end for 15: end for

The critic s approximation error can be categorized into two parts, namely, the consensus error 1 N PN i=1 ωi ω , which measures how close the critics parameters are; and the approximation error ω ω (θ) , which measures the approximation quality of averaged critic.

In order for critics to reach consensus, each critic exchanges its parameters with neighbors and perform the following update

(PN j=1 W ijωj k if k mod Kc = 0 ωi k otherwise. (4)

Here, Kc denotes the consensus frequency. The communication matrix W Rn n is usually determined artificially in practice and can be sparse, which means that the number of neighbors for each agent is much fewer than the total number of agents. Thus, the cost for each consensus step is usually much lower than a full synchronization over the network. The detailed requirements of matrix W will be discussed in Assumption 5.

To reduce the approximation error, we will perform the local TD(0) update (Tsitsiklis & Van Roy, 1997) as

ωi k+1 = ΠRω( ωi k + βkgi c(ξk, ωi k)), (5)

where ξ := (s, a, s ) represents a transition tuple, gi c(ξ, ω) := δi(ξ, ω) ˆVω(s) is the update direction, δi(ξ, ω) := ri(s, a) + γ ˆVω(s ) ˆVω(s) is the local temporal difference error (TD-error). βk is the step size for critic at iteration k. ΠRω projects the parameter into a ball of radius of Rω containing the optimal solution, which will be explained when discussing Assumptions 1 and 2.

Actors update. We will use stochastic gradient ascent to update the policy s parameter, which is calculated based on policy gradient theorem in (3). The advantage function Aπθ(s, a) can be estimated by

δ(ξ, θ) := r(s, a) + γVπθ(s ) Vπθ(s),

with a sampled from πθ( |s). However, to preserve the privacy of each agents, the local reward cannot be shared to other agents under the fully decentralized setting. Thus, the averaged reward r(sk, ak) is not directly attainable. To this end, we adopt the strategy proposed in (Zhang et al., 2018) to approximate the averaged reward. In particular, each agent i will have a local reward estimator with parameter λi Rdλ, which estimates the global averaged reward as r(sk, ak) ˆrλi(sk, ak).

Thus, the update of the ith actor is given by

θi k+1 = θi k + αkˆδ(ξk, ωi k+1, λi k+1)ψπθi k (sk, ai k), (6)

Published in Transactions on Machine Learning Research (01/2023)

Algorithm 2: Decentralized single-timescale AC (noisy reward version)

1: Initialize: Actor parameter θ0, critic parameter ω0, initial state s0. 2: for k = 0, , K 1 do 3: Option 1: i.i.d. sampling: 4: sk µθk(s), ak πθk( |sk), sk+1 P( |sk, ak). 5: Option 2: Markovian sampling: 6: ak πθk( |sk), sk+1 P( |sk, ak). 7: 8: Periodical consensus: Compute ωi k by (4). 9: 10: for i = 0, , N in parallel do 11: Global reward estimation: estimate rk(sk, ak) by (9). 12: Critic update: Update ωi k+1 by (5). 13: Actor update: Update θi k+1 by (10). 14: end for 15: end for

where ˆδ(ξ, ω, λ) := ˆrλ(s, a) + γ ˆVω(s ) ˆVω(s) is the approximated advantage function. αk is the step size for actor s update at iteration k.

Reward estimators update. Similar to critic, each reward estimator s approximation error can be decomposed into consensus error and the approximation error.

For each local reward estimator, we perform the consensus step to minimize the consensus error as

(PN j=1 W ijλj k if k mod Kc = 0 λi k otherwise. (7)

To reduce the approximation error, we perform a local update of stochastic gradient descent.

λi k+1 = ΠRλ( λi k + ηkgi r(ξk, λi k)), (8)

where gi r(ξ, λ) := (ri(s, a) ˆrλ(s, a)) ˆrλ(s, a) is the update direction. ηk is the step size for reward estimator at iteration k. Note the calculation of gi r(ξ, λ) does not depend on the next state s ; we use ξ in (8) just for notation brevity. Similar to critic s update, ΠRλ projects the parameter into a ball of radius of Rλ containing the optimal solution.

In our Algorithm 1, we will use the same order for αk, βk, and ηk and hence, our algorithm is in single-timescale.

Linear approximation for analysis. In our analysis, we will use linear approximation for both critic and reward estimator variables, i.e. ˆVω(s) := ϕ(s)T ω; ˆrλ(s, a) := φ(s, a)T λ, where ϕ(s) : S Rdω and φ(s, a) : S A Rdλ are two feature mappings, whose property will be specified in the discussion of Assumption 1.

Remarks on sampling scheme. Acquiring unbiased stochastic gradients for critic and actor variables requires sampling from µπθ and dπθ, respectively. However, in practical implementations, states are usually collected from an online trajectory (Markovian sampling), whose distribution is generally different from µπθ and dπθ. Such a distribution mismatch will inevitably cause biases during the update of critic and actor variables. One has to bound the corresponding error terms when analyzing the algorithm.

3.2 Variant for preserving local action

Note that in Algorithm 1, the reward estimators need the knowledge of joint actions in order to estimate the global rewards. Inspired by (Chen et al., 2022), we further propose a variant of Algorithm 1 to preserve the privacy of local actions. It estimates the global rewards by communicating noisy local rewards. As a trade-off, the approach requires O(log(ε 1)) communication rounds for each iteration; see Algorithm 2.

Published in Transactions on Machine Learning Research (01/2023)

Let ri k represents ri k(sk, ak) for brevity. The reward estimation process goes as follow: for each agent i, we first produce a noisy local reward ri k = ri k(1 + z), with z N(0, σ2). Thus, the noise level is controlled by the variance σ2, which is chosen artificially. When the noise level σ2 increases, the local reward s privacy will be strengthen. In the meantime, the variance of the estimated global reward will increase. To estimate the global reward, each agent i first initialize the estimation as ri k,0 = ri k. Then, each agent i perform the following consensus step for Kr times, i.e.

j=1 W ij ri k,l, l = 0, 1, , Kr 1. (9)

The reward ri k,Kr will be used for estimating the global reward for agent i at kth iteration. As we will see, the error | ri t,l+1 1

N PN i=1 ri k| will converge to 0 linearly. Hence, to reduce the error to ε, we need Kr = O(log(ε 1)) rounds of communications for each iteration. Based on the estimated global reward, the ith actor s update is given by

θi k+1 = θi k + αk( ri k,Kr + γ ˆVωi(s ) ˆVωi(s))ψπθi k (sk, ai k). (10)

4 Main results

In this section, we first introduce the technical assumptions used for our analysis, which are standard in the literature. Then, we present the convergence results for both actor and critic variables.

4.1 Assumptions

Assumption 1 (boundedness of rewards and feature vectors). The local rewards are uniformly bounded, i.e., there exists a positive constant rmax such that for all feasible (s, a) and i [N], we have | ri(s, a) | rmax. The norm of feature vectors are bounded such that for all s S, a A, ϕ(s) 1, φ(s, a) 1.2

Assumption 1 is standard and commonly adopted; see, e.g., (Bhandari et al., 2018; Xu et al., 2020; Zeng et al., 2022; Shen et al., 2020; Qiu et al., 2019). This assumption can be achieved via normalizing the feature vectors.

Assumption 2 (sufficient exploration). There exists two positive constants λϕ, λφ such that for all policy πθ, the following two matrices are negative definite

Aθ,ϕ := Es µθ(s)[ϕ(s)(γϕ(s )T ϕ(s)T )]

Aθ,φ := Es µθ(s),a πθ( |s)[ φ(s, a)φ(s, a)T ],

with λmax(Aθ,ϕ) λϕ, λmax(Aθ,φ) λφ, where λmax( ) represents the largest eigenvalue.

The Assumption 2 characterizes a strong convexity-like property of critic and reward estimator s objective function, and thereby ensures sufficient decrease of the estimation error for each update. It will be satisfied when infθ,s,a πθ(a|s) c for all policy πθ, s S, a A with c being positive. Thus, it can be understood as an exploration assumption on policy πθ. (see Proposition 3.1 of (Olshevsky & Gharesifard, 2023) for more detail). This assumption is widely seen in analysis of AC algorithms; see, e.g. (Shen et al., 2020; Xu & Liang, 2021; Zeng et al., 2022). Together with Assumption 1, we can show that ω (θ) Rω := rmax

λϕ , λ (θ) Rλ := rmax

λφ , which justifies the projection step. In practice, one can estimate Rω and Rλ online; see Section 8.2 of (Bhandari et al., 2018) for one approach. We provide more details for the projection in Appendix C.

Assumption 3 (Lipschitz properties of policy). There exists constants Cψ, Lψ, Lπ such that for all policy parameter θ, θ , s S and a A, we have (1). |πθ(a|s) πθ (a|s)| Lπ θ θ ; (2). ψθ(s, a) ψθ (s, a) Lψ θ θ ; (3). ψθ(s, a) Cψ.

2Through out the paper, we will use to represent the Euclidean norm for vectors and Frobenius norm for matrices.

Published in Transactions on Machine Learning Research (01/2023)

Assumption 3 is common for analyzing policy-based algorithms; see, e.g., (Xu et al., 2019; Wu et al., 2020; Hairi et al., 2022). The assumption implies the smoothness of objective function J(θ). It holds for policy classes such as tabular softmax policy (Agarwal et al., 2020), Gaussian policy (Doya, 2000), and Boltzmann policy (Konda & Borkar, 1999).

Assumption 4 (mixing of Markov chain). There exists constants κ > 0 and ρ (0, 1) such that

sup s S d T V (P(sk |s0 = s, πθ), µθ) κρk, k.

Assumption 4 is a standard assumption; see, e.g. (Bhandari et al., 2018; Wu et al., 2020; Xu et al., 2019). The assumption always holds for irreducible and aperiodic Markov chain. It ensures the geometric convergence of state to the stationary distribution.

Assumption 5 (doubly stochastic weight matrix). The communication matrix W is doubly stochastic, i.e. each column/row sum up to 1. Moreover, the second largest singular value ν is smaller than 1.

Assumption 5 is a common assumption in decentralized optimization and multi-agent reinforcement learning; see, e.g., (Sun et al., 2020; Chen et al., 2021b; 2022). It ensures the convergence of consensus error for critic and reward estimator variables.

4.2 Sample complexity for Algorithm 1

Theorem 1. Suppose Assumptions 1-5 hold. Consider the update of Algorithm 1 under Markovian sampling. Let αk = α

K for some positive constant α, βk = C9 2λϕ αk, and ηk = C10

2λφ αk and Kc O(K1/4), where K is the total number of iterations. Then, we have

i=1 E h ωi k ω (θk) 2i O log2 K

i=1 E h θi J(θk) 2i O log2 K

+ O (εapp + εsp) , (11)

where C9, C10 are positive constants defined in proof.

The proof of Theorem 1 can be found in Appendix D.1. It establishes the iteration complexity of O(log2 K/

K), or equivalently, sample complexity of O(ε 2) for Algorithm 1. Note that actors, critics, and reward estimators use the step size of the same order. The rate matches the state-of-the-art sample complexity of decentralized AC algorithms up to a logarithmic term, which are implemented in double-loop fashion (Hairi et al., 2022; Chen et al., 2022). The approximation error is defined as

εapp := max θ,a Es µθ

Vπθ(s) ˆVω (θ)(s) 2 + r(s, a) ˆrλ (θ)(s, a) 2 . (12)

The error εapp captures the approximation power of critic and reward estimator. When using function approximation, such an error is inevitable. Similar terms also appear in the literature (see, e.g., (Xu et al., 2020; Agarwal et al., 2020; Qiu et al., 2019)). εapp becomes zero in tabular case. The error εsp represents the mismatch between the discounted state visitation distribution dπθ and stationary distribution µπθ. It is defined as

εsp := 4C2 θ

logρ κ 1 + 1

By policy gradient theorem (3), the states should be sampled from discounted state visitation distribution in order to attain unbiased estimation of policy gradient. Nevertheless, the state distribution converges to stationary distribution µπθ due to Markov chain s mixing, which inevitably introduces the sampling error εsp. Similar terms also appear in (Zeng et al., 2022; Shen et al., 2020). When γ is close to 1, the error becomes small. This is because dπθ approaches to µπθ when γ goes to 1. In the literature, some works assume that sampling from dπθ is permitted, thus eliminate this error; see, e.g., (Chen et al., 2021a).

Published in Transactions on Machine Learning Research (01/2023)

Complexity result under i.i.d. sampling. Under the i.i.d. sampling scheme, state can be directly sampled from µπθ and dπθ. In this case, the logarithmic term caused by the Markovian mixing time, and the error εsp caused by the distribution mismatch, can be avoided. In this sense, one can attain the iteration complexity of O(1/

K), or equivalently, sample complexity of O(ε 2).

4.3 Sample complexity for Algorithm 2

Theorem 2. Suppose Assumptions 1-5 hold. Consider the update of Algorithm 2 under Markovian sampling. Let αk = α

K for some positive constant α and βk = C9 2λϕ αk, Kr = log(K1/2), Kc O(K1/4), where K is the total number of iterations. Then, we have

i=1 E ωi k ω (θk) 2 O log2 K

i=1 E θi J(θk) 2 O log2 K

+ O(εc app + εsp), (13)

where the constants are defined in proof.

The proof of Theorem 2 can be found in Appendix D.2. It establishes the sample complexity of O(ε 2) for Algorithm 2. The εc app captures the approximation error of the critic variables, which is defined as

εc app := max θ Es µθ

Vπθ(s) ˆVω (θ)(s) 2 .

The Algorithm 2 preserves the privacy of local actions and requires less parameters than Algorithm 1 since there is no reward estimator. The cost is that it needs to communicate O(log(ε 1)) times for each iteration.

4.4 Proof sketch

We present the main elements for the proof of Theorem 1, which helps in understanding the difference between classical two-timescale/double-loop analysis and our single-timescale analysis. The proof of Theorem 2 follows the similar framework.

Under Markovian sampling, it is possible to show the following inequality, which characterizes the ascent of the objective.

E[J(θk+1)] J(θk)

2 E θi J(θk) 2 + αk

2 E gi a(ξk, ωi k+1, λi k+1) 2

8C2 ψαk E ω (θk) ωi k+1 2 4C2 ψαk E λ (θk) λi k+1 2i

O(log2(K)α2 k) O((εapp + εsp)αk). (14)

To analyze the errors of critic ω (θk) ωi k+1 2 and reward estimator λ (θk) λi k+1 2, the two-timescale analysis requires O(αk) < min{O(βk), O(ηk)} in order for these two errors to converge. The double-loop approach runs lower-level update for O(log(ε 1)) times with batch size O(ε 1) to drive these errors below ε and hence, they cannot allow inner loop size and bath size to be O(1) simultaneously. To obtain the convergence result for single-timescale update, the idea is to further upper bound these two lower-level errors by the quantity O(αk E gi a(ξk, ωi k+1, λi k+1) 2) (through a series of derivations), and then eliminate these errors by the ascent term αk

2 E gi a(ξk, ωi k+1, λi k+1) 2.

We mainly focus on the analysis of critic s error through the proof sketch. The analysis for reward estimator s error follows similar procedure. We start by decomposing the error of critic as

i=1 ωi k+1 ω (θk) 2 =

i=1 ( ωi k+1 ωk+1 2 + ωk+1 ω (θk) 2). (15)

Published in Transactions on Machine Learning Research (01/2023)

The first term represents the consensus error, which can be bounded by the next lemma. Lemma 1. Suppose Assumptions 1 and 5 hold. Consider the sequence {ωi k} generated by Algorithm 1, then the following holds

Qωk+1 ν k Kc 1 ω0 + 4

where ωk := [ω1 k, , ωN k ]T , Q := I 1

N 11T , ν (0, 1) is the second largest singular value of W.

Based on Lemma 1 and follow the step size rule of Theorem 1, it is possible to show Qωk+1 2 = PN i=1 ωi k+1

ωk+1 2 = O(K2 c β2 k). Let Kc = O(β 1

2 k ), we have Qωk+1 2 = O(βk), which maintains the optimal rate.

To analyze the second term in (15), we first construct the following Lyapunov function

Vk := J(θk) + ωk ω (θk) 2 + λk λ (θk) 2. (16)

Then, it remains to derive an approximate descent property of the term ωk ω (θk) 2 in (16). Towards that end, our key step lies in establishing the smoothness of the optimal critic variables shown in the next lemma. Lemma 2 (Smoothness of optimal critic). Suppose Assumptions 1-3 hold, under the update of Algorithm 1, there exists a positive constant Lω,2 such that for any policy parameter θ1, θ2, it holds that

ω (θ1) ω (θ2) Lω,2 θ1 θ2 ,

This smoothness property is essential for achieving our O(1/

K) convergence rate.

To the best of our knowledge, the smoothness of ω (θ) has not been justified in the literature. Equipped with Lemma 2, we are able to establish the following lemma. Lemma 3 (Error of critic). Under Assumptions 1-5, consider the update of Algorithm 1. Then, it holds that

E[ ωk+1 ω (θk+1) 2] (1 + C9αk) ωk+1 ω (θk) 2

i=1 E[gi a(ξk, ωi k+1, λi k+1)] 2 + O(α2 k). (17)

E[ ωk+1 ω (θk) 2] (1 2λϕβk) ωk ω (θk) 2

+ CK1βkβk ZK + CK2αk ZKβk. (18)

Here, ZK := min{z N+|κρz 1 min{αK, βK, ηK}}, C9, λϕ are constants specified in appendix, and CK1 and CK2 are of order O(log(K)) and O(log2(K)) respectively.

Plug (18) into (17), we can establish the approximate descent property of ωk ω (θk) 2 in (16):

E[ ωk+1 ω (θk+1) 2] (1 + C9αk)(1 2λϕβk) ωk ω (θk) 2

i=1 E[gi a(ξk, ωi k+1, λi k+1)] 2 + O(CK1βkβk ZK + CK2αk ZKβk). (19)

Finally, plugging (14), (17), and (19) into (16) gives the ascent of the Lyapunov function, which leads to our convergence result through steps of standard arguments.

Remarks on update step. In Algorithms 1 and 2, the actor and critic update once for each iteration. This update scheme can be generalized to the case where actor and critic update arbitrary number of constant steps without affecting the order of the sample complexity. In particular, suppose that actor updates Ca steps per iteration, and let gi a,k be the actor s update direction at iteration k. The bounds (14) and (19) become

E[J(θk+1)] J(θk)

2 E θi J(θk) 2 + αk

2 E gi a,k 2 8C2 ψαk E ω (θk) ωi k+1 2

4C2 ψαk E λ (θk) λi k+1 2 O(C2 a log2(K)α2 k) O((εapp + εsp)αk)

Published in Transactions on Machine Learning Research (01/2023)

Figure 1: Averaged reward versus sample complexity and communication complexity. The vertical axis is the averaged reward over all the agents. The result is averaged over 10 Monte Carlo runs.

E[ ωk+1 ω (θk+1) 2] (1 + C9αk)(1 2λϕβk) ωk ω (θk) 2

i=1 E[gi a,k] 2 + O(CK1βkβk ZK + Ca CK2αk ZKβk),

where we replace the norm bound αk gi a(ξk, ωi k+1, λi k+1) = O(αk) with gi a,k and apply Cauchy-Schwartz inequality: gi a,k Ca gi a(ξk, ωi k+1, λi k+1) . When Ca is a constant that is not related to K, these two bounds recovers (14) and (19). Hence, we can follow exactly the same proof procedure and obtain the O(ε 2) sample complexity result as before. When critic update Cc > 1 steps per iteration, the expected temporal difference error will decrease for each step by controlling step size, so that the bound in (19) still holds. Thus, updating critic for multiple steps will not affect the sample complexity.

4.5 Convergence of single-timescale decentralized NAC

The natural Actor-Critic (NAC) (Peters & Schaal, 2008) is a popular variant of AC algorithm, which enjoys the convergence to a global optimum (with compatible function approximation error) instead of a local stationary point. While our main focus is the convergence of the single-timescale AC algorithm, we find that the proof technique can be directly extended to establish the global convergence of single-timescale decentralized NAC. For reference, we design such an algorithm and provide its convergence result in Appendix E as a by-product of our single-timescale AC s analysis. To the best of our knowledge, this is the first convergence result of single-timescale NAC. However, our analysis only establishes a O(ε 6) rate for the algorithm. This result is sub-optimal compared with the existing best complexity of O(ε 3) (Chen et al., 2022), which is based on the double-loop implementation. The main reason for the sub-optimality is that in comparison with the double-loop update, the critic variables under the single-timescale update will inevitably converge slower due to the change of the actor s parameter in each iteration. Based on the classical NAC s analysis, the slower convergence of critic variables will result in a worse convergence rate of the optimality gap. Please refer to Appendix E for more discussions on the sub-optimality.

5 Numerical results

5.1 Experiment setting

We adopt the grounded communication environment proposed in (Mordatch & Abbeel, 2018). Our task consists of N agents and the corresponding N landmarks inhabited in a two-dimension world, where each agent can observe the relative position of other agents and landmarks. For every discrete time step, agents take actions to move along certain directions, and receive their rewards. Agents are rewarded based on the distance to their own landmark, and penalized if they collide with other agents. The objective is to maximize

Published in Transactions on Machine Learning Research (01/2023)

the long-term averaged reward over all agents. Since we focus on decentralized setting, each agent shall not know the target landmark of others, i.e., the reward function of others. To exchange information, each agent is allowed to send their local information via a fixed communication link. Through all the experiments, the agent number N is set to be 5, and the discount factor γ is set to be 0.95.

5.2 Comparison with existing decentralized AC algorithms

In this section, we compare the proposed algorithm with existing decentralized AC algorithms under the cooperative MARL setting (Chen et al., 2022; Zeng et al., 2022) in terms of sample complexity and communication complexity. In the sequel, we refer Algorithm 1 as "SDAC-re" and Algorithm 2 as "SDAC-noi" (see Appendix 2). The algorithm proposed in (Chen et al., 2022) is referred as "DLDAC", which is based on double-loop implementation. The algorithm proposed in (Zeng et al., 2022) is denoted by "TDAC-re", which is based on two-timescale step size implementation. For comparison, we also implement a noisy reward version of "TDAC-re" and denote it by "TDAC-noi".

Comparison to double-loop decentralized AC. For "SDAC-re" and "SDAC-noi", we set αk = 0.01(k + 1) 0.5, βk = 0.1(k + 1) 0.5, ηk = 0.1(k + 1) 0.5, Kc = 5, σ = 0.5, Kr = 2. For "DLDAC", we fix Tc = 50, T c = 10, T = 5, Nc = 10, N = 100, σ = 0.1 3, which is adopted by their paper (see comparisons under different hyper-parameters in Appendix A). We set α = 0.01, β = 0.1 for "DLDAC" since we observe that larger step sizes will result in divergence. We have to mention that such a inner loop size Tc = 50 in "DLDAC" is not necessarily consistent with the theory of a double-loop algorithm, in which the loop size should be proportional to O(ε 1). The sample complexity and communication complexity results are shown in Figure 1. For the sample complexity, "SDAC-noi" enjoys a faster convergence compared with "DLDAC". In terms of communication complexity, "DLDAC" achieves better performance as it applies mini-batch technique and thereby requires less communication rounds when using the same amount of samples. Such a mini-batch approach can also be adopted to our proposed algorithms. Thus, we implement a mini-batch version of our proposed algorithms, which we refer as "SDAC-noi-batch" and "SDAC-re-batch", respectively. We set 10 as the batch size for actor, critic, and reward estimator. We can see that by applying mini-batch update, these two variants achieve significantly better communication complexity compared with "DLDAC". This is because our algorithm updates actor for more times compared with "DLDAC" under the same communication rounds.

Figure 2: Comparison between the proposed algorithms and two-timescale decentralized AC algorithms (Zeng et al., 2022). The results are averaged over 10 Monte Carlo runs.

Comparison with two-timescale decentralized AC. We fix Kc = 1, Kr = 5 for this experiment. We set αk = 0.01(k + 1) 0.5, βk = 0.1(k + 1) 0.5, and ηk = 0.1(k + 1) 0.5 for "SDAC-re" and "SDAC-noi"; we set αk = 0.01(k + 1) 0.6, βk = 0.1(k + 1) 0.4, and ηk = 0.1(k + 1) 0.4 for "TDAC-re" and "TDAC-noi". The sample complexity is presented in Figure 2. We can observe that the convergence speed of "SDAC-noi" is

3Note that we adopt the notations in (Chen et al., 2022). Here, Tc is the inner loop size, T c is the communication number for each outer loop, T is the communication number for reward consensus, N is the batch size for actor s update, and Nc is the batch size for critic s update.

Published in Transactions on Machine Learning Research (01/2023)

slightly better than that the two-timescale counterpart "TDAC-noi". In addition, when using reward estimator for the global reward estimation, we see that "SDAC-re" has much more stable convergence behavior than "TDAC-re", and achieves significantly higher rewards.

Figure 3: Ablation study on the consensus periods. The results are averaged over 10 Monte Carlo runs.

5.3 Ablation study on different choices of Kc

We compare the performance of "SDAC-noi" under different choices of consensus periods Kc. In particular, we set αk = 0.01(k + 1) 0.5, βk = 0.1(k + 1) 0.5, Kr = 2, σ = 0.5 and examine the consensus periods Kc of 1, 5, 10, and 20, respectively. The corresponding sample complexity and communication complexity results are summarized in Figure 3. Evidently, in terms of sample complexity, the convergence becomes slower and relatively unstable as the consensus period Kc increases. Therefore, when the communication cost is low, choosing a small Kc will yield a better performance. We also plot the communication complexity under the consensus periods of 1 and 5. We can see that the communication complexity of "cons-5" outperforms "cons-1" after 12 103 communications. Thus, when the communication cost is expensive and high averaged reward is required, one may use large Kc and run the algorithm for a relatively large number of iterations.

6 Conclusion and future direction

In this paper, we studied the convergence of fully decentralized AC algorithm under practical singletimescale update. We showed that the algorithm will maintain the optimal sample complexity of O(ε 2) and is communication efficient. We also proposed a variant to preserve the privacy of local actions by communicating noisy rewards. Extensive simulation results demonstrate the superiority of our algorithms empirical performance over existing decentralized AC algorithms. However, directly extending our singletimescale AC s analysis technique to single-timescale NAC will result in a sub-optimal sample complexity. We leave the study on improving the convergence rate and design a more efficient single-timescale NAC algorithm as promising future directions.

Acknowledgement

The authors would like to thank the Action Editor and anonymous reviewers for their detailed and constructive comments, which have helped greatly to improve the quality and presentation of the manuscript.

X. Li was partially supported by the National Natural Science Foundation of China (NSFC) under Grant No. 12201534 and 72150002, by the Shenzhen Science and Technology Program under Grant No. RCBS20210609103708017, and by the Shenzhen Institute of Artificial Intelligence and Robotics for Society (AIRS) under Grant No. AC01202101108.

Published in Transactions on Machine Learning Research (01/2023)

Alekh Agarwal, Sham M Kakade, Jason D Lee, and Gaurav Mahajan. On the theory of policy gradient methods: Optimality, approximation, and distribution shift. Ar Xiv:1908.00261, 2019.

Alekh Agarwal, Sham M Kakade, Jason D Lee, and Gaurav Mahajan. Optimality and approximation with policy gradient methods in markov decision processes. In Conference on Learning Theory (COLT), pp. 64 66, 2020.

Jonathan Baxter and Peter L. Bartlett. Infinite-horizon policy-gradient estimation. Journal of Artificial Intelligence Research, 15:319 350, 2001.

Jalaj Bhandari, Daniel Russo, and Raghav Singal. A finite time analysis of temporal difference learning with linear function approximation. In Conference on Learning Theory (COLT), pp. 1691 1692, 2018.

Tianyi Chen, Yuejiao Sun, and Wotao Yin. Closing the gap: Tighter analysis of alternating stochastic gradient methods for bilevel problems. In A. Beygelzimer, Y. Dauphin, P. Liang, and J. Wortman Vaughan (eds.), Advances in Neural Information Processing Systems, 2021a. URL https://openreview.net/forum?id= OItv P2-i9j.

Ziyi Chen, Yi Zhou, and Rongrong Chen. Multi-agent off-policy td learning: Finite-time analysis with near-optimal sample complexity and communication complexity. ar Xiv preprint ar Xiv:2103.13147, 2021b.

Ziyi Chen, Yi Zhou, Rong-Rong Chen, and Shaofeng Zou. Sample and communication-efficient decentralized actor-critic algorithms with finite-time analysis. In International Conference on Machine Learning, pp. 3794 3834. PMLR, 2022.

Kenji Doya. Reinforcement learning in continuous time and space. Neural Computation, 12(1):219 245, 2000.

Lasse Espeholt, Hubert Soyer, Remi Munos, Karen Simonyan, Vlad Mnih, Tom Ward, Yotam Doron, Vlad Firoiu, Tim Harley, Iain Dunning, et al. Impala: Scalable distributed deep-rl with importance weighted actor-learner architectures. In International Conference on Machine Learning, pp. 1407 1416. PMLR, 2018.

Zuyue Fu, Zhuoran Yang, and Zhaoran Wang. Single-timescale actor-critic provably finds globally optimal policy. In International Conference on Learning Representations, 2021. URL https://openreview.net/ forum?id=pq ZV_sr UVm K.

Hongyi Guo, Zuyue Fu, Zhuoran Yang, and Zhaoran Wang. Decentralized single-timescale actor-critic on zero-sum two-player stochastic games. In International Conference on Machine Learning, pp. 3899 3909. PMLR, 2021.

FNU Hairi, Jia Liu, and Songtao Lu. Finite-time convergence and sample complexity of multi-agent actor-critic reinforcement learning with average reward. In International Conference on Learning Representations, 2022.

Sham M Kakade. A natural policy gradient. In Proc. Advances in Neural Information Processing Systems (NIPS), pp. 1531 1538, 2002.

Vijaymohan R Konda and Vivek S Borkar. Actor-critic type learning algorithms for Markov decision processes. SIAM Journal on Control and Optimization, 38(1):94 123, 1999.

Timothy P Lillicrap, Jonathan J Hunt, Alexander Pritzel, Nicolas Heess, Tom Erez, Yuval Tassa, David Silver, and Daan Wierstra. Continuous control with deep reinforcement learning. ar Xiv preprint ar Xiv:1509.02971, 2015.

Yixuan Lin, Kaiqing Zhang, Zhuoran Yang, Zhaoran Wang, Tamer Başar, Romeil Sandhu, and Ji Liu. A communication-efficient multi-agent actor-critic algorithm for distributed reinforcement learning. In 2019 IEEE 58th Conference on Decision and Control (CDC), pp. 5562 5567, 2019.

Michael L Littman. Markov games as a framework for multi-agent reinforcement learning. In Machine learning proceedings 1994, pp. 157 163. Elsevier, 1994.

Published in Transactions on Machine Learning Research (01/2023)

Yanli Liu, Kaiqing Zhang, Tamer Basar, and Wotao Yin. An improved analysis of (variance-reduced) policy gradient and natural policy gradient methods. Advances in Neural Information Processing Systems, 33: 7624 7636, 2020.

Ryan Lowe, Yi I Wu, Aviv Tamar, Jean Harb, Open AI Pieter Abbeel, and Igor Mordatch. Multi-agent actor-critic for mixed cooperative-competitive environments. Advances in neural information processing systems, 30, 2017.

Igor Mordatch and Pieter Abbeel. Emergence of grounded compositional language in multi-agent populations. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 32, 2018.

Alex Olshevsky and Bahman Gharesifard. A small gain analysis of single timescale actor critic. To appear in SIAM Journal on Control and Optimization, 2023, 2023.

Shayegan Omidshafiei, Jason Pazis, Christopher Amato, Jonathan P How, and John Vian. Deep decentralized multi-task multi-agent reinforcement learning under partial observability. In International Conference on Machine Learning, pp. 2681 2690. PMLR, 2017.

Jan Peters and Stefan Schaal. Natural actor-critic. Neurocomputing, 71(7-9):1180 1190, 2008.

Shuang Qiu, Zhuoran Yang, Jieping Ye, and Zhaoran Wang. On the finite-time convergence of actor-critic algorithm. In Optimization Foundations for Reinforcement Learning Workshop at Advances in Neural Information Processing Systems (Neur IPS), 2019.

Tabish Rashid, Mikayel Samvelyan, Christian Schroeder, Gregory Farquhar, Jakob Foerster, and Shimon Whiteson. Qmix: Monotonic value function factorisation for deep multi-agent reinforcement learning. In International Conference on Machine Learning, pp. 4295 4304. PMLR, 2018.

John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. Proximal policy optimization algorithms. ar Xiv preprint ar Xiv:1707.06347, 2017.

Han Shen, Kaiqing Zhang, Mingyi Hong, and Tianyi Chen. Asynchronous advantage actor critic: Nonasymptotic analysis and linear speedup. Ar Xiv:2012.15511, 2020.

David Silver, Julian Schrittwieser, Karen Simonyan, Ioannis Antonoglou, Aja Huang, Arthur Guez, Thomas Hubert, Lucas Baker, Matthew Lai, Adrian Bolton, et al. Mastering the game of go without human knowledge. nature, 550(7676):354 359, 2017.

Kyunghwan Son, Daewoo Kim, Wan Ju Kang, David Earl Hostallero, and Yung Yi. Qtran: Learning to factorize with transformation for cooperative multi-agent reinforcement learning. In International Conference on Machine Learning, pp. 5887 5896. PMLR, 2019.

Jun Sun, Gang Wang, Georgios B Giannakis, Qinmin Yang, and Zaiyue Yang. Finite-sample analysis of decentralized temporal-difference learning with linear function approximation. In Proc. International Conference on Artificial Intelligence and Statistics (AISTATS), pp. 4485 4495, 2020.

Richard S Sutton, David A Mc Allester, Satinder P Singh, and Yishay Mansour. Policy gradient methods for reinforcement learning with function approximation. In Proc. Advances in Neural Information Processing Systems (NIPS), pp. 1057 1063, 2000.

John N Tsitsiklis and Benjamin Van Roy. Analysis of temporal-diffference learning with function approximation. In Advances in neural information processing systems (NIPS), pp. 1075 1081, 1997.

Oriol Vinyals, Igor Babuschkin, Wojciech M Czarnecki, Michaël Mathieu, Andrew Dudzik, Junyoung Chung, David H Choi, Richard Powell, Timo Ewalds, Petko Georgiev, et al. Grandmaster level in starcraft ii using multi-agent reinforcement learning. Nature, 575(7782):350 354, 2019.

Published in Transactions on Machine Learning Research (01/2023)

Yue Wang, Shaofeng Zou, and Yi Zhou. Non-asymptotic analysis for two time-scale TDC with general smooth function approximation. In M. Ranzato, A. Beygelzimer, Y. Dauphin, P.S. Liang, and J. Wortman Vaughan (eds.), Advances in Neural Information Processing Systems, volume 34, pp. 9747 9758. Curran Associates, Inc., 2021. URL https://proceedings.neurips.cc/paper/2021/file/ 50e207ab6946b5d78b377ae0144b9e07-Paper.pdf.

Yue Frank Wu, Weitong Zhang, Pan Xu, and Quanquan Gu. A finite-time analysis of two time-scale actor-critic methods. Advances in Neural Information Processing Systems, 33:17617 17628, 2020.

Pan Xu, Felicia Gao, and Quanquan Gu. An improved convergence analysis of stochastic variance-reduced policy gradient. In Proc. International Conference on Uncertainty in Artificial Intelligence (UAI), 2019.

Tengyu Xu and Yingbin Liang. Sample complexity bounds for two timescale value-based reinforcement learning algorithms. In International Conference on Artificial Intelligence and Statistics, pp. 811 819. PMLR, 2021.

Tengyu Xu, Zhe Wang, and Yingbin Liang. Improving sample complexity bounds for (natural) actor-critic algorithms. In Proc. Advances in Neural Information Processing Systems (Neur IPS), volume 33, 2020.

Tengyu Xu, Zhuoran Yang, Zhaoran Wang, and Yingbin Liang. Doubly robust off-policy actor-critic: Convergence and optimality. Ar Xiv:2102.11866, 2021.

Chao Yu, Xin Wang, Xin Xu, Minjie Zhang, Hongwei Ge, Jiankang Ren, Liang Sun, Bingcai Chen, and Guozhen Tan. Distributed multiagent coordinated learning for autonomous driving in highways based on dynamic coordination graphs. IEEE Transactions on Intelligent Transportation Systems, 21(2):735 748, 2019.

Siliang Zeng, Tianyi Chen, Alfredo Garcia, and Mingyi Hong. Learning to coordinate in multi-agent systems: A coordinated actor-critic algorithm and finite-time guarantees. In Learning for Dynamics and Control Conference, pp. 278 290. PMLR, 2022.

Haifeng Zhang, Weizhe Chen, Zeren Huang, Minne Li, Yaodong Yang, Weinan Zhang, and Jun Wang. Bi-level actor-critic for multi-agent coordination. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 34, pp. 7325 7332, 2020.

Kaiqing Zhang, Zhuoran Yang, Han Liu, Tong Zhang, and Tamer Basar. Fully decentralized multi-agent reinforcement learning with networked agents. In International Conference on Machine Learning, pp. 5872 5881. PMLR, 2018.

Kaiqing Zhang, Alec Koppel, Hao Zhu, and Tamer Başar. Global convergence of policy gradient methods to (almost) locally optimal policies. ar Xiv preprint ar Xiv:1906.08383, 2019.

Kaiqing Zhang, Zhuoran Yang, and Tamer Başar. Multi-agent reinforcement learning: A selective overview of theories and algorithms. Handbook of Reinforcement Learning and Control, pp. 321 384, 2021.

Published in Transactions on Machine Learning Research (01/2023)

1 Introduction 1

2 Preliminary 3

2.1 Markov decision process . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3

2.2 Policy gradient Theorem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4

3 Algorithms 4

3.1 Decentralized single-timescale Actor-Critic . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4

3.2 Variant for preserving local action . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6

4 Main results 7

4.1 Assumptions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7

4.2 Sample complexity for Algorithm 1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8

4.3 Sample complexity for Algorithm 2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9

4.4 Proof sketch . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9

4.5 Convergence of single-timescale decentralized NAC . . . . . . . . . . . . . . . . . . . . . . . . 11

5 Numerical results 11

5.1 Experiment setting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11

5.2 Comparison with existing decentralized AC algorithms . . . . . . . . . . . . . . . . . . . . . . 12

5.3 Ablation study on different choices of Kc . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13

6 Conclusion and future direction 13

A Additional simulation results 19

B Auxiliary lemmas 20

C Supporting lemmas 23

C.1 Error of critic . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23

C.2 Error of reward estimator . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31

C.3 Consensus error . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31

C.4 Error of actor . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33

D Proof of main results 35

D.1 Proof of Theorem 1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35

D.2 Proof of Theorem 2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40

E Natural Actor-Critic variant and its convergence 42

Published in Transactions on Machine Learning Research (01/2023)

E.1 Decentralized natural Actor-Critic . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42

E.2 Convergence of natural Actor-Critic . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43

E.3 Proof of Theorem 3 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45

Published in Transactions on Machine Learning Research (01/2023)

(a) Different actor s batch sizes.

(b) Different critic s batch sizes.

(c) Different loop sizes. Figure 4: Comparison between the proposed algorithms and the double-loop decentralized AC algorithm that uses mini-batch update. The results are averaged over 10 Monte Carlo runs.

A Additional simulation results

In this section, we provide more experiments which compare the proposed algorithms with double-loop based decentralized AC algorithm under different batch sizes and inner loop sizes.

1. Actor s batch size. We fix Tc = 50, T c = 10, Nc = 10, 4 which is adopted by (Chen et al., 2022). We examine values of N in {10, 50, 100}. The results are in Figure 4a. We observe that the best choice of actor s batch size N is 50, and the proposed "SDAC-noi" converges faster than it in terms of sample complexity.

2. Critic s batch size. We fix Tc = 50, T c = 10, N = 100, which is adopted by (Chen et al., 2022). We examine values of Nc in {2, 10, 50}. The results are shown in Figure 4b. As we can see, "DLDAC" with smaller critic s batch sizes can achieve better sample complexity, indicating that the variance of critic s update is relatively small and the mini-batch update is not needed for this task. Our proposed "SDAC-noi" achieves better convergence compared with the double-loop decentralized AC under different choices of Nc.

4Note that we adopt the notations in (Chen et al., 2022). Here, Tc is the inner loop size, T c is the communication number for each outer loop, N is the batch size for actor s update, and Nc is the batch size for critic s update.

Published in Transactions on Machine Learning Research (01/2023)

3. Inner loop size. We fix T c = 10, N = 100, Nc = 10, which is adopted by (Chen et al., 2022). We examine values of Tc in {5, 20}. The results are shown in Figure 4c. We can see that the proposed "SDAC-noi" enjoys a better convergence in terms of sample complexity.

B Auxiliary lemmas

In this section, we provide some auxiliary lemmas, which serves as the preliminary for the proof of main theorems and lemmas.

The following two lemmas present the Lipschitz properties of the objective function and value function. Lemma 4 ((Zhang et al., 2019), Lemma 3.2). Suppose Assumption 3 holds, then there exists a positive constant L such that for any policy parameter θ1 and θ2, we have J(θ1) J(θ2) L θ1 θ2 .

Lemma 5 ((Shen et al., 2020), Lemma 4). Suppose Assumption 3 holds, for any policy parameter θ1, θ2 and s S, there exits a positive constant LV such that

Vπθ1(s) LV |Vπθ1(s) Vπθ2(s)| LV θ1 θ2 .

The next lemma shows that the stationary distribution is Lipschitz continuous with respect to policy. Lemma 6 ((Wu et al., 2020), Lemma B.1). For any policy parameter θ1 and θ2, it holds that

d T V (µθ1, µθ2) |A|Lπ(logρ κ 1 + (1 ρ) 1) θ1 θ2

d T V (µθ1 πθ1, µθ2 πθ2) |A|Lπ(1 + logρ κ 1 + (1 ρ) 1) θ1 θ2

d T V (µθ1 πθ1 P, µθ2 πθ2 P) |A|Lπ(1 + logρ κ 1 + (1 ρ) 1) θ1 θ2 .

We will define Lµ := |A|Lπ(logρ κ 1 + (1 ρ) 1) for the proof of main theorems and lemmas.

The following lemma characterizes the geometric mixing of the Markov chain. Lemma 7 ((Shen et al., 2020), Lemma 1). Suppose Assumption 4 holds, then there exists κ > 0, ρ [0, 1] such that for any policy parameter θ we have

sup s0 S d T V (P((sk, ak, sk+1) |s0, πθ), µθ πθ P) κρk,

where µθ is the stationary distribution induced by πθ and transition kernel P( |s, a).

The next lemma bounds the error of the discounted state-visitation distribution and the stationary distribution. Lemma 8 ((Shen et al., 2020), Lemma 2). Suppose Assumption 4 holds, then for any policy parameter θ, there exists κ > 0, ρ [0, 1] such that

d T V (dπθ, µπθ) 2(logρ κ 1 + 1 1 ρ)(1 γ).

The following lemma bounds the total variation distance between state distribution under a fixed policy and that under an updating policy. The lemma is used for analyzing the sampling error. Lemma 9. Consider the Markov chain:

sk z θk z ak z P sk z+1 θk z+1 ak z+1 θk 1 ak 1 P sk θk ak P sk+1.

Also consider the auxiliary Markov chain with fixed policy:

sk z θk z ak z P sk z+1 θk z ak z+1 θk z ak 1 P sk θk z ak P sk+1.

Let ξk := (sk, ak, sk+1) be sampled from chain 1, and ξk := ( sk, ak, sk+1) be sampled from chain 2. Then we have

d T V (P(ξk |θk z, sk z+1), P( ξk |θk z, sk z+1)) 1

m=0 |A|Lπ θk m θk z .

Published in Transactions on Machine Learning Research (01/2023)

d T V (P(ξk ), P( ξk ))

a A |P(sk = ds, ak = a, sk+1 = ds ) P( sk = ds, ak = a, sk+1 = ds )|

a A |P(sk = ds, ak = a) P( sk = ds, ak = a)| Z

s S P(sk+1 = ds |sk = ds, ak = a)

a A |P(sk = ds, ak = a) P( sk = ds, ak = a)|

a A |P(sk = ds)πθk(a|ds) P( sk = ds)πθk z(a|ds)|

a A |P(sk = ds)πθk(a|ds) P(sk = ds)πθk z(a|ds)|

a A |P(sk = ds)πθk z(a|ds) P( sk = ds)πθk z(a|ds)|

s S |A|Lπ θk θk z P(sk = ds)

s S |P(sk = ds) P( sk = ds)| X

a A πθk z(a|ds)

2|A|Lπ θk θk z + d T V (P(sk ), P( sk )). (20)

The second term can be bounded as

d T V (P(sk ), P( sk ))

s S |P(sk = ds) P( sk = ds)|

s S P(sk 1 = ds, ak 1 = a, sk = ds ) P( sk 1 = ds, ak 1 = a, sk = ds )|

s S |P(sk 1 = ds, ak 1 = a, sk = ds ) P( sk 1 = ds, ak 1 = a, sk = ds )|

= d T V (P(ξk 1 ), P( ξk 1 )). (21)

Combined (20) and (21), we obtain

d T V (P(ξk ), P( ξk )) d T V (P(ξk 1 ), P( ξk 1 )) + 1

2|A|Lπ|θk θk z .

Sum over z 1 steps, we obtain

d T V (P(ξk ), P( ξk )) d T V (P(ξk z ), P( ξk z )) + 1

m=0 |A|Lπ θk m θk z

m=0 |A|Lπ θk m θk z .

Next, we present some mathematical facts that are useful in our analysis.

Published in Transactions on Machine Learning Research (01/2023)

Lemma 10 ((Chen et al., 2021b), Lemma F.3). For a doubly stochastic matrix W RN N and the difference matrix Q := I 1

N 11T , it holds that for any matrix H RN N, W k H νk QH , where ν is the second largest singular value of W.

Lemma 11 (descent lemma in high dimension). Consider the mapping F : Rn Rm. If there exists a positive constant L such that

F(x) F(y) F L x y , x, y dom(F), (22)

then the following holds

F(y) F(x) F(x)(y x) L

Proof. Observe that (22) directly implies the smoothness of each entry Fi:

Fi(x) Fi(y) F(x) F(y) F L x y .

Define zi(x, y) := Fi(y) Fi(x) Fi(x)T (y x).

F(y) F(x) F(x)(y x) =

i=1 zi(x, y)2

where the inequality follows the descent lemma.

Lemma 12 (Lipschitz property of multiplication). Suppose f(x) and g(x) are two functions bounded by Cf and Cg, and are Lfand Lg-Lipschitz continuous, then f(x)g(x) is Cf Lg + Cg Lf-Lipschitz continuous.

f(x1)g(x1) f(x2)g(x2) = f(x1)g(x1) f(x1)g(x2) + f(x1)g(x2) f(x2)g(x2)

f(x1) g(x1) g(x2) + f(x1) f(x2) g(x2)

(Cf Lg + Cg Lf) x1 x2 .

Lemma 13 (invertible property of matrix). If a square matrix A satisfies limt At = 0, or equivalently, |λ(A)| < 1, then I A is invertible.

(I A) lim t

i=0 At = lim t [

= I lim t At+1

Since I is invertible, by the rank inequality rank(AB) min(rank(A), rank(B)), I A and limt Pt i=0 At

will be full rank and thereby invertible.

Published in Transactions on Machine Learning Research (01/2023)

C Supporting lemmas

Before proceeding to the analysis of critic variables, we justify the uniqueness of fix point for critic and reward estimator variables under the update (5) and (8), respectively. Define the following notations

Aθ,ϕ := E[ϕ(s)(γϕ(s )T ϕ(s)T )], (23)

Aθ,φ := E[φ(s, a)φ(s, a)T ],

bθ,ϕ := E[ϕ(s) r(s, a)],

bθ,φ := E[φ(s, a) r(s, a)],

with expectation taken from s µθ(s), a πθ, s P. The optimal critic and reward estimator variables given policy θ will satisfy Aθ,ϕω (θ) + bθ,ϕ = 0; Aθ,φλ (θ) + bθ,φ = 0. By Assumption 2, Aθ,ϕ and Aθ,φ are negative definite with largest eigenvalue λϕ and λφ, which ensures the unique solution ω (θ) = A 1 θ,ϕbθ,ϕ; λ (θ) = A 1 θ,φbθ,φ. Let Rω := rmax

λϕ , Rλ := rmax

λφ . Then the norm of optimal solutions will be bounded as ω (θ) Rω, λ (θ) Rλ, which justifies the projection step of the Algorithm 1. In practice, the knowledge of λϕ and λφ may not be available. One can estimate projection radius online using the methods proposed in Section 8.2 of (Bhandari et al., 2018).

We slightly abuse the notation by overwriting Vπθ as Vθ. To study the error of critic, we introduce the following notations (crf. ξ := (s, a, s ))

δi(ξ, θ) := ri(s, a) + γVθ(s ) Vθ(s)

δ(ξ, θ) := r(s, a) + γVθ(s ) Vθ(s) δ(ξ, ω) := r(s, a) + γϕ(s )T ω ϕ(s)T ω ˆδ(ξ, ω, λ) := φ(s, a)T λ + γϕ(s )T ω ϕ(s)T ω, (24)

For the ease of expression, we further define

gi a(ξ, ω, λ) := ˆδ(ξ, ω, λ)ψθi(s, ai),

gi c(ξ, ω) := δi(ξ, ω)ϕ(s),

gc(ξ, ω) := δ(ξ, ω)ϕ(s),

gc(θ, ω) := Eξ µθ[ gc(ξ, ω)]. (25)

C.1 Error of critic

The following lemmas and propositions serves as the preliminary for establishing the approximate descent property of the critic variables optimal gap.

Proposition 1 (Lipschitz continuity of ω (θ) (Wu et al., 2020)). Suppose Assumptions 1, 2, 3, and 4 hold, then there exists a positive constant Lω such that for any θ1, θ2 RNdθ, we have

ω (θ1) ω (θ2) Lω θ1 θ2 .

Lemma 14 (smoothness of stationary distribution). For any θ, θ Rd, there exists a positive constant Lµ,2 such that µθ(s) µθ (s) Lµ,2 θ θ .

The proof of this Lemma consists of two main steps: 1) Derive the expression of the gradient and 2) establish that the gradient is Lipschitz continuous. For the first part, we follow the main idea in (Baxter & Bartlett, 2001).

Proof. For a given policy πθ, we define the transition probability Pθ(s|s ) := P a πθ(a|s )P(s|s , a). By the Assumption 4, there exists a stationary distribution µθ(s) which satisfies for all state s

Published in Transactions on Machine Learning Research (01/2023)

s S µθ(s )Pθ(s|s ) (26)

Define the following notations

µθ := [µθ(s1), µθ(s2), , µθ(sn)]T R|S| 1

Pθ(s) := [Pθ(s|s1), Pθ(s|s2), , Pθ(s|sn)]T R|S| 1

P(θ) := [Pθ(s1), Pθ(s2), , Pθ(sn)] R|S| |S|

µθ := [ µθ(s1), µθ(s2), , µθ(sn)] Rdθ |S|

Pθ(s) := [ Pθ(s|s1), Pθ(s|s2), , Pθ(s|sn)] Rdθ |S|

Upon taking derivative with respect to θ on both sides of (26), we have

s S µθ(s )Pθ(s|s ) + µθ(s ) θPθ(s|s )

= µθPθ(s) + Pθ(s)µθ (27)

(27) can be written in compact form as

µθ = µθP(θ) + [ Pθ(s1)µθ, , Pθ(sn)µθ] (28)

Therefore, we have

[ Pθ(s1)µθ, , Pθ(sn)µθ] = µθ(I P(θ))

= µθ(I (P(θ) eµT θ )),

where the second inequality is due to µθe = (µθe) = 1 = 0.

We now show that I (P(θ) eµT θ ) is invertible. The first step is to show limt (P(θ) eµT θ )t = 0. Let P, µ represent P(θ), µθ for simplicity, we first show (P eµT )t = P t P t 1eµT by induction. Observe that when t = 1, this is trivially satisfied. Suppose the equality holds for t = k, then

(P eµT )k+1 = (P k P k 1eµT )P (P k P k 1eµT )eµT

= P k+1 P k 1eµT P keµT + P k 1(eµT )2

= P k+1 P keµT ,

where the second equality is due to (26) such that eµT P = eµT and the last equality is due to µT e = 1.

Therefore, we have

lim t (P(θ) eµT θ )t = lim t (P(θ)t P(θ)t 1eµT θ ) = eµT θ eµT θ = 0,

which together with Lemma 13 justifies that I (P(θ) eµT θ ) is invertible. Thus, we have

µθ = (I (P(θ) eµT θ )) 1[ Pθ(s1)µθ, , Pθ(sn)µθ]. (29)

Published in Transactions on Machine Learning Research (01/2023)

We will utilize Lemma 12 to prove the Lipschitz property of µθ. We first show the Lipschitz continuous of the first term. Let A(θ) to represent I (P(θ) eµT θ ), then we have

A(θ1) A(θ2) = P(θ1) P(θ2) + e(µθ2 µθ1)T

P(θ1) P(θ2) + e(µθ2 µθ1)T

a A (πθ1(a|s ) πθ2(a|s ))P(s|s , a)|2 + p

|S| µθ2 µθ1

a A |(πθ1(a|s ) πθ2(a|s ))P(s|s , a)|)2 + p

|S| µθ2 µθ1

s S |A|2L2π θ1 θ2 2 X

s S P(s|s , a)2 + p

|S|Lµ θ1 θ2

|S|(|A|Lπ + Lµ) θ1 θ2 .

where the second inequality uses triangle inequality. The last inequality is due to Lipschitz continuous of the policy specified in Assumption 3, and Lipschitz continuous of µθ implied by Lemma 5.

To see that A 1(θ) is Lipschitz continuous and bounded, observe that

A 1(θ1) A 1(θ2) = A 1(θ2)(A(θ2) A(θ1))A 1(θ1)

A 1(θ2) A 1(θ1) A(θ2) A(θ1)

|S|(|A|Lπ + Lµ) A 1(θ2) A 1(θ1) θ2 θ1 , (30)

where the first inequality uses Cauchy-Schwartz inequality, and the last inequality uses the Lipschitz continuous of A(θ) in (30). Since A(θ) is bounded, A 1(θ) is also bounded (due to invertibility), which justifies that the first term in (29) is Lipschitz continuous and bounded.

We now consider the second term in (29). For any state s

Pθ1(s)µθ1 Pθ2(s)µθ2 = Pθ1(s)(µθ1 µθ2) + ( Pθ1(s) Pθ2(s))µθ2

Pθ1(s)(µθ1 µθ2) + ( Pθ1(s) Pθ2(s))µθ2

Pθ1(s) µθ1 µθ2 + Pθ1(s) Pθ2(s) µθ2

a A πθ1(a|s )P(s|s , a) Lµ θ1 θ2

a A ( πθ1(a|s ) πθ2(a|s ))P(s|s , a)

|S||A|(CπLµ + Lπ) θ1 θ2 ,

which justifies the Lipschitz continuous of Pθ(s)µθ. Define B(θ) := [ Pθ(s1)µθ, , Pθ(sn)µθ], we have

B(θ1) B(θ2) |S|3/2|A|(CπLµ + Lπ) θ1 θ2 .

Since µθ = A 1(θ)B(θ), with A 1(θ) and B(θ) being Lipschitz continuous and bounded. Therefore, according to Lemma 12, there exists a positive constant Lµ,2 which satisfies

µθ1 µθ2 Lµ,2 θ1 θ2 .

Proposition 2 (restatement of Lemma 2, Lipschitz continuity of θω (θ) (Chen et al., 2021a)). Suppose Assumptions 1-4 holds, then there exists a positive constant Lω,2 such that

θω (θ1) θω (θ2) F Lω,2 θ1 θ2 .

Published in Transactions on Machine Learning Research (01/2023)

Proof. The proof follows the derivation of Proposition 8 of (Chen et al., 2021a). However, they make assumption that µθ(s) is Lipschitz continuous, which we have justified in Lemma 14. We present the proof for the completeness.

We have ω (θ) = A 1 θ,ϕbθ,ϕ, where Aθ,ϕ is defined in (23). The Jacobian of ω (θ) can be calculated as

θω (θ) = θ(A 1 θ,ϕbθ,ϕ)

= A 1 θ,ϕ( θAθ,ϕ)A 1 θ,ϕbθ,ϕ Aθ,ϕ( θbθ,ϕ). (31)

We can utilize Lemma 12 to show the Lipschitz continuity of ω (θ). We have to verify the Lipschitz continuity and boundedness of A 1 θ,ϕ, bθ,ϕ, θAθ,ϕ, and θbθ,ϕ.

The Lipschitz continuity and boundedness of A 1 θ,ϕ has been shown in (30. Let b1 and b2 represent bθ1,ϕ, bθ2,ϕ, we have

b1 b2 = E[ r(s, a, s )ϕ(s)] E[r( s, a, s )ϕ( s)]

sup s,a,s r(s, a, s )ϕ(s) P((s, a, s )) P(( s, a, s )) T V

rmax P((s, a, s )) P(( s, a, s )) T V 2|A|Lπ(1 + logρ κ 1 + (1 ρ) 1 θ1 θ2 ,

where the last inequality follows Lemma 6.

We now analyze θAθ,ϕ. We first define

A(s, s ) := ϕ(s)(γϕ(s ) ϕ(s))T , b(s, a, s ) := r(s, a, s )ϕ(s).

s,a,s µθ(s)πθ(a|s)P(s |s, a)A(s, s )

s,a,s [ θµθ(s)πθ(a|s)P(s |s, a)A(s, s ) + µθ θπθ(a|s)P(s |s, a)A(s, s )] .

By Lemma 14 and Lemma 6, and Assumption 3, µθ(s), πθ(a|s), θµθ(s), θπθ(a|s) are Lipschitz continuous and bounded. Therefore, θAθ,ϕ is Lipschitz and bounded.

Finally, we analyze θbθ,ϕ by following the same technique.

s,a,s µθ(s)πθ(a|s)P(s |s, a)b(s, a, s )

s,a,s [ θµθ(s)πθ(a|s)P(s |s, a)b(s, a, s ) + µθ(s) θπθ(a|s)P(s |s, a)b(s, a, s )] .

By Lemma 14 and Lemma 6, and Assumption 3, µθ(s), πθ(a|s), θµθ(s), θπθ(a|s) are Lipschitz continuous and bounded. Thus, θbθ,ϕ is bounded and Lipschitz continuous.

We have shown the Lipschitz continuity and boundedness of A 1 θ,ϕ, bθ,ϕ, θAθ,ϕ, and θbθ,ϕ. Therefore, by applying Lemma 12, we conclude that there exists a positive constant Lω,2 such that θω (θ) in (31) is Lω,2-Lipschitz continuous.

Lemma 15 (descent of critic s optimal gap (Markovian sampling)). Under Assumptions 1-5, with ωk+1 generated by Algorithm 1 given ωk and θk under Markovian sampling, then the following holds

E[ ωk+1 ω (θk+1) 2|θk]

1 + 4LωNαk + L2 ω,2 2 C2 θN p

E ωk+1 ω (θk) 2

L2 ω,2 2 C2 θN + L2 ωC2 θN

i=1 E[gi a(ξk, ωi k+1, λi k+1)] 2. (32)

Published in Transactions on Machine Learning Research (01/2023)

E[ ωk+1 ω (θk) 2|θk] (1 2λϕβk)E ωk ω (θk) 2 + CK1βkβk ZK + CK2αk ZKβk. (33)

where the constants are defined as CK1 := 4C2CδZK + C2 δ , CK2 := 4C1CθZK + 2C3CθZ2 K + C8, ZK := min{z N+|κρz 1 min{αK, βK, ηK}}.

Proof. We begin with the optimality gap of averaged critic variables

ωk+1 ω (θk+1) 2

= ωk+1 ω (θk) + ω (θk) ω (θk+1) 2

= ωk+1 ω (θk) 2 + ω (θk) ω (θk+1) 2 + 2 ωk+1 ω (θk), ω (θk) ω (θk+1)

ωk+1 ω (θk) 2 + NL2 ωC2 θα2 k + 2 ωk+1 ω (θk), ω (θk)T (θk θk+1)

+ 2 ωk+1 ω (θk), ω (θk) ω (θk+1) ω (θk)T (θk θk+1) , (34)

where the inequality is based on the Lipschitz of ω (θ) implied by Proposition 1

ω (θk) ω (θk+1) 2 L2 ω θk θk+1 2,

θk θk+1 2 =

i=1 αkˆδ(ξk, ωi k, λi k)ψθi k(sk, ai k) 2 Nα2 k C2 θ, (35)

with Cθ := CδCψ, and Cδ is defined in (39).

The third term in (34) can be bounded as

ωk+1 ω (θk), ω (θk)T (θk θk+1)

ωk+1 ω (θk) ω (θk)T (θk θk+1)

ω (θk) ωk+1 ω (θk) θk θk+1

Lω ωk+1 ω (θk) θk θk+1

i=1 Lωαk ωk+1 ω (θk) gi a(ξk, ωi k+1, λi k+1)

i=1 (2Lωαk ωk+1 ω (θk) 2 + αk

8 gi a(ξk, ωi k+1, λi k+1) 2), (36)

where the second inequality follows Proposition 1, the third inequality uses triangle inequality, and the last inequality uses Young s inequality.

The last term in (34) can be bounded as

E ωk+1 ω (θk), ω (θk) ω (θk+1) ω (θk)T (θk θk+1)

dθE ωk+1 ω (θk) θk+1 θk 2

dθE ωk+1 ω (θk) 2 θk+1 θk 2 + L2 ω,2 4 θk+1 θk 2

dθNC2 θα2 k E ωk+1 ω (θk) 2 + L2 ω,2 4 NC2 θα2 k. (37)

The first inequality uses Lemma 11. The second inequality is induced by Young s inequality. The last inequality follows (35).

Plug (36) and (37) into (34) will yield (32).

Published in Transactions on Machine Learning Research (01/2023)

We now prove (33). By the critic update rule, we have (crf. gc(θ, ω) := Eξ µθ[ gc(ξ, ω)].)

E ωk+1 ω (θk) 2 = E ΠRω( ωk + βk gc(ξk, ωk)) ΠRωω (θk) 2

(i) E ωk + βk gc(ξ, ωk) ω (θk) 2

= ωk ω (θk) 2 + β2 k E gc(ξk, ωk) 2 + 2βk E[ ωk ω (θk), E[ gc(ξk, ωk)] ]

(ii) ωk ω (θk) 2 + β2 k C2 δ + 2βk ωk ω (θk), E[ gc(ξk, ωk)]

= ωk ω (θk) 2 + β2 k C2 δ + 2βk ωk ω (θk), gc(θk, ωk)

+ 2βk E ωk ω (θk), E[ gc(ξk, ωk)] gc(θk, ωk) , (38)

where (i) is due to the non-expansiveness of projection to convex set, and (ii) follows

gc(ξ, ω) |r(s, a) + γϕ(s )T ω ϕ(s)T ω| rmax + (1 + γ)Rω := Cδ. (39)

The product in the third term in (38) can be bounded as

ωk ω (θk), gc(θk, ωk) = ωk ω (θk), Eξ µθk [ gc(ξk, ωk)]

= ωk ω (θk), Eξ µθk [ gc(ξk, ωk) gc(θk, ω (θk))]

= βk ωk ω (θk), Eξ µθk [ϕ(s)(γϕ(s ) ϕ(s))T |θk]( ωk ω (θk))

= βk ωk ω (θk), Aθk,ϕ( ωk ω (θk))

λϕβk ωk ω (θk) 2. (40)

Here the first equality is due to critic s optimality condition gc(θk, ω (θk)) = Eξk µθk [ gc(ξk, ω (θk))|θk] = 0. The last inequality uses the negative definiteness of Aθk,ϕ of Assumption 2.

Plug (40) to (38) gives

E ωk+1 ω (θk) 2 (1 2λϕβk) ωk ω (θk) 2 + β2 k C2 δ + 2βk ωk ω (θk), E[ gc(ξk, ωk)] gc(θk, ωk) . (41)

We now bound the last term in (41). By Lemma 16, for any z N+, we have

E ωk ω (θk), gc(ξk, ωk) gc(θk, ωk)

C1E θk θk z + C2E ωk ωk z + C3

m=0 E θk m θk z + C8κρz 1

n=1 E θk n+1 θk n + C2

n=1 E ωk n+1 ωk n

n=1 E θk m n+1 θk m n + C8κρz 1

n=1 αk n + 2C2Cδ

n=1 βk n + C3Cθ

n=1 αk m n + C8κρz 1

(ii) 2C1Cθzαk z + 2C2Cδzβk z + C3Cθz(z 1)αk z + C8κρz 1, (42)

where the (i) uses triangle inequality, (ii) uses the non-increasing property of step sizes.

Let z = ZK, we have (crf. ZK := min{z N+|κρz 1 min{αK, βK, ηK}})

E ωk ω (θk), gc(ξk, ωk) gc(θk, ωk)

2C1CθZKαk ZK + 2C2CδZKβk ZK + C3CθZ2 Kαk ZK + C8αk ZK. (43)

Published in Transactions on Machine Learning Research (01/2023)

Plug (43) into (41) will yield

ωk+1 ω (θk) 2 (1 2λϕβk) ωk ω (θk) 2 + C2 δ β2 k + 4C1CθZKαk ZK + 4C2CδZKβk ZK + 2C3CθZ2 Kαk ZK + 2C8αk ZK.

By defining CK1 := 4C2CδZK + C2 δ , CK2 := 4C1CθZK + 2C3CθZ2 K + C8, we complete the proof.

Lemma 16. Consider the sequence generated by Algorithm 1, for any z N+, we have

E ωk ω (θk), gc(ξk, ωk) gc(θk, ωk) C1 θk θk z + C2 ωk ωk z

m=0 θk m θk z + C8κρz 1, (44)

where C1 := 4RωCδ|A|Lπ(1 + logρ κ 1 + (1 ρ) 1) + 2CδLω, C2 := 4(1 + γ)Rω + 2Cδ, C3 := 4RωCδ|A|Lπ, C8 := 8RωCδ.

Basically, this lemma shows that the term on the left hand side of (44) is of order O(αk + βk).

Proof. Consider the Markov chain since timestep k z:

sk z θk z ak z P sk z+1 θk z+1 ak z+1 θk 1 ak 1 P sk θk ak P sk+1.

Also consider the auxiliary Markov chain with fixed policy since timestep k z:

sk z θk z ak z P sk z+1 θk z ak z+1 θk z ak 1 P sk θk z ak P sk+1.

Throughout the proof of this lemma, we will use θ, θ , ω, ω , ξ, ξ as shorthand notations of θk, θk z, ωk, ωk z, ξk, ξk.

For the ease of expression, define

1(ξ, θ, ω) := ω ω (θ), gc(ξ, ω) gc(θ, ω) .

Therefore, we have

ωk ω (θk), gc(ξk, ωk) gc(θk, ωk) = 1(ξ, θ, ω)

= 1(ξ, θ, ω) 1(ξ, θ , ω) | {z } I1

+ 1(ξ, θ , ω) 1(ξ, θ , ω ) | {z } I2 + 1(ξ, θ , ω ) 1( ξ, θ , ω ) | {z } I3

+ 1( ξ, θ , ω ) | {z } I4

I1 can be expressed as

I1 = ω ω (θ), gc(ξ, ω) gc(θ, ω) ω ω (θ ), gc(ξ, ω) gc(θ , ω)

= ω ω (θ), gc(ξ, ω) gc(θ, ω) ω ω (θ), gc(ξ, ω) gc(θ , ω)

+ ω (θ) ω (θ ), gc(ξ, ω) gc(θ , ω)

ω ω (θ) gc(θ , ω) gc(θ, ω) + ω (θ) ω (θ ) gc(ξ, ω) gc(θ , ω) . (46)

The first term can be bounded as

ω ω (θ) gc(θ , ω) gc(θ, ω) 2Rω Eξ µ θ[ gc(ξ, ω)] Eξ µθ[ gc(ξ, ω)]

4Rω sup ξ gc(ξ, ω) d T V (µ θ π θ P, µθ πθ P)

4RωCδd T V (µ θ π θ P, µθ πθ P)

4RωCδ|A|Lπ(1 + logρ κ 1 + (1 ρ) 1) θ θ , (47)

Published in Transactions on Machine Learning Research (01/2023)

where the first inequality is due to ω Rω induced by the projection step of critic s update. The third inequality is due to gc(ξ, ω) Cδ, and the last inequality follows Lemma 6.

By the Lipschitz conitinuous of ω (θ) proposed in Proposition 1, the second term in (46) can be bounded as

ω (θ) ω (θ ) gc(ξ, ω) gc(θ, ω) Lω θ θ gc(ξ, ω) gc(θ, ω)

2CδLω θ θ (48)

Plug (47) and (48) into (46), we can bound I1 as

I1 (4RωCδ|A|Lπ(1 + logρ κ 1 + (1 ρ) 1) + 2CδLω) θ θ . (49)

I2 can be decomposed by

I2 = ω ω (θ ), gc(ξ, ω) gc(θ , ω) ω ω (θ ), gc(ξ, ω ) gc(θ , ω )

= ω ω (θ ), gc(ξ, ω) gc(θ , ω) ω ω (θ ), gc(ξ, ω) gc(θ , ω)

+ ω ω (θ ), gc(ξ, ω) gc(ξ, ω ) gc(θ , ω) + gc(θ , ω )

= ω ω (θ ), gc(ξ, ω) gc(θ , ω)

+ ω ω (θ ), gc(ξ, ω) gc(ξ, ω ) gc(θ , ω) + gc(θ , ω )

2Cδ ω ω (θ ) + ω ω (θ ), gc(ξ, ω) gc(ξ, ω ) gc(θ , ω) + gc(θ , ω ) .

The last term can be bounded as

ω ω (θ ), gc(ξ, ω) gc(ξ, ω ) gc(θ , ω) + gc(θ , ω )

ω ω (θ ) ( gc(ξ, ω) gc(ξ, ω ) + gc(θ , ω ) gc(θ , ω) )

2Rω( gc(ξ, ω) gc(ξ, ω ) + gc(θ , ω ) gc(θ , ω) )

2Rω( gc(ξ, ω) gc(ξ, ω ) + Eξ µθ gc(ξ, ω ) gc(ξ, ω) )

4Rω(1 + γ) ω ω , (50)

where the first inequality applies Cauchy-Schwartz inequality and triangle inequality, the second inequality follows the projection of each critic step. The last inequality is due to

gc(ξ, ω) gc(ξ, ω ) = ϕ(s)(γϕ(s )T ( ω ω ) ϕ(s)T ( ω ω ))

γ ϕ(s )T ( ω ω ) + ϕ(s)T ( ω ω )

(1 + γ) ω ω .

Thus, I2 can be bounded as

I2 (4(1 + γ)Rω + 2Cδ) ω ω . (51)

We bound I3 as

E[I3|θ , sk z+1] = E[ 1(ξ, θ , ω ) 1( ξ, θ , ω )|θ , sk z+1]

2 sup ξ | 1(ξ, θ , ω )| d T V (P(ξ |θ , sk z+1), P( ξ |θ , sk z+1))

8RωCδd T V (P(ξ |θ , sk z+1), P( ξ |θ , sk z+1))

m=0 θk m θk z . (52)

Here, the second inequality is due to 1(ξ, θ , ω ) ω ω (θ ) gc(ξ, ω ) gc(θ , ω ) 4RωCδ, and the last inequality is according to Lemma 9.

Published in Transactions on Machine Learning Research (01/2023)

We now bound I4

E[I4|θ , ω , sk+z 1] = E[ 1( ξ, θ , ω )|θ , ω , sk z+1]

sup ξ | 1(ξ, θ , ω )| P(ξ |θ , sk z+1) µθ πθ P

8RωCδd T V (P( x |θ , st z+1), µθ πθ P)

8RωCδκρz 1, (53)

where the last inequality follows Lemma 7.

Plug (49), (51), (52), and (53) into (45), we get

E[ 1(ξ, θ, ω)] (4RωCδ|A|Lπ(1 + logρ κ 1 + (1 ρ) 1) + 2CδLω)E θk θk z

+ (4(1 + γ)Rω + 2Cδ)E ωk ωk z

+ (4RωCδ|A|Lπ)

m=0 E θk m θk z

+ 8RωCδκρz 1,

which completes the proof.

C.2 Error of reward estimator

The analysis for the error of reward estimator is similar to critic. To see this, we only need to change gc(ξ, ω) into gr(ξ, λ) := (r(s, a) φ(s, a)T λ)φ(s, a) to recover most of the proofs.

Lemma 17 and 18 are the counter parts of Lemma 15 and 16 for reward estimator.

Lemma 17. Suppose Assumptions 1 , 2, 3, 4 hold, with λk+1 generated by Algorithm 1 given λk and θk under Markovian sampling, then the following holds

E[ λk+1 λ (θk+1) 2|θk]

1 + 4LλNαk + L2 λ,2 2 C2 θN p

E λk+1 λ (θk) 2

L2 λ,2 2 C2 θN + L2 λC2 θN

i=1 E[gi a(ξk, λi k+1, λi k+1)] 2. (54)

E[ λk+1 λ (θk) 2|θk] (1 2ηkλφ) λk λ (θk) 2 + CK3ηkηk ZK + CK4ηkαk ZK, (55)

where CK3 := 4C6CλZK + C2 λ, CK4 := 4C5CθZK + 2C7CθZ2 K + C8, ZK := min{z N+|κρz 1 min{αK, βK, ηK}}, Cλ := rmax + Rλ maxs,a,λ (r(s, a) λT φ(s, a)φ(s, a)) .

Lemma 18. Consider the sequence generated by Algorithm 1, for any z N+, we have

E[ λk λ (θ), gr(ξk, λk) gr(θk, λk) ] C5 θk θk z + C6 λk λk z

m=0 θk m θk z + C8κρz 1, (56)

where C5 := 4RλCλ|A|Lπ(1 + logρ κ 1 + (1 ρ) 1) + 2CλLλ, C6 := 4Rλ + 2Cλ, C7 := 4RλCλ|A|Lπ, C8 := 8RλCλ.

C.3 Consensus error

Lemma 19 (restatement of Lemma 1, bound of consensus error). Define the matrix representation of critics and reward estimators parameters as ωk := [ω1 k, , ωN k ]T , λk := [λ1 k, , λN k ]T . Let Q := I 1

N 11T , then

Published in Transactions on Machine Learning Research (01/2023)

the consensus error can be expressed as PN i=1 ωi k ωk 2 = Qωk 2, PN i=1 λi k λk 2 = Qλk 2. Suppose Asssumption 5 holds. Let {ωi k}k, {λi k}k be the sequence generated by the Algorithm 1, then for any k 1, the following inequalities hold

Qωk ν k Kc 1 ω0 + 4

Qλk ν k Kc 1 λ0 + 4

where ν (0, 1) is the second largest singular value of W.

Proof. We will prove the bound in (57) for the critic variables. The analysis for reward estimator in (58) follows the same routine. To simplify the notation, we will use gi k to represent gi c(ξk, ωi k) throughout the proof of this lemma. We also use ei k to represent the projection update ei k := ΠRω(ωi k βkgi k) (ωi k βkgi k). Define gk := 1

N PN i=1 gi k; ek := 1

N PN i=1 ei k, and the corresponding matrix exressions as

(g1 k)T , ... (g N k )T

(e1 k)T , ... (e N k )T

According to the update rule of critic variables, the following equalities holds

( Wωk βk Gk + Ek, if k mod Kc = 0 ωk βk Gk + Ek, otherwise. (59)

To bound the consensus error, We first bound the consensus error of critic s update as

i=1 gi k gk 2 (i)

i=1 2 gi k 2 + 2 gk 2 2

i=1 ei t et 2

i=1 2 ei k 2 + 2 ek 2 (ii)

i=1 2β2 k gi k 2 + 2β2 k gk 2 2βk

where (i) is due to gi k Cδ; (ii) is ensured by the convexity of the projection set.

We now study the consensus error of critic variables. Let k = k

Kc Kc. By the update rule in (59), we have

Qωk = QWωk 1 βk 1QGk 1 + QEk 1 = WQωk 1 βk 1QGk 1 + QEk 1

t=k Kc βt W k 1 t QGt +

k Kc W k 1 t QEt, (62)

where the second equality is due to the doubly stochasticity of matrix W implied by Assumption 5: QW = W 1

N 11T W = W 1

N W11T = WQ. The last equality is indicated by the update rule that

ωk 1 = ωk Kc

t=k Kc βt Gt +

Expand the recursion in (62), we have

Qωk = W k Kc Qω0

t=0 W ctβt QGt +

t=0 W ct QEt,

Published in Transactions on Machine Learning Research (01/2023)

where ct := k 1 t

Kc . Therefore, the kth iteration s consensus error can be expressed as

t=k βt QGt +

= W k Kc Qω0

t=0 W ctβt QGt +

t=0 W ct QEt. (63)

Take norm on the each side of (63) and apply triangle inequality, we get

Qωk W k Kc Qω0 +

t=0 βt W ct QGt +

t=0 W ct QEt

(i) ν k Kc ω0 +

t=0 βtνct Gt +

(ii) ν k Kc ω0 + 4

(iii) ν k Kc 1 ω0 + 4

where (i) inequality uses Lemma 10 and the fact that the spectral of Q is less than 1; (ii) is due to (60) and (61); (iii) uses the fact that k Kc k Kc 1 and k 1 t

Kc 1. Thus, the proof for (57) is completed. The proof of (58) follows a similar procedure, we leave it as an exercise to reader.

C.4 Error of actor

The following lemma characterizes the sampling error of actor.

Lemma 20. Consider the sequence generated by Algorithm 1, for any z 1 we have

Eξ µθk [δ(ξ, θk)ψθi k(sk, ai k)] E[δ(ξk, θk)ψθi k(sk, ai k)]

2Cθκρz 1 + C12

m=0 θk m θk z + C13 θk θk z + C14 θi k θi k z , (64)

where C12 := 2Cθ|A|Lπ, C13 := |A|L(logρ κ 1 + (1 ρ) 1)Cθ + 2(1 + γ)LV , C14 := 2CδLψ.

Proof. Consider the Markov chain since timestep k z:

sk z θk z ak z P sk z+1 θk z+1 ak z+1 θk 1 ak 1 P sk θk ak P sk+1.

Also consider the auxiliary Markov chain with fixed policy since timestep k z:

sk z θk z ak z P sk z+1 θk z ak z+1 θk z ak 1 P sk θk z ak P sk+1.

Throughout the proof of this lemma, we wil use ψθi to represent ψθi(sk, ai k) for brevity.

We define the following notation for the ease of discussion

3(ξ, θ) := Eξ µθ[δ(ξ, θ)ψθi] δ(ξ, θ)ψθi].

Then our objective is to bound

E[ 3(ξk, θk) | θk z].

Published in Transactions on Machine Learning Research (01/2023)

We decompose 3(ξk, θk) by applying triangle inequality

3(ξk, θk) 3(ξk, θk) 3(ξk, θk z) | {z } I1 + 3(ξk, θk z) 3( ξk, θk z) | {z } I2 + 3( ξk, θk z) | {z } I3

We apply triangle inequality again to bound I1 as

I1 δ(ξk, θk z)ψθi k z δ(ξk, θk)ψθi k | {z }

I(1) 1 + Eξ µθk [δ(ξ, θk)ψθi k] Eξ µθk z [δ(ξ, θk z)ψθi k z] | {z }

I(1) 1 can be bounded as

I(1) 1 = δ(ξk, θk z)ψθi k z δ(ξk, θk)ψθi k

δ(ξk, θk z)ψθi k z δ(ξk, θk)ψθi k z

+ δ(ξk, θk)ψθi k z δ(ξk, θk)ψθi k

|γ(Vθk z(s ) Vθk(s )) + (Vθk z(s) Vθk z(s ))|ψi k z

+ δ(ξk, θk)ψθi k z δ(ξk, θk)ψθi k

(1 + γ)LV θk θk z + δ(ξk, θk)ψθi k z δ(ξk, θk)ψθi k

(1 + γ)LV θk θk z + CδLψ θi k θi k z , (67)

where the second last inequality follows the Lipschitz continuous of value function in Lemma 5, and the last inequality uses Lipschitz continuous of ψθi.

I(2) 1 can be bounded as

I(2) 1 = Eξ µθk [δ(ξ, θk)ψθi k] Eξ µθk z [δ(ξ, θk z)ψθi k z]

= Eξ µθk [δ(ξ, θk z)ψθi k z] Eξ µθk z [δ(ξ, θk z)ψθi k z]

+ Eξ µθk [δ(ξ, θk)ψθi k δ(ξ, θk z)ψθi k z]

|A|L(logρ κ 1 + (1 ρ) 1)Cθ θk θk z

+ Eξ µθk [δ(ξ, θk)ψθi k δ(ξ, θk z)ψθi k z]

|A|L(logρ κ 1 + (1 ρ) 1)Cθ θk θk z

+ (1 + γ)LV θk θk z + CδLψ θi k θi k z , (68)

where the first inequality applies Lemma 6, and the last inequality uses the derivation in (67).

Combine (67) and (68), we have

I1 |A|L(logρ κ 1 + (1 ρ) 1)Cθ θk θk z

+ 2(1 + γ)LV θk θk z + 2CδLψ θi k θi k z (69)

Published in Transactions on Machine Learning Research (01/2023)

We now bound I2 as

E[I2] = E δ( ξk, θk z)ψi θk z δ(ξk, θk z)ψi θk z

2 sup ξ δ(ξ, θk z)ψθi k z d T V (P( ξk |θk z, sk z), P(ξk |θk z, sk z))

m=0 |A|Lπ θk m θk z , (70)

where the last inequality follows Lemma 9.

I3 can be bounded as

I3 = E Eξ µθk z [δ(ξ, θk z)ψi k z] δ( ξk, θk zψi θk z)

2 sup ξ δ(ξ, θk z)ψi θk z d T V (P( ξ |θk z, sk z), µθk z πθk z P)

2Cθκρz 1, (71)

where the last inequality follows Lemma 7.

Plug (69), (70), and (71), we have

Eξ µθk [δ(ξ, θk)ψθi k(sk, ai k)] E[δ(ξk, θk)ψθi k(sk, ai k)]

2Cθκρz 1 + 2CδLψ θi k θi k z + 2Cθ

m=0 |A|Lπ θk m θk z

+ (|A|L(logρ κ 1 + (1 ρ) 1)Cθ + 2(1 + γ)LV ) θk θk z ,

which completes the proof.

D Proof of main results

D.1 Proof of Theorem 1

Let θk RNdθ be the stack of actors parameter at timestep k. By Lemma 4, we have

E [J(θk+1)] J(θk) E [ J(θk), θk+1 θk ] L

2 θk+1 θk 2

i=1 E θi J(θk), θi k+1 θi k L

i=1 θi k+1 θi k 2

i=1 E αk θi J(θk), gi a(ξk, ωi k+1, λi k+1) L

i=1 E gi a(ξk, ωi k+1, λi k+1) 2

2 θi J(θk) 2 + αk

2 E gi a(ξk, ωi k+1, λi k+1) 2

2 θi J(θk) E gi a(ξk, ωi k+1, λi k+1) 2i L

2 NC2 θα2 k, (72)

where the expectation is taken over ξk under Markovian sampling. The last inequality is due to

gi a(ξk, ωi k+1, λi k+1) = ˆδ(ξk, ωi k, λi k)ψθi k(sk, ai k) CδCψ := Cθ. (73)

Published in Transactions on Machine Learning Research (01/2023)

For brevity, we will use ψθi k to represent ψθi k(sk, ai k). The gradient bias can be bounded as

θi J(θk) E gi a(ξk, ωi k+1, λi k+1)|ωi k+1, λi k+1 2

4 θi J(θk) E h δ(ξk, θk)ψθi k

+4 E h (δ(ξk, θk) δ(ξk, ω (θk)))ψθi k

+ 4 E h ( δ(ξk, ω (θk)) δ(ξk, ωi k+1))ψθi k

+4 E h ( δ(ξk, ωi k+1) ˆδ(ξk, ωi k+1, λi k+1))ψθi k

where the inequality uses a + b + c + c 2 4 a 2 + 4 b 2 + 4 c 2 + 4 d 2.

We bound I1 as

I1 = θi J(θk) E h δ(ξk, θk)ψθi k|θk i 2

h δ(ξ, θk)ψθi k|θk i E h δ(ξk, θk)ψθi k|θk i 2

h δ(ξ, θk)ψθi k|θk i Eξ µθk

h δ(ξ, θk)ψθi k|θk i 2

+2 Eξ µθ h δ(ξ, θk)ψθi k|θk i E h δ(ξk, θk)ψθi k|θk i 2

I(2) 1 (75)

From now on, we will use ξ dθ to denote s dπθ, a π( |s), s P for notational simplicity. I1 is the sampling error under perfect value function estimation of critic. It can be bounded as

E h I(1) 1 |θk i = θi J(θk) E h δ(ξk, θk)ψθi k|θk i 2

h δ(ξ, θk)ψθi k|θk i Eξ µθk

h δ(ξ, θk)ψθi k|θk i 2

δ(ξ, θk)ψθi k

d T V (µθk πθk P, dθk πθk P)

(i) (2Cθd T V (µθk, dθk))2

(ii) 16C2 θ(logρ κ 1 + 1

where (i) uses (73); (ii) follows Lemma 8. Define εsp := 4C2 θ(logρ κ 1 + 1

ρ)2(1 γ)2, then we have

I(1) 1 4εsp. (76)

By Lemma 20, I(2) 1 can be bounded as

2Cθκρz 1 + C12

m=0 θk m θk z + C13 θk θk z + C14 θi k θi k z

2Cθκρz 1 + C12

n=1 θk m n+1 θk m + C13

n=1 θk n+1 θk n + C14

n=1 θi k n+1 θi k n

2Cθκρz 1 + C12NCθ z(z + 1)

2 αk z + C13Nz Cθαk z + C14z Cθαk z

16C2 θκ2ρ2z 2 + 2C2 12C2 θz2α2 k z + 4C2 13N 2z2C2 θα2 k z + 4C2 14z2C2 θα2 k z, (77)

Published in Transactions on Machine Learning Research (01/2023)

where the second inequality uses triangle inequality, and the last inequality applies (a + b + c + d)2 4a2 + 4b2 + 4c2 + 4d2. Let z = ZK := min{z N+|κρz 1 min{αK, βK, ηK}}. Then we have

I(2) 1 CK5α2 k ZK, (78)

where we define CK5 := 16C2 θ + 2C2 12C2 θZ2 K + 4C2 13N 2Z2 KC2 θ + 4C2 14Z2 KC2 θ. Thus, we have

I1 4εsp + CK5α2 k ZK. (79)

The term I2 describes the approximation quality of linear function class, it can be bounded as

I2 = E h (δ(ξk, θk) δ(ξk, ω (θk)))ψθi k

(i) E δ(ξk, θk) δ(ξk, ω (θk)) 2 ψθi k

(ii) C2 ψE h γ Vπθk (sk+1) ˆVω (θk)(sk+1) + Vπθk (sk) ˆVω (θk)(sk) i

(iii) 2C2 ψ

γ2E Vθk(sk+1) ˆVω (θk)(sk+1) 2 + E Vθk(sk) ˆVω (θk)(sk) 2

(iiii) 2C2 ψ(1 + γ2)εc app 4C2 ψεc app. (80)

where (i) applies Cauchy Schwarz inequality and triangle inequality; (ii) is due to ψθi k Cψ, which is ensured by Assumption 3; (iii) uses |a + b|2 2|a|2 + 2|b|2; (iiii) follows the definition of the critic s approximation error:

εc app := max θ

Vπθ(s) ˆVω (θ)(s) 2 . (81)

I3 captures the error of critic s estimator, which can be bounded as

E[I3] = E h δ(ξk, ω (θk)) δ(ξk, ωi k+1) ψθi k

E δ(ξk, ω (θk)) δ(ξk, ωi k+1) 2 ψθi k

C2 ψE h γϕ(sk+1)T ω (θk) ωi k+1 ϕ(sk)T ω (θk) ωi k+1 2i

C2 ψ 2E h γϕ(sk+1)T ω (θk) ωi k+1 2i + 2E h ϕ(sk)T ω (θk) ωi k+1 2i

C2 ψ 2γ2E h ϕ(sk+1) 2 ω (θk) ωi k+1 2i + 2E h ϕ(sk) 2 ω (θk) ωi k+1 2i

2C2 ψ(1 + γ2) ω (θk) ωi k+1 2 4C2 ψ ω (θk) ωi k+1 2 , (82)

where the last inequality is due to ϕ(s) 1, which is specified by Assumption 1.

I4 characterizes the error of reward estimator, which can be bounded as

E[I4] = E h δ(ξk, ωi k+1) ˆδ(ξk, ωi k+1, λi k+1) ψθi k|λi k+1 i 2

E δ(ξk, ωi k+1) ˆδ(ξk, ωi k+1, λi k+1) 2 ψθi k

C2 ψE h r(sk, ak) φ(sk, ak)T λi k+1 2 |λi k+1 i

C2 ψ 2E h r(sk, ak) φ(sk, ak)T λ (θk) 2i + 2E h φ(sk, ak)T λ (θk) φ(sk, ak)T λi k+1 2 |λi k+1 i

2C2 ψεr app + 2C2 ψ λ (θk) λi k+1 2 , (83)

Published in Transactions on Machine Learning Research (01/2023)

where the εr app in the last inequality is the approximation error of reward estimator, which is defined as

εr app := max θ,a

Es µθ h r(s, a) ˆrλ (θ)(s, a) 2i .

Combining (79), (80), (82), and (83) gives us the bound of the gradient bias error as

θi F(θk) E[gi a(ξk, ωi k+1, λi k+1)] 2 16(εsp + C2 ψεapp) + 16C2 ψ ω (θk) ωi k+1 2

+ 8C2 ψ λ (θk) λi k+1 2 + 4CK5α2 k ZK. (84)

Plug (84) into (72), we get

E[J(θk+1)] J(θk)

2 E θi J(θk) 2 + αk

2 E gi a(ξk, ωi k+1, λi k+1) 2

8C2 ψαk E ω (θk) ωi k+1 2 4C2 ψαk E λ (θk) λi k+1 2

2 NC2 θα2 k 2NCK5α2 k ZK 8(εsp + C2 ψεapp)Nαk. (85)

Consider the Lyapunov function

Vk := J(θk) + ωk ω (θk) 2 + λk λ (θk) 2. (86)

The difference between two Lyapunov functions will be

E[Vk+1] E[Vk] = E[J(θk)] E[J(θk+1)] + E ωk+1 ω (θk+1) 2 E ωk ω (θk) 2

+ E λk+1 λ (θk) 2 E λk λ (θk) 2

2 θi J(θk) 2 αk

2 E gi a(ξk, ωi k+1) 2

+ 2NCK5αk ZK + L

2 NC2 θα2 k + 8(εsp + C2 ψεapp)Nαk

i=1 8C2 ψαk E ω (θk) ωi k+1 2 + E ωk+1 ω (θk+1) 2 E ωk ω (θk) 2

i=1 4C2 ψαk E λ (θk) λi k+1 2 + E λk+1 λ (θk+1) 2 E λk λ (θk) 2

The first two terms of I5 can be bounded as

i=1 8C2 ψαk E ω (θk) ωk+1 + ωk+1 ωi k+1 2 + E ωk+1 ω (θk+1) 2

i=1 8C2 ψαk E ωk+1 ωi k+1 2 + 8C2 ψαk E ωk+1 ω (θk) 2 + E ωk+1 ω (θk+1) 2

8C2 ψαk E ωk+1 ω (θk) 2 + E ωk+1 ω (θk+1) 2 + αk Mk1

1 + 4LωNαk + 8C2 ψαk + L2 ω,2 2 C2 θN p

E ωk+1 ω (θk) 2

L2 ω,2C2 θN 2 + L2 ωC2 θN

E gi a(ξk, ωi k+1, λi k+1) 2 + αk Mk1, (88)

Published in Transactions on Machine Learning Research (01/2023)

where the equality is due to

ω (θk) ωk+1, ωk+1 ωi k+1 = ω (θk) ωk+1, ωk+1 ωk+1 = 0.

The first inequality follows the Lemma 19, where Mk1 is defined as

Mk1 := ν 2k Kc 2 ω0 2 + 16NC2 δ

NCδν k Kc 1 k X

The last inequality follows (32) in Lemma 15.

Plug (88) into (87), and define C9 := min n c | 4LωNαk + 8C2 ψαk + L2 ω,2

2 C2 θN dθα2 k cαk o , we get

I5 (1 + C9αk)E ωk+1 ω (θk) 2 +

L2 ω,2C2 θN 2

2 + L2 ωC2 θN

i=1 E[gi a(ξk, ωi k+1, λi k+1)] 2 + αk Mk1

[(1 + C9αk)(1 2λϕβk) 1]E ωk ω (θk) 2

+ (1 + C9αk)(CK1βkβk ZK + CK2βkαk ZK)

L2 ω,2C2 θN 2 + L2 ωC2 θN

E gi a ξk, ωi k+1, λi k+1 2 + αk Mk1, (89)

where the last inequality follows (33) in Lemma 15.

By letting βk = C9 2λϕ αk, we can ensure

(1 + C9αk)(1 2λϕβk) < 1.

Therefore, I5 can be bounded as

i=1 E[gi a(ξk, ωi k+1, λi k+1)] 2 + αk Mk1 +

L2 ω,2C2 θN 2

2 + L2 ωC2 θN

+ (1 + C9αk)(CK1βkβk ZK + CK2βkαk ZK). (90)

By applying Lemma 17 and following the similar procedure, we can bound I6 as

i=1 E[gi a(ξk, ωi k+1, λi k+1)] 2 + αk Mk2 +

L2 λ,2C2 θN 2

2 + L2 λC2 θN

+ (1 + C10αk)(CK3ηkηk ZK + CK4ηkαk ZK). (91)

with ηk = C10

2λφ αk. C10 and Mk2 are defined as

c | 4LλNαk + 4C2 ψαk + L2 λ,2 2 C2 δ N p

Mk2 := ν 2k Kc 2 λ0 2 + 16NC2 λ

NCλν k Kc 1 k X

Published in Transactions on Machine Learning Research (01/2023)

Plug (90) and (91) into (87), we have

E[Vk+1] E[Vk]

2 θi J(θk) 2 + (Mk1 + Mk2)αk

+ (1 + C9αk)(CK1βkβk ZK + CK2βkαk ZK)

+ (1 + C10αk)(CK3ηkηk ZK + CK4ηkαk ZK)

2 NC2 θ + C11

α2 k + 8(εsp + C2 ψεapp N)αk, (93)

where C11 := C2 θN( L2 ω,2+Lλ,2

2 + L2 ω + L2 λ).

By letting αk = α

K for some positive constant α, and recall βk = C9 2λϕ αk, ηk = C10

2λφ αk, we can telescope (93) as

i=1 E θi J(θk) 2 2E[V0]

Kαk + 16(εsp + C2 ψεapp N) + 2

k=0 (Mk1 + Mk2)

+ (2 + 2C9αk)(CK1 βk αk βk ZK + CK2 βk αk αk ZK)

+ (2 + 2C10αk)(CK3 ηk αk ηk ZK + CK4 ηk αk αk ZK)

+ LNC2 θ + 2C11 αk. (94)

The summation of Mk1 can be bounded as

ν 2k Kc 2 ω0 2 + 16NC2 δ

NCδν k Kc 1 k X

ν 2k Kc 2 ω0 2 + 16NC2 δ β2 k

NCδν k Kc 1βk

ν 2k Kc 2 ω0 2 + 16NC2 δ β2 k K2 c ν2(1 ν)2 + 8

NCδν k Kc 1βk Kc ν(1 ν)

ν2(1 ν2/Kc)2 + 16NC2 δ β2 k K2 c K ν2(1 ν)2 + 8

NCδβk Kc (1 νKc)ν(1 ν)

= O(β2 k K2 c ) = O(

where the second equality is according to the step size choice. (i) is due to

z=0 νz 1 Kc 1 ν(1 ν).

(ii) is due to PK k=0 ν k Kc 1 = 1 ν(1 ν1/Kc). The last equality uses Kc = O(K1/4). By following similar

arguments, we can show that PK k=0 Mk2 = O(

K). Therefore, the third term in (94) is of order O( 1

Finally, by noticing CK1 = O(log 1 αk ), CK2 = O(log2 1 αk ), CK3 = O(log 1 αk ), CK4 = O(log2 1 αk ), we obtain the desired iteration complexity of O( 1

K ), or equivalently, the sample complexity of O(ε 2).

D.2 Proof of Theorem 2

Define the update of actor i using the noisy reward as

gi a(ξk, ωi k+1) := ri k,Kr(sk, ak) + γϕ(s )T ωi k+1 ϕ(s)T ωi k+1. (96)

Published in Transactions on Machine Learning Research (01/2023)

Following the derivation of (72), we have

E[J(θk+1] J(θk)

2 θi J(θk) 2 + αk

2 E[gi a(ξk, ωi k+1)] 2

2 θi J(θk) E[gi a(ξk, ωi k+1)] 2i L

2 NC2 θα2 k. (97)

Similarly to the proof of Theorem 1, the gradient bias term can be decomposed as as

θi J(θk) E[gi a(ξk, ωi k+1)] 2 4 θi J(θk) E[δ(ξk, θk)ψθi k] 2 | {z } I1 + 4 E[(δ(ξk, θk) δ(ξk, ω (θk)))ψθi k] 2 | {z } I2 + 4 E[( δ(ξk, ω (θk)) δ(ξk, ωi k+1))ψθi k] 2 | {z } I3 + 4 E[( rk(sk, ak) rk,Kr(sk, ak))ψθi k] 2 | {z } I4

I1, I2, I3 can be bounded following the derivation of (84), (80), and (82), respectively. Plug these bounds into (97), we have

E[J(θk+1)] J(θk)

2 E θi J(θk) 2 + αk

2 E gi a(ξk, ωi k+1) 2 8C2 ψαk E ω (θk) ωi k+1 2

2 C2 ψ rk(sk, ak) ri k,Kr(sk, ak) 2 L

2 NC2 θα2 k

2NCK5α2 k ZK 8(εsp + C2 ψεr app)Nαk. (99)

Define rk,Kr := [r1 k,Kr, , r N k,Kr]T . The reward bias can be bounded as

i=1 rk(sk, ak) ri k,Kr(sk, ak) 2 = Q rk,Kr 2

= QW Kr rk,0(sk, ak) 2

ν2Kr rk,0(sk, ak) 2

ri k,0(sk, ak) rk(sk, ak) 2 + rk(sk, ak) 2

ν2Kr N(σ2 + rmax), (100)

where σ2 is the variance of the reward noise. Let Kr = 1

2 logν αk and define C15 := σ2 + r2 max. Plug (100) back to (99), we have

E[J(θk+1)] J(θk)

2 E θi J(θk) 2 + αk

2 E gi a(ξk, ωi k+1) 2 8C2 ψαk E ω (θk) ωi k+1 2

2 (C15 + C2 θL)α2 k 2NCK5α2 k ZK 8(εsp + C2 ψεr app)Nαk.

Consider the Lyapunov function Vk := J(θk) + ωk ω (θk) 2.

Published in Transactions on Machine Learning Research (01/2023)

The difference between two Lyapunov functions is

E[Vk+1] E[Vk]

2 θi J(θk) 2 αk

2 E gi a(ξk, ωi k+1) 2

2 C16α2 k 2NCK5α2 k ZK 8(εsp + C2 ψεr app)Nαk

i=1 8C2 ψαk E ω (θk) ωi k+1 2 + E ωk+1 ω (θk+1) 2 E ωk ω (θk) 2

I5 can be bounded by following the derivation of (90). Thus, we have

E[Vk+1] E[Vk]

2 θi J(θk) 2 + N

2 C16α2 k 2NCK5α2 k ZK 8(εsp + C2 ψεr app)Nαk

+ (1 + C9αk)(CK1βkβk ZK + CK2βkαk ZK) + Mk1αk, (101)

where C16 := C15 + C2 θL + L2 ω,2C2 θN2

Telescoping (101), we have

i=1 E θi J(θk) 2 2E[V0]

Kαk + 16(εsp + C2 ψεr app N) + 2

k=0 Mk1 + C16αk

+ (1 + C9αk) CK1 βk αk βk ZK + CK2 βk αk αk ZK

The term 2 K PK k=0 Mk1 has been bounded in (95). Let αk = α

K for some positive constant α, βk = C9 2λϕ αk will yield the desired rate.

E Natural Actor-Critic variant and its convergence

In this section, we propose a natural Actor-Critic variant of Algorithm 1, where the approach of calculating the natural policy graident under the decentralized setting is mainly inspired by (Chen et al., 2022). We show that the gradient norm square of such an algorithm will converge with the optimal sample complexity of e O(ε 3). Moreover, the algorithm will converge to the global optimum with the sample complexity of e O(ε 6). In the rest of this section, we first explain the update of the algorithm, and then prove its convergence.

E.1 Decentralized natural Actor-Critic

The natural policy gradient (NPG) algorithm (Kakade, 2002) can be viewed as a preconditioned policy gradient algorithm, which updates as follow:

θk+1 = θk αk F(θk) J(θk), (102)

where F(θ) := Es dπθ ,a πθ ψθ(s, a)ψθ(s, a)T is the Fisher information matrix (FIM). The natural Actor Critic (NAC) uses the critic variable to estimate the gradient. The main challenge for implementing NAC lies in the estimation of the matrix-vector product F(θk) J(θk), especially under the decentralized setting. The work (Chen et al., 2022) proposes to solve the following subproblem in order to estimate the product in a decentralized way:

h(θk) = arg min h fθk(h) := 1

2h T F(θk)h J(θk)T h. (103)

Published in Transactions on Machine Learning Research (01/2023)

Algorithm 3: Decentralized single-timescale NAC

1: Initialize: Actor parameter θ0, critic parameter ω0, reward estimator parameter λ0, initial state s0, natural policy gradient estimation hk,0. 2: for k = 0, , K 1 do 3: Option 1: i.i.d. sampling: 4: sk µθk( ), ak πθk( |sk), sk+1 P( |sk, ak). 5: Option 2: Markovian sampling: 6: ak πθk( |sk), sk+1 P( |sk, ak). 7: 8: Periodical consensus: Compute ωi k and λi k by (4) and (7). 9: 10: for i = 0, , N in parallel do 11: Reward estimator update: Update λi k+1 by (8). 12: Critic update: Update ωi k+1 by (5). 13: Actor update: 14: Collect Na transition samples based on Markovian/i.i.d sampling. 15: for k = 1, , Ka do 16: Estimate zk ,n, n [Na] using (104). 17: Update hk,k +1 by (106). 18: end for 19: Update θi k+1 by (107). 20: end for 21: end for

Such a problem can be solved by using (stochastic) gradient descent, where the gradient is calculated by F(θk)h J(θk). For the centralized setting, the gradient w.r.t. each agent can be approximated as 1 Na PNa n=1 ψi θk(sn, ai n)ψθk(sn, an)T h gi a(ξn, ωk+1, λk+1). However, when considering the decentralized setting,

the term zn := ψθk(sn, an)T h = PN i=1 ψi θk(sn, an)T hi is not accessible for each agent. To approximate this value under the decentralized setting, agents compute zi n,0 := ψi θk(sn, an)T hi locally and then perform the following communication step for Kz steps:

zi n,k +1 =

j=1 W ijzi n,k , n [Na], k = 0, , Kz 1. (104)

As we will see, the value Nzi n,k converges to zn linearly. Thus, the gradient of the subproblem (103) for agent i can be approximated as:

e f i θk(hk,k ) := N

n=1 ψi θk(sn, ai n)zi n,Kz gi a(ξn, ωk+1, λk+1). (105)

Then, each agent i performs the following update for Ka steps to estimate the natural policy gradient direction: hi k,k +1 = ΠCh(hi k,k ϱe f i θk(hk,k )), (106)

where ϱ is a positive constant step size. Since the norm of optimal direction is bounded by Ch := λmax(F(θ) 1)Cθ, we project the vector into a ball of norm Ch for each update. Finally, we perform the approximate natural policy gradient step as:

θi k+1 = θi k αkhi k,Ka. (107)

E.2 Convergence of natural Actor-Critic

In this section, we establish the sample complexity of Algorithm 3. We first introduce an additional assumption.

Published in Transactions on Machine Learning Research (01/2023)

Assumption 6. (invertible FIM) There exists a positive constant λF such that for all policy θ, λmin(F(θ)) λF .

Assumption 6 ensures that F(θ) is positive definite so that the problem (103) is strongly convex for all policy. Such an assumption is also adopted by (Chen et al., 2022; Xu et al., 2021; Liu et al., 2020).

We now show the sample complexity of the Algroithm 3 in terms of gradient norm and the global optimality gap. To keep the analysis concise, we will consider the i.i.d. sampling scheme where we can directly sample transition tuples (s, a, s ) from the stationary distribution µπθ. Extending the analysis to the Markovian sampling scheme essentially follows the similar technique as in AC s analysis, which introduces an additional O(log(ε 1)) error terms caused by Markov chain mixing, and an error of order O( 1 1 γ ) due to the mismatch between µπθ and dπθ.

Theorem 3. Suppose Assumptions 1-6 hold. Consider the update of Algorithm 3 under the i.i.d. sampling. Let αk = α

K for some positive constant α, βk = C9 2λϕ αk, ϱ 1 2C2 ψ , Na = O(

K), Ka = O(log(K1/2)), Kc =

O(log(K1/4)). Then, the following hold

i=1 E θi J(θk) 2 O 1

+ O(εapp + εsp) (108)

k=0 J(θ ) J(θk) O 1 K1/4

+ O(εapp + εsp + εactor). (109)

The error εapp and εsp are defined in (12) and (4.2), respectively. The error εactor is referred as "compatible function approximation error", which is defined as:

εactor := max θ min d Es dπθ ,a πθ[(ψθ(s, a)T d Aπθ(s, a))2].

Such an error captures the expressivity of the policy parameterization class: it measures the error of approximating Aπθ(s, a) using ψθ(s, a) as feature. The error becomes 0 when using the softmax-tabular parameterization; see more discussions in Section 6 of (Agarwal et al., 2019).

Based on Theorem 3, Algorithm 3 needs K = O(ε 2) iterations to achieve ε-error for gradient norm square, and thus attains the sample complexity of KNa Ka = e O(ε 3), which matches the best existing sample complexity of NAC (Xu et al., 2020; Chen et al., 2022). In terms of the global optimality gap, the algorithm requires K = O(ε 4) iterations to achieve ε-error, and thus has the sample complexity of KNa Ka = e O(ε 6). Such a sample complexity is worse than the best existing sample complexity of e O(ε 3) (Xu et al., 2020; Chen et al., 2022).

We now explain the gap for the sub-optimal sample complexity. Mimicking the analysis of (Chen et al., 2022) allows to establish the following inequality:

k=0 J (θ ) E[J(θk)] O

i=1 E[ θi J(θk) 2]

i=1 E ωi k ω (θk)

While our analysis can obtain the iteration complexity of O( 1

K ) for J(θk) 2, we can only achieve O( 1 K1/4 ) iteration complexity for critic s error ωk ω (θk) . This is because our algorithm uses single-timescale update, where the critic s error inevitably converges slower than that of double-loop based algorithms which have O( 1

K ) complexity for the critic s error at each iteration. Therefore, the sample complexity in terms of global optimality gap of our single-timescale NAC is dominated by this critic s error term, resulting in the final complexity of e O(ε 6). Nevertheness, the bound (110) is not necessarily tight. We leave the research on the tight bound of single-timescale NAC as a future work.

Published in Transactions on Machine Learning Research (01/2023)

E.3 Proof of Theorem 3

By Lemma 4, we have

E[J(θk+1)] J(θk) E θJ(θk), θk+1 θk L

2 θk+1 θk 2

(i) αk E θJ(θk), hk L

2 NC2 hα2 k

= αk E θJ(θk), F(θk) 1ga(ξk, ωk+1, λk+1)

+ αk E θJ(θk), hk F(θk) 1ga(ξk, ωk+1, λk+1) L

2 NC2 hα2 k

(ii) = αk E F(θk) 1/2 θJ(θk), F(θk) 1/2ga(ξk, ωk+1, λk+1)

+ αk E θJ(θk), hk F(θk) 1ga(ξk, ωk+1, λk+1) L

2 NC2 hα2 k

2 F(θk) 1/2 θJ(θk) 2 + αk

2 F(θk) 1/2E[ga(ξk, ωk+1, λk+1)] 2

2 F(θk) 1/2 θJ(θk) F(θk) 1/2E[ga(ξk, ωk+1, λk+1)] 2

+ αk E θJ(θk), hk F(θk) 1ga(ξk, ωk+1, λk+1) L

2 NC2 hα2 k

4 C 2 ψ θJ(θk) 2 + αk

2 λ 1 F E[ga(ξk, ωk+1, λk+1)] 2

2 λ 1 F θJ(θk) E[ga(ξk, ωk+1, λk+1)] 2 | {z } I1

4αk C2 ψ E[hk] F(θk) 1E[ga(ξk, ωk+1, λk+1)] 2 | {z } I2

2 NC2 hα2 k, (111)

where (i) is due to θi k+1 θi k Ch := λF Cθ. Note that we use hi k to represent hi k,Ka for simplifying the notation. (ii) uses decomposition of positive definite (PD) matrix. Specifically, let A be PD matrix, then by eigenvalue decomposition, A = V ΛV T for some orthonormal matrix V . Define A 1/2 := V Λ 1/2V T , then x, A 1y = A 1/2x, A 1/2y for any x and y. (iii) uses C 2 ψ λ(F(θ) 1) λ 1 F and Young s inequality.

I1 represents the error of gradient bias, which we have bounded when analyzing the error of AC. By (84), we have

16(εsp + C2 ψεapp) + 16C2 ψ ω (θk) ωi k+1 2 + 8C2 ψ λ (θk) λi k+1 2 . (112)

To bound I2, we need to bound the error of hk,k . We start with the gradient bias when estimating hk,k . Define fk,k (hk,k ) := F(θk)hk,k E[ga(ξk, ωi k+1, λi k+1)], then it is easy to see that fk,k (hk,k ) is the unbiased gradient of the following problem 1 2h T k,k F(θk)hk,k E[ga(ξk, ωi k+1, λi k+1)]T hk,k .

Define the following notation for the ease of expression:

b f i k,k (hk,k ) := 1

n=1 ψθi k(sn, ai n)ψθk(sn, an)T hk,k gi a(ξk,k , ωi k+1, λi k+1)

b fk,k (hk,k ) := [b f 1 k,k (hk,k ), , b f N k,k (hk,k )]

e f i k,k (hk,k ) := N

n=1 ψθi k(sn, ai n)zi n,Kz gi a(ξk,k , ωi k+1, λi k+1)

e fk,k (hk,k ) := [e f 1 k,k (hk,k ), , e f N k,k (hk,k )].

Published in Transactions on Machine Learning Research (01/2023)

We now analyze the error at outer-loop iteration k. For notational simplicity, we omit the subscript k for the prementioned notations, e.g. we use b f i k (hk ), b fk (hk ), e f i k (hk ), e fk (hk ) to represent the above notations, respectively.

fk (hk ) e fk (hk ) 2 2 fk (hk ) b fk (hk ) 2 | {z } I3

+2 b fk (hk ) e fk (hk ) 2 | {z } I4

I3 can be bounded as

Na ψθ(sn, an)ψθ(sn, an)T F(θ))hk 2

Na ψθ(sn, an)ψθ(sn, an)T F(θ)) 2C2 h

Na C4 ψC2 h. (113)

I4 can be bounded as

ψθi(sn, ai n)

n=1 Nzi n,Kz ψθ(sn, an)T hk

n=1 zi n,Kz zn,Kz 2

n=1 QW Kzzn,0 2

n=1 νKz zn,0 2 NC4 ψC2 hνKz. (114)

Let Kz = min{c N+|νc 4 Na N }, then Kz = O(log 1 Na ). Combine (113) and (114) gives us

fk (hk ) e fk (hk ) 2 4C4 ψC2 h Na .

We now analyze the error of hk,k . Note that we omit the subscript k here for simplifying notation. Define

h = arg min h fθ(h) := h T F(θ)h := Eξ µθ[ga(ξ, ω, λ)]T h. (115)

It is easy to see that the function on the RHS is strongly convex, since F(θ) is positive definite w.r.t. h. We bound the optimal gap by

E hk +1 h 2 = E hk ϱe fk (hk ) h 2

= E hk h 2 2ϱE hk h , e fk (hk ) + ϱ2 e fk (hk ) 2

E hk h 2 2ϱE hk h , fk (hk ) + 2ϱE hk h , fk (hk ) e fk (hk )

+ 2ϱ2 fk (hk ) 2 + 2ϱ2 e fk (hk ) fk (hk ) 2

(i) (1 ϱλF )E hk h 2 2ϱ(fk (hk ) f ) + 2ϱE hk h , fk (hk ) e fk (hk )

+ 2ϱ2 fk (hk ) 2 + 2ϱ2 e fk (hk ) fk (hk ) 2

Published in Transactions on Machine Learning Research (01/2023)

(ii) (1 ϱλF )E hk h 2 2ϱ(1 2ϱC2 ψ)(fk (hk ) f )

+ 2ϱE hk h , fk (hk ) e fk (hk ) + 2ϱ2 e fk (hk ) fk (hk ) 2

(iii) (1 ϱλF )E hk h 2 + 2ϱE hk h , fk (hk ) e fk (hk )

+ 2ϱ2 e fk (hk ) fk (hk ) 2

(iiii) (1 ϱλF

2 )E hk h 2 + ( 2ϱ

λF + 2ϱ2) e fk (hk ) fk (hk ) 2,

where f is the optimal value of f(h) defined in (115), and the inequality follows the property of λF -strongly convex function: f(h2) f(h1) + f(h1), h2 h2 + λF

2 h1 h2 2, h1, h2. (ii) uses the PL condition implied by λF -strong convexity: f(h ) f(h) 1 2λF f(h) 2, h. (iii) is due to step size rule that ϱ 1 2C2 ψ . (iiii) applies Young s inequality.

Use the above induction, we have

E h Ka h 2 (1 ϱλF

2 )Ka h0 h 2 +

λF + 2ϱ2) f Ka t(h Ka) e f Ka(h Ka) 2

4C2 h(1 ϱλF

2 )Ka + ( 4ϱ

λF )C4 ψC2 h 4 Na .

Let Ka = min{c N+|4C2 h(1 ϱλF

2 )c = ( 4ϱ

λF )C4 ψC2 h 1 Na }, then Ka = O(log( 1

Na )). Define C18 :=

ϱλ2 F + 16ϱ

λF )C4 ψC2 h, we have

I2 = E h Ka h 2 2C18

Plug (112) and (116) back to (111), we have

E[J(θk+1)] J(θk)

4 C 2 ψ θi J(θk) 2 + αk

2 λF F(θk) 1E[gi a(ξk, ωi k+1, λi k+1)] 2 + αk C2 ψ 2C18

+ 8λ 1 F (εsp + C2 ψεapp) + 8λ 1 F C2 ψ ω (θk) ωi k+1 2 + 4λ 1 F C2 ψ λ (θk) λi k+1 2]

Consider the Lyapunov function

Vk = J(θk) + λ 1 F ( ωk ω (θk) 2 + λk λ (θk) 2).

The difference of the Lyapunov function is

E[Vk+1] E[Vk] = E[J(θk)] E[J(θk+1)] + λ 1 F (E ωk+1 ω (θk+1) 2 E ωk ω (θk) 2

+ E λk+1 λ (θk+1) 2 E λk λ (θk) 2)

4 C 2 ψ E θi J(θk) 2 + αk

2 λF F(θk) 1E[gi a(ξk, ωi k+1, λi k+1)] 2 + αk C2 ψ 2C18

i=1 8C2 ψαk E ω (θk) ωi k+1 2 + E ωk+1 ω (θk+1) 2 E ωk ω (θk) 2 #

i=1 4C2 ψαk E λ (θk) λi k+1 2 + E λk+1 λ (θk+1) 2 E λk λ (θk) 2 #

| {z } I6 + 8Nλ 1 F (εsp + C2 ψεapp). (117)

Published in Transactions on Machine Learning Research (01/2023)

By following the similar procedures through (87) to (91), we can bound I5 and I6 as

I5 (1 + C19αk)C2 δ β2 k + αk

i=1 E F(θk) 1gi a(ξk, ωi k+1, λi k+1) 2 + αk Mk1 + C20α2 k (118)

I6 (1 + C21αk)C2 λη2 k + αk

i=1 E F(θk) 1gi a(ξk, ωi k+1, λi k+1) 2 + αk Mk2 + C22α2 k, (119)

where C19, C20, C21, C22 are some positive constants. Plug (118) and (119) back to (117), we have

E[Vk+1] E[Vk]

4 C 2 ψ E θi J(θk) 2 + αk C2 ψ 2C18

Na + O(α2 k + β2 k + η2 k)

+ (Mk1 + Mk2)αk + O(εsp + εapp)αk]. (120)

By telescoping (120), we can get

i=1 E θi J(θk) 2 4C2 ψV0 Kαk + O(εsp + εapp) + 8C2 ψC18 Na + O(αk + β2 k αk + η2 k αk )

+ 4C2 ψ 1 K

k=0 (Mk1 + Mk2)

By (95), Mk1 + Mk2 = O( 1

K ) when Kc O(K1/4). Therefore, let C, α be some positive constants. Set

K , βk = C9 2λϕ αk, ηk = C10

2λφ αk, we obtain the desired result of (108).

We now prove (109). Let Eθ denote the expectation over s dπθ , a πθ ( |s). By the smoothness of log πθ(a|s), we have

Eθ [log πθk+1(a|s) log πθk(a|s)]

αk Eθ [ψθk(s, a)T hk] Lψα2 k 2 C2 h

αk Eθ [ψθk(s, a)T (hk h (θk))] + αk Eθ [ψθk(s, a)T h (θk) Aθk(s, a)]

+ αk Eθ [Aθk(s, a)] Lψα2 k 2 C2 h

αk Cψ hk h (θk) αk εactor + αk(J(θ ) J(θk)) Lψα2 k 2 C2 h.

By telescoping the above inequality and rearranging terms, we have

k=1 (J(θ ) J(θk)) 1 Kαk Eθ [log πK(a|s) log π0(a|s)] + εactor

k=1 Cψ hk h (θk) + 1

The term hk h (θk) hk F(θk) 1E[ga(ξk, ωk+1, λk+1] + E[ga(ξk, ωk+1, λk+1] F 1 J(θk) . Since by the (116) and (84), these two terms are of order O( 1 N 1/2 a ) and O( ωk ωk+1 + εapp), respectively, we

conclude that hk h (θk) is of order O( ωk ω (θk) + εapp). By following the step size rule as suggested by Theorem 3, we obtain the desired result.