# finitetime_analysis_of_singletimescale_actorcritic__3be57dbf.pdf

Finite-Time Analysis of Single-Timescale Actor-Critic

Xuyang Chen National University of Singapore chenxuyang@u.nus.edu

Lin Zhao National University of Singapore elezhli@nus.edu.sg

Actor-critic methods have achieved significant success in many challenging applications. However, its finite-time convergence is still poorly understood in the most practical single-timescale form. Existing works on analyzing single-timescale actor-critic have been limited to i.i.d. sampling or tabular setting for simplicity. We investigate the more practical online single-timescale actor-critic algorithm on continuous state space, where the critic assumes linear function approximation and updates with a single Markovian sample per actor step. Previous analysis has been unable to establish the convergence for such a challenging scenario. We demonstrate that the online single-timescale actor-critic method provably finds an ϵ-approximate stationary point with e O(ϵ 2) sample complexity under standard assumptions, which can be further improved to O(ϵ 2) under the i.i.d. sampling. Our novel framework systematically evaluates and controls the error propagation between the actor and critic. It offers a promising approach for analyzing other single-timescale reinforcement learning algorithms as well.

1 Introduction

Actor-critic (AC) methods have achieved great success in solving many challenging reinforcement learning (RL) problems [17, 20, 24]. AC updates the actor (i.e., the policy) using the estimated policy gradient (PG), which is a function of the Q-value under the policy. Meanwhile, it employs a bootstrapping critic to estimate the Q-value, which often helps reduce variance and accelerates convergence in practice.

Despite the empirical success, the non-asymptotic convergence analysis of AC in the most practical single-timescale form remains underexplored. A large body of existing works consider the doubleloop variants, where the critic runs many steps to accurately estimate the Q-value for a given actor [37, 16, 27, 2]. This leads to a decoupled convergence analysis of the critic and the actor, which involves a policy evaluation sub-problem in the inner loop and a perturbed gradient descent in the outer loop. Its finite-time convergence is relatively well understood [16, 37, 35]. Nevertheless, the double-loop setting is mainly for ease of analysis, which is barely adopted in practice. Since it requires an accurate critic estimation, it is typically sample inefficient. In fact, it is unclear whether an inner loop of accurate policy evaluation is really necessary since it only corresponds to one transient policy among many iterations.

Another body of works considers the (single-loop) two-timescale variants [31, 9, 36, 13], where the actor and the critic are updated simultaneously in each iteration with stepsizes of different timescales. The actor stepsize is typically smaller than that of the critic, with their ratio converging to zero as the iteration number goes to infinity. Hence, the actor is updated much slower than the critic. The two-timescale allows the critic to approximate the desired Q-value in an asymptotic way, which enables a decoupled convergence analysis of the actor and the critic. This variant is occasionally

Corresponding author

37th Conference on Neural Information Processing Systems (Neur IPS 2023).

Table 1: Comparison with related single-timescale actor-critic algorithms

Reference Setting Sampling Sample Complexity State Space Reward Actor Critic [21] Finite Discounted i.i.d. i.i.d. O(ϵ 2) [8] Infinite Discounted i.i.d. i.i.d. O(ϵ 2) This Paper Infinite Average Markovian Markovian O(ϵ 2)

adopted to improve learning stability. However, it is still considered inefficient as the actor update is artificially slowed down.

In this paper, we consider the more practical single-timescale AC algorithm, which is the one introduced in many works of literature as well as in [25] as a classic AC algorithm. In singletimescale AC, the stepsizes of the critic and the actor diminish at the same timescale. Unlike the aforementioned variants, which have specialized designs aimed at simplifying the convergence analysis, the analysis of the single-timescale AC presents a greater challenge. Due to the substantial errors in critic estimation and the close coupling between the parallel critic update and actor update, the algorithm is more prone to unstable error propagation. It remains unclear under what condition the errors will converge to zero. To study its finite-time convergence, we consider the challenging undiscounted time-average reward formulation [25, 31, 37], which consists of three parallel updates: the (time-average) reward estimator, the critic estimator, and the actor estimator. We keep track of the reward estimation error, the critic error, and the policy gradient norm (which measures the actor error) by deriving an implicit bound for each of them. They are then analyzed altogether as an interconnected system inspired by [21] to establish the convergence simultaneously. Specifically, we identify the (constant) ratio between the actor stepsize and the critic stepsize, below which all three errors will diminish to zero, despite the inaccurate estimation in all three updates (reward estimation, critic, actor).

1.1 Main Contributions

We summarise our main contributions as follows:

We provide a finite-time analysis for the single-timescale AC under the Markovian sampling and prove an e O(ϵ 2) sample complexity, where e O( ) hides additional logarithmic terms. We further show that this sample complexity can be improved to O(ϵ 2) under i.i.d. sampling, which matches the state-of-the-art performance of SGD on general non-convex optimization problems. Our proof clearly shows that the additional logarithmic term under the Markovian sampling is introduced by the mixing time of the underlying Markov chain.

Our result compares favorably to existing works on single-timescale AC. To our knowledge, the only other results of single-timescale AC in the general MDP (Markov decision process) case are from [8] and [21], both of which obtain a sample complexity of O(ϵ 2) under discounted reward setting. However, both [8] and [21] considered the i.i.d. sampling, where the transition tuples are independently sampled from stationary distribution and discounted state-action visitation distribution. In this paper, we consider the more practical Markovian sampling, where the transition tuples are generated from a single trajectory (see Table 1).

Furthermore, [8] follows an explicit Lyapunov analysis, where they leave a biased term in the critic and eliminated in the actor. Therefore, their proof framework cannot show the convergence of the critic. We instead give a neat proof framework to guarantee convergence for both the critic and the actor. Additionally, [21] only considered the tabular case (finite state-action space) where we allow the state space to be infinite (see Table 1). It is worth emphasizing that moving from a finite state space to an infinite state space takes significantly non-trivial effort in analysis. The analysis in [21] concatenates all state-action pairs to create a finite-dimensional feature matrix, which however becomes impossible in the infinite state space scenario. Consequently, their analysis technique and established results are not applicable in our context.

Technically, we develop a new analysis framework that can establish the finite-time convergence for single-timescale AC under the general setting. The existing analysis for double-loop AC [37] and two-timescale AC [31] hinge on decoupling the analysis of actor and critic, which typically establishes the convergence of critic first and then actor [37, 31, 8]. We instead investigate the evolution of

the coupled estimation errors of the time-average reward, the critic, and the policy gradient norm altogether as an interconnected system in a much less conservative way. We emphasize that our analysis framework includes the discounted setting as a simple special case where the interconnected system is only two-dimensional without the time-average reward estimation.

1.2 Related Work

Policy gradient methods. Policy gradient methods [26, 25] learn a parameterized policy, constituting a departure from the value-based approach [28, 32, 40]. The asymptotic convergence of policy gradient methods has been well established in [29, 26, 3, 14] via stochastic approximation methods [6]. Some recent works have shown that PG methods can find the global optimum of some particular class of problems, such as LQR [11, 19] and tabular case problem [1]. Under general function approximation setting, finite-time convergence of PG methods was analyzed in [1, 38, 33, 34]. Specifically, [1] established the finite-time convergence of PG methods under both tabular policy parameterizations and general parametric policy classes. [38] showed that a variant of PG methods can attain an ϵ-accurate stationary point at a sample complexity of O(ϵ 2), where they adopted Monte-Carlo sampling to find an unbiased estimation of policy gradient. [33, 34] studied the variance reduction PG and acceleration PG.

Actor-Critic methods. The AC algorithm was initially proposed by [15]. Later, [14] extended it to the natural AC algorithm. The asymptotic convergence of AC algorithms has been well established in [14, 5, 7, 39] under various settings. Many recent works focused on the finite-time convergence of AC methods. Under the double-loop setting, [37] established the global convergence of AC methods for solving linear quadratic regulator (LQR). [27] studied the global convergence of AC methods with both the actor and the critic parameterized by neural networks. [16] studied the finite-time local convergence of a few AC variants with linear function approximation.

Under the two-timescale AC setting, [31] established the finite-time local convergence to a stationary point at a sample complexity of e O(ϵ 2.5) under the undiscounted time-average reward setting. [36] studied both local convergence and global convergence for two-timescale (natural) AC, with e O(ϵ 2.5) and e O(ϵ 4) sample complexity, respectively, under the discounted accumulated reward. The algorithm collects multiple samples to update the critic. [13] proposed a two-timescale stochastic approximation algorithm for bilevel optimization and the algorithm was subsequently employed in the context of two-timescale AC. [9] established the global convergence of two-timescale AC methods for solving LQR, where only a single sample is used to update the critic in each iteration.

Under the single-timescale setting, [12] considered the least-squares temporal difference (LSTD) update for the critic and obtained the optimal policy within the energy-based policy class for both linear function approximation and nonlinear function approximation using neural networks. [41] studied single-timescale AC on LQR. In addition, [8] and [21] considered the single-timescale AC in general MDP cases, which have been reviewed and compared in Section 1.1.

Notation. We use non-bold letters to denote scalars and use lower and upper case bold letters to denote vectors and matrices respectively. Without further specification, we write xn = O(yn) if there exists an absolute positive constant C such that xn Cyn, for two sequences {xn} and {yn}. We use O( ) to hide logarithm factors. The total variation distance of two probability measure µ and v is defined by d T V (µ, v) := 1

X |µ(dx) v(dx)|. In addition, we use P to denote a generic probability of some random event.

2 Preliminaries

In this section, we review the basics of the Markov decision process, policy gradient algorithm, and single-timescale AC with linear function approximation.

2.1 Markov decision process

We consider the standard Markov Decision Process (MDP) characterized by (S, A, P, r), where S is the state space and A is the action space. We consider a finite action space |A| < , whereas the state space can be either a finite set or an (unbounded) real vector space S Rn. P(st+1|st, at) [0, 1] denotes the transition kernel. We consider a bounded reward r : S A [ Ur, Ur], which is a

function of the state s and action a. A policy πθ( |s) R|A| parameterized by θ is defined as a mapping from a given state to a probability distribution over actions.

The RL problem of consideration aims to find a policy πθ that maximizes the infinite-horizon time-average reward [26, 25, 37, 31], which is given by

J(θ) := lim T Eθ

PT 1 t=0 r(st, at)

T = E s µθ,a πθ [r(s, a)],

where the expectation Eθ is over the Markov chain under the policy πθ, and µθ denotes the stationary state distribution induced by πθ. The existence of the stationary distribution can be guaranteed by the uniform ergodicity of the underlying MDP, which is a common assumption. Hereafter, we refer to J(θ) as the time-average reward (or exchangeably, performance function), which can be evaluated by the expected reward over the stationary distribution µθ and the policy πθ.

The state-value function is used to evaluate the overall rewards starting from a state s and following policy πθ thereafter, which is defined as

Vθ(s) := Eθ[

t=0 (r(st, at) J(θ))|s0 = s],

where the action follows the policy at πθ( |st) and the next state comes from the transition kernel st+1 P( |st, at). Similarly, we define the action-value (Q-value) function to evaluate the overall rewards starting from s, taking action a, and following policy πθ thereafter:

Qθ(s, a) = Eθ[

t=0 (r(st, at) J(θ))|s0 = s, a0 = a]

(i)= r(s, a) J(θ) + E[Vθ(s )],

where the expectation in (i) is taken over s P( |s, a).

2.2 Policy gradient theorem

The policy gradient theorem [26] provides an analytic expression for the gradient of the performance function J(θ) with respect to the policy parameter θ, which is given by:

θJ(θ) = Es µθ,a πθ[Qθ(s, a) θ log πθ(a|s)]. (1)

Evaluating this gradient requires the Q-value corresponding to the current policy πθ. The REINFORCE [29] is a Monte Carlo-based episodic algorithm, which uses all the rewards collected along the sample trajectory (that is, the differential return) as an approximation to the true Q-value.

Note that for any function b : S R that is independent of the action, we have X

a A b(s) πθ(a|s) = b(s) ( X

a A πθ(a|s)) = b(s) 1 = 0.

Therefore, the policy gradient theorem can be written equivalently as:

J(θ) = Es µθ,a πθ[(Qθ(s, a) b(s)) θ log πθ(a|s)],

where b(s) is called the baseline function. A popular choice of baseline is the state-value function, which leads to the following advantage-based policy gradient

θJ(θ) = Es µθ,a πθ[ θ(s, a) θ log πθ(a|s)],

where θ = Qθ(s, a) Vθ(s) is known as the advantage function. This is the REINFORCE with baseline [29]. The baseline function can help reduce variance. However, like all Monte Carlo-based methods, it can still suffer from high variance and thus learns slowly. In addition, it is inconvenient to implement the algorithm online for continuing tasks [25].

AC algorithm instead employs a bootstrapping critic to estimate the Q-value. We describe the classic single-timescale AC in the next subsection.

2.3 The single-timescale actor-critic algorithm

We consider the classic single-sample single-timescale AC method, where the critic is bootstrapping and uses a single sampled reward to update in each iteration. This directly accommodates online learning for continuing tasks. We consider the following linear function approximation of the state-value function: b Vθ(s; ω) = ϕ(s) ω,

where ϕ( ) : S Rd is a known feature mapping, which satisfies ϕ( ) 1. To drive b Vθ(s; w) towards its true value Vθ(s), the semi-gradient TD(0) update is applied to estimate the linear coefficient ω (hereafter referred to as the critic):

ωt+1 = ωt + βt[(rt J(θ) + ϕ(st+1) ωt ϕ(st) ωt)]ϕ(st), (2)

where βt is the step size of the critic ω and rt := r(st, at). Since J(θ) is unknown, the time-average reward setting introduces an additional estimator η to estimate it. Hereafter, we simply refer to η as the reward estimator. The temporal difference error can be defined as

δt := rt ηt + ϕ(st+1) ωt ϕ(st) ωt.

Then, the update rule for the critic is given by

ηt+1 = ηt + γt(rt ηt), ωt+1 = ωt + βtδtϕ(st),

where γt is the step size of the reward estimator ηt.

Since δt is an approximation of the advantage function, similar to REINFORCE with baseline, the corresponding update rule for the actor can be written as:

θt+1 = θt + αtδt θ log πθt(at|st),

where αt is the actor stepsize. The above-described AC is summarized in Algorithm 1, which is introduced in [25] as a classic online one-step AC algorithm. Algorithm 1 can be efficiently implemented for continuing tasks due to its online nature.

Algorithm 1 Single-timescale Actor-Critic

1: Input initial actor parameter θ0, initial critic parameter ω0, initial reward estimator η0, stepsize αt for actor, βt for critic, and γt for reward estimator. 2: Draw s0 from some initial distribution 3: for t = 0, 1, 2, , T 1 do 4: Take action at πθt( |st) 5: Observe next state st+1 P( |st, at) and reward rt = r(st, at) 6: δt = rt ηt + ϕ(st+1) ωt ϕ(st) ωt 7: ηt+1 = ηt + γt(rt ηt) 8: ωt+1 = ΠUω(ωt + βtδtϕ(st)) 9: θt+1 = θt + αtδt θ log πθt(at|st) 10: end for

Note that the single-timescale refers to the fact that the stepsizes αt, βt, γt are only constantly proportional to each other. In addition, this is a single-sample algorithm, since only one sample is needed for the update in each iteration. We remark that Algorithm 1 is more common in practice than double loop variants. In Line 8 of Algorithm 1, a projection (ΠUω) is introduced to keep the critic norm-bounded by Uω, which is widely adopted in the literature [31, 37, 36, 8] for analysis. Note that the projection can be handled easily, which is relaxed using its non-expansive property in our analysis.

3 Main Results

We first present several standard assumptions that are common in the literature of analyzing AC with linear function approximation [12, 35, 31, 8, 21]. Insights into these conditions and connections with relevant works are also discussed.

3.1 Assumptions

By taking the expectation of ωt+1 in (2) with respect to the stationary distribution, we have for any given ωt Eθ[ωt+1|ωt] = ωt + βt(bθ + Aθωt), (3)

Aθ := E(s,a,s )[ϕ(s)(ϕ(s ) ϕ(s)) )], (4)

bθ := E(s,a)[(r(s, a) J(θ))ϕ(s)], (5)

and s µθ( ), a πθ( |s), s P( |s, a) is the subsequent state of the (s, a). It can be easily shown that [25] the TD limiting point ω (θ) satisfies:

bθ + Aθω (θ) = 0. (6)

Note that Aθ reflects the exploration of the policy. To see this, note that without sufficient exploration, Aθ can be rank deficient and (6) can be unsolvable. Consequently, the critic update (2) will not converge. Hence, the following assumption is made to guarantee the problem s solvability. Assumption 3.1 (Exploration). For any θ, the matrix Aθ defined in (4) is negative definite and its maximum eigenvalue can be upper bounded by λ.

Assumption 3.1 is commonly adopted in analyzing TD learning with linear function approximation [4, 42, 31, 23, 8, 21]. In particular, Assumption 3.1 holds if the policy πθ can explore all state-action pairs in the tabular case [21]. In addition, with this assumption, we can choose Uω = 2Ur

λ so that all ω lie within the projection radius Uω because bθ 2Ur and A 1 θ λ 1, which justifies the projection operator introduced in Line 8 of Algorithm 1. Assumption 3.2 (Uniform ergodicity). For any θ, denote µθ( ) as the stationary distribution induced by the policy πθ( |s) and the transition probability measure P( |s, a). For a Markov chain generated by the policy πθ and transition kernel P, there exists m > 0 and ρ (0, 1) such that

d T V (P(sτ |s0 = s), µθ( )) mρτ, τ 0, s S.

Assumption 3.2 assumes the Markov chain is geometrically mixing, which can be implied by the uniform ergodicity. It is commonly employed to characterize the noise induced by Markovian sampling in RL algorithms [4, 42, 31, 8, 21]. Assumption 3.3 (Lipschitz continuity of policy). Let πθ(a|s) be a policy parameterized by θ Rd. There exists positive constants B, Ll and Lπ such that for any θ, θ1, θ2 Rd, s S, and a A, it holds that:

(a) log πθ(a|s) B

(b) log πθ1(a|s) log πθ2(a|s) Ll θ1 θ2

(c) |πθ1(a|s) πθ2(a|s)| Lπ θ1 θ2

Assumption 3.3 is standard in the literature of policy gradient methods [22, 42, 38, 34, 31, 8, 21]. This assumption holds for many policy classes such as Gaussian policy [10], Boltzmann policy [15], and tabular softmax policy [1]. Assumption 3.4. For any θ, θ Rd, there exists constant Lµ such that µθ µθ Lµ θ θ , where µθ(s) is the stationary distribution under the policy πθ.

Assumption 3.4 is introduced in [8] to show the smoothness of the optimal critic ω (θ), which is critical to guarantee the convergence of single-timescale AC. This assumption holds for the finite state-action space setting [18].

3.2 Finite-Time Analysis

We define the following uniform upper bound for the linear function approximation error of the critic:

ϵapp := sup θ

Es µθ(ϕ(s) ω (θ) Vθ(s))2.

The error ϵapp is zero if Vθ is indeed a linear function for any θ. Naturally, it can be expected that the learning errors of Algorithm 1 depend on ϵapp.

We define the following integer τT that will be useful in the statement of the theorems, which depends on the number of total iterations T:

τT := min{i 0 | mρi 1 1

where m, ρ are constants defined in Assumption 3.2. Therefore, we choose τT = log mρ 1

log T 2 log ρ 1 = O(log T) such that mρτT 1 1

T . The integer τT represents the mixing time of an ergodic Markov chain, which will be used to control the Markovian noise in the analysis.

We quantify the learning errors by defining yt := ηt J(θt), which is the difference between the reward estimator and the true time-average reward J(θt) at time t. For the critic, we define, zt := ωt ω t with ω t := ω (θt) to measure the error between the critic and its target value at iteration t. The following two theorems summarize our main results. Theorem 3.5 (Markovian sampling). Consider Algorithm 1 with αt = α = c

T , βt = β = 1

T , γt = γ = 1

T , where c is a constant depending on problem parameters. Suppose Assumptions 3.1-3.4 hold, we have for T 2τT ,

t=τT Ey2 t = O(log2 T

T ) + O(ϵapp),

t=τT E zt 2 = O(log2 T

T ) + O(ϵapp),

t=τT E J(θt) 2 = O(log2 T

T ) + O(ϵapp).

We defer the interpretation of the above results a bit and present below the analysis results under the i.i.d. sampling first for better comparison. The major difference of i.i.d. from Markovian sampling is that at the t-th iteration, the state st is sampled from the stationary distribution µθt instead of the evolving Markov chain (see Algorithm 2 in Appendix E). The i.i.d. sampling simplifies the analysis in the way that many Markovian noise terms reduce to zero effectively. This leads to a tighter sample complexity bound compared to the Markovian sampling by up to logarithmic factors. Theorem 3.6 (i.i.d. sampling). Consider Algorithm 2 (see Appendix E) with αt = α = c

T , βt = β = 1

T , γt = γ = 1

T , where c is a constant depending on problem parameters. Suppose Assumptions 3.1-3.4 hold, we have for T 2τT ,

t=τT Ey2 t = O( 1

T ) + O(ϵapp),

t=τT E zt 2 = O( 1

T ) + O(ϵapp),

t=τT E J(θt) 2 = O( 1

T ) + O(ϵapp).

If the critic approximation error ϵapp is zero, we see that the reward estimator, the critic, and the actor estimation errors all diminish at a sub-linear rate of e O(T 1

2 ). The additional logarithmic term hidden by e O( ) is incurred by the mixing time of the Markov chain, which can be get rid of under the i.i.d. sampling. It also hides the polynomials of all other problem parameters. They are explicitly characterized in the proofs up to the last step of analyzing the overall interconnected error propagation system. One can easily keep and get the dependence orders of all parameters if needed. Here we focus on the dependence of the iteration number for ease of presentation.

To put the results into perspective, note that O(T 1

2 ) is the rate one would obtain from stochastic gradient descent (SGD) on general non-convex functions with unbiased gradient updates. In terms of sample complexity, to obtain an ϵ-approximate stationary point, it takes a number of e O(ϵ 2) samples for Markovian sampling (Algorithm 1) and O(ϵ 2) for i.i.d. sampling (Algorithm 2), which matches the state-of-the-art performance of SGD on non-convex optimization problems.

The obtained sample complexities are superior to those of other AC variants. Notably, [16] provided finite-time convergence for double-loop variant with a O(ϵ 4) sample complexity and [31] analysed two-timescale variant, yielding a e O(ϵ 2.5) sample complexity. The sample complexity gap is intrinsic to their inefficient usage of data. In the double-loop setting, the critic starts over to estimate the Qvalue for an intermediate policy in the inner loop, ignoring the fact that the consecutive Q-values can be similar given a relatively minor policy update. The two-timescale setting artificially slows down the actor update by adopting an actor stepsize that decays faster than the critic. The single-timescale approach updates the critic and actor parallelly with proportional stepsizes and thus learns more efficiently.

Moreover, our result matches the O(ϵ 2) sample complexity of policy gradient methods such as REINFORCE [2, 22] under the i.i.d sampling. It is previously found in [31] that there is a sample complexity gap between Algorithm 1 adopting two-timescale stepsizes and (variance-reduced) REINFORCE [22]. In this paper, we close this gap by providing a single-timescale analysis for Algorithm 1 which shows that this practical single-timescale AC can have the same sample complexity as REINFORCE.

3.3 Proof Sketch

The main challenge in the finite-time analysis lies in that the estimation errors of the time-average reward, the critic, and the policy gradient are strongly coupled. To overcome this difficulty, we view the propagation of these errors as an interconnected system and analyze them holistically. To better appreciate the advantage of our analysis framework over the decoupled methods that are traditionally adopted in analyzing double-loop and two-timescale variants, we sketch the main proof steps of Theorem 3.5 in the following. We also highlight the key challenges and techniques developed correspondingly. All supporting lemmas mentioned below can be found in Appendix.

We first derive implicit (coupled) upper bounds for the reward estimation error yt, the critic error zt, and the policy gradient J(θt), respectively. Then, we solve a system of inequalities to establish finite-time convergence.

Step 1: Reward estimation error analysis. Using the reward estimator update rule (Line 7 of Algorithm 1), we decompose the reward estimation error into: y2 t+1 = (1 2γt)y2 t + 2γtyt(rt J(θt))

+ 2yt(J(θt) J(θt+1)) + (J(θt) J(θt+1) + γt(rt ηt))2. (7)

The second term on the right-hand side of (7) is a bias term caused by the Markovian sample, which is characterized in Lemma C.1. As shown in Lemma E.1, this bias reduces to 0 under i.i.d. sampling after taking the expectation. The third term captures the variation of the moving targets J(θt). The double-loop variant of AC runs a policy evaluation sub-problem in the inner loop for each target J(θt) to estimate the policy gradient accurately. This easily ensures the monotonic decreasing of J(θt) and consequently the convergence. The two-timescale variant utilizes the additional property of limt αt/βt = 0 to annihilate this term and consequently can have a decoupled analysis. In the case of single-timescale AC, we do not have the aforementioned special algorithm designs and properties to ease the analysis. Instead, we utilize the smoothness of J(θ) (see Lemma B.2) and derive an implicit upper bound for this term as a function of the norm of yt and J(θt). This bound will be combined with the implicit bounds derived in Step 2 and Step 3 below to establish the non-asymptotic convergence altogether. The last term in (7) reflects the variance in reward estimation, which is bounded by O(γt).

Step 2: Critic error analysis. Using the critic update rule (Line 8 of Algorithm 1), we decompose the squared error by (we neglect the projection for the time being for the ease of comprehension. The complete analysis can be found in the appendix.) zt+1 2 = zt 2 + 2βt zt, g(ωt, θt) + 2βtΨ(Ot, ωt, θt) + 2βt zt, g(Ot, ηt, θt)

+ 2 zt, ω t ω t+1 + βt(g(Ot, ωt, θt) + g(Ot, ηt, θt)) + ω t ω t+1 2, (8)

where Ot := (st, at, st+1) is a tuple of observations and the definitions of g, g, g, and Ψ can be found in (12) and (13) in Appendix A. Without diving into the detailed definitions, here we focus on illustrating the high-level insights of our proof. First of all, the second term on the right-hand side of (8) can be bounded by 2λβt zt 2 under Assumption 3.1. It provides an explicit characterization of how sufficient exploration can help the convergence of learning. The third term is a Markovian noise, which is further bounded implicitly in Lemma C.3. For the i.i.d sampling case, as shown in Lemma E.1, this bias reduces to 0 after taking the expectation. The fourth term is caused by inaccurate reward and critic estimations, which can be bounded by the norm of yt and zt. The fifth term tracks both the critic estimation performance zt and the difference between the drifting critic targets ω t . Similar to the case of Step 1, the double-loop approach bounds this term relying on the accurate policy evaluation sub-problem in the inner loop for each target ω t , whereas the two-timescale approach ensures its convergence by additionally requiring limt αt/βt = 0. In contrast, we establish an implicit upper bound for this term as a function of yt and zt by utilizing the smoothness of the optimal critic proved in Lemma B.4. Finally, the last term reflects the variances of various estimations, which is bounded by O(βt).

Step 3: Policy gradient norm analysis. Using the actor update rule (Line 9 of Algorithm 1) and the smoothness property of J(θ) (see Lemma B.2), we derive

αt (J(θt+1) J(θt)) + Θ(Ot, θt) J(θt), h(Ot, ηt, ωt, θt)

J(θt), EO t[ h (O t, θt)] + LJ

2 αt δt log πθt(at|st) 2, (9)

where O t is a shorthand for an independent sample from stationary distribution s µθt, a πθt, s P( |s, a), Θ is defined in (13), and LJ is a constant. The first term on the right-hand side of (9) compares the actor s performances between consecutive updates, which can be bounded via Abel summation by parts. The second term is a noise term introduced by Markovian sampling, which is characterized in Lemma C.6. Again, as proven in Lemma E.1, this bias reduces to 0 under i.i.d. sampling after taking the expectation. The third term is an error introduced by the inaccurate estimations of both the time-average reward and the critic. This term was directly bounded to zero under both the double-loop setting and the two-timescale setting due to their particular algorithm design, to enable a decoupled analysis. We control this term by providing an implicit bound depending on yt, zt, and J(θt). The fourth term comes from the linear function approximation error. The last term captures the variance of the stochastic gradient update, which is bounded by O(αt).

Step 4: Interconnected iteration system analysis. Taking the expectation of and summing (7), (8), and (9) from τT to T 1, respectively, we obtain the following system of inequalities in terms of YT , ZT , GT :

YT := 1 T τT

t=τT Ey2 t O(log2 T

ZT := 1 T τT

t=τT E zt 2 O(log2 T

T ) + O(ϵapp) + l2 p

YT ZT + l3 p

ZT (2YT + 8ZT ),

GT := 1 T τT

t=τT E J(θt) 2 O(log2 T

T ) + O(ϵapp) + l4 p

GT (2YT + 8ZT ),

where l1, l2, l3, l4 are positive constants. By solving the above system of inequalities, we further prove that if

l1(1 + 2l2 4 + 8l2 4(2l2 2 + l3)) 1 and 16l3 1,

then YT , ZT , GT converge at a rate of O( log2 T

T ). This condition can be easily satisfied by choosing the stepsize ratio c to be smaller than a threshold identified in Equation (28). Thus, it completes the proof.

The above proof applies to i.i.d sampling straightforwardly, with the corresponding terms pointed out in the above steps reducing to 0 in the analysis. The additional proof can be found in Lemma E.1.

4 Conclusion and Discussion

In this paper, we establish the finite-time analysis for single-timescale AC with Markovian sampling. Our work compares favorably to existing works in terms of analyzing online learning and considering the continuous state space. We developed a series of lemmas that characterize the propagation of errors, and establish their convergence simultaneously by solving a system of nonlinear inequalities. The proposed framework is general and can be applied to analyze other single-timescale stochastic approximation algorithms. Our future work includes further considering the continuous action space problems and developing new proof techniques that require fewer assumptions.

Acknowledgements

This work was supported by the Singapore Ministry of Education Tier 1 Academic Research Funds (A0009030-00-00, 22-5460-A0001). The authors would like to thank the timely help from Yue Wu and Quanquan Gu for clarifying the proof of their seminar work on finite-time analysis of two-timescale actor-critic.

[1] Alekh Agarwal, Sham M Kakade, Jason D Lee, and Gaurav Mahajan. Optimality and approximation with policy gradient methods in markov decision processes. In Conference on Learning Theory, pages 64 66. PMLR, 2020.

[2] Alekh Agarwal, Sham M Kakade, Jason D Lee, and Gaurav Mahajan. On the theory of policy gradient methods: Optimality, approximation, and distribution shift. Journal of Machine Learning Research, 22(98):1 76, 2021.

[3] Jonathan Baxter and Peter L Bartlett. Infinite-horizon policy-gradient estimation. Journal of Artificial Intelligence Research, 15:319 350, 2001.

[4] Jalaj Bhandari, Daniel Russo, and Raghav Singal. A finite time analysis of temporal difference learning with linear function approximation. In Conference on learning theory, pages 1691 1692. PMLR, 2018.

[5] Shalabh Bhatnagar, Richard S Sutton, Mohammad Ghavamzadeh, and Mark Lee. Natural actor critic algorithms. Automatica, 45(11):2471 2482, 2009.

[6] Vivek S Borkar. Stochastic approximation: a dynamical systems viewpoint, volume 48. Springer, 2009.

[7] Dotan Di Castro and Ron Meir. A convergent online single time scale actor critic algorithm. The Journal of Machine Learning Research, 11:367 410, 2010.

[8] Tianyi Chen, Yuejiao Sun, and Wotao Yin. Closing the gap: Tighter analysis of alternating stochastic gradient methods for bilevel problems. Advances in Neural Information Processing Systems, 34:25294 25307, 2021.

[9] Xuyang Chen, Jingliang Duan, Yingbin Liang, and Lin Zhao. Global convergence of twotimescale actor-critic for solving linear quadratic regulator. ar Xiv preprint ar Xiv:2208.08744, 2022.

[10] Kenji Doya. Reinforcement learning in continuous time and space. Neural computation, 12(1):219 245, 2000.

[11] Maryam Fazel, Rong Ge, Sham Kakade, and Mehran Mesbahi. Global convergence of policy gradient methods for the linear quadratic regulator. In International Conference on Machine Learning, pages 1467 1476. PMLR, 2018.

[12] Zuyue Fu, Zhuoran Yang, and Zhaoran Wang. Single-timescale actor-critic provably finds globally optimal policy. ar Xiv preprint ar Xiv:2008.00483, 2020.

[13] Mingyi Hong, Hoi-To Wai, Zhaoran Wang, and Zhuoran Yang. A two-timescale stochastic algorithm framework for bilevel optimization: Complexity analysis and application to actorcritic. SIAM Journal on Optimization, 33(1):147 180, 2023.

[14] Sham M Kakade. A natural policy gradient. Advances in neural information processing systems, 14, 2001.

[15] Vijay Konda and John Tsitsiklis. Actor-critic algorithms. Advances in neural information processing systems, 12, 1999.

[16] Harshat Kumar, Alec Koppel, and Alejandro Ribeiro. On the sample complexity of actorcritic method for reinforcement learning with function approximation. ar Xiv preprint ar Xiv:1910.08412, 2019.

[17] Yann Le Cun, Yoshua Bengio, and Geoffrey Hinton. Deep learning. Nature, 521(7553):436 444, 2015.

[18] Qijun Luo and Xiao Li. Finite-time analysis of fully decentralized single-timescale actor-critic. ar Xiv preprint ar Xiv:2206.05733, 2022.

[19] Dhruv Malik, Ashwin Pananjady, Kush Bhatia, Koulik Khamaru, Peter Bartlett, and Martin Wainwright. Derivative-free methods for policy optimization: Guarantees for linear quadratic systems. In The 22nd International Conference on Artificial Intelligence and Statistics, pages 2916 2925. PMLR, 2019.

[20] Volodymyr Mnih, Adria Puigdomenech Badia, Mehdi Mirza, Alex Graves, Timothy Lillicrap, Tim Harley, David Silver, and Koray Kavukcuoglu. Asynchronous methods for deep reinforcement learning. In International conference on machine learning, pages 1928 1937. PMLR, 2016.

[21] Alex Olshevsky and Bahman Gharesifard. A small gain analysis of single timescale actor critic. SIAM Journal on Control and Optimization, 61(2):980 1007, 2023.

[22] Matteo Papini, Damiano Binaghi, Giuseppe Canonaco, Matteo Pirotta, and Marcello Restelli. Stochastic variance-reduced policy gradient. In International conference on machine learning, pages 4026 4035. PMLR, 2018.

[23] Shuang Qiu, Zhuoran Yang, Jieping Ye, and Zhaoran Wang. On finite-time convergence of actor-critic algorithm. IEEE Journal on Selected Areas in Information Theory, 2(2):652 664, 2021.

[24] David Silver, Julian Schrittwieser, Karen Simonyan, Ioannis Antonoglou, Aja Huang, Arthur Guez, Thomas Hubert, Lucas Baker, Matthew Lai, Adrian Bolton, et al. Mastering the game of go without human knowledge. nature, 550(7676):354 359, 2017.

[25] Richard S Sutton and Andrew G Barto. Reinforcement learning: An introduction. MIT press, 2018.

[26] Richard S Sutton, David Mc Allester, Satinder Singh, and Yishay Mansour. Policy gradient methods for reinforcement learning with function approximation. Advances in neural information processing systems, 12, 1999.

[27] Lingxiao Wang, Qi Cai, Zhuoran Yang, and Zhaoran Wang. Neural policy gradient methods: Global optimality and rates of convergence. ar Xiv preprint ar Xiv:1909.01150, 2019.

[28] Christopher JCH Watkins and Peter Dayan. Q-learning. Machine learning, 8(3):279 292, 1992.

[29] Ronald J Williams. Simple statistical gradient-following algorithms for connectionist reinforcement learning. Machine learning, 8(3):229 256, 1992.

[30] Yue Wu, Weitong Zhang, Pan Xu, and Quanquan Gu. A finite time analysis of two time-scale actor critic methods. ar Xiv preprint ar Xiv:2005.01350, 2020.

[31] Yue Frank Wu, Weitong Zhang, Pan Xu, and Quanquan Gu. A finite-time analysis of two time-scale actor-critic methods. Advances in Neural Information Processing Systems, 33:17617 17628, 2020.

[32] Huaqing Xiong, Lin Zhao, Yingbin Liang, and Wei Zhang. Finite-time analysis for double q-learning. In Advances in Neural Information Processing Systems (Neur IPS), 2020 (spotlight), 2020.

[33] Pan Xu, Felicia Gao, and Quanquan Gu. Sample efficient policy gradient methods with recursive variance reduction. ar Xiv preprint ar Xiv:1909.08610, 2019.

[34] Pan Xu, Felicia Gao, and Quanquan Gu. An improved convergence analysis of stochastic variance-reduced policy gradient. In Uncertainty in Artificial Intelligence, pages 541 551. PMLR, 2020.

[35] Tengyu Xu, Zhe Wang, and Yingbin Liang. Improving sample complexity bounds for (natural) actor-critic algorithms. Advances in Neural Information Processing Systems, 33:4358 4369, 2020.

[36] Tengyu Xu, Zhe Wang, and Yingbin Liang. Non-asymptotic convergence analysis of two time-scale (natural) actor-critic algorithms. ar Xiv preprint ar Xiv:2005.03557, 2020.

[37] Zhuoran Yang, Yongxin Chen, Mingyi Hong, and Zhaoran Wang. Provably global convergence of actor-critic: A case for linear quadratic regulator with ergodic cost. Advances in neural information processing systems, 32, 2019.

[38] Kaiqing Zhang, Alec Koppel, Hao Zhu, and Tamer Basar. Global convergence of policy gradient methods to (almost) locally optimal policies. SIAM Journal on Control and Optimization, 58(6):3586 3612, 2020.

[39] Shangtong Zhang, Bo Liu, Hengshuai Yao, and Shimon Whiteson. Provably convergent twotimescale off-policy actor-critic with function approximation. In International Conference on Machine Learning, pages 11204 11213. PMLR, 2020.

[40] Lin Zhao, Huaqing Xiong, and Yingbin Liang. Faster non-asymptotic convergence for double q-learning. In Advances in Neural Information Processing Systems (Neur IPS), 2021, 2021.

[41] Mo Zhou and Jianfeng Lu. Single timescale actor-critic method to solve the linear quadratic regulator with convergence guarantees. Journal of Machine Learning Research, 24(222):1 34, 2023.

[42] Shaofeng Zou, Tengyu Xu, and Yingbin Liang. Finite-sample analysis for sarsa with linear function approximation. Advances in neural information processing systems, 32, 2019.

Table of Contents

A Notation 13

B Preliminary Lemmas 14

C Proof of Main Theorem 14

C.1 Step 1: Reward estimation error analysis . . . . . . . . . . . . . . . . . . . . . . . 14

C.2 Step 2: Critic error analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17

C.3 Step 3: Policy gradient norm analysis . . . . . . . . . . . . . . . . . . . . . . . . 21

C.4 Step 4: Interconnected iteration system analysis . . . . . . . . . . . . . . . . . . . 23

D Proof of Supporting Lemmas 25

E IID Sample Analysis 32

We make use of the following auxiliary Markov chain to deal with the Markovian noise.

Auxiliary Markov Chain:

st τ θt τ at τ P st τ+1 θt τ eat τ+1 P est τ+2 θt τ eat τ+2 P est θt τ eat P est+1. (10) For reference, we also show the original Markov chain.

Original Markov Chain:

st τ θt τ at τ P st τ+1 θt τ+1 eat τ+1 P est τ+2 θt τ+2 eat τ+2 P est θt eat P est+1. (11)

In the sequel, we denote by e Ot := (est, eat, est+1) the tuple generated from the auxiliary Markov chain in (10) while Ot := (st, at, st+1) denotes the tuple generated from the original Markov chain in (11).

We define the following functions, which will benefit to decompose the errors and simplify the presentation. g(O, η, θ) := [J(θ) η]ϕ(s),

g(O, ω, θ) := [r(s, a) J(θ) + (ϕ(s ) ϕ(s)) ω]ϕ(s),

g(ω, θ) := E(s,a,s ) (µθ,πθ,P)[[r(s, a) J(θ) + (ϕ(s ) ϕ(s)) ω]ϕ(s)],

h(O, η, ω, θ) := (J(θ) η + (ϕ(s ) ϕ(s)) (ω ω (θ)) log πθ(a|s),

h (O, θ) := ((ϕ(s )ω (θ) Vθ(s )) (ϕ(s) ω (θ) Vθ(s))) log πθ(a|s),

h(O, θ) := (r(s, a) J(θ) + ϕ(s ) ω (θ) ϕ(s) ω (θ)) log πθ(a|s).

We also define the following functions, which characterize the Markovian noise. Φ(O, η, θ) := (η J(θ))(r(s, a) J(θ)), Ψ(O, ω, θ) := ω ω θ, g(O, ω, θ) g(ω, θ) ,

Θ(O, O , θ) := J(θ), EO [h(O , θ)] h(O, θ) ,

Ξ(O, ω, θ) := ω ω θ, ( ω θ) (EO [h(O , θ)] h(O, θ)) ,

where O is a shorthand for an independent sample from stationary distribution s µθ, a πθ, s P. Define Uδ := 2Ur + 2Uω so that we have |δt| Uδ, where δt comes from Line 6 in Algorithm 1. Note that from Assumption 3.3, we have δ log πθ G := UδB.

B Preliminary Lemmas

Lemma B.1 ([31], Lemma C.4). For any θ1, θ2, we have

|J(θ1) J(θ2)| LJ θ1 θ2 ,

where LJ = 2Ur|A|Lπ(1 + logρ m 1 + 1 1 ρ).

Lemma B.2 ([38], Lemma 3.2). For the performance function J(θ), there exists a constant LJ > 0 such that for all θ1, θ2 Rd, it holds that

J(θ1) J(θ2) LJ θ1 θ2 , (14)

which further implies

J(θ2) J(θ1) + J(θ1), θ2 θ1 LJ

2 θ1 θ2 2, (15)

J(θ2) J(θ1) + J(θ1), θ2 θ1 + LJ

2 θ1 θ2 2. (16)

Lemma B.3 ([31], Proposition 4.4). There exists a constant L > 0 such that

ω (θ1) ω (θ2) L θ1 θ2 , θ1, θ2 Rd,

where L = (2λ 2Ur + 3λ 1Ur)|A|Lπ(1 + logρ m 1 + 1 1 ρ).

Lemma B.4 ([8], Proposition 8). For any θ1, θ2 Rd, we have

ω (θ1) ω (θ2) Ls θ1 θ2 ,

where Ls is a positive constant. Lemma B.5 ([42],[31]). For any θ1 and θ2, it holds that

d T V (µθ1, µθ2) |A|( logρ m 1 + 1 1 ρ) θ1 θ2 ,

d T V (µθ1 πθ1, µθ2 πθ2) |A|Lπ(1 + logρ m 1 + 1 1 ρ) θ1 θ2 ,

d T V (µθ1 πθ1 P, µθ2 πθ2 P) |A|Lπ(1 + logρ m 1 + 1 1 ρ) θ1 θ2 .

Lemma B.6 ([31], Lemma B.2). Given time indexes t and τ such that t τ > 0, consider the auxiliary Markov chain in (10). Conditioning on st τ+1 and θt τ, we have

d T V (P(st+1 ), P(est+1 )) d T V (P(Ot ), P( e Ot )),

d T V (P(Ot ), P( e Ot )) = d T V (P((st, at) ), P((est, eat) )),

d T V (P((st, at) ), P((est, eat) )) d T V (P(st ), P(est )) + 1

2|A|E[ θt θt τ ].

C Proof of Main Theorem

C.1 Step 1: Reward estimation error analysis

In this subsection, we will establish an implicit bound for estimator. Lemma C.1. From any t τ > 0, we have

E[Φ(Ot, ηt, θt)] 4Ur LJ θt θt τ + 2Ur|ηt ηt τ|

+ 2U 2 r |A|Lπ

i=t τ E θi θt τ + 4U 2 r mρτ 1.

Theorem C.2. Choose αt = c

T , βt = γt = 1

T , we have

YT O(log2 T

T ) + c G p

YT GT . (17)

Proof. From the update rule of reward estimator in Line 7 of Algorithm 1, we have

ηt+1 J(θt+1) = ηt J(θt) + J(θt) J(θt+1) + γt(rt ηt)

Then we have

y2 t+1 = (yt + J(θt) J(θt+1) + γt(rt ηt))2

y2 t + 2yt(J(θt) J(θt+1)) + 2γtyt(rt ηt)

+ 2(J(θt) J(θt+1))2 + 2γ2 t (rt ηt)2

= (1 2γt)y2 t + 2γtyt(rt J(θt)) + 2yt(J(θt) J(θt+1))

+ 2(J(θt) J(θt+1))2 + 2γ2 t (rt ηt)2.

Taking expectation up to st+1 (the whole trajectory), rearranging and summing from τT to T 1, we have

t=τT E[y2 t ]

1 2γt E(y2 t y2 t+1)

t=τT E[yt(rt J(θt))]

1 γt E[yt(J(θt) J(θt+1)]

1 γt E[(J(θt) J(θt+1))2]

t=τT γt E[(rt ηt)2]

For term I1, from Abel summation by parts, we have

1 2γt (y2 t y2 t+1)

t=τT +1 y2 t ( 1

2γt 1 2γt 1 ) + 1 2γτt y2 τt 1 γT 1 y2 T

2U 2 r γT 1 = 2U 2 r

For term I2, from Lemma C.1, we have

E[yt(rt J(θt))] 4Ur LJ θt θt τ + 2Ur|ηt ηt τ|

+ 2U 2 r |A|Lπ

i=t τ E θi θt τ + 4U 2 r mρτ 1

4Ur LJGταt τ + 4U 2 r τγt τ + 2U 2 r |A|Lπτ(τ + 1)Gαt τ + 4U 2 r mρτ 1

(4Ur LJGτ + 2U 2 r |A|LπGτ(τ + 1))αt τ + 4U 2 r τγt τ + 4U 2 r mρτ 1.

Choose τ = τT , we have

t=τT E[yt(rt J(θt))]

(4Ur LJGτT + 2U 2 r |A|LπGτT (τT + 1))

+ 4U 2 r τT

t=τT γt + 4U 2 r

= (4Ur LJGτT + 2U 2 r |A|LπGτT (τT + 1) + 4U 2 r τT + 4U 2 r )T τT

For I3, if yt > 0, from (15), we have

yt(J(θt) J(θt+1)) yt(LJ

2 θt θt+1 2 + J(θt), θt θt+1 )

LJ Ur θt θt+1 2 + |yt| θt θt+1 J(θt) .

If yt 0, from (16), we have

yt(J(θt) J(θt+1)) yt( LJ

2 θt θt+1 2 + J(θt), θt θt+1 )

LJ Ur θt θt+1 2 + |yt| θt θt+1 J(θt) .

Overall, we get

1 γt E[yt(J(θt) J(θt+1))]

1 γt E[LJ Ur θt θt+1 2 + |yt| θt θt+1 J(θt) ]

t=τT E[c LJ Ur G2αt + c G|yt| J(θt) ]

c LJ Ur G2 T τT

t=τT Ey2 t ) 1 2 (

t=τT E J(θt) 2) 1 2 .

For term I4, we have

1 γt E[(J(θt) J(θt+1))2]

1 γt L2 JE θt θt+1 2

1 γt L2 JG2α2 t

= L2 JG2c2 T τT

For term I5, we have

t=τT γt E[(rt J(θt))2]

t=τT 4U 2 r γt

= 4U 2 r T τT

Therefore, we get

t=τT E[y2 t ] (4Ur LJGτT + 2U 2 r |A|LπGτT (τT + 1)

+ 4U 2 r (τT + 2) + c2G2(LJ Ur + L2 J))T τT

t=τT Ey2 t ) 1 2 (

t=τT E J(θt) 2) 1 2 .

Since τT = O(log T), we have

T for large T. Then we get

t=τT E[y2 t ]

(4Ur LJGτT + 2U 2 r |A|LπGτT (τT + 1)

+ 4U 2 r (τT + 3) + c2G2(LJ Ur + L2 J)) 1

+ c G( 1 T τT

t=τT Ey2 t ) 1 2 ( 1 T τT

t=τT E J(θt) 2) 1 2

T ) + c G( 1 T τT

t=τT Ey2 t ) 1 2 ( 1 T τT

t=τT E J(θt) 2) 1 2 .

Thus we finish the proof.

C.2 Step 2: Critic error analysis

In this subsection, we will establish an implicit upper bound for critic.

Lemma C.3. For any t τ > 0, we have

E[Ψ(Ot, ωt, θt)] C1 θt θt τ + 6Uδ ωt ωt τ + U 2 δ |A|LπGτ(τ + 1)αt τ + 2U 2 δ mρτ 1,

where C1 = 2U 2 δ |A|Lπ(1 + logρ m 1 + 1 1 ρ) + 2UδLJ + 2UδL .

Lemma C.4. Given the definition of Ξ(Ot, ωt, θt), for any t τ > 0, we have

E[Ξ(Ot, ωt, θt)] C2 θt θt τ + 2UδB ωt ωt τ

+ 2U 2 δ B|A|LπGτ(τ + 1)αt τ + 4U 2 δ Bmρτ 1.

where C2 := 3BU 2 δ |A|Lπ(1 + logρ m 1 + 1 1 ρ) + 3U 2 δ Ll + 8UδBL .

Theorem C.5. Choose αt = c

T , βt = γt = 1

T , we have

ZT O(log2 T

T ) + O(ϵapp) + 2

YT ZT + 2c BL

ZT (2YT + 8ZT ). (18)

Proof. From the update rule of critic in Line 8 of Algorithm 1, we have

ωt+1 ω t+1 = ΠUω(ωt + βtδtϕ(st)) ω t+1 = ΠUω(ωt + βtδtϕ(st)) ΠUω(ω t+1) ωt + βtδtϕ(st) ω t+1 = ωt + βt(g(Ot, ωt, θt) + g(Ot, ηt, θt)) ω t+1 = ωt ω t + βt(g(Ot, ωt, θt) + g(Ot, ηt, θt)) + ω t ω t+1 .

Therefore, we have

zt+1 2 = zt + βt(g(Ot, ωt, θt) + g(Ot, ηt, θt)) + ω t ω t+1 2

= zt 2 + 2βt zt, g(Ot, ωt, θt) + 2βt zt, g(Ot, ηt, θt)

+ 2 zt, ω t ω t+1 + βt(g(Ot, ωt, θt) + g(Ot, ηt, θt)) + ω t ω t+1 2

= zt 2 + 2βt zt, g(ωt, θt) + 2βtΨ(Ot, ωt, θt) + 2βt zt, g(Ot, ηt, θt)

+ 2 zt, ω t ω t+1 + βt(g(Ot, ωt, θt) + g(Ot, ηt, θt)) + ω t ω t+1 2

zt 2 + 2βt zt, g(ωt, θt) + 2βtΨ(Ot, ωt, θt) + 2βt zt, g(Ot, ηt, θt)

+ 2 zt, ω t ω t+1 + 2U 2 δ β2 t + 2 ω t ω t+1 2.

Note that we have

zt, g(ωt, θt) = zt, g(ωt, θt) g(ω t , θt)

= zt, E[(ϕ(s ) ϕ(s)) (ωt ω t )ϕ(s)]

= z t E[ϕ(s)(ϕ(s ) ϕ(s)) ]zt = z t Azt λ zt 2,

where the first equality is due to the fact that g(ω t , θt) = 0 and the last inequality follows from Assumption 3.1.

Taking expectation up to st+1, we have

E zt+1 2 E zt 2 + 2βt E zt, g(ωt, θt) + 2βt EΨ(Ot, ωt, θt) + 2βt E zt, g(Ot, ηt, θt)

+ 2E zt, ω t ω t+1 + 2U 2 δ β2 t + 2E ω t ω t+1 2

(1 2λβt)E zt 2 + 2βt EΨ(Ot, ωt, θt) + 2βt E zt, g(Ot, ηt, θt)

+ 2E zt, ω t ω t+1 + 2U 2 δ β2 t + 2E ω t ω t+1 2

(1 2λβt)E zt 2 + 2βt EΨ(Ot, ωt, θt) + 2βt E zt, g(Ot, ηt, θt)

+ 2E zt, ω t ω t+1 + ( ω t ) (θt+1 θt)

+ 2E zt, ( ω t ) (θt θt+1) + 2U 2 δ β2 t + 2E ω t ω t+1 2

(1) (1 2λβt)E zt 2 + 2βt EΨ(Ot, ωt, θt) + 2βt E zt |yt| + Ls E zt θt+1 θt 2

+ 2αt E zt, ( ω t ) δt log πθt(at|st) + 2U 2 δ β2 t + 2L2 E θt θt+1 2

(1 2λβt)E zt 2 + 2βt EΨ(Ot, ωt, θt) + 2βt q

2 E zt 2 θt+1 θt 2 + Ls

2 E θt+1 θt 2 + 2U 2 δ β2 t + 2L2 G2α2 t

+ 2αt E zt, ( ω t ) δt log πθt(at|st)

(1 2λβt)E zt 2 + 2βt EΨ(Ot, ωt, θt) + 2βt q

E zt 2 + Ls G2

2 α2 t E zt 2

+ 2U 2 δ β2 t + (2L2 + Ls

2 )G2α2 t + 2αt E zt, ( ω t ) δt log πθt(at|st)

(2) (1 λβt)E zt 2 + 2βt EΨ(Ot, ωt, θt) + 2βt q

+ 2U 2 δ β2 t + (2L2 + Ls

2 )G2α2 t + 2αt E zt, ( ω t ) δt log πθt(at|st) (19)

where (1) follows from the Ls-smoothness of ω in Lemma B.4; (2) uses Ls G2

2 α2 t λβt for large T.

For term E zt, ( ω t ) δt log πθt(at|st) , we have

E zt, ( ω t ) δt log πθt(at|st)

= E zt, ( ω t ) ( h(Ot, ηt, ωt, θt) h(Ot, θt))

= E zt, ( ω t ) h(Ot, ηt, ωt, θt)

+ E zt, ( ω t ) (EO t[h(O t, θt)] h(Ot, θt) EO t[h(O t, θt)])

= E[Ξ(Ot, ωt, θt)] E zt, ( ω t ) h(Ot, ηt, ωt, θt) E zt, ( ω t ) EO t[h(O t, θt)] (20)

Note that from Cauchy-Schwartz inequality and L is the Lipschitz constant of ω in Lemma B.3, we have

E zt, ( ω t ) h(Ot, ηt, ωt, θt) BL p

2Ey2 t + 8E zt 2. (21)

Furthermore, it holds that

EO h (O, θ) 2 = EO ((ϕ(s ) ω Vθ(s )) (ϕ(s) ω Vθ(s))) log πθ(a|s) 2

EO [B2((ϕ(s ) ω Vθ(s )) (ϕ(s) ω Vθ(s)))2]

EO [2B2(ϕ(s ) ω Vθ(s ))2 + (ϕ(s) ω Vθ(s))2]

= 4B2EO [(ϕ(s) ω Vθ(s))2]

= 4B2ϵ2 app.

Therefore, we have

zt, ( ω t ) EO [ h (O , θt)] UδL p

EO [ h (Ot, θt)] 2

EO h (Ot, θt) 2

2BUδL ϵapp. (22)

Substituting (21) and (22) into (20) yields

E zt, ( ω t ) δt log πθt(at|st) EΞ(Ot, ωt, θt) + 2BUδL ϵapp

2Ey2 t + 8E zt 2. (23)

Plugging (23) into (19), we have

E zt+1 2 (1 λβt)E zt 2 + 2βt EΨ(Ot, ωt, θt) + 2αt EΞ(Ot, ωt, θt)

E zt 2 + 2BL αt p

2Ey2 t + 8E zt 2

+ 2U 2 δ β2 t + (2L2 + Ls

2 )G2α2 t + 4BUδαtϵapp.

Rearranging and summing from τT to T gives

1 βt (E zt 2 E zt+1 2)

t=τT EΨ(Ot, ωt, θt)

t=τT EΞ(Ot, ωt, θt)

2Ey2 t + 8E zt 2

t=τT (2U 2 δ βt + c(2L2 + Ls

2 )G2αt + 4c BUδϵapp).

In the sequel, we will tackle I1, I2, I3, I4, I5 respectively.

For term I1, from Abel summation by parts, we have

1 βt (E zt 2 E zt+1 2)

t=τT +1 ( 1

βt 1 βt 1 )E zt 2 + 1 βτT E zτT 2 1 βT 1 E z T 2

t=τT +1 ( 1

βt 1 βt 1 ) + 1 βτT )

where the inequality is due to E zt 2 U 2 δ and discard the last term.

For term I2, from Lemma C.3, choose τ = τT , we have

EΨ(Ot, ωt, θt) C1 θt θt τT + U 2 δ |A|LπGτT (τT + 1)αt τT + 2U 2 δ mρτT 1 + 6Uδ ωt ωt τT

k=t τT Gαk + U 2 δ |A|LπGτT (τT + 1)αt τT + 2U 2 δ

k=t τT Uδβk

(C1GτT + U 2 δ |A|LπGτT (τT + 1))αt τT + 2U 2 δ

T + 6U 2 δ τT βt τT .

Then we get

T =τT EΨ(Ot, ωt, θt)

T =τT ((C1GτT + U 2 δ |A|LπGτT (τT + 1))αt + 2U 2 δ

T + 6U 2 δ τT βt).

For term I3, from Lemma C.4, choose τ = τT , we have

E[Ξ(Ot, ωt, θt)] C2 θt θt τ + 2UδB ωt ωt τ

+ 2U 2 δ B|A|LπGτ(τ + 1)αt τ + 4U 2 δ Bmρτ 1

k=t τT Gαk + 2UδB

k=t τT Uδβk

+ 2U 2 δ B|A|LπGτ(τ + 1)αt τ + 4U 2 δ BmρτT 1

(C2GτT + 2U 2 δ B|A|LπGτT (τT + 1))αt + 2U 2 δ BτT βt + 4U 2 δ B

Therefore, we have

t=τT EΞ(Ot, ωt, θt)

t=τT ((C2GτT + 2U 2 δ B|A|LπGτT (τT + 1))αt + 2U 2 δ BτT βt + 4U 2 δ B

For term I4 and I5, from Cauchy-Schwartz inequality, we have

t=τT Ey2 t ) 1 2 (

t=τT E zt 2) 1 2 ,

t=τT E zt 2) 1 2 (2

t=τT Ey2 t + 8

t=τT E zt 2) 1 2 .

Overall, we get

t=τT E zt 2 2(

t=τT Ey2 t ) 1 2 (

t=τT E zt 2) 1 2

t=τt E zt 2) 1 2 (2

t=τT Ey2 t + 8

t=τT E zt 2) 1 2

T =τT ((C1GτT + U 2 δ |A|LπGτT (τT + 1))αt + 2U 2 δ

T + 6U 2 δ τT βt)

t=τT ((C2GτT + 2U 2 δ B|A|LπGτT (τT + 1))αt + 2U 2 δ BτT βt + 4U 2 δ B

t=τT (2U 2 δ βt + c(2L2 + Ls

2 )G2αt + 4c BUδϵapp).

Therefore, we have

t=τT Ey2 t ) 1 2 ( 1 T τT

t=τT E zt 2) 1 2

t=τt E zt 2) 1 2 (2 1 T τT

t=τT Ey2 t + 8 1 T τT

t=τT E zt 2) 1 2

T + 2(C1GτT + U 2 δ |A|LπGτT (τT + 1))αt + 4U 2 δ

T + 12U 2 δ τT βt

+ 2c(C2GτT + 2U 2 δ B|A|LπGτT (τT + 1))αt + 4c U 2 δ BτT βt + 8c U 2 δ B

+ 2U 2 δ βt + c(2L2 + Ls

2 )G2αt + 4c BUδϵapp)

T ) + O(ϵapp) + 2

t=τT Ey2 t ) 1 2 ( 1 T τT

t=τT E zt 2) 1 2

t=τt E zt 2) 1 2 (2 1 T τT

t=τT Ey2 t + 8 1 T τT

t=τT E zt 2) 1 2 ,

where (1) follows from τT = O(log T) so that T τT 1

2T for large T. Therefore, we have

ZT O(log2 T

T ) + O(ϵapp) + 2

YT ZT + 2c BL

ZT (2YT + 8ZT ),

which completes the proof.

C.3 Step 3: Policy gradient norm analysis

In this subsection, we will establish an implicit upper bound for policy gradient norm.

Lemma C.6. For any t τ > 0, it holds that

E[Θ(Ot, θt)] D1τ(τ + 1)Gα + D2mρτ 1,

where D1 = max{UδBLJ + 3LJLh, 2UδBLJ|A|Lπ} and D2 = 4UδBLJ.

Theorem C.7. We have

GT O(log2 T

T ) + O(ϵapp) + B p

GT (2YT + 8ZT ). (25)

Proof. From the update rule of actor in Line 9 of Algorithm 1 and 15, we have

J(θt+1) J(θt) + J(θt), θt+1 θt LJ

= J(θt) + J(θt), δt log πθt(at|st) LJ

2 α2 t δt log πθt(at|st) 2

= J(θt) + αt J(θt), h(Ot, ηt, ωt, θt) + αt J(θt), h(Ot, θt)

2 α2 t δt log πθt(at|st) 2

= J(θt) + αt J(θt), h(Ot, ηt, ωt, θt) αtΘ(Ot, θt)

+ αt J(θt), EO [h(O , θt)] LJ

2 α2 t δt log πθt(at|st) 2

= J(θt) + αt J(θt), h(Ot, ηt, ωt, θt) αtΘ(Ot, θt) + αt J(θt) 2

+ αt J(θt), EO [ h (O , θt)] LJ

2 α2 t δt log πθt(at|st) 2,

where the last equality is due to the fact EO [h(O , θ) h (O , θ)] = EO [(r(s, a) J(θ) + Vθ(s ) Vθ(s)) log πθ(a|s)] = J(θ). Rearranging the above inequality and taking expectation, we have

E J(θt) 2 1

αt (E[J(θt+1) J(θt)]) E J(θt), h(Ot, ηt, ωt, θt) + E[Θ(Ot, θt)]

E J(θt), EO [ h (O , θt)] + LJ

2 αt E δt log πθt(at|st) 2.

Note that from Cauchy-Schwartz inequality, we have

E J(θt), h(Ot, ηt, ωt, θt) B p

E J(θt) 2 q

2Ey2 t + 8E zt 2.

From Lemma C.6 and choosing τ = τT , we have

E[Θ(Ot, θt)] D1(τT + 1)

k=t τT +1 E θk θk 1 + D2mρτT 1

D1(τT + 1)G

k=t τT αk + D2mρτT 1

GD1(τT + 1)2αt τT + D2 1

It has been shown that EO h (O, θ) 2 = EO ((ϕ(s ) ω Vθ(s )) (ϕ(s) ω Vθ(s))) log πθ(a|s) 2

EO [B2((ϕ(s ) ω Vθ(s )) (ϕ(s) ω Vθ(s)))2]

EO [2B2(ϕ(s ) ω Vθ(s ))2 + (ϕ(s) ω Vθ(s))2]

= 4B2EO [(ϕ(s) ω Vθ(s))2]

= 4B2ϵ2 app. Therefore, we have

J(θt), EO [ h (O , θt)] LJ p

EO [ h (Ot, θt)] 2

EO h (Ot, θt) 2

2BLJϵapp, where we use J(θ) LJ which comes from Lemma B.1. Plugging the three terms yields

E J(θt) 2 1

αt (E[J(θt+1)] E[J(θt)]) + B p

E J(θt) 2 q

2Ey2 t + 8E zt 2

+ 2BLJϵapp + GD1(τT + 1)2αt τT + D2 1

Summing over t from τT to T 1 gives

t=τT E J(θt) 2

1 αt (E[J(θt+1) E[J(θt)])

E J(θt) 2 q

2Ey2 t + 8E zt 2

+ (GD1(τT + 1)2 + D2)T τT

T + 2BLJϵapp(T τT ).

For term I1, from Abel summation by parts, we have

1 αt (E[J(θt+1) E[J(θt)])

t=τT +1 ( 1 αt 1 1

αt )E[J(θt)] E[J(θτT )] 1

ατT + 1 αT 1 E[J(θT )]

t=τT +1 ( 1

αt 1 αt 1 )Ur + 1 ατT Ur + 1 αT 1 Ur

Overall, we have

t=τT E J(θt) 2 2Ur

T + (GD1(τT + 1)2 + D2)T τT

T + 2BLJϵapp(T τT )

E J(θt) 2 q

2Ey2 t + 8E zt 2

T + (GD1(τT + 1)2 + D2)T τT

T + 2BLJϵapp(T τT )

t=τT E J(θt) 2) 1 2 (2

t=τT Ey2 t + 8

t=τT E zt 2) 1 2 .

Therefore, we get

c + GD1(τT + 1)2 + D2) 1

T + 2BLJϵapp + B p

GT (2YT + 8ZT )

T ) + O(ϵapp) + B p

GT (2YT + 8ZT ),

which concludes the proof.

C.4 Step 4: Interconnected iteration system analysis

In this subsection, we perform an interconnected iteration system analysis to prove Theorem 3.5.

Proof of Theorem 3.5.

Proof. Combining (17), (18), and (25), we have

YT O(log2 T

T ) + c G p

ZT O(log2 T

T ) + O(ϵapp) + 2

YT ZT + 2c BL

ZT (2YT + 8ZT ),

GT O(log2 T

T ) + O(ϵapp) + B p

GT (2YT + 8ZT ).

l3 := 2c BL

Then we have

YT O(log2 T

ZT O(log2 T

T ) + O(ϵapp) + l2 p

YT ZT + l3 p

ZT (2YT + 8ZT ),

GT O(log2 T

T ) + O(ϵapp) + l4 p

GT (2YT + 8ZT ).

For GT , we get

GT O(log2 T

T ) + O(ϵapp) + 1

2GT + l2 4(YT + 4ZT ),

GT O(log2 T

T ) + O(ϵapp) + 2l2 4(YT + 4ZT ). (26)

For ZT , we have

ZT O(log2 T

T ) + O(ϵapp) + 1

4ZT + l2 2YT + 4l3ZT + l3

If it satisfies 4l3 1

4, we further have

ZT O(log2 T

T ) + O(ϵapp) + (2l2 2 + l3)YT . (27)

For YT , we get

YT O(log2 T

2 (YT + GT ).

Plugging (26) and (27) into the above inequality gives

YT O(log2 T

T ) + O(ϵapp) + l1

2 (YT + 2l2 4YT + 8l2 4ZT )

T ) + O(ϵapp) + l1

2 (YT + 2l2 4YT + 8l2 4(2l2 2 + l3)YT )

T ) + O(ϵapp) + l1

2 (1 + 2l2 4 + 8l2 4(2l2 2 + l3))YT .

Therefore, if l1(1 + 2l2 4 + 8l2 4(2l2 2 + l3) 1, we have

YT O(log2 T

T ) + O(ϵapp).

Overall, we require

l1(1 + 2l2 4 + 8l2 4(2l2 2 + l3)) 1.

According to the definition of l1, l2, l3, l4, we have

c G(1 + 2B2 + 8B2( 8

Thus we choose

c = min{ λ 32BL , λ2

G(λ2 + 3B2λ2 + 64B2)}, (28)

which satisfies the above two inequalities. Therefore, we have

YT = O(log2 T

T ) + O(ϵapp),

and consequently,

ZT = O(log2 T

T ) + O(ϵapp),

GT = O(log2 T

T ) + O(ϵapp).

Thus we conclude our proof.

D Proof of Supporting Lemmas

The following three lemmas only deal with the Markovian noise, which are originally proved in [31] and updated in [30]. We include the proof with slight modifications for proving Theorem 3.5.

Proof of Lemma C.1.

Proof. We will divide the proof of this lemma into four steps.

Step 1: show that for any θ1, θ2, η, O = (s, a, s ), we have

|Φ(O, η, θ1) Φ(O, η, θ2)| 4Ur LJ θ1 θ2 . (29)

By the definition of Φ(O, η, θ) in (13), we have

|Φ(O, η, θ1) Φ(O, θ, θ2)| = |(η J(θ1))(r J(θ1)) (η J(θ2))(r J(θ2))| |(η J(θ1))(r J(θ1)) (η J(θ1))(r J(θ2))| + |(η J(θ1))(r J(θ2)) (η J(θ2))(r J(θ2))| 4Ur|J(θ1) J(θ2)| 4Ur LJ θ1 θ2 .

Step 2: show that for any θ, η1, η2, O, we have

|Φ(O, η1, θ) Φ(O, η2, θ) 2Ur|η1 η2|. (30)

By definition, we have

|Φ(O, η1, θ) Φ(O, η2, θ)| = |(η1 J(θ))(r J(θ)) (η2 J(θ))(r J(θ))| 2Ur|η1 η2|.

Step 3: show that for original tuple Ot and the auxiliary tuple e Ot, conditioned on st τ 1 and θt τ, we have

|E[Φ(Ot, ηt τ, θt τ) E[Φ( e Ot, ηt τ, θt τ)]| 2U 2 r |A|Lπ

k=t τ E θk θt τ . (31)

By definition, we have

E[Φ(Ot, ηt τ, θt τ) E[Φ( e Ot, ηt τ, θt τ)] = (ηt τ J(θt τ))E[r(st, at) r(est, eat)].

By definition of total variation norm, we have

E[r(st, at) r(est, eat)] 2Urd T V (P(Ot |st τ+1, θt τ), P( e Ot |st τ+1, θt τ)). (32)

By Lemma B.6, we get

d T V (P(Ot |st τ+1, θt τ), P( e Ot |st τ+1, θt τ)) = d T V (P((st, at) |st τ+1, θt τ), P((est, eat) |st τ+1, θt τ))

d T V (P(st |st τ+1, θt τ), P(est |st τ+1, θt τ)) + 1

2LπE θt θt τ

d T V (P(Ot 1 |st τ+1, θt τ), P( e Ot 1 |st τ+1, θt τ)) + 1

2LπE θt θt τ .

Repeat the above argument from t to t τ + 1, we have

d T V (P(Ot |st τ+1, θt τ), P( e Ot |st τ+1, θt τ)) 1

k=t τ E θk θt τ . (33)

Plugging (33) into (32), we have

|E[Φ(Ot, ηt τ, θt τ) E[Φ( e Ot, ηt τ, θt τ)]| 2U 2 r |A|Lπ

k=t τ E θk θt τ .

Step 4: show that conditioned on st τ+1 and θt τ, we have

E[Φ( e Ot, ηt τ, θt τ)] 4U 2 r mρτ 1. (34)

Note that according to definition, we have

E[Φ(O t τ, ηt τ, θt τ)|θt τ] = 0,

where O t τ = (s t τ, a t τ, s t τ+1) is the tuple generated by s t τ µt τ, a t τ πθt τ , s t τ+1 P. From the uniform ergodicity in Assumption 3.2, it shows that

d T V (P(est = |st τ+1, θt τ), µθt τ ) mρτ 1.

Then we have

E[Φ( e Ot, ηt τ, θt τ)] = E[Φ( e Ot, ηt τ, θt τ) Φ(O t τ, ηt τ, θt τ)]

= E[(ηt τ J(θt τ))(r( st, at) r(s t τ, a t τ))]

4U 2 r d T V (P( e Ot τ = |st τ+1, θt τ), µθt τ πθt τ P)

4U 2 r mρτ 1.

Combing (29), (30), (31), and (34), we have

E[Φ(Ot, ηt, θt)] = E[Φ(Ot, ηt, θt) Φ(Ot, ηt, θt τ)] + E[Φ(Ot, ηt, θt τ) Φ(Ot, ηt τ, θt τ)]

+ E[Φ(Ot, ηt τ, θt τ) Φ( e Ot, ηt τ, θt τ)] + E[Φ( e Ot, ηt τ, θt τ)]

4Ur LJ θt θt τ + 2Ur|ηt ηt τ| + 2U 2 r |A|Lπ

i=t τ E θi θt τ

+ 4U 2 r mρτ 1,

which concludes the proof.

Proof of Lemma C.3.

Proof. We will divide the proof of this lemma into four steps.

Step 1: show that for any θ1, θ2, ω and tuple O = (s, a, s ), we have

|Ψ(O, ω, θ1) Ψ(O, ω, θ2) C1 θ1 θ2 , (35)

where C1 = 2U 2 δ |A|Lπ(1 + logρ m 1 + 1 1 ρ) + 2UδLJ + 2UδL .

By definition of Ψ(O, ω, θ) in (13), we have

|Ψ(O, ω, θ1) Ψ(O, ω, θ2)| = | ω ω 1, g(O, ω, θ1) g(ω, θ1) ω ω 2, g(O, ω, θ2) g(ω, θ2) | | ω ω 1, g(O, ω, θ1) g(ω, θ1) ω ω 1, g(O, ω, θ2) g(ω, θ2) | | {z } I1 + | ω ω 1, g(O, ω, θ2) g(ω, θ2) ω ω 2, g(O, ω, θ2) g(ω, θ2) | | {z } I2

For term I1, we have

I1 = | ω ω 1, g(O, ω, θ1) g(ω, θ1) ω ω 1, g(O, ω, θ2) g(ω, θ2) | = | ω ω 1, g(O, ω, θ1) g(O, ω, θ2) | + | ω ω 1, g(ω, θ1) g(ω, θ2) | = | ω ω 1, ϕ(s)(J(θ1) J(θ2)) | + | ω ω 1, g(ω, θ1) g(ω, θ2) | 2UωLJ θ1 θ2 + 2Uω g(ω, θ1) g(ω, θ2) 2UωLJ θ1 θ2 + 2Uω 2Uδd T V (µθ1 πθ1 P, µθ2 πθ2 P)

2UωLJ θ1 θ2 + 2U 2 δ d T V (µθ1 πθ1 P, µθ2 πθ2 P)

(2UδLJ + 2U 2 δ |A|Lπ(1 + logρ m 1 + 1 1 ρ) θ1 θ2 ,

where we use the fact that Uδ = 2Ur + 2Uω and the last inequality comes from Lemma B.5.

For term I2, from Cauchy-Schwartz inequality, we have

I2 = | ω ω 1, g(O, ω, θ2) g(ω, θ2) ω ω 2, g(O, ω, θ2) g(ω, θ2) | = | ω 1 ω 2, g(O, ω, θ2) g(ω, θ2) | 2Uδ ω 1 ω 2 2UδL θ1 θ2 .

Combining the results from I1 and I2, we get

|Ψ(O, ω, θ1) Ψ(O, ω, θ2) C1 θ1 θ2 ,

where C1 = 2U 2 δ |A|Lπ(1 + logρ m 1 + 1 1 ρ) + 2UδLJ + 2UδL .

Step 2: show that for any θ, ω1, ω2 and tuple O(s, a, s ), we have

|Ψ(O, ω1, θ) Ψ(O, ω2, θ)| 6Uδ ω1 ω2 . (36)

By definition, we have

|Ψ(O, ω1, θ) Ψ(O, ω2, θ)| = | ω1 ω , g(O, ω1, θ) g(ω1, θ) ω2 ω , g(O, ω2, θ) g(ω2, θ) | | ω1 ω , g(O, ω1, θ) g(ω1, θ) ω1 ω , g(O, ω2, θ) g(ω2, θ) | + | ω1 ω , g(O, ω2, θ) g(ω2, θ) ω2 ω , g(O, ω2, θ) g(ω2, θ) | 2Uω (g(O, ω1) g(O, ω2)) ( g(ω1, θ) g(ω2, θ)) + 2Uδ ω1 ω2 6Uδ ω1 ω2 ,

where the last inequality is due to g(O, ω1, θ) g(O, ω2, θ) = |(ϕ(s ) ϕ(s)) (ω1 ω2)| 2 ω1 ω2 , g(ω1, θ) g(ω2, θ) 2 ω1 ω2 , and 2Uω Uδ.

Step 3: show that for tuples Ot = (st, at, st+1) and e Ot = (est, eat, est+1). Conditioning on st τ+1 and θt τ, we have

E[Ψ(Ot, ωt τ, θt τ) Ψ( e Ot, ωt τ, θt τ)] U 2 δ |A|Lπ

k=t τ E θk θt τ . (37)

By the definition of total variation norm, we have

E[Ψ(Ot, ωt τ, θt τ) Ψ( e Ot, ωt τ, θt τ)]

E[ ωt τ ω t τ, g(Ot, ωt τ, θt τ) g( e Ot, ωt τ, θt τ))]

2U 2 δ d T V (P(Ot |st τ+1, θ τ), P( e Ot |st τ+1, θt τ))

(1) U 2 δ |A|Lπ

k=t τ E θk θt τ

U 2 δ |A|LπGτ(τ + 1)αt τ,

where (1) follows from (33).

Step 4: show that conditioning on st τ+1 and θt τ,

E[Ψ( e Ot, ωt τ, θt τ)] 2U 2 δ mρτ 1 (38)

From the definition of Ψ(O, ω, θ), we have

E[Ψ(O t τ, ωt τ, θt τ)|st τ+1, θt τ] = 0,

where O t τ is the tuple generated by s t τ µθt τ , a t τ πθt τ , s t τ+1 P. From Assumption 3.2, we have

d T V (P(est = |st τ+1, θt τ), µθt τ ) mρτ 1.

Then, it holds that

E[Ψ( e Ot, ωt τ, θt τ)] = E[Ψ( e Ot, ωt τ, θt τ) Ψ(O t τ, ωt τ, θt τ)]

= E ωt τ ω t τ, g( e Ot, ωt τ, θt τ g(O t τ, ωt τ, θt τ)

4UωUδd T V (P( e Ot = |st τ+1, θt τ), µθt τ πθt τ P)

2U 2 δ d T V (P( e Ot = |st τ+1, θt τ), µθt τ πθt τ P)

= 2U 2 δ d T V (P((est, eat) |st τ+1, θt τ), µθt τ πθt τ )

= 2U 2 δ d T V (P(est = |st τ+1, θt τ), µθt τ )

2U 2 δ mρτ 1.

Combining (35), (36), (37), and (38), we have

E[Ψ(Ot, ωt, θt)] = E[Ψ(Ot, ωt, θt) Ψ(Ot, ωt, θt τ)] + E[Ψ(Ot, ωt, θt τ) Ψ(Ot, ωt τ, θt τ)]

+ E[Ψ(Ot, ωt τ, θt τ) Ψ( e Ot, ωt τ, θt τ)]

+ E[Ψ( e Ot, ωt τ, θt τ)]

C1 θt θt τ + U 2 δ |A|LπGτ(τ + 1)αt τ + 2U 2 δ mρτ 1 + 6Uδ ωt ωt τ ,

where C1 = 2U 2 δ |A|Lπ(1 + logρ m 1 + 1 1 ρ) + 2Uδ(LJ + L ).

Proof of Lemma C.4.

Proof. We will divide the proof of this lemma into four steps.

Step 1: show that Ξ(Ot, ωt, θt) Ξ(Ot, ωt, θt τ) (3UδLh + 2UδBL ) θt θt τ (39) Since Ξ(Ot, ωt, θt) = ωt ω t , EO [h(O , θ)] h(O, θ) , we define Eθ[h(O , θ)] := EO [h(O , θ)], where Eθ is the shorthand of EO (µθ,πθ,P). In the following, we will show that each term in Ξ(Ot, ωt, θt) is Lipschitz.

Term ωt is not related to θ and term ω t := ω (θt) is L -Lipschitz.

For term h(O, θ), denote δ(O, θ) := r(s, a) r(θ) + (ϕ(s ) ϕ(s)) ω , we have h(O, θ1) h(O, θ2) = δ(O, θ1) log πθ1(a|s) δ(Ot, θ2) log πθ2(a|s) δ(O, θ1) log πθ1(a|s) δ(O, θ1) log πθ2(a|s) + δ(O, θ1) log πθ2(a|s) δ(O, θ2) log πθ2(a|s) UδLl θ1 θ2 + B|δ(O, θ1) δ(O, θ2)|

UδLl θ1 θ2 + B(|r(θ1) r(θ2)| + ϕ(s ) ϕ(s) ω (θ1) ω (θ2) ) (UδLl + 2BL ) θ1 θ2 + B|Es µθ1,a πθ1[r(s, a)] Es µθ1,a πθ2[r(s, a)]|

(UδLl + 2BL ) θ1 θ2 + 2BUrd T V (µθ1 πθ1, µθ2 πθ2)

(UδLl + 2BL + 2BUr|A|Lπ(1 + logρ m 1 + 1 1 ρ)) θ1 θ2 .

Hence we have h(O, θ) is Lh-Lipschitz, where Lh denotes the above coefficient.

For term Eθ[h(O , θ)], we have Eθ1[h(O , θ1)] Eθ2[h(O , θ2)]

Eθ1[h(O , θ1)] Eθ1[h(O , θ2)] + Eθ1[h(O , θ2)] Eθ2[h(O , θ2)]

Eθ1[ h(O , θ1) h(O , θ2) ] + Eθ1[h(O , θ2)] Eθ2[h(O , θ2)]

Lh θ1 θ2 + Eθ1[h(O , θ2)] Eθ2[h(O , θ2)] Lh θ1 θ2 + 2BUrd T V (µθ1 πθ1, µθ2 πθ2)

[Lh + 2BUr|A|Lπ(1 + logρ m 1 + 1 1 ρ)] θ1 θ2

2Lh θ1 θ2 . Then we have ωt ω t is Uδ-bounded and L -Lipschitz; h(O, θ) Eθ[h(O , θ)] is 3Lh-Lipschitz and 2UδB-bounded. By the triangle inequality, we have Ξ(Ot, ωt, θt) Ξ(Ot, ωt, θt τ) (3UδLh + 2UδBL ) θt θt τ C2 θt θt τ ,

where C2 := 3BU 2 δ |A|Lπ(1 + logρ m 1 + 1 1 ρ) + 3U 2 δ Ll + 8UδBL .

Step 2: show that Ξ(O, ωt, θ) Ξ(O, ωt τ, θ) 2UδB ωt ωt τ . (40) Actually, we have Ξ(O, ω1, θ) Ξ(O, ω2, θ) = ω1 ω2, EO [h(O , θ)] h(O, θ) 2UδB ω1 ω2

Step 3: show that for tuples Ot = (st, at, st+1) and e Ot = (est, eat, est+1). Conditioning on st τ+1 and θt τ, we have

E[Ξ(Ot, ωt τ, θt τ) Ξ( e Ot, ωt τ, θt τ)] 2U 2 δ B|A|Lπ

k=t τ E θk θt τ . (41)

By definition of Ξ(O, ω, θ), we have

E[Ξ(Ot, ωt τ, θt τ) Ξ( e Ot, ωt τ, θt τ)]

= E[ ωt τ ω t τ, h( e Ot, θt τ) h(Ot, θt τ)]

= E[ ωt τ ω t τ, h( e Ot, θt τ) E[ ωt τ ω t τ, h(Ot, θt τ) ]

4U 2 δ Bd T V (P(Ot |st τ+1, θt τ), P( e Ot |st τ+1, θt τ)), (42)

where the inequality comes from the definition of total variation distance. The total variation norm between Ot and e Ot has been computed in (33). Plugging (33) into (42), we get

E[Ξ(Ot, ωt τ, θt τ) Ξ( e Ot, ωt τ, θt τ)] 2U 2 δ B|A|Lπ

k=t τ E θk θt τ

2U 2 δ B|A|LπGτ(τ + 1)αt τ.

Step 4: Show that conditioning on st τ+1 and θt τ, we have

E[Ξ( e Ot, ωt τ, θt τ)] 4U 2 δ Bmρτ 1. (43)

It can be shown that

E[Ξ( e Ot, ωt τ, θt τ)] (1) = E[Ξ( e Ot, ωt τ, θt τ) Ξ(O t τ, ωt τ, θt τ)]

= E[ ωt τ ω t τ, h(O t τ, θt τ) ωt τ ω t τ, h( e Ot, θt τ) ]

(2) 4U 2 δ Bd T V (P( e Ot |st τ+1, θt τ), µθt τ πθt τ P),

where (1) is due to the fact that O t is from the stationary distribution which satisfies E[Ξ(O t, ωt τ, θt τ)] = 0 and (2) follows from the definition of total variation distance. From Assumption 3.2, we know that

d T V (P(est ), µθt τ ) mρτ 1.

We also have the fact that

P( e Ot |st τ+1, θt τ) = P(est |st τ+1, θt τ) πθt τ P.

Therefore, we have

E[Ξ( e Ot, ωt τ, θt τ) 4U 2 δ Bmρτ 1.

Combining (39)-(43), we can decompose the Markovian bias as

E[Ξ(Ot, ωt, θt)] = E[Ξ(Ot, ωt, θt) Ξ(Ot, ωt, θt τ)] + E[Ξ(Ot, ωt, θt τ) Ξ(Ot, ωt τ, θt τ)]

+ E[Ξ(Ot, ωt τ, θt τ) Ξ( e Ot, ωt τ, θt τ)]

+ E[Ξ( e Ot, ωt τ, θt τ)] C2 θt θt τ + 2UδB ωt ωt τ

+ 2U 2 δ B|A|LπGτ(τ + 1)αt τ + 4U 2 δ Bmρτ 1.

Thus we conclude our proof.

Proof of Lemma C.6.

Proof. We will divide the proof of this lemma into three steps.

Step 1: show that

|Θ(Ot, θt τ) Θ( e Ot, θt τ)| (2UδBLJ + 3LJLh) θt θt τ , (44)

where Lh = UδLl + (2 + 2λ 2 + 3λ 1)BUr|A|Lπ(1 + logρ m 1 + 1/(1 ρ)).

Since Θ(O, θ) = J(θ), Eθ[h(O , θ)] h(O, θ) , we will show that each term in Θ(O, θ) is Lipschitz.

For the term J(θ), by Lemma B.3 we know it s LJ -Lipschitz.

For term h(O, θ), denote δ(O, θ) := r(s, a) r(θ) + (ϕ(s ) ϕ(s)) ω , we have

h(O, θ1) h(O, θ2) = δ(O, θ1) log πθ1(a|s) δ(Ot, θ2) log πθ2(a|s) δ(O, θ1) log πθ1(a|s) δ(O, θ1) log πθ2(a|s) + δ(O, θ1) log πθ2(a|s) δ(O, θ2) log πθ2(a|s) UδLl θ1 θ2 + B|δ(O, θ1) δ(O, θ2)|

UδLl θ1 θ2 + B(|r(θ1) r(θ2)| + ϕ(s ) ϕ(s) ω (θ1) ω (θ2) ) (UδLl + 2BL ) θ1 θ2 + B|Es µθ1,a πθ1[r(s, a)] Es µθ1,a πθ2[r(s, a)]|

(UδLl + 2BL ) θ1 θ2 + 2BUrd T V (µθ1 πθ1, µθ2 πθ2)

(UδLl + 2BL + 2BUr|A|Lπ(1 + logρ m 1 + 1 1 ρ)) θ1 θ2 .

Hence we have h(O, θ) is Lh-Lipschitz, where Lh denotes the above coefficient.

For term Eθ[h(O , θ)], we have

Eθ1[h(O , θ1)] Eθ2[h(O , θ2)]

Eθ1[h(O , θ1)] Eθ1[h(O , θ2)] + Eθ1[h(O , θ2)] Eθ2[h(O , θ2)]

Eθ1[ h(O , θ1) h(O , θ2) ] + Eθ1[h(O , θ2)] Eθ2[h(O , θ2)]

Lh θ1 θ2 + Eθ1[h(O , θ2)] Eθ2[h(O , θ2)] Lh θ1 θ2 + 2BUrd T V (µθ1 πθ1, µθ2 πθ2)

[Lh + 2BUr|A|Lπ(1 + logρ m 1 + 1 1 ρ)] θ1 θ2

2Lh θ1 θ2 .

Then we have J(θ) is LJ-bounded and LJ -Lipschitz; h(O, θ) Eθ[h(O , θ)] is 3Lh-Lipschitz and 2UδB-bounded. By the triangle inequality, we have

Θ(Ot, θt) Θ(Ot, θt τ) (2UδBLJ + 3LJLh) θt θt τ

Step 2: show that for t τ > 0, we have

|E[Θ(Ot, θt τ) Θ( e Ot, θt τ)]| 2UδBLJ|A|Lπ

k=t τ θk θt τ (45)

By definition of Θ(O, θ), we have

|E[Θ(Ot, θt τ) Θ( e Ot, θt τ)]|

= |E[ J(θt τ), h( e Ot, θt τ) h(Ot, θt τ)]|

= |E[ J(θt τ), h( e Ot, θt τ) E[ J(θt τ), h(Ot, θt τ) ]|

4UδBLJd T V (P(Ot |st τ+1, θt τ), P( e Ot |st τ+1, θt τ)), (46)

where the inequality comes from the definition of total variation distance. The total variation norm between Ot and e Ot has been computed in (33). Plugging (33) into (46), we get

|E[Θ(Ot, θt τ) Θ( e Ot, θt τ)]| 2UδBLJ|A|Lπ

k=t τ θk θt τ .

Step 3: show that for t τ > 0, we have

|E[Θ( e Ot, θt τ) Θ(O t, θt τ)]| 4UδBLJmρτ 1. (47)

From the definition of Θ(O, θ), we have

|E[Θ( e Ot, θt τ) Θ(O t, θt τ)]| = |E[ J(θt τ), h(O t, θt τ) J(θt τ), h( e Ot, θt τ) ]|

4UδBLJd T V (P( e Ot |st τ+1, θt τ), µθt τ πθt τ P).

The inequality is due to the definition of total variation distance. From Assumption 3.2, we know that d T V (P(est ), µθt τ ) mρτ 1. We also have the fact that P( e Ot |st τ+1, θt τ) = P(est |st τ+1, θt τ) πθt τ P. Therefore, we have |E[Θ( e Ot, θt τ Θ(O t, θt τ)]| 4UδBLJmρτ 1. Combining (44), (45), and (47), we can decompose the Markovian bias as E[Θ(Ot, θt)] = E[Θ(Ot, θt) Θ(Ot, θt τ)]

+ E[Θ(Ot, θt τ) Θ( e Ot, θt τ)]

+ E[Θ( e Ot, θt τ) Θ(O t, θt τ)]

+ E[Θ(O t, θt τ)],

where e Ot is from the auxiliary Markovian chain defined in (10) and O t is from the stationary distribution which satisfies E[Θ(O t, θt τ)] = 0.

Then we have E[Θ(Ot, θt)] (2UδBLJ + 3LJLh)E θt θt τ

+ 2UδBLJ|A|Lπ

k=t τ θk θt τ + 4UδBLJmρτ 1

(2UδBLJ + 3LJLh)

k=t τ+1 E θk θk 1

+ 2UδBLJ|A|Lπ

j=t τ+1 E θj θj 1 + 4UδBLJmρτ 1

(2UδBLJ + 3LJLh)

k=t τ+1 E θk θk 1

+ 2UδBLJ|A|Lπτ

j=t τ+1 E θj θj 1 + 4UδBLJmρτ 1

k=t τ+1 E θk θk 1 + D2mρτ 1

D1(τ + 1)2Gα + D2mρτ 1

where D1 = max{UδBLJ + 3LJLh, 2UδBLJ|A|Lπ} and D2 = 4UδBLJ. Thus we conclude the proof.

E IID Sampling Analysis

Algorithm 2 Single-timescale Actor-Critic (i.i.d. sampling)

1: Input initial actor parameter θ0, initial critic parameter ω0, initial reward estimator η0, stepsize αt for actor, βt for critic and γt for reward estimator. 2: for t = 0, 1, 2, , T 1 do 3: Sample st µθt 4: Take the action at πθt( |st) 5: Observe next state s t P( |st, at) and the reward rt = r(st, at) 6: δt = rt ηt + ϕ(s t) ωt ϕ(st) ωt 7: ηt+1 = ηt + γt(rt ηt) 8: ωt+1 = ΠUω(ωt + βtδtϕ(st)) 9: θt+1 = θt + αtδt θ log πθt(at|st) 10: end for

Note that under i.i.d. sampling in Algorithm 2, we denote by st the samples from the stationary distribution and s t the subsequent state following transition kernel s t P( |st, at). Correspondingly, we redefine the observation tuple as Ot = (st, at, s t) (in contrast to Ot = (st, at, st+1) in the Markovian sampling case). This modification implies the decoupling of Ot and Ot+1 since st+1 in tuple Ot+1 is a new state sampled from the stationary distribution rather than inherited from Ot. This intuitively elucidates the vanishment of Markovian noise under i.i.d. sampling. Lemma E.1. Under i.i.d sampling, we have

E[Φ(Ot, ηt, θt)] = 0, E[Ψ(Ot, ωt, θt)] = 0,

E[Θ(Ot, O t, θt)] = 0, E[Ξ(Ot, ωt, θt)] = 0.

Proof. Note that the expectation is taken over all the random variables. We use the notation Ot to denote the tuple (st, at, s t) and v0:t to denote the sequence (st, at, s t), (st, at, s t), , (st, at, s t). By definition in (13), it can be shown that

E[Φ(Ot, ηt, θt)] = Ev0:t[Φ(Ot, ηt, θt)] = Ev0:t 1Ev0:t[(ηt J(θt))(rt J(θt))|v0:t 1],

where is second equality is due to law of total expectation. Once we know v0:t 1, ηt and J(θt) is not a random variable any more. It holds that

E[Φ(Ot, ηt, θt)] = Ev0:t 1Ev0:t[(ηt J(θt))(rt J(θt))|v0:t 1] = Ev0:t 1(ηt J(θt))Ev0:t[(rt J(θt))|v0:t 1] = Ev0:t 1(ηt J(θt))EOt[(rt J(θt))|v0:t 1] = 0,

where the last equation is due to EOt[(rt J(θt))|v0:t 1] = 0 under i.i.d. sampling.

By a similar argument, we have

E[Ψ(Ot, ηt, θt)] = Ev0:t[ ωt ω t , g(O, ω, θ) g(ωt, θt) ] = Ev0:t 1Ev0:t[ ωt ω t , g(Ot, ωt, θt) g(ωt, θt) |v0:t 1] = Ev0:t 1 ωt ω t , Ev0:t[g(Ot, ωt, θt) g(ωt, θt) |v0:t 1] = Ev0:t 1 ωt ω t , EOt[g(Ot, ωt, θt) g(ωt, θt) |v0:t 1] = 0,

where we use the fact that EOt[g(Ot, ωt, θt) g(ωt, θt) |v0:t 1] = 0.

Similarly, we have

E[Θ(Ot, O t, θt)] = Ev0:t[ J(θt), EO t[h(O t, θt)] h(Ot, θt) ]

= Ev0:t 1Ev0:t[ J(θt), EO t[h(O t, θt)] h(Ot, θt) |v0:t 1]

= Ev0:t 1 J(θt), Ev0:t[EO t[h(O t, θt)] h(Ot, θt) |v0:t 1]

= Ev0:t 1 J(θt), EOt[EO t[h(O t, θt)] h(Ot, θt) |v0:t 1]

where we use fact that Ot = O t under i.i.d. sampling.

The proof of E[Ξ(Ot, ωt, θt)] = 0 is the same as the above argument, which concludes the proof.

Proof of Theorem 3.6.

Proof. The proof follows similarly to the Markovian sampling case. Specifically, all the Markovian noises (see the definitions in (13)) present in the former analysis reduce to zero after taking expectations. The detailed results and proof are presented in Lemma E.1. Then, replacing Lemma C.1, Lemma C.3, and Lemma C.6 with Lemma E.1, we will get the desired O(T 1

2 ) convergence rate and thus an O(ϵ 2) sample complexity accordingly.