# on_convergence_of_gradient_expected_sarsaλ__e03cf275.pdf

On Convergence of Gradient Expected Sarsa(λ)

Long Yang1, Gang Zheng1, Yu Zhang1, Qian Zheng2, Pengfei Li1, Gang Pan 1*

1College of Computer Science and Technology, Zhejiang University, China. 2School of Electrical and Electronic Engineering, Nanyang Technological University,Singapore. 1 {yanglong,gang zheng, hzzhangyu,pﬂ,gpan}@zju.edu.cn; 2zhengqian@ntu.edu.sg

We study the convergence of Expected Sarsa(λ) with function approximation. We show that with off-line estimate (multi-step bootstrapping) to Expected Sarsa(λ) is unstable for off-policy learning. Furthermore, based on convex-concave saddle-point framework, we propose a convergent Gradient Expected Sarsa(λ) (GES(λ)) algorithm. The theoretical analysis shows that the proposed GES(λ) converges to the optimal solution at a linear convergence rate under true gradient setting. Furthermore, we develop a Lyapunov function technique to investigate how the stepsize inﬂuences ﬁnite-time performance of GES(λ). Additionally, such a technique of Lyapunov function can be potentially generalized to other gradient temporal difference algorithms. Finally, our experiments verify the effectiveness of our GES(λ). For the details of proof, please refer to https: //arxiv.org/pdf/2012.07199.pdf.

Introduction

Tabular Expected Sarsa(λ) (with importance sampling) is one of the widely used methods for off-policy evaluation in reinforcement learning (RL), whose goal is to estimate the value function of a given target policy via the data that is generated from a behavior policy. Due to the high-dimensional state space, instead of tabular learning, a standard approach is to estimate the value function with a linear function (Sutton and Barto 2018). There is very little literature to study Expected Sarsa(λ) with function approximation for off-policy learning. To our best knowledge, Sutton and Barto (2018) (section 12.9) ﬁrstly extend off-line estimate (multi-step bootstrapping) to Expected Sarsa(λ) with linear function approximation. Unfortunately, as pointed out by Sutton and Barto (2018) intuitively, their off-line approach may be unstable, i.e., their way to extend Expected Sarsa(λ) with linear function approximation still lacks a provable convergence guarantee, which is undesirable for RL. It is critical to ﬁnd the inherent essence of the above unstable appearance in Expected Sarsa(λ), which not only makes a complement for existing off-policy learning methods but also provides

*Corresponding author. Copyright 2021, Association for the Advancement of Artiﬁcial Intelligence (www.aaai.org). All rights reserved.

some inspirations to design a stable algorithm. Thus, extending Expected Sarsa(λ) with linear function approximation for off-policy evaluation is a fundamental theoretical topic in RL, including: 1) how to character the instability of off-line Expected Sarsa(λ) with linear function approximation; 2) how to derive a convergent algorithm; 3) what convergence rate does Expected Sarsa(λ) with linear function approximation can reach. We focus on these questions in this paper. Our Main Works To address the above problems, we propose Theorem 1, which characters a sufﬁcient and necessary condition that presents stability criteria of off-line update Expected Sarsa(λ) with linear function approximation. Theorem 1 requires the key matrix (that has been deﬁned in (10)) keeps the negative real components. Unfortunately, due to the discrepancy between behavior policy and target policy, off-line Expected Sarsa(λ) that is suggested by Sutton and Barto (2018) may not satisfy the condition appears in Theorem 1, i.e., their scheme maybe unstable. Then, we use a classic counterexample to verify the above instability lies in the off-line update Expected Sarsa(λ) with linear function approximation, see Example 1. Furthermore, to get a stable algorithm, we derive an online gradient Expected Sarsa(λ) (GES(λ)) algorithm. Theorem 2 shows that the proposed GES(λ) learns the optimal solution at a linear convergence rate under a true gradient. Finally, inspired by Srikant and Ying (2019), Wang et al. (2019), Gupta et al. (2019), we develop a Lyapunov function technique to establish Theorem 3, which illustrates the relationship between the ﬁnite-time performance of GES(λ) and step-size. Result shows that the upper-bounded error consists of two different parts: the ﬁrst error depends on both step-size and the size of samples, and such error decays geometrically as the samples increase; while the second error is only determined by the step-size and it is independent of samples. Additionally, the technique of proving Theorem 3 can be potentially generalized to other GTD algorithms.

Notations We use Spec(A) to denote the eigenvalues of the matrix A Cp p, i.e., Spec(A) = {λ1, , λp}, where λi is the root of the characteristic equation p(λ) = det(A λI). We use C to denote the collection that contains the complex numbers with negative real components, i.e., C =: {c C; Re(c) < 0}.

The Thirty-Fifth AAAI Conference on Artificial Intelligence (AAAI-21)

Let λmin(A) and λmax(A) be the minimum and maximum eigenvalue of the matrix A correspondingly. We use A op to denote the operator norm of matrix A; furthermore, if A is a symmetric real matrix, then A op = max1 i p{|λi|}. κ(A) = λmax(A)

λmin(A) is the condition number of matrix A. We use A 0 to denote a positive deﬁnite matrix A. For a function f(x) : Rp R, let 2f(x) denote its Hessian matrix, and its convex conjugate function f (y) : Rd R is deﬁned as f (y) = supx Rp{y x f(x)}. Fact 1 ((Rockafellar 1970; Kakade et al. 2009)). Let f( ) be α-strongly convex and β-smooth, i.e., f( ) α

2 2 2 is convex and f(u) f(v) β u v . If 0 α β, then the following fact holds, (I) f is 1

α-smooth and 1

β -strongly convex. (II) f = ( f ) 1 and f = ( f) 1.

Preliminary Reinforcement learning (RL) is formalized as Markov decision processes (MDP) which considers the tuple M = (S, A, P, R, γ); S is the space of states, A is the space of actions; P : S A S [0, 1], pa ss = P(St = s |St 1 = s, At 1 = a) is the probability of state transition from s to s under playing the action a; R( , ) : S A R1 is the expected reward function; γ (0, 1) is the discount factor. The policy π is a probability distribution on S A, we use π(a|s) to denote the probability of playing a under the state s. Let {St, At, Rt+1}t 0 be generated by a given policy π, its state-action value function

qπ(s, a) = Eπ[

t=0 γt Rt+1|S0 = s, A0 = a],

where Eπ[ | ] is the conditional expectation on the actions selected according to π. Let Bπ : R|S| |A| R|S| |A| be the Bellman operator:

Bπ : q 7 Rπ + γP πq, (1)

where P π R|S| |S|, Rπ R|S| |A|, their corresponding elements are:

[P π]s,s = X

a A π(a|s)pa ss , [Rπ]s,a = R(s, a).

It is well-known that qπ is the unique ﬁxed point of Bπ, i.e., Bπqπ = qπ, which is known as Bellman equation. Off-Policy Evaluation Let s consider the following trajectory T generated by a policy µ:

T = {S0 = s, A0 = a, , St, At, Rt+1, },

where At µ( |St),St+1 P( |St, At). In RL, the task of off-policy evaluation is to estimate the value function of the target policy π via the data that is generated by an another policy µ (that is called behavior policy), where µ = π. Assumption 1. The Markov chain induced by behavior policy µ is ergodic, i.e., there exists a stationary distribution ξ( , ) over S A: for (S0, A0) S A,

k=1 P(Sk = s, Ak = a|S0, A0) n ξ(s, a) > 0. (2)

The ergodicity of behavior policy µ is a standard assumption in off-policy learning (Bertsekas 2012), and it implies each-action pair can be visited under this behavior policy µ. In this paper, we use Ξ to denote a diagonal matrix whose diagonal element is ξ(s, a), i.e.,

Ξ = diag{ , ξ(s, a), }.

Temporal Difference (TD) Learning TD learning updates value function as follows, t 0,

Q(St, At) Q(St, At) + αtδt, (3)

where Q( , ) is an estimate of state-action value function, αt is step-size and δt is TD error. Let Qt = Q(St, At), if δt is expected TD error:

δES t = Rt+1 + γ X

a A π(a|St+1)Q(St+1, a) Qt, (4)

then update (3) is Expected Sarsa. Expected Sarsa(λ) Sutton and Barto (2018) 1 propose a multi-step TD learning that extends Expected Sarsa to λreturn version: for each t 0,

Gλ t = Qt +

k=t (γλ)k t k Y

π(Ai|Si) µ(Ai|Si)

δES k , (5)

where δES k is expected TD error. For the convenience, we set Qk i=t ρi = Qk i=t π(Ai|Si) µ(Ai|Si) = ρt:k, and ρt:t+1 = 1. Finally, we introduce λ-operator Bπ λ that is a high level view of iteration (5):

Bπ λ : q 7 q + Eµ[

k=t (λγ)k tδES k ρt+1:k] (6)

= q + (I λγP π) 1(Bπq q), (7)

where Bπ is deﬁned in (1). For the limitation of space, we provide the derivation of (7) from (6) in Appendix A.2.

Linear Function Approximation TD learning (3) requires a very huge table to store the estimate value function Q( , ) when |S| is very large, which implies tabular TD learning is considerably expensive for high-dimensional RL. We often use a parametric function Qθ( , ) to approximate qπ(s, a), i.e.,

qπ(s, a) φ (s, a)θ =: Qθ(s, a),

where θ Rp is the parameter that needs to be learned, φ(s, a) = (ϕ1(s, a), ϕ2(s, a), , ϕp(s, a)) , and each ϕi : S A R. Furthermore, Qθ can be rewritten as a version of matrix Qθ = Φθ qπ, where Φ is a matrix whose rows are the state-action feature vectors φ (s, a). In this paper, we mainly consider extending λ-return of Expected Sarsa (5) with linear function approximation.

1It is noteworthy that the λ-return version of Expected Sarsa appears in section 12.9 of (Sutton and Barto 2018) is limited in the case of linear function case, Eq.(5) extends it to be a general case.

Off-Line Gradient Expected Sarsa(λ) In this section, we use a counterexample to show the way to extend Expected Sarsa(λ) with linear function approximation via off-line estimate is unstable for off-policy learning. Off-Line Update Sutton and Barto (2018) provide a way to extend (5) with linear function approximation as follows,

θt+1 = θt + αt(Gλ t Qθ(St, At)) Qθ(St, At)

= θt + αt X

k=t (γλ)k tδES k,θρt+1:k φt, (8)

where αt is step-size, φt =: φ(St, At), and

δES k,θ = Rk+1 + γθ t Eπ[φ(Sk+1, )] θ t φk.

Furthermore, we can rewrite the expected parameter in (8):

Eµ[θt+1] = θt + αt(Aθt + b), (9)

k=t (γλ)k tρt+1:kφt γEπ[φ(Sk, )] φk

= Φ Ξ(I γλP π) 1(γP π I)Φ, (10)

k=t (γλ)k tρt+1:kφt Rk+1

= Φ Ξ(I γλP π) 1R. (11)

If θt (9) converges to a certain point θ , then θ satisﬁes the following linear system:

Aθ + b = 0. (12)

Such θ satisﬁes (12) is called TD-ﬁxed point. Stability Criteria According to Sutton et al. (2016); and Ghosh and Bellemare (2020), we formulate the stability of the iteration (9) as the next deﬁnition. Deﬁnition 1 (Stability). Update rule (9) is stable if θk converges to the point θ satisﬁes (12) for any initial θ0. Theorem 1 (Stability Criteria). Under Assumption 1, the off-line update (9) is stable if and only if the eigenvalues of the matrix A (10) have negative real components, i.e.,

Spec(A) C . (13)

We provode its proof in Appendix B. Theorem 1 provides a sufﬁcient and necessary condition (13) that guarantees the stability of iteration (8). Unfortunately, for off-policy learning, the matrix A (10) can not guarantee the condition (13) holds, which implies the iteration (8) may be divergent and unstable. Now, we use the following example (Touati et al. 2018) to illustrate the instability lies in the iteration (8). Example 1. For the MDP in Figure 1, we assign the features {(1, 0) , (2, 0) , (0, 1) , (0, 2) } to the state-action pairs {(s1, right), (s2, right), (s1, left), (s2, left)}, From the dynamic transition shown in Figure 1 , we have

0 1 0 0 0 1 0 0 1 0 0 0 1 0 0 0

1 0 2 0 0 1 0 2

Figure 1: Counterexample, Two-State MDP: behavior policy µ(right| ) = 0.5 and target policy π(right| ) = 1.

Then, according to (10), we have

A = Φ Ξ(I γλP π) 1(γP π I)Φ =

2(1 γλ) 0 3γ

and the eigenvalues of A are: 6γ γλ 5

2(1 γλ) and 5

2. For any ini-

tial θ0 = (θ0,1, θ0,2) , let E[θt+1] =: (θt+1,1, θt+1,2) be the expectation of iteration (8), according to (9), the ﬁrst component of E[θt+1|θt] is: θt+1,1 = θ0,1 Qt i=0 1 +

2(1 γλ) . For any λ (0, 1), if γ ( 5 6 λ, 1), then

2(1 γλ) is a positive scalar, which implies A can not be a negative matrix. Furthermore, if step size αt : P i 0 αt = , we have 2

|θt+1,1| = |θ0,1|

1 + αi 6γ γλ 5

which implies the way (8) to extend Expected Sarsa(λ) with linear function approximation via off-line estimate is unstable for off-policy learning.

On-Line Gradient Expected Sarsa(λ) The above discussion of the instability for off-policy learning shows that we should abandon the off-line update (8). In this section, we provide a convergent on-line algorithm: Gradient Expected Sarsa(λ) (GES(λ)), which is based on the popular TD ﬁxed point method. The TD ﬁxed point method (Sutton et al. 2009a; Bertsekas 2011; Dann et al. 2014) is widely used for policy evaluation and it focuses on ﬁnding the value function satisﬁes Φθ = ΠBπ λΦθ, (15)

where Π = Φ(Φ ΞΦ) 1Φ Ξ. It has been shown that if the projected Bellman operator ΠBπ λ has a ﬁxed point θ , then it is unique (Lagoudakis and Parr 2003; Bertsekas 2011), and such a ﬁxed-point θ also satisﬁes the linear system (12). Instead of using the method of value iteration according to the projected Bellman operator ΠBπ λ, we derive the algorithm on the mean square projected Bellman equation (MSPBE) (Sutton et al. 2009a) as follows,

min θ MSPBE(θ, λ) =: min θ 1 2 Φθ ΠBπ λ(Φθ) 2 Ξ

= min θ 1 2 Aθ + b 2 M 1, (16)

2Eq.(14) is a direct result of the following conclusion that could be found in any calculus textbook. Let pi = 1 + ai, where ai > 0, if P i=1 ai = + , then Q i=1 pi = Q i=1(1 + ai) = + .

Algorithm 1 Gradient Expected Sarsa(λ) (GES(λ))

1: Initialization: ω0 = 0, θ0 = 0, α0 > 0, β0 > 0, T N. 2: e 1 = 0 3: for t = 0 to T do 4: Observe {St, At, Rt+1, St+1, At+1} µ

5: ρt = π(At|St)

µ(At|St) 6: et = λγρtet 1 + φt 7: δt = Rt+1 + γθ t Eπφ(St+1, ) θ t φt 8: ωt+1 = ωt + βt(etδt φtφ t ωt) 9: θt+1 = θt αt(γEπ[φ(St+1, )] φt)e t ωt 10: end for 11: Output:{θt, ωt}T t=1

where x Ξ = x Ξx is a weighted norm, and M = Φ ΞΦ. We provide the derivation of (16) in Appendix C.1. Since the computational complexity of the invertible matrix M 1 is very large, it is too expensive to use stochastic gradient method to solve the problem (16) directly. Let

2 ω 2 M b ω (17)

Ψ(θ, ω) = (Aθ + b) ω 1

2 ω 2 M = θ Aω g(ω).

According to Liu et al. (2015), the original problem (16) is equivalent to the convex-concave saddle-point problem

min θ max ω {Ψ(θ, ω)}. (18)

Proposition 1. If (θ , ω ) is the solution of problem (18), then θ is the solution of original problem (16), i.e.,

θ = arg min θ MSPBE(θ, λ).

We provide the proof of Proposition 1 in Appendix C.2. Proposition 1 illustrates that the solution of (16) is contained in the problem (18). Gradient update is a natural way to solve problem (18) (ascending in ω and descending in θ):

ωt+1 = ωt + βt(Aθt + b Mωt), (19)

θt+1 = θt αt A ωt, (20)

where αt, βt is step-size, t 0. However, since A, b, and M are versions of expectations, we can not get the transition probability in practice. A practical way is to ﬁnd the unbiased estimators of them. Let e0 = 0, ρt = π(At|St) µ(At|St), et = λγρtet 1 + φt,ˆbt =

Rt+1et, ˆAt = et(γEπ[φ(St+1, )] φt) , ˆ Mt = φtφ t . According to the Theorem 9 of (Maei 2011), we have

Eµ[ ˆAt] = A, Eµ[ˆbt] = b, Eµ[ ˆ Mt] = M. (21)

Replacing the expectations in (19) and (20) by corresponding unbiased estimates, we deﬁne the stochastic on-line implementation of (19) and (20) as follows,

ωt+1 = ωt + βt( ˆAtθt + ˆbt ˆ Mtωt), (22)

θt+1 = θt αt ˆA t ωt. (23)

We provide more details in Algorithm 1.

Finite-Time Performance Analysis In this section, we mainly focus on the ﬁnite-time performance of GES(λ). Theorem 2 shows the proposed GES(λ) with a true gradient can converge at a linear rate under a concrete step-size. Furthermore, to investigate how the step-size inﬂuences ﬁnite-time performance, we establish Theorem 3. We develop a technique of Lyapunov function (Gupta et al. 2019) to prove Theorem 3, before we present the main result, we provide the motivation and some necessary details of Lyapunov function. Throughout this paper, we make two additional standard assumptions, which are widely used in reinforcement learning (Wang et al. 2017; Bhandari et al. 2018; Xu et al. 2019).

Assumption 2 (Boundedness of Feature Map, Reward). The features {φt}t 0 is uniformly bounded by φmax. The reward function is uniformly bounded by Rmax. The importance sampling ρt is uniformly bounded by ρmax.

Assumption 3 (Solvability of Problem). The matrix A is non-singular and rank(Φ) = p.

As claimed by Xu et al., (2019), Assumption 2 can be ensured by normalizing the feature maps {φt}t 1 and when µ( |s) is non-degenerate for all s S. Besides, Assumption 2 implies the boundedness of the estimators ˆAt, ˆ Mt and ˆbt. For the limitation of space, we provide more details and discussions in Remark 5 (see Appendix D.1). Assumption 3 requires the non-singularity of the matrix A, which implies the optimal parameter θ = A 1b is well deﬁned. The feature matrix Φ has linearly independent columns implies the matrix M is non-singular.

Linear Convergence Rate under True Grdient We consider the ﬁrst-order optimality condition of the problem (18), i.e., the optimal solution (θ , ω ) satisﬁes θΨ(θ , ω ) = A ω = 0, ωΨ(θ , ω ) = g(ω ) + Aθ = 0. (24)

According to the Fact 1 and the condition (24), we have

ω = ( g) 1(Aθ ) = g (Aθ ),

which implies ω can be represented by θ , thus, we mainly focus on the performance of {θt}t 1.

Theorem 2. {(θt, ωt)}t 0 is generated by (19)-(20). Let θt = θt θ 2 2, ωt = ωt g (Aθt) 2 2, ν = 2κ2(A)κ(M)λmax(A)

λmin(M) ,and Dt = ν θt + ωt. If we choose

step-size α = λmin(M) λmax(M)+λmin(M) λ2max(A)

λmin(M) +νλmax(A) , β =

2 λmax(M)+λmin(M), under Assumption 1-3, we have

E[Dt+1] 1 1 12κ3(M)κ4(A)

E[Dt]. (25)

Furthermore, we have

E[ θt θ 2 2] 1

1 1 12κ3(M)κ4(A)

t E[D0]. (26)

Remark 1. We provide its proof in Appendix C.3. Recall the fact Dt = ν θt+ ωt ν θt, which implies the inequality

E[ θt θ 2 2] E[Dt]

12 1 κ3(M)κ4(A)

The term (1 1

12 1 κ3(M)κ4(A)) (0, 1) implies GES(λ) produces the sequence {θt}t 0 converges to the optimal solution at a linear convergence rate. After some simple algebra, with a computational cost of

O max n 1, λmax(A)

o 1 1 12κ3(M)κ4(A)

the output of Algorithm 1 closes to (θ , ω ) as follows,

E[ θt θ 2 2] δ2, E[ ωt ω 2 2] δ2.

Remark 2. Theorem 2 provides a concrete step-size that depends on the some unknown parameters. As suggested by Du et al. (2017), Touati et al. (2018), and Voloshin et al. (2019), we can use Monte Carlo method to estimate the unknown parameters.

A Further Analysis of Lyapunov Function for Stochastic Gradient In this section, we propose Theorem 3 that illustrates a relationship between the performance of GES(λ) and step-size. The proof of Theorem 3 involves a novel Lyapunov function technique, we start by presenting the motivation behind such Lyapunov function. Let

H = A M 1A, L = 2A.

Let Q1 and Q2 be the solutions of the following equations H Q1 Q1H = I M Q2 + Q2M = I. (27)

Since both M and H are Hurwitz matrix, the solution of the linear system (27) alway exists (Lakshminarayanan et al. 2018; Srikant and Ying 2019). Let Q be a matrix as follows,

Q = 1 p1 + p2

where p1 = Q1A op Q1, p2 = Q2M 1AL op Q2. Finally, we deﬁne ϱt and zt as follows,

ϱt = ωt M 1Aθt, zt = (θt θ , ϱt ϱ ) , (28)

where ϱ = ω M 1Aθ . Lyapunov function L(zt) is:

L(zt) = z t Qzt. (29)

Motivation of Lyapunov Function We consider the expected difference of iteration (22)-(23) as follows

1 αE[θt+1 θt|Yt τ] =E[ ˆAtωt|Yt τ] (30)

α β E[ωt+1 ωt|Yt τ]

α =E[ ˆAtθt + ˆbt ˆ Mtωt|Yt τ], (31)

where Yt τ = {θt τ, ωt τ, Xt τ}, and the sequence Xt =: {S0, A0, S1, A1, , St, At} denotes a Markov chain according to Algorithm 1. The expectation is conditioned sufﬁciently in the past information of the underlying Markov chain. Approximating the left parts of (30)-(31) by derivatives, then we have the ordinary differential equation (ODE): ( θ(t) = Aω(t), α β ω(t) = Aθ(t) + b Mω(t), (32)

where θ(t), ω(t) Rp of (32) are the functions that are deﬁned on the continuous time (0, ). The update rule of (22)-(23) can be thought of as a discretization of the ODE (32) that is known as singular perturbation ODE (chapter 11 of (Khalil and Grizzle 2002)), our goal is to provide a nonasymptotic analysis of GES(λ) according to the asymptotically stable equilibrium of ODE (32). According to Khalil and Grizzle (2002), the following Lyapunov function L(t) is widely used as a stability criteria for the ODE (32),

L(a, t) =a ω(t) M 1Aθ(t) Q2 ω(t) M 1Aθ(t)

+ (1 a)θ (t)Q1θ(t), (33) where a (0, 1). Our L(zt) (29) can be seen as a discretization of L(t) (33) after a proper choice of a, which inspires us to conduct the Lyapunov function L(zt). Lemma 1. Under Assumption 1-3, there exists a positive scalar τ such that: t τ,

E[L(zt+1)] E[L(zt)] α 1

β κ2 E[L(zt)]

+ 2α2ζ2λmax(Q)ecb 2, (34)

where the constants κ1, κ2, ζ =: C1 + β

αC2, ecb are deﬁned in Appendix D. Finally, we know E[ zt 2 2] (λmin(Q)) 1E[L(zt)], applying the result of (34) recurrently, we have Theorem 3. Theorem 3. Let η1 = 4ζ2τ 2( z0 2 + ecb)2 + z0 2 2), η2 = 2κ(Q)ζ2λmax(Q) e cb 2

β κ2 . Under Assumption 1-3, there ex-

ists a positive scalar τ such that: t τ,

E[ zt 2 2] α2η1 1 α λmax(Q)

β κ2 t τ + αη2.

Remark 3. Recall zt = (θt θ , ϱt ϱ ) , then E[ θt θ 2 2] E[ zt 2 2], which implies the expected mean square error of θt θ is also upper-bounded by the result of Theorem 3. Furthermore, after a total computational cost of

the Algorithm 1 outputs θt closes to the optimal solution θ as follows, E[ θt θ 2 2] O(τδ). Remark 4. Theorem 3 shows that the upper-bounded error consists of two different parts: the ﬁrst error bound depends on both step-size and the size of samples, and this error decays geometrically as the number of iteration increases; while the second part is only determined by the step-sizes and it is independent of the number of iterations.

Algorithm Step-size Convergence Rate TD Fixed Point TD(0) (Nathaniel et al. 2015)

αt = O(t η) η (0, 1) O 1

Φ Ξ(γP µ I)Φθ = b

TD(0) (Dalal et al. 2018a)

P t=1 αt = O 1

Φ Ξ(γP µ I)Φθ = b

TD(0) (Lakshminarayanan et al. 2018)

Constant O 1

Φ Ξ(γP µ I)Φθ = b

GTD(0) (Dalal et al. 2018b) P t=1 αt = , βt

Φ Ξ(γP π I)Φθ = b

GTD(0)/GTD2/TDC (Dalal et al. 2020)

αt = 1 tη1 ,βt = 1 tη2 0 < η2 < η1 < 1 O 1

Φ Ξ(γP π I)Φθ = b

GTB(λ) (Touati et al. 2018) αt, βt = O( 1

Φ Ξ(I γλP µ) 1(γP π I)Φθ = b

SARSA (Zou et al. 2019) αt = O( 1

t ) O log3(T)

Φ Ξ(γP µ I)Φθ = b

TDC (Xu et al. 2019)

max{αt log( 1

αt ), αt} min{ θ0 θ 2

2t 1 , C} Linear Φ Ξ(γP π I)Φθ = b

TDC (Kaledin et al. 2020)

αt = 1 tv , βt = 1

t v (0, 1) O( 1

T ) Φ Ξ(γP π I)Φθ = b

GES(λ) Theorem 3 α = Constant O α) Φ Ξ(I γλP π) 1(γP π I)Φθ = b

Table 1: Comparison of GTD family algorithms over performance measurement E θT θ 2 2.

Related Works In this section, we review existing ﬁnite-time performance of GTD algorithms over θT θ 2 2. Although the asymptotic analysis of GTD family has been established in (Sutton et al. 2009a,b; Maei 2011), which holds only in the limit as the number of iterations increases to inﬁnity, and we can not get the information of convergence rate from asymptotic results. This is the main reason why we focus on the ﬁnite-time performance over θT θ 2 2. It is noteworthy that Liu et al. (2015) ﬁrstly introduce primal-dual gap error to measure the convergence of GTD algorithm, we provide the discussion of ﬁnite-time primal-dual gap error analysis in Appendix E. Nathaniel et al. (2015) proves that TD(0) (Sutton 1988) converges at O( 1

T ) with the step-size αt = O( 1

tη ), η (0, 1). Later, Dalal et al. (2018a) further explore the property of TD(0), they prove the convergence rate of TD(0) achieves O(e λ

2 T 1 η + 1 T η ), but it never reaches O( 1

T ), where η (0, 1), λ is the minimum eigenvalue of the matrix A + A. Lakshminarayanan et al., (2018) show TD(0) converges at O( 1

T ) with a more relaxed step-size than the works of (Nathaniel et al. 2015; Dalal et al. 2018a), it only requires a constant step-size. Recently, Dalal et al. (2018b) proves GTD(0) family algorithm (Sutton et al. 2009b) converges at O(( 1

3 ), but nerve reach O( 1

T ), where κ (0, 1). A very similar convergence rate appears in (Dalal et al. 2020), which considers TDC and GTD2. Touati et al. (2018) propose GTB(λ)/GRetrace(λ), they prove the convergence rate

of GTB(λ)/GRetrace(λ) reahches O( 1

T ). Zou et al. (2019) show SARSA with linear function approximation converges at the rate of O log3(T )

T . Recently, Kaledin et al. (2020) further develop two timescale stochastic approximation with Markovian noise, and they show that TDC converges at a rate of O( 1

T ) if αt = 1 tv , βt = 1

t , v (0, 1). Theorem 3 illustrates our GES(λ) achieves the convergence rate at the same order with step-size. Although Xu et al., (2019) prove TDC converges at a linear convergence rate, they require a fussy blockwise diminishing step-size condition: max{αt log( 1

αt ), αt} {min{ θ0 θ 2

2t 1 , C}, αt = O( 1 (t+1)η1 ), βt = O( 1 (t+1)η2 ), 0 < η2 < η1 < 1, where C is a constant. Apparently, our Theorem 2 requires a simpler condition of step-size than Xu et al., (2019). It is noteworthy that our Theorem 2 does not require an additional projection step (that is unnecessary in practice) that appears in (Xu et al. 2019). Signiﬁcantly, the ﬁnite-time performances of (Dalal et al. 2018a; Lakshminarayanan et al. 2018) requires an additional assumption that all the samples required to update the function parameters are i.i.d. In this paper, we remove this condition and achieve a better result than theirs.

Experiments

In this section, we test the capacity of GES(λ) for off-policy evaluation in three typical domains: Mountain Car, Baird Star (Baird 1995), Two-State MDP (Touati et al. 2018). We

0 20 40 60 80 100 120 140 Episodes(Mountain Car)

t =constant

t =O( 1 p t )

0 20 40 60 80 100 Episodes(Baird)

t =constant

t =O( 1 p t )

Figure 2: Comparison between a constant step-size and 1

0 100 200 300 400 500 Episodes(Mountain Car)

GTB( ) GQ( ) ABQ(³) GES( )

0 1000 2000 3000 4000 5000 Episodes(Mountain Car)

GTB( ) GQ( ) ABQ(³) GES( )

Figure 3: MSPBE and MSE comparison on Mountain Car.

compare GES(λ) with three state-of-art algorithms: GQ(λ) (Maei and Sutton 2010), ABQ(ζ) (Mahmood et al. 2017b), GTB(λ) (Touati et al. 2018) over two typical measurements: MSPBE and mean square error (MSE). We choose those three algorithms as baselines since they are all learning via expected TD-error δES t , which is same as GES(λ). Feature Map and Parameters Recall the states and actions of Mountain Car: S = {(Velocity, Position)} = [ 0.07, 0.07] [ 1.2, 0.6], A = {left, neutral, right}. In this example, if Velocity > 0, we use behavior policy µ = ( 1

100, 1 100, 98

100), π = ( 1

10); else µ = ( 98

100, 1 100, 1 100), π = ( 8

10). Since the state space is continuous, we use an open tile coding3 software to extract feature of states. We set the number of tilings to be 4, and there are no white noise features. The performance is an average 5 runs, and each run contains 5000 episodes. As suggested by Sutton and Barto (2018), we set all the initial parameters to be 0, which is optimistic about causing extensive exploration. The Baird Star is an episodic seven states MDP with two actions: dashed action and solid action. In this example, we set the behavior policy µ( |dashed) = 6

7, µ( |solid) = 1 7 and target policy π( |solid) = 1. We choose the feature

map matrix as follows Φ = 2I7 7 17 1 07 8 07 8 2I7 7 17 1

where 0 denotes a matrix whose elements are all 0, and 17 1 denotes a vector whose elements are all 1. The dynamics of Two-State MDP is presented in Example 1. We set λ = 0.99, γ = 0.99 in all the experiments. The MSPBE/MSE distribution is computed over the combination of step-size, (αt, βt

αt ) [0.1 2j|j = 10, 9, , 1, 0]2. Effect of Step-size Figure 2 shows the comparison of the

3http://incompleteideas.net/rlai.cs.ualberta.ca/RLAI/RLtoolkit/ tilecoding.html

0 20 40 60 80 100 Episodes(Baird)

GTB( ) GQ( ) ABQ(³) GES( )

0 20 40 60 80 100 Episodes(Baird)

GTB( ) GQ( ) ABQ(³) GES( )

Figure 4: MSPBE and MSE comparison on Baird Star.

0 20 40 60 80 100 Episodes(Two-State MDP)

GTB( ) GQ( ) ABQ(³) GES( )

0 20 40 60 80 100 Episodes(Two-State MDP)

GTB( ) GQ( ) ABQ(³) GES( )

Figure 5: MSPBE and MSE comparison on Two-State MDP.

empirical MSPBE performance between a constant step-size and the decay step-size O( 1

t). Result of Figure 2 illustrates that the GES(λ) with a proper constant step-size converges signiﬁcantly faster than the learning with step-size O( 1

t), which also supports Theorem 2: learning with a proper constant step-size can reach a very faster rate. Comparison of Empirical MSPBE and MSE In this section, we use empirical MSPBE = 1

2 ˆb+ ˆAθ 2 ˆ M 1 to eval-

uate the performance, where we evaluate ˆA, ˆb, and ˆ M according to their unbiased estimators by Monte Carlo method with 5000 episodes, and our implementation of MSPBE is inspired by (Touati et al. 2018). Besides, we also compare the performance over a common measurement empirical MSE: MSE = Φθ qπ 2 Ξ/ qπ 2 Ξ, where qπ is estimated by simulating the target policy π and averaging the discounted cumulative rewards over trajectories. The combination of step-size for MSE is the same as previous empirical MSPBE. Results in Figure 3 to 5 show that our GES(λ) learns faster with a better performance than GQ(λ), ABQ(ζ) and GTB(λ). Besides, GES(λ) converges with a lower variance. In the Two-State MDP and Baird Star experiments, GES(λ) outperforms the baselines slightly. This is because both Two-State MDP and Baird Star are relatively easy; many gradient TD learning could learn a convergent result. While the advantage of GES(λ) over baselines becomes more signiﬁcant in the Mountain Car domain, which shows that GES(λ) is more robust than baselines in the more difﬁcult task.

We propose GES(λ) that extends Expected Sarsa(λ) with linear function approximation. We prove GES(λ) learns the optimal solution at a linear convergence rate, which is comparable to extensive GTD algorithms. The primal-dual gap

error of GES(λ) matches the best-known theoretical results, but we require a simpler condition of step-size. Finally, we conduct experiments to verify the effectiveness of GES(λ).

Acknowledgements This work was supported by Natural Science Foundation of China (No. 61925603) Zhejiang Lab (No. 2019KC0AD02). The authors would like to thank Ahmed Touati for providing a number of valuable comments for us to polish the paper.

References Baird, L. 1995. Residual algorithms: Reinforcement learning with function approximation. In ICML, 30 37. Bertsekas, D. P. 2011. Temporal difference methods for general projected equations. IEEE Transactions on Automatic Control 56(9): 2128 2139. Bertsekas, D. P. 2012. Dynamic Programming and Optimal Control, volume 2. Athena scientiﬁc Belmont, MA. Bhandari, J.; Russo, D.; Singal, R.; et al. 2018. A ﬁnite time analysis of temporal difference learning with linear function approximation. COLT . Dalal, G.; Szorenyi, B.; Thoppe, G.; and Mannor, S. 2018a. Finite sample analyses for td(0) with function approximation. In AAAI2018. Dalal, G.; Szorenyi, B.; Thoppe, G.; and Mannor, S. 2018b. Finite Sample Analysis of Two-Timescale Stochastic Approximation with Applications to Reinforcement Learning. In COLT. Dalal, G.; Szorenyi, B.; Thoppe, G.; et al. 2020. A Tale of Two-Timescale Reinforcement Learning with the Tightest Finite-Time Bound. AAAI . Dann, C.; Neumann, G.; Peters, J.; et al. 2014. Policy evaluation with temporal differences: A survey and comparison. JMLR 15(1): 809 883. Du, S. S.; Chen, J.; Li, L.; Xiao, L.; and Zhou, D. 2017. Stochastic variance reduction methods for policy evaluation. ICML . Du, S. S.; and Hu, W. 2019. Linear convergence of the primal-dual gradient method for convex-concave saddle point problems without strong convexity. AISTATS . Geist, M.; and Scherrer, B. 2014. Off-policy learning with eligibility traces: A survey. JMLR . Ghosh, D.; and Bellemare, M. G. 2020. Representations for Stable Off-Policy Reinforcement Learning. ICML . Gupta, H.; Srikant, R.; Ying, L.; et al. 2019. Finite-time performance bounds and adaptive learning rate selection for two time-scale reinforcement learning. In Neur IPS, 4704 4713. Kakade, S.; Shalev-Shwartz, S.; Tewari, A.; et al. 2009. On the duality of strong convexity and strong smoothness: Learning applications and matrix regularization . Kaledin, M.; Moulines, E.; Naumov, A.; Tadic, V.; and Wai, H.-T. 2020. Finite time analysis of linear two-timescale stochastic approximation with Markovian noise. COLT .

Khalil, H. K.; and Grizzle, J. W. 2002. Nonlinear systems, volume 3. Prentice hall Upper Saddle River, NJ.

Lagoudakis, M. G.; and Parr, R. 2003. Least-squares policy iteration. JMLR 4: 1107 1149.

Lakshminarayanan, C.; Szepesvari; Csaba; et al. 2018. Linear Stochastic Approximation: How Far Does Constant Step-Size and Iterate Averaging Go? In AISTATS.

Liu, B.; Liu, J.; Ghavamzadeh, M.; Mahadevan, S.; and Petrik, M. 2015. Finite-Sample Analysis of Proximal Gradient TD Algorithms. In UAI.

Maei, H. R. 2011. Gradient temporal-difference learning algorithms. Ph.D. thesis, University of Alberta Edmonton, Alberta.

Maei, H. R.; and Sutton, R. S. 2010. GQ(λ): A general gradient algorithm for temporal-difference prediction learning with eligibility traces. In Proceedings of the third conference on artiﬁcial general intelligence, volume 1, 91 96.

Mahmood, A. R.; Yu, H.; Sutton, R. S.; et al. 2017b. Multistep off-policy learning without importance sampling ratios.

Nathaniel, K.; Prashanth, L.; Prashanth, L.; et al. 2015. On TD(0) with function approximation: Concentration bounds and a centered variant with exponential convergence. In ICML.

Nemirovski, A.; Juditsky, A.; Lan, G.; and Shapiro, A. 2009. Robust stochastic approximation approach to stochastic programming. SIAM Journal on optimization 19(4): 1574 1609.

Precup, D.; Sutton, R. S.; and Singh, S. P. 2000. Eligibility Traces for Off-Policy Policy Evaluation. In International Conference on Machine Learning, 759 766. Citeseer.

Rockafellar, R. T. 1970. Convex analysis. 28. Princeton university press.

Rubinstein, R. Y.; and Kroese, D. P. 2016. Simulation and the Monte Carlo method, volume 10. John Wiley & Sons.

Srikant; and Ying, L. 2019. Finite-Time Error Bounds For Linear Stochastic Approximation and TD Learning. In COLT.

Sutton, R. S. 1988. Learning to predict by the methods of temporal differences. Machine learning 3(1): 9 44.

Sutton, R. S.; and Barto, A. G. 1998. Reinforcement learning: An introduction. MIT press Cambridge.

Sutton, R. S.; and Barto, A. G. 2018. Reinforcement learning: An introduction. MIT press.

Sutton, R. S.; Maei, H. R.; Precup, D.; Bhatnagar, S.; Silver, D.; Szepesv ari, C.; and Wiewiora, E. 2009a. Fast gradientdescent methods for temporal-difference learning with linear function approximation. In ICML.

Sutton, R. S.; Maei, H. R.; Szepesv ari, C.; et al. 2009b. A Convergent O(n) Temporal-difference Algorithm for Offpolicy Learning with Linear Function Approximation. In Neur IPS.

Sutton, R. S.; Mahmood, A. R.; White, M.; et al. 2016. An emphatic approach to the problem of off-policy temporaldifference learning. JMLR 17(1): 2603 2631. Thomas, P. S. 2015. Safe reinforcement learning. Ph.D. thesis, University of Massachusetts Libraries. Touati, A.; Bacon, P.-L.; Precup, D.; and Vincent, P. 2018. Convergent Tree-Backup and Retrace with Function Approximation. In ICML. Voloshin, C.; Le, H. M.; Jiang, N.; and Yue, Y. 2019. Empirical Study of Off-Policy Policy Evaluation for Reinforcement Learning. ar Xiv preprint ar Xiv:1911.06854 . Wang, G.; Li, B.; Giannakis, G. B.; et al. 2019. A multistep Lyapunov approach for ﬁnite-time analysis of biased stochastic approximation. ar Xiv preprint ar Xiv:1909.04299 . Wang, Y.; Chen, W.; Liu, Y.; Ma, Z.-M.; and Liu, T.-Y. 2017. Finite Sample Analysis of the GTD Policy Evaluation Algorithms in Markov Setting. In Advances in Neural Information Processing Systems(Neur IPS). Xu, T.; Zou, S.; Liang, Y.; et al. 2019. Two time-scale offpolicy TD learning: Non-asymptotic analysis over Markovian samples. In Neur IPS. Zou, S.; Xu, T.; Liang, Y.; et al. 2019. Finite-sample analysis for SARSA with linear function approximation. In Neur IPS.