# temporal_difference_learning_as_gradient_splitting__f5428928.pdf

Temporal Difference Learning as Gradient Splitting

Rui Liu 1 Alex Olshevsky 2

Temporal difference learning with linear function approximation is a popular method to obtain a low-dimensional approximation of the value function of a policy in a Markov Decision Process. We provide an interpretation of this method in terms of a splitting of the gradient of an appropriately chosen function. As a consequence of this interpretation, convergence proofs for gradient descent can be applied almost verbatim to temporal difference learning. Beyond giving a fuller explanation of why temporal difference works, this interpretation also yields improved convergence times. We consider the setting with 1/

T step-size, where previous comparable ﬁnite-time convergence time bounds for temporal difference learning had the multiplicative factor 1/(1 γ) in front of the bound, with γ being the discount factor. We show that a minor variation on TD learning which estimates the mean of the value function separately has a convergence time where 1/(1 γ) only multiplies an asymptotically negligible term.

1. Introduction

Reinforcement learning is a basic machine learning paradigm which concerns learning optimal policies in Markov Decision Processes (MDP). It has been applied to many challenging practical problems, such as, autonomous driving (Chen et al., 2015), robotics (Gu et al., 2017), bidding and advertising (Jin et al., 2018), and games (Silver et al., 2016). An important problem in reinforcement learning is to estimate the value function for a given policy, often referred to as the policy evaluation problem. Temporal difference (TD) learning originally proposed by Sutton (1988) is one of the most widely used policy evaluation algorithms.

1Division of Systems Engineering, Boston University, Boston, MA, USA 2Department of ECE and Division of Systems Engineering, Boston University, Boston, MA, USA. Correspondence to: Rui Liu <rliu@bu.edu>.

Proceedings of the 38th International Conference on Machine Learning, PMLR 139, 2021. Copyright 2021 by the author(s).

TD uses differences in predictions over successive time steps to drive the learning process, with the prediction at any given time step updated via a carefully chosen step-size to bring it closer to the prediction of the same quantity at the next time step.

Despite its simple implementation, theoretical analysis of TD can be involved. This is particularly true when TD methods are applied to problems with large state-spaces by maintaining an approximation to the value function. Precise conditions for the asymptotic convergence of TD with linear function approximation were established by viewing TD as a stochastic approximation for solving a suitable Bellman equation in Tsitsiklis & Van Roy (1997). Before the last few years, there have been few non-asymptotic analyses of TD methods. The ﬁrst non-asymptotic bounds for TD(0) with linear function approximation were given by Korda & La (2015), obtaining an exponential convergence rate for the centered variant of TD(0) when the underlying Markov chain mixes fast. However, some issues with the proofs of Korda & La (2015) were listed by the subsequent work of Narayanan & Szepesvári (2017).

In Lakshminarayanan & Szepesvari (2018), it was shown that TD algorithms with a problem independent constant step size and iterate averaging, achieve a problem dependent error that decays as O(1/t) with the number of iterations t. Convergence rates in probability with an O(1/t) stepsize were provided by Dalal et al. (2018). Both analyses of Dalal et al. (2018) and Lakshminarayanan & Szepesvari (2018) assume samples used by the algorithm are i.i.d. rather than a trajectory in the underlying Markov chain. For the Markov chain observation model, Bhandari et al. (2018) provide a O(1/

T) convergence rate with step-size that scales as 1/

T and O((logt)/t) convergence rate with step size O(1/t) for projected TD algorithm. The constant factors in the latter bounds depend on 1/(1 γ), where γ is the discount factor; this scaling is one of the things we will be studying in this paper.

A number of papers also work on algorithms related to and inspired by the classic TD algorithm in the setting with Markovian sampling. Srikant & Ying (2019) give ﬁnite-time bounds for the TD algorithms with linear function approximation and a constant step-size. The two time-scale TD with gradient correction algorithm under a Markovian sampling

Temporal Difference Learning as Gradient Splitting

and linear function approximation are discussed by Xu et al. (2019b) and shown to converge as fast as O((logt)/t2/3). A method called TD-AMSGrad under linear function approximation is studied by Xiong et al. (2020); with a constant step size, TD-AMSGrad converges to a neighborhood of the global optimum at a rate of O(1/t), and with a diminishing step size, it converges exactly to the global optimum at a rate of O((logt)/t). Xu et al. (2019a) present performances of variance reduced TD and give a reduced bias error over the classic TD.

In this paper, we will study the convergence of TD(0) and TD(λ) with linear function approximation under Markovian observations. Our main contribution is to provide an interpretation of temporal difference learning: we show how to view it as a splitting (a term we will deﬁne later) of an appropriately chosen quadratic form. As a consequence of this interpretation, it is possible to apply convergence proofs for gradient descent almost verbatim to temporal difference learning.

The convergence times bounds we obtain this way improve on existing results. In particular, we study step-sizes of 1/

T, which are typically recommended because the resulting error bounds do not depend on the inverse eigenvalues of the matrices involved in the linear approximation, which can be quite large; by contrast, methods that achieve faster than O(1/

T) decay have performance guarantees that scale with these same eigenvalues. We provide a minor variation on TD(0) for which we obtain a convergence rate that scales as O (1/(1 γ))2 /T + e O(1/

T), with the constant in the e O( ) term not blowing up as γ 1. We will also explain why a factor of 1/(1 γ)2 multiplying the asymptotically negligible O(1/T) term as here is unavoidable.

2. Preliminaries

In this section, we describe the basics of MDPs and TD learning methods. While all this material is standard and available in textbooks (e.g., Sutton & Barto (2018)), it is necessary to standardize notation and make our presentation self-contained.

2.1. Markov Decision Processes

We consider a discounted reward MDP described by a 5-tuple (S,A,P,r,γ), where S = [n] = {1,2, ,n} is the ﬁnite state space, A is the ﬁnite action space, P = (P(s |s,a))s,s S,a A are the transition probabilities, r = (r(s,s ))s,s S are rewards which are determined deterministically by the state transition pair (s,s ) and γ (0,1) is the discount factor. The (stationary) policy to be evaluated is a mapping µ: S A [0,1], where µ(s,a) is the probabilities to select action a when in state s and a A µ(s,a) = 1 for all states s S. We adopt the shorthand st for the state at

step t, at for the action taken at step t, and rt+1 = r(st,st+1).

The value function of the policy µ, denoted V µ : S R is deﬁned as

V µ(s) = Eµ,s

" t=0 γtrt+1

where Eµ,s [ ] indicates that s is the initial state and the actions are chosen according to µ.

The immediate reward vector Rµ : S R is deﬁned as

Rµ(s) = Eµ,s(r1) = s S a A µ(s,a)P(s |s,a)r(s,s ).

For the remainder of the paper, we will be ﬁxing the policy µ; consequently, we can talk about the probability transition matrix Pµ deﬁned as

Pµ(s,s ) = a A µ(s,a)P(s |s,a).

In the following, we will treat V µ and Rµ as vectors in Rn, and treat Pµ as a matrix in Rn n. It is well-known that V µ satisﬁes the Bellman equation (Sutton & Barto, 2018): deﬁning the Bellman operator T µ : Rn Rn as

(T µV µ)(s) = n s =1 Pµ(s,s )(r(s,s )+γV µ(s ))

for s [n], we can then write Bellman equation as

T µV µ = V µ.

Next, we state some standard assumptions from the literature. The ﬁrst assumption is on the underlying Markov chain.

Assumption 1. The Markov chain whose transition matrix is the matrix Pµ is irreducible and aperiodic.

Following this assumption, the Markov decision process induced by the policy µ is ergodic with a unique stationary distribution π = (π1,π2, ,πn), a row vector whose entries are positive and sum to 1. It also holds that πs = limt (Pµ)t(s,s ) for any two states s,s [n]. Note that we are using π to denote the stationary distribution of Pµ, and not the policy (which is denoted by µ).

We will use the notation rmax to denote an upper bound on the rewards; more formally, rmax is a real number such that

|r(s,s )| rmax for all s,s [n],a A.

Since the number of states is ﬁnite, such an rmax always exists.

We next introduce some notation that will make our analysis more concise. For a symmetric positive deﬁnite matrix A

Temporal Difference Learning as Gradient Splitting

Rn n, we deﬁne the inner product x,y A = x TAy and the associated norm x A =

x TAx. Let D = diag(π1, ,πn) denote the diagonal matrix whose elements are given by the entries of the stationary distribution π. Given value functions V and V on the state space S, we deﬁne V,V

D = V TDV = s S πs V(s)V (s),

and the associated norm

V 2 D = V TDV = s S πs V(s)2.

Finally, we deﬁne the Dirichlet seminorm, which is often called the Dirichlet form in the Markov chain literature (Diaconis et al., 1996); we follow here the notation of (Ollivier, 2018). The Dirichlet seminorm depends both on the transition matrix P and the invariant measure π:

V 2 Dir = 1

2 s,s S πs P(s,s )(V(s ) V(s))2.

It is easy to see that, as a consequence of Assumption 1, V Dir = 0 if and only if V is a multiple of the all-ones vector.

Similarly, we introduce the k-step Dirichlet seminorm, deﬁned as

V 2 Dir,k = 1

2 s,s S πs Pk(s,s )(V(s ) V(s))2.

2.2. Policy Evaluation, Temporal Difference Learning, and Linear Function Approximation

Policy evaluation refers to the problem of estimating the value function V µ for a given stationary policy µ. If the size of the state space is large, computing V µ(s) for all states s may be prohibitively expensive. A standard remedy is to use low dimensional approximation V µ θ of V µ in the classical TD algorithm as in Sutton (1988); Sutton & Barto (2018). For brevity, we omit the superscript µ throughout from now on.

The classical TD(0) algorithm with function approximation Vθ starts with an arbitrary value of the parameters θ0; upon observing the tth transition st s t, it computes the scalarvalued temporal-difference error,

δt = r(st,s t)+γVθt(s t) Vθt(st),

and updates the parameter vector as

θt+1 = θt +αtδt Vθt(st). (1)

Here Vθt(st) denotes the gradient of the function Vθ(st) w.r.t to θ evaluated at θ = θt, and αt is the step size. Intuitively, updating in the direction δt Vθt(st) moves Vθt(st) closer to the bootstrapped value of r(st,s t)+γVθt(s t).

We will be considering the TD(0) algorithm with a linear function approximation Vθ deﬁned as

Vθ(s) = K l=1 θlφl(s) s S,

for a given set of K feature vectors φl : S R, l [K]. For each state s, we will deﬁne the vector φ(s) which stacks up the features of s as φ(s) = (φ1(s),φ2(s), ,φK(s))T RK. Finally, Φ Rn K is deﬁned to be the matrix Φ = [φ1, ,φK].

We thus have that Vθ(s) = θ Tφ(s) and the approximate TD(0) update becomes

θt+1 = θt +αt(r(st,s t)+γθ T t φ(s t) θ T t φ(st))φ(st). (2)

Next we state a common assumption on the feature vectors, which requires that features used for approximation are linearly independent (Tsitsiklis & Van Roy, 1997; Bhandari et al., 2018).

Assumption 2. The matrix Φ has full column rank, i.e., the feature vectors {φ1,...,φK} are linearly independent. Additionally, we also assume that φ(s) 2 2 1 for s S.

It is always possible to make sure this assumption holds. If the norm bound is unsatisﬁed, then the standard approach is to normalize the feature vectors so that it is. If the matrix Φ does not have full column rank, one can simply omit enough feature vectors so that it does.

It is well-known that under Assumptions 1-2 as well as an additional assumption on the decay of the step-sizes αt, temporal difference learning converges almost surely; furthermore, its limit is the ﬁxed point of a certain projected Bellman equation (Tsitsiklis & Van Roy, 1997). Henceforth we will use θ to denote this ﬁxed point.

It is convenient to introduce the notation

gt(θ) = r(st,s t)+γφ(s t)Tθ φ(st)Tθ φ(st)

for the direction taken by TD(0) at time t. Note that gt(θ) is a scalar multiple of φ(st), the feature vector of the state encountered at time t.

Furthermore, g(θ) will denote the average of gt(θ) when the state is sampled according to the stationary distribution:

g(θ) = s,s S π(s)P(s,s ) r(s,s )+γφ(s )T θ φ(s)T θ φ(s).

Naturally it can be seen (see Tsitsiklis & Van Roy (1997)) that g(θ ) = 0. (3)

Temporal Difference Learning as Gradient Splitting

2.3. Eligibility Traces

We will also study a larger class of algorithms, denoted by TD(λ) and parameterized by λ [0,1], that contains as a special case the TD(0) algorithm discussed above. While TD(0) makes parameter updates in the direction of the (scaled) last feature vector gt(θt), the TD(λ) algorithm maintains the eligibility trace :

zt = t k= (γλ)kφ(st k),

which is a geometric weighted average of the feature vectors at all previously visited states, and takes a step in the direction of zt.

In practice, the sum will start at k = 0 (or some other ﬁnite time); however, parts of the analysis are done with the sum starting at negative inﬁnity because many of the results are much simpler in this setting, and doing so introduces only an exponentially decaying error term.

It is shown in Tsitsiklis & Van Roy (1997) that, subject to Assumptions 1-2 and appropriate decay of step-sizes, TD(λ) converges with probability one, and its limit is a ﬁxed point of a certain projected & averaged Bellman equation. We will denote this limit by θ λ.

2.4. Markov Chain Observation Model

In this paper, we are interested in TD in the setting where the data is collected from a single sample path of a Markov chain. Our ﬁnal assumption is that the Markov chain mixes at a uniform geometric rate. Assumption 3. There are constants m > 0 and ρ (0,1) such that

sup s S d TV(Pt(s, ),π) mρt t N0,

where d TV(P,Q) denotes the total-variation distance between probability measures P and Q. In addition, the initial distribution of s0 is the steady-state distribution π, so that (s0,s1, ) is a stationary sequence.

Under Assumption 1, i.e., for irreducible and aperiodic Markov chains, the uniform mixing assumption always holds (Levin & Peres, 2017). It is worth noting that the assumption that s0 is the the stationary distribution is primarily done to make the analysis and results tidier: given the uniform mixing assumption, one can apply analysis after the Markov chain is close to its steady-state.

3. Temporal Difference Learning as Gradient Splitting

All existing analyses temporal difference learning proceed by comparing it, either explicitly or implicitly, to the ex-

pected update, usually referred to as the mean-path update; for TD(0), this is

θt+1 = θt +αt g(θt).

Stochastic approximation (Robbins & Monro, 1951) is a common tool to make this comparison. Generally, one wants to argue that the mean-path TD update brings θt closer to its ﬁnal value θ .

The ﬁrst theoretical analysis of TD(0) in Tsitsiklis & Van Roy (1997) proceeded based on the observation that g(θ) forms a positive angle with θ θ, that is

g(θ)T(θ θ) > 0. (4)

An explicit version of this inequality was used in (Bhandari et al., 2018) where it is stated as Lemma 3:

g(θ)T(θ θ) (1 γ) Vθ Vθ 2 D. (5)

Our main result is an interpretation of the quantity g(θ) which explains why such an inequality holds, as well as allows us to derive stronger results. To do this, we ﬁrst introduce the concept of a gradient splitting. Deﬁnition 1. Let A be a symmetric positive semi-deﬁnite matrix. A linear function h(θ) = B(θ a) is called a gradient splitting of the quadratic f(θ) = (θ a)TA(θ a) if B+BT = 2A.

Note that whenever we state that h(θ) is a splitting of the gradient f(θ), this presumes that h(θ) is a linear function of θ while f(θ) is a quadratic.

To the best of our knowledge, the concept is introduced here for the ﬁrst time. We next explain why it is useful.

3.1. Gradient Splitting and Gradient Descent

Observe ﬁrst that, as one should expect from the name, (1/2) f(θ) is a splitting of the gradient of f since

1 2 f(θ) = A(θ a).

Of course, it is far from the only splitting, since there are many B that satisfy B +BT = 2A. In particular, B may be non-symmetric. For example, one can take B to be equal to the upper triangular part of 2A plus the diagonal of A.

The key property of splittings that make them useful is the following. Proposition 1. Suppose h(θ) is a splitting of the gradient of f(θ). Then

(θ1 θ2)T (h(θ1) h(θ2)) = 1

2(θ1 θ2)T ( f(θ1) f(θ2)).

Temporal Difference Learning as Gradient Splitting

Proof. Indeed,

(θ1 θ2)T (h(θ1) h(θ2)) = (θ1 θ2)T B(θ1 θ2)

2(θ1 θ2)T B(θ1 θ2)+ 1

2(θ1 θ2)T BT (θ1 θ2)

=(θ1 θ2)T A(θ1 θ2)

2(θ1 θ2)T ( f(θ1) f(θ2)).

Thus, while h(θ) may be quite different from f(θ), the difference disappears once one looks at the inner products considered in Proposition 1.

A particular consequence of Proposition 1 can be obtained by plugging in θ1 = a, the global minimizer of f(θ). In that case, f(a) = 0 and h(a) = 0 as well, and we obtain that for all θ,

(a θ)Th(θ) = 1

2(a θ)T f(θ). (6)

Thus the splitting h(θ) has the exact same angle with the direction to the optimal solution a θ as the true gradient.

Most analysis of gradient descent on convex functions are ultimately based on the observation that gradient descent makes progress towards the optimal solution because it has a positive inner product with the direction to optimality. As a consequence of this discussion, the same argument can be applied to gradient splittings.

3.2. Our Main Contribution

We now come back to temporal difference learning. To analyze TD learning, it is tempting to see if we can write the TD(0) and TD(λ) updates as gradient descent on some appropriately chosen function. Unfortunately, it is wellknown (and easy to see) that this cannot work. Indeed, in the TD(0) case, it is possible to express the average direction g(θ) as g(θ) = B(θ θ ) and in some cases the matrix B is not symmetric; this linear map cannot be the gradient of anything since the non-symmetry of B would contradict equality of partial derivatives (see (Maei, 2011)).

Our main results show that the temporal difference direction can, however, be viewed as a splitting of the gradient of an appropriately chosen function.

Theorem 1. Suppose Assumptions 1-2 hold. Then in the TD(0) update, g(θ) is a splitting of the gradient of the quadratic

f(θ) = (1 γ) Vθ Vθ 2 D +γ Vθ Vθ 2 Dir.

Theorem 2. Suppose Assumptions 1-2 hold. Then, in the TD(λ) update, the negative of the expected update E[δtzt]

is a splitting of the gradient of the quadratic

f (λ)(θ) =(1 γκ) Vθ Vθ λ 2 D

+(1 λ) + m=0 λ mγm+1||Vθ Vθ λ ||2 Dir,m+1,

where κ = (1 λ)/(1 γλ).

The proof of these theorems can be found in the supplementary information. Assumptions 1 and 2 are not particularly crucial: they are used only to be able to deﬁne the stationary distribution π and the unique ﬁxed point θ .

These results provide some insights into why temporal difference learning works. Indeed, there is no immediate reason why the bootstrapped update of Eq. (2) should produce a reasonable answer, and it is well known that the version with nonlinear approximation in Eq. (1) can diverge (see Tsitsiklis & Van Roy (1997)). Convergence analyses of TD learning rely on Eq. (4), but the proof of this equation from (Tsitsiklis & Van Roy, 1997) does not yield a conceptual reason for why it should hold.

The previous two theorems provide such a conceptual reason. It turns out that TD(0) and TD(λ) are, on average, attempting to minimize the functions f(θ) and f (λ)(θ) deﬁned in those theorems, by moving in direction of a gradient splitting. Moreover, the functions f(θ) and f (λ)(θ) are plainly convex (they are positive linear combinations of convex quadratics), so that Equation (6) immediately explains why Eq. (4) holds.

These theorems are inspired by the recent preprint (Ollivier, 2018). It is shown there that, if P is reversible, then g(θ) is exactly the gradient of the function f(θ), even in the case when the function approximation is nonlinear. Theorem 1 may be viewed as a way to generalize this observation to the non-reversible case for the case of linear function approximation.

4. Consequences

We now discuss several consequences. These will all be along the lines of improved convergence guarantees. Indeed, as we mentioned in the previous section, viewing TD learning as gradient splitting allows us to take existing results for gradient descent and port them almost verbatim to the temporal difference setting.

In the main body of the paper, we focus on TD(0); the case of TD(λ) is discussed in the supplementary information. As mentioned earlier, existing analyses of TD(0) rely on Eq. (4) as well as its reﬁnement Eq. (5). However, as a consequence of Proposition 1, we can actually write out explicitly the inner product between the mean TD(0) direction g(θ) and the direction to optimality θ θ.

Temporal Difference Learning as Gradient Splitting

Corollary 1. For any θ RK,

(θ θ)T g(θ) = (1 γ) Vθ Vθ 2 D +γ Vθ Vθ 2 Dir.

Proof. Indeed, we can use Eq. (3) to argue that

(θ θ)T g(θ) = (θ θ)T ( g(θ ) ( g(θ))))

2(θ θ)T( f(θ ) f(θ)), (7)

where the last step follows by Theorem 1 and Proposition 1; here f(θ) is the function from Theorem 1.

However, for any quadratic function q(θ) = (θ a)TP(θ a) where P is a symmetric matrix, we have that

(a θ)T( q(a) q(θ)) = 2q(θ).

Applying this to the function f(θ) in Eq. (7), we complete the proof.

This corollary should be contrasted to Eq. (4) and Eq. (5). It is clearly a strengthening of those equations. More importantly, this is an equality, whereas Eq. (4) and Eq. (5) are inequalities. We thus see that the average TD(0) direction makes more progress in the direction of the optimal solution compared to the previously available bounds.

In the remainder of the paper, we will use Corollary 1 to obtain improved convergence times for TD(0); also, a natural generalization of that Corollary which appeals to Theorem 2 instead of Theorem 1 results in improved convergence times for TD(λ), as explained in the supplementary information. We focus on a particular property which is natural in this context: scaling with the discount factor γ.

Indeed, as we discussed in the introduction, an undesirable feature of some of the existing analyses of temporal difference learning is that they scale multiplicatively with 1/(1 γ). It is easy to see why this should be so: it is natural to rely on Eq. (5) or its variations, and as γ 1, that equation guarantees smaller and smaller progress towards the limit. Unfortunately, it is natural to set the discount factor close to 1 in order to avoid focusing on short-term behavior of the policy.

But now we can instead rely on Corollary 1 and this corollary suggests that as γ 1, the inner product between the expected TD(0) direction g(θ) and the direction to the optimal solution θ θ will be lower bounded by γ||Vθ Vθ ||2 Dir. A difﬁculty, however, is that the Dirichlet seminorm can be zero even when applied to a nonzero vector. We next discuss the results we are able to derive with this approach.

4.1. Improved Error Bounds

As mentioned earlier, a nice consequence of the gradient splitting interpretation is that we can apply the existing proof

for gradient descent almost verbatim to gradient splittings. In particular, temporal difference learning when the states are sampled i.i.d. could be analyzed by simply following existing analyses of noisy gradient descent. However, under our Markov observation model, it is not true that the samples are i.i.d; rather, we proceed by modifying the analysis of so-called Markov Chain Gradient Descent, analyzed in the papers (Sun et al., 2018; Johansson et al., 2010), where, to minimize the function F(x) = i fi(x), we have access to samples fi1( ), fi2( ),..., with the sequence ik following a Markov chain.

One issue is the choice of step-size. The existing literature on temporal difference learning contains a range of possible step-sizes from O(1/t) to O(1/

T) (see (Bhandari et al., 2018; Dalal et al., 2018; Lakshminarayanan & Szepesvari, 2018)). A step-size that scales as O(1/

T) is often preferred because, for faster decaying step-sizes, performance will scale with the inverse of the smallest eigenvalue of ΦTDΦ or related quantity, and these can be quite small1. This is not the case, however, for a step-size that decays like O(1/

We will be using the standard notation

to denote the running average of the iterates θt.

We will be considering the projected TD(0)2 update

θt+1 = ProjΘ(θt +αtgt(θt)), (8)

where Θ is a convex set containing the optimal solution θ . Moreover, we will assume that the norm of every element in Θ is at most Rθ. Setting G = rmax + 2Rθ, we have the following error bound.

Corollary 2. Suppose Assumptions 1-3 hold. Suppose further that (θt)t 0 is generated by the Projected TD algorithm of Eq. (8) with θ Θ and α0 = = αT = 1/

E (1 γ) Vθ V θT 2 D +γ Vθ V θT 2 Dir

θ θ0 2 2 +G2 9+12τmix 1/

where τmix is standard notation for the mixing time of the Markov chain:

τmix(ε) = min t N,t 1|mρt ε .

1For example, λ in Theorems of Dalal et al. (2018), ρd in Theorems of Lakshminarayanan & Szepesvari (2018), and ω in the Theorem 3 of Bhandari et al. (2018) 2The challenges of analyzing Markovian samples and the reason of using a projected step has been discussed in Section 8 of Bhandari et al. (2018)

Temporal Difference Learning as Gradient Splitting

The proof is available in the supplementary information. The bound is very similar to the standard bounds for SGD, with the exception of the τmix term. That term arises in the analysis of Markov gradient descent (Sun et al., 2018; Johansson et al., 2010). Informally, in Markov gradient descent, one has to wait τmix iterations to obtain a new independent sample of the gradient, which is why τmix enters the bound multiplicatively.

We next compare this bound to the existing literature. The closest comparison is Theorem 3(a) in (Bhandari et al., 2018) which shows that

E Vθ V θT 2 D θ θ0 2 2 2(1 γ)

+ G2 9+12τmix 1/

Corollary 2 is stronger than this, because this bound can be derived from Corollary 2 by ignoring the second term on the left hand side of Eq. (9). Moreover, we next argue that Corollary 2 is stronger an interesting way, in that it offers a new insight on the behavior of temporal difference learning.

Observe that the upper bound of Eq. (10) blows up as γ 1. On the other hand, we can simply ignore the ﬁrst term on the left-hand side of Corollary 2 to obtain

E h Vθ V θT 2 Dir i θ θ0 2 2 2γ

+ G2 9+12τmix 1/

In particular, we see that E ||Vθ V θT ||2 Dir does not blow up as γ 1. To understand this, recall that the Dirichlet seminorm is equal to zero if and only if applied to a multiple of the all-ones vector. Consequently, ||V||Dir is properly thought of as a way to measure norm of the projection of V onto 1 . We therefore obtain the punchline of this section: the error of (averaged & projected) temporal difference learning projected on 1 does not blow up as the discount factor approaches 1.

There are scenarios where this is already interesting. For example, if TD(0) is a subroutine of policy evaluation, it will be used for a policy improvement step, which is clearly unaffected by adding a multiple of the all-ones vector to the value function. Similarly, Proposition 4 of (Ollivier, 2018) shows that the bias in the policy gradient computed from an approximation ˆV to the true value function V can be bounded solely in terms of ||V ˆV||2 Dir (multiplied by a factor that depends on how the policies are parameterized).

It is natural to wonder whether the dependence on 1/(1 γ) can be removed completely from bounds on the performance

of temporal difference learning (not just in terms of projection on 1 ). We address this next.

4.2. Mean-adjusted Temporal Difference Learning

Unfortunately, it is easy to see that the dependence on 1/(1 γ) in error bounds for temporal difference learning cannot be entirely removed. We next give an informal sketch explaining why this is so. We consider the case where samples s,s are i.i.d. with probability πs P(s,s ) rather than coming from the Markov chain model, since this only makes the estimation problem easier.

Estimating the mean of the value function. Let us denote by V the true value function; because it is the ﬁxed point of the Bellman operator, V = R+γPV, we have that

V = (I γP) 1R =

m=0 γm Pm !

Deﬁne V = πTV; then

V = πTV = πT m=0 γm Pm !

Under i.i.d. sampling, what we have are samples from a random variable R which takes the value r(s,s ) with probability πs P(s,s ). From Eq. (12), we have that

(1 γ) V = E[ R].

From T samples of the scalar random variable R, the best estimate ˆRT will satisfy

E[( ˆRT E[ R])2] = Ω(1/T)

in the worst-case. This implies that the best estimator ˆV of V will satisfy

E[( ˆV V)2] = Ω (1/(1 γ)2)/T .

To summarize, the squared error in estimating just the mean of the value function will already scale with 1/(1 γ). If we consider e.g., Φ to be the identity matrix, in which case Vθ is just equal to the true value function, it can easily be seen that it is not possible to estimate Vθ with error that does not scale with 1/(1 γ).

A better scaling with the discount factor. Note, however, that the previous discussion implied that a term like (1/(1 γ)2)/T in a bound on the squared error is unavoidable. But with 1/

T step-size, the error will in general decay more slowly as 1/

T as in Corollary 2. Is it possible to derive a bound where the only scaling with 1/(1 γ) is in the asymptotically negligible O(1/T) term?

Temporal Difference Learning as Gradient Splitting

As we show next, this is indeed possible. The algorithm is very natural given the discussion in the previous subsection: we run projected and averaged TD(0) and estimate the mean of the value function separately, adjusting the outcome of TD(0) to have the right mean in the end. Building on Corollary 2, the idea is that the mean will have expected square error that scales with (1/(1 γ)2)/T while the temporal difference method will estimate the projection onto 1 without blowing up as γ 1.

The pseudocode of the algorithm is given next as Algorithm 1 and Corollary 3 bounds its performance. Note that, in contrast to the bounds in the last subsection, Corollary 3 bounds the error to the true value function V directly. This is a more natural bound for this algorithm which tries to directly match the mean of the true value function.

Algorithm 1 Mean-adjusted TD(0)

1: Initialize A0 = 0, s0 π, and some initial condition θ0. 2: for t = 0 to T 1 do 3: Projected TD(0) update: θt+1 = ProjΘ (θt +αtgt(θt))

4: Keep track of the average reward: At+1 = t At+rt+1

t+1 5: end for 6: Set ˆVT = AT 1 γ 7: Output V T = V θT + ˆVT πTV θT 1

Corollary 3. Suppose that (θt)t 0 and V T are generated by Algorithm 1 with step-sizes α0 = = αT = 1/

T. Suppose further that Θ is a convex set that contains θ . Let t0 be the largest integer which satisﬁes t0 2τmix 1 2(t0+1) . Then as long as T t0, we will have

E h V T V 2 D i

E h Vθ V 2 D i + r2maxτmix 1 2(T+1)

+ θ θ0 2 2 +G2 1+τmix(1/

Here r(P) is the inverse spectral gap of the additive reversibilization of the transition matrix P, deﬁned as follows:

r(P) = 1 1 λ2(Q),

where λ2(Q) is the second-largest eigenvalue of the matrix Q in turn deﬁned as

where, ﬁnally, P denotes the transition matrix of the reversed Markov chain, i.e.,

P (j|i) = π(j)

π(i) P(i|j).

Let us parse the bound of Corollary 3. The bound has three terms. The ﬁrst term is just the difference between the limit of TD(0) and the true value function; such a term is inevitable in any TD(0)-based method that compares its performance to the true value function (rather than to Vθ ). The second term comes from the error in mean estimation; as described earlier, scaling with (1/(1 γ))2/T is inevitable here. The multiplicative factor of τmix is present because, due to the Markovian sampling, it can take τmix steps to obtain an independent sample of the mean.

Finally, the last term of the bound of Corollary 3 is the one that scales as e O 1/

T ; compared to it, the second term is negligible for large T. Crucially, this term does not blow up as γ 1, so that the only blowup occurs in the asymptotically negligible second term. Indeed, observe that while 1/(1 γ) appears in the third term, it appears as a minimum of 1/(1 γ) and a quantity that depends on the matrix P, so that it does not blow up as γ 1.

Note that, unlike the previous bounds discussed in this paper, this bound does depend on an eigenvalue gap associated with (a function of) the matrix P. However, this dependence is in such a way that it only helps: when 1/(1 γ) is small, there is no dependence on the eigenvalue gap, and it is only when γ 1 that performance saturates at something depending on P.

5. Conclusion

We have provided an interpretation of temporal difference learning in terms of a splitting of gradient descent. As a consequence of this interpretation, analyses of gradient descent can apply to temporal difference learning almost verbatim.

We have exploited this interpretation to observe that temporal difference methods learn the projection of the value function onto 1 without any blowup as γ 1; by contrast, previous work tended to have error bounds that scaled with 1/(1 γ). While, as we explain, it is not possible to remove the dependence on O(1/(1 γ)) in general, we provide an error bound for a simple modiﬁcation to TD(0) where the only dependence on 1/(1 γ) is in the asymptotically negligible term.

An open problem might be to improve the scaling of the bounds we have obtained in this paper with P. Our focus has been on scaling with 1/(1 γ) but one could further ask what dependence on the transition matrix P is optimal. It is natural to wonder, for example, whether the r(P) factor in the last term of Corollary 3 measuring how the performance saturates as γ 1, could be improved. Typically, error bounds in this setting scale with the spectral gap of P, which can be much smaller than r(P). We thus do not believe our bound is tight.

Temporal Difference Learning as Gradient Splitting

More broadly, if temporal difference learning is a splitting of gradient descent, this opens up several possibilities for future work. For example, one might wonder whether there are other, more attractive, ways to split gradient descent amenable to a bootstrapped interpretation. Alternatively, instead of splitting gradient descent, one might attempt to split mirror descent whenever there are further constraints on the parameter θ.

Aldous, D. and Fill, J. Reversible markov chains and random walks on graphs, 1995.

Bhandari, J., Russo, D., and Singal, R. A ﬁnite time analysis of temporal difference learning with linear function approximation. In Conference on Learning Theory, pp. 1691 1692, 2018.

Chen, C., Seff, A., Kornhauser, A., and Xiao, J. Deepdriving: Learning affordance for direct perception in autonomous driving. In Proceedings of the IEEE International Conference on Computer Vision, pp. 2722 2730, 2015.

Dalal, G., Szörényi, B., Thoppe, G., and Mannor, S. Finite sample analysis for td (0) with linear function approximation. In The Thirty-Second AAAI Conference on Artiﬁcial Intelligence, 2018.

Diaconis, P., Saloff-Coste, L., et al. Logarithmic sobolev inequalities for ﬁnite markov chains. The Annals of Applied Probability, 6(3):695 750, 1996.

Gu, S., Holly, E., Lillicrap, T., and Levine, S. Deep reinforcement learning for robotic manipulation with asynchronous off-policy updates. In 2017 IEEE International Conference on Robotics and Automation (ICRA), pp. 3389 3396. IEEE, 2017.

Jin, J., Song, C., Li, H., Gai, K., Wang, J., and Zhang, W. Real-time bidding with multi-agent reinforcement learning in display advertising. In Proceedings of the 27th ACM International Conference on Information and Knowledge Management, pp. 2193 2201, 2018.

Johansson, B., Rabi, M., and Johansson, M. A randomized incremental subgradient method for distributed optimization in networked systems. SIAM Journal on Optimization, 20(3):1157 1170, 2010.

Korda, N. and La, P. On td (0) with function approximation: Concentration bounds and a centered variant with exponential convergence. In International Conference on Machine Learning, pp. 626 634, 2015.

Lakshminarayanan, C. and Szepesvari, C. Linear stochastic approximation: How far does constant step-size and

iterate averaging go? In International Conference on Artiﬁcial Intelligence and Statistics, pp. 1347 1355, 2018.

Levin, D. A. and Peres, Y. Markov chains and mixing times, volume 107. American Mathematical Soc., 2017.

Maei, H. R. Gradient Temporal-Difference Learning Algorithms. Ph D thesis, University of Alberta, 2011.

Narayanan, C. and Szepesvári, C. Finite time bounds for temporal difference learning with function approximation: Problems with some state-of-the-art results. Technical report, Technical Report, 2017.

Ollivier, Y. Approximate temporal difference learning is a gradient descent for reversible policies. ar Xiv preprint ar Xiv:1805.00869, 2018.

Robbins, H. and Monro, S. A stochastic approximation method. The Annals of Mathematical Statistics, pp. 400 407, 1951.

Silver, D., Huang, A., Maddison, C. J., Guez, A., Sifre, L., Van Den Driessche, G., Schrittwieser, J., Antonoglou, I., Panneershelvam, V., Lanctot, M., et al. Mastering the game of go with deep neural networks and tree search. Nature, 529(7587):484 489, 2016.

Srikant, R. and Ying, L. Finite-time error bounds for linear stochastic approximation and td learning. In Conference on Learning Theory, pp. 2803 2830, 2019.

Sun, T., Sun, Y., and Yin, W. On markov chain gradient descent. In Advances in Neural Information Processing Systems, pp. 9896 9905, 2018.

Sutton, R. S. Learning to predict by the methods of temporal differences. Machine Learning, 3(1):9 44, 1988.

Sutton, R. S. and Barto, A. G. Reinforcement learning: An introduction. MIT Press, 2018.

Tsitsiklis, J. N. and Van Roy, B. An analysis of temporaldifference learning with function approximation. IEEE Transactions on Automatic Control, 42(5):674 690, 1997.

Xiong, H., Xu, T., Liang, Y., and Zhang, W. Non-asymptotic convergence of adam-type reinforcement learning algorithms under markovian sampling. ar Xiv preprint ar Xiv:2002.06286, 2020.

Xu, T., Wang, Z., Zhou, Y., and Liang, Y. Reanalysis of variance reduced temporal difference learning. In International Conference on Learning Representations, 2019a.

Xu, T., Zou, S., and Liang, Y. Two time-scale off-policy td learning: Non-asymptotic analysis over markovian samples. In Advances in Neural Information Processing Systems, pp. 10633 10643, 2019b.