# offpolicy_learning_with_eligibility_traces_a_survey__4f251ae9.pdf

Journal of Machine Learning Research 15 (2014) 289-333 Submitted 9/11; Revised 4/13; Published 1/14

Oﬀ-policy Learning With Eligibility Traces: A Survey

Matthieu Geist matthieu.geist@supelec.fr IMS-Ma LIS Research Group & UMI 2958 (Georgia Tech-CNRS) Supélec 2 rue Edouard Belin 57070 Metz, France

Bruno Scherrer bruno.scherrer@inria.fr MAIA project-team INRIA Lorraine 615 rue du Jardin Botanique 54600 Villers-lès-Nancy, France

Editor: Ronald Parr

In the framework of Markov Decision Processes, we consider linear oﬀ-policy learning, that is the problem of learning a linear approximation of the value function of some ﬁxed policy from one trajectory possibly generated by some other policy. We brieﬂy review on-policy learning algorithms of the literature (gradient-based and least-squares-based), adopting a uniﬁed algorithmic view. Then, we highlight a systematic approach for adapting them to oﬀ-policy learning with eligibility traces. This leads to some known algorithms oﬀ-policy LSTD(λ), LSPE(λ), TD(λ), TDC/GQ(λ) and suggests new extensions oﬀpolicy FPKF(λ), BRM(λ), g BRM(λ), GTD2(λ). We describe a comprehensive algorithmic derivation of all algorithms in a recursive and memory-eﬃcent form, discuss their known convergence properties and illustrate their relative empirical behavior on Garnet problems. Our experiments suggest that the most standard algorithms on and oﬀ-policy LSTD(λ)/LSPE(λ) and TD(λ) if the feature space dimension is too large for a leastsquares approach perform the best.

Keywords: reinforcement learning, value function estimation, oﬀ-policy learning, eligibility traces

1. Introduction

We consider the problem of learning a linear approximation of the value function of some ﬁxed policy in a Markov Decision Process (MDP) framework. This study is performed in the most general situation where learning must be done from a single trajectory possibly generated by some other policy, also known as oﬀ-policy learning. Given samples, well-known methods for estimating a value function are temporal diﬀerence (TD) learning and Monte Carlo (Sutton and Barto, 1998). TD learning with eligibility traces (Sutton and Barto, 1998), known as TD(λ), constitutes a nice bridge between both approaches; by controlling the bias/variance trade-oﬀ(Kearns and Singh, 2000), their use can signiﬁcantly speed up learning. When the value function is approximated through a linear architecture, the depth

c 2014 Matthieu Geist and Bruno Scherrer.

Geist and Scherrer

λ of the eligibility traces is also known to control the quality of approximation (Tsitsiklis and Van Roy, 1997). Overall, the use of these traces often plays an important practical role.

gradient-based least-squares-based bootstrapping TD FPKF (Sutton and Barto, 1998) (Choi and Van Roy, 2006) residual g BRM BRM (Engel, 2005) (Baird, 1995) (Geist and Pietquin, 2010b) projected ﬁxed point TDC/GTD2 LSTD (Bradtke and Barto, 1996) (Sutton et al., 2009) LSPE (Nedić and Bertsekas, 2003)

Table 1: Taxonomy of linearly parameterized estimators for value function approximation (Geist and Pietquin, 2013).

There has been a signiﬁcant amount of research on parametric linear approximation of the value function, without eligibility traces (in the onor oﬀ-policy case). We follow the taxonomy proposed by Geist and Pietquin (2013), brieﬂy recalled in Table 1 and further developed in Section 2. Value function approximators can be categorized depending on the cost function they minimize (based on bootstrapping, on a Bellman residual minimization or on a projected ﬁxed point approach) and on how it is minimized (gradient descent or linear least-squares). Most of these algorithms have been extended to take into account eligibility traces, in the on-policy case. Works on extending these eligibility-trace approaches to oﬀpolicy learning are scarcer. They are summarized in Table 2 (algorithms in black). The ﬁrst motivation of this article is to argue that it is conceptually simple to extend all the algorithms of Table 1 so that they can be applied to the oﬀ-policy setting and use eligibility traces. If this allows re-deriving existing algorithms (in black in Table 2), it also leads to new candidate algorithms (in red in Table 2). The second motivation of this work is to discuss the subtle diﬀerences between these intimately-related algorithms, and to provide some comparative insights on their empirical behavior (a topic that has to our knowledge not been considered in the literature, even in the simplest on-policy and no-trace situation).

gradient-based least-squares-based bootstrapping oﬀ-policy TD(λ) oﬀ-policy FPKF(λ) (Bertsekas and Yu, 2009) residual oﬀ-policy g BRM(λ) oﬀ-policy BRM(λ) projected ﬁxed point GQ(λ) a.k.a. oﬀ-policy TDC(λ) oﬀ-policy LSTD(λ) (Maei and Sutton, 2010) oﬀ-policy LSPE(λ) oﬀ-policy GTD2(λ) (Yu, 2010a)

Table 2: Surveyed oﬀ-policy and eligibility-traces approaches. Algorithms in black have been published before (provided references), algorithms in red are new.

Off-policy Learning with Traces

The rest of this article is organized as follows. Section 2 introduces the background of Markov Decision Processes, describes the state-of-the-art algorithms for learning without eligibility traces, and gives the fundamental idea to extend the methods to the oﬀ-policy situation with eligibility traces. Section 3 details this extension for the least-squares approaches: the resulting algorithms are formalized, and we derive recursive and memory-eﬃcient formula for their implementation (this allows online learning without loss of generality, all the more that half of these algorithms are recursive by their very deﬁnition), and we discuss their convergence properties. Section 4 does the same job for stochastic gradient approaches, which oﬀers a smaller computational cost (linear per update, instead of quadratic). Last but not least, Section 5 describes an empirical comparison and Section 6 concludes.

2. Background

We consider a Markov Decision Process (MDP), that is a tuple {S, A, P, R, γ} in which S is a ﬁnite state space identiﬁed with {1, 2, . . . , N}, A a ﬁnite action space, P P(S)S A

the set of transition probabilities, R RS A the reward function and γ the discount factor. A mapping π P(A)S is called a policy. For any policy π, let P π be the corresponding stochastic transition matrix, and Rπ the vector of mean reward when following π, that is of components Ea|π,s[R(s, a)]. The value V π(s) of state s for a policy π is the expected discounted cumulative reward starting in state s and then following the policy π:

V π(s) = Eπ

i=0 γiri|s0 = s

where Eπ denotes the expectation over trajectories induced by policy π. The value function satisﬁes the (linear) Bellman equation:

s, V π(s) = Es ,a|s,π[R(s, a) + γV π(s )].

It can be rewritten as the ﬁxed-point of the Bellman evaluation operator: V π = T πV π where for all V, T πV = Rπ + γP πV . In this article, we are interested in learning an approximation of this value function V π

under some constraints. First, we assume our approximation to be linearly parameterized:

s, ˆVθ(s) = θT φ(s)

with θ Rp being the parameter vector and φ(s) Rp the feature vector in state s. This encompasses notably the tabular case (exact representation of the value function). Also, we want to estimate the value function V π (or equivalently the associated parameter θ) from a single ﬁnite trajectory generated using a possibly diﬀerent behavioral policy π0.1 Let µ0 be the stationary distribution of the stochastic matrix P0 = P π0 of the behavior policy π0 (we assume it exists and is unique). Let D0 be the diagonal matrix of which the elements are (µ0(si))1 i N. Let Φ be the matrix of feature vectors:

Φ = [φ(1) . . . φ(N)]T .

1. This can be easily extended to multiple ﬁnite trajectories.

Geist and Scherrer

As we consider a linear approximation, the considered value functions belong to the space spanned by Φ. The projection Π0 onto this hypothesis space with respect to the µ0-quadratic norm, which will be central for the understanding of the algorithms, has the following closedform: Π0 = Φ(ΦT D0Φ) 1ΦT D0.

If π0 is diﬀerent from π, it is called an oﬀ-policy setting. Notice that all algorithms considered in this paper use this Π0 projection operator, that is the projection according to the observed data.2 It would certainly be interesting to consider the projection according to the stationary distribution of π, the (unobserved) target policy: this would reduce oﬀ-policy learning to on-policy learning. However, this would require re-weighting samples according to the stationary distribution of the target policy π, which is unknown and probably as diﬃcult to estimate as the value function itself.

2.1 Standard Algorithms For On-policy Learning Without Traces

We now review existing on-policy linearly parameterized temporal diﬀerence learning algorithms (see Table 1). In this case, the behavior and target policies are the same, so we omit the subscript 0 for the policy (π) and the projection (Π). We assume that a trajectory (s1, a1, r1, s2, . . . , si, ai, ri, si+1, . . . , sn, an, rn, sn+1) sampled according to the policy π is available, and will explain how to compute the ith iterate for several algorithms. For all j i, let us introduce the empirical Bellman operator at step j:

V 7 rj + γV (sj+1)

so that ˆTj V is an unbiased estimate of TV (sj). Projected ﬁxed point approaches aim at ﬁnding the ﬁxed-point of the operator being the composition of the projection onto the hypothesis space and the Bellman operator. In other words, they search for the ﬁxed-point ˆVθ = ΠT ˆVθ, Π being the just introduced projection operator. Solving the following ﬁxed-point problem,

θi = argmin ω Rp

ˆTj ˆVθi ˆVω(sj) 2 ,

with a least-squares approach corresponds to the Least-Squares Temporal Diﬀerences (LSTD) algorithm of Bradtke and Barto (1996). Recently, Sutton et al. (2009) proposed two algorithms reaching the same objective, Temporal Diﬀerence with gradient Correction (TDC) and Gradient Temporal Diﬀerence 2 (GTD2), by performing a stochastic gradient descent of the function θ 7 ˆVθ ΠT ˆVθ 2 which is minimal (and equal to 0) when ˆVθ = ΠT ˆVθ.

2. As far as we know, there are two notable exceptions. Precup et al. (2001) propose an algorithm that updates parameters according to full trajectories (not according to transitions, as all approaches to be reviewed next). Therefore, the distribution weighting the projection operator is the one of the starting states of these trajectories instead of the one involved by the behavioral policy. Another work to move in a diﬀerent direction is the oﬀ-policy approach of Kolter (2011): samples are weighted such that the projection operator composed with the Bellman operator is non-expansive: this is weaker than ﬁnding the projection of the stationary distribution, but oﬀers some guarantees. In this article, we consider only the Π0 projection.

Off-policy Learning with Traces

A related approach consists in building a recursive algorithm that repeatedly mimics the iteration ˆVθi ΠT ˆVθi 1. In practice, we aim at minimizing

ˆTj ˆVθi 1 ˆVω(sj) 2 .

Performing the minimization exactly through a least-squares method leads to the Least Squares Policy Evaluation (LSPE) algorithm of Bertsekas and Ioﬀe (1996). If this minimization is approximated by a stochastic gradient descent, this leads to the classical Temporal Diﬀerence (TD) algorithm (Sutton and Barto, 1998). Bootstrapping approaches consist in treating value function approximation after seeing the ith transition as a supervised learning problem, by replacing the unobserved values V π(sj) at states sj by some estimate computed from the trajectory until the transition (sj, sj+1), the best such estimate being ˆTj ˆVθj 1. This amounts to minimizing the following function:

ˆTj ˆVθj 1 ˆVω(sj) 2 . (1)

Choi and Van Roy (2006) proposed the Fixed-Point Kalman Filter (FPKF), a least-squares variation of TD that minimizes exactly the function of Equation (1). If the minimization is approximated by a stochastic gradient descent, this gives again the classical TD algorithm (Sutton and Barto, 1998). Finally, residual approaches aim at minimizing the distance between the value function and its image through the Bellman operator, V TV 2 µ0. Based on a trajectory, this suggests the following function to minimize

ˆTj ˆVω ˆVω(sj) 2 ,

which is a surrogate of the objective V TV 2 µ0 that is biased (Antos et al., 2006). This cost function has originally been proposed by Baird (1995) who minimized it using a stochastic gradient approach (this algorithm being referred here as g BRM for gradient-based Bellman Residual Minimization). Both the parametric Gaussian Process Temporal Diﬀerences (GPTD) algorithm of Engel (2005) and the linear Kalman Temporal Diﬀerences (KTD) algorithm of Geist and Pietquin (2010b) can be shown to minimize the above cost using a least-squares approach, and are thus the very same algorithm, that we will refer to as BRM (for Bellman Residual Minimization) in the remaining of this paper.3

To sum up, it thus appears that after the ith transition has been observed, the above mentioned algorithms behave according to the following pattern:

move from θi 1 to θi towards the minimum of ω 7

ˆTj ˆVξ ˆVω(sj) 2 , (2)

either through a least-squares approach or a stochastic gradient descent. Each of the algorithms mentioned above is obtained by substituting θi, θi 1, θj 1 or ω for ξ.

3. Note that this is only true in the linear case. GPTD and KTD were both introduced in a more general setting: GPTD is non-parametric and KTD is motivated by the goal of handling nonlinearities.

Geist and Scherrer

2.2 Towards Oﬀ-policy Learning With Traces

It is now easy to preview, at least at a high level, how one may extend the previously described algorithms so that they can deal with eligibility traces and oﬀ-policy learning.

Eligibility Traces. The idea of eligibility traces amounts to looking for the ﬁxed-point of the following variation of the Bellman operator (Bertsekas and Tsitsiklis, 1996)

V RS, T λV = (1 λ)

k=0 λk T k+1V

that makes a geometric average with parameter λ (0, 1) of the powers of the original Bellman operator T. Clearly, any ﬁxed-point of T is a ﬁxed-point of T λ and vice-versa. After some simple algebra, one can see that:

T λV = (I λγP) 1(R + (1 λ)γPV ) (3)

= V + (I λγP) 1(R + γPV V ).

This leads to the following well-known temporal diﬀerence expression in some state s

T λV (s) = V (s) + Eπ

k=i (γλ)k i rk + γV (sk+1) V (sk) si = s

k=i (γλ)k iδik(s)

where we recall that Eπ means that the expectation is done according to the target policy π, and where δik(s) = Eπ h rk + γV (sk+1) V (sk) si = s i is the expected temporal-diﬀerence (Sutton and Barto, 1998). With λ = 0, we recover the Bellman evaluation equation. With λ = 1, this is the deﬁnition of the value function as the expected and discounted cumulative reward: T 1V (s) = Eπ[P k=i γk irk|si = s].

Oﬀ-policy Learning. As before, we assume that we are given a trajectory (s1, a1, r1, s2, . . . , sj, aj, rj, sj+1 . . . , sn, an, rn, sn+1), except now that it may be generated from some behavior policy possibly diﬀerent from the target policy π of which we want to estimate the value. We are going to describe how to compute the ith iterate for several algorithms. For any i k, unbiased estimates of the temporal diﬀerence terms δik(sk) can be computed through importance sampling (Ripley, 1987). Indeed, for all s, a, let us introduce the following weight:

ρ(s, a) = π(a|s)

In our trajectory context, for any j and k, write

l=j ρl with ρl = ρ(sl, al)

Off-policy Learning with Traces

with the convention that if k < j, ρk j = 1. With these notations,

ˆδik = ρk i ˆTk V ρk 1 i V (sk)

is an unbiased estimate of δik(si), from which we may build an estimate ˆT λ j,i V of T λV (sj) (we will describe this very construction separately for the least-squares and the stochastic gradient as they slightly diﬀer).

Then, by replacing the empirical operator ˆTj in Equation (2) by ˆT λ j,i, we get the general pattern for oﬀ-policy trace-based algorithms:

move from θi 1 to θi towards the minimum of ω 7

ˆT λ j,i ˆVξ ˆVω(sj) 2 , (4)

either through a least-squares approach or a stochastic gradient descent after having instantiated ξ = θi, θi 1, θj 1 or ω. This process, including in particular the precise deﬁnition of the empirical operator ˆT λ j,i, will be further developed in the next two sections.4 Since they are easier to derive, we begin by focusing on least-squares algorithms (right column of Table 2) in Section 3. Then, Section 4 focuses on stochastic gradient-based algorithms (left column of Table 2).

3. Least-squares Extensions To Eligibility Traces And Oﬀ-policy Learning

First, we consider the least-squares solution to the problem described in Equation (4). At their ith step, the algorithms that we are about to describe will compute the parameter θi by exactly solving the following problem:

θi = argmin ω Rp

ˆT λ j,i ˆVξ ˆVω(sj) 2

where we deﬁne the following empirical truncated approximation of Tλ:

ˆT λ j,i : RS R

V 7 V (sj) +

k=j (γλ)k jˆδjk = V (sj) +

k=j (γλ)k j ρk j ˆTk V ρk 1 j V (sk) .

Though diﬀerent deﬁnitions of this operator may lead to practical implementations, note that ˆT λ j,i only uses samples seen before time i: this very feature considered by all existing works in the literature will enable us to derive recursive and low-memory algorithms. Recall that a linear parameterization is chosen here, ˆVξ(si) = ξT φ(si). We adopt the following notations:

φi = φ(si), φi = φi γρiφi+1 and ρk 1 j = (γλ)k jρk 1 j .

4. Note that we let the empirical operator ˆT λ j,i depends on the index j of the sample (as before) but also on the step i of the algorithm. This will be particularly useful for the derivation of the recursive and memory-eﬃcient least-squares algorithms that we present in the next section.

Geist and Scherrer

The generic cost function to be solved is therefore:

θi = argmin ω Rp J(ω; ξ) with J(ω; ξ) =

j=1 (φT j ξ +

k=j ρk 1 j (ρkrk φT k ξ) φT j ω)2. (5)

Before deriving existing and new least-squares algorithms, as announced, some technical lemmas are required. The ﬁrst lemma allows computing directly the inverse of a rank-one perturbed matrix.

Lemma 1 (Sherman-Morrison) Assume that A is an invertible n n matrix and that u, v Rn are two vectors satisfying 1 + v T A 1u = 0. Then:

(A + uv T ) 1 = A 1 A 1uv T A 1

1 + v T A 1u.

The next lemma is simply a rewriting of imbricated sums. However, it is quite important here as it will allow stepping from the operator ˆT λ j,i (operator which depends on future of sj, so acasual) forward view of eligibility traces to the recursion over parameters using eligibility traces (dependence on only past samples) backward view of eligibility traces. In other words, the forward view is a theoretical way of mixing backups that shifts parametrically (through the choice of λ) from the standard Bellman operator to the Monte Carlo one. However, it cannot be implemented easily, as it requires knowing the future states. On the other hand, the backward view, which is equivalent (see notably Lemma 2 and Proposition 6), is a more mechanistic and convenient viewpoint that allows performing the same updates using solely information gathered in the states encountered in the past. See Sutton and Barto (1998, Chapter 7) for further discussion on backward/forward views.

Lemma 2 Let f RN N and n N. We have:

j=i f(i, j) =

j=1 f(j, i)

We are now ready to mechanically derive the oﬀ-policy algorithms LSTD(λ), LSPE(λ), FPKF(λ) and BRM(λ). This is what we do in the following subsections.

3.1 Oﬀ-policy LSTD(λ)

The Least-Squares Temporal Diﬀerence algorithm, that computes directly a ﬁxed-point of the projected Bellman operator, has originally been introduced in the no-trace and on-policy case by Bradtke and Barto (1996). It has been extended to eligibility traces by Boyan (1999), to oﬀ-policy (through state-action value function approximation) learning (without traces) by Lagoudakis and Parr (2003), and to oﬀ-policy learning with traces by Yu (2010a). The oﬀ-policy LSTD(λ) algorithm actually corresponds to instantiating Problem (5) with ξ = θi:

θi = argmin ω Rp

j=1 (φT j θi +

k=j ρk 1 j (ρkrk φT k θi) φT j ω)2.

Off-policy Learning with Traces

This can be solved by zeroing the gradient respectively to ω:

j=1 φjφT j ) 1 i X

j=1 φj(φT j θi +

k=j ρk 1 j (ρkrk φT k θi))

k=j φj ρk 1 j (ρkrk φT k θi),

which, through Lemma 2, is equivalent to:

k=1 φk ρj 1 k )(ρjrj φT j θi).

Introducing the (importance-based) eligibility vector zj:

k=1 φk ρj 1 k =

k=1 φk(γλ)j k j 1 Y

m=k ρm = γλρj 1zj 1 + φj, (6)

one obtains the following batch estimate:

j=1 zj φT j ) 1 i X

j=1 zjρjrj = (Ai) 1bi (7)

j=1 zj φT j and bi =

j=1 zjρjrj. (8)

Thanks to Lemma 1, the inverse Mi = (Ai) 1 can be computed recursively:

j=1 zj φT j ) 1 = Mi 1 Mi 1zi φT i Mi 1 1 + φT i Mi 1zi .

This can be used to derive a recursive estimate:

j=1 zj φT j ) 1 i X

j=1 zjρjrj = (Mi 1 Mi 1zi φT i Mi 1 1 + φT i Mi 1zi )(

j=1 zjrjρj + ziρiri)

= θi 1 + Mi 1zi 1 + φT i Mi 1zi (ρiri φT i θi 1).

Writing the gain Ki = Mi 1zi 1+ φT i Mi 1zi , this gives Algorithm 1.

This algorithm has been proposed and analyzed by Yu (2010a). The author proves the following result: if the behavior policy π0 induces an irreducible Markov chain and chooses with positive probability any action that may be chosen by the target policy π, and if the compound (linear) operator Π0T λ has a unique ﬁxed-point, then oﬀ-policy LSTD(λ)

Geist and Scherrer

Algorithm 1: Oﬀ-policy LSTD(λ)

Initialization; Initialize vector θ0 and matrix M0 ; Set z0 = 0;

for i = 1, 2, . . . do

Observe φi, ri, φi+1 ;

Update traces ; zi = γλρi 1zi 1 + φi ;

Update parameters ; Ki = Mi 1zi 1+ φT i Mi 1zi ;

θi = θi 1 + Ki(ρiri φT i θi 1) ; Mi = Mi 1 Ki(MT i 1 φi)T ;

converges to it almost surely.5 Formally, it converges to the solution θ of the so-called projected ﬁxed-point equation: Vθ = Π0T λVθ . (9)

Using the expression of the projection Π0 and the form of the Bellman operator in Equation (3), it can be seen that θ satisﬁes (see Yu, 2010a for details)

where A = ΦT D0(I γP)(I λγP) 1Φ and b = ΦT D0(I λγP) 1R. (10)

The core of the analysis of Yu (2010a) consists in showing that 1

i bi deﬁned in Equation (8) respectively converge to A and b almost surely. Through Equation (7), this implies the convergence of θi to θ .

3.2 Oﬀ-policy LSPE(λ)

The Least-Squares Policy Evaluation algorithm, that computes iteratively the ﬁxed point of the projected Bellman operator, was originally introduced by Bertsekas and Ioﬀe (1996) and ﬁrst analyzed in an on-policy context by Nedić and Bertsekas (2003). Its extension to oﬀ-policy learning with traces was brieﬂy mentioned by Yu (2010a). The oﬀ-policy LSPE(λ) algorithm corresponds to instantiate ξ = θi 1 in Problem (5):

θi = argmin ω Rp

j=1 (φT j θi 1 +

k=j ρk 1 j (ρkrk φT k θi 1) φT j ω)2.

5. It is not always the case that Π0T λ has a unique ﬁxed-point, see Tsitsiklis and Van Roy (1997) for a counter-example.

Off-policy Learning with Traces

This can be solved by zeroing the gradient respectively to ω:

j=1 φjφT j ) 1 i X

j=1 φj(φT j θi 1 +

k=j ρk 1 j (ρkrk φT k θi 1))

j=1 φjφT j ) 1 i X

k=j φj ρk 1 j (ρkrk φT k θi 1).

Using Lemma 2 and the deﬁnition of the eligibility vector zj in Equation (6), we get:

θi = θi 1 + (

j=1 φjφT j ) 1 i X

k=1 φk ρj 1 k (ρjrj φT j θi 1)

j=1 φjφT j ) 1 i X

j=1 zj(ρjrj φT j θi 1).

Deﬁne the matrix Ni as follows:

j=1 φjφT j ) 1 = Ni 1 Ni 1φiφT i Ni 1 1 + φT i Ni 1φi , (11)

where the second equality follows from Lemma 1. Let Ai and bi be deﬁned as in the LSTD description in Equation (8). For clarity, we restate their deﬁnition along with their recursive writing:

j=1 zj φT j = Ai 1 + zi φT i+1

j=1 zjρjrj = bi 1 + ziρiri.

Then, it can be seen that the LSPE(λ) update is:

θi = θi 1 + Ni(bi Aiθi 1).

The overall computation is provided in Algorithm 2. This algorithm, (brieﬂy) mentioned by Yu (2010a), generalizes the LSPE(λ) algorithm of Bertsekas and Ioﬀe (1996) to oﬀ-policy learning. With respect to LSTD(λ), which computes θi = (Ai) 1bi at each iteration as stated in Equation (7), LSPE(λ) is fundamentally recursive (as it is based on an iterated ﬁxed-point relation). Along with the almost sure convergence of 1

i bi to A and b deﬁned in Equation (10), it can be shown that i Ni converges to N = (ΦT D0Φ) 1 see for instance Nedić and Bertsekas (2003) so that, asymptotically, LSPE(λ) behaves as:

θi = θi 1 + N(b Aθi 1) = Nb + (I NA)θi 1

Geist and Scherrer

Algorithm 2: Oﬀ-policy LSPE(λ)

Initialization; Initialize vector θ0 and matrix N0 ; Set z0 = 0, A0 = 0 and b0 = 0;

for i = 1, 2, . . . do

Observe φi, ri, φi+1;

Update traces ; zi = γλρi 1zi 1 + φi ;

Update parameters ;

Ni = Ni 1 Ni 1φiφT i Ni 1 1+φT i Ni 1φi ;

Ai = Ai 1 + zi φT i ; bi = bi 1 + ρiziri; θi = θi 1 + Ni(bi Aiθi 1) ;

or using the deﬁnition of Π0, A, b from Equation (10) and T λ from Equation (3):

Vθi = Φθi = ΦNb + Φ(I NA)θi 1 = Π0T λVθi 1. (12)

The behavior of this sequence depends on whether the spectral radius of Π0T λ is smaller than 1 or not. Thus, the analyses of Yu (2010a) and Nedić and Bertsekas (2003) (for the convergence of Ni) imply the following convergence result: under the assumptions required for the convergence of oﬀ-policy LSTD(λ), and the additional assumption that the operator Π0T λ has a spectral radius smaller than 1 (so that it is contracting), LSPE(λ) also converges almost surely to the ﬁxed-point of the compound Π0T λ operator. There are two suﬃcient conditions that can ensure such a desired contraction property. The ﬁrst one is when one considers on-policy learning, as Nedić and Bertsekas (2003) did when they derived the ﬁrst convergence proof of (on-policy) LSPE(λ). When the behavior policy π0 is diﬀerent from the target policy π, a suﬃcient condition for contraction is that λ be close enough to 1; indeed, when λ tends to 1, the spectral radius of T λ tends to zero and can potentially balance an expansion of the projection Π0. In the oﬀ-policy case, when γ is suﬃciently big, a small value of λ can make Π0T λ expansive (see Tsitsiklis and Van Roy 1997 for an example in the case λ = 0) and oﬀ-policy LSPE(λ) will then diverge. Eventually, Equations (9) and (12) show that when λ = 1, both LSTD(λ) and LSPE(λ) asymptotically coincide (as T 1V does not depend on V ).

3.3 Oﬀ-policy FPKF(λ)

The Fixed Point Kalman Filter algorithm is a bootstrapped recursive least-squares approach to value function approximation originally introduced by Choi and Van Roy (2006). Its extensions to eligibility traces and to oﬀ-policy learning are new.

Off-policy Learning with Traces

The oﬀ-policy FPKF(λ) algorithm corresponds to instantiate ξ = θj 1 in Problem (5):

θi = argmin ω Rp

j=1 (φT j θj 1 +

k=j ρk 1 j (ρkrk φT k θj 1) φT j ω)2.

This can be solved by zeroing the gradient respectively to ω:

j=1 φj(φT j θj 1 +

k=j ρk 1 j (ρkrk φT k θj 1)),

where Ni is the matrix introduced for LSPE(λ) in Equation (11). For clarity, we restate its deﬁnition here and its recursive writing:

j=1 φjφT j ) 1 = Ni 1 Ni 1φiφT i Ni 1 1 + φT i Ni 1φi . (13)

Using Lemma 2, one obtains:

j=1 φjφT j θj 1 +

k=1 φk ρj 1 k (ρjrj φT j θk 1)).

With respect to the previously described algorithms, the diﬃculty here is that on the right side there is a dependence with all the previous terms θk 1 for 1 k i. Using the symmetry of the dot product φT j θk 1 = θT k 1 φj, it is possible to write a recursive algorithm by introducing the trace matrix Zj that integrates the subsequent values of θk as follows:

k=1 ρj 1 k φkθT k 1 = Zj 1 + γλρj 1φjθT j 1.

With this notation we obtain:

j=1 φjφT j θj 1 +

j=1 (zjρjrj Zj φj)).

Using Equation (13) and a few algebraic manipulations, we end up with:

θi = θi 1 + Ni(ziρiri Zi φi).

This is the parameter update as provided in Algorithm 3. As LSPE(λ), this algorithm is fundamentally recursive. However, its overall behavior is quite diﬀerent. As we discussed for LSPE(λ), i Ni can be shown to tend asymptotically to N = (ΦT D0Φ) 1 and FPKF(λ) iterates eventually resemble:

θi = θi 1 + 1

i N(ziρiri Zi φi).

Geist and Scherrer

Algorithm 3: Oﬀ-policy FPKF(λ)

Initialization; Initialize vector θ0 and matrix N0 ; Set z0 = 0 and Z0 = 0;

for i = 1, 2, . . . do

Observe φi, ri, φi+1;

Update traces ; zi = γλρi 1zi 1 + φi ; Zi = γλρi 1Zi 1 + φiθT i 1;

Update parameters ;

Ni = Ni 1 Ni 1φiφT i Ni 1 1+φT i Ni 1φi ;

θi = θi 1 + Ni(ziρiri Zi φi) ;

The term in brackets is a random component (that only depends on the previous transitions) and 1

i acts as a learning coeﬃcient that asymptotically tends to 0. In other words, FPKF(λ) has a stochastic approximation ﬂavor. In particular, one can see FPKF(0) as a stochastic approximation of LSPE(0). Indeed, asymptotically, FPKF(0) does the following update

θi = θi 1 + 1

i N(ρiφiri φi φT i θi 1),

and one can notice that ρiφiri and φi φT i are samples of A and b to which Ai and bi converge through LSPE(0). When λ > 0, the situation is less clear up to the fact that since T 1V does not depend on V , we expect FPKF to asymptotically behave like LSTD and LSPE when λ tends to 1. Due to its much more involved form (notably the matrix trace Zj integrating the values of all the values θk from the start), it does not seem easy to provide a guarantee for FPKF(λ), even in the on-policy case. To our knowledge, there does not exist any proof of convergence for stochastic approximation algorithms in the oﬀ-policy case with traces, and a related result for FPKF(λ) thus seems diﬃcult.6 Based on the above-mentioned relation between FPKF(0) and LSPE(0) and the experiments we have run (see Section 5), we conjecture that oﬀ-policy FPKF(λ) has the same asymptotic behavior as LSPE(λ). We leave the formal study of this algorithm for future work.

3.4 Oﬀ-policy BRM(λ)

The Bellman Residual Minimization algorithm is a least-squares approach that minimizes directly the Bellman residual. The oﬀ-policy BRM(λ) algorithm corresponds to instantiate ξ = ω in Problem (5):

6. An analysis of TD(λ), with a simplifying assumption that forces the algorithm to stay bounded is given by Yu (2010a). An analysis of GQ(λ) is provided by Maei and Sutton (2010), with an assumption on the second moment of the traces, which as explained in Proposition 2 of Yu (2010a) does not hold in general. A full analysis of these algorithms thus remains to be done. See also Sections 4.1 and 4.2.

Off-policy Learning with Traces

θi = argmin ω Rp

j=1 (φT j ω +

k=j ρk 1 j (ρkrk φT k ω) φT j ω)2

= argmin ω Rp

k=j ρk 1 j (ρkrk φT k ω))2.

k=j ρk 1 j φk and zj i =

k=j ρk 1 j ρkrk.

This yields the following batch estimate:

θi = argmin ω Rp

j=1 (zj i ψT j iω)2 = ( Ai) 1 bi (14)

j=1 ψj iψT j i and bi =

j=1 ψj izj i.

The transformation of this batch estimate into a recursive update rule is somewhat tedious (it involves three trace variables), and the details are deferred to Appendix A for clarity. The resulting BRM(λ) method is provided in Algorithm 4. Note that at each step, this algorithm involves the inversion of a 2 2 matrix (involving the 2 2 identity matrix I2), inversion that admits a straightforward analytical solution. The computational complexity of an iteration of BRM(λ) is thus O(p2) (as for the preceding least-squares algorithms). GPTD and KTD, which are close to BRM, have also been extended with some trace mechanism; however, GPTD(λ) (Engel, 2005), KTD(λ) (Geist and Pietquin, 2010a) and the just described BRM(λ) are diﬀerent algorithms.7 Brieﬂy, GPTD(λ) is very close to LSTD(λ) and KTD(λ) uses a diﬀerent Bellman operator.8 As BRM(λ) builds a linear system whose solution is updated recursively, it resembles LSTD(λ). However, the system it builds is diﬀerent. The following theorem, proved in Appendix B, partially characterizes the behavior of BRM(λ) and its potential limit.9

Theorem 3 Assume that the stochastic matrix P0 of the behavior policy is irreducible and has stationary distribution µ0. Further assume that there exists a coeﬃcient β < 1 such that

(s, a), λγρ(s, a) β. (15)

7. GPTD(λ) is not exactly a generalization of GPTD as it does not reduce to it when λ = 0. It is rather a technical variation that bridges a gap with the Monte Carlo approach. 8. The corresponding loss is ( ˆT 0 j,i ˆV (ω) ˆVω(sj) + γλ( ˆT 1 j+1,i ˆV (ω) ˆVω(sj+1)))2. With λ = 0 it gives ˆT 0 j,i and with λ = 1 it provides ˆT 1 j,i. 9. Our proof is similar to that of Proposition 4 of Bertsekas and Yu (2009). The overall arguments are the following: Equation (15) implies that the traces can be truncated at some depth l, whose inﬂuence on the potential limit of the algorithm vanishes when l tends to . For all l, the l-truncated version of the algorithm can easily be analyzed through the ergodic theorem for Markov chains. Making l tend to allows tying the convergence of the original arguments to that of the truncated version. Eventually, the formula for the limit of the truncated algorithm is computed and one derives the limit.

Geist and Scherrer

Algorithm 4: Oﬀ-policy BRM(λ)

Initialization; Initialize vector θ0 and matrix C0 ; Set y0 = 0, D0 = 0 and z0 = 0;

for i = 1, 2, . . . do

Observe φi, ri, φi+1;

Pre-update traces ; yi = (γλρi 1)2yi 1 + 1 ;

Ui = yi φi + γλρi 1 yi Di 1 γλρi 1 yi Di 1 T ;

Vi = yi φi + γλρi 1 yi Di 1 γλρi 1 yi Di 1 T ;

Wi = yiρri + γλρi 1 yi zi 1 γλρi 1 yi zi 1 T ;

Update parameters ; θi = θi 1 + Ci 1Ui (I2 + Vi Ci 1Ui) 1 (Wi Viθi 1) ; Ci = Ci 1 Ci 1Ui (I2 + Vi Ci 1Ui) 1 Vi Ci 1 ;

Post-update traces ; Di = (γλρi 1)Di 1 + φiyi ; zi = (γλρi 1)zi 1 + riρiyi ;

i bi respectively converge almost surely to

A = ΦT D γDP γP T D + γ2D + S(I γP) + (I γP T )ST Φ b = ΦT (I γP T )QT D + S Rπ

where we wrote:

D = diag (I (λγ)2 P T ) 1µ0 , Q = (I λγP) 1,

D = diag P T (I (λγ)2 P T ) 1µ0 , S = λγ(DP γD )Q,

and where P is the matrix whose coordinates are pss = P a π(a|s)ρ(s, a)P(s |s, a). Then, the BRM(λ) algorithm converges with probability 1 to A 1 b.

The assumption given by Equation (15) trivially holds in the on-policy case (in which ρ(s, a) = 1 for all (s, a)) and in the oﬀ-policy case when λγ is suﬃciently small with respect to the mismatch between policies. Note in particular that this result implies the almost sure convergence of the GPTD/KTD algorithms in the on-policy and no-trace case, a question that was still open in the literature.10 The matrix P, which is in general not a stochastic

10. See for instance the conclusion of Engel (2005).

Off-policy Learning with Traces

matrix, can have a spectral radius bigger than 1; Equation (15) ensures that (λγ)2 P has a spectral radius smaller than β so that D and D are well deﬁned. Removing assumption of Equation (15) does not seem easy, since by tuning λγ maliciously, one may force the spectral radius of (λγ)2 P to be as close to 1 as one may want, which would make A and b diverge. Though the quantity A 1 b may compensate for these divergences, our current proof technique cannot account for this situation and a related analysis constitutes possible future work. The fundamental idea behind the Bellman Residual approach is to address the computation of the ﬁxed-point of T λ diﬀerently from the previous methods. Instead of computing the projected ﬁxed-point as in Equation (9), one considers the following over-determined system

Φθ (I λγP) 1(R + (1 λ)γPΦθ) by Equation (3)

Φθ QR + (1 λ)γPQΦθ

with Ψ = Φ (1 λ)γPQΦ, and solves it in a least-squares sense, that is by computing θ = A 1 b with A = ΨT Ψ and b = ΨT QR. One of the motivations for this approach is that, as opposed to the matrix A of LSTD/LSPE/FPKF, A is invertible for all values of λ, and one can always guarantee a ﬁnite error bound with respect to the best projection (Schoknecht, 2002; Yu and Bertsekas, 2008; Scherrer, 2010). If the goal of BRM(λ) is to compute A and b from samples, what it actually computes ( A and b as characterized in Theorem 3) will in general be biased because the estimation is based on a single trajectory.11 Such a bias adds an uncontrolled variance term (Antos et al., 2006) to A and b; an interesting consequence is that A is always non-singular.12 More precisely, there are two sources of bias in the estimation: one results from the non Monte-Carlo evaluation (the fact that λ < 1) and the other from the use of the correlated importance sampling factors (as soon as one considers oﬀ-policy learning). The interested reader may check that in the on-policy case, and when λ tends to 1, A and b coincide with A and b. However, in the strictly oﬀ-policy case, taking λ = 1 does not prevent the bias due to the correlated importance sampling factors. If we have argued that LSTD/LSPE/FPKF should asymptotically coincide when λ = 1, we see here that BRM should generally diﬀer in an oﬀ-policy situation.

4. Stochastic Gradient Extensions To Eligibility Traces And Oﬀ-policy Learning

We have just provided a systematic derivation of all least-squares algorithms for learning with eligibility traces in an oﬀ-policy manner. When the number of features p is very large, the O(p2) complexity involved by a least-squares approach may be prohibitive. In such a situation, a natural alternative is to consider an approach based on a stochastic gradient

11. It is possible to remove the bias when λ = 0 by using double samples. However, in the case where λ > 0, the possibility to remove the bias seems much more diﬃcult. 12. A is by construction positive deﬁnite, and A equals A plus a positive term (the variance term), and is thus also positive deﬁnite.

Geist and Scherrer

descent of the objective function of interest (Bottou and Bousquet, 2011; Sutton et al., 2009; Maei and Sutton, 2010). In this section, we will describe a systematic derivation of stochastic gradient algorithms for learning in an oﬀ-policy manner with eligibility traces. The principle followed is the same as for the least-squares approaches: we shall instantiate the algorithmic pattern of Equation (4) by choosing the value of ξ and update the parameter so as move towards the minimum of J(θi, ξ) using a stochastic gradient descent. To make the pattern of Equation (4) precise, we need to deﬁne the empirical approximate operator we use. We will consider the untruncated ˆT λ i,n operators (written in the followings ˆT λ i , with a slight abuse of notation):

ˆT λ i V = V (si) +

j=i (γλ)j i ρj i ˆTj V ρj 1 i V (sj) (16)

where n is the total length of the trajectory. It should be noted that algorithmic derivations in this section will be a little bit more involved than in the least-squares case. First, by instantiating ξ = θi, the pattern given in Equation (4) is actually a ﬁxed-point problem onto which one cannot directly perform a stochastic gradient descent; this issue will be addressed in Section 4.2 through the introduction of an auxiliary objective function, following the approach originally proposed by Sutton et al. (2009). A second diﬃculty is the following: the just introduced empirical operator ˆT λ i depends on the full trajectory after step i (on the future of the process), and is for this reason usually coined a forward view estimate. Though it would be possible, in principle, to implement a gradient descent based on this forward view, it would not be very memory nor time eﬃcient. Thus, we will follow a usual trick of the literature by deriving recursive algorithms based on a backward view estimate that is equivalent to the forward view in expectation. To do so, we will repeatedly use the following identity that highlights the fact that the estimate ˆT λ i V can be written as a forward recursion:

Lemma 4 Let ˆT λ i be the operator deﬁned in Equation (16) and let V RS. We have

ˆT λ i V = ρiri + γρi(1 λ)V (si+1) + γλρi ˆT λ i+1V.

Proof Using notably the identity ρj i = ρiρj i+1, we have:

ˆT λ i V = V (si) +

j=i (γλ)j i ρj i ˆTj V ρj 1 i V (sj)

= V (si) + ρi ˆTi V V (si) + γλρi

j=i+1 (ρj i ˆTj V ρj 1 i V (sj))

= ρi ˆTi V + γλρi ˆT λ i+1V V (si+1) .

To sum up, the recipe that we are about to use to derive oﬀ-policy gradient learning algorithms based on eligibility traces will consist of the following steps:

1. write the empirical generic cost function of Equation (4) with the untruncated Bellman operator of Equation (16) ;

Off-policy Learning with Traces

2. instantiate ξ and derive the gradient-based update rule (with some additional work for ξ = θi, see Section 4.2);

3. turn the forward view into an equivalent (in expectation) backward view.

The next subsection details the precise derivation of the algorithms.

4.1 Oﬀ-policy TD(λ)

The Temporal-Diﬀerence algorithm (Sutton and Barto, 1998) is a gradient-based bootstrap approach for value function approximation. Because it is the simplest, we begin by considering this bootstrap approach, that is by instantiating ξ = θj 1. The cost function to be minimized is therefore: i X

ˆT λ j ˆVθj 1 ˆVω(sj) 2 .

Minimized with a stochastic gradient descent, the related update rule is (αi being a standard learning rate and recalling that ˆVω(si) = ωT φ(si) = ωT φi):

θi = θi 1 αi

2 ω ˆT λ i ˆVθi 1 ˆVω(si) 2 ω=θi 1

= θi 1 + αiφi ˆT λ i ˆVθi 1 ˆVθi 1(si) . (17)

At this point, one could notice that the exact same update rule would have been obtained by instantiating ξ = θi 1. This was to be expected: as only the last term of the sum is considered for the update, we have j = i, and therefore ξ = θi 1 = θj 1. Equation (17) makes use of a λ-TD error deﬁned as

δλ i (ω) = ˆT λ i ˆVω ˆVω(si).

For convenience, let also δi be the standard (oﬀ-policy) TD error deﬁned as

δi(ω) = δλ=0 i (ω) = ρi ˆTi ˆVω ˆVω(si) = ρi ri + γ ˆVω(si+1) ˆVω(si).

The λ-TD error can be expressed as a forward recursion:

Lemma 5 Let δλ i be the λ-TD error and δi be the standard TD error. Then for all ω,

δλ i (ω) = δi(ω) + γλρiδλ i+1(ω).

Proof This is a corollary of Lemma 4:

ˆT λ i Vω = ρiri + γρi(1 λ)Vω(si+1) + γλρi ˆT λ i+1Vω

ˆT λ i Vω Vω(si) = ρiri + γρi Vω(si+1) Vω(si) + γλρi( ˆT λ i+1Vω Vω(si+1))

δλ i (ω) = δi(ω) + γλρiδλ i+1(ω).

Geist and Scherrer

Therefore, we get the following update rule

θi = θi 1 + αiφiδλ i (θi 1)

with δλ i (θi 1) = δi(θi 1)+γλδλ i+1(θi 1). The key idea here is to ﬁnd some backward recursion such that in expectation, when the Markov chain has reached its steady state distribution µ0, it provides the same result as the forward recursion. Such a backward recursion is given by the following lemma.

Proposition 6 Let zi be the eligibility vector, deﬁned by the following recursion:

zi = φi + γλρi 1zi 1.

For all ω, we have Eµ0[φiδλ i (ω)] = Eµ0[ziδi(ω)].

Proof For clarity, we omit the dependence with respect to ω and write below δi (resp. δλ i ) for δi(ω) (resp. δλ i (ω)). The result relies on successive applications of Lemma 5. We have:

Eµ0[φiδλ i ] = Eµ0[φi(δi + γλρiδλ i+1)]

= Eµ0[φiδi] + Eµ0[φiγλρiδλ i+1].

Moreover, we have that Eµ0[φiρiδλ i+1] = Eµ0[φi 1ρi 1δλ i ], as expectation is done according to the stationary distribution, therefore:

Eµ0[φiδλ i ] = Eµ0[φiδi] + γλEµ0[φi 1ρi 1δλ i ]

= Eµ0[φiδi] + γλEµ0[φi 1ρi 1(δi + γλρiδλ i+1)]

= Eµ0[δi(φi + γλρi 1φi 1 + (γλ)2ρi 1ρi 2φi 2 + . . . )]

= Eµ0[δizi].

This suggests to replace Equation (17) by the following update rule,

θi = θi 1 + αiziδi(θi 1),

which is equivalent in expectation when the Markov chain has reached its steady state. This is summarized in Algorithm 5. This algorithm was ﬁrst proposed in the tabular case by Precup et al. (2000) (who call it per-decision importance sampling). An oﬀ-policy TD(λ) algorithm (with function approximation) was proposed by Precup et al. (2001), but it diﬀers signiﬁcantly from the algorithm just described, since it updates parameters based on full episodic trajectories rather than based on the current transition. Algorithm 5 was actually ﬁrst proposed much more recently by Bertsekas and Yu (2009). An important issue for the analysis of this algorithm is the fact that the trace zi may have an inﬁnite variance, due to importance sampling (Yu, 2010b, Section 3.1). As far as we know, the only existing analysis of oﬀ-policy TD(λ) (as provided in Algorithm 5) uses an additional constraint which forces the parameters to be bounded: after each parameter update, the

Off-policy Learning with Traces

Algorithm 5: Oﬀ-policy TD(λ)

Initialization; Initialize vector θ0; Set z0 = 0;

for i = 1, 2, . . . do

Observe φi, ri, φi+1 ;

Update traces ; zi = γλρi 1zi 1 + φi ;

Update parameters ; θi = θi 1 + αizi(ρiri φT i θi 1) ;

resulting parameter vector is projected onto some predeﬁned compact set. This analysis is performed by Yu (2010b, Section 4.1). Under the standard assumptions of stochastic approximations and most of the assumptions required for the on-policy TD(λ) algorithm, assuming moreover that Π0T λ is a contraction (which we recall to hold for a big enough λ) and that the predeﬁned compact set used to project the parameter vector is a large enough ball containing the ﬁxed point of Π0T λ, the constrained version of oﬀ-policy TD(λ) converges to this ﬁxed-point therefore, the same solution as oﬀ-policy LSTD(λ) , LSPE(λ) and FPKF(λ). We refer to Yu (2010b, Section 4.1) for further details. An analysis of the unconstrained version of oﬀ-policy TD(λ) described in Algorithm 5 is an interesting topic for future research.

4.2 Oﬀ-policy TDC(λ) And Oﬀ-policy GTD2(λ)

The Temporal Diﬀerence with gradient Correction and Gradient Temporal Diﬀerences 2 algorithms have been introduced by Sutton et al. (2009) as gradient descent approaches to minimize the norm of the diﬀerence between the value function and its image through the projected Bellman operator (they are therefore projected ﬁxed-point approaches). Maei and Sutton (2010) extended TDC to oﬀ-policy learning with traces, calling the resulting algorithm GQ(λ). This corresponds (for all algorithms and extensions) to the case ξ = θi, considered in this section. Following the general pattern, at step i, we would like to come up with a new parameter θi that moves (from θi 1) closer to the minimum of the function

ω 7 J(ω, θi) = ˆT λ j ˆVθi ˆVω(sj) 2 .

This problem is tricky since the function to minimize contains what we want to compute - θi as a parameter. For this reason we cannot directly perform a stochastic gradient descent of the right hand side. Instead, we will consider an alternative (but equivalent) formulation of the projected ﬁxed-point minimization θ = arg minω Vω Π0T λVω 2, and will move from θi 1 to θi by making one step of gradient descent of an estimate of the function

θ 7 Vθ Π0T λVθ 2.

Geist and Scherrer

With the following vectorial notations:

ˆVω = ˆVω(s1) . . . ˆVω(si) T ,

ˆTλ ˆVω = ˆT λ 1 ˆVω . . . ˆT λ i ˆVω T ,

Φ = φ(s1) . . . φ(si) T , Π0 = Φ( ΦT Φ) 1 ΦT ,

we consider the following objective function:

J(θ) = ˆVθ Π0 ˆTλ ˆVθ 2

= ˆVθ ˆTλ ˆVθ T Π0 ˆVθ ˆTλ ˆVθ

j=1 δλ j (θ)φj

j=1 δλ j (θ)φj

This is the derivation followed by Sutton et al. (2009) in the case λ = 0 and by Maei and Sutton (2010) in the case λ > 0 (and oﬀ-policy learning). Let us introduce the following notation:

gλ j = ˆT λ j ˆVθ. (18)

Note that since we consider a linear approximation this quantity does not depend on θ. Noticing that δλ j (θ) = φj gλ j , we can compute J(θ):

j=1 δλ j (θ)φj

j=1 δλ j (θ)φj

j=1 δλ j (θ)φj

j=1 δλ j (θ)φj

j=1 (φj gλ j )φT j

j=1 δλ j (θ)φj

j=1 δλ j (θ)φj

j=1 gλ j φT j

j=1 δλ j (θ)φj

Let wi(θ) be a quasi-stationary estimate of the last part, that can be recognized as the solution of a least-squares problem (regression of λ-TD errors δλ j on features φj):

j=1 δλ j (θ)φj

φT j ω δλ j (θ) 2 .

Off-policy Learning with Traces

The identiﬁcation with the above least-squares solution suggests to use the following stochastic gradient descent to form the quasi-stationary estimate:

wi = wi 1 + βiφi δλ i (θi 1) φT i wi 1 .

This update rule makes use of the λ-TD error, deﬁned through a forward view. As for the previous algorithm, we can use Proposition 6 to obtain the following backward view update rule that is equivalent (in expectation when the Markov chain reaches its steady state):

wi = wi 1 + βi ziδi(θi 1) φi(φT i wi 1) . (20)

Using this quasi-stationary estimate, the gradient can be approximated as:

j=1 δλ j (θ)φj

j=1 gλ j φT j

Therefore, a stochastic gradient descent gives the following update rule for the parameter vector θ: θi = θi 1 + αi δλ i (θi 1)φi gλ i φT i wi . (21)

Once again the forward view term δλ i (θi 1)φi can be turned into a backward view by using Proposition 6. There remains to work on the term gλ i φT i . First, one can notice that the term gλ i satisﬁes a forward recursion.

Lemma 7 We have gλ i = γρi(1 λ)φi+1 + γλρigλ i+1.

Proof This result is simply obtained by applying the gradient to the forward recursion of ˆT λ i Vθ provided in Lemma 4 (according to θ).

Using this, the term gλ i φT i can be worked out similarly to the term δλ i (θi 1)φi.

Proposition 8 Let zi be the eligibility vector deﬁned in Proposition 6. We have

Eµ0[gλ i φT i ] = Eµ0[γρi(1 λ)φi+1z T i ].

Proof The proof is similar to that of Proposition 6. Writing bi = γρi(1 λ)φi+1 and ηi = γλρi, we have

Eµ0[gλ i φT i ] = Eµ0[(bi + ηigλ i+1)φT i ]

= Eµ0[biφT i ] + Eµ0[ηi 1(bi + ηigλ i+1)φT i 1]

= Eµ0[biz T i ].

Using this result and Proposition 6, it is natural to replace Equation (21) by an update based on a backward recursion:

θi = θi 1 + αi ziδi γρi(1 λ)φi+1(z T i wi 1) . (22)

Geist and Scherrer

Last but not least, for the estimate wi to be indeed quasi-stationary, the learning rates should satisfy the following condition (in addition to the classical conditions):

lim i αi βi = 0.

Equations. (22) and (20) deﬁne the oﬀ-policy TDC(λ) algorithm, summarized in Algorithm 6. It was originally proposed by Maei and Sutton (2010) under the name GQ(λ). We call it oﬀ-policy TDC(λ) to highlight the fact that it is the extension of the original TDC algorithm of Sutton et al. (2009) to oﬀ-policy learning with traces. One can observe to our knowledge, this was never mentioned in the literature before that when λ = 1, the learning rule of TDC(1) reduces to that of TD(1).

Maei and Sutton (2010) show that the algorithm converges with probability 1 to the same solution as the LSTD(λ) algorithm (that is, to θ = A 1b) under some technical assumptions. Contrary to oﬀ-policy TD(λ), this algorithm does not requires Π0T λ to be a contraction in order to be convergent. Unfortunately, one of the assumptions made in the analysis, requiring that the traces zi have uniformly bounded second moments, is restrictive since in an oﬀ-policy setting the traces zi may easily have an inﬁnite variance (unless the behavior policy is really close to the target policy), as noted by Yu (2010a).13 A full proof of convergence thus still remains to be done.

Algorithm 6: Oﬀ-policy TDC(λ), also known as GQ(λ)

Initialization; Initialize vector θ0 and w0; Set z0 = 0;

for i = 1, 2, . . . do

Observe φi, ri, φi+1 ;

Update traces ; zi = γλρi 1zi 1 + φi ;

Update parameters ; θi = θi 1 + αi zi(ρiri φT i θi 1) γρi(1 λ)φi+1(z T i wi 1) ; wi = wi 1 + βi zi(ρiri φT i θi) φi(φT i wi 1) ;

Using the same principle performing a stochastic gradient descent to minimize J(θ)) , an alternative to TDC, the GTD2 algorithm, was derived by Sutton et al. (2009) in the λ = 0 case. As far as we know, it has never been extended to oﬀ-policy learning with traces; we do it now. Notice that, given the derivation of GQ(λ), obtaining this algorithm is pretty straightforward.

13. See also Randhawa and Juneja (2004).

Off-policy Learning with Traces

To do so, we can start back from Equation (19):

j=1 (φj gλ j )φT j

j=1 δλ j (θ)φj

j=1 (φj gλ j )φT j

This suggests the following alternative update rule (based on forward recursion):

θi = θi 1 + αi(φi gλ i )φT i wi.

Using Proposition 8, it is natural to use the following alternative update rule, based on a backward recursion:

θi = θi 1 + αi φi(φT i wi 1) γρi(1 λ)φi+1(z T i wi 1) .

The update of wi remains the same, and put together it gives oﬀ-policy GTD2(λ), summarized in Algorithm 7. The analysis of this new algorithm constitutes a potential topic for future research.

Algorithm 7: Oﬀ-policy GTD2(λ)

Initialization; Initialize vector θ0 and w0; Set z0 = 0;

for i = 1, 2, . . . do

Observe φi, ri, φi+1 ;

Update traces ; zi = γλρi 1zi 1 + φi ;

Update parameters ; θi = θi 1 + αi φi(φT i wi 1) γρi(1 λ)φi+1(z T i wi 1) ; wi = wi 1 + βi zi(ρiri φT i θi) φi(φT i wi 1) ;

4.3 Oﬀ-policy g BRM(λ)

The algorithm proposed by Baird (1995) minimizes the Bellman residual using a gradientbased approach, in the no-trace and on-policy case. We extend it to eligibility traces and to oﬀ-policy learning, which corresponds to instantiate ξ = ω. The cost function to be minimized is then: i X

ˆT λ j ˆVω ˆVω(sj) 2 .

Geist and Scherrer

Following the negative of the gradient of the last term leads to the following update rule:

θi = θi 1 αi ω ˆT λ i ˆVω ˆVω(si) 2 ω=θi 1

= θi 1 αi ω ˆT λ i ˆVω ˆVω(si) ω=θi 1

ˆT λ i ˆVθi 1 ˆVθi 1(si)

= θi 1 + αi φi gλ i δλ i (θi 1),

recalling the notation gλ i = ˆT λ i ˆVω ﬁrst deﬁned in Equation (18). As usual, this update involves a forward view, which we are going to turn into a backward view. The term φiδλ i can be worked thanks to Proposition 6. The term gλ i δλ i is more diﬃcult to handle, as it is the product of two forward views (until now, we only considered the product of a forward view with a non-recursive term). This can be done thanks to the following original relation (the proof being somewhat tedious, it is deferred to Appendix C):

Proposition 9 Write gλ i = ω ˆT λ i and deﬁne

ci = 1 + (γλρi 1)2ci 1,

ζi = γρi(1 λ)φi+1ci + γλρi 1ζi 1 and di = δici + γλρi 1di 1.

We have that

Eµ0[δλ i gλ i ] = Eµ0[δiζi + diγρi(1 λ)φi+1 δiγρi(1 λ)φi+1ci].

This result (together with Proposition 6) suggests to update parameters as follows:

θi = θi 1 + αi (δi(zi + γρi(1 λ)φi+1ci ζi) diγρi(1 λ)φi+1) .

This gives the oﬀ-policy g BRM(λ) algorithm, depicted in Algorithm 8. One can observe that g BRM(1) is equivalent to TD(1) (and thus also TDC(1), cf. the comment before the description of Algorithm 6). The analysis of this new algorithm is left for future research.

5. Empirical Study

This section aims at empirically comparing the surveyed algorithms. As they only address the policy evaluation problem, we compare the algorithms in their ability to perform policy evaluation (no control, no policy optimization); however, they may straightforwardly be used in an approximate policy iteration approach (Bertsekas and Tsitsiklis, 1996; Munos, 2003). In order to assess their quality, we consider ﬁnite problems where the exact value function can be computed. More precisely, we consider Garnet problems (Archibald et al., 1995), which are a class of randomly constructed ﬁnite MDPs. They do not correspond to any speciﬁc application, but are totally abstract while remaining representative of the kind of MDP that might be encountered in practice. In our experiments, a Garnet is parameterized by 4 parameters and is written G(n S, n A, b, p): n S is the number of states, n A is the number of actions, b

Off-policy Learning with Traces

Algorithm 8: Oﬀ-policy g BRM(λ)

Initialization; Initialize vector θ0; Set z0 = 0, d0 = 0, c0 = 0, ζ0 = 0;

for i = 1, 2, . . . do

Observe φi, ri, φi+1 ;

Update traces ; zi = φi + γλρi 1zi 1 ; ci = 1 + (γλρi 1)2ci 1 ; ζi = γρi(1 λ)φi+1ci + γλρi 1ζi 1 ; di = (ρiri φT i θi 1)ci + γλρi 1di 1 ;

Update parameters ; θi = θi 1 + αi (ρiri φT i θi 1)(zi + γρi(1 λ)φi+1ci ζi) diγρi(1 λ)φi+1 ;

is a branching factor specifying how many possible next states are possible for each stateaction pair (b states are chosen uniformly at random and transition probabilities are set by sampling uniform random b 1 cut points between 0 and 1) and p is the number of features (for function approximation). The reward is state-dependent: for a given randomly generated Garnet problem, the reward for each state is uniformly sampled between 0 and 1. Features are chosen randomly: Φ is a n S p feature matrix of which each component is randomly and uniformly sampled between 0 and 1. The discount factor γ is set to 0.95 in all experiments. We consider two types of problems, small and big , respectively corresponding to instances G(30, 2, 2, 8) and G(100, 4, 3, 20). We also consider two types of learning: onpolicy and oﬀ-policy. In the on-policy setting, for each Garnet a policy π to be evaluated is randomly generated (by sampling randomly n A 1 cut points between 0 and 1 for each state), and trajectories (to be used for learning) are sampled according to this same policy. In the oﬀ-policy setting, the policy π to be evaluated is randomly generated the same way, but trajectories are sampled according to a diﬀerent (similarly randomly generated) behavior policy π0. For all algorithms, we choose θ0 = 0. For least-squares algorithms (LSTD, LSPE, FPKF and BRM), we set the initial matrices (M0, N0, C0) to 103I (the higher this value, the more negligible its eﬀect on estimates; we observed that this parameter did not play a crucial role in practice). We run a ﬁrst set of experiments in order to set all other parameters (eligibility factor and learning rates). We use the following schedule for the learning rates:

αi = α0 αc αc + i and βi = β0 βc βc + i 2 3 .

More precisely, we generate 30 problems (MDPs and policies) for each possible combination small/big on-policy/oﬀ-policy (leading to four cases). For each problem, we generate one trajectory of length 104 using the behavioral policy (which is the randomly generated target

Geist and Scherrer

policy in the on-policy case and the behavior policy in the oﬀ-policy case), to be used by all algorithms. For each meta-parameter, we consider the following ranges of values: λ {0, 0.4, 0.7, 0.9, 1}, α0 {10 2, 10 1, 100}, αc {101, 102, 103}, β0 {10 2, 10 1, 100} and βc {101, 102, 103}. Then, we compute the parameter estimates considering all algorithms instantiated with each possible combination of the meta-parameters. This gives for each combination a family θi,d with i the number of transitions encountered in the trajectory of the dth problem. Finally, for each case, for all problems and each algorithm, we choose the combination of meta-parameters which minimizes the average error on the last one-tenth of the averaged (over all problems) learning curves (we do this to reduce the sensitivity to the initialization and the transient behavior). Formally, we pick the set of parameters that minimizes the following quantity:

i=9.103 Φθi,d V πd 2.

We provide the empirical results of this ﬁrst set of experiments in Tables 3 to 6. As a complement, we detail in Figure 1 the sensitivity of all algorithms with respect to the main parameter λ that controls the eligibility traces (averaged over the 30 problems, with the best global meta-parameters for each choice of λ). We comment these results below.

λ α0 αc β0 βc err LSTD 1.0 2.07 LSPE 1.0 2.07 FPKF 1.0 2.07 BRM 1.0 2.07 TD 1.0 10 2 103 2.06 g BRM 1.0 10 2 103 2.06 TDC 1.0 10 2 103 10 2 101 2.06 GTD2 1.0 10 2 103 10 1 102 2.05

Table 3: Small problem (G(30, 2, 2, 8)), on-policy learning (π = π0).

Table 3 shows the best global meta-parameters over the 30 considered instances (one trajectory per instance) of a small Garnet problem in an on-policy setting, as well as related eﬃciency. Numerically, all methods provide equivalent performance (the slight diﬀerence of error is not statistically signiﬁcant, provided the variance of the estimates). All methods use the same eligibility factor (λ = 1), leading to a Monte Carlo estimate, to reach their best performance. Figure 1 (top, left) shows that this choice of λ does matter and that BRM, g BRM and FPKF are more sensitive to a good choice of the eligibility factor. Table 4 shows the best global meta-parameters over the 30 considered instances (one trajectory per instance) of a big Garnet problem in an on-policy setting, as well as related performance. These results are consistent with those of the small problem, in the on-policy setting (with rather diﬀerent meta-parameters, apart from the eligibility factor). Here again, the algorithms need the highest value of λ to perform the best, except TDC and GTD2 that

Off-policy Learning with Traces

λ α0 αc β0 βc err LSTD 1.0 1.20 LSPE 1.0 1.20 FPKF 1.0 1.20 BRM 1.0 1.20 TD 1.0 10 1 101 1.25 g BRM 1.0 10 1 101 1.25 TDC 0.9 10 1 102 10 1 102 1.21 GTD2 0.9 10 1 102 10 2 103 1.22

Table 4: Big problem (G(100, 4, 3, 20)), on-policy learning (π = π0).

take nevertheless a high value of λ. Figure 1 (top, right) suggests that as the problem s size grows, the role of the eligibility factor gets more prominent (but with a similar behavior).

λ α0 αc β0 βc err LSTD 0.4 3.69 LSPE 0.4 3.69 FPKF 0.7 4.74 BRM 0.0 4.42 TD 0.4 10 1 102 3.85 g BRM 0.0 10 2 101 10.42 TDC 0.4 10 1 101 10 2 101 7.81 GTD2 0.4 10 1 103 10 2 101 4.53

Table 5: Small problem (G(30, 2, 2, 8)), oﬀ-policy learning (π = π0).

Table 5 reports the best meta-parameters in an oﬀ-policy setting for a small problem (still for 30 instances). Regarding the least-squares methods, LSTD and LSPE get the best results, whereas FPKF and BRM suﬀer more from the oﬀ-policy aspect. Regarding gradient methods, TD s performance is good (it is close to that of LSTD/LSPE and better than BRM/FPKF), followed closely by GTD2. TDC and g BRM lead to the worse results. All algorithms use a small or intermediate value of the eligibility factor. Increasing λ would reduce the bias, but the performance would suﬀer from the variance due to importance sampling, as shown also in Figure 1 (bottom, left). Eventually, Table 6 shows the meta-parameters and performance in the most diﬃcult situation: the oﬀ-policy setting of the big problem. These results are consistent with the oﬀ-policy results of the small problem, summarized in Table 5. LSTD and LSPE are the most eﬃcient least-squares algorithms and choose the smallest possible value λ = 0. FPKF and BRM s performance deteriorate (signiﬁcantly for the latter). TD behaves very well and GTD2 follows closely. The performance of TDC and g BRM are the worse. Figure 1 (bottom, right) is similar to that of the small problem. It shows that TD (with a good learning rate) is quite stable, in particular more than LSTD/LSPE.

Geist and Scherrer

0.0 0.2 0.4 0.6 0.8 1.0 Lambda

Average error

on-policy, small problem

LSTD LSPE FPKF BRM TD g BRM TDC GTD2

0.0 0.2 0.4 0.6 0.8 1.0 Lambda

Average error

on-policy, big problem

LSTD LSPE FPKF BRM TD g BRM TDC GTD2

0.0 0.2 0.4 0.6 0.8 1.0 Lambda

Average error

off-policy, small problem

LSTD LSPE FPKF BRM TD g BRM TDC GTD2

0.0 0.2 0.4 0.6 0.8 1.0 Lambda

Average error

off-policy, big problem

LSTD LSPE FPKF BRM TD g BRM TDC GTD2

Figure 1: Sensitivity of performance of the algorithms (y-axis, in logarithmic scale) with respect to the eligibility trace parameter λ (x-axis). Left: Small problem (G(30, 2, 2, 8)), right: Big problem (G(100, 4, 3, 20)). Top: on-policy learning (π = π0), bottom: oﬀ-policy learning (π = π0).

The main goal of the series of experiments we have just described was to choose reasonable values for the meta-parameters. We have also used these experiments to quickly comment the relative performance of the algorithms, but this is not statistically signiﬁcant as this was based on a few (random) problems, onto which meta-parameters have been optimized. Though we will see that the general behavior of the algorithm is globally consistent with what we have seen so far, the series of experiments that we are about to describe aims at providing such a statistically signiﬁcant performance comparison. For each situation (small and big problems, onand oﬀ-policy), we ﬁx the meta-parameters to the previously reported values and we compare the algorithms on several new instances of the problems. These results are reported on Figures 2 to 5. For each of the 4 problems, we randomly generate 100 instances (MDP and policy to be evaluated). For each such problem, we generate a trajectory of length 105. Then, all algorithms learn using this very trajectory. On each ﬁgure, we report the average performance (left), measured as the diﬀerence between the true

Off-policy Learning with Traces

λ α0 αc β0 βc err LSTD 0 3.76 LSPE 0 3.86 FPKF 0.7 4.80 BRM 1.0 10.05 TD 0.4 10 1 101 2.96 g BRM 0.0 10 2 101 10.50 TDC 0.0 10 1 101 10 2 101 8.65 GTD2 0.0 10 1 103 10 2 101 4.41

Table 6: Big problem (G(100, 4, 3, 20)), oﬀ-policy learning (π = π0).

value function (computed from the model) and the currently estimated one, V π Φθ 2, as well as the associated standard deviation (right).

Average error

on-policy, small problem

LSTD LSPE FPKF BRM TD TDC GTD2 g BRM

Standard Deviation of the error

on-policy, small problem

LSTD LSPE FPKF BRM TD TDC GTD2 g BRM

Figure 2: Performance for small problems (G(30, 2, 2, 8)), on-policy learning (π = π0) (left: average error, right: standard deviation).

We begin by discussing the results in the on-policy setting. Figure 2 compares all algorithms for 100 randomly generated small problems (that is, each run corresponds to diﬀerent dynamics, reward function, features and evaluated policy), the meta-parameters being those provided in Table 3. All least-squares approaches provide the best results and are bunched together; this was to be expected, as all algorithms use λ equal to 1. The g BRM, TD and TDC algorithms provide the same results (being equivalent with the choice λ = 1), they are slower than GTD2, which is slower than the least-squares algorithms. Figure 3 compares the algorithms for 100 randomly generated big problems, the meta-parameters being those provided in Table 4. These result are similar to those of the small problem in an oﬀ-policy setting, except that TDC has now a diﬀerent (and slower) behavior, due to the diﬀerent choice of the eligibility factor (λ = 0.9). GTD2 is still the better gradient-based algorithm.

Geist and Scherrer

Average error

on-policy, big problem

LSTD LSPE FPKF BRM TD TDC GTD2 g BRM

Standard Deviation of the error

on-policy, big problem

LSTD LSPE FPKF BRM TD TDC GTD2 g BRM

Figure 3: Performance for big problems (G(100, 4, 3, 20)), on-policy learning (π = π0) (left: average error, right: standard deviation).

Average error

off-policy, small problem

LSTD LSPE FPKF BRM TD TDC GTD2 g BRM

Standard Deviation of the error

off-policy, small problem

LSTD LSPE FPKF BRM TD TDC GTD2 g BRM

Figure 4: Performance for small problems (G(30, 2, 2, 8)), oﬀ-policy learning (π = π0) (left: average error, right: standard deviation).

We now consider the oﬀ-policy setting. Figure 4 provides the average performance and standard deviation of the algorithms (meta-parameters being those of Table 5) on 100 small problems. Once again, we can see that LSTD/LSPE provide the best results. The two other least-squares methods (FPKF and BRM) are overtaken by the gradient-based TD algorithm, that follows closely LSTD/LSPE. GTD2 is a little bit slower and TDC is the slowest algorithm. Figure 5 provides the same data for the big problems (with the metaparameters of Table 6). These results are similar to those of the small problems in an oﬀ-policy setting, except that TD is even closer to LSTD/LSPE (but requires the choice of a learning rate). Summary. Overall, our experiments suggest that the two best algorithms are LSTD and LSPE, since they converge much faster in all situations with less parameter tuning. The gradient-based TD algorithm globally displays a good behavior and constitutes a good alter-

Off-policy Learning with Traces

Average error

off-policy, big problem

LSTD LSPE FPKF BRM TD TDC GTD2 g BRM

Standard Deviation of the error

off-policy, big problem

LSTD LSPE FPKF BRM TD TDC GTD2 g BRM

Figure 5: Performance for big problems (G(100, 4, 3, 20)), on-policy learning (π = π0) (left: average error, right: standard deviation).

native when the number p of features is too big for least-squares methods to be implemented. Though some new algorithms/extensions show interesting results (FPKF(λ) is consistently better that the state-of-the-art FPKF by Choi and Van Roy 2006, g BRM works well in the on-policy setting) most of the other algorithms do not seem to be empirically competitive with the trio LSTD/LSPE/TD, especially in oﬀ-policy situations. In particular, the algorithm introduced speciﬁcally for the oﬀ-policy setting (TDC/GTD2) are much slower than TD in the oﬀ-policy case (but GTD2 is faster in the on-policy experiments, yet with more parameter tuning). Moreover, the condition required for the good behavior of LSPE, FPKF and TD the contraction of Π0T λ does not seem to be very restrictive in practice (at least for the Garnet problems we considered): though it is possible to build speciﬁc pathological examples where these algorithms diverge, this never happened in our experiments.14

6. Conclusion And Future Work

We have considered least-squares and gradient-based algorithms for value estimation in an MDP context. Starting from the on-policy case with no trace, we have recalled that several algorithms (LSTD, LSPE, FPKF and BRM for least-squares approaches, TD, g BRM and TDC/GTD2 for gradient-based approaches) fall in a common algorithmic pattern: Equation (2). Substituting the original Bellman operator by an operator that deals with traces and oﬀ-policy samples naturally leads to the state-of-the-art oﬀ-policy trace-based versions of LSTD, LSPE, TD and TDC, and suggests natural extensions of FPKF, BRM, g BRM and GTD2. This way, we surveyed many known and new oﬀ-policy eligibility trace-based algorithms for policy evaluation. We have explained how to derive recursive (memory and time-eﬃcient) implementations of all these algorithms and discussed their known convergence properties, including an original analysis of BRM(λ) for suﬃciently small λ, that implies the so far not known convergence

14. A preliminary version of this article (Scherrer and Geist, 2011) contains such examples, and also an example where an adversarial choice of λ leads to the divergence of LSTD(λ).

Geist and Scherrer

of GPTD/KTD. Interestingly, it appears that the analysis of oﬀ-policy trace-based stochastic gradient algorithms under mild assumptions is still an open problem: the only currently known analysis of TD (Yu, 2010a) only applies to a constrained version of the algorithm, and that of TDC (Maei and Sutton, 2010) relies on an assumption on the boundedness of the second moment traces that is restrictive (Yu, 2010a). Filling this theoretical gap, as well as providing complete analyses for the other gradient algorithms and FPFK(λ) and BRM(λ) constitute important future work. Finally, we have illustrated and compared the behavior of these algorithms; this constitutes the ﬁrst exhaustive empirical comparison of linear methods.15 Overall, our study suggests that even if the use of eligibility traces generally improves the eﬃciency of all algorithms, LSTD and LSPE consistently provide the best estimates; and in situations where the computational cost is prohibitive for a least-squares approach (when the number p of features is large), TD probably constitutes the best alternative.

Appendix A. Derivation Of The Recursive Formulas For BRM(λ)

We here detail the derivation of oﬀ-policy BRM(λ). We will need two technical lemmas. The ﬁrst one is the Woodbury matrix identity which generalizes the Sherman-Morrison formula (given in Lemma 1).

Lemma 10 (Woodbury) Let A, U, C and V be matrices of correct sizes, then:

(A + UCV ) 1 = A 1 A 1U(C 1 + V A 1U) 1V A 1.

The second lemma is a rewriting of imbricated sums:

Lemma 11 Let f RN N N and n N. We have:

k=i f(i, j, k) =

k=1 f(k, i, j) +

k=1 f(k, j, i).

As stated in Equation (14), we have the following batch estimate for BRM(λ):

θi = argmin ω Rp

j=1 (zj i ψT j iω)2 = ( Ai) 1 bi,

k=j ρk 1 j φk and zj i =

k=j ρk 1 j ρkrk

j=1 ψj iψT j i and bi =

j=1 ψj izj i.

15. To our knowledge, there does not even exist any work reporting and comparing empirical results of LSTD(0) and FPKF(0).

Off-policy Learning with Traces

To obtain a recursive formula, these two sums have to be reworked through Lemma 11. Let us ﬁrst focus on the latter:

j=1 ψj izj i =

m=j ρk 1 j φk ρm 1 j ρmrm

m=1 ρj 1 m φj ρk 1 m ρkrk +

m=1 ρk 1 m φk ρj 1 m ρjrj.

m=1 ( ρk 1 m )2 = 1 + (γλρk 1)2yk 1,

we have that: k X

m=1 ρj 1 m ρk 1 m = ρj 1 k yk.

j=1 ψj izj i =

k=1 ρj 1 k yk φjρkrk +

k=1 ρj 1 k yk φkρjrj.

With the following notations:

k=1 ρj 1 k ykρkrk = γλρj 1zj 1 + ρjrjyj

k=1 ρj 1 k yk φk = γλρj 1Dj 1 + yj φj,

and with the convention that z0 = 0 and D0 = 0, one can write:

j=1 ψj izj i =

j=1 ( φjρjrjyj + γλρj 1( φjzj 1 + ρjrj Dj 1)).

Similarly, on can show that:

j=1 ψj iψT j i =

j=1 ( φj φT j yj + γλρj 1( φj DT j 1 + Dj 1 φT j )).

uj = yj φj,

vj = γλρj 1 yj Dj 1,

Geist and Scherrer

and I2 the 2 2 identity matrix, we have:

j=1 ψj iψT j i =

j=1 ((uj + vj)(uj + vj)T vjv T j )

j=1 ψj iψT j i + ui + vi vi

We can apply the Woodbury identity given in Lemma 10:

j=1 ψj iψT j i

j=1 ψj izj i + Ui I2Vi

= Ci 1 Ci 1Ui (I2 + Vi Ci 1Ui) 1 Vi Ci 1.

The other sum can also be reworked:

j=1 ψj izj i =

j=1 φjrjyj + γλ (Dj 1rj + φjzj 1)

= bi 1 + φiriyi + γλ (Di 1ri + φizi 1) = bi 1 + Ui

yiri + γλ yi zi 1 γλ yi zi 1

Finally, the recursive BRM(λ) estimate can be computed as follows:

θi = Ci bi = θi 1 + Ci 1Ui (I2 + Vi Ci 1Ui) 1 (Wi Viθi 1) .

This gives BRM(λ) as provided in Algorithm 4.

Appendix B. Proof Of Theorem 3: Convergence Of BRM(λ)

The proof of Theorem 3 follows the general idea of that of Proposition 4 of Bertsekas and Yu (2009). It is done in 2 steps. First we argue that the limit of the sequence is linked to that of an alternative algorithm for which one cuts the traces at a certain depth l. Then, we show that for all depth l, this alternative algorithm converges almost surely, we explicitly compute its limit and make l tend to inﬁnity to obtain the limit of BRM(λ). We will only show that 1

i Ai tends to A. The argument is similar for 1

i bi b. Consider the following l-truncated version of the algorithm based on the following alternative traces (we here limit the memory of the traces to a size l):

m=max(1,k l+1) ( ρk 1 m )2,

k=max(1,j l+1) ρj 1 k yk,l φk,

Off-policy Learning with Traces

and update the following matrix:

Ai,l = Ai 1,l + φi φT i yi,l + ρi 1( φi DT i 1,l + Di 1,l φT i ).

The assumption in Equation (15) implies that ρj 1 i βj i, therefore it can be seen that for all k,

|yk,l yk| =

max(0,k l) X

m=1 ( ρk 1 m )2

max(0,k l) X

m=1 β2(k m) β2l

1 β2 = ϵ1(l)

where ϵ1(l) tends to 0 when l tends to inﬁnity. Similarly, using the fact that yk 1 1 β2 and writing K = maxs,s φ(s) γφ(s ) , one has for all j,

max(0,j l) X

k=1 ρj 1 k yk φk +

k=max(1,j l+1) ρj 1 k |yk,l yk| φk

max(0,j l) X

k=1 ρj 1 k 1 1 β2 K +

k=max(1,j l+1) ρj 1 k β2l

1 β 1 1 β2 K + 1 1 β β2l

1 β2 K = ϵ2(l)

where ϵ2(l) also tends to 0. Then, it can be seen that:

Ai,l Ai = Ai 1,l Ai 1 + φi φT i (yi,l yi)

+ ρi 1( φi(DT i 1,l DT i 1) + (Di 1,l Di 1) φT i ) Ai 1,l Ai 1 + φi φT i |yk,l yk| + 2β φi Di 1,l Di

Ai 1,l Ai 1 + K2ϵ1(l) + 2βKϵ2(l)

and, by a recurrence on i, one obtains

where ϵ(l) tends to 0 when l tends to inﬁnity. This implies that:

i ϵ(l) lim inf l

i lim sup l

i lim sup l

In other words, one can see that limi Ai

i and liml limi Ai,l

i are equal if the latter exists. In the remaining of the proof, we show that the latter limit indeed exists and we compute it explicitly.

Let us ﬁx some l and let us consider the sequence ( Ai,l

i ). At some index i, yi,l depends only on the last l samples, while Di,l depends on the same samples and the last l values of yj,l, thus on the last 2l samples. It is then natural to view the computation of Ai,l, which is

Geist and Scherrer

based on yi,l, Di 1,l and φi = φi γρiφi+1, as being related to a Markov chain of which the states are the 2l + 1 consecutive states of the original chain (si 2l, . . . , si, si+1). Write E0 the expectation with respect to its stationary distribution. By the Markov chain Ergodic Theorem, we have with probability 1:

i = E0 φ2l φT 2ly2l,l + λγρ2l 1( φ2l DT 2l 1,l + D2l 1,l φT 2l) . (23)

Let us now explicitly compute this expectation. Write xi the indicator vector (of which the kth coordinate equals 1 when the state at time i is k and 0 otherwise). One has the following relations: φi = ΦT xi. Let us ﬁrst look at the left part of the above limit:

E0 φ2l φT 2ly2l,l

= E0 (φ2l γρ2lφ2l+1)(φ2l γρ2lφ2l+1)T y2l,l

ΦT (x2l γρ2lx2l+1)(x2l γρ2lx2l+1)T Φ

m=l+1 (λγ)2(2l m)(ρ2l 1 m )2 !#

= ΦT ( 2l X

m=l+1 (λγ)2(2l m)E0 h (ρ2l 1 m )2(x2l γρ2lx2l+1)(x2l γρ2lx2l+1)T i)

= ΦT ( 2l X

m=l+1 (λγ)2(2l m)E0 (Xm,2l,2l γXm,2l,2l+1 γXm,2l+1,2l + γ2Xm,2l+1,2l+1) )

where we used the deﬁnition ρk 1 j = (λγ)k jρk 1 j and the notation Xm,i,j = ρi 1 m ρj 1 m xix T j . To ﬁnish the computation, we will mainly rely on the following Lemma:

Lemma 12 (Some Identities) Let P be the matrix of which the coordinates are pss = P a π(s, a)ρ(s, a)T(s, a, s ), which is in general not a stochastic matrix. Let µ0 be the sta-

tionary distribution of the behavior policy π0. Write Di = diag ( P T )iµ0 . Then

m i, E0[Xm,i,i] = Di m,

m i j, E0[Xm,i,j] = Di m P j i,

m j i, E0[Xm,i,j] = (P T )j i Di m.

Proof We ﬁrst observe that:

E0[Xm,i,i] = E0[(ρi 1 m )2xix T i ]

= E0[(ρi 1 m )2 diag(xi)]

= diag E0[(ρi 1 m )2xi .

To provide the identity, we will thus simply provide a proof by recurrence that E0[(ρi 1 m )2xi] = ( P T )m iµ0. For i = m, we have E0[xm] = µ0. Now suppose the relation holds for i and let us prove it for i + 1.

E0[(ρi m)2xi+1] = E0 E0[(ρi m)2xi+1|Fi]

= E0 E0[(ρi 1 m )2(ρi)2xi+1|Fi]

= E0 (ρi 1 m )2E0[(ρi)2xi+1|Fi] .

Off-policy Learning with Traces

Write Fi the realization of the process until time i. Recalling that si is the state at time i and xi is the indicator vector corresponding to si, one has for all s :

E0[(ρi)2xi+1(s )|Fi] = X

a π0(si, a)ρ(si, a)2T(si, a, s )

a π(si, a)ρ(si, a)T(si, a, s )

= [ P T xi](s ).

As this is true for all s , we deduce that E0[(ρi)2xi+1|Fi] = P T xi and

E0[(ρi m)2xi+1] = E0[(ρi 1 m )2 P T xi]

= P T E0[(ρi 1 m )2 P T xi]

= P T ( P T )iµ0 = ( P T )i+1µ0

which concludes the proof by recurrence. Let us consider the next identity. For i j,

E0[ρi 1 m ρj 1 m xix T j ] = E0[E0[ρi 1 m ρj 1 m xix T j |Fi]]

= E0[(ρi 1 m )2xi E0[ρj 1 i x T j |Fi]]

= E0[(ρi 1 m )2xix T i P j i]

= diag ( P T )m iµ0 P j i.

Eventually, the last identity is obtained by considering Ym,i,j = XT m,j,i.

Thus, coming back to our calculus,

E0 φ2l φT 2ly2l,l = ΦT ( 2l X

m=l+1 (λγ)2(2l m) D2l m γ D2l m P γP T D2l m + γ2 D2l+1 m )

= ΦT (Dl γDl P γP T Dl + γ2D l)Φ (24)

j=0 (λγ)2j Dj, and D l =

j=0 (λγ)2j Dj+1.

Geist and Scherrer

Similarly, the second term on the right side of Equation (23) satisﬁes:

E0 ρ2l 1D2l 1,l φT 2l

k=l ρ2l 2 k yk,l φk φT 2l

k=l (λγ)2l 1 kρ2l 1 k

m=k l+1 ( ρk 1 m )2 !

ΦT (xk γρkxk+1)(x2l γρ2lx2l+1)T Φ φT 2l

= ΦT 2l 1 X

k=l (λγ)2l 1 k

m=k l+1 (λγ)2(k m)E0 h ρ2l 1 m ρk 1 m (xk γρkxk+1)(x2l γρ2lx2l+1)T i !

= ΦT 2l 1 X

k=l (λγ)2l 1 k k X

m=k l+1 (λγ)2(k m)

E0 Xm,k,2l γXm,k+1,2l γXm,k,2l+1 + γ2Xm,k+1,2l+1 !

= ΦT 2l 1 X

k=l (λγ)2l 1 k k X

m=k l+1 (λγ)2(k m)

Dk m P 2l k γ Dk+1 m P 2l k 1 γ Dk m P 2l+1 k + γ2 Dk+1 m P 2l k !

= ΦT 2l 1 X

k=l (λγ)2l 1 k k X

m=k l+1 (λγ)2(k m)

Dk m P 2l k(I γP) γ Dk+1 m P 2l 1 k(I γP) !

= ΦT 2l 1 X

k=l (λγ)2l 1 k k X

m=k l+1 (λγ)2(k m) Dk m P γ Dk+1 m P 2l 1 k(I γP)

= ΦT 2l 1 X

k=l (λγ)2l 1 k Dl P γD l P 2l 1 k(I γP)

= ΦT Dl P γD l Ql(I γP)Φ

with Ql = Pl 1 j=0(λγP)j.

Off-policy Learning with Traces

Gathering this and Equation (24), we see that the limit of Ai,l

i expressed in Equation (23) equals:

ΦT Dl γDl P γP T Dl + γ2D l +λγ (Dl P γD l)Ql(I γP) + (I γP T )QT l (P T Dl γD l) Φ.

When l tends to inﬁnity, Ql tends to Q = (I λγP) 1. The assumption of Equation (15) ensures that (λγ) P has spectral radius smaller than 1, and thus when l tends to inﬁnity, Dl tends to D = diag (I (λγ)2 P T ) 1µ0 and D l to D = diag P T (I (λγ)2 P T ) 1µ0 . In

other words, liml limi Ai,l

i exists with probability 1 and equals:

ΦT D γDP γP T D + γ2D

+λγ (DP γD )Q(I γP) + (I γP T )QT (P T D γD ) Φ.

Eventually, this shows that limi Ai

i exists with probability 1 and shares the same value.

A similar reasoning allows to show that limi bi

i exists and equals

ΦT (I γP T )QT D + λγ(DP γD )Q Rπ.

Appendix C. Proof Of Proposition 9

To prove Proposition 9, we need the following technical lemma.

Lemma 13 Let αi and βi be two forward recursions deﬁned as

αi = ai + ηiαi+1 and βi = bi + ηiβi+1.

Assume that for any function f we have that (this is typically true if the index i refers to a state sampled according to some stationary distribution, which is the case we are interested in) E[f(ai, bi, ηi)] = E[f(ai 1, bi 1, ηi 1)].

Let also ui, vi and wi be the backward recursions deﬁned as

wi = 1 + η2 i 1wi 1,

ui = aiwi + ηi 1ui 1,

vi = biwi + ηi 1vi 1.

Then, we have E[αiβi] = E[aivi + biui aibiwi].

Proof The proof looks like the one of Proposition 6, but is a little bit more complicated. A key equality, to be applied repeatedly, is:

αiβi = (ai + ηiαi+1)(bi + ηiβi+1)

= aiβi + biαi + η2 i αi+1βi+1 aibi.

Geist and Scherrer

Another equality to be used repeatedly makes use of the stationarity assumption. For any k 0 we have:

j=0 η2 i j)αi+1βi+1] = E[(

j=1 η2 i j)αiβi].

These two identities can be used to work the term of interest:

E[αiβi] = E[(ai + ηiαi+1)(bi + ηiβi+1)]

= E[aiβi] + E[biαi] + E[η2 i αi+1βi+1] E[aibi]

= E[aiβi] + E[biαi] E[aibi] + E[η2 i 1(ai + ηiαi+1)(bi + ηiβi+1)]

= E[ai(1 + η2 i 1)βi] + E[bi(1 + η2 i 1)αi] E[aibi(1 + η2 i 1)] + E[(ηi 1ηi)2αi+1βi+1].

This process can be repeated, giving

E[αiβi] = E[(aiβi + biαi aibi)(1 + η2 i 1 + (ηi 1ηi 2)2 + . . . )].

We have that wi = 1 + η2 i 1wi 1 = 1 + η2 i 1 + (ηi 1ηi 2)2 + . . . ,

therefore E[αiβi] = E[aiwiβi] + E[biwiαi] E[aibiwi].

We can work on the ﬁrst term:

E[aiwiβi] = E[aiwi(bi + ηiβi+1)]

= E[aiwibi] + E[ai 1wi 1ηi 1(bi + ηiβi+1)]

= E[bi(aiwi + ηi 1(ai 1wi 1) + ηi 1ηi 2(ai 2wi 2) + . . . )]

The work on the second term is symmetric:

E[biwiαi] = E[aivi].

This ﬁnishes proving the result.

The proof of Proposition 9 is a simple application of the preceding technical lemma. By lemma 5, we have that δλ i |{z} .=αi

= δi |{z} .=ai

+ γλρi |{z} .=ηi

δλ i+1 |{z} .=αi+1

By lemma 7, we have that

gλ i |{z} .=βi

= γρi(1 λ)φi+1 | {z } .=bi

+ γλρi |{z} .=ηi

gλ i+1 |{z} .=βi+1

The result is then a direct application of Lemma 13.

Off-policy Learning with Traces

A. Antos, Cs. Szepesvári, and R. Munos. Learning near-optimal policies with Bellmanresidual minimization based ﬁtted policy iteration and a single sample path. In Conference on Learning Theory (COLT), 2006.

T. Archibald, K. Mc Kinnon, and L. Thomas. On the generation of Markov decision processes. Journal of the Operational Research Society, 46:354 361, 1995.

L.C. Baird. Residual algorithms: Reinforcement learning with function approximation. In International Conference on Machine Learning (ICML), 1995.

D.P. Bertsekas and S. Ioﬀe. Temporal diﬀerences-based policy iteration and applications in neuro-dynamic programming. Technical report, MIT, 1996.

D.P. Bertsekas and John N. Tsitsiklis. Neuro-Dynamic Programming. Athena Scientiﬁc, 1996.

D.P. Bertsekas and H. Yu. Projected equation methods for approximate solution of large linear systems. Journal of Computational and Applied Mathematics, 227:27 50, 2009.

L. Bottou and O. Bousquet. The tradeoﬀs of large scale learning. In S. Sra, S. Nowozin, and S.J. Wright, editors, Optimization for Machine Learning, pages 351 368. MIT Press, 2011.

J.A. Boyan. Technical update: Least-squares temporal diﬀerence learning. Machine Learning, 49(2-3):233 246, 1999.

S.J. Bradtke and A.G. Barto. Linear least-squares algorithms for temporal diﬀerence learning. Machine Learning, 22(1-3):33 57, 1996.

D. Choi and B. Van Roy. A generalized Kalman ﬁlter for ﬁxed point approximation and eﬃcient temporal-diﬀerence learning. Discrete Event Dynamic Systems, 16:207 239, 2006.

Y. Engel. Algorithms and Representations for Reinforcement Learning. Ph D thesis, Hebrew University, 2005.

M. Geist and O. Pietquin. Eligibility traces through colored noises. In IEEE International Conference on Ultra Modern Control Systems (ICUMT), 2010a.

M. Geist and O. Pietquin. Kalman temporal diﬀerences. Journal of Artiﬁcal Intelligence Research (JAIR), 39:483 532, 2010b.

M. Geist and O. Pietquin. Algorithmic survey of parametric value function approximation. IEEE Transactions on Neural Networks and Learning Systems, 24(6):845 867, 2013.

M. Kearns and S. Singh. Bias-variance error bounds for temporal diﬀerence updates. In Conference on Learning Theory (COLT), 2000.

J.Z. Kolter. The ﬁxed points of oﬀ-policy TD. In Advances in Neural Information Processing Systems (NIPS), 2011.

Geist and Scherrer

M.G. Lagoudakis and R. Parr. Least-squares policy iteration. Journal of Machine Learning Research, 4:1107 1149, 2003. ISSN 1533-7928.

H.R. Maei and R.S. Sutton. GQ(λ): A general gradient algorithm for temporal-diﬀerence prediction learning with eligibility traces. In Conference on Artiﬁcial General Intelligence (AGI), 2010.

R. Munos. Error bounds for approximate policy iteration. In International Conference on Machine Learning (ICML), 2003.

A. Nedić and D.P. Bertsekas. Least squares policy evaluation algorithms with linear function approximation. Discrete Event Dynamic Systems, 13:79 110, 2003.

D. Precup, R.S. Sutton, and S.P. Singh. Eligibility traces for oﬀ-policy policy evaluation. In International Conference on Machine Learning (ICML), 2000.

D. Precup, R.S. Sutton, and S. Dasgupta. Oﬀ-policy temporal-diﬀerence learning with function approximation. In International Conference on Machine Learning (ICML), 2001.

R.S. Randhawa and S. Juneja. Combining importance sampling and temporal diﬀerence control variates to simulate Markov chains. ACM Transactions on Modeling and Computer Simulation, 14(1):1 30, 2004.

B.D. Ripley. Stochastic Simulation. Wiley & Sons, 1987.

B. Scherrer. Should one compute the temporal diﬀerence ﬁx point or minimize the Bellman residual? The uniﬁed oblique projection view. In International Conference on Machine Learning (ICML), 2010.

B. Scherrer and M. Geist. Recursive least-squares learning with eligibility traces. In European Workshop on Reinforcement Learning (EWRL), 2011.

R. Schoknecht. Optimality of reinforcement learning algorithms with linear function approximation. In Advances in Neural Information Processing Systems (NIPS), 2002.

R.S. Sutton and A.G. Barto. Reinforcement Learning: An Introduction (Adaptive Computation and Machine Learning). MIT Press, 3rd edition, 1998.

R.S. Sutton, H.R. Maei, D. Precup, S. Bhatnagar, D. Silver, Cs. Szepesvári, and E. Wiewiora. Fast gradient-descent methods for temporal-diﬀerence learning with linear function approximation. In International Conference on Machine Learning (ICML), 2009.

J.N. Tsitsiklis and B. Van Roy. An analysis of temporal-diﬀerence learning with function approximation. IEEE Transactions on Automatic Control, 42(5):674 690, 1997.

H. Yu. Convergence of least-squares temporal diﬀerence methods under general conditions. In International Conference on Machine Learning (ICML), 2010a.

H. Yu. Least squares temporal diﬀerence methods: An analysis under general condtions. Technical Report C-2010-39, University of Helsinki, September 2010b.

Off-policy Learning with Traces

H. Yu and D.P. Bertsekas. New error bounds for approximations from projected linear equations. Technical Report C-2008-43, Dept. Computer Science, Univ. of Helsinki, July 2008.