# offpolicy_learning_with_eligibility_traces_a_survey__4f251ae9.pdf Journal of Machine Learning Research 15 (2014) 289-333 Submitted 9/11; Revised 4/13; Published 1/14 Off-policy Learning With Eligibility Traces: A Survey Matthieu Geist matthieu.geist@supelec.fr IMS-Ma LIS Research Group & UMI 2958 (Georgia Tech-CNRS) Supélec 2 rue Edouard Belin 57070 Metz, France Bruno Scherrer bruno.scherrer@inria.fr MAIA project-team INRIA Lorraine 615 rue du Jardin Botanique 54600 Villers-lès-Nancy, France Editor: Ronald Parr In the framework of Markov Decision Processes, we consider linear off-policy learning, that is the problem of learning a linear approximation of the value function of some fixed policy from one trajectory possibly generated by some other policy. We briefly review on-policy learning algorithms of the literature (gradient-based and least-squares-based), adopting a unified algorithmic view. Then, we highlight a systematic approach for adapting them to off-policy learning with eligibility traces. This leads to some known algorithms off-policy LSTD(λ), LSPE(λ), TD(λ), TDC/GQ(λ) and suggests new extensions offpolicy FPKF(λ), BRM(λ), g BRM(λ), GTD2(λ). We describe a comprehensive algorithmic derivation of all algorithms in a recursive and memory-efficent form, discuss their known convergence properties and illustrate their relative empirical behavior on Garnet problems. Our experiments suggest that the most standard algorithms on and off-policy LSTD(λ)/LSPE(λ) and TD(λ) if the feature space dimension is too large for a leastsquares approach perform the best. Keywords: reinforcement learning, value function estimation, off-policy learning, eligibility traces 1. Introduction We consider the problem of learning a linear approximation of the value function of some fixed policy in a Markov Decision Process (MDP) framework. This study is performed in the most general situation where learning must be done from a single trajectory possibly generated by some other policy, also known as off-policy learning. Given samples, well-known methods for estimating a value function are temporal difference (TD) learning and Monte Carlo (Sutton and Barto, 1998). TD learning with eligibility traces (Sutton and Barto, 1998), known as TD(λ), constitutes a nice bridge between both approaches; by controlling the bias/variance trade-off(Kearns and Singh, 2000), their use can significantly speed up learning. When the value function is approximated through a linear architecture, the depth c 2014 Matthieu Geist and Bruno Scherrer. Geist and Scherrer λ of the eligibility traces is also known to control the quality of approximation (Tsitsiklis and Van Roy, 1997). Overall, the use of these traces often plays an important practical role. gradient-based least-squares-based bootstrapping TD FPKF (Sutton and Barto, 1998) (Choi and Van Roy, 2006) residual g BRM BRM (Engel, 2005) (Baird, 1995) (Geist and Pietquin, 2010b) projected fixed point TDC/GTD2 LSTD (Bradtke and Barto, 1996) (Sutton et al., 2009) LSPE (Nedić and Bertsekas, 2003) Table 1: Taxonomy of linearly parameterized estimators for value function approximation (Geist and Pietquin, 2013). There has been a significant amount of research on parametric linear approximation of the value function, without eligibility traces (in the onor off-policy case). We follow the taxonomy proposed by Geist and Pietquin (2013), briefly recalled in Table 1 and further developed in Section 2. Value function approximators can be categorized depending on the cost function they minimize (based on bootstrapping, on a Bellman residual minimization or on a projected fixed point approach) and on how it is minimized (gradient descent or linear least-squares). Most of these algorithms have been extended to take into account eligibility traces, in the on-policy case. Works on extending these eligibility-trace approaches to offpolicy learning are scarcer. They are summarized in Table 2 (algorithms in black). The first motivation of this article is to argue that it is conceptually simple to extend all the algorithms of Table 1 so that they can be applied to the off-policy setting and use eligibility traces. If this allows re-deriving existing algorithms (in black in Table 2), it also leads to new candidate algorithms (in red in Table 2). The second motivation of this work is to discuss the subtle differences between these intimately-related algorithms, and to provide some comparative insights on their empirical behavior (a topic that has to our knowledge not been considered in the literature, even in the simplest on-policy and no-trace situation). gradient-based least-squares-based bootstrapping off-policy TD(λ) off-policy FPKF(λ) (Bertsekas and Yu, 2009) residual off-policy g BRM(λ) off-policy BRM(λ) projected fixed point GQ(λ) a.k.a. off-policy TDC(λ) off-policy LSTD(λ) (Maei and Sutton, 2010) off-policy LSPE(λ) off-policy GTD2(λ) (Yu, 2010a) Table 2: Surveyed off-policy and eligibility-traces approaches. Algorithms in black have been published before (provided references), algorithms in red are new. Off-policy Learning with Traces The rest of this article is organized as follows. Section 2 introduces the background of Markov Decision Processes, describes the state-of-the-art algorithms for learning without eligibility traces, and gives the fundamental idea to extend the methods to the off-policy situation with eligibility traces. Section 3 details this extension for the least-squares approaches: the resulting algorithms are formalized, and we derive recursive and memory-efficient formula for their implementation (this allows online learning without loss of generality, all the more that half of these algorithms are recursive by their very definition), and we discuss their convergence properties. Section 4 does the same job for stochastic gradient approaches, which offers a smaller computational cost (linear per update, instead of quadratic). Last but not least, Section 5 describes an empirical comparison and Section 6 concludes. 2. Background We consider a Markov Decision Process (MDP), that is a tuple {S, A, P, R, γ} in which S is a finite state space identified with {1, 2, . . . , N}, A a finite action space, P P(S)S A the set of transition probabilities, R RS A the reward function and γ the discount factor. A mapping π P(A)S is called a policy. For any policy π, let P π be the corresponding stochastic transition matrix, and Rπ the vector of mean reward when following π, that is of components Ea|π,s[R(s, a)]. The value V π(s) of state s for a policy π is the expected discounted cumulative reward starting in state s and then following the policy π: V π(s) = Eπ i=0 γiri|s0 = s where Eπ denotes the expectation over trajectories induced by policy π. The value function satisfies the (linear) Bellman equation: s, V π(s) = Es ,a|s,π[R(s, a) + γV π(s )]. It can be rewritten as the fixed-point of the Bellman evaluation operator: V π = T πV π where for all V, T πV = Rπ + γP πV . In this article, we are interested in learning an approximation of this value function V π under some constraints. First, we assume our approximation to be linearly parameterized: s, ˆVθ(s) = θT φ(s) with θ Rp being the parameter vector and φ(s) Rp the feature vector in state s. This encompasses notably the tabular case (exact representation of the value function). Also, we want to estimate the value function V π (or equivalently the associated parameter θ) from a single finite trajectory generated using a possibly different behavioral policy π0.1 Let µ0 be the stationary distribution of the stochastic matrix P0 = P π0 of the behavior policy π0 (we assume it exists and is unique). Let D0 be the diagonal matrix of which the elements are (µ0(si))1 i N. Let Φ be the matrix of feature vectors: Φ = [φ(1) . . . φ(N)]T . 1. This can be easily extended to multiple finite trajectories. Geist and Scherrer As we consider a linear approximation, the considered value functions belong to the space spanned by Φ. The projection Π0 onto this hypothesis space with respect to the µ0-quadratic norm, which will be central for the understanding of the algorithms, has the following closedform: Π0 = Φ(ΦT D0Φ) 1ΦT D0. If π0 is different from π, it is called an off-policy setting. Notice that all algorithms considered in this paper use this Π0 projection operator, that is the projection according to the observed data.2 It would certainly be interesting to consider the projection according to the stationary distribution of π, the (unobserved) target policy: this would reduce off-policy learning to on-policy learning. However, this would require re-weighting samples according to the stationary distribution of the target policy π, which is unknown and probably as difficult to estimate as the value function itself. 2.1 Standard Algorithms For On-policy Learning Without Traces We now review existing on-policy linearly parameterized temporal difference learning algorithms (see Table 1). In this case, the behavior and target policies are the same, so we omit the subscript 0 for the policy (π) and the projection (Π). We assume that a trajectory (s1, a1, r1, s2, . . . , si, ai, ri, si+1, . . . , sn, an, rn, sn+1) sampled according to the policy π is available, and will explain how to compute the ith iterate for several algorithms. For all j i, let us introduce the empirical Bellman operator at step j: V 7 rj + γV (sj+1) so that ˆTj V is an unbiased estimate of TV (sj). Projected fixed point approaches aim at finding the fixed-point of the operator being the composition of the projection onto the hypothesis space and the Bellman operator. In other words, they search for the fixed-point ˆVθ = ΠT ˆVθ, Π being the just introduced projection operator. Solving the following fixed-point problem, θi = argmin ω Rp ˆTj ˆVθi ˆVω(sj) 2 , with a least-squares approach corresponds to the Least-Squares Temporal Differences (LSTD) algorithm of Bradtke and Barto (1996). Recently, Sutton et al. (2009) proposed two algorithms reaching the same objective, Temporal Difference with gradient Correction (TDC) and Gradient Temporal Difference 2 (GTD2), by performing a stochastic gradient descent of the function θ 7 ˆVθ ΠT ˆVθ 2 which is minimal (and equal to 0) when ˆVθ = ΠT ˆVθ. 2. As far as we know, there are two notable exceptions. Precup et al. (2001) propose an algorithm that updates parameters according to full trajectories (not according to transitions, as all approaches to be reviewed next). Therefore, the distribution weighting the projection operator is the one of the starting states of these trajectories instead of the one involved by the behavioral policy. Another work to move in a different direction is the off-policy approach of Kolter (2011): samples are weighted such that the projection operator composed with the Bellman operator is non-expansive: this is weaker than finding the projection of the stationary distribution, but offers some guarantees. In this article, we consider only the Π0 projection. Off-policy Learning with Traces A related approach consists in building a recursive algorithm that repeatedly mimics the iteration ˆVθi ΠT ˆVθi 1. In practice, we aim at minimizing ˆTj ˆVθi 1 ˆVω(sj) 2 . Performing the minimization exactly through a least-squares method leads to the Least Squares Policy Evaluation (LSPE) algorithm of Bertsekas and Ioffe (1996). If this minimization is approximated by a stochastic gradient descent, this leads to the classical Temporal Difference (TD) algorithm (Sutton and Barto, 1998). Bootstrapping approaches consist in treating value function approximation after seeing the ith transition as a supervised learning problem, by replacing the unobserved values V π(sj) at states sj by some estimate computed from the trajectory until the transition (sj, sj+1), the best such estimate being ˆTj ˆVθj 1. This amounts to minimizing the following function: ˆTj ˆVθj 1 ˆVω(sj) 2 . (1) Choi and Van Roy (2006) proposed the Fixed-Point Kalman Filter (FPKF), a least-squares variation of TD that minimizes exactly the function of Equation (1). If the minimization is approximated by a stochastic gradient descent, this gives again the classical TD algorithm (Sutton and Barto, 1998). Finally, residual approaches aim at minimizing the distance between the value function and its image through the Bellman operator, V TV 2 µ0. Based on a trajectory, this suggests the following function to minimize ˆTj ˆVω ˆVω(sj) 2 , which is a surrogate of the objective V TV 2 µ0 that is biased (Antos et al., 2006). This cost function has originally been proposed by Baird (1995) who minimized it using a stochastic gradient approach (this algorithm being referred here as g BRM for gradient-based Bellman Residual Minimization). Both the parametric Gaussian Process Temporal Differences (GPTD) algorithm of Engel (2005) and the linear Kalman Temporal Differences (KTD) algorithm of Geist and Pietquin (2010b) can be shown to minimize the above cost using a least-squares approach, and are thus the very same algorithm, that we will refer to as BRM (for Bellman Residual Minimization) in the remaining of this paper.3 To sum up, it thus appears that after the ith transition has been observed, the above mentioned algorithms behave according to the following pattern: move from θi 1 to θi towards the minimum of ω 7 ˆTj ˆVξ ˆVω(sj) 2 , (2) either through a least-squares approach or a stochastic gradient descent. Each of the algorithms mentioned above is obtained by substituting θi, θi 1, θj 1 or ω for ξ. 3. Note that this is only true in the linear case. GPTD and KTD were both introduced in a more general setting: GPTD is non-parametric and KTD is motivated by the goal of handling nonlinearities. Geist and Scherrer 2.2 Towards Off-policy Learning With Traces It is now easy to preview, at least at a high level, how one may extend the previously described algorithms so that they can deal with eligibility traces and off-policy learning. Eligibility Traces. The idea of eligibility traces amounts to looking for the fixed-point of the following variation of the Bellman operator (Bertsekas and Tsitsiklis, 1996) V RS, T λV = (1 λ) k=0 λk T k+1V that makes a geometric average with parameter λ (0, 1) of the powers of the original Bellman operator T. Clearly, any fixed-point of T is a fixed-point of T λ and vice-versa. After some simple algebra, one can see that: T λV = (I λγP) 1(R + (1 λ)γPV ) (3) = V + (I λγP) 1(R + γPV V ). This leads to the following well-known temporal difference expression in some state s T λV (s) = V (s) + Eπ k=i (γλ)k i rk + γV (sk+1) V (sk) si = s k=i (γλ)k iδik(s) where we recall that Eπ means that the expectation is done according to the target policy π, and where δik(s) = Eπ h rk + γV (sk+1) V (sk) si = s i is the expected temporal-difference (Sutton and Barto, 1998). With λ = 0, we recover the Bellman evaluation equation. With λ = 1, this is the definition of the value function as the expected and discounted cumulative reward: T 1V (s) = Eπ[P k=i γk irk|si = s]. Off-policy Learning. As before, we assume that we are given a trajectory (s1, a1, r1, s2, . . . , sj, aj, rj, sj+1 . . . , sn, an, rn, sn+1), except now that it may be generated from some behavior policy possibly different from the target policy π of which we want to estimate the value. We are going to describe how to compute the ith iterate for several algorithms. For any i k, unbiased estimates of the temporal difference terms δik(sk) can be computed through importance sampling (Ripley, 1987). Indeed, for all s, a, let us introduce the following weight: ρ(s, a) = π(a|s) In our trajectory context, for any j and k, write l=j ρl with ρl = ρ(sl, al) Off-policy Learning with Traces with the convention that if k < j, ρk j = 1. With these notations, ˆδik = ρk i ˆTk V ρk 1 i V (sk) is an unbiased estimate of δik(si), from which we may build an estimate ˆT λ j,i V of T λV (sj) (we will describe this very construction separately for the least-squares and the stochastic gradient as they slightly differ). Then, by replacing the empirical operator ˆTj in Equation (2) by ˆT λ j,i, we get the general pattern for off-policy trace-based algorithms: move from θi 1 to θi towards the minimum of ω 7 ˆT λ j,i ˆVξ ˆVω(sj) 2 , (4) either through a least-squares approach or a stochastic gradient descent after having instantiated ξ = θi, θi 1, θj 1 or ω. This process, including in particular the precise definition of the empirical operator ˆT λ j,i, will be further developed in the next two sections.4 Since they are easier to derive, we begin by focusing on least-squares algorithms (right column of Table 2) in Section 3. Then, Section 4 focuses on stochastic gradient-based algorithms (left column of Table 2). 3. Least-squares Extensions To Eligibility Traces And Off-policy Learning First, we consider the least-squares solution to the problem described in Equation (4). At their ith step, the algorithms that we are about to describe will compute the parameter θi by exactly solving the following problem: θi = argmin ω Rp ˆT λ j,i ˆVξ ˆVω(sj) 2 where we define the following empirical truncated approximation of Tλ: ˆT λ j,i : RS R V 7 V (sj) + k=j (γλ)k jˆδjk = V (sj) + k=j (γλ)k j ρk j ˆTk V ρk 1 j V (sk) . Though different definitions of this operator may lead to practical implementations, note that ˆT λ j,i only uses samples seen before time i: this very feature considered by all existing works in the literature will enable us to derive recursive and low-memory algorithms. Recall that a linear parameterization is chosen here, ˆVξ(si) = ξT φ(si). We adopt the following notations: φi = φ(si), φi = φi γρiφi+1 and ρk 1 j = (γλ)k jρk 1 j . 4. Note that we let the empirical operator ˆT λ j,i depends on the index j of the sample (as before) but also on the step i of the algorithm. This will be particularly useful for the derivation of the recursive and memory-efficient least-squares algorithms that we present in the next section. Geist and Scherrer The generic cost function to be solved is therefore: θi = argmin ω Rp J(ω; ξ) with J(ω; ξ) = j=1 (φT j ξ + k=j ρk 1 j (ρkrk φT k ξ) φT j ω)2. (5) Before deriving existing and new least-squares algorithms, as announced, some technical lemmas are required. The first lemma allows computing directly the inverse of a rank-one perturbed matrix. Lemma 1 (Sherman-Morrison) Assume that A is an invertible n n matrix and that u, v Rn are two vectors satisfying 1 + v T A 1u = 0. Then: (A + uv T ) 1 = A 1 A 1uv T A 1 1 + v T A 1u. The next lemma is simply a rewriting of imbricated sums. However, it is quite important here as it will allow stepping from the operator ˆT λ j,i (operator which depends on future of sj, so acasual) forward view of eligibility traces to the recursion over parameters using eligibility traces (dependence on only past samples) backward view of eligibility traces. In other words, the forward view is a theoretical way of mixing backups that shifts parametrically (through the choice of λ) from the standard Bellman operator to the Monte Carlo one. However, it cannot be implemented easily, as it requires knowing the future states. On the other hand, the backward view, which is equivalent (see notably Lemma 2 and Proposition 6), is a more mechanistic and convenient viewpoint that allows performing the same updates using solely information gathered in the states encountered in the past. See Sutton and Barto (1998, Chapter 7) for further discussion on backward/forward views. Lemma 2 Let f RN N and n N. We have: j=i f(i, j) = j=1 f(j, i) We are now ready to mechanically derive the off-policy algorithms LSTD(λ), LSPE(λ), FPKF(λ) and BRM(λ). This is what we do in the following subsections. 3.1 Off-policy LSTD(λ) The Least-Squares Temporal Difference algorithm, that computes directly a fixed-point of the projected Bellman operator, has originally been introduced in the no-trace and on-policy case by Bradtke and Barto (1996). It has been extended to eligibility traces by Boyan (1999), to off-policy (through state-action value function approximation) learning (without traces) by Lagoudakis and Parr (2003), and to off-policy learning with traces by Yu (2010a). The off-policy LSTD(λ) algorithm actually corresponds to instantiating Problem (5) with ξ = θi: θi = argmin ω Rp j=1 (φT j θi + k=j ρk 1 j (ρkrk φT k θi) φT j ω)2. Off-policy Learning with Traces This can be solved by zeroing the gradient respectively to ω: j=1 φjφT j ) 1 i X j=1 φj(φT j θi + k=j ρk 1 j (ρkrk φT k θi)) k=j φj ρk 1 j (ρkrk φT k θi), which, through Lemma 2, is equivalent to: k=1 φk ρj 1 k )(ρjrj φT j θi). Introducing the (importance-based) eligibility vector zj: k=1 φk ρj 1 k = k=1 φk(γλ)j k j 1 Y m=k ρm = γλρj 1zj 1 + φj, (6) one obtains the following batch estimate: j=1 zj φT j ) 1 i X j=1 zjρjrj = (Ai) 1bi (7) j=1 zj φT j and bi = j=1 zjρjrj. (8) Thanks to Lemma 1, the inverse Mi = (Ai) 1 can be computed recursively: j=1 zj φT j ) 1 = Mi 1 Mi 1zi φT i Mi 1 1 + φT i Mi 1zi . This can be used to derive a recursive estimate: j=1 zj φT j ) 1 i X j=1 zjρjrj = (Mi 1 Mi 1zi φT i Mi 1 1 + φT i Mi 1zi )( j=1 zjrjρj + ziρiri) = θi 1 + Mi 1zi 1 + φT i Mi 1zi (ρiri φT i θi 1). Writing the gain Ki = Mi 1zi 1+ φT i Mi 1zi , this gives Algorithm 1. This algorithm has been proposed and analyzed by Yu (2010a). The author proves the following result: if the behavior policy π0 induces an irreducible Markov chain and chooses with positive probability any action that may be chosen by the target policy π, and if the compound (linear) operator Π0T λ has a unique fixed-point, then off-policy LSTD(λ) Geist and Scherrer Algorithm 1: Off-policy LSTD(λ) Initialization; Initialize vector θ0 and matrix M0 ; Set z0 = 0; for i = 1, 2, . . . do Observe φi, ri, φi+1 ; Update traces ; zi = γλρi 1zi 1 + φi ; Update parameters ; Ki = Mi 1zi 1+ φT i Mi 1zi ; θi = θi 1 + Ki(ρiri φT i θi 1) ; Mi = Mi 1 Ki(MT i 1 φi)T ; converges to it almost surely.5 Formally, it converges to the solution θ of the so-called projected fixed-point equation: Vθ = Π0T λVθ . (9) Using the expression of the projection Π0 and the form of the Bellman operator in Equation (3), it can be seen that θ satisfies (see Yu, 2010a for details) where A = ΦT D0(I γP)(I λγP) 1Φ and b = ΦT D0(I λγP) 1R. (10) The core of the analysis of Yu (2010a) consists in showing that 1 i bi defined in Equation (8) respectively converge to A and b almost surely. Through Equation (7), this implies the convergence of θi to θ . 3.2 Off-policy LSPE(λ) The Least-Squares Policy Evaluation algorithm, that computes iteratively the fixed point of the projected Bellman operator, was originally introduced by Bertsekas and Ioffe (1996) and first analyzed in an on-policy context by Nedić and Bertsekas (2003). Its extension to off-policy learning with traces was briefly mentioned by Yu (2010a). The off-policy LSPE(λ) algorithm corresponds to instantiate ξ = θi 1 in Problem (5): θi = argmin ω Rp j=1 (φT j θi 1 + k=j ρk 1 j (ρkrk φT k θi 1) φT j ω)2. 5. It is not always the case that Π0T λ has a unique fixed-point, see Tsitsiklis and Van Roy (1997) for a counter-example. Off-policy Learning with Traces This can be solved by zeroing the gradient respectively to ω: j=1 φjφT j ) 1 i X j=1 φj(φT j θi 1 + k=j ρk 1 j (ρkrk φT k θi 1)) j=1 φjφT j ) 1 i X k=j φj ρk 1 j (ρkrk φT k θi 1). Using Lemma 2 and the definition of the eligibility vector zj in Equation (6), we get: θi = θi 1 + ( j=1 φjφT j ) 1 i X k=1 φk ρj 1 k (ρjrj φT j θi 1) j=1 φjφT j ) 1 i X j=1 zj(ρjrj φT j θi 1). Define the matrix Ni as follows: j=1 φjφT j ) 1 = Ni 1 Ni 1φiφT i Ni 1 1 + φT i Ni 1φi , (11) where the second equality follows from Lemma 1. Let Ai and bi be defined as in the LSTD description in Equation (8). For clarity, we restate their definition along with their recursive writing: j=1 zj φT j = Ai 1 + zi φT i+1 j=1 zjρjrj = bi 1 + ziρiri. Then, it can be seen that the LSPE(λ) update is: θi = θi 1 + Ni(bi Aiθi 1). The overall computation is provided in Algorithm 2. This algorithm, (briefly) mentioned by Yu (2010a), generalizes the LSPE(λ) algorithm of Bertsekas and Ioffe (1996) to off-policy learning. With respect to LSTD(λ), which computes θi = (Ai) 1bi at each iteration as stated in Equation (7), LSPE(λ) is fundamentally recursive (as it is based on an iterated fixed-point relation). Along with the almost sure convergence of 1 i bi to A and b defined in Equation (10), it can be shown that i Ni converges to N = (ΦT D0Φ) 1 see for instance Nedić and Bertsekas (2003) so that, asymptotically, LSPE(λ) behaves as: θi = θi 1 + N(b Aθi 1) = Nb + (I NA)θi 1 Geist and Scherrer Algorithm 2: Off-policy LSPE(λ) Initialization; Initialize vector θ0 and matrix N0 ; Set z0 = 0, A0 = 0 and b0 = 0; for i = 1, 2, . . . do Observe φi, ri, φi+1; Update traces ; zi = γλρi 1zi 1 + φi ; Update parameters ; Ni = Ni 1 Ni 1φiφT i Ni 1 1+φT i Ni 1φi ; Ai = Ai 1 + zi φT i ; bi = bi 1 + ρiziri; θi = θi 1 + Ni(bi Aiθi 1) ; or using the definition of Π0, A, b from Equation (10) and T λ from Equation (3): Vθi = Φθi = ΦNb + Φ(I NA)θi 1 = Π0T λVθi 1. (12) The behavior of this sequence depends on whether the spectral radius of Π0T λ is smaller than 1 or not. Thus, the analyses of Yu (2010a) and Nedić and Bertsekas (2003) (for the convergence of Ni) imply the following convergence result: under the assumptions required for the convergence of off-policy LSTD(λ), and the additional assumption that the operator Π0T λ has a spectral radius smaller than 1 (so that it is contracting), LSPE(λ) also converges almost surely to the fixed-point of the compound Π0T λ operator. There are two sufficient conditions that can ensure such a desired contraction property. The first one is when one considers on-policy learning, as Nedić and Bertsekas (2003) did when they derived the first convergence proof of (on-policy) LSPE(λ). When the behavior policy π0 is different from the target policy π, a sufficient condition for contraction is that λ be close enough to 1; indeed, when λ tends to 1, the spectral radius of T λ tends to zero and can potentially balance an expansion of the projection Π0. In the off-policy case, when γ is sufficiently big, a small value of λ can make Π0T λ expansive (see Tsitsiklis and Van Roy 1997 for an example in the case λ = 0) and off-policy LSPE(λ) will then diverge. Eventually, Equations (9) and (12) show that when λ = 1, both LSTD(λ) and LSPE(λ) asymptotically coincide (as T 1V does not depend on V ). 3.3 Off-policy FPKF(λ) The Fixed Point Kalman Filter algorithm is a bootstrapped recursive least-squares approach to value function approximation originally introduced by Choi and Van Roy (2006). Its extensions to eligibility traces and to off-policy learning are new. Off-policy Learning with Traces The off-policy FPKF(λ) algorithm corresponds to instantiate ξ = θj 1 in Problem (5): θi = argmin ω Rp j=1 (φT j θj 1 + k=j ρk 1 j (ρkrk φT k θj 1) φT j ω)2. This can be solved by zeroing the gradient respectively to ω: j=1 φj(φT j θj 1 + k=j ρk 1 j (ρkrk φT k θj 1)), where Ni is the matrix introduced for LSPE(λ) in Equation (11). For clarity, we restate its definition here and its recursive writing: j=1 φjφT j ) 1 = Ni 1 Ni 1φiφT i Ni 1 1 + φT i Ni 1φi . (13) Using Lemma 2, one obtains: j=1 φjφT j θj 1 + k=1 φk ρj 1 k (ρjrj φT j θk 1)). With respect to the previously described algorithms, the difficulty here is that on the right side there is a dependence with all the previous terms θk 1 for 1 k i. Using the symmetry of the dot product φT j θk 1 = θT k 1 φj, it is possible to write a recursive algorithm by introducing the trace matrix Zj that integrates the subsequent values of θk as follows: k=1 ρj 1 k φkθT k 1 = Zj 1 + γλρj 1φjθT j 1. With this notation we obtain: j=1 φjφT j θj 1 + j=1 (zjρjrj Zj φj)). Using Equation (13) and a few algebraic manipulations, we end up with: θi = θi 1 + Ni(ziρiri Zi φi). This is the parameter update as provided in Algorithm 3. As LSPE(λ), this algorithm is fundamentally recursive. However, its overall behavior is quite different. As we discussed for LSPE(λ), i Ni can be shown to tend asymptotically to N = (ΦT D0Φ) 1 and FPKF(λ) iterates eventually resemble: θi = θi 1 + 1 i N(ziρiri Zi φi). Geist and Scherrer Algorithm 3: Off-policy FPKF(λ) Initialization; Initialize vector θ0 and matrix N0 ; Set z0 = 0 and Z0 = 0; for i = 1, 2, . . . do Observe φi, ri, φi+1; Update traces ; zi = γλρi 1zi 1 + φi ; Zi = γλρi 1Zi 1 + φiθT i 1; Update parameters ; Ni = Ni 1 Ni 1φiφT i Ni 1 1+φT i Ni 1φi ; θi = θi 1 + Ni(ziρiri Zi φi) ; The term in brackets is a random component (that only depends on the previous transitions) and 1 i acts as a learning coefficient that asymptotically tends to 0. In other words, FPKF(λ) has a stochastic approximation flavor. In particular, one can see FPKF(0) as a stochastic approximation of LSPE(0). Indeed, asymptotically, FPKF(0) does the following update θi = θi 1 + 1 i N(ρiφiri φi φT i θi 1), and one can notice that ρiφiri and φi φT i are samples of A and b to which Ai and bi converge through LSPE(0). When λ > 0, the situation is less clear up to the fact that since T 1V does not depend on V , we expect FPKF to asymptotically behave like LSTD and LSPE when λ tends to 1. Due to its much more involved form (notably the matrix trace Zj integrating the values of all the values θk from the start), it does not seem easy to provide a guarantee for FPKF(λ), even in the on-policy case. To our knowledge, there does not exist any proof of convergence for stochastic approximation algorithms in the off-policy case with traces, and a related result for FPKF(λ) thus seems difficult.6 Based on the above-mentioned relation between FPKF(0) and LSPE(0) and the experiments we have run (see Section 5), we conjecture that off-policy FPKF(λ) has the same asymptotic behavior as LSPE(λ). We leave the formal study of this algorithm for future work. 3.4 Off-policy BRM(λ) The Bellman Residual Minimization algorithm is a least-squares approach that minimizes directly the Bellman residual. The off-policy BRM(λ) algorithm corresponds to instantiate ξ = ω in Problem (5): 6. An analysis of TD(λ), with a simplifying assumption that forces the algorithm to stay bounded is given by Yu (2010a). An analysis of GQ(λ) is provided by Maei and Sutton (2010), with an assumption on the second moment of the traces, which as explained in Proposition 2 of Yu (2010a) does not hold in general. A full analysis of these algorithms thus remains to be done. See also Sections 4.1 and 4.2. Off-policy Learning with Traces θi = argmin ω Rp j=1 (φT j ω + k=j ρk 1 j (ρkrk φT k ω) φT j ω)2 = argmin ω Rp k=j ρk 1 j (ρkrk φT k ω))2. k=j ρk 1 j φk and zj i = k=j ρk 1 j ρkrk. This yields the following batch estimate: θi = argmin ω Rp j=1 (zj i ψT j iω)2 = ( Ai) 1 bi (14) j=1 ψj iψT j i and bi = j=1 ψj izj i. The transformation of this batch estimate into a recursive update rule is somewhat tedious (it involves three trace variables), and the details are deferred to Appendix A for clarity. The resulting BRM(λ) method is provided in Algorithm 4. Note that at each step, this algorithm involves the inversion of a 2 2 matrix (involving the 2 2 identity matrix I2), inversion that admits a straightforward analytical solution. The computational complexity of an iteration of BRM(λ) is thus O(p2) (as for the preceding least-squares algorithms). GPTD and KTD, which are close to BRM, have also been extended with some trace mechanism; however, GPTD(λ) (Engel, 2005), KTD(λ) (Geist and Pietquin, 2010a) and the just described BRM(λ) are different algorithms.7 Briefly, GPTD(λ) is very close to LSTD(λ) and KTD(λ) uses a different Bellman operator.8 As BRM(λ) builds a linear system whose solution is updated recursively, it resembles LSTD(λ). However, the system it builds is different. The following theorem, proved in Appendix B, partially characterizes the behavior of BRM(λ) and its potential limit.9 Theorem 3 Assume that the stochastic matrix P0 of the behavior policy is irreducible and has stationary distribution µ0. Further assume that there exists a coefficient β < 1 such that (s, a), λγρ(s, a) β. (15) 7. GPTD(λ) is not exactly a generalization of GPTD as it does not reduce to it when λ = 0. It is rather a technical variation that bridges a gap with the Monte Carlo approach. 8. The corresponding loss is ( ˆT 0 j,i ˆV (ω) ˆVω(sj) + γλ( ˆT 1 j+1,i ˆV (ω) ˆVω(sj+1)))2. With λ = 0 it gives ˆT 0 j,i and with λ = 1 it provides ˆT 1 j,i. 9. Our proof is similar to that of Proposition 4 of Bertsekas and Yu (2009). The overall arguments are the following: Equation (15) implies that the traces can be truncated at some depth l, whose influence on the potential limit of the algorithm vanishes when l tends to . For all l, the l-truncated version of the algorithm can easily be analyzed through the ergodic theorem for Markov chains. Making l tend to allows tying the convergence of the original arguments to that of the truncated version. Eventually, the formula for the limit of the truncated algorithm is computed and one derives the limit. Geist and Scherrer Algorithm 4: Off-policy BRM(λ) Initialization; Initialize vector θ0 and matrix C0 ; Set y0 = 0, D0 = 0 and z0 = 0; for i = 1, 2, . . . do Observe φi, ri, φi+1; Pre-update traces ; yi = (γλρi 1)2yi 1 + 1 ; Ui = yi φi + γλρi 1 yi Di 1 γλρi 1 yi Di 1 T ; Vi = yi φi + γλρi 1 yi Di 1 γλρi 1 yi Di 1 T ; Wi = yiρri + γλρi 1 yi zi 1 γλρi 1 yi zi 1 T ; Update parameters ; θi = θi 1 + Ci 1Ui (I2 + Vi Ci 1Ui) 1 (Wi Viθi 1) ; Ci = Ci 1 Ci 1Ui (I2 + Vi Ci 1Ui) 1 Vi Ci 1 ; Post-update traces ; Di = (γλρi 1)Di 1 + φiyi ; zi = (γλρi 1)zi 1 + riρiyi ; i bi respectively converge almost surely to A = ΦT D γDP γP T D + γ2D + S(I γP) + (I γP T )ST Φ b = ΦT (I γP T )QT D + S Rπ where we wrote: D = diag (I (λγ)2 P T ) 1µ0 , Q = (I λγP) 1, D = diag P T (I (λγ)2 P T ) 1µ0 , S = λγ(DP γD )Q, and where P is the matrix whose coordinates are pss = P a π(a|s)ρ(s, a)P(s |s, a). Then, the BRM(λ) algorithm converges with probability 1 to A 1 b. The assumption given by Equation (15) trivially holds in the on-policy case (in which ρ(s, a) = 1 for all (s, a)) and in the off-policy case when λγ is sufficiently small with respect to the mismatch between policies. Note in particular that this result implies the almost sure convergence of the GPTD/KTD algorithms in the on-policy and no-trace case, a question that was still open in the literature.10 The matrix P, which is in general not a stochastic 10. See for instance the conclusion of Engel (2005). Off-policy Learning with Traces matrix, can have a spectral radius bigger than 1; Equation (15) ensures that (λγ)2 P has a spectral radius smaller than β so that D and D are well defined. Removing assumption of Equation (15) does not seem easy, since by tuning λγ maliciously, one may force the spectral radius of (λγ)2 P to be as close to 1 as one may want, which would make A and b diverge. Though the quantity A 1 b may compensate for these divergences, our current proof technique cannot account for this situation and a related analysis constitutes possible future work. The fundamental idea behind the Bellman Residual approach is to address the computation of the fixed-point of T λ differently from the previous methods. Instead of computing the projected fixed-point as in Equation (9), one considers the following over-determined system Φθ (I λγP) 1(R + (1 λ)γPΦθ) by Equation (3) Φθ QR + (1 λ)γPQΦθ with Ψ = Φ (1 λ)γPQΦ, and solves it in a least-squares sense, that is by computing θ = A 1 b with A = ΨT Ψ and b = ΨT QR. One of the motivations for this approach is that, as opposed to the matrix A of LSTD/LSPE/FPKF, A is invertible for all values of λ, and one can always guarantee a finite error bound with respect to the best projection (Schoknecht, 2002; Yu and Bertsekas, 2008; Scherrer, 2010). If the goal of BRM(λ) is to compute A and b from samples, what it actually computes ( A and b as characterized in Theorem 3) will in general be biased because the estimation is based on a single trajectory.11 Such a bias adds an uncontrolled variance term (Antos et al., 2006) to A and b; an interesting consequence is that A is always non-singular.12 More precisely, there are two sources of bias in the estimation: one results from the non Monte-Carlo evaluation (the fact that λ < 1) and the other from the use of the correlated importance sampling factors (as soon as one considers off-policy learning). The interested reader may check that in the on-policy case, and when λ tends to 1, A and b coincide with A and b. However, in the strictly off-policy case, taking λ = 1 does not prevent the bias due to the correlated importance sampling factors. If we have argued that LSTD/LSPE/FPKF should asymptotically coincide when λ = 1, we see here that BRM should generally differ in an off-policy situation. 4. Stochastic Gradient Extensions To Eligibility Traces And Off-policy Learning We have just provided a systematic derivation of all least-squares algorithms for learning with eligibility traces in an off-policy manner. When the number of features p is very large, the O(p2) complexity involved by a least-squares approach may be prohibitive. In such a situation, a natural alternative is to consider an approach based on a stochastic gradient 11. It is possible to remove the bias when λ = 0 by using double samples. However, in the case where λ > 0, the possibility to remove the bias seems much more difficult. 12. A is by construction positive definite, and A equals A plus a positive term (the variance term), and is thus also positive definite. Geist and Scherrer descent of the objective function of interest (Bottou and Bousquet, 2011; Sutton et al., 2009; Maei and Sutton, 2010). In this section, we will describe a systematic derivation of stochastic gradient algorithms for learning in an off-policy manner with eligibility traces. The principle followed is the same as for the least-squares approaches: we shall instantiate the algorithmic pattern of Equation (4) by choosing the value of ξ and update the parameter so as move towards the minimum of J(θi, ξ) using a stochastic gradient descent. To make the pattern of Equation (4) precise, we need to define the empirical approximate operator we use. We will consider the untruncated ˆT λ i,n operators (written in the followings ˆT λ i , with a slight abuse of notation): ˆT λ i V = V (si) + j=i (γλ)j i ρj i ˆTj V ρj 1 i V (sj) (16) where n is the total length of the trajectory. It should be noted that algorithmic derivations in this section will be a little bit more involved than in the least-squares case. First, by instantiating ξ = θi, the pattern given in Equation (4) is actually a fixed-point problem onto which one cannot directly perform a stochastic gradient descent; this issue will be addressed in Section 4.2 through the introduction of an auxiliary objective function, following the approach originally proposed by Sutton et al. (2009). A second difficulty is the following: the just introduced empirical operator ˆT λ i depends on the full trajectory after step i (on the future of the process), and is for this reason usually coined a forward view estimate. Though it would be possible, in principle, to implement a gradient descent based on this forward view, it would not be very memory nor time efficient. Thus, we will follow a usual trick of the literature by deriving recursive algorithms based on a backward view estimate that is equivalent to the forward view in expectation. To do so, we will repeatedly use the following identity that highlights the fact that the estimate ˆT λ i V can be written as a forward recursion: Lemma 4 Let ˆT λ i be the operator defined in Equation (16) and let V RS. We have ˆT λ i V = ρiri + γρi(1 λ)V (si+1) + γλρi ˆT λ i+1V. Proof Using notably the identity ρj i = ρiρj i+1, we have: ˆT λ i V = V (si) + j=i (γλ)j i ρj i ˆTj V ρj 1 i V (sj) = V (si) + ρi ˆTi V V (si) + γλρi j=i+1 (ρj i ˆTj V ρj 1 i V (sj)) = ρi ˆTi V + γλρi ˆT λ i+1V V (si+1) . To sum up, the recipe that we are about to use to derive off-policy gradient learning algorithms based on eligibility traces will consist of the following steps: 1. write the empirical generic cost function of Equation (4) with the untruncated Bellman operator of Equation (16) ; Off-policy Learning with Traces 2. instantiate ξ and derive the gradient-based update rule (with some additional work for ξ = θi, see Section 4.2); 3. turn the forward view into an equivalent (in expectation) backward view. The next subsection details the precise derivation of the algorithms. 4.1 Off-policy TD(λ) The Temporal-Difference algorithm (Sutton and Barto, 1998) is a gradient-based bootstrap approach for value function approximation. Because it is the simplest, we begin by considering this bootstrap approach, that is by instantiating ξ = θj 1. The cost function to be minimized is therefore: i X ˆT λ j ˆVθj 1 ˆVω(sj) 2 . Minimized with a stochastic gradient descent, the related update rule is (αi being a standard learning rate and recalling that ˆVω(si) = ωT φ(si) = ωT φi): θi = θi 1 αi 2 ω ˆT λ i ˆVθi 1 ˆVω(si) 2 ω=θi 1 = θi 1 + αiφi ˆT λ i ˆVθi 1 ˆVθi 1(si) . (17) At this point, one could notice that the exact same update rule would have been obtained by instantiating ξ = θi 1. This was to be expected: as only the last term of the sum is considered for the update, we have j = i, and therefore ξ = θi 1 = θj 1. Equation (17) makes use of a λ-TD error defined as δλ i (ω) = ˆT λ i ˆVω ˆVω(si). For convenience, let also δi be the standard (off-policy) TD error defined as δi(ω) = δλ=0 i (ω) = ρi ˆTi ˆVω ˆVω(si) = ρi ri + γ ˆVω(si+1) ˆVω(si). The λ-TD error can be expressed as a forward recursion: Lemma 5 Let δλ i be the λ-TD error and δi be the standard TD error. Then for all ω, δλ i (ω) = δi(ω) + γλρiδλ i+1(ω). Proof This is a corollary of Lemma 4: ˆT λ i Vω = ρiri + γρi(1 λ)Vω(si+1) + γλρi ˆT λ i+1Vω ˆT λ i Vω Vω(si) = ρiri + γρi Vω(si+1) Vω(si) + γλρi( ˆT λ i+1Vω Vω(si+1)) δλ i (ω) = δi(ω) + γλρiδλ i+1(ω). Geist and Scherrer Therefore, we get the following update rule θi = θi 1 + αiφiδλ i (θi 1) with δλ i (θi 1) = δi(θi 1)+γλδλ i+1(θi 1). The key idea here is to find some backward recursion such that in expectation, when the Markov chain has reached its steady state distribution µ0, it provides the same result as the forward recursion. Such a backward recursion is given by the following lemma. Proposition 6 Let zi be the eligibility vector, defined by the following recursion: zi = φi + γλρi 1zi 1. For all ω, we have Eµ0[φiδλ i (ω)] = Eµ0[ziδi(ω)]. Proof For clarity, we omit the dependence with respect to ω and write below δi (resp. δλ i ) for δi(ω) (resp. δλ i (ω)). The result relies on successive applications of Lemma 5. We have: Eµ0[φiδλ i ] = Eµ0[φi(δi + γλρiδλ i+1)] = Eµ0[φiδi] + Eµ0[φiγλρiδλ i+1]. Moreover, we have that Eµ0[φiρiδλ i+1] = Eµ0[φi 1ρi 1δλ i ], as expectation is done according to the stationary distribution, therefore: Eµ0[φiδλ i ] = Eµ0[φiδi] + γλEµ0[φi 1ρi 1δλ i ] = Eµ0[φiδi] + γλEµ0[φi 1ρi 1(δi + γλρiδλ i+1)] = Eµ0[δi(φi + γλρi 1φi 1 + (γλ)2ρi 1ρi 2φi 2 + . . . )] = Eµ0[δizi]. This suggests to replace Equation (17) by the following update rule, θi = θi 1 + αiziδi(θi 1), which is equivalent in expectation when the Markov chain has reached its steady state. This is summarized in Algorithm 5. This algorithm was first proposed in the tabular case by Precup et al. (2000) (who call it per-decision importance sampling). An off-policy TD(λ) algorithm (with function approximation) was proposed by Precup et al. (2001), but it differs significantly from the algorithm just described, since it updates parameters based on full episodic trajectories rather than based on the current transition. Algorithm 5 was actually first proposed much more recently by Bertsekas and Yu (2009). An important issue for the analysis of this algorithm is the fact that the trace zi may have an infinite variance, due to importance sampling (Yu, 2010b, Section 3.1). As far as we know, the only existing analysis of off-policy TD(λ) (as provided in Algorithm 5) uses an additional constraint which forces the parameters to be bounded: after each parameter update, the Off-policy Learning with Traces Algorithm 5: Off-policy TD(λ) Initialization; Initialize vector θ0; Set z0 = 0; for i = 1, 2, . . . do Observe φi, ri, φi+1 ; Update traces ; zi = γλρi 1zi 1 + φi ; Update parameters ; θi = θi 1 + αizi(ρiri φT i θi 1) ; resulting parameter vector is projected onto some predefined compact set. This analysis is performed by Yu (2010b, Section 4.1). Under the standard assumptions of stochastic approximations and most of the assumptions required for the on-policy TD(λ) algorithm, assuming moreover that Π0T λ is a contraction (which we recall to hold for a big enough λ) and that the predefined compact set used to project the parameter vector is a large enough ball containing the fixed point of Π0T λ, the constrained version of off-policy TD(λ) converges to this fixed-point therefore, the same solution as off-policy LSTD(λ) , LSPE(λ) and FPKF(λ). We refer to Yu (2010b, Section 4.1) for further details. An analysis of the unconstrained version of off-policy TD(λ) described in Algorithm 5 is an interesting topic for future research. 4.2 Off-policy TDC(λ) And Off-policy GTD2(λ) The Temporal Difference with gradient Correction and Gradient Temporal Differences 2 algorithms have been introduced by Sutton et al. (2009) as gradient descent approaches to minimize the norm of the difference between the value function and its image through the projected Bellman operator (they are therefore projected fixed-point approaches). Maei and Sutton (2010) extended TDC to off-policy learning with traces, calling the resulting algorithm GQ(λ). This corresponds (for all algorithms and extensions) to the case ξ = θi, considered in this section. Following the general pattern, at step i, we would like to come up with a new parameter θi that moves (from θi 1) closer to the minimum of the function ω 7 J(ω, θi) = ˆT λ j ˆVθi ˆVω(sj) 2 . This problem is tricky since the function to minimize contains what we want to compute - θi as a parameter. For this reason we cannot directly perform a stochastic gradient descent of the right hand side. Instead, we will consider an alternative (but equivalent) formulation of the projected fixed-point minimization θ = arg minω Vω Π0T λVω 2, and will move from θi 1 to θi by making one step of gradient descent of an estimate of the function θ 7 Vθ Π0T λVθ 2. Geist and Scherrer With the following vectorial notations: ˆVω = ˆVω(s1) . . . ˆVω(si) T , ˆTλ ˆVω = ˆT λ 1 ˆVω . . . ˆT λ i ˆVω T , Φ = φ(s1) . . . φ(si) T , Π0 = Φ( ΦT Φ) 1 ΦT , we consider the following objective function: J(θ) = ˆVθ Π0 ˆTλ ˆVθ 2 = ˆVθ ˆTλ ˆVθ T Π0 ˆVθ ˆTλ ˆVθ j=1 δλ j (θ)φj j=1 δλ j (θ)φj This is the derivation followed by Sutton et al. (2009) in the case λ = 0 and by Maei and Sutton (2010) in the case λ > 0 (and off-policy learning). Let us introduce the following notation: gλ j = ˆT λ j ˆVθ. (18) Note that since we consider a linear approximation this quantity does not depend on θ. Noticing that δλ j (θ) = φj gλ j , we can compute J(θ): j=1 δλ j (θ)φj j=1 δλ j (θ)φj j=1 δλ j (θ)φj j=1 δλ j (θ)φj j=1 (φj gλ j )φT j j=1 δλ j (θ)φj j=1 δλ j (θ)φj j=1 gλ j φT j j=1 δλ j (θ)φj Let wi(θ) be a quasi-stationary estimate of the last part, that can be recognized as the solution of a least-squares problem (regression of λ-TD errors δλ j on features φj): j=1 δλ j (θ)φj φT j ω δλ j (θ) 2 . Off-policy Learning with Traces The identification with the above least-squares solution suggests to use the following stochastic gradient descent to form the quasi-stationary estimate: wi = wi 1 + βiφi δλ i (θi 1) φT i wi 1 . This update rule makes use of the λ-TD error, defined through a forward view. As for the previous algorithm, we can use Proposition 6 to obtain the following backward view update rule that is equivalent (in expectation when the Markov chain reaches its steady state): wi = wi 1 + βi ziδi(θi 1) φi(φT i wi 1) . (20) Using this quasi-stationary estimate, the gradient can be approximated as: j=1 δλ j (θ)φj j=1 gλ j φT j Therefore, a stochastic gradient descent gives the following update rule for the parameter vector θ: θi = θi 1 + αi δλ i (θi 1)φi gλ i φT i wi . (21) Once again the forward view term δλ i (θi 1)φi can be turned into a backward view by using Proposition 6. There remains to work on the term gλ i φT i . First, one can notice that the term gλ i satisfies a forward recursion. Lemma 7 We have gλ i = γρi(1 λ)φi+1 + γλρigλ i+1. Proof This result is simply obtained by applying the gradient to the forward recursion of ˆT λ i Vθ provided in Lemma 4 (according to θ). Using this, the term gλ i φT i can be worked out similarly to the term δλ i (θi 1)φi. Proposition 8 Let zi be the eligibility vector defined in Proposition 6. We have Eµ0[gλ i φT i ] = Eµ0[γρi(1 λ)φi+1z T i ]. Proof The proof is similar to that of Proposition 6. Writing bi = γρi(1 λ)φi+1 and ηi = γλρi, we have Eµ0[gλ i φT i ] = Eµ0[(bi + ηigλ i+1)φT i ] = Eµ0[biφT i ] + Eµ0[ηi 1(bi + ηigλ i+1)φT i 1] = Eµ0[biz T i ]. Using this result and Proposition 6, it is natural to replace Equation (21) by an update based on a backward recursion: θi = θi 1 + αi ziδi γρi(1 λ)φi+1(z T i wi 1) . (22) Geist and Scherrer Last but not least, for the estimate wi to be indeed quasi-stationary, the learning rates should satisfy the following condition (in addition to the classical conditions): lim i αi βi = 0. Equations. (22) and (20) define the off-policy TDC(λ) algorithm, summarized in Algorithm 6. It was originally proposed by Maei and Sutton (2010) under the name GQ(λ). We call it off-policy TDC(λ) to highlight the fact that it is the extension of the original TDC algorithm of Sutton et al. (2009) to off-policy learning with traces. One can observe to our knowledge, this was never mentioned in the literature before that when λ = 1, the learning rule of TDC(1) reduces to that of TD(1). Maei and Sutton (2010) show that the algorithm converges with probability 1 to the same solution as the LSTD(λ) algorithm (that is, to θ = A 1b) under some technical assumptions. Contrary to off-policy TD(λ), this algorithm does not requires Π0T λ to be a contraction in order to be convergent. Unfortunately, one of the assumptions made in the analysis, requiring that the traces zi have uniformly bounded second moments, is restrictive since in an off-policy setting the traces zi may easily have an infinite variance (unless the behavior policy is really close to the target policy), as noted by Yu (2010a).13 A full proof of convergence thus still remains to be done. Algorithm 6: Off-policy TDC(λ), also known as GQ(λ) Initialization; Initialize vector θ0 and w0; Set z0 = 0; for i = 1, 2, . . . do Observe φi, ri, φi+1 ; Update traces ; zi = γλρi 1zi 1 + φi ; Update parameters ; θi = θi 1 + αi zi(ρiri φT i θi 1) γρi(1 λ)φi+1(z T i wi 1) ; wi = wi 1 + βi zi(ρiri φT i θi) φi(φT i wi 1) ; Using the same principle performing a stochastic gradient descent to minimize J(θ)) , an alternative to TDC, the GTD2 algorithm, was derived by Sutton et al. (2009) in the λ = 0 case. As far as we know, it has never been extended to off-policy learning with traces; we do it now. Notice that, given the derivation of GQ(λ), obtaining this algorithm is pretty straightforward. 13. See also Randhawa and Juneja (2004). Off-policy Learning with Traces To do so, we can start back from Equation (19): j=1 (φj gλ j )φT j j=1 δλ j (θ)φj j=1 (φj gλ j )φT j This suggests the following alternative update rule (based on forward recursion): θi = θi 1 + αi(φi gλ i )φT i wi. Using Proposition 8, it is natural to use the following alternative update rule, based on a backward recursion: θi = θi 1 + αi φi(φT i wi 1) γρi(1 λ)φi+1(z T i wi 1) . The update of wi remains the same, and put together it gives off-policy GTD2(λ), summarized in Algorithm 7. The analysis of this new algorithm constitutes a potential topic for future research. Algorithm 7: Off-policy GTD2(λ) Initialization; Initialize vector θ0 and w0; Set z0 = 0; for i = 1, 2, . . . do Observe φi, ri, φi+1 ; Update traces ; zi = γλρi 1zi 1 + φi ; Update parameters ; θi = θi 1 + αi φi(φT i wi 1) γρi(1 λ)φi+1(z T i wi 1) ; wi = wi 1 + βi zi(ρiri φT i θi) φi(φT i wi 1) ; 4.3 Off-policy g BRM(λ) The algorithm proposed by Baird (1995) minimizes the Bellman residual using a gradientbased approach, in the no-trace and on-policy case. We extend it to eligibility traces and to off-policy learning, which corresponds to instantiate ξ = ω. The cost function to be minimized is then: i X ˆT λ j ˆVω ˆVω(sj) 2 . Geist and Scherrer Following the negative of the gradient of the last term leads to the following update rule: θi = θi 1 αi ω ˆT λ i ˆVω ˆVω(si) 2 ω=θi 1 = θi 1 αi ω ˆT λ i ˆVω ˆVω(si) ω=θi 1 ˆT λ i ˆVθi 1 ˆVθi 1(si) = θi 1 + αi φi gλ i δλ i (θi 1), recalling the notation gλ i = ˆT λ i ˆVω first defined in Equation (18). As usual, this update involves a forward view, which we are going to turn into a backward view. The term φiδλ i can be worked thanks to Proposition 6. The term gλ i δλ i is more difficult to handle, as it is the product of two forward views (until now, we only considered the product of a forward view with a non-recursive term). This can be done thanks to the following original relation (the proof being somewhat tedious, it is deferred to Appendix C): Proposition 9 Write gλ i = ω ˆT λ i and define ci = 1 + (γλρi 1)2ci 1, ζi = γρi(1 λ)φi+1ci + γλρi 1ζi 1 and di = δici + γλρi 1di 1. We have that Eµ0[δλ i gλ i ] = Eµ0[δiζi + diγρi(1 λ)φi+1 δiγρi(1 λ)φi+1ci]. This result (together with Proposition 6) suggests to update parameters as follows: θi = θi 1 + αi (δi(zi + γρi(1 λ)φi+1ci ζi) diγρi(1 λ)φi+1) . This gives the off-policy g BRM(λ) algorithm, depicted in Algorithm 8. One can observe that g BRM(1) is equivalent to TD(1) (and thus also TDC(1), cf. the comment before the description of Algorithm 6). The analysis of this new algorithm is left for future research. 5. Empirical Study This section aims at empirically comparing the surveyed algorithms. As they only address the policy evaluation problem, we compare the algorithms in their ability to perform policy evaluation (no control, no policy optimization); however, they may straightforwardly be used in an approximate policy iteration approach (Bertsekas and Tsitsiklis, 1996; Munos, 2003). In order to assess their quality, we consider finite problems where the exact value function can be computed. More precisely, we consider Garnet problems (Archibald et al., 1995), which are a class of randomly constructed finite MDPs. They do not correspond to any specific application, but are totally abstract while remaining representative of the kind of MDP that might be encountered in practice. In our experiments, a Garnet is parameterized by 4 parameters and is written G(n S, n A, b, p): n S is the number of states, n A is the number of actions, b Off-policy Learning with Traces Algorithm 8: Off-policy g BRM(λ) Initialization; Initialize vector θ0; Set z0 = 0, d0 = 0, c0 = 0, ζ0 = 0; for i = 1, 2, . . . do Observe φi, ri, φi+1 ; Update traces ; zi = φi + γλρi 1zi 1 ; ci = 1 + (γλρi 1)2ci 1 ; ζi = γρi(1 λ)φi+1ci + γλρi 1ζi 1 ; di = (ρiri φT i θi 1)ci + γλρi 1di 1 ; Update parameters ; θi = θi 1 + αi (ρiri φT i θi 1)(zi + γρi(1 λ)φi+1ci ζi) diγρi(1 λ)φi+1 ; is a branching factor specifying how many possible next states are possible for each stateaction pair (b states are chosen uniformly at random and transition probabilities are set by sampling uniform random b 1 cut points between 0 and 1) and p is the number of features (for function approximation). The reward is state-dependent: for a given randomly generated Garnet problem, the reward for each state is uniformly sampled between 0 and 1. Features are chosen randomly: Φ is a n S p feature matrix of which each component is randomly and uniformly sampled between 0 and 1. The discount factor γ is set to 0.95 in all experiments. We consider two types of problems, small and big , respectively corresponding to instances G(30, 2, 2, 8) and G(100, 4, 3, 20). We also consider two types of learning: onpolicy and off-policy. In the on-policy setting, for each Garnet a policy π to be evaluated is randomly generated (by sampling randomly n A 1 cut points between 0 and 1 for each state), and trajectories (to be used for learning) are sampled according to this same policy. In the off-policy setting, the policy π to be evaluated is randomly generated the same way, but trajectories are sampled according to a different (similarly randomly generated) behavior policy π0. For all algorithms, we choose θ0 = 0. For least-squares algorithms (LSTD, LSPE, FPKF and BRM), we set the initial matrices (M0, N0, C0) to 103I (the higher this value, the more negligible its effect on estimates; we observed that this parameter did not play a crucial role in practice). We run a first set of experiments in order to set all other parameters (eligibility factor and learning rates). We use the following schedule for the learning rates: αi = α0 αc αc + i and βi = β0 βc βc + i 2 3 . More precisely, we generate 30 problems (MDPs and policies) for each possible combination small/big on-policy/off-policy (leading to four cases). For each problem, we generate one trajectory of length 104 using the behavioral policy (which is the randomly generated target Geist and Scherrer policy in the on-policy case and the behavior policy in the off-policy case), to be used by all algorithms. For each meta-parameter, we consider the following ranges of values: λ {0, 0.4, 0.7, 0.9, 1}, α0 {10 2, 10 1, 100}, αc {101, 102, 103}, β0 {10 2, 10 1, 100} and βc {101, 102, 103}. Then, we compute the parameter estimates considering all algorithms instantiated with each possible combination of the meta-parameters. This gives for each combination a family θi,d with i the number of transitions encountered in the trajectory of the dth problem. Finally, for each case, for all problems and each algorithm, we choose the combination of meta-parameters which minimizes the average error on the last one-tenth of the averaged (over all problems) learning curves (we do this to reduce the sensitivity to the initialization and the transient behavior). Formally, we pick the set of parameters that minimizes the following quantity: i=9.103 Φθi,d V πd 2. We provide the empirical results of this first set of experiments in Tables 3 to 6. As a complement, we detail in Figure 1 the sensitivity of all algorithms with respect to the main parameter λ that controls the eligibility traces (averaged over the 30 problems, with the best global meta-parameters for each choice of λ). We comment these results below. λ α0 αc β0 βc err LSTD 1.0 2.07 LSPE 1.0 2.07 FPKF 1.0 2.07 BRM 1.0 2.07 TD 1.0 10 2 103 2.06 g BRM 1.0 10 2 103 2.06 TDC 1.0 10 2 103 10 2 101 2.06 GTD2 1.0 10 2 103 10 1 102 2.05 Table 3: Small problem (G(30, 2, 2, 8)), on-policy learning (π = π0). Table 3 shows the best global meta-parameters over the 30 considered instances (one trajectory per instance) of a small Garnet problem in an on-policy setting, as well as related efficiency. Numerically, all methods provide equivalent performance (the slight difference of error is not statistically significant, provided the variance of the estimates). All methods use the same eligibility factor (λ = 1), leading to a Monte Carlo estimate, to reach their best performance. Figure 1 (top, left) shows that this choice of λ does matter and that BRM, g BRM and FPKF are more sensitive to a good choice of the eligibility factor. Table 4 shows the best global meta-parameters over the 30 considered instances (one trajectory per instance) of a big Garnet problem in an on-policy setting, as well as related performance. These results are consistent with those of the small problem, in the on-policy setting (with rather different meta-parameters, apart from the eligibility factor). Here again, the algorithms need the highest value of λ to perform the best, except TDC and GTD2 that Off-policy Learning with Traces λ α0 αc β0 βc err LSTD 1.0 1.20 LSPE 1.0 1.20 FPKF 1.0 1.20 BRM 1.0 1.20 TD 1.0 10 1 101 1.25 g BRM 1.0 10 1 101 1.25 TDC 0.9 10 1 102 10 1 102 1.21 GTD2 0.9 10 1 102 10 2 103 1.22 Table 4: Big problem (G(100, 4, 3, 20)), on-policy learning (π = π0). take nevertheless a high value of λ. Figure 1 (top, right) suggests that as the problem s size grows, the role of the eligibility factor gets more prominent (but with a similar behavior). λ α0 αc β0 βc err LSTD 0.4 3.69 LSPE 0.4 3.69 FPKF 0.7 4.74 BRM 0.0 4.42 TD 0.4 10 1 102 3.85 g BRM 0.0 10 2 101 10.42 TDC 0.4 10 1 101 10 2 101 7.81 GTD2 0.4 10 1 103 10 2 101 4.53 Table 5: Small problem (G(30, 2, 2, 8)), off-policy learning (π = π0). Table 5 reports the best meta-parameters in an off-policy setting for a small problem (still for 30 instances). Regarding the least-squares methods, LSTD and LSPE get the best results, whereas FPKF and BRM suffer more from the off-policy aspect. Regarding gradient methods, TD s performance is good (it is close to that of LSTD/LSPE and better than BRM/FPKF), followed closely by GTD2. TDC and g BRM lead to the worse results. All algorithms use a small or intermediate value of the eligibility factor. Increasing λ would reduce the bias, but the performance would suffer from the variance due to importance sampling, as shown also in Figure 1 (bottom, left). Eventually, Table 6 shows the meta-parameters and performance in the most difficult situation: the off-policy setting of the big problem. These results are consistent with the off-policy results of the small problem, summarized in Table 5. LSTD and LSPE are the most efficient least-squares algorithms and choose the smallest possible value λ = 0. FPKF and BRM s performance deteriorate (significantly for the latter). TD behaves very well and GTD2 follows closely. The performance of TDC and g BRM are the worse. Figure 1 (bottom, right) is similar to that of the small problem. It shows that TD (with a good learning rate) is quite stable, in particular more than LSTD/LSPE. Geist and Scherrer 0.0 0.2 0.4 0.6 0.8 1.0 Lambda Average error on-policy, small problem LSTD LSPE FPKF BRM TD g BRM TDC GTD2 0.0 0.2 0.4 0.6 0.8 1.0 Lambda Average error on-policy, big problem LSTD LSPE FPKF BRM TD g BRM TDC GTD2 0.0 0.2 0.4 0.6 0.8 1.0 Lambda Average error off-policy, small problem LSTD LSPE FPKF BRM TD g BRM TDC GTD2 0.0 0.2 0.4 0.6 0.8 1.0 Lambda Average error off-policy, big problem LSTD LSPE FPKF BRM TD g BRM TDC GTD2 Figure 1: Sensitivity of performance of the algorithms (y-axis, in logarithmic scale) with respect to the eligibility trace parameter λ (x-axis). Left: Small problem (G(30, 2, 2, 8)), right: Big problem (G(100, 4, 3, 20)). Top: on-policy learning (π = π0), bottom: off-policy learning (π = π0). The main goal of the series of experiments we have just described was to choose reasonable values for the meta-parameters. We have also used these experiments to quickly comment the relative performance of the algorithms, but this is not statistically significant as this was based on a few (random) problems, onto which meta-parameters have been optimized. Though we will see that the general behavior of the algorithm is globally consistent with what we have seen so far, the series of experiments that we are about to describe aims at providing such a statistically significant performance comparison. For each situation (small and big problems, onand off-policy), we fix the meta-parameters to the previously reported values and we compare the algorithms on several new instances of the problems. These results are reported on Figures 2 to 5. For each of the 4 problems, we randomly generate 100 instances (MDP and policy to be evaluated). For each such problem, we generate a trajectory of length 105. Then, all algorithms learn using this very trajectory. On each figure, we report the average performance (left), measured as the difference between the true Off-policy Learning with Traces λ α0 αc β0 βc err LSTD 0 3.76 LSPE 0 3.86 FPKF 0.7 4.80 BRM 1.0 10.05 TD 0.4 10 1 101 2.96 g BRM 0.0 10 2 101 10.50 TDC 0.0 10 1 101 10 2 101 8.65 GTD2 0.0 10 1 103 10 2 101 4.41 Table 6: Big problem (G(100, 4, 3, 20)), off-policy learning (π = π0). value function (computed from the model) and the currently estimated one, V π Φθ 2, as well as the associated standard deviation (right). Average error on-policy, small problem LSTD LSPE FPKF BRM TD TDC GTD2 g BRM Standard Deviation of the error on-policy, small problem LSTD LSPE FPKF BRM TD TDC GTD2 g BRM Figure 2: Performance for small problems (G(30, 2, 2, 8)), on-policy learning (π = π0) (left: average error, right: standard deviation). We begin by discussing the results in the on-policy setting. Figure 2 compares all algorithms for 100 randomly generated small problems (that is, each run corresponds to different dynamics, reward function, features and evaluated policy), the meta-parameters being those provided in Table 3. All least-squares approaches provide the best results and are bunched together; this was to be expected, as all algorithms use λ equal to 1. The g BRM, TD and TDC algorithms provide the same results (being equivalent with the choice λ = 1), they are slower than GTD2, which is slower than the least-squares algorithms. Figure 3 compares the algorithms for 100 randomly generated big problems, the meta-parameters being those provided in Table 4. These result are similar to those of the small problem in an off-policy setting, except that TDC has now a different (and slower) behavior, due to the different choice of the eligibility factor (λ = 0.9). GTD2 is still the better gradient-based algorithm. Geist and Scherrer Average error on-policy, big problem LSTD LSPE FPKF BRM TD TDC GTD2 g BRM Standard Deviation of the error on-policy, big problem LSTD LSPE FPKF BRM TD TDC GTD2 g BRM Figure 3: Performance for big problems (G(100, 4, 3, 20)), on-policy learning (π = π0) (left: average error, right: standard deviation). Average error off-policy, small problem LSTD LSPE FPKF BRM TD TDC GTD2 g BRM Standard Deviation of the error off-policy, small problem LSTD LSPE FPKF BRM TD TDC GTD2 g BRM Figure 4: Performance for small problems (G(30, 2, 2, 8)), off-policy learning (π = π0) (left: average error, right: standard deviation). We now consider the off-policy setting. Figure 4 provides the average performance and standard deviation of the algorithms (meta-parameters being those of Table 5) on 100 small problems. Once again, we can see that LSTD/LSPE provide the best results. The two other least-squares methods (FPKF and BRM) are overtaken by the gradient-based TD algorithm, that follows closely LSTD/LSPE. GTD2 is a little bit slower and TDC is the slowest algorithm. Figure 5 provides the same data for the big problems (with the metaparameters of Table 6). These results are similar to those of the small problems in an off-policy setting, except that TD is even closer to LSTD/LSPE (but requires the choice of a learning rate). Summary. Overall, our experiments suggest that the two best algorithms are LSTD and LSPE, since they converge much faster in all situations with less parameter tuning. The gradient-based TD algorithm globally displays a good behavior and constitutes a good alter- Off-policy Learning with Traces Average error off-policy, big problem LSTD LSPE FPKF BRM TD TDC GTD2 g BRM Standard Deviation of the error off-policy, big problem LSTD LSPE FPKF BRM TD TDC GTD2 g BRM Figure 5: Performance for big problems (G(100, 4, 3, 20)), on-policy learning (π = π0) (left: average error, right: standard deviation). native when the number p of features is too big for least-squares methods to be implemented. Though some new algorithms/extensions show interesting results (FPKF(λ) is consistently better that the state-of-the-art FPKF by Choi and Van Roy 2006, g BRM works well in the on-policy setting) most of the other algorithms do not seem to be empirically competitive with the trio LSTD/LSPE/TD, especially in off-policy situations. In particular, the algorithm introduced specifically for the off-policy setting (TDC/GTD2) are much slower than TD in the off-policy case (but GTD2 is faster in the on-policy experiments, yet with more parameter tuning). Moreover, the condition required for the good behavior of LSPE, FPKF and TD the contraction of Π0T λ does not seem to be very restrictive in practice (at least for the Garnet problems we considered): though it is possible to build specific pathological examples where these algorithms diverge, this never happened in our experiments.14 6. Conclusion And Future Work We have considered least-squares and gradient-based algorithms for value estimation in an MDP context. Starting from the on-policy case with no trace, we have recalled that several algorithms (LSTD, LSPE, FPKF and BRM for least-squares approaches, TD, g BRM and TDC/GTD2 for gradient-based approaches) fall in a common algorithmic pattern: Equation (2). Substituting the original Bellman operator by an operator that deals with traces and off-policy samples naturally leads to the state-of-the-art off-policy trace-based versions of LSTD, LSPE, TD and TDC, and suggests natural extensions of FPKF, BRM, g BRM and GTD2. This way, we surveyed many known and new off-policy eligibility trace-based algorithms for policy evaluation. We have explained how to derive recursive (memory and time-efficient) implementations of all these algorithms and discussed their known convergence properties, including an original analysis of BRM(λ) for sufficiently small λ, that implies the so far not known convergence 14. A preliminary version of this article (Scherrer and Geist, 2011) contains such examples, and also an example where an adversarial choice of λ leads to the divergence of LSTD(λ). Geist and Scherrer of GPTD/KTD. Interestingly, it appears that the analysis of off-policy trace-based stochastic gradient algorithms under mild assumptions is still an open problem: the only currently known analysis of TD (Yu, 2010a) only applies to a constrained version of the algorithm, and that of TDC (Maei and Sutton, 2010) relies on an assumption on the boundedness of the second moment traces that is restrictive (Yu, 2010a). Filling this theoretical gap, as well as providing complete analyses for the other gradient algorithms and FPFK(λ) and BRM(λ) constitute important future work. Finally, we have illustrated and compared the behavior of these algorithms; this constitutes the first exhaustive empirical comparison of linear methods.15 Overall, our study suggests that even if the use of eligibility traces generally improves the efficiency of all algorithms, LSTD and LSPE consistently provide the best estimates; and in situations where the computational cost is prohibitive for a least-squares approach (when the number p of features is large), TD probably constitutes the best alternative. Appendix A. Derivation Of The Recursive Formulas For BRM(λ) We here detail the derivation of off-policy BRM(λ). We will need two technical lemmas. The first one is the Woodbury matrix identity which generalizes the Sherman-Morrison formula (given in Lemma 1). Lemma 10 (Woodbury) Let A, U, C and V be matrices of correct sizes, then: (A + UCV ) 1 = A 1 A 1U(C 1 + V A 1U) 1V A 1. The second lemma is a rewriting of imbricated sums: Lemma 11 Let f RN N N and n N. We have: k=i f(i, j, k) = k=1 f(k, i, j) + k=1 f(k, j, i). As stated in Equation (14), we have the following batch estimate for BRM(λ): θi = argmin ω Rp j=1 (zj i ψT j iω)2 = ( Ai) 1 bi, k=j ρk 1 j φk and zj i = k=j ρk 1 j ρkrk j=1 ψj iψT j i and bi = j=1 ψj izj i. 15. To our knowledge, there does not even exist any work reporting and comparing empirical results of LSTD(0) and FPKF(0). Off-policy Learning with Traces To obtain a recursive formula, these two sums have to be reworked through Lemma 11. Let us first focus on the latter: j=1 ψj izj i = m=j ρk 1 j φk ρm 1 j ρmrm m=1 ρj 1 m φj ρk 1 m ρkrk + m=1 ρk 1 m φk ρj 1 m ρjrj. m=1 ( ρk 1 m )2 = 1 + (γλρk 1)2yk 1, we have that: k X m=1 ρj 1 m ρk 1 m = ρj 1 k yk. j=1 ψj izj i = k=1 ρj 1 k yk φjρkrk + k=1 ρj 1 k yk φkρjrj. With the following notations: k=1 ρj 1 k ykρkrk = γλρj 1zj 1 + ρjrjyj k=1 ρj 1 k yk φk = γλρj 1Dj 1 + yj φj, and with the convention that z0 = 0 and D0 = 0, one can write: j=1 ψj izj i = j=1 ( φjρjrjyj + γλρj 1( φjzj 1 + ρjrj Dj 1)). Similarly, on can show that: j=1 ψj iψT j i = j=1 ( φj φT j yj + γλρj 1( φj DT j 1 + Dj 1 φT j )). uj = yj φj, vj = γλρj 1 yj Dj 1, Geist and Scherrer and I2 the 2 2 identity matrix, we have: j=1 ψj iψT j i = j=1 ((uj + vj)(uj + vj)T vjv T j ) j=1 ψj iψT j i + ui + vi vi We can apply the Woodbury identity given in Lemma 10: j=1 ψj iψT j i j=1 ψj izj i + Ui I2Vi = Ci 1 Ci 1Ui (I2 + Vi Ci 1Ui) 1 Vi Ci 1. The other sum can also be reworked: j=1 ψj izj i = j=1 φjrjyj + γλ (Dj 1rj + φjzj 1) = bi 1 + φiriyi + γλ (Di 1ri + φizi 1) = bi 1 + Ui yiri + γλ yi zi 1 γλ yi zi 1 Finally, the recursive BRM(λ) estimate can be computed as follows: θi = Ci bi = θi 1 + Ci 1Ui (I2 + Vi Ci 1Ui) 1 (Wi Viθi 1) . This gives BRM(λ) as provided in Algorithm 4. Appendix B. Proof Of Theorem 3: Convergence Of BRM(λ) The proof of Theorem 3 follows the general idea of that of Proposition 4 of Bertsekas and Yu (2009). It is done in 2 steps. First we argue that the limit of the sequence is linked to that of an alternative algorithm for which one cuts the traces at a certain depth l. Then, we show that for all depth l, this alternative algorithm converges almost surely, we explicitly compute its limit and make l tend to infinity to obtain the limit of BRM(λ). We will only show that 1 i Ai tends to A. The argument is similar for 1 i bi b. Consider the following l-truncated version of the algorithm based on the following alternative traces (we here limit the memory of the traces to a size l): m=max(1,k l+1) ( ρk 1 m )2, k=max(1,j l+1) ρj 1 k yk,l φk, Off-policy Learning with Traces and update the following matrix: Ai,l = Ai 1,l + φi φT i yi,l + ρi 1( φi DT i 1,l + Di 1,l φT i ). The assumption in Equation (15) implies that ρj 1 i βj i, therefore it can be seen that for all k, |yk,l yk| = max(0,k l) X m=1 ( ρk 1 m )2 max(0,k l) X m=1 β2(k m) β2l 1 β2 = ϵ1(l) where ϵ1(l) tends to 0 when l tends to infinity. Similarly, using the fact that yk 1 1 β2 and writing K = maxs,s φ(s) γφ(s ) , one has for all j, max(0,j l) X k=1 ρj 1 k yk φk + k=max(1,j l+1) ρj 1 k |yk,l yk| φk max(0,j l) X k=1 ρj 1 k 1 1 β2 K + k=max(1,j l+1) ρj 1 k β2l 1 β 1 1 β2 K + 1 1 β β2l 1 β2 K = ϵ2(l) where ϵ2(l) also tends to 0. Then, it can be seen that: Ai,l Ai = Ai 1,l Ai 1 + φi φT i (yi,l yi) + ρi 1( φi(DT i 1,l DT i 1) + (Di 1,l Di 1) φT i ) Ai 1,l Ai 1 + φi φT i |yk,l yk| + 2β φi Di 1,l Di Ai 1,l Ai 1 + K2ϵ1(l) + 2βKϵ2(l) and, by a recurrence on i, one obtains where ϵ(l) tends to 0 when l tends to infinity. This implies that: i ϵ(l) lim inf l i lim sup l i lim sup l In other words, one can see that limi Ai i and liml limi Ai,l i are equal if the latter exists. In the remaining of the proof, we show that the latter limit indeed exists and we compute it explicitly. Let us fix some l and let us consider the sequence ( Ai,l i ). At some index i, yi,l depends only on the last l samples, while Di,l depends on the same samples and the last l values of yj,l, thus on the last 2l samples. It is then natural to view the computation of Ai,l, which is Geist and Scherrer based on yi,l, Di 1,l and φi = φi γρiφi+1, as being related to a Markov chain of which the states are the 2l + 1 consecutive states of the original chain (si 2l, . . . , si, si+1). Write E0 the expectation with respect to its stationary distribution. By the Markov chain Ergodic Theorem, we have with probability 1: i = E0 φ2l φT 2ly2l,l + λγρ2l 1( φ2l DT 2l 1,l + D2l 1,l φT 2l) . (23) Let us now explicitly compute this expectation. Write xi the indicator vector (of which the kth coordinate equals 1 when the state at time i is k and 0 otherwise). One has the following relations: φi = ΦT xi. Let us first look at the left part of the above limit: E0 φ2l φT 2ly2l,l = E0 (φ2l γρ2lφ2l+1)(φ2l γρ2lφ2l+1)T y2l,l ΦT (x2l γρ2lx2l+1)(x2l γρ2lx2l+1)T Φ m=l+1 (λγ)2(2l m)(ρ2l 1 m )2 !# = ΦT ( 2l X m=l+1 (λγ)2(2l m)E0 h (ρ2l 1 m )2(x2l γρ2lx2l+1)(x2l γρ2lx2l+1)T i) = ΦT ( 2l X m=l+1 (λγ)2(2l m)E0 (Xm,2l,2l γXm,2l,2l+1 γXm,2l+1,2l + γ2Xm,2l+1,2l+1) ) where we used the definition ρk 1 j = (λγ)k jρk 1 j and the notation Xm,i,j = ρi 1 m ρj 1 m xix T j . To finish the computation, we will mainly rely on the following Lemma: Lemma 12 (Some Identities) Let P be the matrix of which the coordinates are pss = P a π(s, a)ρ(s, a)T(s, a, s ), which is in general not a stochastic matrix. Let µ0 be the sta- tionary distribution of the behavior policy π0. Write Di = diag ( P T )iµ0 . Then m i, E0[Xm,i,i] = Di m, m i j, E0[Xm,i,j] = Di m P j i, m j i, E0[Xm,i,j] = (P T )j i Di m. Proof We first observe that: E0[Xm,i,i] = E0[(ρi 1 m )2xix T i ] = E0[(ρi 1 m )2 diag(xi)] = diag E0[(ρi 1 m )2xi . To provide the identity, we will thus simply provide a proof by recurrence that E0[(ρi 1 m )2xi] = ( P T )m iµ0. For i = m, we have E0[xm] = µ0. Now suppose the relation holds for i and let us prove it for i + 1. E0[(ρi m)2xi+1] = E0 E0[(ρi m)2xi+1|Fi] = E0 E0[(ρi 1 m )2(ρi)2xi+1|Fi] = E0 (ρi 1 m )2E0[(ρi)2xi+1|Fi] . Off-policy Learning with Traces Write Fi the realization of the process until time i. Recalling that si is the state at time i and xi is the indicator vector corresponding to si, one has for all s : E0[(ρi)2xi+1(s )|Fi] = X a π0(si, a)ρ(si, a)2T(si, a, s ) a π(si, a)ρ(si, a)T(si, a, s ) = [ P T xi](s ). As this is true for all s , we deduce that E0[(ρi)2xi+1|Fi] = P T xi and E0[(ρi m)2xi+1] = E0[(ρi 1 m )2 P T xi] = P T E0[(ρi 1 m )2 P T xi] = P T ( P T )iµ0 = ( P T )i+1µ0 which concludes the proof by recurrence. Let us consider the next identity. For i j, E0[ρi 1 m ρj 1 m xix T j ] = E0[E0[ρi 1 m ρj 1 m xix T j |Fi]] = E0[(ρi 1 m )2xi E0[ρj 1 i x T j |Fi]] = E0[(ρi 1 m )2xix T i P j i] = diag ( P T )m iµ0 P j i. Eventually, the last identity is obtained by considering Ym,i,j = XT m,j,i. Thus, coming back to our calculus, E0 φ2l φT 2ly2l,l = ΦT ( 2l X m=l+1 (λγ)2(2l m) D2l m γ D2l m P γP T D2l m + γ2 D2l+1 m ) = ΦT (Dl γDl P γP T Dl + γ2D l)Φ (24) j=0 (λγ)2j Dj, and D l = j=0 (λγ)2j Dj+1. Geist and Scherrer Similarly, the second term on the right side of Equation (23) satisfies: E0 ρ2l 1D2l 1,l φT 2l k=l ρ2l 2 k yk,l φk φT 2l k=l (λγ)2l 1 kρ2l 1 k m=k l+1 ( ρk 1 m )2 ! ΦT (xk γρkxk+1)(x2l γρ2lx2l+1)T Φ φT 2l = ΦT 2l 1 X k=l (λγ)2l 1 k m=k l+1 (λγ)2(k m)E0 h ρ2l 1 m ρk 1 m (xk γρkxk+1)(x2l γρ2lx2l+1)T i ! = ΦT 2l 1 X k=l (λγ)2l 1 k k X m=k l+1 (λγ)2(k m) E0 Xm,k,2l γXm,k+1,2l γXm,k,2l+1 + γ2Xm,k+1,2l+1 ! = ΦT 2l 1 X k=l (λγ)2l 1 k k X m=k l+1 (λγ)2(k m) Dk m P 2l k γ Dk+1 m P 2l k 1 γ Dk m P 2l+1 k + γ2 Dk+1 m P 2l k ! = ΦT 2l 1 X k=l (λγ)2l 1 k k X m=k l+1 (λγ)2(k m) Dk m P 2l k(I γP) γ Dk+1 m P 2l 1 k(I γP) ! = ΦT 2l 1 X k=l (λγ)2l 1 k k X m=k l+1 (λγ)2(k m) Dk m P γ Dk+1 m P 2l 1 k(I γP) = ΦT 2l 1 X k=l (λγ)2l 1 k Dl P γD l P 2l 1 k(I γP) = ΦT Dl P γD l Ql(I γP)Φ with Ql = Pl 1 j=0(λγP)j. Off-policy Learning with Traces Gathering this and Equation (24), we see that the limit of Ai,l i expressed in Equation (23) equals: ΦT Dl γDl P γP T Dl + γ2D l +λγ (Dl P γD l)Ql(I γP) + (I γP T )QT l (P T Dl γD l) Φ. When l tends to infinity, Ql tends to Q = (I λγP) 1. The assumption of Equation (15) ensures that (λγ) P has spectral radius smaller than 1, and thus when l tends to infinity, Dl tends to D = diag (I (λγ)2 P T ) 1µ0 and D l to D = diag P T (I (λγ)2 P T ) 1µ0 . In other words, liml limi Ai,l i exists with probability 1 and equals: ΦT D γDP γP T D + γ2D +λγ (DP γD )Q(I γP) + (I γP T )QT (P T D γD ) Φ. Eventually, this shows that limi Ai i exists with probability 1 and shares the same value. A similar reasoning allows to show that limi bi i exists and equals ΦT (I γP T )QT D + λγ(DP γD )Q Rπ. Appendix C. Proof Of Proposition 9 To prove Proposition 9, we need the following technical lemma. Lemma 13 Let αi and βi be two forward recursions defined as αi = ai + ηiαi+1 and βi = bi + ηiβi+1. Assume that for any function f we have that (this is typically true if the index i refers to a state sampled according to some stationary distribution, which is the case we are interested in) E[f(ai, bi, ηi)] = E[f(ai 1, bi 1, ηi 1)]. Let also ui, vi and wi be the backward recursions defined as wi = 1 + η2 i 1wi 1, ui = aiwi + ηi 1ui 1, vi = biwi + ηi 1vi 1. Then, we have E[αiβi] = E[aivi + biui aibiwi]. Proof The proof looks like the one of Proposition 6, but is a little bit more complicated. A key equality, to be applied repeatedly, is: αiβi = (ai + ηiαi+1)(bi + ηiβi+1) = aiβi + biαi + η2 i αi+1βi+1 aibi. Geist and Scherrer Another equality to be used repeatedly makes use of the stationarity assumption. For any k 0 we have: j=0 η2 i j)αi+1βi+1] = E[( j=1 η2 i j)αiβi]. These two identities can be used to work the term of interest: E[αiβi] = E[(ai + ηiαi+1)(bi + ηiβi+1)] = E[aiβi] + E[biαi] + E[η2 i αi+1βi+1] E[aibi] = E[aiβi] + E[biαi] E[aibi] + E[η2 i 1(ai + ηiαi+1)(bi + ηiβi+1)] = E[ai(1 + η2 i 1)βi] + E[bi(1 + η2 i 1)αi] E[aibi(1 + η2 i 1)] + E[(ηi 1ηi)2αi+1βi+1]. This process can be repeated, giving E[αiβi] = E[(aiβi + biαi aibi)(1 + η2 i 1 + (ηi 1ηi 2)2 + . . . )]. We have that wi = 1 + η2 i 1wi 1 = 1 + η2 i 1 + (ηi 1ηi 2)2 + . . . , therefore E[αiβi] = E[aiwiβi] + E[biwiαi] E[aibiwi]. We can work on the first term: E[aiwiβi] = E[aiwi(bi + ηiβi+1)] = E[aiwibi] + E[ai 1wi 1ηi 1(bi + ηiβi+1)] = E[bi(aiwi + ηi 1(ai 1wi 1) + ηi 1ηi 2(ai 2wi 2) + . . . )] The work on the second term is symmetric: E[biwiαi] = E[aivi]. This finishes proving the result. The proof of Proposition 9 is a simple application of the preceding technical lemma. By lemma 5, we have that δλ i |{z} .=αi = δi |{z} .=ai + γλρi |{z} .=ηi δλ i+1 |{z} .=αi+1 By lemma 7, we have that gλ i |{z} .=βi = γρi(1 λ)φi+1 | {z } .=bi + γλρi |{z} .=ηi gλ i+1 |{z} .=βi+1 The result is then a direct application of Lemma 13. Off-policy Learning with Traces A. Antos, Cs. Szepesvári, and R. Munos. Learning near-optimal policies with Bellmanresidual minimization based fitted policy iteration and a single sample path. In Conference on Learning Theory (COLT), 2006. T. Archibald, K. Mc Kinnon, and L. Thomas. On the generation of Markov decision processes. Journal of the Operational Research Society, 46:354 361, 1995. L.C. Baird. Residual algorithms: Reinforcement learning with function approximation. In International Conference on Machine Learning (ICML), 1995. D.P. Bertsekas and S. Ioffe. Temporal differences-based policy iteration and applications in neuro-dynamic programming. Technical report, MIT, 1996. D.P. Bertsekas and John N. Tsitsiklis. Neuro-Dynamic Programming. Athena Scientific, 1996. D.P. Bertsekas and H. Yu. Projected equation methods for approximate solution of large linear systems. Journal of Computational and Applied Mathematics, 227:27 50, 2009. L. Bottou and O. Bousquet. The tradeoffs of large scale learning. In S. Sra, S. Nowozin, and S.J. Wright, editors, Optimization for Machine Learning, pages 351 368. MIT Press, 2011. J.A. Boyan. Technical update: Least-squares temporal difference learning. Machine Learning, 49(2-3):233 246, 1999. S.J. Bradtke and A.G. Barto. Linear least-squares algorithms for temporal difference learning. Machine Learning, 22(1-3):33 57, 1996. D. Choi and B. Van Roy. A generalized Kalman filter for fixed point approximation and efficient temporal-difference learning. Discrete Event Dynamic Systems, 16:207 239, 2006. Y. Engel. Algorithms and Representations for Reinforcement Learning. Ph D thesis, Hebrew University, 2005. M. Geist and O. Pietquin. Eligibility traces through colored noises. In IEEE International Conference on Ultra Modern Control Systems (ICUMT), 2010a. M. Geist and O. Pietquin. Kalman temporal differences. Journal of Artifical Intelligence Research (JAIR), 39:483 532, 2010b. M. Geist and O. Pietquin. Algorithmic survey of parametric value function approximation. IEEE Transactions on Neural Networks and Learning Systems, 24(6):845 867, 2013. M. Kearns and S. Singh. Bias-variance error bounds for temporal difference updates. In Conference on Learning Theory (COLT), 2000. J.Z. Kolter. The fixed points of off-policy TD. In Advances in Neural Information Processing Systems (NIPS), 2011. Geist and Scherrer M.G. Lagoudakis and R. Parr. Least-squares policy iteration. Journal of Machine Learning Research, 4:1107 1149, 2003. ISSN 1533-7928. H.R. Maei and R.S. Sutton. GQ(λ): A general gradient algorithm for temporal-difference prediction learning with eligibility traces. In Conference on Artificial General Intelligence (AGI), 2010. R. Munos. Error bounds for approximate policy iteration. In International Conference on Machine Learning (ICML), 2003. A. Nedić and D.P. Bertsekas. Least squares policy evaluation algorithms with linear function approximation. Discrete Event Dynamic Systems, 13:79 110, 2003. D. Precup, R.S. Sutton, and S.P. Singh. Eligibility traces for off-policy policy evaluation. In International Conference on Machine Learning (ICML), 2000. D. Precup, R.S. Sutton, and S. Dasgupta. Off-policy temporal-difference learning with function approximation. In International Conference on Machine Learning (ICML), 2001. R.S. Randhawa and S. Juneja. Combining importance sampling and temporal difference control variates to simulate Markov chains. ACM Transactions on Modeling and Computer Simulation, 14(1):1 30, 2004. B.D. Ripley. Stochastic Simulation. Wiley & Sons, 1987. B. Scherrer. Should one compute the temporal difference fix point or minimize the Bellman residual? The unified oblique projection view. In International Conference on Machine Learning (ICML), 2010. B. Scherrer and M. Geist. Recursive least-squares learning with eligibility traces. In European Workshop on Reinforcement Learning (EWRL), 2011. R. Schoknecht. Optimality of reinforcement learning algorithms with linear function approximation. In Advances in Neural Information Processing Systems (NIPS), 2002. R.S. Sutton and A.G. Barto. Reinforcement Learning: An Introduction (Adaptive Computation and Machine Learning). MIT Press, 3rd edition, 1998. R.S. Sutton, H.R. Maei, D. Precup, S. Bhatnagar, D. Silver, Cs. Szepesvári, and E. Wiewiora. Fast gradient-descent methods for temporal-difference learning with linear function approximation. In International Conference on Machine Learning (ICML), 2009. J.N. Tsitsiklis and B. Van Roy. An analysis of temporal-difference learning with function approximation. IEEE Transactions on Automatic Control, 42(5):674 690, 1997. H. Yu. Convergence of least-squares temporal difference methods under general conditions. In International Conference on Machine Learning (ICML), 2010a. H. Yu. Least squares temporal difference methods: An analysis under general condtions. Technical Report C-2010-39, University of Helsinki, September 2010b. Off-policy Learning with Traces H. Yu and D.P. Bertsekas. New error bounds for approximations from projected linear equations. Technical Report C-2008-43, Dept. Computer Science, Univ. of Helsinki, July 2008.