# incremental_truncated_lstd__3d438ecb.pdf

Incremental Truncated LSTD

Clement Gehring

MIT CSAIL Cambridge MA 02139 USA

gehring@csail.mit.edu

Yangchen Pan and Martha White

Indiana University Bloomington IN 47405 USA {yangpan, martha}@indiana.edu

Balancing between computational efﬁciency and sample efﬁciency is an important goal in reinforcement learning. Temporal difference (TD) learning algorithms stochastically update the value function, with a linear time complexity in the number of features, whereas least-squares temporal difference (LSTD) algorithms are sample efﬁcient but can be quadratic in the number of features. In this work, we develop an efﬁcient incremental lowrank LSTD(λ) algorithm that progresses towards the goal of better balancing computation and sample efﬁciency. The algorithm reduces the computation and storage complexity to the number of features times the chosen rank parameter while summarizing past samples efﬁciently to nearly obtain the sample efﬁciency of LSTD. We derive a simulation bound on the solution given by truncated low-rank approximation, illustrating a biasvariance trade-off dependent on the choice of rank. We demonstrate that the algorithm effectively balances computational complexity and sample efﬁciency for policy evaluation in a benchmark task and a high-dimensional energy allocation domain.

1 Introduction

Value function approximation is a central goal in reinforcement learning. A common approach to learn the value function is to minimize the mean-squared projected Bellman error, with dominant approaches generally split into stochastic temporal difference (TD) methods and least squares temporal difference (LSTD) methods. TD learning [Sutton, 1988] requires only O(d) computation and storage per step for d features, but can be sample inefﬁcient [Bradtke and Barto, 1996; Boyan, 1999; Geramifard and Bowling, 2006] because a sample is used only once for a stochastic update. Nonetheless, for practical incremental updating, particularly for highdimensional features, it remains a dominant approach.

On the other end of the spectrum, LSTD [Bradtke and Barto, 1996] algorithms summarizes all past data into a linear system, and are more sample efﬁcient than TD [Bradtke and Barto, 1996; Boyan, 1999; Geramifard and Bowling, 2006],

but at the cost of higher computational complexity and storage complexity. Several algorithms have been proposed to tackle these practical issues,1 including i LSTD [Geramifard and Bowling, 2006], i LSTD(λ) [Geramifard et al., 2007], sigma-point policy iteration [Bowling and Geramifard, 2008], random projections [Ghavamzadeh et al., 2010], experience replay strategies [Lin, 1993; Prashanth et al., 2013] and forgetful LSTD [van Seijen and Sutton, 2015]. Practical incremental LSTD strategies typically consist of using the system as a model [Geramifard and Bowling, 2006; Geramifard et al., 2007; Bowling and Geramifard, 2008], similar to experience replay, or using random projections to reduce the size of the system [Ghavamzadeh et al., 2010]. To date, however, none seem to take advantage of the fact that the LSTD system is likely low-rank, due to dependent features [Bertsekas, 2007], small numbers of samples [Kolter and Ng, 2009; Ghavamzadeh et al., 2010] or principal subspaces or highways in the environment [Keller et al., 2006].

In this work, we propose t-LSTD, a novel incremental lowrank LSTD(λ), to further bridge the gap between computation and sample efﬁciency. The key advantage to using a low-rank approximation is to direct approximation to less signiﬁcant parts of the system. For the original linear system with d features and corresponding d d matrix, we incrementally maintain a truncated rank r singular value decomposition (SVD), which reduces storage to signiﬁcantly smaller d r matrices and computation to O(dr + r3). In addition to these practical computational gains, this approach has several key beneﬁts. First, it exploits the fact that the linear system likely has redundancies, reducing computation and storage without sacriﬁcing much accuracy. Second, the resulting solution is better conditioned, as truncating singular values is a form of regularization. Regularization strategies have proven effective for stability [Bertsekas, 2007; Farahmand et al., 2008; Kolter and Ng, 2009; Farahmand, 2011]; however, unlike these previous approaches, the truncated SVD also reduces the size of the system. Third, like i LSTD, it provides a close approximation to the system, but with storage complexity reduced to O(dr) instead of O(d2) and a more intuitive toggle r to balance computation and approximation. Finally, the ap-

1A somewhat orthogonal strategy is to sub-select features before applying LSTD [Keller et al., 2006]. Feature selection is an important topic on its own; we therefore focus exploration on direct approximations of the LSTD system itself.

Proceedings of the Twenty-Fifth International Joint Conference on Artificial Intelligence (IJCAI-16)

proach is more promising for tracking and for control, because previous samples can be efﬁciently down-weighted in O(r) and the solution can be computed in O(dr) time, enabling every-step updating.

To better investigate the merit of low-rank approximations for LSTD, we ﬁrst derive a new simulation bound for lowrank approximations, highlighting the bias-variance trade-off given by this form of regularization. We then empirically investigate the rank properties of the system in a benchmark task (Mountain Car) with common feature representations (tile coding and RBFs), to explore the validity of using lowrank approximation in reinforcement learning. Finally, we demonstrate efﬁcacy of t-LSTD for value function approximation in this domain as well as a high-dimensional energy allocation domain.

2 Problem formulation We assume the agent interacts with and receives reward from an environment formalized by a Markov decision process: (S, A, Pr, r, γ) where S is the set of states, n = |S|; A is the set of actions; Pr : S A S ! [0, 1] is the transition probability function; r : S A S ! R is the reward function, where Pr(s, a, s0) is the probability of transitioning from state s into state s0 when taking action a, receiving reward r(s, a, s0); and γ 2 [0, 1] is the discount rate. For a policy : S A ! [0, 1], where P

a2A (s, a) = 1 8s 2 S, deﬁne matrix P 2 Rn n as P (s, s0) = P

a2A (s, a)Pr(s, a, s0) and vector r 2 Rn as the average rewards from each state under . The value at a state st is the expected discounted sum of future rewards, assuming actions are selected according to ,

V (st) = r (st) + γ

P (st, st+1)V (st+1).

Value function learning using linear function approximation can be expressed as a linear system [Bradtke and Barto, 1996]: Aw = b for

A = X>D(I γλP )

b = X>D(I γλP )

1r where each row in X 2 Rn d corresponds to the features for a state; D is a diagonal matrix with the stationary distribution of on the diagonal; and λ is the trace parameter for the λ-return. For action-value function approximation, the system is the same, but with state-action features in X. These matrices are approximated using

zt(xt γxt+1)> and b T = 1

for eligibility trace zt = Pt

i=0(γλ)t ixi and sampled trajectory s0, a0, r1, s1, a1, . . . , s T 1, a T 1, r T , s T .

There are several strategies to solve this system incrementally. A standard approach is to use TD and variants, which stochastically update w with new samples as w = w + (rt+1 + γx>

t w)zt. The LSTD algorithms instead incrementally approximate these matrices or corresponding system. For example, the original LSTD algorithm [Bradtke and Barto, 1996] incrementally maintains A

1 t using the matrix inversion lemma so that on each step the new solution w = A

1 t bt can be computed.

We iteratively update and solve this system by maintaining a low rank approximation to At directly. Any matrix A 2 Rd d has a singular value decomposition (SVD) A = U V>, where 2 Rd d is a diagonal matrix of the singular values of A and U, V 2 Rd d are orthonormal matrices: U>U = I = V>V and UU> = I = VV>. With this decomposition, for full rank A, the inverse of A is simply computed by inverting the singular values, to get w = A

1U>b. In many cases, however, the rank of A may be smaller than d, giving d rank(A) singular values that are zero. Further, we can approximate A by dropping (i.e., zeroing) some number of the smallest singular values, to obtain a rank r approximation. Correspondingly, rows of U and V are zeroed, reducing the size of these matrices to d r. The further we reduce the dimension, the more practical for efﬁcient incremental updating; however there is clearly a trade-off in terms of accuracy of the solution. We ﬁrst investigate the theoretical properties of using a low-rank approximation to At and then present the incremental t-LSTD algorithm.

3 Characterizing the low-rank approximation Low-rank approximations provide an efﬁcient approach to obtaining stable solutions for linear systems. The approach is particularly well motivated for our resource constrained setting, because of the classical Eckart-Young-Mirsky theorem [Eckart and Young, 1936; Mirsky, 1960], which states that the optimal rank r approximation to a matrix under any unitarily invariant norm (e.g., Frobenius norm, spectral norm, nuclear norm) is the truncated singular value decomposition. In addition to this nice property, which facilitates development of an efﬁcient approximate LSTD algorithm, the truncated SVD can be viewed as a form of regularization [Hansen, 1986], improving the stability of the solution.

To see why truncated SVD regularizes the solution, consider the solution to the linear system

w = A b = V U>b =

for ordered singular values σ1 σ2 . . . σrank(A) > σrank(A)+1 = 0, . . . , σd = 0. A is the pseudo-inverse of A, with = diag(σ

1 1 , ..., σ

1 rank(A), 0, ...,0) composed of the inverses of the non-zero singular values. For very small, but still non-zero σi, the outer product viu>

i will be scaled by a large number; this will often correspond to highly overﬁtting the observed samples and a high variance estimate. A common practice is to regularize w with kwk2 for regularization weight > 0, modifying the multiplier from σ

1 i to σi/(σ2

i + ) because w = (A>A + I)

1A>b = V( 2 + I)

1 U>b. The regularization reduces variance but introduces bias controlled by ; for = 0, we obtain the unbiased solution. Similarly, by thresholding the smallest singular values to retain only the top r singular values,

1 1 , ..., σ

1 r , 0, ...,0)U>b=

we introduce bias, but reduce variance because the size of σ

1 r can be controlled by the choice of r < rank(A).

To characterize the bias-variance tradeoff, we bound the difference between the true solution, w , and the approximate rank r solution at time t, wt,r. We use a similar analysis to the one used for regularized LSTD [Bertsekas, 2007, Proposition 6.3.4]. This previous bound does not easily extend, because in regularized LSTD, the singular values are scaled up, maintaining the information in the singular vectors (i.e., no columns are dropped from U or V). We bound the loss incurred by dropping singular vectors using insights from work on ill-posed systems.

The following is a simple but realistic assumption for illposed systems [Hansen, 1990]. The assumption states that u>

i b shrinks faster than σp

i , where p speciﬁes the smoothness of the solution w and is related to the smoothness parameter for the Hilbert space setting [Groetsch, 1984, Cor. 1.2.7]. Assumption 1: The linear system deﬁned by A = U V> and b satisfy the discrete Picard condition: for some p > 1,

i for i = 1, . . . , rank(A)

rank(A) for i = rank(A) + 1, . . . , d.

Assumption 2: As t ! 1, the sample average At converges to the true A. This assumption can be satisﬁed with typical technical assumptions (see [Tsitsiklis and Van Roy, 1997]).

We write the SVD of A = U V> and At = ˆU ˆ ˆV>, where to avoid cluttered notation, we do not explicitly subscript with t. Further, though the singular values are unique, there is a space of equivalent singular vectors, up to sign changes and multiplication by rotation matrices. We assume that among the space of equivalent SVDs of At, the most similar singular vectors for each singular value are chosen between A and At. This avoids uniqueness issues without losing generality, because we only conceptually compare the SVDs of A and At; the proof does not rely on practically obtaining this matching SVD. Theorem 1 (Bias-variance trade-off of rank-r approximation). Let At,r = ˆU ˆ r ˆV> be the approximated A after t samples, truncated to rank r, i.e., with the last r + 1, . . . , d singular values zeroed. Let w = A b and wt,r = A

t,rbt. Under Assumption 1 and 2, the relative error of the rank-r weights to the true weights w is bounded as follows:

kwt,r w k2 1

kbt Atw k2 + (d r) (t)

+ (d r)σp 1

r | {z } bias for function : N ! [0, 1), where (t) ! 0 as t ! 1:

rank(A)σp 1

A detailed proof is provided in an appendix[Gehring and White, 2015]. The key step is to split up the error into two terms: approximation error due to a ﬁnite number of samples t and bias due the choice of r < d. Then the second part is bounded using the discrete Picard condition to ensure that the magnitude of u>

j b does not dominate the error, and by

adding and subtracting terms to express the error in terms of differences between A and At. Because At converges to A, we can see that (t) converges to zero because the differences vjσp 1

j and ˆσp 1

r converge to zero. Remark 1: Notice that for no truncation, the bias term disappears and the ﬁrst term could be very large because

ˆσr = ˆσd could be very small (and often is for systems studied to-date, including in the below experiments). In fact, previous work on ﬁnite sample analysis of LSTD uses an unbiased estimate and the bound suffers from an inverse relationship to the smallest eigenvalue of X>X (see [Lazaric et al., 2010, Lemma 3], [Ghavamzadeh et al., 2010; Tagorti and Scherrer, 2015]). Here, we avoid such a potentially large constant in the bound at the expense of an additional bias term determined by the choice of r. Lasso-TD [Ghavamzadeh and Lazaric, 2011] similarly avoids such a dependence, using 1 regularization; to the best of our knowledge, however, there does not yet exist an efﬁcient incremental Lasso-TD algorithm. A future goal is to use the above bound, to obtain a ﬁnite sample bound for t-LSTD(λ), using the most up-to-date analysis by Tagorti and Scherrer [2015] and more general techniques for linear system introduced by Pires and Szepesvari [2012]. Remark 2: The discrete Picard condition could be relaxed to an average discrete Picard condition, where |u>

i b| on average is similar to σ

1 i , with a bound on the variance of this ratio. The assumption above, however, simpliﬁes the analysis and much more clearly illustrates the importance of the decay of u>

i b for obtaining stable LSTD solutions.

4 Incremental low-rank LSTD(λ) algorithm We have shown that a low-rank approximation to At is effective for computing the solution to LSTD from t samples. However, the computational complexity of explicitly computing At from samples and then performing a SVD is O(d3), which is not feasible for most settings. In this section, we propose an algorithm that incrementally computes a low-rank singular value decomposition of At, from samples, with signiﬁcantly improved storage O(dr) and computational complexity O(dr + r3), which we can further reduced to O(dr) using mini-batches of size r.

To maintain a low-rank approximation to At incrementally, we need to update the SVD with new samples. With each new xt, we add the rank-one matrix zt(xt γxt+1)> to At. Consequently, we can take advantage of recent advances for fast low-rank SVD updates [Brand, 2006], with some specialized computational improvements for our setting. Algorithm 1 summarizes the generic incremental update for t-LSTD, which can use mini-batches or update on each step, depending on the choice of the mini-batch size k. Due to space constraints, the detailed pseudo-code for the SVD updates are left out but detailed code and explanations will be published on-line. The basics of the SVD update follow from previous work [Brand, 2006] but our implementation offers some optimizations speciﬁc for the LSTD case.

By maintaining the SVD incrementally, we do not need to explicitly maintain At; therefore, storage is reduced to the size of the truncated singular vector matrices, which is

Algorithm 1 t-LSTD(λ) using incremental SVD

// Input rank r, and mini-batch size k // with differing update-svd for k = 1 and k > 1 U [], V [], 0, b 0, z 0, i 0, t 1 x the initial observation repeat

Take action according to , observe x0, reward r β 1/(t + k) z γλz + x d β(x γx0) Z:,i z D:,i d b (1 β)b + βzr i i + 1 if i k then

// Returns U, V 2 Rd r, diagonal 2 Rr r U, , V update-svd(U, (1 β) , V, pβZ, pβD, r) Z 0d k, D 0d k, i 0, t t + k end if w V U>b // O(dr) time until agent done interaction with environment

O(dr). To maintain O(dr) computational complexity, matrix and vector multiplications need to be carefully ordered. For example, to compute w, ﬁrst b = U>b is computed in O(dr), then r b is computed in O(r), and ﬁnally that is multiplied by V in O(dr). For k = 1 (update on each step), the O(r3) computation arises from a re-diagonalization and the multiplication of the resulting orthonormal matrices. For mini-batches of size k = r, we can get further computational improvements by amortizing costs across r steps, to obtain a total amortized complexity O(dr), losing the r3 term.

As an additional beneﬁt, unlike previous incremental LSTD algorithms, we maintain normalized At and bt, by incorporating the term β. On each step, we use

At+1 = 1 t+1(t At + ztd>

t ) = (1 βt)At + βtztd>

for βt = 1 t+1. The multiplication of At by 1 βt requires only O(r) computation because (1 βt)Ur r V>

r = Ur(1 βt) r V>

r . Multiplying the full A matrix by 1 βt, on the other hand, would require O(d2) computation, which is prohibitive. Further, βt can be selected to obtain a running average, as in Algorithm 1, or more generally can be set to any βt 2 (0, 1). For example, to improve tracking, β can be chosen as a constant to weight more recent samples more highly in the value function estimate.

5 Experiments

We empirically investigate t-LSTD, for k = r in a benchmark domain and k = 1 in an energy allocation domain. Value function accuracy in benchmark domains:

We ﬁrst investigate the performance of t-LSTD in the Mountain Car benchmark. The goal in this setting is to carefully investigate t-LSTD against the two extremes of TD and

(a) Rank versus performance with tile coding

(b) Rank versus performance with RBFs

Figure 1: The impact of the rank r on RMSE of the true discounted returns and the learned value function in Mountain Car. We can see that large r are not necessary, with performance levelling off at r = 50. For high values of r and fewer samples, the error slightly increases, likely due to some instability with incremental updating and very small singular values.

(a) RMSE versus samples (b) RMSE versus runtime

Figure 2: RMSE of the true discounted returns and the learned value function in Mountain Car with RBFs. For (a) we can see that with a signiﬁcantly reduced r, t-LSTD can match LSTD, and outperforms TD. This is the best setting for LSTD, where computation is not restricted, and it can spend time processing samples. For (b) we provide the best scenario for TD, with unlimited samples. Once again, t-LSTD can almost match the performance of TD, and signiﬁcantly outperforms LSTD. Together, these graphs indicate that t-LSTD can balance between the two extremes. The reported results are for the best parameter settings for TD, and for r = 100 and λ = 0 for t-LSTD.

LSTD, and evaluate the utility for balancing sample and computational complexity. We use two common feature representations: tile coding and radial basis function (RBF) coding. We set the policy to the commonly used energy-pumping policy, which picks actions by pushing along the current velocity. The true values are estimated by using rollouts from states chosen in a uniform 20x20 grid of the state-space. The reported root mean squared error (RMSE) is computed between the estimated value functions and the rollout values. The tile coding representation has 1000 features, using 10 layers of 10x10 grids. The RBF representation has 1024 features, for a grid of 32x32 RBFs with width equal to 0.12 times the total range of the state space. We purposefully set the total number of features to be similar in both cases in order to keep the results comparable. We set the RBF widths to obtain

good performance from LSTD. The other parameters (λ and step-size) are optimized for each algorithm. In the Mountain Car results, we use the mini-batch case where k = r and a discount γ = 0.99. Results are averaged over 30 runs.

Empirically, we observed that the A has only a few large singular value with the rest being small. This was observed in Mountain Car across a wide range of parameter choices for both tile coding and RBFs, hinting that A could be reasonably approximated with small rank. In order to investigate the effect of the rank of t-LSTD , we vary r and run t-LSTD on some ﬁx number of samples. In Figure 1 (a) and (b), we observe a gracious decay in the quality of the estimated value function as the rank is reduced while achieving LSTD level performance with as little as r = 50 for RBFs (d = 1024) and r = 300 for tile coding (d = 1000).

Given large enough rank and numerical precision, LSTD and t-LSTD should behave similarly. To verify this, in Figure 2 (a), we plot the learning curves of t-LSTD in the case where the r is too small and the case where r is large enough, alongside LSTD and TD. As expected, we observe LSTD and t-LSTD to have near identical learning curves for r = 100, while, for the case with smaller rank r = 30, we see the algorithm converge rapidly to an inferior solution. TD is less sample efﬁcient and so converges more slowly than either.

Sample efﬁciency is an important property for an algorithm but does not completely capture the needs of an engineer attempting to solve a domain. In many case, the requirements tend to call for a balance between runtime and number of samples. In cases where a simulator is available, such as in game playing (e.g., atari, chess, backgammon, go), samples are readily available and only computational cost matters. For this reason, we explore the performance of TD, LSTD, and t-LSTD when given unlimited data but limited CPU time. In Figure 2 (b), we plot the accuracy of the methods with respect to computation time used. The algorithms are given access to varying amounts of samples: up to 8000 samples for TD and up to 4000 for t-LSTD and LSTD. The RMSE and time taken is monitored, after which, the points are averaged to generate the plots comparing runtime to the error in the learned solution.

These results show that TD, despite poor sample efﬁciency, outperforms LSTD for a given runtime, due to the computational efﬁciency of each update. This supports the trend of preferring TD for large problems over LSTD. We observe t-LSTD achieve a comparable runtime to TD. Even though t-LSTD is computationally more costly than TD, its superior sample efﬁciency compensates. Furthermore, this inﬁnite sample stream case is favorable to TD. In a scenario where data is obtain in real-time, sacriﬁcing sample efﬁciency for computational gains might leave TD idling occasionally, further reinforcing t-LSTD as a good alternative.

These results indicate that t-LSTD offers an effective approach to balance sample efﬁciency and computational efﬁciency to match both TD and LSTD in their respective use cases, offering good performance when data is plentiful while still offering LSTD-like sample efﬁciency.

Value function accuracy in an energy domain

In this section, we demonstrate the performance of the fully

incremental algorithm (k = 1) in a large energy allocation domain [Salas and Powell, 2013]. The focus in this experiment is to evaluate the practical utility of t-LSTD in an important application, versus realistic competitors: TD2 and i LSTD. The goal of the agent in this domain is to maximize revenue and satisfy demand. Each action vector is an allocation decision. Each state is a four dimensional variable: the amount of energy in storage, the amounts of renewable generation available, the market price of energy, and the demand needs to be satisﬁed. We use a provided near-optimal policy [Salas and Powell, 2013]. We set γ = 0.8.

To approximate the value function, we use tile coding with 32 tilings where each tiling contains 5 5 10 5 grids, resulting in 40, 000 features and also included a bias unit. We choose this representation, because i LSTD is only computationally feasible for high-dimensional sparse representations. As before, extensive rollouts are computed from a subset of states, to compute accurate estimate of the true value, and then stored for comparison in the computation of the RMSE. Results were averaged over 30 runs.

We report results for several values of r for t-LSTD. We sweep the additional parameters in the other algorithms, including step-sizes for TD and i LSTD and m for i LSTD. We sweep a range of 0 = {2 11, 2 10, 2 9, ..., 2 1}, and divide by the number of active features (which in this case is 26). Further, because i LSTD is unstable unless is decayed,we further sweep the decay formula as suggested by Geramifard and Bowling [2006]

where N0 is chosen from {10, 100, 1000}. To focus parameter sweeps on the step-size, which had much more effect for i LSTD, we set λ = 0.9 for all other algorithms, except for t LSTD which we set λ = 1.0. We choose r 2 {5, 20, 40, 60} and m 2 {10, 20, 30, 40, 50}. We restrict the i LSTD parameters to a small set, since there are too many options, even the optimal stepsize would be different when we choose different m values. Preliminary investigation indicated that 50 was large enough for i LSTD.

In this domain, under the common strategy of creating a large number of ﬁxed features (tile coding or RBFs), t-LSTD is able to signiﬁcantly take advantage of low rank structure, learning more efﬁciently without incurring much computational cost. Figure 3 shows that t-LSTD performs well with a small r = 40 << d = 40, 000, and outperforms both TD and i LSTD.

We highlight that i LSTD is one of the only practical competitors introduced for this setting: incremental learning with computational constraints. Even then, i LSTD is restrictive in that the feature representation must be sparse and its storage requirements are O(d2). Further, though it was reasonably robust to the choice of m, we found i LSTD was quite sensitive to the choice of step-size parameter. In fact, without a careful decay, we still encountered divergence issues.

The goal here was to investigate the performance of the simplest version of t-LSTD, with fewest parameters and with-

2We also compared to true-online TD [van Seijen and Sutton, 2014], but it gave very similar performance; we therefore omit it.

(a) RMSE vs samples, rank comparison

(b) RMSE vs samples, algorithm comparison

(c) RMSE vs runtime

Figure 3: RMSE of the value function in the energy allocation domain. (a) Performance of t-LSTD improves as the rank increases; however, even for small r = 5, the algorithm still converges with some bias. For r smaller than 5, the error was signiﬁcantly worse. (b) With r = 40, t-LSTD converges to the almost same level with TD in signiﬁcantly fewer steps. The best parameters are chosen for each algorithm, m = 50 for i LSTD and r = 40 for t LSTD. (c) As before, we plot RMSE versus runtime, but now by selecting a scenario in between the extremes plotted in Figure 2. The number of samples per second is restricted to 25 samples, meaning TD is sometimes idle waiting for more samples, and i LSTD (m = 50) and t-LSTD (r = 40) could be too slow to process all the samples. This plot further indicates the advantages of t-LSTD, particularly as it is faster than TD in terms of sample efﬁciency and scales better than i LSTD and converges to a better solution.

out optimizing thresholds, which were kept ﬁxed at reasonable heuristics across all experiments. This choice does impact the learning curves of t-LSTD. For example, though t-LSTD has signiﬁcantly faster early convergence, it is less smooth than either TD or i LSTD. This lack of smoothness could be due to not optimizing these parameters and further because wt is solved on each step. Beyond this vanilla implementation of t-LSTD, there are clear avenues to explore to more smoothly update wt with the low-rank approximation to A. Nonetheless, even in its simplest form, t-LSTD provides an attractive alternative to TD, obtaining sample efﬁciency improvements without much additional computation and without the need to tune a step-size parameter.

6 Discussion and conclusion

This paper introduced an efﬁcient value function approximation algorithm, called t-LSTD(λ), that maintains an incremental truncated singular value decomposition of the LSTD matrix. We systematically explored the validity of using lowrank approximations for LSTD, ﬁrst by proving a simulation error bound for truncated low-rank LSTD solutions and then, empirically, by examining an incremental truncated LSTD algorithm in two domains. We demonstrated performance of t-LSTD in the benchmark domain, Mountain Car, exploring runtime properties and the effect of the small rank approximation, and in a high-dimensional energy allocation domain, illustrating that t-LSTD enables a nice interpolation between the properties of TD and LSTD, and out-performs i LSTD.

There are several potential beneﬁts of t-LSTD that we did not yet explore in this preliminary investigation. First, there are clear advantages to t-LSTD for tracking and control. As mentioned above, unlike previous LSTD algorithms, past samples for t-LSTD can be efﬁciently down-weighted with a βt 2 (0, 1). By enabling down-weighting, At is more strongly inﬂuenced by recent samples and so can better adapt

in a non-stationary environment, such as for control.

Another interesting avenue is to take advantage of t-LSTD for early learning, to improve sample efﬁciency, and then switch to TD to converge to an unbiased solution. Even for highly constrained systems in terms of storage and computation, aggressively small r can still be useful for early learning. Further empirical investigation could give insight into when this switch could occur, depending on the choice of r.

Finally, an important avenue for this new approach is to investigate the convergence properties of truncated incremental SVDs. The algorithm derivation requires only simple algebra and is clearly sound; however, to the best of our knowledge, the question of convergence under numerical stability and truncating non-zero singular values remains open. The truncated incremental SVD has been shown to be practically useful in numerous occasions, such as for principal components analysis and partial least squares [Arora et al., 2012]. Moreover, there are some informal arguments (using randomized matrix theory) that even under truncation the SVD will re-orient [Brand, 2006]. This open question is an important next step in understanding t-LSTD and, more generally, incremental singular value decomposition algorithms for reinforcement learning.

[Arora et al., 2012] R Arora, A Cotter, K Livescu, and

N Srebro. Stochastic optimization for PCA and PLS. In Annual Allerton Conference on Communication, Control, and Computing, 2012.

[Bertsekas, 2007] D Bertsekas. Dynamic Programming and

Optimal Control. Athena Scientiﬁc Press, 2007.

[Bowling and Geramifard, 2008] M Bowling and A Gerami-

fard. Sigma point policy iteration. In International Conf. on Autonomous Agents and Multiagent Systems, 2008.

[Boyan, 1999] J A Boyan. Least-squares temporal difference

learning. International Conf. on Machine Learning, 1999.

[Bradtke and Barto, 1996] Steven J Bradtke and Andrew G

Barto. Linear least-squares algorithms for temporal difference learning. Machine Learning, 1996.

[Brand, 2006] Matthew Brand. Fast low-rank modiﬁcations

of the thin singular value decomposition. Linear Algebra and its Applications, 2006.

[Eckart and Young, 1936] Carl Eckart and Gale Young. The

approximation of one matrix by another of lower rank. Psychometrika, 1936.

[Farahmand et al., 2008] A M Farahmand, M Ghavamzadeh,

and C Szepesv ari. Regularized policy iteration. In Advances in Neural Information Processing Systems, 2008.

[Farahmand, 2011] A Farahmand. Regularization in reinforcement learning. Ph D thesis, Univ. of Alberta, 2011.

[Gehring and White, 2015] Clement Gehring and Martha

White. Incremental truncated LSTD. Co RR, abs/1511.08495, 2015.

[Geramifard and Bowling, 2006] A Geramifard and M Bowling. Incremental least-squares temporal difference learning. In AAAI Conference on Artiﬁcial Intelligence, 2006.

[Geramifard et al., 2007] A Geramifard, M Bowling, and

M Zinkevich. i LSTD: Eligibility traces and convergence analysis. In Advances in Neural Information Processing Systems, 2007.

[Ghavamzadeh and Lazaric, 2011] M Ghavamzadeh and A Lazaric. Finite-sample analysis of Lasso-TD. In International Conference on Machine Learning, 2011.

[Ghavamzadeh et al., 2010] M Ghavamzadeh, A Lazaric,

O A Maillard, and R Munos. LSTD with random projections. In Advances in Neural Information Processing Systems, 2010.

[Groetsch, 1984] C W Groetsch. The Theory of Tikhonov Regularization for Fredholm Equations of the First Kind. Pitman Advanced Publishing Program, 1984.

[Hansen, 1986] P C Hansen. The truncated SVD as a method

for regularization. BIT Numerical Mathematics, 1986.

[Hansen, 1990] Per Christian Hansen. The discrete picard

condition for discrete ill-posed problems. BIT Numerical Mathematics, 1990.

[Keller et al., 2006] PW Keller, S Mannor, and D Precup.

Automatic basis function construction for approximate dynamic programming and reinforcement learning. In International Conference on Machine Learning, 2006.

[Kolter and Ng, 2009] JZ Kolter and AY Ng. Regularization

and feature selection in least-squares temporal difference learning. In Inter. Conf. on Machine Learning, 2009.

[Lazaric et al., 2010] A Lazaric, M Ghavamzadeh, and R Munos. Finite sample analysis of LSTD. International Conference on Machine Learning, 2010.

[Lin, 1993] Long-Ji Lin. Reinforcement Learning for Robots

Using Neural Networks. Ph D thesis, CMU, 1993. [Mirsky, 1960] L Mirsky. Symmetric gauge functions and

unitarily invariant norms. Quartely Journal Of Mathetmatics, 1960. [Pires and Szepesvari, 2012] Bernardo Avila Pires and Csaba Szepesvari. Statistical linear estimation with penalized estimators: an application to reinforcement learning. In International Conference on Machine Learning, 2012. [Prashanth et al., 2013] L A Prashanth, Nathaniel Korda,

and R emi Munos. Fast LSTD using stochastic approximation: Finite time analysis and application to trafﬁc control. ar Xiv.org, 2013. [Salas and Powell, 2013] D F Salas and W B Powell. Bench-

marking a Scalable Approximate Dynamic Programming Algorithm for Stochastic Control of Multidimensional Energy Storage Problems. Dept Oper Res Financial Eng, 2013. [Sutton, 1988] R.S. Sutton. Learning to predict by the meth-

ods of temporal differences. Machine Learning, 1988. [Tagorti and Scherrer, 2015] Manel Tagorti and Bruno Scherrer. On the Rate of Convergence and Error Bounds for LSTD(λ). In Inter. Conf. on Machine Learning, 2015. [Tsitsiklis and Van Roy, 1997] J.N. Tsitsiklis and B Van Roy. An analysis of temporal-difference learning with function approximation. IEEE Transactions on Automatic Control, 1997. [van Seijen and Sutton, 2014] Harm van Seijen and Rich

Sutton. True online TD(lambda). In International Conference on Machine Learning, 2014. [van Seijen and Sutton, 2015] H van Seijen and R.S. Sutton.

A deeper look at planning as learning from replay. In International Conference on Machine Learning, 2015.