# limiting_extrapolation_in_linear_approximate_value_iteration__4c80fc4a.pdf

Limiting Extrapolation in Linear Approximate Value Iteration

Andrea Zanette Institute for Computational and Mathematical Engineering,

Stanford University, CA zanette@stanford.edu

Alessandro Lazaric Facebook AI Research

lazaric@fb.com

Mykel J. Kochenderfer Department of Aeronautics and Astronautics,

Stanford University, CA

mykel@stanford.edu

Emma Brunskill Department of Computer Science,

Stanford University, CA ebrun@cs.stanford.edu

We study linear approximate value iteration (LAVI) with a generative model. While linear models may accurately represent the optimal value function using a few parameters, several empirical and theoretical studies show the combination of leastsquares projection with the Bellman operator may be expansive, thus leading LAVI to amplify errors over iterations and eventually diverge. We introduce an algorithm that approximates value functions by combining Q-values estimated at a set of anchor states. Our algorithm tries to balance the generalization and compactness of linear methods with the small ampliﬁcation of errors typical of interpolation methods. We prove that if the features at any state can be represented as a convex combination of features at the anchor points, then errors are propagated linearly over iterations (instead of exponentially) and our method achieves a polynomial sample complexity bound in the horizon and the number of anchor points. These ﬁndings are conﬁrmed in preliminary simulations in a number of simple problems where a traditional least-square LAVI method diverges.

1 Introduction

Impressive empirical successes [Mni+13; Sil+16; Sil+17] in using deep neural networks in reinforcement learning (RL) often use sample inefﬁcient algorithms. Despite recent advances in the theoretical analysis of value-based batch RL with function approximation [MS08; ASM08; FSM10; YXW19; CJ19], designing provably sample-efﬁcient approximate RL algorithms with function approximation remains an open challenge.

In this paper, we study value iteration with linear approximation (LAVI for short). Linear function approximators represent action-value functions as the inner product between a weight vector w and a d-dimensional feature map φ evaluated at each state-action pair, i.e., b Q(s, a) = w>φ(s, a). Linear models are common and powerful because they allow to compactly represent functions with a small number of parameters, and therefore have promise for requiring a small sample size to learn such functions. Unfortunately, it is well known that the Bellman operator combined with the projection onto a linear space in, e.g., 2-norm, may result in an expansive operator. As a result, even when the features are expressive enough so that the optimal state-action value function Q? can be accurately represented (i.e., Q?(s, a) (w?)>φ(s, a)), combining linear function approximation with value iteration may lead to divergence [Bai95; TV96]. Munos [Mun05] derived bounds on the error propagation for general approximate value iteration (AVI) and later Munos and Szepesvári

33rd Conference on Neural Information Processing Systems (Neur IPS 2019), Vancouver, Canada.

[MS08] proved ﬁnite-sample guarantees for ﬁtted value iteration with a generative model, while sharper results can be found in [FSM10]. A key issue in AVI is that errors at one iteration may be ampliﬁed through the application of the Bellman operator and projection. In the analysis of Munos and Szepesvári [MS08], this effect is illustrated by the inherent Bellman error, which measures how well the image through the Bellman operator of any function in the approximation space can be approximated within the space itself. Whenever the inherent Bellman error is unbounded, AVI may diverge.

In contrast to the ampliﬁcation of errors of linear value function approximation, averagers [Gor95], such as barycentric interpolators [MM99], nearest-neighbors, and kernels [OS02], can reduce how errors are propagated through iterations. Averagers represent the value function at a state-action pair as an interpolation of its values at a ﬁnite set of anchor points. By interpolating instead of extrapolating, the function approximator is guaranteed to be a non-expansion in 1-norm, and therefore the Bellman backup remains a contraction even after the projection onto the approximation space. Unfortunately, the number of anchor points needed to accurately represent the value function, and thus the number of parameters to learn, may scale exponentially with the input state dimension.

In this paper, we explore a new function approximator that tries to balance the compactness and generalization of linear methods, leading to sample efﬁciency at each iteration, while constraining the resulting expansion, as in averagers, thus providing a small ampliﬁcation factor over iterations. Our algorithm estimates the Q-values at a set of anchor points and predict the function at any other point by taking a combination of those values, while using a linear representation. We show that whenever the features generate a convex set, it is possible to avoid any error ampliﬁcation and achieve a sample complexity that is polynomial in the number of anchor points and in the horizon. A related convexity assumption has been very recently used by Yang and Wang [YW19] to obtain the ﬁrst algorithm with near-optimal sample complexity. Nonetheless, their result holds when the transition model p admits a non-negative low-rank factorization in φ, which also corresponds to a zero inherent Bellman error. In our analysis, we consider the far more general setting of when the optimal state-action value function can be accurately approximately with a linear set of features. Note that this can be true even if the transition model does not admit a low-rank decomposition, as we illustrate in our simulation results. Furthermore, our result holds even when the inherent Bellman error is inﬁnite. Unlike [YW19], we also report a thorough discussion on how to select anchor points and provide a heuristic procedure to automatically create them.

In our simulations we show that small levels of ampliﬁcation can be achieved, and that our algorithm can effectively mitigate the divergence observed in some simple MDPs for least-squares AVI. This happens even when using identical feature representations, highlighting the beneﬁt of bounding extrapolation through constructing feature representations as near convex combinations (versus 2 or other common loss functions). Furthermore, we empirically show that small ampliﬁcation factors can be obtained with relatively small sets of anchor points. We believe this work provides a ﬁrst step towards designing sample efﬁcient algorithms that effectively balance per-iteration generalization and sample complexity and the ampliﬁcation of errors through iterations for general linear action-value function solvers.

2 Preliminaries

We consider a ﬁxed-horizon MDP M = h S, A, p, r, H, i deﬁned by a continuous state space S, a discrete action space A, a horizon H, an initial state distribution , a transition model p(s, a) and a reward model r(s, a). We also denote by R(s, a) the random reward, with expected value r(s, a). A deterministic policy t(s) is a mapping from a state and timestep to an action. The Q-value of a policy in state-action-timestep (s, a, t) is the expected return after taking action a in s at timestep t and following policy afterwards, and V

t (s, t(s)). An optimal policy ? maximizes the value function at any state and timestep, i.e., ?

t = arg max V

t . We use V ?

t to denote the functions corresponding to an optimal policy ?.

We consider the so-called generative model setting, where p and r are unknown but a simulator can be queried at any state-action pair (s, a) to obtain samples s0 p(s, a) and R(s, a). As the generation of each sample may be expensive, the overall objective is to compute a near-optimal policy with as few samples as possible. Approximate dynamic programming algorithms can be used to replace p and r with a ﬁnite number of simulator samples, and can be used for high dimensional or continuous spaces. Approximate value iteration (AVI) (related closely to ﬁtted value iteration), takes as input a regression algorithm F, and it proceeds backward from horizon H to 1. At each timestep t, given the

approximation b Q?

t+1, it queries the simulator n times and obtains a set of tuples {(si, ai, ri, s0

i=1, used to construct a regression dataset Dt = {(si, ai), yi)}n

i=1 with yi = ri + maxa b Q?

i, a). AVI then computes b Q?

t = F(Dt), returns the approximated optimal policy b ?

t (s) = arg maxa b Q?

t (s, a), and proceed to timestep t 1.

A popular instance of AVI is to use linear regression to approximate Q-functions. We refer to this general scheme as linear AVI (LAVI). Let φt : St At ! Rd be a feature mapping for timestep t. We deﬁne Φt = {φ 2 Rd : 9s 2 S, a 2 At,s, φt(s, a) = φ} as the subset of Rd obtained by evaluating φt at any state-action pair (s, a). Any approximate action-value function b Qt is represented as a linear combination of weights bwt 2 Rd and features φt as b Qt(s, a) = bw>

t φt(s, a), where bwt is usually computed by minimizing the 2-loss on the dataset Dt. Linear function approximation requires only O(d/ 2) samples to have an estimation error, independent from the size of S and A. Nonetheless, at each timestep t the combination of the 2-loss minimization (i.e., F) with the application of the Bellman operator to the function computed at timestep t + 1 may correspond to an expansive operation. In this case, errors at each iteration may be ampliﬁed and eventually lead LAVI to diverge.

3 Linear Approximate Value Iteration with Extrapolation Reduction

We introduce IER (Interpolation for Extrapolation Reduction), a novel approximation algorithm that interpolates Q-values at a set of anchor points. We study its prediction error and we analyze the sample complexity of the LAVI scheme obtained by executing IER backward from H to 1.

At each timestep t, IER receives as input an estimate b Q?

t+1 of the action-value function at timestep t + 1, the feature map φt, and a set Kt S A of Kt anchor state-action pairs. IER ﬁrst estimates Q?

t (si, ai) at any anchor point (si, ai) 2 Kt by repeatedly sampling from the simulator and using the approximation b Q?

t+1 to compute the backup values. We deﬁne the anchor values as

t,i = 1 nsupp

t+1 are the samples generated from the generative model at (si, ai) and nsupp is the budget at each anchor point. Given these estimations, the approximation b Q?

t (s, a) returned by IER at any state-action pair (s, a) is obtained by a linear combination of the b Q?

t,i values as

where the interpolation vector φt(s,a)

t 2 RK is the solution to the optimization problem

min φt(s,a) k φt(s,a)k1 subject to φt(s, a) =

i φt(si, ai). (3)

As long as the image of the anchor points {φ(si, ai)}Kt

i=1 spans Rd, (3) admits a solution. This problem is a linear optimization program with linear constraints and it can be solved efﬁciently using standard techniques [BV04; NW06]. Notice that the weights φt(s,a)

t change with s, a and no positiveness constraint is enforced.

3.1 Prediction Error and Sample Complexity of IER

In most problems, the optimal action-value function Q?

t cannot be exactly represented by a low dimensional inner product w>

t φt( , ). The best approximator that can be expressed by features φ and its associated approximation error are deﬁned as

t = arg min

%%w>φt( ) Q?

%%w>φt( ) Q?

where k k1 denotes the inﬁnity norm, i.e., the maximum over state-action pairs in S A. Standard linear function approximation methods rather minimize the 2-norm (i.e., least-squares) or a regularized version of it.

We are interested in studying whether IER approaches the performance of w?. Before analyzing IER, we focus on its exact counterpart. We introduce e Q?

t (s, a) as the interpolator obtained by combining the exact Q?-function evaluated on the anchor points as

t (si, ai) (5)

where the vector φ(s,a) is the solution of (3). We prove the following.

Lemma 1 (Error Bounds of f Q?

t ). Let app

t be the approximation error of the best linear model (Eq. 4). If app

t = 0, i.e., Q?

t (s, a) = (w?

t )>φt(s, a), then e Q?

t (s, a) = (w?

t )>φt(s, a). Otherwise the (exact) interpolator in Eq. 5 has an error

max (s,a)2S A

t (s, a) Q?

((( (1 + Ct) app

def = max(s,a)2S A k φ(s,a)

t k1 is the ampliﬁcation factor.

This result shows that the interpolation done in (5) preserves the linearity of the model whenever the function evaluated at the anchor points is linear itself. Furthermore, the prediction error is a factor (1 + Ct) bigger than the best approximator. The optimization program (3) plays a crucial role in obtaining both results. In particular, the constraint ensures that the linear structure is preserved, while the minimization over φ(s,a)

t aims at controlling the ampliﬁcation factor Ct. We now study the sample complexity of IER at timestep t when an approximation of the optimal value function V ?

t+1 at timestep t + 1 is available (the proof and deﬁnition of δ0 is postponed to the supplementary).

Lemma 2. Let app

t be the error of the best linear model at timestep t and b V ?

t+1 be the approximation of V ?

t+1 used in estimating the values at the anchor points in Eq. 1. Let kb V ?

t+1 be the prediction error of b V ?

t+1. If IER is run with Kt anchor points, then the prediction error of b Q?

(1 + Ct) app

| {z } errors at timestep t

t+1 | {z } propagation error

with probability at least 1 δ/H as long as nsupp ln(2/δ0)/(2 est

Lem. 2 shows that the prediction error of IER is bounded by three main components: an estimation error est

t due to the noise in estimating the Q-values b Q?

t,i at the anchor points, an approximation error (1 + Ct) app

t due to the linear model deﬁned by the features φt, and a propagation error Ct bias

t+1 due to the prediction error of b V ?

t+1 at timestep t + 1. The key result from this lemma is to illustrate how Ct not only impacts the approximation error as in Lem. 1, but it determines how the errors of b V ?

t+1 propagates from timestep t + 1 to t. While for a standard least-square method, Ct may be much larger than one, the approximator (2) with the interpolation vector obtained from (3) aims at minimizing the extrapolation and lowering Ct as much as possible, while preserving the linearity of the representation. As discussed in Sect. 4, a suitable choice of the anchor points may signiﬁcantly reduce the ampliﬁcation factor by leveraging the additional degrees of freedom offered by choosing Kt larger than d. In general, we may expect that the larger Kt, the smaller Ct. Nonetheless, the overall sample complexity of IER increases as Ktnsupp. This shows the need of trading off the number of anchor points (hence possibly higher variance) in exchange for better control on how errors gets ampliﬁed. In this sense, Lem. 2 reveals a critical extrapolation-variance trade-off.

3.2 Sample Complexity of LAVIER

We analyze LAVIER (Linear Approximate Value Iteration with Extrapolation Reduction) obtained by running IER backward from timestep H to 1 and we derive a sample complexity upper bound to achieve a near-optimal policy. Under the assumption of bounded value function V ?

t (s) 2 [0, 1] and bounded immediate reward random variables R(s, a) 2 [0, 1], we obtain the following result.1

1This assumption is inspired by [JA18], who suggested this is a more expressive framework, as it allows some rewards to be substantially larger than others in terms of contributing to the ﬁnal value function.

Theorem 1. Let Ct C and app

t app for all t = 1, . . . , H. If LAVIER is run with failure probability δ > 0, precision > 0 and constant C > CH, ntot KH5C

2 ln(2KH/δ)/ 2 samples, then with probability 1 δ LAVIER returns a policy b ? such that

1 (s0) V b ?

1 (s0) |{z} est. error

| {z } app. error

Algorithm 1 LAVIER algorithm.

Input: Failure probability δ, accuracy , set of anchor points {Kt}t=1,...,H, time horizon H, total ampliﬁcation constant C.

Set δ0 = δ/(PH

t=1 Kt), nsupp =

H+1( ) = 0 (zero predictor at terminal states) for t = H downto 1 do

Call IER with param. (nsupp, Kt, b Q?

t+1( )) and obtain b Q?

t ( ) end for Return policy b ?

t (s) = arg maxa2At,s b Q?

t (φt(s, a))

This bound decomposes the prediction error in two components: an estimation error due to the noise in the samples and an approximation error due to the features {φt}t and the target functions {Q?

t }t. Thm. 1 illustrates the impact of the ampliﬁcation factor on the overall sample complexity and ﬁnal error. If C > 1, C grows exponentially with the horizon. Furthermore, the error app itself is ampliﬁed by C, thus leading to an approximation error scaling exponentially with H. This result is not unexpected, as it conﬁrms previous negative results showing how the extrapolation typical of linear models may lead the error to diverge over iterations [Bai95; TV97]. Nonetheless, if the ampliﬁcation constant is C < (1 + 1

H ), then C (1 + 1

H )H e, which gives a polynomial sample complexity bound of order O(KH5/ 2) and a ﬁnal error where the approximation error is only ampliﬁed by H2. While this conﬁguration does remove the divergence problem, it may still lead to a sample inefﬁcient algorithm. In fact, in order to achieve C 1, we may need to take K very large. This raises the fundamental question of whether low ampliﬁcation error and low sample complexity can be obtained at the same time. In the next section, we ﬁrst discuss how anchor points with small ampliﬁcation C can be efﬁciently constructed, while in Sect. 5 we empirically show how in some scenarios this can be achieved with a small number of anchor points K and thus low sample complexity. Finally, we notice that when the features are chosen to be averagers, the interpolation scheme corresponds to a convex combination of anchor weights, thus corresponding to C = 1. As a result, Thm. 1 is also a sample complexity result for averagers [Gor95].

4 Anchor Points and Ampliﬁcation Factor

While averagers attain C = 1, in general they may not generalize as well as linear models. Furthermore, averagers usually have poor sample complexity, as they may require a number of samples scaling exponentially with the dimension of the state-action space [see e.g., Thm.3 in OS02]. The aim of the minimization program (3) is to trade off the generalization capacity of linear models and their extrapolation, without compromising the overall sample complexity. The process of constructing a good set of anchor points can be seen as a form of experimental design . While in experimental

optimal design the objective is to ﬁnd a small number of anchor points such that least-squares achieves small prediction error, here the objective is to construct a set Kt such that the ampliﬁcation factor Ct is small. We have the following result. Proposition 1. Let Φ(Kt) = {φ 2 Rd, 8(si, ai) 2 Kt, φ(si, ai) = φ} be the image of the anchor points through φ. If the convex hull of Φ(Kt) contains all the features in Φt, i.e.,

= {φ 2 Rd : 9 φ 2 RKt, φ =

i φi, with φ

then the ampliﬁcation factor is Ct 1.

Under the condition of Prop. 1, prediction errors propagates linearly through timesteps. In general, it is not possible to provide a bound on Kt, as the number of anchor points needed to construct a convex hull containing Φt may largely vary depending on the structure of Φt.2 If the convex hull is not known

2For instance, if Φt is a polyhedron in Rd, Kt may be as large as exponential in d.

or it contains too many features, an approximate convex hull could be found by standard techniques, for example [GO17; Blu+17] or [SV16; HA13] and can still provably yield a linear propagation of the error if it is of sufﬁcient quality (i.e., Ct < (1 + 1/H)). Importantly, ﬁnding an approximate convex hull can be performed ofﬂine without accessing the generative model as it only requires access to the mapping function φt( , ). Finally, as the algorithm solves the optimization program (3) during the learning phase (to compute the backup b V ?

t+1(s0) with the sampled next state s0) the actual value of k φ(s,a)k1 is computed and therefore the algorithm can identify whether signiﬁcant extrapolation is taking place and whether the set of anchor points Kt may need to be increased or adjusted. While we defer adaptive construction of approximate convex hulls as future work, we propose a simple greedy heuristic to construct a good set of anchor points before the learning process.

Let C be a target ampliﬁcation error, at timestep t we would like to ﬁnd the smallest set Kt such that C(Kt) = maxs,a k φt(s,a)

t k1 is below C, where the interpolation vector φt(s,a)

t is computed as in (3). As this problem may be NP-hard, we propose a sequential greedy scheme where anchor points are added to Kt until the condition is met. Starting with Kt including a single arbitrary state-action (s1, a1), if C(Kt) > C, we compute (s, a) = arg maxs,a k φt(s,a)

t k1 and add it to Kt. Notice that this process does not necessarily return a positive interpolation vector φt(s,a)

i and thus b Q?

t may not be a convex combination of the anchor values. This extra degree of freedom w.r.t. convex hulls may allow us to obtain a small ampliﬁcation factor with fewer anchor points. Although we do not have theoretical guarantees about the number of anchor points K = |Kt| added through this heuristic process, we report experiments where we show that it is possible to effectively obtain small C, and thus small prediction error, with few anchor points.

5 Numerical Simulations

We investigate the potential beneﬁt of LAVIER over least-squares AVI (LS-AVI). Although LAVIER shares similarity with averagers, a fair comparison is difﬁcult and out of the scope of this preliminary empirical study. In fact, in designing an averager, the choice of structure and parameters (e.g., the position of the points in a nearest neighbor procedure) heavily affects the corresponding function class, i.e., the type of functions that can be accurately represented. As a result, any difference in performance would mostly depend on the different function class used by the averager and the linear model (i.e., φ) used by LAVIER.

The following MDPs are toy examples designed to investigate the differences between the LAVIER and LS-AVI and conﬁrm our theoretical ﬁndings. The empirical results are obtained by averaging 100 simulations and they are reported with 95%-conﬁdence intervals.

Figure 1: Left: Two state MDP. Right: Prediction error for least-squares AVI and LAVIER.

Two-state MDP of Tsitsiklis and Van Roy The ﬁrst experiment focuses on how the interpolation scheme of IER may avoid divergence. The smallest-known problem where least-squares approximation diverges is reported in [TV96; SB18]. This problem consists of a two-state Markov reward process (i.e., an MDP with only one action per state) plus a terminal state (Fig. 1). As there is only one possible policy, the approximation problem reduces to estimating its value function. The feature φ maps a state to a ﬁxed real number, i.e., φ( , ) 2 R, and there is only one weight to learn. For simplicity, we set the parameter = 0.01, and add a zero-mean noise to all rewards generated as 1/2 Ber(1/2), where Ber( ) is a Bernoulli random variable. We study the approximation error at the left-most state when each algorithm is run for a varying number of iterations H and with 1000

samples at each timestep. The samples are generated uniformly from the left and middle node, which serve as anchor points. Fig. 1 shows that the error of the least-square-based method rapidly diverges through iterations, while LAVIER is more robust and its error remains stable.

s1 s2 s3 s N 1 s N

r N( 1 10N , 1) r N(1, 1)

Figure 2: Left: Chain MDP. Right: Suboptimality of the policy at s1, V ?(s1) V e ?(s1).

Chain MDP. We now evaluate the quality of the anchor points returned by the heuristic method illustrated in Sect. 4. In the chain MDPs of Fig. 2 the agent starts in the leftmost state and the optimal policy is to always go right and catch the noisy reward in the rightmost state before the episode terminate. However, a small reward is present in the leftmost state and settling for this reward yields a suboptimal policy. We deﬁne the feature φ(s, a) = [Q?(s, a), v(s, a)], where v(s, a) Unif(0, 1) is a random number ﬁxed for each simulation and (s, a) pair. We run LS-AVI by sampling state-actions in the reachable space uniformly at random, while for LAVIER we compute an anchor set with C 1.2. Both algorithms use the same number of samples and LAVIER splits the budget of samples uniformly over the anchor points to compute the anchor values. The length of the chain is N = 50, which is also the time horizon. We report the quality of the learned policy at s1 and notice that

LAVIER is consistently better than LS-AVI (see App. A for further experiments).

s? s1 s2 s N

Figure 3: Left: MDP with a sequence of linear bandits with actions in 2 dimensions. Center: Example of the anchor points generated by the heuristic greedy algorithm. Right: Accuracy V ?(st) V ?(st) as a function of state.

Successive Linear Bandits. We consider an MDP deﬁned as a sequence of linear bandit problems (Fig. 3) which is designed so that signiﬁcant extrapolation occurs at each iteration. In this MDP, there are N states s1, . . . , s N augmented with the starting state s0 and a terminal state s?. From the starting state s0 there are two actions (left and right). The optimal policy is to take left and receive a reward of 10. The states s1, . . . , s N are linear bandit problems, where each action gives a Gaussian noisy return of mean 0 and variance 1 and the state transitions deterministically from s to s + 1. This represents a sequence of linear bandits with no signal, i.e., the output is not correlated with the features and the learner only experiences noise, hence V ?

1 (s1) = 0. The feature map φt(s, a) = φa returns the features describing the action itself, and the solution Q?

t (s, a) = 0 is exactly representable by a zero weight vector. The solution is unique. The learner should estimate the value of V ?

t (s1) accurately to infer the right action in state s0. At each state s1, . . . , s N, we represent actions in R2 and we generate 100 actions by uniformly discretizing the circumference. As the canonical vectors e1 and e2 are the most informative actions to estimate the reward associated to any other action

(see [SLM14] for the best policy identiﬁcation in linear bandits), we collect our samples from these two actions. The anchor points for LAVIER are chosen by our adaptive procedure for different value of the extrapolation coefﬁcient C 2 {1.05, 1.2, 1.5}. The extrapolation becomes more and more controlled as C approaches one. Fig.3 shows the performance at different states. For small values of C, LAVIER signiﬁcantly outperforms LS-AVI. Furthermore, looking more closely into the rightmost states (i.e., the states that are updated at early iterations) reveals the extrapolation-variance tradeoff (see Fig. 5 in App. B for a zoomed version of the plot): a value of C = 1.5 ensures a more accurate estimate (due to less variance) in the ﬁrst timesteps, but the curve steeply diverges. By contrast, C = 1.05 has initially a poorer estimate, but such estimate remains far more stable with the horizon. We also report the support points selected by the algorithm. Although C is small, only a few points are necessary. In fact, we do not need to cover the circle with an approximate convex hull and our procedures can, for example, ﬂip the sign of the learned value without causing extrapolation (i.e., keeping C small).

In Fig. 3 we also report the performance of LS-AVI. In this case, the divergence of the estimate of LS-AVI is extreme, and it does not allow to accurately estimate V ?(s1) = 0, yielding a policy that cannot identify the correct initial action. Furthermore, in this example we additionally evaluate Least Square Temporal Difference (LSTD) for off-policy prediction [SB18]. LSTD is not a policy optimization algorithm but we can use it to evaluate the value of a policy that chooses for example the action [1/

2] in every state of the chain. The training data for LSTD are identical to LS-AVI, i.e., the canonical vectors e1 and e2. Despite collecting data along the informative direction e1 and e2, the LSTD solution is of increasingly poor quality as a function of the chain length.

6 Conclusion

Related work. Most of literature in linear function approximation focused on designing feature maps φ that could represent action-value functions well, by optimizing parameterized features (e.g., in deep networks or in [MMS05]), by an initial representation learning phase to extract features adapted to the structure of the MDP [MM07; Pet07; Bel+19], or by adding features to reduce the approximation error [Tos+17]. Unfortunately, accurately ﬁtting value functions does not guarantee small inherent Bellman error (IBE), and thus LAVI may still be very unstable. In this paper we assume φ has small approximation error but arbitrary IBE and we focus on how to reduce the ampliﬁcation factor at each iteration.

Yang and Wang [YW19] recently studied the sample complexity of LAVI under the assumption that

the transition model p admits a non-negative low-rank factorization in the features φ. In particular, they show that in this case, the inherent Bellman error is zero, thus avoiding the ampliﬁcation of errors through iterations of LAVI. In this paper, we consider the more general case where only the optimal action-value function should be accurately approximated in φ, which may be true even when the transition model does not admit a low-rank decomposition. In fact, Thm. 1 holds even when the inherent Bellman error is inﬁnite and shows that whenever the ampliﬁcation factor C is small, LAVIER can still achieve polynomial sample complexity. Yang and Wang [YW19] used the convexity condition in Prop. 1 to derive sample complexity guarantees in their setting, while in our case, the same condition is used to control the ampliﬁcation of errors. Furthermore, we notice that in LAVIER we only need to control the 1-norm of the interpolation weights, which does not necessarily require any convexity assumption (see also experiments). [YW19] introduced OPPQ-Learning and proved a near-optimal sample complexity bound of order e O(K/ 2(1 γ)3), where K is the number of anchor points which are assumed to be provided.3 While this shows that both methods scale linearly with the number of anchor points, OPPQ-Learning enjoys a much better dependency on the horizon. It remains as an open question whether our analysis for LAVIER can be improved to match their bound or the difference the unavoidable price to pay for the more general setting we consider.4

Averagers pursue the same objective but take an extreme approach, where no extrapolation is allowed and Q-functions are approximated by interpolation of values at a ﬁxed set of anchor points [Gor95; Gor96; PP13; KKL03; MM99]. Unfortunately, such approach may suffer from a poor sample complexity [PP13; KKL03], as the number of anchor points may scale exponentially with the

3Yang and Wang [YW19] point out that convex assumption requires the number of features d to scale with the number of anchor points K.

4We conjecture that the dependency on H could be greatly improving using similar arguments as in [YW19], such as monotonicity, tighter concentration inequalities, and variance reduction.

problem dimensionality. In LAVIER, we introduce a more explicit extrapolation-variance tradeoff, where the anchor points should be designed to avoid extrapolation only when/where it happens.

Future work. There are several directions for future investigation. AVI is a core building block for many exploration-exploitation algorithms [OVW16; Kum+18] and better LAVI may help in building sample-efﬁcient online learning algorithms with function approximation. Another venue of investigation is off-policy prediction with batch data. The mismatch between behavioural and target policies poses similar challenges as in the error propagation of AVI. In order to control the extrapolation-variance tradeoff may need penalize a non-uniform use of the samples (to reduce the variance) while the 1-norm minimization objective may reduce the amount of extrapolation to the desired value.

Acknowledgment

This was was partially supported by a Total Innovation Fellowship. The authors are grateful to the reviewers for the high quality reviews and helpful suggestions.

[ASM08] András Antos, Csaba Szepesvári, and Rémi Munos. Learning near-optimal policies with Bellmanresidual minimization based ﬁtted policy iteration and a single sample path . In: Machine Learning 71.1 (2008), pp. 89 129. [Bai95] Leemon Baird. Residual algorithms: Reinforcement learning with function approximation . In: International Conference on Machine Learning (ICML). 1995. [Bel+19] Marc G. Bellemare et al. A Geometric Perspective on Optimal Representations for Reinforcement Learning . In: Co RR abs/1901.11530 (2019). ar Xiv: 1901.11530. URL: http://arxiv.org/ abs/1901.11530. [Blu+17] Avrim Blum et al. Approximate Convex Hull of Data Streams. 2017. ar Xiv: 1712 . 04564 [cs.CG]. [BV04] Stephen Boyd and Lieven Vandenberghe. Convex Optimization. Cambridge University Press, 2004. [CJ19] Jinglin Chen and Nan Jiang. Information-Theoretic Considerations in Batch Reinforcement Learning . In: ar Xiv e-prints, ar Xiv:1905.00360 (May 2019), ar Xiv:1905.00360. ar Xiv: 1905. 00360 [cs.LG]. [FSM10] Amir-massoud Farahmand, Csaba Szepesvári, and Rémi Munos. Error propagation for approximate policy and value iteration . In: Advances in Neural Information Processing Systems (NIPS). 2010. [GO17] Robert Graham and Adam M. Oberman. Approximate Convex Hulls: sketching the convex hull using curvature. 2017. ar Xiv: 1703.01350 [cs.CG]. [Gor95] Geoffrey J Gordon. Stable function approximation in dynamic programming . In: International Conference on Machine Learning (ICML). 1995, pp. 261 268. [Gor96] Geoffrey J Gordon. Stable ﬁtted reinforcement learning . In: Advances in Neural Information Processing Systems (NIPS). 1996. [HA13] M Zahid Hossain and M Ashraful Amin. On constructing approximate convex hull . In: American Journal of Computational Mathematics 3.1 (2013), p. 11. [Hoe63] Wassily Hoeffding. Probability inequalities for sums of bounded random variables . In: Journal of the American Statistical Association (1963). [JA18] Nan Jiang and Alekh Agarwal. Open Problem: The Dependence of Sample Complexity Lower Bounds on Planning Horizon . In: Conference on Learning Theory (COLT). 2018, pp. 3395 3398. [KKL03] Sham M. Kakade, Michael Kearns, and John Langford. Exploration in Metric State Spaces . In: International Conference on Machine Learning (ICML). 2003. [Kum+18] Raksha Kumaraswamy et al. Context-dependent upper-conﬁdence bounds for directed exploration . In: Advances in Neural Information Processing Systems (NIPS). 2018. [MM07] Sridhar Mahadevan and Mauro Maggioni. Proto-value Functions: A Laplacian Framework for Learning Representation and Control in Markov Decision Processes . In: J. Mach. Learn. Res. 8 (Dec. 2007), pp. 2169 2231. ISSN: 1532-4435. URL: http://dl.acm.org/citation.cfm? id=1314498.1314570. [MM99] Remi Munos and Andrew W Moore. Barycentric interpolators for continuous space and time reinforcement learning . In: Advances in Neural Information Processing Systems (NIPS). 1999.

[MMS05] Ishai Menache, Shie Mannor, and Nahum Shimkin. Basis Function Adaptation in Temporal Difference Reinforcement Learning . In: Annals of Operations Research 134.1 (Feb. 2005), pp. 215 238. ISSN: 1572-9338. DOI: 10.1007/s10479-005-5732-z. URL: https://doi. org/10.1007/s10479-005-5732-z. [Mni+13] Volodymyr Mnih et al. Playing Atari with Deep Reinforcement Learning . In: Advances in Neural Information Processing Systems (NIPS). 2013. [MS08] Rémi Munos and Csaba Szepesvári. Finite-time bounds for ﬁtted value iteration . In: Journal of Machine Learning Research 9.May (2008), pp. 815 857. [Mun05] Rémi Munos. Error bounds for approximate value iteration . In: AAAI Conference on Artiﬁcial Intelligence (AAAI). 2005. [NW06] Jorge Nocedal and Stephen Wright. Numerical Optimization. Springer, 2006. [OS02] Dirk Ormoneit and Saunak Sen. Kernel-based reinforcement learning . In: Machine Learning 49.2-3 (2002), pp. 161 178. [OVW16] Ian Osband, Benjamin Van Roy, and Zheng Wen. Generalization and Exploration via Randomized Value Functions . In: International Conference on Machine Learning (ICML). 2016. [Pet07] Marek Petrik. An Analysis of Laplacian Methods for Value Function Approximation in MDPs . In: Proceedings of the 20th International Joint Conference on Artiﬁcal Intelligence. IJCAI 07. Hyderabad, India: Morgan Kaufmann Publishers Inc., 2007, pp. 2574 2579. URL: http://dl. acm.org/citation.cfm?id=1625275.1625690. [PP13] Jason Pazis and Ronald Parr. PAC Optimal Exploration in Continuous Space Markov Decision Processes . In: AAAI Conference on Artiﬁcial Intelligence (AAAI). Bellevue, Washington, 2013, pp. 774 781. URL: http://dl.acm.org/citation.cfm?id=2891460.2891568. [SB18] Richard S Sutton and Andrew G Barto. Reinforcement learning: An introduction. MIT Press, 2018. [Sil+16] David Silver et al. Mastering the game of Go with deep neural networks and tree search . In: Nature 529.7587 (2016), p. 484. [Sil+17] David Silver et al. Mastering the game of Go without human knowledge . In: Nature 550.7676 (2017), p. 354. [SLM14] Marta Soare, Alessandro Lazaric, and Rémi Munos. Best-arm identiﬁcation in linear bandits . In: Advances in Neural Information Processing Systems (NIPS). 2014, pp. 828 836. [SV16] Hossein Sartipizadeh and Tyrone L. Vincent. Computing the Approximate Convex Hull in High Dimensions. 2016. ar Xiv: 1603.04422 [cs.CG]. [Tos+17] Samuele Tosatto et al. Boosted Fitted Q-Iteration . In: Proceedings of the 34th International Conference on Machine Learning. Ed. by Doina Precup and Yee Whye Teh. Vol. 70. Proceedings of Machine Learning Research. International Convention Centre, Sydney, Australia: PMLR, June 2017, pp. 3434 3443. URL: http://proceedings.mlr.press/v70/tosatto17a.html. [TV96] John N Tsitsiklis and Benjamin Van Roy. Feature-based methods for large scale dynamic programming . In: Machine Learning 22.1-3 (1996), pp. 59 94. [TV97] John N Tsitsiklis and Benjamin Van Roy. Analysis of temporal-diffference learning with function approximation . In: Advances in Neural Information Processing Systems (NIPS). 1997. [YW19] Lin F Yang and Mengdi Wang. Sample-Optimal Parametric Q-Learning with Linear Transition Models . In: ar Xiv preprint ar Xiv:1902.04779 (2019). [YXW19] Zhuoran Yang, Yuchen Xie, and Zhaoran Wang. A Theoretical Analysis of Deep Q-Learning . In: Co RR abs/1901.00137 (2019). ar Xiv: 1901.00137. URL: http://arxiv.org/abs/1901. 00137.