# regularized_policy_iteration_with_nonparametric_function_spaces__c486ca68.pdf

Journal of Machine Learning Research 17 (2016) 1-66 Submitted 1/13; Revised 12/15; Published 9/16

Regularized Policy Iteration with Nonparametric Function Spaces

Amir-massoud Farahmand farahmand@merl.com Mitsubishi Electric Research Laboratories (MERL) 201 Broadway, 8th Floor Cambridge, MA 02139, USA

Mohammad Ghavamzadeh ghavamza@adobe.com Adobe Research 321 Park Avenue San Jose, CA 95110, USA

Csaba Szepesv ari szepesva@ualberta.ca Department of Computing Science University of Alberta Edmonton, AB, T6G 2E8, Canada

Shie Mannor shie@ee.technion.ac.il Department of Electrical Engineering The Technion Haifa 32000, Israel

Editor: Peter Auer

We study two regularization-based approximate policy iteration algorithms, namely REGLSPI and REG-BRM, to solve reinforcement learning and planning problems in discounted Markov Decision Processes with large state and ﬁnite action spaces. The core of these algorithms are the regularized extensions of the Least-Squares Temporal Diﬀerence (LSTD) learning and Bellman Residual Minimization (BRM), which are used in the algorithms policy evaluation steps. Regularization provides a convenient way to control the complexity of the function space to which the estimated value function belongs and as a result enables us to work with rich nonparametric function spaces. We derive eﬃcient implementations of our methods when the function space is a reproducing kernel Hilbert space. We analyze the statistical properties of REG-LSPI and provide an upper bound on the policy evaluation error and the performance loss of the policy returned by this method. Our bound shows the dependence of the loss on the number of samples, the capacity of the function space, and some intrinsic properties of the underlying Markov Decision Process. The dependence of the policy evaluation bound on the number of samples is minimax optimal. This is the ﬁrst work that provides such a strong guarantee for a nonparametric approximate policy iteration algorithm.1

Keywords: reinforcement learning, approximate policy iteration, regularization, nonparametric method, ﬁnite-sample analysis

1. This work is an extension of the NIPS 2008 conference paper by Farahmand et al. (2009b).

c 2016 Amir-massoud Farahmand, Mohammad Ghavamzadeh, Csaba Szepesv ari, Shie Mannor.

Farahmand, Ghavamzadeh, Szepesv ari, and Mannor

1. Introduction

We study the approximate policy iteration (API) approach to ﬁnd a close to optimal policy in a Markov Decision Process (MDP), either in a reinforcement learning (RL) or in a planning scenario. The basis of API, which is explained in Section 3, is the policy iteration algorithm that iteratively evaluates a policy (i.e., ﬁnding the value function of the policy the policy evaluation step) and then improves it (i.e., computing the greedy policy with respect to (w.r.t.) the recently obtained value function the policy improvement step). When the state space is large (e.g., a subset of Rd or a ﬁnite state space that has too many states to be exactly represented), the policy evaluation step cannot be performed exactly, and as a result the use of function approximation is inevitable. The appropriate choice of the function approximation method, however, is far from trivial. The best choice is problem-dependent and it also depends on the number of samples in the input data. In this paper we propose a nonparametric regularization-based approach to API. This approach provides a ﬂexible and easy way to implement the policy evaluation step of API. The advantage of nonparametric methods over parametric methods is that they are ﬂexible: Whereas a parametric model, which has a ﬁxed and ﬁnite parameterization, limits the range of functions that can be represented, irrespective of the number of samples, the nonparametric models avoid such undue restrictions by increasing the power of the function approximation as necessary. Moreover, the regularization-based approach to nonparametrics is elegant and powerful: It has a simple algorithmic form and the estimator achieves minimax optimal rates in a number of scenarios. Further discussion of and speciﬁc results about nonparametric methods, particularly in the supervised learning scenario, can be found in the books by Gy orﬁet al. (2002) and Wasserman (2007). The nonparametric approaches to solve RL/Planning problems have received some attention in the RL community. For instance, Petrik (2007); Mahadevan and Maggioni (2007); Parr et al. (2007); Mahadevan and Liu (2010); Geramifard et al. (2011); Farahmand and Precup (2012); B ohmer et al. (2013) and Milani Fard et al. (2013) suggest methods to generate data-dependent basis functions, to be used in general linear models. Ormoneit and Sen (2002) use smoothing kernel-based estimate of the model and then use value iteration to ﬁnd the value function. Barreto et al. (2011, 2012) beneﬁt from stochastic factorization trick to provide computationally eﬃcient ways to scale up the approach of Ormoneit and Sen (2002). In the context of approximate value iteration, Ernst et al. (2005) consider growing ensembles of trees to approximate the value function. In addition, there have been some works where regularization methods have been applied to the RL/Planning problems, e.g., Engel et al. (2005); Jung and Polani (2006); Loth et al. (2007); Farahmand et al. (2009a,b); Taylor and Parr (2009); Kolter and Ng (2009); Johns et al. (2010); Ghavamzadeh et al. (2011); Farahmand (2011b); Avila Pires and Szepesv ari (2012); Hoﬀman et al. (2012); Geist and Scherrer (2012). Nevertheless, most of these papers are algorithmic results and do not analyze the statistical properties of these methods (the exceptions are Farahmand et al. 2009a,b; Farahmand 2011b; Ghavamzadeh et al. 2011; Avila Pires and Szepesv ari 2012). We compare these methods with ours in more detail in Sections 5.3.1 and 6. It is worth mentioning that one might use a regularized estimator alongside a feature generation approach to control the complexity of function space induced by the features. An approach alternative to regularization for controlling the complexity of a function space is to

Regularized Policy Iteration with Nonparametric Function Spaces

use greedy algorithms, such as Matching Pursuit (Mallat and Zhang, 1993) and Orthogonal Matching Pursuit (Pati et al., 1993), to select features from a large set of features. Greedy algorithms have recently been developed for the value function estimation by Johns (2010); Painter-Wakeﬁeld and Parr (2012); Farahmand and Precup (2012); Geramifard et al. (2013). We do not discuss these methods any further.

1.1 Contributions

The algorithmic contribution of this work is to introduce two regularization-based nonparametric approximate policy iteration algorithms, namely Regularized Least-Squares Policy Improvement (REG-LSPI) and Regularized Bellman Residual Minimization (REG-BRM). These are ﬂexible methods that, upon the proper selection of their parameters, are sample eﬃcient. Each of REG-BRM and REG-LSPI is formulated as two coupled regularized optimization problems (Section 4). As we argue in Section 4.1, having a regularized objective in both optimization problems is necessary for rich nonparametric function spaces. Despite the unusual coupled formulation of the underlying optimization problems, we prove that the solutions can be computed in a closed-form when the estimated action-value function belongs to the family of reproducing kernel Hilbert spaces (RKHS) (Section 4.2). The theoretical contribution of this work (Section 5) is to analyze the statistical properties of REG-LSPI and to provide upper bounds on the policy evaluation error and the performance diﬀerence between the optimal policy and the policy returned by this method (Theorem 14). The result demonstrates the dependence of the bounds on the number of samples, the capacity of the function space to which the estimated action-value function belongs, and some intrinsic properties of the MDP. It turns out that the dependence of the policy evaluation error bound on the number of samples is minimax optimal. This paper, alongside its conference (Farahmand et al., 2009b) and the dissertation (Farahmand, 2011b) versions, is the ﬁrst work that analyzes a nonparametric regularized API algorithm and provides such a strong guarantee for it.

2. Background and Notation

In the ﬁrst part of this section, we provide a brief summary of some of the concepts and deﬁnitions from the theory of MDPs and RL (Section 2.1). For more information, the reader is referred to Bertsekas and Shreve (1978); Bertsekas and Tsitsiklis (1996); Sutton and Barto (1998); Szepesv ari (2010). In addition to this background on MDPs, we introduce the notations we use to denote function spaces and their corresponding norms (Section 2.2) as well as the considered learning problem (Section 2.3).

2.1 Markov Decision Processes

For a space Ω, with a σ-algebra σΩ, we deﬁne M(Ω) as the set of all probability measures over σΩ. We let B(Ω) denote the space of bounded measurable functions w.r.t. σΩand we denote B(Ω, L) as the space of bounded measurable functions with bound 0 < L < .

Deﬁnition 1 A ﬁnite-action discounted MDP is a 4-tuple (X, A, P, γ), where X is a measurable state space, A is a ﬁnite set of actions, P : X A M(R X) is a mapping with domain X A, and 0 γ < 1 is a discount factor. Mapping P evaluated

Farahmand, Ghavamzadeh, Szepesv ari, and Mannor

at (x, a) X A gives a distribution over R X, which we shall denote by P( , |x, a). We denote the marginals of P by the overloaded symbol P : X A M(X) deﬁned as P( |x, a) = Px,a( ) = R

R P(dr, |x, a) (transition probability kernel) and R : X A M(R) deﬁned as R( |x, a) = R

X P( , dy|x, a) (reward distribution).

An MDP together with an initial distribution P1 of states encode the laws governing the temporal evolution of a discrete-time stochastic process controlled by an agent as follows: The controlled process starts at time t = 1 with random initial state X1 P1 (here and in what follows X Q denotes that the random variable X is drawn from distribution Q). At stage t, action At A is selected by the agent controlling the process. In response, the pair (Rt, Xt+1) is drawn from P( , |Xt, At), i.e., (Rt, Xt+1) P( , |Xt, At), where, Rt is the reward that the agent receives at time t and Xt+1 is the state at time t + 1. The process then repeats with the agent selecting action At+1, etc. In general, the agent can use all past states, actions, and rewards in deciding about its current action. However, for our purposes it will suﬃce to consider action-selection procedures, or policies, that select an action deterministically and time-invariantly solely based on the current state:

Deﬁnition 2 (Deterministic Markov Stationary Policy) A measurable mapping π : X A is called a deterministic Markov stationary policy, or just policy in short. Following a policy π in an MDP means that at each time step t it holds that At = π(Xt).

Policy π induces the transition probability kernels P π : X A M(X A) deﬁned as follows: For a measurable subset C of X A, let (P π)(C|x, a) R P(dy|x, a)I{(y,π(y)) C}. The m-step transition probability kernels (P π)m : X A M(X A) for m = 2, 3, are deﬁned inductively by (P π)m(C|x, a) R

X P(dy|x, a)(P π)m 1(C|y, π(y)). Also given a probability transition kernel P : X A M(X A), we deﬁne the right-linear operator P : B(X A) B(X A) by (PQ)(x, a) R

X A P(dy, da |x, a)Q(y, a ). For a probability measure ρ M(X A) and a measurable subset C of X A, we deﬁne the left-linear operators P : M(X A) M(X A) by (ρP)(C) = R ρ(dx, da)P(dy, da |x, a)I{(y,a ) C}. To study MDPs, two auxiliary functions are of central importance: the value and the action-value functions of a policy π.

Deﬁnition 3 (Value Functions) For a policy π, the value function V π and the actionvalue function Qπ are deﬁned as follows: Let (Rt; t 1) be the sequence of rewards when the Markov chain is started from a state X1 (or state-action (X1, A1) for the actionvalue function) drawn from a positive probability distribution over X (or X A) and the agent follows policy π. Then, V π(x) E h P t=1 γt 1Rt X1 = x i and Qπ(x, a)

E h P t=1 γt 1Rt X1 = x, A1 = a i .

It is easy to see that for any policy π, if the magnitude of the immediate expected reward r(x, a) = R r P(dr, dy|x, a) is uniformly bounded by Rmax, then the functions V π and Qπ

are bounded by Vmax = Qmax = Rmax/(1 γ), independent of the choice of π. For a discounted MDP, we deﬁne the optimal value and optimal action-value functions by V (x) = supπ V π(x) for all states x X and Q (x, a) = supπ Qπ(x, a) for all state-actions (x, a) X A. We say that a policy π is optimal if it achieves the best values in every

Regularized Policy Iteration with Nonparametric Function Spaces

state, i.e., if V π = V . We say that a policy π is greedy w.r.t. an action-value function Q if π(x) = argmaxa A Q(x, a) for all x X. We deﬁne function ˆπ(x; Q) argmaxa A Q(x, a) (for all x X) that returns a greedy policy of an action-value function Q (If there exist multiple maximizers, a maximizer is chosen in an arbitrary deterministic manner). Greedy policies are important because a greedy policy w.r.t. the optimal action-value function Q

is an optimal policy. Hence, knowing Q is suﬃcient for behaving optimally (cf. Proposition 4.3 of Bertsekas and Shreve 1978).2

Deﬁnition 4 (Bellman Operators) For a policy π, the Bellman operators T π : B(X) B(X) (for value functions) and T π : B(X A) B(X A) (for action-value functions) are deﬁned as

(T πV )(x) r(x, π(x)) + γ Z

X V (y)P(dy|x, π(x)),

(T πQ)(x, a) r(x, a) + γ Z

X Q(y, π(y))P(dy|x, a).

To avoid unnecessary clutter, we use the same symbol to denote both operators. However, this should not introduce any ambiguity: Given some expression involving T π one can always determine which operator T π means by looking at the type of function T π is applied to. It is known that the ﬁxed point of the Bellman operator is the (action-)value function of the policy π, i.e., T πQπ = Qπ and T πV π = V π, see e.g., Proposition 4.2(b) of Bertsekas and Shreve (1978). We will also need to deﬁne the so-called Bellman optimality operators:

Deﬁnition 5 (Bellman Optimality Operators) The Bellman optimality operators T : B(X) B(X) (for value functions) and T : B(X A) B(X A) (for action-value functions) are deﬁned as

(T V )(x) max a

r(x, a) + γ Z

X V (y)P(dy|x, a) ,

(T Q)(x, a) r(x, a) + γ Z

X max a Q(y, a )P(dy|x, a).

Again, we use the same symbol to denote both operators; the previous comment that no ambiguity should arise because of this still applies. The Bellman optimality operators enjoy a ﬁxed-point property similar to that of the Bellman operators. In particular, T V = V

and T Q = Q , see e.g., Proposition 4.2(a) of Bertsekas and Shreve (1978). The Bellman optimality operator thus provides a vehicle to compute the optimal action-value function and therefore to compute an optimal policy.

2. Measurability issues are dealt with in Section 9.5 of the same book. In the case of ﬁnitely many actions, no additional condition is needed besides the obvious measurability assumptions on the immediate reward function and the transition kernel (Bertsekas and Shreve, 1978, Corollary 9.17.1), which we will assume from now on.

Farahmand, Ghavamzadeh, Szepesv ari, and Mannor

2.2 Norms and Function Spaces

In what follows we use F : X R to denote a subset of measurable functions. The exact speciﬁcation of this set will be clear from the context. Further, we let F|A| : X A R|A|

to be a subset of vector-valued measurable functions with the identiﬁcation of

F|A| = (Q1, . . . , Q|A|) : Qi F, i = 1, . . . , |A| .

We shall use Q p,ν to denote the Lp(ν)-norm (1 p < ) of a measurable function Q : X A R, i.e., Q p p,ν R

X A |Q(x, a)|pdν(x, a). Let z1:n denote the Z-valued sequence (z1, . . . , zn). For Dn = z1:n, deﬁne the empirical norm of function f : Z R as

f p p,Dn = f p p,z1:n 1

i=1 |f(zi)|p. (1)

When there is no chance of confusion about Dn, we may denote the empirical norm by f p p,n. Based on this deﬁnition, one may deﬁne Q p,Dn with the choice of Z = X A. Note that if Dn = (Zi)n i=1 is random with Zi ν, the empirical norm is random too, and

for any ﬁxed function f, we have E h f p,Dn i = f p,ν. When p = 2, we simply use ν and Dn.

2.3 Oﬄine Learning Problem and Empirical Bellman Operators

We consider the oﬄine learning scenario when we are only given a batch of data3

Dn = {(X1, A1, R1, X 1), . . . , (Xn, An, Rn, X n)}, (2)

with Xi νX , Ai πb( |Xi), and (Ri, X i) P( , |Xi, Ai) for i = 1, . . . , n. Here νX M(X) is a ﬁxed distribution over the states and πb is the data generating behavior policy, which is a stochastic stationary Markov policy, i.e., given any state x X, it assigns a probability distribution over A. We shall also denote the common distribution underlying (Xi, Ai) by ν M(X A). Samples Xi and Xi+1 may be sampled independently (we call this the Planning scenario ), or may be coupled through X i = Xi+1 ( RL scenario ). In the latter case the data comes from a single trajectory. Under either of these scenarios, we say that the data Dn meets the standard oﬄine sampling assumption. We analyze the Planning scenario, where the states are independent, but one may also analyze dependent processes by considering mixing processes and using tools such as the independent blocks technique (Yu, 1994; Doukhan, 1994), as has been done by Antos et al. (2008b); Farahmand and Szepesv ari (2012). The data set Dn allows us to deﬁne the so-called empirical Bellman operators, which can be thought of as empirical approximations to the true Bellman operators.

3. In what follows, when { } is used in connection to a data set, we treat the set as an ordered multiset, where the ordering is given by the time indices of the data points.

Regularized Policy Iteration with Nonparametric Function Spaces

Deﬁnition 6 (Empirical Bellman Operators) Let Dn be a data set as above. Deﬁne the ordered multiset Sn = {(X1, A1), . . . , (Xn, An)}. For a given ﬁxed policy π, the empirical Bellman operator ˆT π : RSn Rn is deﬁned as

( ˆT πQ)(Xi, Ai) Ri + γQ(X i, π(X i)) , 1 i n .

Similarly, the empirical Bellman optimality operator ˆT : RSn Rn is deﬁned as

( ˆT Q)(Xi, Ai) Ri + γ max a Q(X i, a ) , 1 i n .

In words, the empirical Bellman operators get an n-element list Sn and return an ndimensional real-valued vector of the single-sample estimate of the Bellman operators applied to the action-value function Q at the selected points. It is easy to see that the empirical Bellman operators provide an unbiased estimate of the Bellman operators in the following sense: For any ﬁxed bounded measurable deterministic function Q : X A R, policy π and 1 i n, it holds that E h ˆT πQ(Xi, Ai)|Xi, Ai i = T πQ(Xi, Ai) and

E h ˆT Q(Xi, Ai)|Xi, Ai i = T Q(Xi, Ai).

3. Approximate Policy Iteration

The policy iteration algorithm computes a sequence of policies such that the new policy in the iteration is greedy w.r.t. the action-value function of the previous policy. This procedure requires one to compute the action-value function of the most recent policy (policy evaluation step) followed by the computation of the greedy policy (policy improvement step). In API, the exact, but infeasible, policy evaluation step is replaced by an approximate one. Thus, the skeleton of API methods is as follows: At the kth iteration and given a policy πk, the API algorithm approximately evaluates πk to ﬁnd a Qk. The action-value function Qk is typically chosen to be such that Qk T πk Qk, i.e., it is an approximate ﬁxed point of T πk. The API algorithm then calculates the greedy policy w.r.t. the most recent action-value function to obtain a new policy πk+1, i.e., πk+1 = ˆπ( ; Qk). The API algorithm continues by repeating this process again and generating a sequence of policies and their corresponding approximate action-value functions Q0 π1 Q1 π2 .4

The success of an API algorithm hinges on the way the approximate policy evaluation step is implemented. Approximate policy evaluation is non-trivial for at least two reasons. First, policy evaluation is an inverse problem,5 so the underlying learning problem is unlike a standard supervised learning problem in which the data take the form of input-output pairs. The second problem is the oﬀ-policy sampling problem: The distribution of (Xi, Ai) in the data samples (possibly generated by a behavior policy) is typically diﬀerent from the distribution that would be induced if we followed the to-be-evaluated policy (i.e., target policy). This causes a problem since the methods must be able to handle this mismatch of

4. In an actual API implementation, one does not need to compute πk+1 for all states, which in fact is infeasible for large state spaces. Instead, one uses Qk to compute πk+1 at some select states, as required in the approximate policy evaluation step. 5. Given an operator L : F F, the inverse problem is the problem of solving g = Lf for f when g is known. In the policy evaluation problem, L = I γP π, g( ) = r( , π( )), and f = Qπ.

Farahmand, Ghavamzadeh, Szepesv ari, and Mannor

distributions.6 In the rest of this section, we review generic LSTD and BRM methods for approximate policy evaluation. We introduce our regularized version of LSTD and BRM in Section 4.

3.1 Bellman Residual Minimization

The idea of BRM goes back at least to the work of Schweitzer and Seidmann (1985). It was later used in the RL community by Williams and Baird (1994) and Baird (1995). The basic idea of BRM comes from noticing that the action-value function is the unique ﬁxed point of the Bellman operator: Qπ = T πQπ (or similarly V π = T πV π for the value function). Whenever we replace Qπ by an action-value function Q diﬀerent from Qπ, the ﬁxed-point equation would not hold anymore, and we have a non-zero residual function Q T πQ. This quantity is called the Bellman residual of Q. The same is true for the Bellman optimality operator T . The BRM algorithm minimizes the norm of the Bellman residual of Q, which is called the Bellman error. It can be shown that if Q T Q is small, then the value function of the greedy policy w.r.t. Q, that is V ˆπ( ;Q), is also in some sense close to the optimal value function V , see e.g., Williams and Baird (1994); Munos (2003); Antos et al. (2008b); Farahmand et al. (2010), and Theorem 13 of this work. The BRM algorithm is deﬁned as the procedure minimizing the following loss function:

LBRM(Q; π) Q T πQ 2 ν ,

where ν is the distribution of state-actions in the input data. Using the empirical L2-norm deﬁned in (1) with samples Dn deﬁned in (2), and by replacing (T πQ)(Xt, At) with the empirical Bellman operator (Deﬁnition 6), the empirical estimate of LBRM(Q; π) can be written as

ˆLBRM(Q; π, n) Q ˆT πQ 2

h Q(Xt, At) Rt + γQ X t, π(X t) i2 . (3)

Nevertheless, it is well-known that ˆLBRM is not an unbiased estimate of LBRM when the MDP is not deterministic (Lagoudakis and Parr, 2003; Antos et al., 2008b). To address this issue, Antos et al. (2008b) propose the modiﬁed BRM loss that is a new empirical loss function with an extra de-biasing term. The idea of the modiﬁed BRM is to cancel the unwanted variance by introducing an auxiliary function h and a new loss function

LBRM(Q, h; π) = LBRM(Q; π) h T πQ 2 ν , (4)

and approximating the action-value function Qπ by solving

QBRM = argmin Q F|A| sup h F|A| LBRM(Q, h; π), (5)

6. A number of works in the domain adaptation literature consider this scenario under the name of covariate shift problem, see e.g., Ben-David et al. 2006; Mansour et al. 2009; Ben-David et al. 2010; Cortes et al. 2015.

Regularized Policy Iteration with Nonparametric Function Spaces

where the supremum comes from the negative sign of h T πQ 2 ν. They have shown that optimizing the new loss function still makes sense and the empirical version of this loss is unbiased. The min-max optimization problem (5) is equivalent to the following coupled (nested) optimization problems:

h( ; Q) = argmin h F|A|

h T πQ 2 ν ,

QBRM = argmin Q F|A|

h Q T πQ 2 ν h( ; Q) T πQ 2 ν i . (6)

In practice, the norm ν is replaced by the empirical norm Dn and T πQ is replaced by its sample-based approximation ˆT πQ, i.e.,

ˆhn( ; Q) = argmin h F|A|

ˆQBRM = argmin Q F|A|

h Q ˆT πQ 2

Dn ˆhn( ; Q) ˆT πQ 2

From now on, whenever we refer to the BRM algorithm, we are referring to this modiﬁed BRM.

3.2 Least-Squares Temporal Diﬀerence Learning

The Least-Squares Temporal Diﬀerence learning (LSTD) algorithm for policy evaluation was ﬁrst proposed by Bradtke and Barto (1996), and later used in an API procedure by Lagoudakis and Parr (2003) and was called Least-Squares Policy Iteration (LSPI). The original formulation of LSTD ﬁnds a solution to the ﬁxed-point equation Q = ΠνT πQ, where Πν is the simpliﬁed notation for ν-weighted projection operator onto the space of admissible functions F|A|, i.e., Πν ΠF|A| ν : B(X A) B(X A) is deﬁned

by ΠF|A| ν Q = argminh F|A| h Q 2 ν for Q B(X A). We, however, use a diﬀerent optimization-based formulation. The reason is that whenever ν is not the stationary distribution induced by π, the operator (ΠνT π) does not necessarily have a ﬁxed point, but the optimization problem is always well-deﬁned. We deﬁne the LSTD solution as the minimizer of the L2-norm between Q and ΠνT πQ:

LLSTD(Q; π) Q ΠνT πQ 2 ν . (9)

The minimizer of LLSTD(Q; π) is well-deﬁned, and whenever ν is the stationary distribution of π (i.e., on-policy sampling), the solution to this optimization problem is the same as the solution to Q = ΠνT πQ. The LSTD solution can therefore be written as the solution to the following set of coupled optimization problems:

h( ; Q) = argmin h F|A|

h T πQ 2 ν ,

QLSTD = argmin Q F|A| Q h( ; Q) 2 ν , (10)

Farahmand, Ghavamzadeh, Szepesv ari, and Mannor

Algorithm 1 Regularized Policy Iteration(K, ˆQ( 1),F|A|,J,{(λ(k) Q,n, λ(k) h,n)}K 1 k=0 )

// K: Number of iterations // ˆQ( 1): Initial action-value function // F|A|: The action-value function space // J: The regularizer // {(λ(k) Q,n, λ(k) h,n)}K k=0: The regularization coeﬃcients for k = 0 to K 1 do

πk( ) ˆπ( ; ˆQ(k 1)) Generate training samples D(k) n ˆQ(k) REG-LSTD/BRM(πk, D(k) n ; F|A|, J, λ(k) Q,n, λ(k) h,n) end for return ˆQ(K 1) and πK( ) = ˆπ( ; ˆQ(K 1))

where the ﬁrst equation ﬁnds the projection of T πQ onto F|A|, and the second one minimizes the distance of Q and the projection. The corresponding empirical version based on data set Dn is

ˆhn( ; Q) = argmin h F|A|

ˆQLSTD = argmin Q F|A|

Q ˆhn( ; Q) 2

For general spaces F|A|, these optimization problems can be diﬃcult to solve, but when F|A| is a linear subspace of B(X A), the minimization problem becomes computationally feasible. Comparison of BRM and LSTD is noteworthy. The population version of LSTD loss minimizes the distance between Q and ΠνT πQ, which is Q ΠνT πQ 2 ν. Meanwhile, BRM minimizes another distance function that is the distance between T πQ and ΠνT πQ subtracted from the distance between Q and T πQ, i.e., Q T πQ 2 ν ˆhn( ; Q) T πQ 2 ν. See Figure 1a for a pictorial presentation of these distances. When F|A| is linear, because of the Pythagorean theorem, the solution to the modiﬁed BRM (6) coincides with the LSTD solution (10) (Antos et al., 2008b).

4. Regularized Policy Iteration Algorithms

In this section we introduce two Regularized Policy Iteration algorithms, which are instances of the generic API algorithms. These algorithms are built on the regularized extensions of BRM (Section 3.1) and LSTD (Section 3.2) for the task of approximate policy evaluation. The pseudo-code of the Regularized Policy Iteration algorithms is shown in Algorithm 1. The algorithm receives K (the number of API iterations), an initial action-value function ˆQ( 1), the function space F|A|, the regularizer J : F|A| R, and a set of regularization coeﬃcients {(λ(k) Q,n, λ(k) h,n)}K 1 k=0 . Each iteration starts with a step of policy improvement, i.e.,

Regularized Policy Iteration with Nonparametric Function Spaces

Minimized by original BRM

Minimized by LSTD

+ + + + + + + +

- - - - - - - -

Minimized by REG-BRM

Minimized by REG-LSTD

+ + + + + + + +

- - - - - - - -

Figure 1: (a) This ﬁgure shows the loss functions minimized by the original BRM, the modiﬁed BRM, and the LSTD methods. The function space F|A| is represented by the plane. The Bellman operator T π maps an action-value function Q F|A|

to a function T πQ. The function T πQ ΠνT πQ is orthogonal to F|A|. The original BRM loss function is Q T πQ 2 ν (solid line), the modiﬁed BRM loss is Q T πQ 2 ν T πQ ΠνT πQ 2 ν (the diﬀerence of two solid line segments; note the + and symbols), and the LSTD loss is Q ΠνT πQ 2 ν (dashed line). LSTD and the modiﬁed BRM are equivalent for linear function spaces. (b) REG-LSTD and REG-BRM minimize regularized objective functions. Regularization makes the function T πQ ΠνT πQ to be non-orthogonal to F|A|. The dashed ellipsoids represent the level-sets deﬁned by the regularization functional J.

πk ˆπ( ; ˆQ(k 1) = argmaxa A ˆQ(k 1)( , a ). For the ﬁrst iteration (k = 0), one may ignore this step and provide an initial policy π0 instead of ˆQ( 1). Afterwards, we have a data generating step: At each iteration k = 0, . . . , K 1, the agent follows the data generating policy πbk to obtain D(k) n = {(X(k) t , A(k) t , R(k) t , X t (k))}1 t n. For the kth iteration of the algorithm, we use training samples D(k) n to evaluate policy πk. In practice, one might want to change πbk at each iteration in such a way that the agent ultimately achieves a better performance. The relation between the performance and the choice of data samples, however, is complicated. For simplicity of analysis, in the rest of this work we assume that a ﬁxed behavior policy is used in all iterations, i.e., πbk = πb.7 This leads to K independent data sets D(0) n , . . . , D(K 1) n . From now on, to avoid clutter, we use symbols Dn, Xt, . . . instead of D(k) n , X(k) t , . . . with the understanding that each Dn in various iterations is referring to an independent set of data samples, which should be clear from the context. The approximate policy evaluation step is performed by REG-LSTD/BRM, which will be discussed shortly. REG-LSTD/BRM receives policy πk, the training samples D(k) n , the function space F|A|, the regularizer J, and the regularization coeﬃcients (λ(k) Q,n, λ(k) h,n), and

7. So we are in the so-called oﬀ-policy sampling scenario.

Farahmand, Ghavamzadeh, Szepesv ari, and Mannor

returns an estimate of the action-value function of policy πk. This procedure repeats for K iterations. REG-BRM approximately evaluates policy πk by solving the following coupled optimization problems:

ˆhn( ; Q) = argmin h F|A|

h ˆT πk Q 2

Dn + λ(k) h,n J2(h) , (13)

ˆQ(k) = argmin Q F|A|

Q ˆT πk Q 2

Dn ˆhn( ; Q) ˆT πk Q 2

Dn + λ(k) Q,n J2(Q) , (14)

where J : F|A| R is the regularization functional (or simply regularizer or penalizer), and λ(k) h,n, λ(k) Q,n > 0 are regularization coeﬃcients. The regularizer can be any pseudo-norm deﬁned on F|A|; and Dn is deﬁned as (2).8 The regularizer is often chosen such that the functions that we believe are more complex have larger values of J. The notion of complexity, however, is subjective and depends on the choice of F|A| and J. Finally note that we call J(Q) the smoothness of Q, even though it might not coincide with the conventional derivative-based notions of smoothness. An example of the case that J has a derivative-based interpretation is when the function space F|A| is a Sobolev space and the regularizer J is deﬁned as its corresponding norm. In this case, we are penalizing the weak-derivatives of the estimate (Gy orﬁet al., 2002; van de Geer, 2000). One can generalize the notion of smoothness beyond the usual derivativebased ones (cf. Chapter 1 of Triebel 2006) and deﬁne function spaces such as the family of Besov spaces (Devore, 1998). The RKHS norm for shift-invariant and radial kernels can also be interpreted as a penalizer of higher-frequency terms of the function (i.e., a low-pass ﬁlter Evgeniou et al. 1999), so they eﬀectively encourage smoother functions. The choice of kernel determines the frequency response of the ﬁlter. One may also use other data-dependent regularizers such as manifold regularization (Belkin et al., 2006) and Sample-based Approximate Regularization (Bachman et al., 2014). As a ﬁnal example, for the functions in the form of Q(x, a) = P i 1 φi(x, a)wi, if we choose a sparsity-inducing regularizer such as J(Q) P i 1 |wi| as the measure of smoothness, then a function that has a sparse representation in the dictionary {φi}i 1 is, by deﬁnition, a smooth function even though there is not necessarily any connection to the derivative-based smoothness. REG-LSTD approximately evaluates the policy πk by solving the following coupled optimization problems:

ˆhn( ; Q) = argmin h F|A|

h ˆT πk Q 2

Dn + λ(k) h,n J2(h) , (15)

ˆQ(k) = argmin Q F|A|

Q ˆhn( ; Q) 2

Dn + λ(k) Q,n J2(Q) . (16)

Note that the diﬀerence between (7)-(8) ((11)-(12)) and (13)-(14) ((15)-(16)) is the addition of the regularizers J2(h) and J2(Q). Unlike the non-regularized case described in Section 3, the solutions of REG-BRM and REG-LSTD are not the same. As a result of the regularized projection, (13) and

8. A pseudo-norm J satisﬁes all properties of a norm except that J(Q) = 0 does not imply that Q = 0.

Regularized Policy Iteration with Nonparametric Function Spaces

(15), the function ˆhn( ; Q) ˆT πk Q is not orthogonal to the function space F|A| even if F|A| is a linear space. Therefore, the Pythagorean theorem is not applicable anymore: Q ˆhn( ; Q) 2 = Q ˆT πk Q 2 ˆhn( ; Q) ˆT πk Q 2 (See Figure 1b). One may ask why we have regularization terms in both optimization problems, as opposed to only in the projection term (15) (similar to the Lasso-TD algorithm Kolter and Ng 2009; Ghavamzadeh et al. 2011) or only in (16) (similar to Geist and Scherrer 2012;

Avila Pires and Szepesv ari 2012). We discuss this question in Section 4.1. Brieﬂy speaking, for large function spaces such as the Sobolev spaces or the RKHS with universal kernels, if we remove the regularization term in (15), the coupled optimization problems reduces to (unmodiﬁed) BRM, which is biased as discussed earlier; whereas if the regularization term in (16) is removed, the solution can be arbitrary bad due to overﬁtting. Finally note that the choice of the function space F|A|, the regularizer J, and the regularization coeﬃcients λ(k) Q,n and λ(k) h,n all aﬀect the sample eﬃciency of the algorithms. If one knew J(Qπ), the regularization coeﬃcients could be chosen optimally. Nonetheless, the value of J(Qπ) is often not known, so one has to use a model selection procedure to set the best function space and the regularization coeﬃcients. The situation is similar to the problem of model selection in supervised learning (though the solutions are diﬀerent). After developing some tools necessary for discussing this issue in Section 5, we return to the problem of choosing the regularization coeﬃcients after Theorem 11 as well as in Section 6.

Remark 7 To the best of our knowledge, Antos et al. (2008b) were the ﬁrst who explicitly considered LSTD as the optimizer of the loss function (9). Their discussion was mainly to prove the equivalence of modiﬁed BRM (5) and LSTD when F|A| is a linear function space. In their work, the loss function is not used to derive any new algorithm. Farahmand et al. (2009b) used this loss function to develop the regularized variant of LSTD (15)-(16). This loss function was later called mean-square projected Bellman error by Sutton et al. (2009), and was used to derive the GTD2 and TDC algorithms.

4.1 Why Two Regularizers?

We discuss why using regularizers in both optimization problems (15) and (16) of REGLSTD is necessary for large function spaces such as the Sobolev spaces and the RKHS with universal kernels. Here we show that for large function spaces, depending on which regularization term we remove, either the coupled optimization problems reduces to the regularized variant of the unmodiﬁed BRM, which has a bias, or the solution can be arbitrary bad. Let us focus on REG-LSTD for a given policy π. Assume that the function space F|A| is rich enough in the sense that it is dense in the space of continuous functions w.r.t. the supremum norm. This is satisﬁed by many large function spaces such as RKHS with universal kernels (Deﬁnition 4.52 of Steinwart and Christmann 2008) and the Sobolev spaces on compact domains. We consider what would happen if instead of the current formulation of REG-LSTD (15)-(16), we only used a regularizer either in the ﬁrst or second optimization problem. We study each case separately. For notational simplicity, we omit the dependence on the iteration number k.

Farahmand, Ghavamzadeh, Szepesv ari, and Mannor

Case 1. In this case, we only regularize the empirical error Q ˆhn( ; Q) 2 Dn, but we do not regularize the projection, i.e.,

ˆhn( ; Q) = argmin h F|A|

ˆQ = argmin Q F|A|

Q ˆhn( ; Q) 2

Dn + λQ,n J2(Q) . (17)

When the function space F|A| is rich enough, there exists a function ˆhn F|A| that ﬁts perfectly well to its target values at data points {(Xi, Ai)}n i=1, that is, ˆhn((Xi, Ai); Q) = ( ˆT πQ)(Xi, Ai) for i = 1, . . . , n.9 Such a function is indeed the minimizer of the loss Q ˆhn( ; Q) 2 Dn. The second optimization problem (17) becomes

ˆQ = argmin Q F|A|

Dn + λQ,n J2(Q) .

This is the regularized version of the original (i.e., unmodiﬁed) formulation of the BRM algorithm. As discussed in Section 3.1, the unmodiﬁed BRM algorithm is biased when the MDP is not deterministic. Adding a regularizer does not solve the biasedness problem of the unmodiﬁed BRM loss. So without regularizing the ﬁrst optimization problem, the function ˆhn overﬁts to the noise and as a result the whole algorithm becomes incorrect. Case 2. In this case, we only regularize the empirical projection h ˆT πQ 2 Dn, but we do not regularize Q ˆhn( ; Q) 2 Dn, i.e.,

ˆhn( ; Q) = argmin h F|A|

Dn + λh,n J2(h) ,

ˆQ = argmin Q F|A|

Q ˆhn( ; Q) 2

For a ﬁxed Q, the ﬁrst optimization problem is the standard regularized regression estimator with the regression function E h ( ˆT πQ)(X, A)|X = x, A = a i = (T πQ)(x, a). There-

fore, if the function space F|A| is rich enough and we set the regularization coeﬃcient λh,n properly, h T πQ ν and h T πQ Dn go to zero as the sample size grows (the rate of convergence depends on the complexity of the target function; cf. Lemma 15 and Theorem 16). So we can expect ˆhn( ; Q) to get closer to T πQ as the sample size grows. For simplicity of discussion, suppose that we are in the ideal situation where for any Q, we have ˆhn((x, a); Q) = (T πQ)(x, a) for all (x, a) {(Xi, Ai)}n i=1 {(X i, π(X i))}n i=1, that

9. To be more precise: First, for an ε > 0, we construct a continuous function hε(z) = P

Zi {(Xi,Ai)}n i=1 max n 1 z Zi

ε , 0 o ( ˆT πQ)(Zi). We then use the denseness of the function space F |A|

in the supremum norm to argue that there exists hε F |A| such that hε hε is arbitrarily close to zero. So when ε 0, the value of function hε is arbitrarily close to T πQ at data points. We then choose ˆhn( ; Q) = hε. This construction is similar to Theorem 2 of Nadler et al. (2009). See also the argument in Case 2 for more detail.

Regularized Policy Iteration with Nonparametric Function Spaces

is, we precisely know T πQ at all data points.10 Substituting this ˆhn((x, a); Q) in the second optimization problem (17), we get that we are solving the following optimization problem:

ˆQ = argmin Q F|A| Q T πQ 2 Dn . (19)

This is the Bellman error minimization problem. We do not have the biasedness problem here as we have T πQ instead of ˆT πQ in the loss. Nonetheless, we face another problem: Minimizing this empirical risk minimization without controlling the complexity of the function space might lead to an overﬁtted solution, very similar to the same phenomenon in supervised learning. To see it more precisely, we ﬁrst construct a continuous function

Zi {(Xi,Ai)}n i=1 {(X i,π(X i))}n i=1

ε , 0 Qπ(Zi),

which for small enough ε > 0 has the property that Qε T π Qε 2 Dn is zero, i.e., it is a minimizer of the empirical loss. Due to the denseness of F|A|, we can ﬁnd a Qε F|A|

that is arbitrarily close to the continuous function Qε. Therefore, for small enough ε, the function Qε is a minimizer of (19), i.e., the value of Qε T πQε 2 Dn is zero. But Qε is not a good approximation of Qπ because Qε consists of spikes in the ε-neighbourhood of data points and zero elsewhere. In other words, Qε does not generalize well beyond the data points when ε is chosen to be small. Of course the solution is to control the complexity of F|A| so that spiky functions such as Qε are not selected as the solution of the optimization problem. When we regularize both optimization problems, as we do in this work, none of these problems happen. This argument applies to rich function spaces that can approximate any reasonably complex functions (e.g., continuous functions) arbitrarily well. If the function space F|A|

is much more limited, for example if it is a parametric function space, we may not need to regularize both optimization problems. An example of such an approach for parametric spaces has been analyzed by Avila Pires and Szepesv ari (2012).

4.2 Closed-Form Solutions

In this section we provide a closed-form solution for (13)-(14) and (15)-(16) for two cases: 1) When F|A| is a ﬁnite dimensional linear space and J2( ) is deﬁned as the weighted squared sum of parameters describing the function (a setup similar to the ridge regression Hoerl and Kennard 1970) and 2) F|A| is an RKHS and J( ) is the corresponding inner-product norm, i.e., J2( ) = 2 H. Here we use a generic π and Dn instead of πk and D(k) n at the kth

10. This is an ideal situation because 1) h ˆT πQ ν is equal to zero only asymptotically and not in ﬁnite samples regime, and 2) even if h ˆT πQ ν = 0, it does not imply that ˆhn(x, a; Q) = (T πQ)(x, a) almost surely on X A. Nonetheless, these simpliﬁcations are only in favour of the algorithm considered in this case, so for simplicity of discussion we assume that they hold.

Farahmand, Ghavamzadeh, Szepesv ari, and Mannor

4.2.1 A Parametric Formulation for REG-BRM and REG-LSTD

In this section we consider the case when h and Q are both given as linear combinations of some basis functions:

h( ) = φ( ) u, Q( ) = φ( ) w, (20)

where u, w Rp are parameter vectors and φ( ) Rp is a vector of p linearly independent basis functions deﬁned over the space of state-action pairs.11 These basis functions might be predeﬁned (e.g., Fourier (Konidaris et al., 2011) or wavelets) or constructed datadependently by one of already mentioned feature generation methods. We further assume that the regularization terms take the form

J2(h) = u Ψu ,

J2(Q) = w Ψw .

for some user-deﬁned choice of positive deﬁnite matrix Ψ Rp p. A simple and common choice would be Ψ = I. Deﬁne Φ, Φ Rn p and r Rn as follows:

Φ = φ(Z1), . . . , φ(Zn) , Φ = φ(Z 1), . . . , φ(Z n) , r = R1, . . . , Rn , (21)

with Zi = (Xi, Ai) and Z i = (X i, π(X i)). The solution to REG-BRM is given by the following proposition.

Proposition 8 (Closed-form solution for REG-BRM) Under the setting of this section, the approximate action-value function returned by REG-BRM is ˆQ( ) = φ( ) w , where w = h B B γ2C C + nλQ,nΨ i 1 B + γC (ΦA I) r,

with A = Φ Φ + nλh,nΨ 1 Φ , B = Φ γΦ , C = (ΦA I)Φ .

Proof Using (20) and (21), we can rewrite (13)-(14) as

u (w) = argmin u Rp

n Φu (r + γΦ w) Φu (r + γΦ w) + λh,nu Ψu , (22)

w = argmin w Rp

n Φw (r + γΦ w) Φw (r + γΦ w) (23)

1 n Φu (w) (r + γΦ w) Φu (w) (r + γΦ w) + λQ,nw Ψw .

Taking the derivative of (22) w.r.t. u and equating it to zero, we obtain u as a function of w: u (w) = Φ Φ + nλh,nΨ 1 Φ (r + γΦ w) = A(r + γΦ w). (24)

Plug u (w) from (24) into (23), take the derivative w.r.t. w and equate it to zero to obtain the parameter vector w as announced above.

The solution returned by REG-LSTD is given in the following proposition.

11. At the cost of using generalized inverses, everything in this section extends to the case when the basis functions are not linearly independent.

Regularized Policy Iteration with Nonparametric Function Spaces

Proposition 9 (Closed-form solution for REG-LSTD) Under the setting of this section, the approximate action-value function returned by REG-LSTD is ˆQ( ) = φ( ) w , where w = h E E + nλQ,nΨ i 1 E Ar,

with A = Φ Φ + nλh,nΨ 1 Φ and E = (Φ γAΦ ).

Proof Using (20) and (21), we can rewrite (15)-(16) as

u (w) = argmin u Rp

n Φu (r + γΦ w) Φu (r + γΦ w) + λh,nu Ψu , (25)

w = argmin w Rp

n Φw Φu (w) Φw Φu (w) + λQ,nw Ψw o . (26)

Similar to the parametric REG-BRM, we solve (25) and obtain u (w) which is the same as (24). If we plug this u (w) into (26), take derivate w.r.t. w, and ﬁnd the minimizer, the parameter vector w will be as announced.

4.2.2 RKHS Formulation for REG-BRM and REG-LSTD

The class of reproducing kernel Hilbert spaces provides a ﬂexible and powerful family of function spaces to choose F|A| from. An RKHS H : X A R is deﬁned by a positive deﬁnite kernel k : (X A) (X A) R. With such a choice, we can use the corresponding squared RKHS norm 2 H as the regularizer J2( ). REG-BRM with an RKHS function space F|A| = H would be

ˆhn( ; Q) = argmin h F|A|[=H]

Dn + λh,n h 2 H

ˆQ = argmin Q F|A|[=H]

Dn ˆhn( ; Q) ˆT πQ 2

Dn + λQ,n Q 2 H

and the coupled optimization problems for REG-LSTD are

ˆhn( ; Q) = argmin h F|A|[=H]

Dn + λh,n h 2 H

ˆQ = argmin Q F|A|[=H]

Q ˆhn( ; Q) 2

Dn + λQ,n Q 2 H

We can solve these coupled optimization problems by the application of the generalized representer theorem for RKHS (Sch olkopf et al., 2001). The result, which is stated in the next theorem, shows that the inﬁnite dimensional optimization problem deﬁned on F|A| = H boils down to a ﬁnite dimensional problem with the dimension twice the number of data points.

Theorem 10 Let Z be a vector deﬁned as Z = (Z1, . . . , Zn, Z 1, . . . , Z n) . Then the optimizer ˆQ H of (27)-(28) can be written as ˆQ( ) = P2n i=1 αik( Zi, ) for some values of

Farahmand, Ghavamzadeh, Szepesv ari, and Mannor

α R2n. The same holds for the solution to (29)-(30). Further, the coeﬃcient vectors can be obtained in the following form:

REG-BRM: αBRM = (CKQ + nλQ,n I) 1(D + γC 2 B B)r ,

REG-LSTD: αLSTD = (F F KQ + nλQ,n I) 1F Er ,

where r = (R1, . . . , Rn) and the matrices KQ, B, C, C2, D, E, F are deﬁned as follows: Kh Rn n is deﬁned as [Kh]ij = k(Zi, Zj), 1 i, j n, and KQ R2n 2n is deﬁned as [KQ]ij = k( Zi, Zj), 1 i, j 2n. Let C1 = In n 0n n and C2 = 0n n In n . Denote D = C1 γC2, E = Kh(Kh + nλh,n I) 1, F = C1 γEC2, B = Kh(Kh + nλh,n I) 1 I, and C = D D γ2(BC2) (BC2).

Proof See Appendix A.

5. Theoretical Analysis

In this section, we analyze the statistical properties of REG-LSPI and provide a ﬁnitesample upper bound on the performance loss Q QπK 1,ρ. Here, πK is the policy greedy w.r.t. ˆQ(K 1) and ρ is the performance evaluation measure. The distribution ρ is chosen by the user and is often diﬀerent from the sampling distribution ν. Our study has two main parts. First, we analyze the policy evaluation error of REGLSTD in Section 5.1. We suppose that given any policy π, we obtain ˆQ by solving (15)-(16) with πk in these equations being replaced by π. Theorem 11 provides an upper bound on the Bellman error ˆQ T π ˆQ ν. We discuss the optimality of this upper bound for policy evaluation for some general classes of function spaces. We show that the result is not only optimal in its convergence rate, but also in its dependence on J(Qπ). After that in Section 5.2, we show how the Bellman errors of the policy evaluation procedure propagate through the API procedure (Theorem 13). The main result of this paper, which is an upper bound on the performance loss Q QπK 1,ρ, is stated as Theorem 14 in Section 5.3, followed by its discussion. We compare this work s statistical guarantee with some other papers in Section 5.3.1. To analyze the statistical performance of the REG-LSPI procedure, we make the following assumptions. We discuss their implications and the possible relaxations after stating each of them.

Assumption A1 (MDP Regularity) The set of states X is a compact subset of Rd. The random immediate rewards Rt R( |Xt, At) (t = 1, 2, . . . ) as well as the expected immediate rewards r(x, a) are uniformly bounded by Rmax, i.e., |Rt| Rmax (t = 1, 2, . . . ) and r Rmax.

Even though the algorithms were presented for a general measurable state space X, the theoretical results are stated for the problems whose state space is a compact subset of Rd. Generalizing Assumption A1 to other state spaces should be possible under certain regularity conditions. One example could be any Polish space, i.e., separable completely

Regularized Policy Iteration with Nonparametric Function Spaces

metrizable topological space. Nevertheless, we do not investigate such generalizations here. The boundedness of the rewards is a reasonable assumption that can be replaced by a more relaxed condition such as its sub-Gaussianity (Vershynin, 2012; van de Geer, 2000). This relaxation, however, increases the technicality of the proofs without adding much to the intuition. We remark on the compactness assumption after stating Assumption A4.

Assumption A2 (Sampling) At iteration k of REG-LSPI (for k = 0, . . . , K 1), n fresh independent and identically distributed (i.i.d.) samples are drawn from distribution

ν M(X A), i.e., D(k) n = n Z(k) t , R(k) t , X t (k) on

t=1 with Z(k) t = (X(k) t , A(k) t ) i.i.d. ν and

X t (k) P( |X(k) t , A(k) t ).

The i.i.d. requirement of Assumption A2 is primarily used to simplify the proofs. With much extra eﬀort, these results can be extended to the case when the data samples belong to a single trajectory generated by a ﬁxed policy. In the single trajectory scenario, samples are not independent anymore, but under certain conditions on the Markov process, the process (Xt, At) gradually forgets its past. One way to quantify this forgetting is through mixing processes. For these processes, tools such as the independent blocks technique (Yu, 1994; Doukhan, 1994) or information theoretical inequalities (Samson, 2000) can be used to carry on the analysis as have been done by Antos et al. (2008b) in the API context, by Farahmand and Szepesv ari (2012) for analyzing the regularized regression problem, and by Farahmand and Szepesv ari (2011) in the context of model selection for RL problems. It is worthwhile to emphasize that we do not require that the distribution ν to be known. The sampling distribution is also generally diﬀerent from the distribution induced by the target policy πk. For example, it might be generated by drawing state samples from a given νX and choosing actions according to a behavior policy πb, which is diﬀerent from the policy being evaluated. So we are in the oﬀ-policy sampling setting. Moreover, changing ν at each iteration based on the previous iterations is a possibility with potential practical beneﬁts, which has theoretical justiﬁcations in the context of imitation learning (Ross et al., 2011). For simplicity of the analysis, however, we assume that ν is ﬁxed in all iterations. Finally, we note that the proofs work ﬁne if we reuse the same data sets in all iterations. We comment on it later after the proof of Theorem 11 in Appendix B.

Assumption A3 (Regularizer) Deﬁne two regularization functionals J : B(X) R and J : B(X A) R that are pseudo-norms on F and F|A|, respectively.12 For all Q F|A|

and a A, we have J(Q( , a)) J(Q).

The regularizer J(Q) measures the complexity of an action-value function Q. The functions that are more complex have larger values of J(Q). We also need to deﬁne a related regularizer for value functions Q( , a) (a A). The latter regularizer is not explicitly used in the algorithm, and is only used in the analysis. This assumption imposes some mild restrictions on these regularization functionals. The condition that the regularizers be pseudo-norms is satisﬁed by many commonly-used regularizers such as the Sobolev norms,

12. Note that here we are slightly abusing the notations as the same symbol is used for the regularizer over both B(X) and B(X A). However, this should not cause any confusion since in any speciﬁc expression the identity of the regularizer should always be clear from the context.

Farahmand, Ghavamzadeh, Szepesv ari, and Mannor

the RKHS norms, and the l2-regularizer deﬁned in Section 4.2.1 with a positive semideﬁnite choice of matrix Ψ. Moreover, the condition J(Q( , a)) J(Q) essentially states that the complexity of Q should upper bound the complexity of Q( , a) for all a A. If the regularizer J : B(X A) R is derived from a regularizer J : B(X) R through J(Q) = (J (Q( , a))a A p for some p [1, ], then J will satisfy the second part of the assumption. From a computational perspective, a natural choice for RKHS is to choose p = 2 and to deﬁne J2(Q) = P a A Q( , a) 2 H for H being the RKHS deﬁned on X.

Assumption A4 (Capacity of Function Space) For R > 0, let FR = {f F : J(f) R}. There exist constants C > 0 and 0 < α < 1 such that for any u, R > 0 the following metric entropy condition is satisﬁed:

log N (u, FR) C R

This assumption characterizes the capacity of the ball with radius R in F. The value of α is an essential quantity in our upper bounds. The metric entropy is precisely deﬁned in Appendix G, but roughly speaking it is the logarithm of the minimum number of balls with radius u that are required to completely cover a ball with radius R in F. This is a measure of complexity of a function space as it is more diﬃcult to estimate a function when the metric entropy grows fast when u decreases. As a simple example, when the function space is ﬁnite, we eﬀectively need to have good estimate of |F| functions in order not to choose the wrong one. In this case, N (u, FR) can be replaced by |F|, so α = 0 and C = log |F|. When the state space X is ﬁnite and all functions are bounded by Qmax, we have log N (u, FR) log N (u, F) = |X| log(2Qmax

u ). This shows that the metric entropy for problems with ﬁnite state spaces grows much slower than what we consider here. Assumption A4 is suitable for large function spaces and is indeed satisﬁed for the Sobolev spaces and various RKHS. Refer to van de Geer (2000); Zhou (2002, 2003); Steinwart and Christmann (2008) for many examples. An alternative assumption would be to have a similar metric entropy for the balls in F|A| (instead of F). This would slightly change a few steps of the proofs, but leave the results essentially the same. Moreover, it makes the requirement that J(Q( , a)) J(Q) in Assumption A3 unnecessary. Nevertheless, as results on the capacity of F is more common in the statistical learning theory literature, we stick to the combination of Assumptions A3 and A4. The metric entropy here is deﬁned w.r.t. the supremum norm. All proofs, except that of Lemma 23, only require the same bound to hold when the supremum norm is replaced by the more relaxed empirical L2-norm, i.e., those results require that there exist constants C > 0 and 0 < α < 1 such that for any u, R > 0 and all x1, . . . , xn X, we have log N2(u, FR, x1:n) C R

u 2α. Of course, the metric entropy w.r.t. the supremum norm implies the one with the empirical norm. It is an interesting question to relax the supremum norm assumption in Lemma 23. We can now remark on the requirement that X is compact (Assumption A1). We stated that requirement mainly because most of the metric entropy results in the literature are for compact spaces (one exception is Theorem 7.34 of Steinwart and Christmann (2008), which relaxes the compactness requirement by adding some assumptions on the tail of νX on X).

Regularized Policy Iteration with Nonparametric Function Spaces

So we could remove the compactness requirement from Assumption A1 and implicitly let Assumption A4 satisfy it, but we preferred to be explicit about it at the cost of a bit of redundancy in our set of assumptions.

Assumption A5 (Function Space Boundedness) The subset F|A| B(X A; Qmax) is a separable and complete Carath eodory set with Rmax Qmax < .

Assumption A5 requires all the functions in F|A| to be bounded so that the solutions of optimization problems (15)-(16) stay bounded. If they are not, they should be truncated, and thus, the truncation argument should be used in the analysis, see e.g., the proof of Theorem 21.1 of Gy orﬁet al. (2002). The truncation argument does not change the ﬁnal result, but complicates the proof at several places, so we stick to the above assumption to avoid unnecessary clutter. Moreover, in order to avoid the measurability issues resulting from taking supremum over an uncountable function space F|A|, we require the space to be a separable and complete Carath eodory set (cf. Section 7.3 of Steinwart and Christmann 2008).

Assumption A6 (Function Approximation Property) The action-value function of any policy π belongs to F|A|, i.e., Qπ F|A|.

This no function approximation error assumption is standard in analyzing regularizationbased nonparametric methods. This assumption is realistic and is satisﬁed for rich function spaces such as RKHS deﬁned by universal kernels, e.g., Gaussian or exponential kernels (Section 4.6 of Steinwart and Christmann 2008). On the other hand, if the space is not large enough, we might have function approximation error. The behavior of the function approximation error for certain classes of small RKHS has been discussed by Smale and Zhou (2003); Steinwart and Christmann (2008). We stick to this assumption to simplify many key steps in the proofs.

Assumption A7 (Expansion of Smoothness) For all Q F|A|, there exist constants 0 LR, LP < , depending only on the MDP and F|A|, such that for policy π,

J(T πQ) LR + γLP J(Q).

We require that the complexity of T πQ to be comparable to the complexity of Q itself. In other words, we require that if Q is smooth according to the regularizer J of a function space F|A|, it stays smooth after the application of the Bellman operator. We believe that this is a reasonable assumption for many classes of MDPs with suﬃcient stochasticity and when F|A| is rich enough. The intuition is that if the Bellman operator has a smoothing eﬀect, the norm of T πQ does not blow up and the function can still be represented well within F|A|. Proposition 25 in Appendix F presents the conditions that for the so-called convolutional MDPs, Assumption A7 is satisﬁed. Brieﬂy speaking, the conditions are 1) the transition probability kernel should have a ﬁnite gain (in the control-theoretic sense) in its frequency response, and 2) the reward function should be smooth according to the regularizer J. Of course, this is only an example of the class of problems for which this assumption holds.

Farahmand, Ghavamzadeh, Szepesv ari, and Mannor

5.1 Policy Evaluation Error

In this section, we focus on the kth iteration of REG-LSPI. To simplify the notation, we use Dn = {(Zt, Rt, X t)}n t=1 to refer to D(k) n . The policy πk depends on data used in the earlier iterations, but since we use independent set of samples D(k) n for the kth iteration and πk is independent of D(k) n , we can safely ignore the randomness of πk by working on the probability space obtained by conditioning on D(0) n , . . . , D(k 1) n , i.e., the probability space used in the kth iteration is (Ω, σΩ, Pk) with Pk = P n D(0) n , . . . , D(k 1) n o . In order to avoid clutter, we do not use the conditional probability symbol. In the rest of this section, π refers to a σ(D(0) n , . . . , D(k 1) n )-measurable policy and is independent of Dn; ˆQ and ˆhn(Q) = ˆhn( ; Q) refer to the solution to (15)-(16) when π, λh,n, and λQ,n replace πk, λ(k) h,n, and λ(k) Q,n in that set of equations, respectively. The following theorem is the main result of this section and provides an upper bound on the statistical behavior of the policy evaluation procedure REG-LSTD.

Theorem 11 (Policy Evaluation) For any ﬁxed policy π, let ˆQ be the solution to the optimization problem (15)-(16) with the choice of

λh,n = λQ,n = 1 n J2(Qπ)

If Assumptions A1 A7 hold, there exists c(δ) > 0 such that for any n N and 0 < δ < 1, we have ˆQ T π ˆQ 2

ν c(δ) n 1 1+α ,

with probability at least 1 δ. Here c(δ) is equal to

c(δ) = c1 1 + (γLP )2 J 2α 1+α (Qπ) ln(1/δ) + c2

2α 1+α R + L2 R [J(Qπ)] 2 1+α

for some constants c1, c2 > 0.

Theorem 11, which is proven in Appendix B, indicates how the number of samples and the diﬃculty of the problem as characterized by J(Qπ), LP , and LR inﬂuence the policy evaluation error.13

This upper bound provides some insights about the behavior of the REG-LSTD algorithm. To begin with, it shows that under the speciﬁed conditions, REG-LSTD is a consistent algorithm: As the number of samples increases, the Bellman error decreases and asymptotically converges to zero. This is due to the use of a nonparametric function space and the proper control of its complexity through regularization. A parametric function space, e.g., a linear function approximator with a ﬁxed number of features, does not generally have a similar guarantee unless the value function happens to belong to the span of the features. Achieving consistency for parametric function spaces requires careful choice

13. Without loss of generality and for simplicity we assumed that J(Qπ) > 0.

Regularized Policy Iteration with Nonparametric Function Spaces

of features, and might be diﬃcult. On the other hand, a rich enough nonparametric function space, for example one deﬁned by a universal kernel (cf. Assumption A6), ensures the consistency of the policy evaluation algorithm. This theorem, however, is much more powerful than a consistency result as it provides a ﬁnite-sample upper bound guarantee for the error, too. If the parameters of the REGLSTD algorithm are selected properly, one may achieve the sample complexity upper bound of O(n 1/(1+α)). For the case of the Sobolev space Wk(X) with X being an open Euclidean ball in Rd and k > d/2, one may choose α = d/2k to obtain the error upper bound of O(n d/(2k+d)).14

To study the upper bound a bit closer, let us focus on the special case of γ = 0. For this choice of the discount factor, T πQ is equal to rπ and ( ˆT πQ)(Xi, Ai) is equal to Ri. One can see that the policy evaluation problem becomes a regression problem with the regression function rπ. The guarantee of this theorem would be then on ˆQ T π ˆQ 2 ν = ˆQ rπ 2 ν, which is the usual squared error in the regression literature. Hence we reduced a regression problem to a policy evaluation problem. Because of this reduction, any lower bound on the regression would also be a lower bound on the policy evaluation problem. It is well-known that the convergence rate of n d/(2k+d) is asymptotically minimax optimal for the regression estimation for target functions belonging to the Sobolev space Wk(X) as well as some other smoothness classes with the k order of smoothness, cf. e.g., Nussbaum (1999) for the results for the Sobolev spaces, Stone (1982) for a closely related H older space Cp,α, which with the choice of k = p + α (k N and 0 < α 1) has the same rate, and Tsybakov (2009) for several results on minimax optimality of nonparametric estimators. More generally, the rate of O(n 1/(1+α) is optimal too: For a regression function belonging to a function space F with a packing entropy in the same form as in the upper bound of Assumption A4, the rate Ω(n 1/(1+α)) is its minimax lower bound (Yang and Barron, 1999), making the upper bound optimal. Comparing these lower bounds with the upper bound O(n 1/(1+α)) (or O(n d/(2k+d)) for the Sobolev space) of this theorem indicates that REG-LSTD algorithm has the optimal error rate as a function of the number of samples n, which is a remarkable result. Furthermore, to understand the ﬁne behavior of the upper bound, beyond the dependence of the rate on n and α, we focus on the multiplicative term c(δ). Again we consider the special case of regression estimation as it is the only case we have some known lower bounds. With the choice of γ = 0, we have Qπ = rπ, so J(Qπ) = J(rπ). Moreover, since T πQ = rπ + 0PπQ = rπ, we can choose LR = J(rπ) in Assumption A7. As a result c(δ) = c1 J 2α 1+α (rπ) ln(1/δ) for a constant c1 > 0. We are interested in studying the dependence of the upper bound on J(rπ). We study its behavior when the function space is the Sobolev space Wk([0, 1]) and J( ) is the corresponding Sobolev space norm. We choose α = 1/2k to get J 2 2k+1 (rπ) dependence of c(δ). On the other hand, for the regression estimation problem within the subset F|A| 1 = Q( , a) Wk([0, 1]) : J(Q) J(rπ), a A

of this Sobolev space, the ﬁne behavior of the asymptotic minimax rate is determined by

14. For examples of the metric entropy results for the Sobolev spaces, refer to Section A.5.6 alongside Lemma 6.21 of Steinwart and Christmann (2008), or Theorem 2.4 of van de Geer (2000) for X = [0, 1] or Lemma 20.6 of Gy orﬁet al. (2002) for X = [0, 1]d. Also in this paper we use the notation Wk(X) to refer to Wk,2(X), the Sobolev space deﬁned based on the L2-norm of the weak derivatives.

Farahmand, Ghavamzadeh, Szepesv ari, and Mannor

the so-called Pinsker constant, whose dependence on J is in fact J 2 2k+1 (rπ), cf. e.g., Nussbaum (1999, 1985); Golubev and Nussbaum (1990), or Section 3.1 of Tsybakov (2009).15

Therefore, not only the exponent of the rate is optimal for this function space, but also its multiplicative dependence on the smoothness J(rπ) is optimal.

For function spaces other than this choice of Sobolev space (i.e., the general case of α), we are not aware of any reﬁned lower bound that indicates the optimality of J 2α 1+α (rπ). We note that some available upper bounds for regression with comparable assumptions on the metric entropy have the same dependence on J(rπ), e.g., Steinwart et al. (2009)16 or Farahmand and Szepesv ari (2012), whose result is for the regression setting with exponential β-mixing input, but can also be shown for i.i.d. data. We conjecture that under our assumptions this dependence is optimal.

One may note that the proper selection of the regularization coeﬃcients to achieve the optimal rate requires the knowledge of an unknown quantity J(Qπ). This, however, is not a major concern as a proper model selection procedure ﬁnds parameters that result in a performance which is almost the same as the optimal performance. We comment on this issue in more detail in Section 6.

The proof of this theorem requires several auxiliary results, which are presented in the appendices, but the main idea behind the proof is as follows. Since ˆQ T π ˆQ 2 ν 2 ˆQ ˆhn( ; ˆQ) 2 ν + 2 ˆhn( ; ˆQ) T π ˆQ 2 ν, we may upper bound the Bellman error by upper bounding each term in the right-hand side (RHS). One can see that for a ﬁxed Q, the optimization problem (15) essentially solves a regularized least-squares regression problem, which leads to small value of ˆhn( ; ˆQ) T π ˆQ ν, when there are enough samples and under proper conditions. The relation of the optimization problem (16) with ˆQ ˆhn( ; ˆQ) ν is evident too. The diﬃculty, however, is that these two optimization problems are coupled: ˆhn( ; ˆQ) is a function of ˆQ which itself is a function of ˆhn( ; ˆQ). Thus, Q appearing in (15) is not ﬁxed, but is a random function ˆQ. The same is true for the other optimization problem as well. The coupling of the optimization problems makes the analysis more complicated than the usual supervised learning type of analysis. The dependencies between all the results that lead to the proof of Theorem 14 is depicted in Figure 2 in Appendix B.

In order to obtain fast convergence rates, we use concepts and techniques from the empirical process theory such as the peeling device, the chaining technique, and the modulus of continuity of the empirical process, cf. e.g., van de Geer (2000). By focusing on the behavior of the empirical process over local subsets of the function space, these techniques allow us to study the deviations of the process in a more reﬁned way compared to a global approach that studies the supremum of the empirical process in the whole function space. These techniques are crucial to obtain a fast rate for large function spaces. We discuss them in more detail as we proceed in the proofs.

15. The Pinsker constant determines the eﬀect of the noise variance too. We do not present such information in our bounds. Also note that most aforementioned results, except Golubev and Nussbaum (1990), consider a normal noise model, which is diﬀerent from our bounded noise. 16. This is obtained by using Corollary 3 of Steinwart et al. (2009) after substituting A2(λ) by its upper bound λ f 2 H, which is valid whenever f H, as is in our case. This result can be used after one converts the metric entropy condition to the condition on the decay rate of eigenvalues of a certain integral operator.

Regularized Policy Iteration with Nonparametric Function Spaces

We mentioned earlier that one can actually reuse a single data set in all iterations. To keep the presentation more clear, we keep the current setup. The reason behind this can be explained better after the proof of Theorem 11. But note that from the convergence-rate point of view, the diﬀerence between reusing data or not is insigniﬁcant. If we have a batch of data with size n and we divide it into K chunks and only use one chunk per iteration of API, the rate would be O(( n

K ) 1 1+α ). For ﬁnite K, or slowly growing K, this is essentially

the same as O(n 1 1+α ).

5.2 Error Propagation in API

Consider an API algorithm that generates the sequence ˆQ(0) π1 ˆQ(1) π2 ˆQ(K 1) πK, where πk is the greedy policy w.r.t. ˆQ(k 1) and ˆQ(k) is the approximate action-value function for policy πk. For the sequence ( ˆQ(k))K 1 k=0 , denote the Bellman Residual (BR) of the kth action-value function by

εBR k = ˆQ(k) T πk ˆQ(k). (31)

The goal of this section is to study the eﬀect of the ν-weighted L2-norm of the Bellman residual sequence (εBR k )K 1 k=0 on the performance loss Q QπK 1,ρ of the resulting policy πK. Because of the dynamical nature of the MDP, the performance loss Q QπK p,ρ depends on the diﬀerence between the sampling distribution ν and the future state-action distribution in the form of ρP π1P π2 . The precise form of this dependence is formalized in Theorem 13, which is a slight modiﬁcation of a result by Farahmand et al. (2010).17

Before stating the results, we deﬁne the following concentrability coeﬃcients that are used in a change of measure argument, see e.g., Munos (2007); Antos et al. (2008b); Farahmand et al. (2010).

Deﬁnition 12 (Expected Concentrability of Future State-Action Distributions) Given the distributions ρ, ν M(X A), an integer m 0, and an arbitrary sequence of stationary policies (πm)m 1, let ρP π1P π2 P πm M(X A) denote the future state-action distribution obtained when the ﬁrst state-action is distributed according to ρ and then we follow the sequence of policies (πk)m k=1. Deﬁne the following concentrability coeﬃcients:

c PI1,ρ,ν(m1, m2; π)

d ρ(P π )m1(P π)m2

with (X, A) ν. If the future state-action distribution ρ(P π )m1(P π)m2 is not absolutely continuous w.r.t. ν, then we take c PI1,ρ,ν(m1, m2; π) = .

In order to compactly present our results, we deﬁne the following notation:

ak = (1 γ)γK k 1

1 γK+1 . (0 k < K) (32)

17. The diﬀerence of these two results is in the way the norm of functions from the space F |A| is deﬁned, which in turn corresponds to whether the distributions ν and ρ are deﬁned over the state space X, as Farahmand et al. (2010) deﬁned, or over the state-action space X A, as we deﬁne here. These diﬀerences do not change the general form of the proof. See Theorem 3.2 in Chapter 3 of Farahmand (2011b) for the proof of the current result.

Farahmand, Ghavamzadeh, Szepesv ari, and Mannor

Theorem 13 (Error Propagation for API Theorem 3 of Farahmand et al. 2010) Let p 1 be a real number, K be a positive integer, and Qmax Rmax

1 γ . Then for any sequence

( ˆQ(k))K 1 k=0 B(X A, Qmax) and the corresponding sequence (εBR k )K 1 k=0 deﬁned in (31), we have

Q QπK p,ρ 2γ (1 γ)2

inf r [0,1] C

1 2p PI,ρ,ν(K; r)E 1 2p (εBR 0 , . . . , εBR K 1; r) + γ K

where E(εBR 0 , . . . , εBR K 1; r) = PK 1 k=0 a2r k εBR k 2p 2p,ν and

CPI,ρ,ν(K; r) = 1 γ

2 sup π 0,...,π K

k=0 a2(1 r) k

m 0 γm c PI1,ρ,ν(K k 1, m + 1; π k+1) +

c PI1,ρ,ν(K k, m; π k) #2 .

For better understanding of the intuition behind the error propagation results in general, refer to Munos (2007); Antos et al. (2008b); Farahmand et al. (2010). The signiﬁcance of this particular theorem and the ways it improves previous similar error propagation results such as that of Antos et al. (2008b) (for API) and Munos (2007) (for AVI) is thoroughly discussed by Farahmand et al. (2010). We brieﬂy comment on it in Section 5.3.

5.3 Performance Loss of REG-LSPI

In this section, we use the error propagation result (Theorem 13 in Section 5.2) together with the upper bound on the policy evaluation error (Theorem 11 in Section 5.1) to derive an upper bound on the performance loss Q QπK 1,ρ of REG-LSPI. This is the main theoretical result of this work. Before stating the theorem, let us denote ˆΠ(F|A|) as the set of all policies that are greedy w.r.t. a member of F|A|, i.e., ˆΠ(F|A|) = {ˆπ( ; Q) : Q F|A|}.

Theorem 14 Let ( ˆQ(k))K 1 k=0 be the solutions of the optimization problem (15)-(16) with the choice of

λ(k) h,n = λ(k) Q,n = 1 n J2(Qπk)

Let Assumptions A1 A5 hold; Assumptions A6 and A7 hold for any π ˆΠ(F|A|), and infr [0,1] CPI,ρ,ν(K; r) < . Then there exists CLSPI(δ, K; ρ, ν) such that for any n N and 0 < δ < 1, we have

Q QπK 1,ρ 2γ (1 γ)2

h CLSPI(δ, K; ρ, ν)n 1 2(1+α) + γK 1Rmax i ,

with probability at least 1 δ.

In this theorem, the function CLSPI(δ, K; ρ, ν) = CLSPI(δ, K; ρ, ν; LR, LP , α, β, γ) is

CLSPI(δ, K; ρ, ν; LR, LP , α, β, γ) = C

1 2 I (δ) inf r [0,1]

1 2 PI,ρ,ν(K; r)

Regularized Policy Iteration with Nonparametric Function Spaces

with CI(δ) being deﬁned as

CI(δ) = sup π ˆΠ(F|A|)

c1 1 + (γLP )2 J 2α 1+α (Qπ) ln K

2α 1+α R + L2 R [J(Qπ)] 2 1+α

in which c1, c2 > 0 are universal constants. Proof Fix 0 < δ < 1. For each iteration k = 0, . . . , K 1, invoke Theorem 11 with the conﬁdence parameter δ/K and take the supremum over all policies to upper bound the Bellman residual error εBR k ν as

ˆQ(k) T πk ˆQ(k) 2

ν sup π ˆΠ(F|A|) c J(Qπ), LR, LP , α, β, γ, δ

which holds with probability at least 1 δ

K . Here c( ) is deﬁned as in Theorem 11. For any r [0, 1], we have

E(εBR 0 , . . . , εBR K 1; r) =

k=0 a2r k εBR k 2 ν c n 1 1+α K 1 X

= c n 1 1+α 1 γ 1 γK+1

2r 1 (γ2r)K

where we used the deﬁnition of ak (32). We then apply Theorem 13 with the choice of p = 1 to get that with probability at least 1 δ, we have

Q QπK 1,ρ 2γ (1 γ)2

h CLSPI(ρ, ν; K)n 1 1+α + γK 1Rmax i .

CLSPI(ρ, ν; K) = s

sup π ˆΠ(F|A|) c J(Qπ), LR, LP , α, γ, δ

inf r [0,1]

1 2 PI,ρ,ν(K; r)

Theorem 14 upper bounds the performance loss and relates it to the number of samples n, the capacity of the function space quantiﬁed by α, the number of iterations K, the concentrability coeﬃcients, and some other properties of the MDP such as LR, LP , and γ. This theorem indicates that the behavior of the upper bound as a function of the number of samples is O(n 1 2(1+α) ). This upper bound is notable because of its minimax optimality, as discussed in detail after Theorem 11. The term CLSPI has two main components. The ﬁrst is CPI,ρ,ν( ; r), which describes the eﬀect of the sampling distribution ν and the evaluation distribution ρ, as well as the

Farahmand, Ghavamzadeh, Szepesv ari, and Mannor

transition probability kernel of the MDP itself on the performance loss. This term has been thoroughly discussed by Farahmand et al. (2010), but brieﬂy speaking it indicates that ν and ρ aﬀect the performance through a weighted summation of c PI1,ρ,ν (Deﬁnition 12). The concentrability coeﬃcients c PI1,ρ,ν is deﬁned as the square root of the expected squared Radon-Nikodym of the future state-action distributions starting from ρ w.r.t. the sampling distribution ν. This may be much tighter compared to the previous results (e.g., Antos et al. 2008b) that depend on the supremum of the Radon-Nikodym derivative. One may also notice that Theorem 13 actually provides a stronger result than what is reported in Theorem 14: The eﬀect of errors at earlier iterations on the performance loss is geometrically decayed. So one may potentially use a fewer number of samples in the earlier iterations of REG-LSPI (or any other API algorithm) to get the same guarantee on the performance loss. We ignore this eﬀect to simplify the result. The other important term is CI, which mainly describes the eﬀect of LR, LP , and supπ ˆΠ(F|A|) J(Qπ) on the performance loss. These quantities depend on the MDP, as

well as the function space F|A|. If the function space is matched with the MDP, these quantities would be small, otherwise they may even be inﬁnity. Note that CI provides an upper bound on the constant in front of REG-LSTD procedure by taking supremum over all policies in ˆΠ(F|A|). This might be a conservative estimate as the actual encountered policies are the rather restricted random sequence π0, π1, . . . , πK 1 generated by the REG-LSPI procedure. One might expect that as the sequence ˆQ(k 1)

converge to a neighbourhood of Q , the value function Qπk of the greedy policy πk = ˆπ( ; ˆQ(k 1)), which is the policy being evaluated, converges to a neighbourhood of Q too. Thus with certain assumptions, one might be able to show that its smoothness J(Qπk), the quantity that appears in the upper bound of Theorem 11, belongs to a neighbourhood of J(Q ). If J(Q ) is small, the value of J(Qπk) in that neighbourhood can be smaller than supπ ˆΠ(F|A|) J(Qπ). We postpone the analysis of this ﬁner structure of the problem to future work. Finally we note that the optimality of the error bound for the policy evaluation task, as shown by Theorem 11, does not necessarily imply that the REG-LSPI algorithm has the optimal sample complexity rate for the corresponding RL/Planning problem as well. The reason is that it is possible to get close to the optimal policy, which is the ultimate goal in RL/Plannning, even though the estimate of the action-value function is still inaccurate. To act optimally, it is suﬃcient to have an action-value function whose greedy policy is the same as the optimal policy. This can happen even if there is some error in the estimated action-value function. This is called the action-gap phenomenon and has been analyzed in the reinforcement learning context by Farahmand (2011a).

5.3.1 Comparison with Similar Statistical Guarantees

Theorem 14 might be compared with the results of Antos et al. (2008b), who introduced a BRM-based API procedure and studied its statistical properties, Lazaric et al. (2012), who analyzed LSPI with linear function approximators, Avila Pires and Szepesv ari (2012), who studied a regularized variant of LSTD, and Ghavamzadeh et al. (2011), who analyzed the statistical properties of Lasso-TD. Although these results address diﬀerent algorithms, comparing them with the results of this work is insightful.

Regularized Policy Iteration with Nonparametric Function Spaces

We ﬁrst focus on Antos et al. (2008b). Their simpliﬁed upper bound for Q QπK 1,ρ is C1/2 ρ,ν p

VF log(n) + ln(K/δ) n 1/4, in which VF is the eﬀective dimension of F and is deﬁned based on the pseudo-dimension of sub-graphs of F and the so-called VC-crossing dimension of F; and Cρ,ν is a concentrability coeﬃcient and plays a similar rule to our

CPI,ρ,ν(K; r). In contrast, our simpliﬁed upper bound is CLSPI(δ)n 1 2(1+α) , in which CLSPI(δ)

can roughly be factored into C

1 2 PI,ρ,ν(K; r) C1(J(Qπ), LR, LP ) p

ln(K/δ). One important diﬀerence between these two results is that Antos et al. (2008b) considered parametric function spaces, which have ﬁnite eﬀective dimension VF, while this work considers nonparametric function spaces, which essentially are inﬁnite dimensional. The way they use the parametric function space assumption is equivalent to assuming that log N1(u, F, x1:n) VF log( 1

u) as opposed to log N (u, FB, x1:n) C R

u 2α of Assumption A4. Our assumption lets us describe the capacity of inﬁnite dimensional function spaces F. Disregarding this crucial diﬀerence, one may also note that our upper bound s dependence on the number of samples (i.e., O(n 1 2(1+α) )) is much faster than theirs (i.e., O(n 1/4)). This is more noticeable when we apply our result to a ﬁnite dimensional function space, which can be done by letting α 0 at a certain rate, to recover the error upper bound of n 1/2.18 This improvement is mainly because of more advanced techniques used in our analysis, i.e., the relative deviation tail inequality and the peeling device in this work in contrast with the uniform deviation inequality of Antos et al. (2008b). The other diﬀerence is in the deﬁnition of concentrability coeﬃcients (CPI,ρ,ν(K) vs. Cρ,ν). In Deﬁnition 12, we use the expectation of Radon-Nikodym derivative of two distributions while their deﬁnition uses the supremum of a similar quantity. This can be a signiﬁcant improvement in the multiplicative constant of the upper bound. For more information regarding this improvement, which can be used to improve the result of Antos et al. (2008b) too, refer to Farahmand et al. (2010).

Lazaric et al. (2012) analyzed unregularized LSTD/LSPI specialized for linear function approximators with ﬁnite number of basis functions (parametric setting). Their rate of O(n 1/2) for V V πK 2,ρ is faster than the rate in the work of Antos et al. (2008b), and is comparable to our rate for Q QπK 1,ρ when α 0. The diﬀerence of their work with ours is that they focus on a parametric class of function approximators as opposed to the nonparametric class in this work. Moreover, because they formulate the LSTD as a ﬁxedpoint problem, in contrast to this work and that of Antos et al. (2008b), their algorithm and results are only applicable to on-policy sampling scenario.

Avila Pires and Szepesv ari (2012) studied a regularized version of LSTD in the parametric setting that works for both on-policy and oﬀ-policy sampling. Beside the diﬀerence between the class of function spaces with this work (parametric vs. nonparametric), another algorithmic diﬀerence is that they only use a regularizer for the projected Bellman error term, similar to (16), as opposed to using regularizers in both terms of REG-LSTD (15)-(16) (cf. Section 4.1). Also the weight used in their loss function, the matrix M in their paper, is not necessarily the one induced by data. Their result indicates O(n 1/2) for the projected Bellman error, which is comparable, though with some subtle diﬀerences, to Lazaric et al.

18. For problems with ﬁnite state space, we have log N (u, FR) |X| log( Qmax

u ), so with a similar α 0 argument, we get O(n 1/2) error upper bound (disregarding the logarithmic terms).

Farahmand, Ghavamzadeh, Szepesv ari, and Mannor

(2012). It is remarkable that they separate the error bound analysis to deterministic and probabilistic parts. In the deterministic part, they use perturbation analysis to relate the loss to the error in the estimation of certain parameters used by the algorithms. In the probabilistic part, they provide upper bounds on the error in estimation of the parameters. We conjecture that their proof technique, even though simple and elegant, cannot easily be extended to provide the right convergence rate for large function spaces because the current analysis is based on a uniform bound on the error of a noisy matrix. Providing a tight uniform bound for a matrix (or operator) for large state spaces might be diﬃcult or impossible to achieve.

Ghavamzadeh et al. (2011) analyzed Lasso-TD, a policy evaluation algorithm that uses linear function approximators and enforces sparsity by the l1-regularization, and provided error upper bounds w.r.t. the empirical measure (or what they call Markov design). Their error upper bound is O([ w 2 1 log(p)]1/4n 1/4), where w is the weight vector describing the projection of Qπ onto the span of p basis functions. With some extra assumptions on the Grammian of the basis functions, they obtain faster rate of O( p

w 0 log(p) n 1/2). These results indicate that by using the sparsity-inducing regularizer, the dependence of the error bound on the number of features becomes logarithmic. We conjecture that if one uses REG-LSTD with a linear function space (similar to Section 4.2.1) with J2(h) = u 1 and J2(Q) = w 1, the current analysis leads to the error upper bound O( w 1/2 1 n 1/4) with a logarithmic dependence on p. This result might be obtained using Corollary 5 of Zhang (2002) as Assumption A4. To get a faster rate of O(n 1/2), one should make extra assumptions on the Grammian as was done by Ghavamzadeh et al. (2011). We should emphasize that even with the choice of linear function approximators and the l1-regularization, REG-LSTD would not be the same algorithm as Lasso-TD since REG-LSTD uses regularization in both optimization problems (15)-(16). Also note that the error upper bound of Ghavamzadeh et al. (2011) is on the empirical norm 2,Dn as opposed to the norm 2,ν, which is w.r.t. the measure ν. This means that their result does not provide a generalization upper bound on the quality of the estimated value function over the whole state space, but provides an upper bound only on the training data. Comparing this work with its conference version (Farahmand et al., 2009b), we observe that the main diﬀerence in the theoretical guarantees is that the current results are for more general function spaces than the Sobolev spaces considered in the conference paper. Assumption A4 speciﬁes the requirement on the capacity of the function space, which is satisﬁed not only by the Sobolev spaces (with the choice of α = d/2k for Wk(X) with X being an open Euclidean ball in Rd and k > d/2; cf. Section A.5.6 alongside Lemma 6.21 of Steinwart and Christmann (2008), or Theorem 2.4 of van de Geer (2000) for X = [0, 1] or Lemma 20.6 of Gy orﬁet al. (2002) for X = [0, 1]d), but also many other large function spaces including several commonly-used RKHS.

6. Conclusion and Future Work

We introduced two regularization-based API algorithms, namely REG-LSPI and REGBRM, to solve RL/Planning problems with large state spaces. Our formulation was general and could incorporate many types of function spaces and regularizers. We speciﬁcally showed how these algorithms can be implemented eﬃciently when the function space is the

Regularized Policy Iteration with Nonparametric Function Spaces

span of a ﬁnite number of basis functions (parametric model) or an RKHS (nonparametric model). We then focused on the statistical properties of REG-LSPI and provided its performance loss upper bound (Theorem 14). The error bound demonstrated the role of the sample size, the complexity of function space to which the action-value function belongs (quantiﬁed by its metric entropy in Assumption A4), and the intrinsic properties of the MDP such as the behavior of concentrability coeﬃcients and the smoothness-expansion property of the Bellman operator (Deﬁnition 12 and Assumption A7). The result indicated that the dependence on the sample size for the task of policy evaluation is optimal. This work (and its conference (Farahmand et al., 2009b) and the dissertation (Farahmand, 2011b) versions) alongside the work on the Regularized Fitted Q-Iteration algorithm (Farahmand et al., 2008, 2009a) are the ﬁrst that address the statistical performance of a regularized RL algorithm. Nevertheless, there have been a few other work that also used regularization for RL/Planning problems, most often without analyzing their statistical properties.

Jung and Polani (2006) studied adding regularization to BRM, but their solution is restricted to deterministic problems. The main contribution of that work was the development of fast incremental algorithms using the sparsiﬁcation technique. The l1-regularization has been considered by Loth et al. (2007), who were similarly concerned with incremental implementations and computational eﬃciency. Xu et al. (2007) provided a kernel-based, but not regularized, formulation of LSPI. They used sparsiﬁcation to provide basis functions for the LSTD procedure. Sparsiﬁcation leads to a selection of only a subset of data points to be used as the basis functions, thus indirectly controls the complexity of the resulting function space. This should be contrasted with a regularization-based approach in which the regularizer interacts with the empirical loss to jointly determine the subset of the function space to which the estimate belongs.

Kolter and Ng (2009) formulated an l1-regularization ﬁxed-point formulation LSTD, which is called Lasso-TD by Ghavamzadeh et al. (2011), and provided LARS-like algorithm (Efron et al., 2004) to compute the solutions. Johns et al. (2010) considered the same ﬁxed-point formulation and cast it as a linear complementarity problem. The statistical properties of this l1-regularized ﬁxed-point formulation is studied by Ghavamzadeh et al. (2011), as discussed earlier. Lasso-TD has a ﬁxed-point formulation, which looks diﬀerent from our coupled optimization formulation (15)-(16), but under on-policy sampling scenario, it is equivalent to a particular version of REG-LSTD: If we choose a ﬁxed linear function approximator (parametric), use the l1-norm in the projection optimization problem (15), but do not regularize optimization problem (16) (i.e., λQ,n = 0), we get Lasso-TD. Geist and Scherrer (2012) suggested a diﬀerent algorithm where the projection is not regularized (i.e., λh,n = 0), but the optimization problem (16) is regularized with the l1-norm of the parameter weights. The choice of only regularizing (16) is the same as the one in the algorithm introduced and analyzed by Avila Pires and Szepesv ari (2012), except that the latter work uses the l2-norm. Hoﬀman et al. (2012) introduced an algorithm similar to that of Geist and Scherrer (2012) with the diﬀerence that the projection optimization (15) uses the l2-norm (so it is a mixed l1/l2-regularized algorithm). All these algorithms are parametric. Several TD-based algorithms and their regularized variants are discussed in a survey by Dann et al. (2014).

Farahmand, Ghavamzadeh, Szepesv ari, and Mannor

Taylor and Parr (2009) uniﬁed several kernelized reinforcement learning algorithms, and showed the equivalence of kernelized value function approximators such as GPTD (Engel et al., 2005), the work of Xu et al. (2007), and a few other methods with a model-based reinforcement learning algorithm that has certain regularization on the transition kernel estimator, reward estimator, or both. Their result was obtained by considering two separate regularized regression problems: One that predicts the reward function given the current state and the other that predicts the next-state kernel values given the current-state ones. Their formulation is diﬀerent from our formulation that is stated as a coupled optimization problem in an RKHS. Similar to other kernel-based algorithms (e.g., SVMs, Gaussian Process Regressions, Splines, etc.), devising a computationally eﬃcient implementation of REG-LSPI/BRM is important to ensure that it is a practical algorithm for large-scale problems. A naive implementation of these algorithms requires the computation time of O(n3K), which is prohibitive for large sample sizes. One possible workaround is to reduce the eﬀective number of samples by the sparsiﬁcation technique (Engel et al., 2005; Jung and Polani, 2006; Xu et al., 2007). The other is to use elegant vector-matrix multiplication methods, which are used in iterative methods for matrix inversion, such as those based on the Fast Multipole Methods (Beatson and Greengard, 1997) and the Fast Gauss Transform (Yang et al., 2004). These methods can reduce the computational cost of vector-matrix multiplication from O(n2) to O(n log n), which results in computation time of O(n2K log n) for REG-LSPI/BRM, at the cost of some small, but controlled, numerical error. Another possibility is to use stochastic gradient-like algorithms similar to the works of Liu et al. (2012); Qin et al. (2014). The use of stochastic gradient-like algorithms is especially appealing in the light of results such as Bottou and Bousquet (2008); Shalev-Shwartz and Srebro (2008). They analyze the tradeoﬀbetween the statistical error and the optimization error caused by the choice of optimization method. They show that one might achieve lower generalization error by using a faster stochastic gradient-like algorithm, which processes more data points less accurately, rather than a slower but more accurate optimization algorithm, which can only process fewer data points. Designing scalable optimization algorithms for REG-LSPI/BRM is a topic for future work. An important issue in the successful application of any RL/Planning algorithm, including REG-LSPI and REG-BRM, is the proper choice of parameters. In REG-BRM and REG-LSTD we are faced with the choice of F|A| and the corresponding regularization parameters λQ,n and λh,n. The proper choice of these parameters, however, depends on quantities that are not known, e.g., J(Qπ) and the choice of F|A| that matches with the MDP. This problem in the RL/Planning context has been addressed by Farahmand and Szepesv ari (2011). They introduced a complexity-regularization-based model selection algorithm that allows one to design adaptive algorithms: Algorithms that perform almost the same as the one with the prior knowledge of the best parameters. Another important question is how to extend these algorithms to deal with continuous action MDPs. There are two challenges: Computational and statistical. The computational challenge is ﬁnding the greedy action at each state in the policy improvement step. In general, this is an intractable optimization problem, which cannot be solved exactly or even with any suboptimality guarantee. To analyze this inexact policy improvement some parts of the theory, especially the error propagation result, should be modiﬁed. Moreover, we also have a statistical challenge: One should speciﬁcally control the complexity of the

Regularized Policy Iteration with Nonparametric Function Spaces

policy space as the complexity of {maxa A Q( , a) : Q F|A|} might be inﬁnity even though F|A| has a ﬁnite complexity (Antos et al., 2008a). A properly modiﬁed algorithm might be similar to the continuous-action extension of Farahmand et al. (2015), an API algorithm that explicitly controls the complexity of the policy space.

Finally an open theoretical question is to characterize the properties of the MDP that determine the function space to which action-value function belong. A similar question is how the values of LP and LR in Assumption A7 are related to the intrinsic properties of the MDP. We partially addressed this question for the convolutional MDPs, but analysis of more general MDPs is remained to be done.

Acknowledgments

We thank the members of the Reinforcement Learning and Artiﬁcial Intelligence (RLAI) research group at the University of Alberta and the Reasoning and Learning Lab at Mc Gill University for fruitful discussions. We thank the anonymous reviewers and the editor for their helpful comments and suggestions, which improved the quality of the paper. We gratefully acknowledge funding from the National Science and Engineering Research Council of Canada (NSERC) and Alberta Innovates Centre for Machine Learning (AICML). Shie Mannor was partially supported by the European Community s Seventh Framework Programme (FP7/2007-2013) under grant agreement 306638 (SUPREL).

Proofs and Auxiliary Results

In these appendices, we ﬁrst prove Theorem 10, which provides the closed-form solutions for REG-LSTD and REG-BRM when the function space is an RKHS (Appendix A). We then attend to the proof of Theorem 11 (Policy Evaluation error for REG-LSTD). The main body of the proof for Theorem 11 is in Appendix B. To increase the readability and ﬂow, the proofs of some of the auxiliary and more technical results are postponed to Appendices C, D, and E.

More speciﬁcally, we prove an extension of Theorem 21.1 of Gy orﬁet al. (2002) in Appendix C (Lemma 15). We present a modiﬁed version of Theorem 10.2 of van de Geer (2000) in Appendix D. We then provide a covering number result in Appendix E (Lemma 20). The reason we require these results will become clear in Appendix B. Finally, we introduce convolutional MDPs as an instance of problems that satisfy Assumption A7 (Appendix F).

We would like to remark that the generic constants c, c > 0 in the proofs, especially those related to the statistical guarantees, might change from line to line, if their exact value is not important in the bound. These values are constant as a function of important quantities of the upper bound (such as n, α, J(Qπ), etc.), but may depend on Qmax or |A|.

Farahmand, Ghavamzadeh, Szepesv ari, and Mannor

Appendix A. Proof of Theorem 10 (Closed-Form Solutions for RKHS Formulation of REG-LSTD/BRM)

Proof REG-BRM: First, notice that the optimization problem (28) can be written in the

form cn(Q) + λQ,n Q 2 H Q min! with an appropriately deﬁned functional cn.19 In order to apply the representer theorem (Sch olkopf et al., 2001), we require to show that cn depends on Q only through the data-points Z1, Z 1, . . . , Zn, Z n. This is immediate for all the terms that deﬁne cn except the term that involves ˆhn( ; Q). However, since ˆhn is deﬁned as the solution to the optimization problem (27), calling for the representer theorem once again, we observe that ˆhn can be written in the form

ˆhn( ; Q) =

t=1 β t k(Zt, ),

where β = (β 1, . . . , β t ) satisﬁes

β = argmin β Rn

Khβ ˆT πQ 2

n + λh,nβ Khβ .

Solving this minimization problem leads to

β = (Kh + nλh,n I) 1( ˆT πk Q).

In both equations ( ˆT πQ) is viewed as the n-dimensional vector ˆT πQ (Z1), . . . , ˆT πQ (Zn) = R1 + γQ(Z 1), . . . , Rn + γQ(Z n) .

Thus, β depends on Q only through Q(Z 1), . . . , Q(Z n). Plugging this solution into (28), we get that cn(Q) indeed depends on Q through

Q(Z1), Q(Z 1), , Q(Zn), Q(Z n),

and thus on data points Z1, Z 1, , Zn, Z n. The representer theorem then implies that the minimizer of cn(Q) + λQ,n Q 2 H can be written in the form Q( ) = P2n i=1 αik( Zi, ), where Zi = Zi if i n and Zi = Z i n, otherwise. Let α = (α1, . . . , αn, α 1, . . . , α n) . Using the reproducing kernel property of k, we get the optimization problem

C1KQ α (r + γC2KQ α) 2 n B(r + γC2KQ α) 2 n + λQ,n α KQ α α min!.

Solving this for α concludes the proof for REG-BRM. REG-LSTD: The ﬁrst part of the proof that shows cn depends on Q only through the datapoints Z1, Z 1, . . . , Zn, Z n is exactly the same as the proof of REG-BRM. Thus, using the representer theorem, the minimizer of (30) can be written in the form Q( ) = P2n i=1 αik( Zi, ), where Zi = Zi if i n and Zi = Z i n, otherwise. Let α = (α1, . . . , αn, α 1, . . . , α n) . Using the reproducing kernel property of k, we get the optimization problem

(C1 γEC2)KQ α Er 2 n + λQ,n α KQ α α min!.

Replacing C1 γEC2 with F and solving for α concludes the proof.

19. Here f(Q) Q min! indicates that Q is a minimizer of f(Q).

Regularized Policy Iteration with Nonparametric Function Spaces

Thm. 14 (REG-LSPI)

Thm. 11 (REG-LSTD)

Thm. 13 (Error Propagation)

Lem. 15 [ ˆhn(Q) T πQ 0]

Lem. 18 [J( ˆQ) J(Qπ)]

Lem. 19 [ ˆQ ˆhn( ˆQ) 0]

Lem. 22 (relative deviation)

Thm. 16 [ ˆhn(Q) T πQ n&J(ˆhn(Q))]

Lem. 17 [J(ˆhn(Q) J(Qπ)+J(Q)+J(T πQ)]

Lem. 24 (modulus of continuity)

Lem. 20 (covering number)

Lem. 23 (supremum of weighted sum)

Figure 2: Dependencies of results used to prove the statistical guarantee for REG-LSPI (Theorem 14).

Appendix B. Proof of Theorem 11 (Statistical Guarantee for REG-LSTD)

The goal of Theorem 11 is to provide a ﬁnite-sample upper bound on the Bellman error ˆQ T π ˆQ ν for REG-LSTD deﬁned by the optimization problems (15) and (16). Since ˆQ T π ˆQ 2 ν 2 ˆQ ˆhn( ; ˆQ) 2 ν+2 ˆhn( ; ˆQ) T π ˆQ 2 ν, we may upper bound the Bellman error by upper bounding each term in the RHS. Recall from the discussion after Theorem 11 that the analysis is more complicated than the conventional supervised learning setting because the corresponding optimization problems are coupled: ˆhn( ; ˆQ) is a function of ˆQ which itself is a function of ˆhn( ; ˆQ). Theorem 11 is proven using Lemma 15, which upper bounds ˆhn( ; ˆQ) T π ˆQ ν, and Lemma 19, which upper bounds ˆQ ˆhn( ; ˆQ) ν. We also require to relate the smoothness J( ˆQ) to the smoothness J(Qπ). Lemma 18 speciﬁes this relation. The proof of these lemmas themselves require further developments, which will be discussed when we encounter them. Figure 2 shows the dependencies between all results that lead to the proof of Theorem 11 and consequently Theorem 14. The following lemma controls the error behavior resulting from the optimization problem (15). This lemma, which is a result on the error upper bound of a regularized regression estimator, is similar to Theorem 21.1 of Gy orﬁet al. (2002) with two main diﬀerences. First,

Farahmand, Ghavamzadeh, Szepesv ari, and Mannor

it holds uniformly over T πQ (as opposed to a ﬁxed function T πQ); second, it holds for function spaces that satisfy a general metric entropy condition (as opposed to the special case of the Sobolev spaces).

Lemma 15 (Convergence of ˆhn( ; Q) to T πQ) For any random Q F|A|, let ˆhn(Q) be deﬁned according to (15). Under Assumptions A1 A5 and A7, there exist ﬁnite constants c1, c2 > 0 such that for any n N and 0 < δ < 1, we have ˆhn( ; Q) T πQ 2

ν 4λh,n J2(T πQ) + 2λh,n J2(Q) + c1 1 nλα h,n + c2 ln(1/δ)

with probability at least 1 δ.

Proof See Appendix C.

When we use this lemma to prove Theorem 11, the action-value function Q that appears in the bound is the result of the optimization problems deﬁned in (16), that is ˆQ, and so is random. Lemma 18, which we will prove later, provides a deterministic upper bound for the smoothness J( ˆQ) of this random quantity. It turns out that to derive our main result, we require to know more about the behavior of the regularized regression estimator than what is shown in Lemma 15. In particular, we need an upper bound on the empirical error of the regularized regression estimator ˆhn( ; Q) (cf. (33) below). Moreover, we should bound the random smoothness J(ˆhn( ; Q)) by some deterministic quantities, which turns out to be a function of J(T πQ) and J(Q). Theorem 16 provides us with the required upper bounds. This theorem is a modiﬁcation of Theorem 10.2 by van de Geer (2000), with two main diﬀerences: 1) It holds uniformly over Q and 2) ˆhn( ; Q) uses the same data Dn that is used to estimate Q itself. We introduce the following notation: Let w = (x, a, r, x ) and deﬁne the random variables wi = (Xi, Ai, Ri, X i) for 1 i n. The data set Dn would be {w1, . . . , wn}. For a measurable function g : X A R X R, let g 2 n = 1

n Pn i=1 |g(wi)|2. Consider the regularized least squares estimator:

ˆhn( ; Q) = argmin h F|A|

h h [Ri + γQ(X i, π(X i))] 2 n + λh,n J2(h) i , (33)

which is the same as (15) with π replacing πk.

Theorem 16 (Empirical error and smoothness of ˆhn( ; Q)) For a random function Q F|A|, let ˆhn( , Q) be deﬁned according to (33). Suppose that Assumptions A1 A5 and A7 hold. Then there exist constants c1, c2 > 0, such that for any n N and 0 < δ < 1, we have

ˆhn( ; Q) T πQ n c1 max

Qmax (J(Q) + J(T πQ)) α 1+α ln(1/δ)

λh,n J(T πQ)

Regularized Policy Iteration with Nonparametric Function Spaces

J(ˆhn( ; Q)) c2 max

J(Q) + J(T πQ), Q1+α max

with probability at least 1 δ.

Proof See Appendix D.

The following lemma, which is an immediate corollary of Theorem 16, indicates that with the proper choice of the regularization coeﬃcient, the complexity of the regression function ˆhn( ; Q) is in the same order as the complexities of Q, T πQ, and Qπ. This result will be used in the proof of Lemma 21, which itself is used in the proof of Lemma 19.

Lemma 17 (Smoothness of ˆhn( ; Q)) For a random Q F|A|, let ˆhn( ; Q) be the solution to the optimization problem (15) with the choice of regularization coeﬃcient

λh,n = 1 n J2(Qπ)

Let Assumptions A1 A5 and A7 hold. Then, there exits a ﬁnite constant c > 0, depending on Qmax, such that for any n N and 0 < δ < 1, the upper bound

J(ˆhn( ; Q)) c J(T πQ) + J(Q) + J(Qπ) p

holds with probability at least 1 δ.

Proof With the choice of λh,n = h 1 n J2(Qπ) i 1 1+α , Theorem 16 implies that there exist some ﬁnite constant c1 > 0 as well as c2 > 0, which depends on Qmax, such that for any n N and 0 < δ < 1, the inequality

J(ˆhn( ; Q)) c1 max

J(Q) + J(T πQ), Q1+α max

n h 1 n J2(Qπ) i 1

c2 J(T πQ) + J(Q) + J(Qπ) p

holds with probability at least 1 δ.

An intuitive understanding of this result might be gained if we consider ˆhn( ; Qπ), which is the regression estimate for T πQπ = Qπ. This lemma then indicates that the smoothness of ˆhn( ; Qπ) is comparable to the smoothness of its target function Qπ. This is intuitive whenever the regularization coeﬃcients are chosen properly. The following lemma relates J( ˆQ) and J(T π ˆQ), which are random, to the complexity of the action-value function of the policy π, i.e., J(Qπ). This result is used in the proof of Theorem 11.

Farahmand, Ghavamzadeh, Szepesv ari, and Mannor

Lemma 18 (Smoothness of ˆQ) Let Assumptions A1 A7 hold, and let ˆQ be the solution to (16) with the choice of

λh,n = 1 n J2(Qπ)

Then, there exists a ﬁnite constant c > 0 such that for any n N and 0 < δ < e 1, we have

λQ,n J2( ˆQ) λQ,n J2(Qπ) + c J 2α 1+α (Qπ) ln(1/δ)

with probability at least 1 δ.

Proof By Assumption A6 we have Qπ F|A|, so by the optimizer property of ˆQ (cf. (16)), we get

λQ,n J2( ˆQ) ˆQ ˆhn( ; ˆQ) 2

Dn + λQ,n J2( ˆQ) Qπ ˆhn( ; Qπ) 2

Dn + λQ,n J2(Qπ). (34)

Since Qπ = T πQπ, we have Qπ ˆhn( ; Qπ) Dn = T πQπ ˆhn( ; Qπ) Dn. So Theorem 16 shows that with the choice of λh,n = [ 1 n J2(Qπ)] 1 1+α , there exists a ﬁnite constant c > 0 such that for any n N and for 0 < δ < e 1 0.3679, we have

Qπ ˆhn( ; Qπ) 2

Dn c1 1 Q2(1+α) max J 2α 1+α (Qπ) ln(1/δ)

n 1 1+α , (35)

with probability at least 1 δ. Chaining inequalities (34) and (35) ﬁnishes the proof.

The other main ingredient of the proof of Theorem 11 is an upper bound to ˆQ ˆhn( ; ˆQ) ν, which is closely related to the optimization problem (16). This task is done by Lemma 19. In the proof of this lemma, we call Lemma 21, which shall be stated and proven right after this result.

Lemma 19 (Convergence of ˆQ ˆhn( ; ˆQ) ν) Let ˆQ be the solution to the set of coupled optimization problems (15) (16). Suppose that Assumptions A1 A7 hold. Then there exists a ﬁnite constant c > 0 such that for any n N and 0 < δ < 2e 1 and with the choice of

λh,n = λQ,n = 1 n J2(Qπ)

we have ˆQ ˆhn( ; ˆQ) 2

ν c (1 + γ2L2 P )α J 2α 1+α (Qπ) ln(1/δ) + L

2α 1+α R n 1 1+α ,

with probability at least 1 δ.

Regularized Policy Iteration with Nonparametric Function Spaces

Proof Decompose ˆQ ˆhn( ; ˆQ) 2

ν = I1,n + I2,n,

1 2I1,n = ˆQ ˆhn( ; ˆQ) 2

Dn + λQ,n J2( ˆQ),

I2,n = ˆQ ˆhn( ; ˆQ) 2

ν I1,n. (36)

In what follows, we upper bound each of these terms. I1,n: Use the optimizer property of ˆQ to get

1 2I1,n = ˆQ ˆhn( ; ˆQ) 2

Dn + λQ,n J2( ˆQ) Qπ ˆhn( ; Qπ) 2

Dn + λQ,n J2(Qπ).

To upper bound Qπ ˆhn( ; Qπ) 2

Dn = T πQπ ˆhn( ; Qπ) 2

Dn, we evoke Theorem 16.

For our choice of λQ,n, there exists a constant c1 > 0 such that for any n N and 0 < δ1 < 1, we have 1 2I1,n λQ,n J2(Qπ) + c1 J 2α 1+α (Qπ) ln(1/δ1)

n 1 1+α , (37)

with probability at least 1 δ1. I2,n: With our choice of λQ,n and λh,n, Lemma 21, which shall be proven later, indicates that there exist some ﬁnite constants c2, c3, c4 > 0 such that for any n N and ﬁnite J(Qπ), LR, and LP , and 0 < δ2 < 1, we have

2α 1+α R + [J(Qπ)] 2α 1+α [ln(1/δ2)] α 1+α

n 1 1+α + c3 (1 + γ2L2 P )α

n λα Q,n + c4 ln(1/δ2)

with probability at least 1 δ2. For δ2 < e 1 and α 0, we have [ln(1/δ2)] α 1+α ln(1/δ2), and also

1 n λα Q,n = [J(Qπ)] 2α 1+α

n 1 1+α [J(Qπ)] 2α 1+α

n 1 1+α ln(1/δ2). (39)

With the right choice of constants, ln(1/δ2)

n can be absorbed into the other terms. Select δ1 = δ2 = δ/2. Inequalities (37), (38), and (39) imply that with the speciﬁed choice of λQ,n and λh,n, there exists a ﬁnite constant c5 > 0 such that for any 0 < δ < 2e 1, we have

ˆQ ˆhn( ; ˆQ) 2

ν c5 (1 + γ2L2 P )α J 2α 1+α (Qπ) ln(1/δ) + L

2α 1+α R n 1 1+α ,

with probability at least 1 δ.

To upper bound I2,n, deﬁned in (36), we simultaneously apply the peeling device (cf. Section 5.3 of van de Geer 2000) on two diﬀerent, but coupled, function spaces (one to which

Farahmand, Ghavamzadeh, Szepesv ari, and Mannor

ˆQ belongs and the other to which ˆhn( ; ˆQ) belongs). In each layer of peeling, we apply an exponential tail inequality to control the relative deviation of the empirical mean from the true mean (Lemma 22 in Appendix C). We also require a covering number result, which is stated as Lemma 20. The ﬁnal result of this procedure is a tight upper bound on I2,n, as stated in Lemma 21. To prepare for the peeling argument, deﬁne the following subsets of F and F|A|:

Fσ f : f F, J2(f) σ ,

F|A| σ n f : f F|A|, J2(f) σ o .

g Q,h(x, a)

j=1 I{a=aj} [Qj(x) hj(x)]2 . (40)

To simplify the notation, we use z = (x, a) and Z = (X, A) in the rest of this section. Deﬁne Gσ1,σ2 as the space of g Q,h functions with J(Q) σ1 and J(h) σ2, i.e.,

Gσ1,σ2 n g Q,h : Rd A R; Q F|A| σ1 , h F|A| σ2 o . (41)

The following lemma provides an upper bound on the covering numbers of Gσ1,σ2.

Lemma 20 (Covering Number) Let Assumptions A3, A4, and A5 hold. Then there exists a constant c1 > 0, independent of σ1, σ2, α, Qmax, and |A|, such that for any u > 0 and all ((x1, a1), . . . , (xn, an)) X A, the empirical covering number of the class of functions Gσ1,σ2 deﬁned in (41) w.r.t. the empirical norm 2,z1:n is upper bounded by

log N2(u, Gσ1,σ2, (x, a)1:n) c1|A|1+αQ2α max (σα 1 + σα 2 ) u 2α.

Proof See Appendix E.

Next, we state and prove Lemma 21, which provides a high probability upper bound on I2,n.

Lemma 21 Let I2,n be deﬁned according to (36). Under Assumptions A1 A5 and A7 and with the choice of

λh,n = λQ,n = 1 n J2(Qπ)

there exist constants c1, c2, c3 > 0, such that for any n N, ﬁnite J(Qπ), LR, and LP , and δ > 0 we have

2α 1+α R + [J(Qπ)] 2α 1+α [ln(1/δ)] α 1+α

n 1 1+α + c2 (1 + γ2L2 P )α

n λα Q,n + c3 ln(1/δ)

with probability at least 1 δ.

Regularized Policy Iteration with Nonparametric Function Spaces

Proof Let Z = (X, A) be a random variable with distribution ν that is independent from Dn. Without loss of generality, we assume that Qmax 1/2. We use the peeling device in conjunction with Lemmas 20 and 22 to obtain a tight high-probability upper bound on I2,n. Based on the deﬁnition of I2,n in (36) we have

P {I2,n > t} = P

E h g ˆQ,ˆhn( ; ˆQ)(Z)|Dn i 1

n Pn i=1 g ˆQ,ˆhn( ; ˆQ)(Zi)

t + 2λQ,n J2( ˆQ) + E h g ˆQ,ˆhn( ; ˆQ)(Z)|Dn i > 1

To beneﬁt from the peeling device, we relate the complexity of ˆhn( ; ˆQ) to the complexity of ˆQ. For a ﬁxed δ1 > 0 and some constant c > 0, to be speciﬁed shortly, deﬁne the following event:

A0 = n ω : J2(ˆhn( ; ˆQ)) c J2(T π ˆQ) + J2( ˆQ) + J2(Qπ) ln(1/δ1) o .

Lemma 17 indicates that P {A0} 1 δ1, where the constant c here can be chosen to be three times of the squared value of the constant in the lemma. We have P {I2,n > t} = P I2,n > t, AC 0 + P {I2,n > t, A0} δ1 + P {I2,n > t, A0}, so we focus on upper bounding P {I2,n > t, A0}. Since ˆQ F|A|, there exists l N0 such that 2lt I{l =0} 2λQ,n J2( ˆQ) < 2l+1t. Fix l N0. For any Q F|A|, Assumption A7 relates J(T πQ) to J(Q):

λQ,n J2(T πQ) 2 L2 R + γ2L2 P 2lt λQ,n

Thus on the event A0, if ˆQ F|A| σl 1 where σl 1 = 2lt λQ,n , we also have ˆhn( ˆQ) F|A| σl 2 with

σl 2 = c 2 L2 R + (1 + γ2L2 P ) 2lt

+ J2(Qπ) ln(1/δ1) . (43)

Apply the peeling device on (42). Use (43) and note that if for an l N0 we have 2λQ,n J2( ˆQ) 2lt I{l =0}, we also have t + 2λQ,n J2( ˆQ) 2lt to get

P {I2,n > t} = P I2,n > t, AC 0 + P {I2,n > t, A0}

A0, 2lt I{l =0} 2λQ,n J2( ˆQ) < 2l+1t,

E h g ˆQ,ˆhn( ; ˆQ)(Z)|Dn i 1

n Pn i=1 g ˆQ,ˆhn( ; ˆQ)(Zi)

t + 2λQ,n J2( ˆQ) + E h g ˆQ,ˆhn( ; ˆQ)(Z)|Dn i > 1

sup g Q,h Gσl 1,σl 2

E [g Q,h(Z)|Dn] 1

n Pn i=1 g Q,h(Zi) 2lt + E [g Q,h(Z)|Dn] > 1

Let us study the behavior of the lth term of the above summation by verifying the conditions of Lemma 22 with the choice of ε = 1

2 and η = 2lt.

Farahmand, Ghavamzadeh, Szepesv ari, and Mannor

Condition (A1): Since all functions involved are bounded by Qmax, it is easy to see that |g Q,h(x, a)| P|A| j=1 I{a=aj} [Qj(x) hj(x)]2 4Q2 max. Therefore, K1, deﬁned in

Lemma 22, can be set to K1 = 4Q2 max.

Condition (A2): We have E [Q(Z) h(Z)]2 2 4Q2 max E h [Q(Z) h(Z)]2i . Therefore,

K2 can be set to K2 = 4Q2 max. Condition (A3): We should satisfy

2 4 nη 288 max{8Q2 max,

8Qmax}. Since η = 2lt t, it is suﬃcient to have

in which c is a function of Qmax (we can choose c = 2 46082 Q4 max). Condition (A4): We shall verify that for ε 1

82lt, and σ1 = σl 1 and σ2 = σl 2, the following holds:

2 max{K1, 2K2}

16 max{K1,2K2}

g Gσ1,σ2 : 1

i=1 g2(zi) 16ε )

!!1/2 du. (45)

Notice that there exists a constant c > 0 such that for any u, ε > 0

g Gσ1,σ2 : 1

i=1 g2(zi) 16ε )

log N2 (u, Gσ1,σ2, z1:n)

c (σα 1 + σα 2 )u 2α, (46)

where we used Lemma 20 in the second inequality. Plug (46) into (45) with the choice of σ1 = σl 1 = 2lt λQ,n and σ2 = σl 2 = c[2(L2 R +

(1 + γ2L2 P ) 2lt

λQ,n ) + J2(Qπ) ln(1/δ1)]. Therefore, for some constant c = c (Qmax) > 0, the inequality

+ c 2 L2 R + (1 + γ2L2 P ) 2lt

+ J2(Qπ) ln(1/δ1) α

implies (45). Because (a + b) 1 2 (a 1 2 + b 1 2 ) for non-negative a and b, it suﬃces to verify the following two conditions: (a) We shall verify that for ε 1

82lt, we have

2 Q,n (2lt) α

Regularized Policy Iteration with Nonparametric Function Spaces

for some c > 0. Substituting ε with 2lt, we see that it is enough if for some constant c > 0,

t c 2l n λα Q,n . (D1)

(b) We should verify that for ε 1

82lt, the following is satisﬁed:

L2 R |{z} (b1)

+ (1 + γ2L2 P ) 2lt

λQ,n | {z } (b2)

+ J2(Qπ) ln(1/δ1) | {z } (b3)

for some c > 0. After some manipulations, we get that the previous inequality holds if the following three inequalities are satisﬁed:

(b1) : t c 1 L

2α 1+α R n 1 1+α , (D2)

(b2) : t c 2 (1 + γ2L2 P )α

nλα Q,n , (D3)

(b3) : t c 3 [J(Qπ)] 2α 1+α [ln(1/δ1)] α 1+α

n 1 1+α , (D4)

for some constants c 1, c 2, c 3 > 0. Fix δ > 0 and let δ1 = δ/2. Whenever (C1), (D1), (D2), (D3), and (D4) are satisﬁed, for some choice of constants c, c > 0 we have

P {I2,n > t} δ

2) 128 2304 max{16Q4max, 4Q2max}

2 + c exp( c n t).

Let the left-hand side be equal δ and solve for t. Considering all aforementioned conditions, we get that there exist constants c1, c2, c3 > 0 such that for any n N, ﬁnite J(Qπ), LR, and LP , and δ > 0, we have

2α 1+α R + [J(Qπ)] 2α 1+α [ln(1/δ)] α 1+α

n 1 1+α + c2 (1 + γ2L2 P )α

n λα Q,n + c3 ln(1/δ)

with probability at least 1 δ.

After developing these tools, we are ready to prove Theorem 11. Proof [Proof of Theorem 11] We want to show that ˆQ T π ˆQ ν is small. Since (15)-(16) minimize ˆhn( ; ˆQ) T π ˆQ ν and ˆQ ˆhn( ; ˆQ) ν, we upper bound ˆQ T π ˆQ ν in terms of these quantities as follows: ˆQ T π ˆQ 2

ν 2 ˆQ ˆhn( ; ˆQ) 2

ν + 2 ˆhn( ; ˆQ) T π ˆQ 2

Farahmand, Ghavamzadeh, Szepesv ari, and Mannor

Let us upper bound each of these two terms in the RHS. Fix 0 < δ < 1. Bounding ˆhn( ; ˆQ) T π ˆQ ν: Lemma 15 indicates that there exist constants c1, c2 > 0 such that for any random ˆQ F|A| and any ﬁxed n N, we have

ˆhn( ; ˆQ) T π ˆQ 2

ν λh,n 2J2( ˆQ) + 4J2(T π ˆQ) + c1 1 nλα h,n + c2 ln(3/δ)

with probability at least 1 δ/3. Note that T π ˆQ F|A| is implied by Assumption A7 and ˆQ F|A|. Because ˆQ is random itself, the terms J( ˆQ) and J(T π ˆQ) in the upper bound of (48) are also random. In order to upper bound them, we use Lemma 18, which states that upon the choice of λh,n = λQ,n = [ 1 n J2(Qπ)] 1 1+α , there exists a constant c3 > 0 such that for any n N,

λh,n J2( ˆQ) = λQ,n J2( ˆQ) λQ,n J2(Qπ) + c3 J 2α 1+α (Qπ)

n 1 1+α ln(3/δ) (49)

holds with probability at least 1 δ/3. We use Assumption A7 to show that we have

λh,n J2(T π ˆQ) 2λQ,n L2 R + 2(γLP )2

λQ,n J2(Qπ) + c3 J 2α 1+α (Qπ)

n 1 1+α ln(3/δ)

with the same probability. Plugging (49) and (50) into (48) and using the selected schedule for λQ,n and λh,n, we get

ˆhn( ; ˆQ) T π ˆQ 2

ν " 2 + c1 + 8(γLP )2 J 2α 1+α (Qπ) + c3 2 + 8(γLP )2 J 2α 1+α (Qπ) ln(3/δ) + 8L2 R J 2 1+α (Qπ)

+ c2 ln(3/δ)

with probability at least 1 2

3δ. By the proper choice of constants, the term c2n 1 ln(3/δ)

can be absorbed into n 1 1+α ln(3/δ). Therefore, there exists a constant c4 > 0 such that

ˆhn( ; ˆQ) T π ˆQ 2

c4 1 + (γLP )2 J 2α 1+α (Qπ) ln(1/δ) + 8L2 R [J(Qπ)] 2 1+α

n 1 1+α , (51)

with probability at least 1 2

3δ. Bounding ˆQ ˆhn( ; ˆQ) ν: With our choice of λQ,n and λh,n, Lemma 19 states that there exists a constant c5 > 0 such that for any n N,

ˆQ ˆhn( ; ; ˆQ) 2

ν c5 (1 + γ2L2 P )αJ 2α 1+α (Qπ) ln(1/δ) + L

2α 1+α R n 1 1+α , (52)

holds with probability at least 1 δ/3.

Regularized Policy Iteration with Nonparametric Function Spaces

Thus, inequality (47) alongside upper bounds (51) and (52) indicate that there exist constants c6, c7 > 0 such that for any n N and δ > 0, we have

ˆQ T π ˆQ 2

ν c6 1 + (γLP )2 J 2α 1+α (Qπ) ln(1/δ) + c7

2α 1+α R + L2 R

[J(Qπ)] 2 1+α

with probability at least 1 δ.

A careful study of the proof of Theorem 11 and the auxiliary results used in it reveals that one can indeed reuse a single data set in all iterations. Recall that at the kth iteration of an API procedure such as REG-LSPI, the policy π = πk is the greedy policy w.r.t. ˆQ(k 1), so it depends on earlier data sets. This implies that a function such as T π ˆQ = T ˆπ( ; ˆQ(k 1)) ˆQ is random with two sources of randomness: One source is the data set used in the current iteration, which deﬁnes the empirical loss functions. This directly aﬀects ˆQ. The other source is ˆπ( ; ˆQ(k 1)), which depends on the data sets in earlier iterations. When we assume that all data sets are independent from each other, the randomness of π does not cause any problem because we can work on the probability space conditioned on the data sets of the earlier iterations. Conditioned on that randomness, the policy π becomes a deterministic function. This is how we presented the statement of Theorem 11 by stating that π is ﬁxed. Nonetheless, the proofs can handle the dependence with no change. Brieﬂy speaking, the reason is that when we want to provide a high probability upper bounds on certain random quantities, we take the supremum over both ˆQ and T π ˆQ and consider them as two separate functions, even though they are related through a random T π operator. To see this more clearly, notice that in the proof of Lemma 15, which is used in the proof of this theorem, we deﬁne the function spaces Gl that chooses the functions h, Q, and T πQ separately. We then take the supremum over all functions in Gl. This means that for the probabilistic upper bound, the randomness of π in T πQ becomes eﬀectively irrelevant as we are providing a uniform over Gl guarantee. In the proof of this theorem, we also use Lemma 19, which itself uses Theorem 16 and Lemma 21 that have a similar construct.

Appendix C. Proof of Lemma 15 (Convergence of ˆhn( ; Q) to T πQ)

The following lemma, quoted from Gy orﬁet al. (2002), provides an exponential probability tail inequality for the relative deviation of the empirical mean from the true mean. A slightly modiﬁed version of this result was published as Theorem 2 of Kohler (2000). This result is used in the proof of Lemmas 15 and 21.

Lemma 22 (Theorem 19.3 of Gy orﬁet al. 2002) Let Z, Z1, , Zn be independent and identically distributed random variables with values in Z. Let 0 < ε < 1 and η > 0. Assume that K1, K2 1 and let F be a permissible class of functions f : Z R with the following properties:

(A2) E f(Z)2 K2E [f(Z)],

Farahmand, Ghavamzadeh, Szepesv ari, and Mannor

(A3) nε 1 ε η 288 max{2K1, 2K2},

(A4) For all z1, , zn Z and all δ η/8, nε(1 ε)δ 96

2 max{K1, 2K2}

ε(1 ε)δ 16 max{K1,2K2}

v u u tlog N2

u, n f F : 1

i=1 f2(zi) 16δ o , z1:n

n Pn i=1 f(Zi)

η + E [f(Z)] > ε

60 exp n η ε2(1 ε) 128 2304 max{K2 1, K2}

Let us now turn to the proof of Lemma 15. This proof follows similar steps to the proof of Theorem 21.1 of Gy orﬁet al. (2002). Proof [Proof of Lemma 15] Without loss of generality, assume that Qmax 1/2. Denote z = (x, a) and let Z = (X, A) ν, R R( |X, A), and X P( |X, A) be random variables that are independent of Dn = {(Xi, Ai, Ri, X i)}n i=1. Deﬁne the following error decomposition Z

ˆhn(z; Q) T πQ(z) 2 dν(z) = E ˆhn(Z; Q) [R + γQ(X , π(X ))] 2 Dn

E h T πQ(Z) [R + γQ(X , π(X ))] 2i

= I1,n + I2,n,

1 2I1,n = 1

ˆhn(Zi; Q) [Ri + γQ(X i, π(X i))] 2 T πQ(Zi) [Ri + γQ(X i, π(X i))] 2 +

λh,n J2(ˆhn( ; Q)) + J2(Q) + J2(T πQ) ,

I2,n = E ˆhn(Z; Q) ˆT πQ(Z) 2 T πQ(Z) ˆT πQ(Z) 2 Dn

By the optimizer property of ˆhn( ; Q), we get the following upper bound by substituting ˆhn( ; Q) with T πQ F|A|:

T πQ(Zi) ˆT πQ(Zi) 2 T πQ(Zi) ˆT πQ(Zi) 2 +

λh,n J2(T πQ) + J2(Q) + J2(T πQ) #

= 4λh,n J2(T πQ) + 2λh,n J2(Q). (53)

Regularized Policy Iteration with Nonparametric Function Spaces

We now turn to upper bounding P {I2,n > t}. Given a policy π and functions h, Q, Q F|A|, for w = (x, a, r, x ) deﬁne g : X A R X R as

gh,Q,Q (w) = h(z) r + γQ(x , π(x )) 2 Q (z) r + γQ(x , π(x )) 2 .

Note that gˆhn( ;Q),Q,T πQ is the function appearing in the deﬁnition of I2,n. Deﬁne the following function spaces for l = 0, 1, . . . :

Gl gh,Q,Q : X A R X R : h, Q, Q F|A|; J2(h), J2(Q), J2(Q ) 2lt

Denote W = (X, A, R, X ) and Wi = (Xi, Ai, Ri, X i). Apply the peeling device to get

P {I2,n > t}

h, Q F|A|, 2lt I{l =0} 2λh,n J2(h) + J2(Q) + J2(T πQ) < 2l+1t;

s.t. E [gh,Q,T πQ(W)|Dn] 1

n Pn i=1 gh,Q,T πQ(Wi) t + 2λh,n (J2(h) + J2(Q) + J2(T πQ)) + E [gh,Q(W)|Dn] > 1

E [g(W)|Dn] 1

n Pn i=1 g(Wi) 2lt + E [g(W)|Dn] > 1

Here we used the simple fact that if 2λh,n J2(h) + J2(Q) + J2(T πQ) < 2l+1t, then J2(h), J2(Q), and J2(T πQ) are also less than 2lt λh,n , so gh,Q,T πQ Gl.

We study the behavior of the lth term of the above summation by verifying the conditions of Lemma 22 similar to what we did in the proof of Lemma 21. It is easy to verify that (A1) and (A2) are satisﬁed with the choice of K1 = K2 = 4Q2 max. Condition (A3) is satisﬁed whenever t c1

for some constant c1 > 0 depending on Qmax (the constant can be set to c1 = 2 46082Q2 max). To verify condition (A4), we ﬁrst require an upper bound on N2(u, Gl, w1:n) for any sequence w1:n. This can be done similar to the proof of Lemma 20: Denote Fl = {f : f F, J2(f) 2lt λh,n }. For gh1,Q1,T πQ1, gh2,Q2,T πQ2 Gl and any sequence w1:n we have

i=1 |gh1,Q1,T πQ1(wi) gh2,Q2,T πQ2(wi)|2

12(2 + γ)2Q2 max 1 n

h |h1(zi) h2(zi)|2 + 4γ2 Q1(x i, π(x i)) Q2(x i, π(x i)) 2 +

|T πQ1(zi) T πQ2(zi)|2 i

12(2 + γ)2Q2 max 1 n

h |h1(xi, a) h2(xi, a)|2 + 4γ2 Q1(x i, a) Q2(x i, a) 2 +

|T πQ1(xi, a) T πQ2(xi, a)|2 i .

Farahmand, Ghavamzadeh, Szepesv ari, and Mannor

With the same covering set argument as in the proof of Lemma 20, we get that for any u > 0,

2|A|Qmax u, Gl, w1:n) N2(u, Fl, x1:n)|A| N2(u, Fl, x 1:n)|A| N2(u, Fl, x1:n)|A|.

Evoke Assumption A4 to get

log N2(u, Gl, w1:n) c(|A|, Qmax) 2lt

Plugging this covering number result into condition (A4), one can verify that the condition is satisﬁed if t c2 nλα h,n , (55)

for a constant c2 > 0, which is only a function of Qmax and |A|. Therefore, Lemma 22 indicates that

P {I2,n > t} 60

l=0 exp n(2lt)(1/4)(1/2) 128 2304 max{16Q4max, 4Q2max}

c3 exp( c4nt). (56)

for some constants c3, c4 > 0. Combining (53), (54), (55), and (56), we ﬁnd that there exist c5, c6 > 0 such that for any n N and 0 < δ < 1, we have

ˆhn(Q) T πQ 2

ν 4λh,n J2(T πQ) + 2λh,n J2(Q) + c5 1 nλα h,n + c6 ln(1/δ)

Here, c5 is only a function of Qmax and |A|, and c6 is a function of Qmax.

Appendix D. Proof of Theorem 16 (Empirical Error and Smoothness of ˆhn( ; Q))

To prove Theorem 16, which is a modiﬁcation of Theorem 10.2 by van de Geer (2000), we ﬁrst need to modify and specialize Lemma 3.2 by van de Geer (2000) to be suitable to our problem. The modiﬁcation is required because Q in (33) is a random function in F|A| as opposed to being a ﬁxed function as in Theorem 10.2 of van de Geer (2000). Let us denote z = (x, a) Z = X A and Z = (x, a, R, X ) Z = X A R X with (R, X ) P( , |x, a). Let Dn denote the set {(xi, ai, Ri, X i)}n i=1 of independent random variables. We use zi to refer to (xi, ai) and Z i to refer to (xi, ai, Ri, X i). Let Pn be the probability measure that puts mass 1/n on z1, . . . , zn, i.e., Pn = 1

n Pn i=1 δzi, in which δz is the Dirac s delta function that puts a mass of 1 at z. Denote G : Z R and G : Z R3|A|, which is deﬁned as the set

G = n (Q, T πQ, 1) : Q F|A| o

Regularized Policy Iteration with Nonparametric Function Spaces

with 1 : X A R|A| being a bounded constant function (and not necessarily equal to 1). We use g to denote the supremum norm of functions in G. The supremum norm of vector-valued functions in G is deﬁned by taking the supremum norm over the l -norm of each vector. Similarly, the supremum norm of (g, g ) G G is deﬁned by (g, g ) max{ g , g }.

For g G, we deﬁne g Pn [ 1

n Pn i=1 g2(zi)]1/2. To simplify the notation, we use the following deﬁnition of the inner product: Fix n N. Consider z1, . . . , zn as a set of points in Z, and a real-valued sequence w = (w1, . . . , wn). For a function g G, deﬁne w , g n 1

n Pn i=1 wig(zi).

For any g = (Q, T πQ, 1) G , deﬁne the mapping W(g )(x, a, r, x ) : X A R X R by W(g )(x, a, r, x ) = r 1+γQ(x , π(x )) T πQ(x, a). For any ﬁxed g G and i = 1, . . . , n, deﬁne the random variables Wi(g ) = W(g )(Z i) and let W(g ) denote the random vector [W1(g ) . . . Wn(g )] . Notice that Wi(g ) can be re-written as Wi(g ) = (Ri r(zi)) + γ(Q(X , π(X )) (P πQ)(zi)), thus for any ﬁxed g , E [Wi(g )] = 0 (i = 1, . . . , n). For notational simpliﬁcation, we use a b = max{a, b}.

Lemma 23 (Modiﬁed Lemma 3.2 of van de Geer 2000) Fix the sequence (zi)n i=1 Z and let (Z i)n i=1 Z be the sequence of independent random variables deﬁned as above. Assume that for some constants 0 < R L, it holds that supg G g Pn R, supg G g L, and |Ri| L (1 i n) almost surely. There exists a constant C such that for all 0 ε < δ satisfying

ε 28L [log N (u, G G )]1/2du R

sup (g,g ) G G

i=1 Wi(g )g(zi)

4 exp n(δ ε)2

The main diﬀerence between this lemma and Lemma 3.2 of van de Geer (2000) is that the latter provides a maximal inequality for supg G 1 n Pn i=1 Wig(zi), with Wi being random variables that satisfy a certain exponential probability inequality, while our result is a maximal inequality for sup(g,g ) G G 1

n Pn i=1 Wi(g )g(zi), i.e., the random variables Wi(g ) are functions of an arbitrary g G . The current proof requires us to have a condition on the metric entropy w.r.t. the supremum norm (cf. (57)) instead of w.r.t. the empirical L2norm used in Lemma 3.2 of van de Geer (2000). The possibility of relaxing this requirement is an interesting question. We now prove this result.

Farahmand, Ghavamzadeh, Szepesv ari, and Mannor

Proof First, note that for any g1, g2 G, and g 1, g 2 G (with the identiﬁcation of g with its corresponding Q and T πQ), we have

i=1 Wi(g 1)g1(zi) Wi(g 2)g2(zi) =

i=1 (Ri r(zi))(g1(zi) g2(zi)) +

i=1 γ (Q1(X i, π(X i)) P πQ1(zi)) (Q2(X i, π(X i)) P πQ2(zi)) g1(zi) +

i=1 γ(Q2(X i, π(X i)) P πQ2(zi))(g1(zi) g2(zi))

2L g1 g2 Pn + γR [ Q1 Q2 + P πQ1 P πQ2 ] + 3γL g1 g2 Pn =

(2 + 3γ)L g1 g2 Pn + γR Q1 Q2 + R T πQ1 T πQ2 , (58)

where we used the boundedness assumptions, the deﬁnition of the supremum norm, the norm inequality 1

n Pn i=1 |g1(zi) g2(zi)| g1 g2 Pn, and the fact that |γQ(X i, π(X i)) γP πQ(zi)| = |r(zi) + γQ(X i, π(X i)) T πQ(zi)| (2 + γ)L 3L for any L-bounded Q and T πQ to get the inequality. We used P πQs P πQs 1 = γ 1 T πQs T πQs 1 to get the last equality.

Let {(gs j, g j s)}Ns j=1 with Ns = N (2 s R, G G ) be a minimal 2 s R-covering of G G

w.r.t. the supremum norm. For any (g, g ) G G , there exists a (gs, (Qs, T πQs, 1)) = (gs, g s) {(gs j, g j s)}Ns j=1 such that (g, g ) (gs, g s) 2 s R. This implies that both Qs Q and T πQs T πQ are smaller than 2 s R as well. Moreover, gs g Pn gs g 2 s R. By (58) we get

i=1 Wi(g s)gs(zi) Wi(g )g(zi)

[(2 + 3γ)L + (1 + γ)R](2 s R) (3 + 4γ)L(2 s R)

Choose S = min{s 1 : 2 s ε 7RL}, which entails that for any (g, g ) G G , the covering set deﬁned by {(g S j , g j S)}NS j=1 approximates the inner product of [g(z1) g(zn)]

and W(g ) with an error less than ε. So it suﬃces to prove the exponential inequality for

max j=1,...,NS

i=1 Wi(g j S)g S j (zi)

Regularized Policy Iteration with Nonparametric Function Spaces

We use the chaining technique (e.g., see van de Geer 2000) as follows (we choose g0 = 0, so Wi(g 0)g0(zi) = 0 for all 1 i n):

i=1 Wi(g S)g S(zi) = 1

Wi(g s)gs(zi) Wi(g s 1)gs 1(zi) =

i=1 (Ri r(zi))(gs(zi) gs 1(zi)) +

i=1 γ (Qs(X i, π(X i)) (P πQs)(zi)) (Qs 1(X i, π(X i)) (P πQs 1)(zi)) gs(zi) +

i=1 γ(Qs 1(X i, π(X i)) (P πQs 1)(zi))(gs(zi) gs 1(zi))

Because each of these summations consists of bounded random variables with expectation zero, we may use Hoeﬀding s inequality alongside the union bound to upper bound them. To apply Hoeﬀding s inequality, we require an upper bound on the sum of squared values of random variables involved. To begin, we have |gs(zi) gs 1(zi)| = |gs(zi) g(zi)+g(zi) gs 1(zi)| 2 s R +2 (s 1)R = 3 2 s R. Similarly, both Qs Qs 1 and T πQs T πQs 1 are smaller than 3 2 s R. As a result, for the ﬁrst term we get

(Ri r(zi))(gs(zi) gs 1(zi)) 2 36(RL)22 2s.

For the second term we have

γ (Qs(X i, π(X i)) (P πQs)(zi)) (Qs 1(X i, π(X i)) (P πQs 1)(zi)) gs(zi) 2

2γ2 h Qs Qs 1 2 + γ 2 T πQs T πQs 1 2

2(1 + γ2)32(2 s R)2R2 36R42 2s,

in which we used P πQs P πQs 1 = γ 1 T πQs T πQs 1 . And ﬁnally,

γ(Qs 1(X i, π(X i)) (P πQs 1)(zi))(gs(zi) gs 1(zi)) 2 (3L)232(2 s R)2

= 92(RL)22 2s,

where we used the fact that |γQ(X i, π(X i)) γP πQ(zi)| 3L for any L-bounded Q and T πQ. Let ηs be a sequence of positive real-valued numbers satisfying PS s=1 ηs 1. We continue the chaining argument by the use of the union bound and the fact that Ns Ns 1 N2 s to

Farahmand, Ghavamzadeh, Szepesv ari, and Mannor

sup (g,g ) G G

i=1 Wi(g )g(zi)

max j=1,...,NS

i=1 Wi(g j S)g S j (zi)

max (gs,g s) (gs 1,g s 1)

i=1 (Ri r(zi))(gs(zi) gs 1(zi))

max (gs,g s) (gs 1,g s 1)

i=1 γ Qs(X i, π(X i)) (P πQs)(zi)

Qs 1(X i, π(X i)) (P πQs 1)(zi) gs(zi)

max (gs,g s) (gs 1,g s 1)

i=1 γ(Qs 1(X i, π(X i)) (P πQs 1)(zi))(gs(zi) gs 1(zi))

s=1 Ns Ns 1 exp 2(δ ε)2η2 sn 4 92(RL)22 2s

+ Ns Ns 1 exp 2(δ ε)2η2 sn 4 92R42 2s

Ns Ns 1 exp 2(δ ε)2η2 sn 3 92(RL)22 2s

s=1 3 exp 2 log Ns 2(δ ε)2η2 sn 4 92(RL)22 2s

Here the max(gs,g s) is over the corresponding covering set {(gs j, g j s)}Ns j=1, which has Ns elements (and the same for s 1). Choose

6RL2 s(log Ns)1/2

n(δ ε) 2 s s

Take C in (57) suﬃciently large such that

n(δ ε) 2 32

s=1 2 s[log N (2 s R, G G )]1/2 72 p

6 log 4 RL. (61)

It can be shown that by this choice of ηs and the condition (61), we have PS s=1 ηs 1. From (60), we have log Ns n(δ ε)2η2 s 2 35(RL)22 2s , so P1 in (59) can be upper bounded as follows

s=1 3 exp n(δ ε)2η2 s 2 35(RL)22 2s

Regularized Policy Iteration with Nonparametric Function Spaces

Since ηs 2 s s/8 too, we have

s=1 exp n(δ ε)22 2ss

27 35(RL)22 2s

s=1 exp n(δ ε)2s

3 exp n(δ ε)2

1 exp n(δ ε)2 27 35(RL)2 4 exp n(δ ε)2

where in the last inequality we used the assumption that n(δ ε) 72 6 log 4 RL (cf. (61)). One can show that (61) is satisﬁed if

ε 28L [log N (u, G G )]1/2du 72 p

6 log 4 RL,

so C can be chosen as C = 72 6 log 4.

The following lemma, which is built on Lemma 23, is a result on the behavior of the modulus of continuity and will be used in the proof of Theorem 16. This lemma provides a high-probability upper bound on sup(g,g ) G G | W(g ) , g n| g 1 α Pn Jα(g,g ). Here J(g, g ) is a regularizer

that is deﬁned on G G and is a pseudo-norm. This result is similar in spirit to Lemma 8.4 of van de Geer (2000), with two main diﬀerences: The ﬁrst is that here we provide an upper bound on

sup (g,g ) G G | W(g ) , g n |

g 1 α Pn Jα(g, g ) ,

whereas in Lemma 8.4 of van de Geer (2000), the upper bound is on

| W , g n |

The normalization by g 1 α Pn Jα(g, g ) instead of g 1 α Pn is important to get the right error bound in Theorem 16. The other crucial diﬀerence is that here W are random variables that are functions of g G , while the result of van de Geer (2000) is for independent W. The proof technique is inspired by Lemmas 5.13, 5.14, and 8.4 of van de Geer (2000).

Lemma 24 (Modulus of Continuity for Weighted Sums) Fix the sequence (zi)n i=1 Z and deﬁne Pn = 1 n Pn i=1 δzi. Let (Z i)n i=1 Z be the sequence of independent random variables deﬁned as before. Assume that for some constant L > 0, it holds that supg G g Pn L, supg G g L, and |Ri| L (1 i n) almost surely. Furthermore, suppose that there exist 0 < α < 1 and a ﬁnite constant A such that for all u > 0,

log N u, (g, g ) G G : J(g, g ) B A B

Farahmand, Ghavamzadeh, Szepesv ari, and Mannor

Then there exists a constant c > 0 such that for any 0 < δ < 1, we have

sup (g,g ) G G | W(g ) , g n |

g 1 α Pn Jα(g, g ) c L1+α

with probability at least 1 δ.

Proof The proof uses double-peeling, i.e., we peel on both J(g, g ) and g Pn. Without loss of generality, we assume that L 1. We use c1, c2, . . . > 0 as constants. First, we start by peeling on J(g, g ):

sup (g,g ) G G | W(g ) , g n |

g 1 α Pn Jα(g, g ) t

sup (g,g ) G G | W(g ) , g n |

g 1 α Pn t.2αs |{z}

, 2s I{s =0} J(g, g ) < 2s+1

Let us denote each term in the RHS by δs. To upper bound δs, notice that by assumption g Pn L. For each term, we peel again, this time on g Pn, and apply Lemma 23:

sup (g,g ) G G | W(g ) , g n |

g 1 α Pn τs, 2s I{s =0} J(g, g ) < 2s+1,

2 (r+1)L < g Pn 2 r L

sup (g,g ) G G | W(g ) , g

n | τs 2 (r+1)L 1 α , J(g, g ) < 2s+1, g Pn 2 r L

n h τs 2 (r+1)L 1 αi2

27 35 (2 r L)2 L2

r 0 4 exp 22rα nτ 2 s 27 35 22(1 α)L2(1+α)

= c2 exp c1nτ 2 s L2(1+α)

The last inequality holds only if the covering number condition in Lemma 23 is satisﬁed, which is the case whenever

n τs(2 (r+1)L)1 α C L

Substituting τs = 2αst and solving the integral, we get that the condition is nt 2αs(2 (r+1)L)1 α CL

A (2s+1)α(2 r L)1 α 2 r L ,

which would be satisﬁed for

A21+α n 21 αCL1+α

n = c3 L1+α

Regularized Policy Iteration with Nonparametric Function Spaces

Plug (63) into (62) to get that

s=0 c2 exp c1n t2 22αs

= c4 exp c1n t2

Solving for δ, we have t c5L1+α q

n with probability at least 1 δ. This alongside the condition (64) lead to the desired result.

Let us turn to the proof of Theorem 16. The proof is similar to the proof of Theorem 10.2 by van de Geer (2000), but with necessary modiﬁcations in order to get a high probability upper bound that holds uniformly over Q. We discuss the diﬀerences in more detail after the proof. Proof [Proof of Theorem 16] Recall that in the optimization problem, we use wi = (Xi, Ai, Ri, X i) (i = 1, . . . , n) to denote the ith elements of the data set Dn = {(Xi, Ai, Ri, X i)}n i=1. Also for a measurable function f : X A R X R, we denote f 2 n = 1

n Pn i=1 |f(wi)|2. We also let (X, A) ν, R R( |X, A), and X P( |X, A) be random variables that are independent of Dn. For any Q F|A| and the corresponding T πQ F|A|, deﬁne the mapping W(Q, T πQ, 1) : X A R X R by W(Q, T πQ, 1)(X, A, R, X ) = R 1+γQ(X , π(X )) T πQ(X, A), in which 1 F|A| is the constant function deﬁned on X A with the value of one. For any ﬁxed Q and i = 1, . . . , n, deﬁne the random variables Wi(Q) = W(Q, T πQ, 1)(Xi, Ai, Ri, X i) and let W(Q) denote the random vector [W1(Q) . . . Wn(Q)] . Notice that |Wi(Q)| 3Qmax, and we have E [Wi(Q) | Q] = 0 (i = 1, . . . , n). From the optimizer property of ˆhn = ˆhn( , Q), we have

ˆhn(Q) [R + γQ(X i, π(X i))] 2

n + λh,n J2(ˆhn(Q)) T πQ [R + γQ(X i, π(X i))] 2 n + λh,n J2(T πQ).

After expanding and rearranging, we get

ˆhn(Q) T πQ 2

n + λh,n J2(ˆhn(Q)) 2 D W(Q) , ˆhn(Q) T πQ E

n + λh,n J2(T πQ). (65)

We evoke Lemma 24 to upper bound D W(Q) , ˆhn(Q) T πQ E

. The function spaces

G and G in that lemma are set as G : X A R and G : X A R X R3 with

G = n h T πQ : h, Q F|A| o ,

G = n (Q, T πQ, 1) : Q, T πQ F|A| o .

All functions in F|A| are Qmax-bounded, so the functions in G and G are bounded by 2Qmax and (Qmax, Qmax, 1), respectively. Moreover for any g G, 1 n Pn i=1 |g(Xi, Ai)|2 4Q2 max. So by setting L equal to 2Qmax in that lemma, all boundedness conditions are satisﬁed.

Farahmand, Ghavamzadeh, Szepesv ari, and Mannor

Deﬁne J(g, g ) = J(h) + J(Q) + J(T πQ) and denote

(G G )B = (g, g ) G G : J(g, g ) B .

Lemma 24 requires an upper bound on log N (u, (G G )B). We relate the metric entropy of this space to that of FB = { f F : J(f) B }, which is speciﬁed by Assumption A4. Notice that if J(g, g ) B, each of J(h), J(Q), and J(T πQ) is also less than or equal to B. So we have

(G G )B = n (h T πQ, Q, T πQ, 1) : h, Q, T πQ F|A|, J(h) + J(Q) + J(T πQ) B o n h T πQ : h, T πQ F|A|, J(h) + J(T πQ) B o n Q : Q F|A|, J(Q) B o n T πQ : T πQ F|A|, J(T πQ) B o {1} .

Because J( ) is a pseudo-norm, we have J(h T πQ) J(h) + J(T πQ), so n h T πQ : h, T πQ F|A|, J(h) + J(T πQ) B o n Q : Q F|A|, J(Q) B o .

As a result (G G )B is a subset of the product space Q F|A| : J(Q) B 3. Therefore by the usual covering argument, we get that

log N u, (G G )B 3 log N u, n Q F|A| : J(Q) B o .

It is easy to see that for ﬁnite |A|, if log N (u, {f F : J(f) B}) C(B

u )2α, then log N (u, {(f1, . . . , f|A|) F|A| : J (f1, . . . , f|A|) B}) C1(B

u )2α (we beneﬁt from the condition J(Q( , a)) J(Q) in Assumption A3; the proof is similar to the proof of Lemma 20 in Appendix E). Here the constant C1 depends on |A|. This along with the previous inequality show that for some constant A > 0, we have

log N u, (G G )B A B

We are ready to apply Lemma 24 to upper bound the inner product term in (65). Fix

δ > 0. To simplify the notation, denote Ln = ˆhn(Q) T πQ n, set t0 = q

n , and use ˆhn to refer to ˆhn(Q). There exists a constant c > 0 such that with probability at least 1 δ, we have

L2 n + λh,n J2(ˆhn) 2c L1+αL1 α n J(ˆhn) + J(Q) + J(T πQ) α t0 + λh,n J2(T πQ). (66)

Either the ﬁrst term in the RHS is larger than the second one or the second term is larger than the ﬁrst. We analyze each case separately. Case 1. 2c L1+αL1 α n (J(ˆhn) + J(Q) + J(T πQ))αt0 λh,n J2(T πQ). In this case we have

L2 n + λh,n J2(ˆhn) 4c L1+αL1 α n J(ˆhn) + J(Q) + J(T πQ) α t0. (67)

Again, two cases might happen:

Regularized Policy Iteration with Nonparametric Function Spaces

Case 1.a. J(ˆhn) > J(Q) + J(T πQ): From (67) we have L2 n 22+αc L1+αL1 α n Jα(ˆhn)t0.

Solving for Ln, we get that Ln 2 2+α 1+α c 1 1+α L[J(ˆhn)] α 1+α t

1 1+α 0 . From (67) we also have λh,n J2(ˆhn) 22+αc L1+αL1 α n Jα(ˆhn)t0. Plugging-in the recently obtained upper bound on Ln and solving for J(ˆhn), we get that

J(ˆhn) 22+αc L1+αt0

Substituting this in the upper bound on Ln, we get that

Ln 22+αc L1+αt0

2 h,n . (69)

Case 1.b. J(ˆhn) J(Q) + J(T πQ): The upper bound on J(ˆhn) is obvious. From (67) we have L2 n 22+αc L1+αL1 α n (J(Q) + J(T πQ))α t0. Solving for Ln, we obtain

Ln 2 2+α 1+α c 1 1+α L (J(Q) + J(T πQ)) α 1+α t

1 1+α 0 . (70)

Case 2. 2c L1+αL1 α n (J(ˆhn) + J(Q) + J(T πQ))αt0 < λh,n J2(T πQ). In this case we have L2 n + λh,n J2(ˆhn) 2λh,n J2(T πQ), which implies that

2λh,n J(T πQ), (71)

2J(T πQ). (72)

By (69), (70), and (71) for Ln and (68), (72), and the condition J(ˆhn) J(Q)+J(T πQ) in Case 1.b. for J(ˆhn), we have that for any ﬁxed 0 < δ < 1, with probability at least 1 δ, the following inequalities hold:

ˆhn(Q) T πQ n max

( 22+αc L1+α q

2 2+α 1+α c 1 1+α L (J(Q) + J(T πQ)) α 1+α ln(1/δ)

2λh,n J(T πQ)

J(ˆhn(Q)) max

22+αc L1+α q

, J(Q) + J(T πQ),

Comparing this proof with that of Theorem 10.2 by van de Geer (2000), we see that here we do not normalize the function space G G to ensure that J(g, g ) 1 and then

Farahmand, Ghavamzadeh, Szepesv ari, and Mannor

use their Lemma 8.4, which provides a high-probability upper bound on supg G | W , g n|

Instead we directly apply Lemma 24, which upper bounds sup(g,g ) G G | W(g ) , g n| g 1 α Pn Jα(g,g ), on

the (unnormalized) function space G G . If we went through the former approach, in which the normalization is global, the ﬁrst term in the RHS of (66) would be L1 α n (J(ˆhn)+J(Q)+ J(T πQ))1+αt0 instead of L1 α n (J(ˆhn) + J(Q) + J(T πQ))αt0 of here, which is obtained by local normalization. This extra J(ˆhn) + J(Q) + J(T πQ) would prevent us from getting proper upper bounds on Ln and J(ˆhn) in Case 1.a above. The reason that the original proof does not work is that here W(g ) is a function of g G .

Appendix E. Proof of Lemma 20 (Covering Number of Gσ1,σ2)

Here we prove Lemma 20, which relates the covering number of Gσ1,σ2 to the covering number of Fσ1 and Fσ2. Proof [Proof of Lemma 20] Let g Q1,h1, g Q2,h2 Gσ1,σ2. By the deﬁnition of Gσ1,σ2 (41), the functions Q1 and h1 corresponding to g Q1,h1 satisfy Q1 F|A| σ1 and h1 F|A| σ2 (and similarly for Q2 and h2). Set zi = (xi, ai). We have

i=1 |g Q1,h1(zi) g Q2,h2(zi)|2

(Q1(zi) h1(zi))2 (Q2(zi) h2(zi))2 2

16Q2 max 1 n

i=1 [(Q1(zi) Q2(zi)) + (h1(zi) h2(zi))]2

32Q2 max 1 n

h (Q1,j(xi) Q2,j(xi))2 + (h1,j(xi) h2,j(xi))2i .

Assumption A3 implies that for Q1, Q2 F|A| σ1 , the functions Q1,j, Q2,j are in Fσ1 and for h1, h2 F|A| σ2 , the functions h1,j, h2,j are in Fσ2 for all j = 1, , |A|. Therefore the previous inequality shows that u-covers on Qj Fσ1 and hj Fσ2 (for j = 1, , |A|) w.r.t. the empirical norms x1:n deﬁne an 8Qmax p

|A| u-cover on Gσ1,σ2 w.r.t. z1:n. Thus,

|A|u, Gσ1,σ2, (x, a)1:n N2 (u, Fσ1, x1:n)|A| N2 (u, Fσ2, x1:n)|A| .

Assumption A4 then implies that for a constant c1, independent of u, |A|, Qmax, and α, and for all ((x1, a1), . . . , (xn, an)) X A we have

log N2(u, Gσ1,σ2, (x, a)1:n) c1|A|1+αQ2α max (σα 1 + σα 2 ) u 2α.

Regularized Policy Iteration with Nonparametric Function Spaces

Appendix F. Convolutional MDPs and Assumption A7

In this appendix, we show that Assumption A7 holds for a certain class of MDPs. This class is deﬁned by one dimensional MDPs in which the increment of the next X compared to the current state X is a function of chosen action only, i.e., X X W(π(X)).

Proposition 25 Suppose that X = [ π, π] is the unit circle and F is the Sobolev space Wk(X) = Wk,2(X) and J( ) is deﬁned as the corresponding norm Wk,2. For a function f F, let f(n) be the nth Fourier coeﬃcient, i.e., f(n) = 1 2π R π π f(x)e jnxdx. Consider the MDPs that have the convolutional transition probability kernel, that is, for any policy π and V F, there exists Kπ(x, y) = Kπ(x y) such that Z

X P(dy|x, π(x))V (y) = Z

X Kπ(x y)V (y)dy = Kπ V.

Moreover, assume that Kπ, V L1(X). For a given policy π, let rπ(x) = r(x, π(x)) (x X). Assumption A7 is then satisﬁed with the choice of LR = supπ rπ Wk,2 and LP = supπ maxn | Kπ(n)|.

Proof By the convolution theorem, ( Kπ V )(n) = Kπ(n) V (n). It is also known that for V F, we have V 2 Wk,2 = P n= 1 + |n|2 k | V (n)|2. Thus,

Kπ V 2 Wk,2 =

n= (1 + |n|2)k| Kπ(n)|2| V (n)|2 h max n | Kπ(n)|2i X

n= (1 + |n|2)k| V (n)|2

= h max n | Kπ(n)|2i V 2 Wk,2 .

Therefore, T πV Wk,2 rπ Wk,2 + γ h maxn | Kπ(n)| i V Wk,2. Taking supremum over all policies ﬁnishes the proof.

The interpretation of maxn | Kπ(n)| is the maximum gain of the linear ﬁlter Kπ that is applied to a value function V . The gain here is explicitly written in the frequency domain.

Appendix G. The Metric Entropy and the Covering Number

Deﬁnition 26 (Deﬁnition 9.3 of Gy orﬁet al. 2002) Let ε > 0, F be a set of realvalued functions deﬁned on X, and νX be a probability measure on X. Every ﬁnite collection of Nε = {f1, . . . , f Nε} deﬁned on X with the property that for every f F, there is a function f Nε such that f f p,νX < ε is called an ε-cover of F w.r.t. p,νX . Let N(ε, F, p,νX ) be the size of the smallest ε-cover of F w.r.t. p,νX . If no ﬁnite ε-cover exists, take N(ε, F, p,νX ) = . Then N(ε, F, p,νX ) is called an ε-covering number of F and log N(ε, F, p,νX ) is called the metric entropy of F w.r.t. the same norm.

The ε-covering of F w.r.t. the supremum norm is denoted by N (ε, F). For x1:n = (x1, . . . , xn) X n, one may also deﬁne the empirical measure νX,n(A) = 1

n Pn i=1 I{xi A} for A X. This leads to the empirical covering number of F w.r.t. the empirical norm p,x1:n and is denoted by Np(ε, F, x1:n). If X1:n = (X1, . . . , Xn) is a sequence of random variables, the covering number Np(ε, F, X1:n) is a random variable too.

Farahmand, Ghavamzadeh, Szepesv ari, and Mannor

Andr as Antos, R emi Munos, and Csaba Szepesv ari. Fitted Q-iteration in continuous actionspace MDPs. In Advances in Neural Information Processing Systems (NIPS - 20), pages 9 16. MIT Press, 2008a. 33

Andr as Antos, Csaba Szepesv ari, and R emi Munos. Learning near-optimal policies with Bellman-residual minimization based ﬁtted policy iteration and a single sample path. Machine Learning, 71:89 129, 2008b. 6, 8, 10, 13, 19, 25, 26, 28, 29

Bernardo Avila Pires and Csaba Szepesv ari. Statistical linear estimation with penalized estimators: an application to reinforcement learning. In Proceedings of the 29th International Conference on Machine Learning (ICML), 2012. 2, 13, 15, 28, 29, 31

Philip Bachman, Amir-massoud Farahmand, and Doina Precup. Sample-based approximate regularization. In Proceedings of the 31st International Conference on Machine Learning (ICML), volume 32 of JMLR: W & CP, 2014. 12

Leemon Baird. Residual algorithms: Reinforcement learning with function approximation. In Proceedings of the 12th International Conference on Machine Learning (ICML), pages 30 37. Morgan Kaufmann, 1995. 8

Andr e M.S. Barreto, Doina Precup, and Joelle Pineau. Reinforcement learning using kernelbased stochastic factorization. In Advances in Neural Information Processing Systems (NIPS - 24), pages 720 728. 2011. 2

Andr e M.S. Barreto, Doina Precup, and Joelle Pineau. On-line reinforcement learning using incremental kernel-based stochastic factorization. In Advances in Neural Information Processing Systems (NIPS - 25), pages 1484 1492. 2012. 2

Rick Beatson and Leslie Greengard. A short course on fast multipole methods. In Wavelets, Multilevel Methods and Elliptic PDEs, pages 1 37. Oxford University Press, 1997. 32

Mikhail Belkin, Partha Niyogi, and Vikas Sindhwani. Manifold regularization: A geometric framework for learning from labeled and unlabeled examples. Journal of Machine Learning Research (JMLR), 7:2399 2434, 2006. 12

Shai Ben-David, John Blitzer, Koby Crammer, and Fernando Pereira. Analysis of representations for domain adaptation. In Advances in Neural Information Processing Systems (NIPS - 19), pages 137 144. MIT Press, 2006. 8

Shai Ben-David, John Blitzer, Koby Crammer, Alex Kulesza, Fernando Pereira, and Jennifer Wortman Vaughan. A theory of learning from diﬀerent domains. Machine Learning, 79(1-2):151 175, 2010. 8

Dimitri P. Bertsekas and Steven E. Shreve. Stochastic Optimal Control: The Discrete-Time Case. Academic Press, 1978. 3, 5

Dimitri P. Bertsekas and John N. Tsitsiklis. Neuro-Dynamic Programming. Athena Scientiﬁc, 1996. 3

Regularized Policy Iteration with Nonparametric Function Spaces

Wendelin B ohmer, Steﬀen Gr unew alder, Yun Shen, Marek Musial, and Klaus Obermayer. Construction of approximation spaces for reinforcement learning. Journal of Machine Learning Research (JMLR), 14:2067 2118, 2013. 2

Leon Bottou and Olivier Bousquet. The tradeoﬀs of large scale learning. In Advances in Neural Information Processing Systems (NIPS - 20), pages 161 168. MIT Press, 2008. 32

Steven J. Bradtke and Andrew G. Barto. Linear least-squares algorithms for temporal diﬀerence learning. Machine Learning, 22:33 57, 1996. 9

Corinna Cortes, Mehryar Mohri, and Andres Mu noz Medina. Adaptation algorithm and theory based on generalized discrepancy. In International Conference on Knowledge Discovery and Data Mining (KDD), pages 169 178, 2015. 8

Christoph Dann, Gerhard Neumann, and Jan Peters. Policy evaluation with temporal diﬀerences: A survey and comparison. Journal of Machine Learning Research (JMLR), 15:809 883, 2014. 31

Ronald A. Devore. Nonlinear approximation. Acta Numerica, 7:51 150, 1998. 12

Paul Doukhan. Mixing: Properties and Examples, volume 85 of Lecture Notes in Statistics. Springer-Verlag, Berlin, 1994. 6, 19

Bradley Efron, Trevor Hastie, Iain M. Johnstone, and Robert Tibshirani. Least angle regression. The Annals of Statistics, 32(2):407 499, 2004. 31

Yaakov Engel, Shie Mannor, and Ron Meir. Reinforcement learning with Gaussian processes. In Proceedings of the 22nd International Conference on Machine learning (ICML), pages 201 208. ACM, 2005. 2, 32

Damien Ernst, Pierre Geurts, and Louis Wehenkel. Tree-based batch mode reinforcement learning. Journal of Machine Learning Research (JMLR), 6:503 556, 2005. 2

Theodoros Evgeniou, Massimiliano Pontil, and Tomaso Poggio. Regularization networks and support vector machines. Advances in Computational Mathematics, 13(1):1 50, 1999. 12

Amir-massoud Farahmand. Action-gap phenomenon in reinforcement learning. In Advances in Neural Information Processing Systems (NIPS - 24), pages 172 180. Curran Associates, Inc., 2011a. 28

Amir-massoud Farahmand. Regularization in Reinforcement Learning. Ph D thesis, University of Alberta, 2011b. 2, 3, 25, 31

Amir-massoud Farahmand and Doina Precup. Value pursuit iteration. In Advances in Neural Information Processing Systems (NIPS - 25), pages 1349 1357. Curran Associates, Inc., 2012. 2, 3

Amir-massoud Farahmand and Csaba Szepesv ari. Model selection in reinforcement learning. Machine Learning, 85(3):299 332, 2011. 19, 32

Farahmand, Ghavamzadeh, Szepesv ari, and Mannor

Amir-massoud Farahmand and Csaba Szepesv ari. Regularized least-squares regression: Learning from a β-mixing sequence. Journal of Statistical Planning and Inference, 142 (2):493 505, 2012. 6, 19, 24

Amir-massoud Farahmand, Mohammad Ghavamzadeh, Csaba Szepesv ari, and Shie Mannor. Regularized ﬁtted Q-iteration: Application to planning. In Recent Advances in Reinforcement Learning, 8th European Workshop (EWRL), volume 5323 of Lecture Notes in Computer Science, pages 55 68. Springer, 2008. 31

Amir-massoud Farahmand, Mohammad Ghavamzadeh, Csaba Szepesv ari, and Shie Mannor. Regularized ﬁtted Q-iteration for planning in continuous-space Markovian Decision Problems. In Proceedings of American Control Conference (ACC), pages 725 730, June 2009a. 2, 31

Amir-massoud Farahmand, Mohammad Ghavamzadeh, Csaba Szepesv ari, and Shie Mannor. Regularized policy iteration. In Advances in Neural Information Processing Systems (NIPS - 21), pages 441 448. MIT Press, 2009b. 1, 2, 3, 13, 30, 31

Amir-massoud Farahmand, R emi Munos, and Csaba Szepesv ari. Error propagation for approximate policy and value iteration. In Advances in Neural Information Processing Systems (NIPS - 23), pages 568 576. 2010. 8, 25, 26, 28, 29

Amir-massoud Farahmand, Doina Precup, Mohammad Ghavamzadeh, and Andr e M.S. Barreto. Classiﬁcation-based approximate policy iteration. IEEE Transactions on Automatic Control, 60(11):2989 2993, November 2015. 33

Matthieu Geist and Bruno Scherrer. ℓ1-penalized projected Bellman residual. In Recent Advances in Reinforcement Learning, volume 7188 of Lecture Notes in Computer Science, pages 89 101. Springer Berlin Heidelberg, 2012. 2, 13, 31

Alborz Geramifard, Finale Doshi, Joshua Redding, Nicholas Roy, and Jonathan How. Online discovery of feature dependencies. In Proceedings of the 28th International Conference on Machine Learning (ICML), pages 881 888. ACM, 2011. 2

Alborz Geramifard, Thomas J. Walsh, Nicholas Roy, and Jonathan P. How. Batch i FDD: A scalable matching pursuit algorithm for solving MDPs. In 29th Annual Conference on Uncertainty in Artiﬁcial Intelligence (UAI), 2013. 3

Mohammad Ghavamzadeh, Alessandro Lazaric, R emi Munos, and Matthew Hoﬀman. Finite-sample analysis of Lasso-TD. In Proceedings of the 28th International Conference on Machine Learning (ICML), pages 1177 1184. ACM, 2011. 2, 13, 28, 30, 31

Grigori K. Golubev and Michael Nussbaum. A risk bound in Sobolev class regression. The Annals of Statistics, 18(2):758 778, 1990. 24

L aszl o Gy orﬁ, Michael Kohler, Adam Krzy zak, and Harro Walk. A Distribution-Free Theory of Nonparametric Regression. Springer Verlag, New York, 2002. 2, 12, 21, 23, 30, 33, 35, 45, 46, 59

Regularized Policy Iteration with Nonparametric Function Spaces

Arthur E. Hoerl and Robert W. Kennard. Ridge regression: Biased estimation for nonorthogonal problems. Technometrics, 12(1), 1970. 15

Matthew W. Hoﬀman, Alessandro Lazaric, Mohammad Ghavamzadeh, and R emi Munos. Regularized least squares temporal diﬀerence learning with nested ℓ1 and ℓ2 penalization. In Recent Advances in Reinforcement Learning, volume 7188 of Lecture Notes in Computer Science, pages 102 114. Springer Berlin Heidelberg, 2012. 2, 31

JeﬀJohns. Basis Construction and Utilization for Markov Decision Processes using Graphs. Ph D thesis, University of Massachusetts Amherst, 2010. 3

JeﬀJohns, Christopher Painter-Wakeﬁeld, and Ronald Parr. Linear complementarity for regularized policy evaluation and improvement. In Advances in Neural Information Processing Systems (NIPS - 23), pages 1009 1017. 2010. 2, 31

Tobias Jung and Daniel Polani. Least squares SVM for least squares TD learning. In Proceedings of the 17th European Conference on Artiﬁcial Intelligence (ECAI), pages 499 503, 2006. 2, 31, 32

Michael Kohler. Inequalities for uniform deviations of averages from expectations with applications to nonparametric regression. Journal of Statistical Planning and Inference, 89:1 23, 2000. 45

J. Zico Kolter and Andrew Y. Ng. Regularization and feature selection in least-squares temporal diﬀerence learning. In Proceedings of the 26th International Conference on Machine Learning (ICML), pages 521 528. ACM, 2009. 2, 13, 31

George Konidaris, Sarah Osentoski, and Philip Thomas. Value function approximation in reinforcement learning using the Fourier basis. In AAAI Conference on Artiﬁcial Intelligence, 2011. 16

Michail G. Lagoudakis and Ronald Parr. Least-squares policy iteration. Journal of Machine Learning Research (JMLR), 4:1107 1149, 2003. 8, 9

Alessandro Lazaric, Mohammad Ghavamzadeh, and R emi Munos. Finite-sample analysis of least-squares policy iteration. Journal of Machine Learning Research (JMLR), 13: 3041 3074, October 2012. 28, 29

Bo Liu, Sridhar Mahadevan, and Ji Liu. Regularized oﬀ-policy TD-learning. In Advances in Neural Information Processing Systems (NIPS - 25), pages 845 853, 2012. 32

Manuel Loth, Manuel Davy, and Philippe Preux. Sparse temporal diﬀerence learning using LASSO. In IEEE International Symposium on Approximate Dynamic Programming and Reinforcement Learning, pages 352 359, 2007. 2, 31

Sridhar Mahadevan and Bo Liu. Basis construction from power series expansions of value functions. In Advances in Neural Information Processing Systems (NIPS - 23), pages 1540 1548. 2010. 2

Farahmand, Ghavamzadeh, Szepesv ari, and Mannor

Sridhar Mahadevan and Mauro Maggioni. Proto-value functions: A Laplacian framework for learning representation and control in Markov decision processes. Journal of Machine Learning Research (JMLR), 8:2169 2231, 2007. 2

St ephane Mallat and Zhifeng Zhang. Matching pursuits with time-frequency dictionaries. IEEE Transactions on Signal Processing, 41(12):3397 3415, 1993. 3

Yishay Mansour, Mehryar Mohri, and Afshin Rostamizadeh. Domain adaptation: Learning bounds and algorithms. In Proceedings of the 22nd Annual Conference on Learning Theory (COLT), 2009. 8

Mahdi Milani Fard, Yuri Grinberg, Amir-massoud Farahmand, Joelle Pineau, and Doina Precup. Bellman error based feature generation using random projections on sparse spaces. In Advances in Neural Information Processing Systems (NIPS - 26), pages 3030 3038, 2013. 2

R emi Munos. Error bounds for approximate policy iteration. In Proceedings of the 20th International Conference on Machine Learning (ICML), pages 560 567, 2003. 8

R emi Munos. Performance bounds in Lp norm for approximate value iteration. SIAM Journal on Control and Optimization, pages 541 561, 2007. 25, 26

Boaz Nadler, Nathan Srebro, and Xueyuan Zhou. Semi-supervised learning with the graph Laplacian: The limit of inﬁnite unlabelled data. In Advances in Neural Information Processing Systems (NIPS - 22), pages 1330 1338, 2009. 14

Michael Nussbaum. Spline smoothing in regression models and asymptotic eﬃciency in L2. Annals of Statistics, 13(3):984 997, 1985. 24

Michael Nussbaum. Minimax risk: Pinsker bound. In Encyclopedia of Statistical Sciences, volume 3, pages 451 460. 1999. 23, 24

Dirk Ormoneit and Saunak Sen. Kernel-based reinforcement learning. Machine Learning, 49:161 178, 2002. 2

Christopher Painter-Wakeﬁeld and Ronald Parr. Greedy algorithms for sparse reinforcement learning. In Proceedings of the 29th International Conference on Machine Learning (ICML), 2012. 3

Ronald Parr, Christopher Painter-Wakeﬁeld, Lihong Li, and Michael Littman. Analyzing feature generation for value-function approximation. In Proceedings of the 24th International Conference on Machine Learning (ICML), pages 737 744. ACM, 2007. 2

Y. C. Pati, R. Rezaiifar, and P. S. Krishnaprasad. Orthogonal matching pursuit: Recursive function approximation with applications to wavelet decomposition. In Proceedings of the 27th Annual Asilomar Conference on Signals, Systems, and Computers, pages 40 44, 1993. 3

Marek Petrik. An analysis of Laplacian methods for value function approximation in MDPs. In International Joint Conference on Artiﬁcial Intelligence (IJCAI), pages 2574 2579, 2007. 2

Regularized Policy Iteration with Nonparametric Function Spaces

Zhiwei Qin, Weichang Li, and Firdaus Janoos. Sparse reinforcement learning via convex optimization. In Proceedings of the 31st International Conference on Machine Learning (ICML), volume 32 of JMLR: W & CP, 2014. 32

St ephane Ross, Geoﬀrey Gordon, and J. Andrew Bagnell. A reduction of imitation learning and structured prediction to no-regret online learning. In Proceedings of the 14th International Conference on Artiﬁcal Intelligence and Statistics (AISTATS), pages 627 635, 2011. 19

Paul-Marie Samson. Concentration of measure inequalities for Markov chains and φ-mixing processes. The Annals of Probability, 28(1):416 461, 2000. 19

Bernhard Sch olkopf, Ralf Herbrich, and Alex J. Smola. A generalized representer theorem. In COLT 01/Euro COLT 01: Proceedings of the 14th Annual Conference on Computational Learning Theory and and 5th European Conference on Computational Learning Theory, pages 416 426. Springer-Verlag, 2001. 17, 34

Paul J. Schweitzer and Abraham Seidmann. Generalized polynomial approximations in Markovian decision processes. Journal of Mathematical Analysis and Applications, 110: 568 582, 1985. 8

Shai Shalev-Shwartz and Nathan Srebro. SVM optimization: Inverse dependence on training set size. In Proceedings of the 25th international conference on Machine learning (ICML), pages 928 935. ACM, 2008. 32

Steve Smale and Ding-Xuan Zhou. Estimating the approximation error in learning theory. Analysis and Applications, 1(1):17 41, 2003. 21

Ingo Steinwart and Andreas Christmann. Support Vector Machines. Springer, 2008. 13, 20,

Ingo Steinwart, Don Hush, and Clint Scovel. Optimal rates for regularized least squares regression. In in Proceedings of the 22nd Annual Conference on Learning Theory (COLT), 2009. 24

Charles J. Stone. Optimal global rates of convergence for nonparametric regression. Annals of Statistics, 10(4):1040 1053, 1982. 23

Richard S. Sutton and Andrew G. Barto. Reinforcement Learning: An Introduction. The MIT Press, 1998. 3

Richard S. Sutton, Hamid Reza Maei, Doina Precup, Shalabh Bhatnagar, David Silver, Csaba Szepesv ari, and Eric Wiewiora. Fast gradient-descent methods for temporaldiﬀerence learning with linear function approximation. In Proceedings of the 26th International Conference on Machine Learning (ICML), pages 993 1000. ACM, 2009. 13

Csaba Szepesv ari. Algorithms for Reinforcement Learning. Morgan Claypool Publishers, 2010. 3

Farahmand, Ghavamzadeh, Szepesv ari, and Mannor

Gavin Taylor and Ronald Parr. Kernelized value function approximation for reinforcement learning. In Proceedings of the 26th Annual International Conference on Machine Learning (ICML), pages 1017 1024. ACM, 2009. 2, 31

Hans Triebel. Theory of Function Spaces III. Springer, 2006. 12

Alexander B. Tsybakov. Introduction to Nonparametric Estimation. Springer, 2009. 23, 24

Sara A. van de Geer. Empirical Processes in M-Estimation. Cambridge University Press, 2000. 12, 19, 20, 23, 24, 30, 33, 36, 39, 48, 49, 51, 53, 55, 57

Roman Vershynin. Introduction to the non-asymptotic analysis of random matrices. In Y. Eldar and G. Kutyniok, editors, Compressed Sensing, Theory and Applications, chapter 5, pages 210 268. 2012. 19

Larry Wasserman. All of Nonparametric Statistics. Springer, 2007. 2

Ronald J. Williams and Leemon C. Baird. Tight performance bounds on greedy policies based on imperfect value functions. In Proceedings of the Tenth Yale Workshop on Adaptive and Learning Systems, 1994. 8

Xin Xu, Dewen Hu, and Xicheng Lu. Kernel-based least squares policy iteration for reinforcement learning. IEEE Transactions on Neural Networks, 18:973 992, 2007. 31, 32

Changjiang Yang, Ramani Duraiswami, and Larry Davis. Eﬃcient kernel machines using the improved fast Gauss transform. In Advances in Neural Information Processing Systems (NIPS - 17), pages 1561 1568. MIT Press, 2004. 32

Yuhong Yang and Andrew R. Barron. Information-theoretic determination of minimax rates of convergence. The Annals of Statistics, 27(5):1564 1599, 1999. 23

Bin Yu. Rates of convergence for empirical processes of stationary mixing sequences. The Annals of Probability, 22(1):94 116, January 1994. 6, 19

Tong Zhang. Covering number bounds of certain regularized linear function classes. Journal of Machine Learning Research (JMLR), 2(527 550), 2002. 30

Ding-Xuan Zhou. The covering number in learning theory. Journal of Complexity, 18(3): 739 767, 2002. 20

Ding-Xuan Zhou. Capacity of reproducing kernel spaces in learning theory. IEEE Transactions on Information Theory, 49:1743 1752, 2003. 20