# regularized_policy_iteration_with_nonparametric_function_spaces__c486ca68.pdf Journal of Machine Learning Research 17 (2016) 1-66 Submitted 1/13; Revised 12/15; Published 9/16 Regularized Policy Iteration with Nonparametric Function Spaces Amir-massoud Farahmand farahmand@merl.com Mitsubishi Electric Research Laboratories (MERL) 201 Broadway, 8th Floor Cambridge, MA 02139, USA Mohammad Ghavamzadeh ghavamza@adobe.com Adobe Research 321 Park Avenue San Jose, CA 95110, USA Csaba Szepesv ari szepesva@ualberta.ca Department of Computing Science University of Alberta Edmonton, AB, T6G 2E8, Canada Shie Mannor shie@ee.technion.ac.il Department of Electrical Engineering The Technion Haifa 32000, Israel Editor: Peter Auer We study two regularization-based approximate policy iteration algorithms, namely REGLSPI and REG-BRM, to solve reinforcement learning and planning problems in discounted Markov Decision Processes with large state and finite action spaces. The core of these algorithms are the regularized extensions of the Least-Squares Temporal Difference (LSTD) learning and Bellman Residual Minimization (BRM), which are used in the algorithms policy evaluation steps. Regularization provides a convenient way to control the complexity of the function space to which the estimated value function belongs and as a result enables us to work with rich nonparametric function spaces. We derive efficient implementations of our methods when the function space is a reproducing kernel Hilbert space. We analyze the statistical properties of REG-LSPI and provide an upper bound on the policy evaluation error and the performance loss of the policy returned by this method. Our bound shows the dependence of the loss on the number of samples, the capacity of the function space, and some intrinsic properties of the underlying Markov Decision Process. The dependence of the policy evaluation bound on the number of samples is minimax optimal. This is the first work that provides such a strong guarantee for a nonparametric approximate policy iteration algorithm.1 Keywords: reinforcement learning, approximate policy iteration, regularization, nonparametric method, finite-sample analysis 1. This work is an extension of the NIPS 2008 conference paper by Farahmand et al. (2009b). c 2016 Amir-massoud Farahmand, Mohammad Ghavamzadeh, Csaba Szepesv ari, Shie Mannor. Farahmand, Ghavamzadeh, Szepesv ari, and Mannor 1. Introduction We study the approximate policy iteration (API) approach to find a close to optimal policy in a Markov Decision Process (MDP), either in a reinforcement learning (RL) or in a planning scenario. The basis of API, which is explained in Section 3, is the policy iteration algorithm that iteratively evaluates a policy (i.e., finding the value function of the policy the policy evaluation step) and then improves it (i.e., computing the greedy policy with respect to (w.r.t.) the recently obtained value function the policy improvement step). When the state space is large (e.g., a subset of Rd or a finite state space that has too many states to be exactly represented), the policy evaluation step cannot be performed exactly, and as a result the use of function approximation is inevitable. The appropriate choice of the function approximation method, however, is far from trivial. The best choice is problem-dependent and it also depends on the number of samples in the input data. In this paper we propose a nonparametric regularization-based approach to API. This approach provides a flexible and easy way to implement the policy evaluation step of API. The advantage of nonparametric methods over parametric methods is that they are flexible: Whereas a parametric model, which has a fixed and finite parameterization, limits the range of functions that can be represented, irrespective of the number of samples, the nonparametric models avoid such undue restrictions by increasing the power of the function approximation as necessary. Moreover, the regularization-based approach to nonparametrics is elegant and powerful: It has a simple algorithmic form and the estimator achieves minimax optimal rates in a number of scenarios. Further discussion of and specific results about nonparametric methods, particularly in the supervised learning scenario, can be found in the books by Gy orfiet al. (2002) and Wasserman (2007). The nonparametric approaches to solve RL/Planning problems have received some attention in the RL community. For instance, Petrik (2007); Mahadevan and Maggioni (2007); Parr et al. (2007); Mahadevan and Liu (2010); Geramifard et al. (2011); Farahmand and Precup (2012); B ohmer et al. (2013) and Milani Fard et al. (2013) suggest methods to generate data-dependent basis functions, to be used in general linear models. Ormoneit and Sen (2002) use smoothing kernel-based estimate of the model and then use value iteration to find the value function. Barreto et al. (2011, 2012) benefit from stochastic factorization trick to provide computationally efficient ways to scale up the approach of Ormoneit and Sen (2002). In the context of approximate value iteration, Ernst et al. (2005) consider growing ensembles of trees to approximate the value function. In addition, there have been some works where regularization methods have been applied to the RL/Planning problems, e.g., Engel et al. (2005); Jung and Polani (2006); Loth et al. (2007); Farahmand et al. (2009a,b); Taylor and Parr (2009); Kolter and Ng (2009); Johns et al. (2010); Ghavamzadeh et al. (2011); Farahmand (2011b); Avila Pires and Szepesv ari (2012); Hoffman et al. (2012); Geist and Scherrer (2012). Nevertheless, most of these papers are algorithmic results and do not analyze the statistical properties of these methods (the exceptions are Farahmand et al. 2009a,b; Farahmand 2011b; Ghavamzadeh et al. 2011; Avila Pires and Szepesv ari 2012). We compare these methods with ours in more detail in Sections 5.3.1 and 6. It is worth mentioning that one might use a regularized estimator alongside a feature generation approach to control the complexity of function space induced by the features. An approach alternative to regularization for controlling the complexity of a function space is to Regularized Policy Iteration with Nonparametric Function Spaces use greedy algorithms, such as Matching Pursuit (Mallat and Zhang, 1993) and Orthogonal Matching Pursuit (Pati et al., 1993), to select features from a large set of features. Greedy algorithms have recently been developed for the value function estimation by Johns (2010); Painter-Wakefield and Parr (2012); Farahmand and Precup (2012); Geramifard et al. (2013). We do not discuss these methods any further. 1.1 Contributions The algorithmic contribution of this work is to introduce two regularization-based nonparametric approximate policy iteration algorithms, namely Regularized Least-Squares Policy Improvement (REG-LSPI) and Regularized Bellman Residual Minimization (REG-BRM). These are flexible methods that, upon the proper selection of their parameters, are sample efficient. Each of REG-BRM and REG-LSPI is formulated as two coupled regularized optimization problems (Section 4). As we argue in Section 4.1, having a regularized objective in both optimization problems is necessary for rich nonparametric function spaces. Despite the unusual coupled formulation of the underlying optimization problems, we prove that the solutions can be computed in a closed-form when the estimated action-value function belongs to the family of reproducing kernel Hilbert spaces (RKHS) (Section 4.2). The theoretical contribution of this work (Section 5) is to analyze the statistical properties of REG-LSPI and to provide upper bounds on the policy evaluation error and the performance difference between the optimal policy and the policy returned by this method (Theorem 14). The result demonstrates the dependence of the bounds on the number of samples, the capacity of the function space to which the estimated action-value function belongs, and some intrinsic properties of the MDP. It turns out that the dependence of the policy evaluation error bound on the number of samples is minimax optimal. This paper, alongside its conference (Farahmand et al., 2009b) and the dissertation (Farahmand, 2011b) versions, is the first work that analyzes a nonparametric regularized API algorithm and provides such a strong guarantee for it. 2. Background and Notation In the first part of this section, we provide a brief summary of some of the concepts and definitions from the theory of MDPs and RL (Section 2.1). For more information, the reader is referred to Bertsekas and Shreve (1978); Bertsekas and Tsitsiklis (1996); Sutton and Barto (1998); Szepesv ari (2010). In addition to this background on MDPs, we introduce the notations we use to denote function spaces and their corresponding norms (Section 2.2) as well as the considered learning problem (Section 2.3). 2.1 Markov Decision Processes For a space Ω, with a σ-algebra σΩ, we define M(Ω) as the set of all probability measures over σΩ. We let B(Ω) denote the space of bounded measurable functions w.r.t. σΩand we denote B(Ω, L) as the space of bounded measurable functions with bound 0 < L < . Definition 1 A finite-action discounted MDP is a 4-tuple (X, A, P, γ), where X is a measurable state space, A is a finite set of actions, P : X A M(R X) is a mapping with domain X A, and 0 γ < 1 is a discount factor. Mapping P evaluated Farahmand, Ghavamzadeh, Szepesv ari, and Mannor at (x, a) X A gives a distribution over R X, which we shall denote by P( , |x, a). We denote the marginals of P by the overloaded symbol P : X A M(X) defined as P( |x, a) = Px,a( ) = R R P(dr, |x, a) (transition probability kernel) and R : X A M(R) defined as R( |x, a) = R X P( , dy|x, a) (reward distribution). An MDP together with an initial distribution P1 of states encode the laws governing the temporal evolution of a discrete-time stochastic process controlled by an agent as follows: The controlled process starts at time t = 1 with random initial state X1 P1 (here and in what follows X Q denotes that the random variable X is drawn from distribution Q). At stage t, action At A is selected by the agent controlling the process. In response, the pair (Rt, Xt+1) is drawn from P( , |Xt, At), i.e., (Rt, Xt+1) P( , |Xt, At), where, Rt is the reward that the agent receives at time t and Xt+1 is the state at time t + 1. The process then repeats with the agent selecting action At+1, etc. In general, the agent can use all past states, actions, and rewards in deciding about its current action. However, for our purposes it will suffice to consider action-selection procedures, or policies, that select an action deterministically and time-invariantly solely based on the current state: Definition 2 (Deterministic Markov Stationary Policy) A measurable mapping π : X A is called a deterministic Markov stationary policy, or just policy in short. Following a policy π in an MDP means that at each time step t it holds that At = π(Xt). Policy π induces the transition probability kernels P π : X A M(X A) defined as follows: For a measurable subset C of X A, let (P π)(C|x, a) R P(dy|x, a)I{(y,π(y)) C}. The m-step transition probability kernels (P π)m : X A M(X A) for m = 2, 3, are defined inductively by (P π)m(C|x, a) R X P(dy|x, a)(P π)m 1(C|y, π(y)). Also given a probability transition kernel P : X A M(X A), we define the right-linear operator P : B(X A) B(X A) by (PQ)(x, a) R X A P(dy, da |x, a)Q(y, a ). For a probability measure ρ M(X A) and a measurable subset C of X A, we define the left-linear operators P : M(X A) M(X A) by (ρP)(C) = R ρ(dx, da)P(dy, da |x, a)I{(y,a ) C}. To study MDPs, two auxiliary functions are of central importance: the value and the action-value functions of a policy π. Definition 3 (Value Functions) For a policy π, the value function V π and the actionvalue function Qπ are defined as follows: Let (Rt; t 1) be the sequence of rewards when the Markov chain is started from a state X1 (or state-action (X1, A1) for the actionvalue function) drawn from a positive probability distribution over X (or X A) and the agent follows policy π. Then, V π(x) E h P t=1 γt 1Rt X1 = x i and Qπ(x, a) E h P t=1 γt 1Rt X1 = x, A1 = a i . It is easy to see that for any policy π, if the magnitude of the immediate expected reward r(x, a) = R r P(dr, dy|x, a) is uniformly bounded by Rmax, then the functions V π and Qπ are bounded by Vmax = Qmax = Rmax/(1 γ), independent of the choice of π. For a discounted MDP, we define the optimal value and optimal action-value functions by V (x) = supπ V π(x) for all states x X and Q (x, a) = supπ Qπ(x, a) for all state-actions (x, a) X A. We say that a policy π is optimal if it achieves the best values in every Regularized Policy Iteration with Nonparametric Function Spaces state, i.e., if V π = V . We say that a policy π is greedy w.r.t. an action-value function Q if π(x) = argmaxa A Q(x, a) for all x X. We define function ˆπ(x; Q) argmaxa A Q(x, a) (for all x X) that returns a greedy policy of an action-value function Q (If there exist multiple maximizers, a maximizer is chosen in an arbitrary deterministic manner). Greedy policies are important because a greedy policy w.r.t. the optimal action-value function Q is an optimal policy. Hence, knowing Q is sufficient for behaving optimally (cf. Proposition 4.3 of Bertsekas and Shreve 1978).2 Definition 4 (Bellman Operators) For a policy π, the Bellman operators T π : B(X) B(X) (for value functions) and T π : B(X A) B(X A) (for action-value functions) are defined as (T πV )(x) r(x, π(x)) + γ Z X V (y)P(dy|x, π(x)), (T πQ)(x, a) r(x, a) + γ Z X Q(y, π(y))P(dy|x, a). To avoid unnecessary clutter, we use the same symbol to denote both operators. However, this should not introduce any ambiguity: Given some expression involving T π one can always determine which operator T π means by looking at the type of function T π is applied to. It is known that the fixed point of the Bellman operator is the (action-)value function of the policy π, i.e., T πQπ = Qπ and T πV π = V π, see e.g., Proposition 4.2(b) of Bertsekas and Shreve (1978). We will also need to define the so-called Bellman optimality operators: Definition 5 (Bellman Optimality Operators) The Bellman optimality operators T : B(X) B(X) (for value functions) and T : B(X A) B(X A) (for action-value functions) are defined as (T V )(x) max a r(x, a) + γ Z X V (y)P(dy|x, a) , (T Q)(x, a) r(x, a) + γ Z X max a Q(y, a )P(dy|x, a). Again, we use the same symbol to denote both operators; the previous comment that no ambiguity should arise because of this still applies. The Bellman optimality operators enjoy a fixed-point property similar to that of the Bellman operators. In particular, T V = V and T Q = Q , see e.g., Proposition 4.2(a) of Bertsekas and Shreve (1978). The Bellman optimality operator thus provides a vehicle to compute the optimal action-value function and therefore to compute an optimal policy. 2. Measurability issues are dealt with in Section 9.5 of the same book. In the case of finitely many actions, no additional condition is needed besides the obvious measurability assumptions on the immediate reward function and the transition kernel (Bertsekas and Shreve, 1978, Corollary 9.17.1), which we will assume from now on. Farahmand, Ghavamzadeh, Szepesv ari, and Mannor 2.2 Norms and Function Spaces In what follows we use F : X R to denote a subset of measurable functions. The exact specification of this set will be clear from the context. Further, we let F|A| : X A R|A| to be a subset of vector-valued measurable functions with the identification of F|A| = (Q1, . . . , Q|A|) : Qi F, i = 1, . . . , |A| . We shall use Q p,ν to denote the Lp(ν)-norm (1 p < ) of a measurable function Q : X A R, i.e., Q p p,ν R X A |Q(x, a)|pdν(x, a). Let z1:n denote the Z-valued sequence (z1, . . . , zn). For Dn = z1:n, define the empirical norm of function f : Z R as f p p,Dn = f p p,z1:n 1 i=1 |f(zi)|p. (1) When there is no chance of confusion about Dn, we may denote the empirical norm by f p p,n. Based on this definition, one may define Q p,Dn with the choice of Z = X A. Note that if Dn = (Zi)n i=1 is random with Zi ν, the empirical norm is random too, and for any fixed function f, we have E h f p,Dn i = f p,ν. When p = 2, we simply use ν and Dn. 2.3 Offline Learning Problem and Empirical Bellman Operators We consider the offline learning scenario when we are only given a batch of data3 Dn = {(X1, A1, R1, X 1), . . . , (Xn, An, Rn, X n)}, (2) with Xi νX , Ai πb( |Xi), and (Ri, X i) P( , |Xi, Ai) for i = 1, . . . , n. Here νX M(X) is a fixed distribution over the states and πb is the data generating behavior policy, which is a stochastic stationary Markov policy, i.e., given any state x X, it assigns a probability distribution over A. We shall also denote the common distribution underlying (Xi, Ai) by ν M(X A). Samples Xi and Xi+1 may be sampled independently (we call this the Planning scenario ), or may be coupled through X i = Xi+1 ( RL scenario ). In the latter case the data comes from a single trajectory. Under either of these scenarios, we say that the data Dn meets the standard offline sampling assumption. We analyze the Planning scenario, where the states are independent, but one may also analyze dependent processes by considering mixing processes and using tools such as the independent blocks technique (Yu, 1994; Doukhan, 1994), as has been done by Antos et al. (2008b); Farahmand and Szepesv ari (2012). The data set Dn allows us to define the so-called empirical Bellman operators, which can be thought of as empirical approximations to the true Bellman operators. 3. In what follows, when { } is used in connection to a data set, we treat the set as an ordered multiset, where the ordering is given by the time indices of the data points. Regularized Policy Iteration with Nonparametric Function Spaces Definition 6 (Empirical Bellman Operators) Let Dn be a data set as above. Define the ordered multiset Sn = {(X1, A1), . . . , (Xn, An)}. For a given fixed policy π, the empirical Bellman operator ˆT π : RSn Rn is defined as ( ˆT πQ)(Xi, Ai) Ri + γQ(X i, π(X i)) , 1 i n . Similarly, the empirical Bellman optimality operator ˆT : RSn Rn is defined as ( ˆT Q)(Xi, Ai) Ri + γ max a Q(X i, a ) , 1 i n . In words, the empirical Bellman operators get an n-element list Sn and return an ndimensional real-valued vector of the single-sample estimate of the Bellman operators applied to the action-value function Q at the selected points. It is easy to see that the empirical Bellman operators provide an unbiased estimate of the Bellman operators in the following sense: For any fixed bounded measurable deterministic function Q : X A R, policy π and 1 i n, it holds that E h ˆT πQ(Xi, Ai)|Xi, Ai i = T πQ(Xi, Ai) and E h ˆT Q(Xi, Ai)|Xi, Ai i = T Q(Xi, Ai). 3. Approximate Policy Iteration The policy iteration algorithm computes a sequence of policies such that the new policy in the iteration is greedy w.r.t. the action-value function of the previous policy. This procedure requires one to compute the action-value function of the most recent policy (policy evaluation step) followed by the computation of the greedy policy (policy improvement step). In API, the exact, but infeasible, policy evaluation step is replaced by an approximate one. Thus, the skeleton of API methods is as follows: At the kth iteration and given a policy πk, the API algorithm approximately evaluates πk to find a Qk. The action-value function Qk is typically chosen to be such that Qk T πk Qk, i.e., it is an approximate fixed point of T πk. The API algorithm then calculates the greedy policy w.r.t. the most recent action-value function to obtain a new policy πk+1, i.e., πk+1 = ˆπ( ; Qk). The API algorithm continues by repeating this process again and generating a sequence of policies and their corresponding approximate action-value functions Q0 π1 Q1 π2 .4 The success of an API algorithm hinges on the way the approximate policy evaluation step is implemented. Approximate policy evaluation is non-trivial for at least two reasons. First, policy evaluation is an inverse problem,5 so the underlying learning problem is unlike a standard supervised learning problem in which the data take the form of input-output pairs. The second problem is the off-policy sampling problem: The distribution of (Xi, Ai) in the data samples (possibly generated by a behavior policy) is typically different from the distribution that would be induced if we followed the to-be-evaluated policy (i.e., target policy). This causes a problem since the methods must be able to handle this mismatch of 4. In an actual API implementation, one does not need to compute πk+1 for all states, which in fact is infeasible for large state spaces. Instead, one uses Qk to compute πk+1 at some select states, as required in the approximate policy evaluation step. 5. Given an operator L : F F, the inverse problem is the problem of solving g = Lf for f when g is known. In the policy evaluation problem, L = I γP π, g( ) = r( , π( )), and f = Qπ. Farahmand, Ghavamzadeh, Szepesv ari, and Mannor distributions.6 In the rest of this section, we review generic LSTD and BRM methods for approximate policy evaluation. We introduce our regularized version of LSTD and BRM in Section 4. 3.1 Bellman Residual Minimization The idea of BRM goes back at least to the work of Schweitzer and Seidmann (1985). It was later used in the RL community by Williams and Baird (1994) and Baird (1995). The basic idea of BRM comes from noticing that the action-value function is the unique fixed point of the Bellman operator: Qπ = T πQπ (or similarly V π = T πV π for the value function). Whenever we replace Qπ by an action-value function Q different from Qπ, the fixed-point equation would not hold anymore, and we have a non-zero residual function Q T πQ. This quantity is called the Bellman residual of Q. The same is true for the Bellman optimality operator T . The BRM algorithm minimizes the norm of the Bellman residual of Q, which is called the Bellman error. It can be shown that if Q T Q is small, then the value function of the greedy policy w.r.t. Q, that is V ˆπ( ;Q), is also in some sense close to the optimal value function V , see e.g., Williams and Baird (1994); Munos (2003); Antos et al. (2008b); Farahmand et al. (2010), and Theorem 13 of this work. The BRM algorithm is defined as the procedure minimizing the following loss function: LBRM(Q; π) Q T πQ 2 ν , where ν is the distribution of state-actions in the input data. Using the empirical L2-norm defined in (1) with samples Dn defined in (2), and by replacing (T πQ)(Xt, At) with the empirical Bellman operator (Definition 6), the empirical estimate of LBRM(Q; π) can be written as ˆLBRM(Q; π, n) Q ˆT πQ 2 h Q(Xt, At) Rt + γQ X t, π(X t) i2 . (3) Nevertheless, it is well-known that ˆLBRM is not an unbiased estimate of LBRM when the MDP is not deterministic (Lagoudakis and Parr, 2003; Antos et al., 2008b). To address this issue, Antos et al. (2008b) propose the modified BRM loss that is a new empirical loss function with an extra de-biasing term. The idea of the modified BRM is to cancel the unwanted variance by introducing an auxiliary function h and a new loss function LBRM(Q, h; π) = LBRM(Q; π) h T πQ 2 ν , (4) and approximating the action-value function Qπ by solving QBRM = argmin Q F|A| sup h F|A| LBRM(Q, h; π), (5) 6. A number of works in the domain adaptation literature consider this scenario under the name of covariate shift problem, see e.g., Ben-David et al. 2006; Mansour et al. 2009; Ben-David et al. 2010; Cortes et al. 2015. Regularized Policy Iteration with Nonparametric Function Spaces where the supremum comes from the negative sign of h T πQ 2 ν. They have shown that optimizing the new loss function still makes sense and the empirical version of this loss is unbiased. The min-max optimization problem (5) is equivalent to the following coupled (nested) optimization problems: h( ; Q) = argmin h F|A| h T πQ 2 ν , QBRM = argmin Q F|A| h Q T πQ 2 ν h( ; Q) T πQ 2 ν i . (6) In practice, the norm ν is replaced by the empirical norm Dn and T πQ is replaced by its sample-based approximation ˆT πQ, i.e., ˆhn( ; Q) = argmin h F|A| ˆQBRM = argmin Q F|A| h Q ˆT πQ 2 Dn ˆhn( ; Q) ˆT πQ 2 From now on, whenever we refer to the BRM algorithm, we are referring to this modified BRM. 3.2 Least-Squares Temporal Difference Learning The Least-Squares Temporal Difference learning (LSTD) algorithm for policy evaluation was first proposed by Bradtke and Barto (1996), and later used in an API procedure by Lagoudakis and Parr (2003) and was called Least-Squares Policy Iteration (LSPI). The original formulation of LSTD finds a solution to the fixed-point equation Q = ΠνT πQ, where Πν is the simplified notation for ν-weighted projection operator onto the space of admissible functions F|A|, i.e., Πν ΠF|A| ν : B(X A) B(X A) is defined by ΠF|A| ν Q = argminh F|A| h Q 2 ν for Q B(X A). We, however, use a different optimization-based formulation. The reason is that whenever ν is not the stationary distribution induced by π, the operator (ΠνT π) does not necessarily have a fixed point, but the optimization problem is always well-defined. We define the LSTD solution as the minimizer of the L2-norm between Q and ΠνT πQ: LLSTD(Q; π) Q ΠνT πQ 2 ν . (9) The minimizer of LLSTD(Q; π) is well-defined, and whenever ν is the stationary distribution of π (i.e., on-policy sampling), the solution to this optimization problem is the same as the solution to Q = ΠνT πQ. The LSTD solution can therefore be written as the solution to the following set of coupled optimization problems: h( ; Q) = argmin h F|A| h T πQ 2 ν , QLSTD = argmin Q F|A| Q h( ; Q) 2 ν , (10) Farahmand, Ghavamzadeh, Szepesv ari, and Mannor Algorithm 1 Regularized Policy Iteration(K, ˆQ( 1),F|A|,J,{(λ(k) Q,n, λ(k) h,n)}K 1 k=0 ) // K: Number of iterations // ˆQ( 1): Initial action-value function // F|A|: The action-value function space // J: The regularizer // {(λ(k) Q,n, λ(k) h,n)}K k=0: The regularization coefficients for k = 0 to K 1 do πk( ) ˆπ( ; ˆQ(k 1)) Generate training samples D(k) n ˆQ(k) REG-LSTD/BRM(πk, D(k) n ; F|A|, J, λ(k) Q,n, λ(k) h,n) end for return ˆQ(K 1) and πK( ) = ˆπ( ; ˆQ(K 1)) where the first equation finds the projection of T πQ onto F|A|, and the second one minimizes the distance of Q and the projection. The corresponding empirical version based on data set Dn is ˆhn( ; Q) = argmin h F|A| ˆQLSTD = argmin Q F|A| Q ˆhn( ; Q) 2 For general spaces F|A|, these optimization problems can be difficult to solve, but when F|A| is a linear subspace of B(X A), the minimization problem becomes computationally feasible. Comparison of BRM and LSTD is noteworthy. The population version of LSTD loss minimizes the distance between Q and ΠνT πQ, which is Q ΠνT πQ 2 ν. Meanwhile, BRM minimizes another distance function that is the distance between T πQ and ΠνT πQ subtracted from the distance between Q and T πQ, i.e., Q T πQ 2 ν ˆhn( ; Q) T πQ 2 ν. See Figure 1a for a pictorial presentation of these distances. When F|A| is linear, because of the Pythagorean theorem, the solution to the modified BRM (6) coincides with the LSTD solution (10) (Antos et al., 2008b). 4. Regularized Policy Iteration Algorithms In this section we introduce two Regularized Policy Iteration algorithms, which are instances of the generic API algorithms. These algorithms are built on the regularized extensions of BRM (Section 3.1) and LSTD (Section 3.2) for the task of approximate policy evaluation. The pseudo-code of the Regularized Policy Iteration algorithms is shown in Algorithm 1. The algorithm receives K (the number of API iterations), an initial action-value function ˆQ( 1), the function space F|A|, the regularizer J : F|A| R, and a set of regularization coefficients {(λ(k) Q,n, λ(k) h,n)}K 1 k=0 . Each iteration starts with a step of policy improvement, i.e., Regularized Policy Iteration with Nonparametric Function Spaces Minimized by original BRM Minimized by LSTD + + + + + + + + - - - - - - - - Minimized by REG-BRM Minimized by REG-LSTD + + + + + + + + - - - - - - - - Figure 1: (a) This figure shows the loss functions minimized by the original BRM, the modified BRM, and the LSTD methods. The function space F|A| is represented by the plane. The Bellman operator T π maps an action-value function Q F|A| to a function T πQ. The function T πQ ΠνT πQ is orthogonal to F|A|. The original BRM loss function is Q T πQ 2 ν (solid line), the modified BRM loss is Q T πQ 2 ν T πQ ΠνT πQ 2 ν (the difference of two solid line segments; note the + and symbols), and the LSTD loss is Q ΠνT πQ 2 ν (dashed line). LSTD and the modified BRM are equivalent for linear function spaces. (b) REG-LSTD and REG-BRM minimize regularized objective functions. Regularization makes the function T πQ ΠνT πQ to be non-orthogonal to F|A|. The dashed ellipsoids represent the level-sets defined by the regularization functional J. πk ˆπ( ; ˆQ(k 1) = argmaxa A ˆQ(k 1)( , a ). For the first iteration (k = 0), one may ignore this step and provide an initial policy π0 instead of ˆQ( 1). Afterwards, we have a data generating step: At each iteration k = 0, . . . , K 1, the agent follows the data generating policy πbk to obtain D(k) n = {(X(k) t , A(k) t , R(k) t , X t (k))}1 t n. For the kth iteration of the algorithm, we use training samples D(k) n to evaluate policy πk. In practice, one might want to change πbk at each iteration in such a way that the agent ultimately achieves a better performance. The relation between the performance and the choice of data samples, however, is complicated. For simplicity of analysis, in the rest of this work we assume that a fixed behavior policy is used in all iterations, i.e., πbk = πb.7 This leads to K independent data sets D(0) n , . . . , D(K 1) n . From now on, to avoid clutter, we use symbols Dn, Xt, . . . instead of D(k) n , X(k) t , . . . with the understanding that each Dn in various iterations is referring to an independent set of data samples, which should be clear from the context. The approximate policy evaluation step is performed by REG-LSTD/BRM, which will be discussed shortly. REG-LSTD/BRM receives policy πk, the training samples D(k) n , the function space F|A|, the regularizer J, and the regularization coefficients (λ(k) Q,n, λ(k) h,n), and 7. So we are in the so-called off-policy sampling scenario. Farahmand, Ghavamzadeh, Szepesv ari, and Mannor returns an estimate of the action-value function of policy πk. This procedure repeats for K iterations. REG-BRM approximately evaluates policy πk by solving the following coupled optimization problems: ˆhn( ; Q) = argmin h F|A| h ˆT πk Q 2 Dn + λ(k) h,n J2(h) , (13) ˆQ(k) = argmin Q F|A| Q ˆT πk Q 2 Dn ˆhn( ; Q) ˆT πk Q 2 Dn + λ(k) Q,n J2(Q) , (14) where J : F|A| R is the regularization functional (or simply regularizer or penalizer), and λ(k) h,n, λ(k) Q,n > 0 are regularization coefficients. The regularizer can be any pseudo-norm defined on F|A|; and Dn is defined as (2).8 The regularizer is often chosen such that the functions that we believe are more complex have larger values of J. The notion of complexity, however, is subjective and depends on the choice of F|A| and J. Finally note that we call J(Q) the smoothness of Q, even though it might not coincide with the conventional derivative-based notions of smoothness. An example of the case that J has a derivative-based interpretation is when the function space F|A| is a Sobolev space and the regularizer J is defined as its corresponding norm. In this case, we are penalizing the weak-derivatives of the estimate (Gy orfiet al., 2002; van de Geer, 2000). One can generalize the notion of smoothness beyond the usual derivativebased ones (cf. Chapter 1 of Triebel 2006) and define function spaces such as the family of Besov spaces (Devore, 1998). The RKHS norm for shift-invariant and radial kernels can also be interpreted as a penalizer of higher-frequency terms of the function (i.e., a low-pass filter Evgeniou et al. 1999), so they effectively encourage smoother functions. The choice of kernel determines the frequency response of the filter. One may also use other data-dependent regularizers such as manifold regularization (Belkin et al., 2006) and Sample-based Approximate Regularization (Bachman et al., 2014). As a final example, for the functions in the form of Q(x, a) = P i 1 φi(x, a)wi, if we choose a sparsity-inducing regularizer such as J(Q) P i 1 |wi| as the measure of smoothness, then a function that has a sparse representation in the dictionary {φi}i 1 is, by definition, a smooth function even though there is not necessarily any connection to the derivative-based smoothness. REG-LSTD approximately evaluates the policy πk by solving the following coupled optimization problems: ˆhn( ; Q) = argmin h F|A| h ˆT πk Q 2 Dn + λ(k) h,n J2(h) , (15) ˆQ(k) = argmin Q F|A| Q ˆhn( ; Q) 2 Dn + λ(k) Q,n J2(Q) . (16) Note that the difference between (7)-(8) ((11)-(12)) and (13)-(14) ((15)-(16)) is the addition of the regularizers J2(h) and J2(Q). Unlike the non-regularized case described in Section 3, the solutions of REG-BRM and REG-LSTD are not the same. As a result of the regularized projection, (13) and 8. A pseudo-norm J satisfies all properties of a norm except that J(Q) = 0 does not imply that Q = 0. Regularized Policy Iteration with Nonparametric Function Spaces (15), the function ˆhn( ; Q) ˆT πk Q is not orthogonal to the function space F|A| even if F|A| is a linear space. Therefore, the Pythagorean theorem is not applicable anymore: Q ˆhn( ; Q) 2 = Q ˆT πk Q 2 ˆhn( ; Q) ˆT πk Q 2 (See Figure 1b). One may ask why we have regularization terms in both optimization problems, as opposed to only in the projection term (15) (similar to the Lasso-TD algorithm Kolter and Ng 2009; Ghavamzadeh et al. 2011) or only in (16) (similar to Geist and Scherrer 2012; Avila Pires and Szepesv ari 2012). We discuss this question in Section 4.1. Briefly speaking, for large function spaces such as the Sobolev spaces or the RKHS with universal kernels, if we remove the regularization term in (15), the coupled optimization problems reduces to (unmodified) BRM, which is biased as discussed earlier; whereas if the regularization term in (16) is removed, the solution can be arbitrary bad due to overfitting. Finally note that the choice of the function space F|A|, the regularizer J, and the regularization coefficients λ(k) Q,n and λ(k) h,n all affect the sample efficiency of the algorithms. If one knew J(Qπ), the regularization coefficients could be chosen optimally. Nonetheless, the value of J(Qπ) is often not known, so one has to use a model selection procedure to set the best function space and the regularization coefficients. The situation is similar to the problem of model selection in supervised learning (though the solutions are different). After developing some tools necessary for discussing this issue in Section 5, we return to the problem of choosing the regularization coefficients after Theorem 11 as well as in Section 6. Remark 7 To the best of our knowledge, Antos et al. (2008b) were the first who explicitly considered LSTD as the optimizer of the loss function (9). Their discussion was mainly to prove the equivalence of modified BRM (5) and LSTD when F|A| is a linear function space. In their work, the loss function is not used to derive any new algorithm. Farahmand et al. (2009b) used this loss function to develop the regularized variant of LSTD (15)-(16). This loss function was later called mean-square projected Bellman error by Sutton et al. (2009), and was used to derive the GTD2 and TDC algorithms. 4.1 Why Two Regularizers? We discuss why using regularizers in both optimization problems (15) and (16) of REGLSTD is necessary for large function spaces such as the Sobolev spaces and the RKHS with universal kernels. Here we show that for large function spaces, depending on which regularization term we remove, either the coupled optimization problems reduces to the regularized variant of the unmodified BRM, which has a bias, or the solution can be arbitrary bad. Let us focus on REG-LSTD for a given policy π. Assume that the function space F|A| is rich enough in the sense that it is dense in the space of continuous functions w.r.t. the supremum norm. This is satisfied by many large function spaces such as RKHS with universal kernels (Definition 4.52 of Steinwart and Christmann 2008) and the Sobolev spaces on compact domains. We consider what would happen if instead of the current formulation of REG-LSTD (15)-(16), we only used a regularizer either in the first or second optimization problem. We study each case separately. For notational simplicity, we omit the dependence on the iteration number k. Farahmand, Ghavamzadeh, Szepesv ari, and Mannor Case 1. In this case, we only regularize the empirical error Q ˆhn( ; Q) 2 Dn, but we do not regularize the projection, i.e., ˆhn( ; Q) = argmin h F|A| ˆQ = argmin Q F|A| Q ˆhn( ; Q) 2 Dn + λQ,n J2(Q) . (17) When the function space F|A| is rich enough, there exists a function ˆhn F|A| that fits perfectly well to its target values at data points {(Xi, Ai)}n i=1, that is, ˆhn((Xi, Ai); Q) = ( ˆT πQ)(Xi, Ai) for i = 1, . . . , n.9 Such a function is indeed the minimizer of the loss Q ˆhn( ; Q) 2 Dn. The second optimization problem (17) becomes ˆQ = argmin Q F|A| Dn + λQ,n J2(Q) . This is the regularized version of the original (i.e., unmodified) formulation of the BRM algorithm. As discussed in Section 3.1, the unmodified BRM algorithm is biased when the MDP is not deterministic. Adding a regularizer does not solve the biasedness problem of the unmodified BRM loss. So without regularizing the first optimization problem, the function ˆhn overfits to the noise and as a result the whole algorithm becomes incorrect. Case 2. In this case, we only regularize the empirical projection h ˆT πQ 2 Dn, but we do not regularize Q ˆhn( ; Q) 2 Dn, i.e., ˆhn( ; Q) = argmin h F|A| Dn + λh,n J2(h) , ˆQ = argmin Q F|A| Q ˆhn( ; Q) 2 For a fixed Q, the first optimization problem is the standard regularized regression estimator with the regression function E h ( ˆT πQ)(X, A)|X = x, A = a i = (T πQ)(x, a). There- fore, if the function space F|A| is rich enough and we set the regularization coefficient λh,n properly, h T πQ ν and h T πQ Dn go to zero as the sample size grows (the rate of convergence depends on the complexity of the target function; cf. Lemma 15 and Theorem 16). So we can expect ˆhn( ; Q) to get closer to T πQ as the sample size grows. For simplicity of discussion, suppose that we are in the ideal situation where for any Q, we have ˆhn((x, a); Q) = (T πQ)(x, a) for all (x, a) {(Xi, Ai)}n i=1 {(X i, π(X i))}n i=1, that 9. To be more precise: First, for an ε > 0, we construct a continuous function hε(z) = P Zi {(Xi,Ai)}n i=1 max n 1 z Zi ε , 0 o ( ˆT πQ)(Zi). We then use the denseness of the function space F |A| in the supremum norm to argue that there exists hε F |A| such that hε hε is arbitrarily close to zero. So when ε 0, the value of function hε is arbitrarily close to T πQ at data points. We then choose ˆhn( ; Q) = hε. This construction is similar to Theorem 2 of Nadler et al. (2009). See also the argument in Case 2 for more detail. Regularized Policy Iteration with Nonparametric Function Spaces is, we precisely know T πQ at all data points.10 Substituting this ˆhn((x, a); Q) in the second optimization problem (17), we get that we are solving the following optimization problem: ˆQ = argmin Q F|A| Q T πQ 2 Dn . (19) This is the Bellman error minimization problem. We do not have the biasedness problem here as we have T πQ instead of ˆT πQ in the loss. Nonetheless, we face another problem: Minimizing this empirical risk minimization without controlling the complexity of the function space might lead to an overfitted solution, very similar to the same phenomenon in supervised learning. To see it more precisely, we first construct a continuous function Zi {(Xi,Ai)}n i=1 {(X i,π(X i))}n i=1 ε , 0 Qπ(Zi), which for small enough ε > 0 has the property that Qε T π Qε 2 Dn is zero, i.e., it is a minimizer of the empirical loss. Due to the denseness of F|A|, we can find a Qε F|A| that is arbitrarily close to the continuous function Qε. Therefore, for small enough ε, the function Qε is a minimizer of (19), i.e., the value of Qε T πQε 2 Dn is zero. But Qε is not a good approximation of Qπ because Qε consists of spikes in the ε-neighbourhood of data points and zero elsewhere. In other words, Qε does not generalize well beyond the data points when ε is chosen to be small. Of course the solution is to control the complexity of F|A| so that spiky functions such as Qε are not selected as the solution of the optimization problem. When we regularize both optimization problems, as we do in this work, none of these problems happen. This argument applies to rich function spaces that can approximate any reasonably complex functions (e.g., continuous functions) arbitrarily well. If the function space F|A| is much more limited, for example if it is a parametric function space, we may not need to regularize both optimization problems. An example of such an approach for parametric spaces has been analyzed by Avila Pires and Szepesv ari (2012). 4.2 Closed-Form Solutions In this section we provide a closed-form solution for (13)-(14) and (15)-(16) for two cases: 1) When F|A| is a finite dimensional linear space and J2( ) is defined as the weighted squared sum of parameters describing the function (a setup similar to the ridge regression Hoerl and Kennard 1970) and 2) F|A| is an RKHS and J( ) is the corresponding inner-product norm, i.e., J2( ) = 2 H. Here we use a generic π and Dn instead of πk and D(k) n at the kth 10. This is an ideal situation because 1) h ˆT πQ ν is equal to zero only asymptotically and not in finite samples regime, and 2) even if h ˆT πQ ν = 0, it does not imply that ˆhn(x, a; Q) = (T πQ)(x, a) almost surely on X A. Nonetheless, these simplifications are only in favour of the algorithm considered in this case, so for simplicity of discussion we assume that they hold. Farahmand, Ghavamzadeh, Szepesv ari, and Mannor 4.2.1 A Parametric Formulation for REG-BRM and REG-LSTD In this section we consider the case when h and Q are both given as linear combinations of some basis functions: h( ) = φ( ) u, Q( ) = φ( ) w, (20) where u, w Rp are parameter vectors and φ( ) Rp is a vector of p linearly independent basis functions defined over the space of state-action pairs.11 These basis functions might be predefined (e.g., Fourier (Konidaris et al., 2011) or wavelets) or constructed datadependently by one of already mentioned feature generation methods. We further assume that the regularization terms take the form J2(h) = u Ψu , J2(Q) = w Ψw . for some user-defined choice of positive definite matrix Ψ Rp p. A simple and common choice would be Ψ = I. Define Φ, Φ Rn p and r Rn as follows: Φ = φ(Z1), . . . , φ(Zn) , Φ = φ(Z 1), . . . , φ(Z n) , r = R1, . . . , Rn , (21) with Zi = (Xi, Ai) and Z i = (X i, π(X i)). The solution to REG-BRM is given by the following proposition. Proposition 8 (Closed-form solution for REG-BRM) Under the setting of this section, the approximate action-value function returned by REG-BRM is ˆQ( ) = φ( ) w , where w = h B B γ2C C + nλQ,nΨ i 1 B + γC (ΦA I) r, with A = Φ Φ + nλh,nΨ 1 Φ , B = Φ γΦ , C = (ΦA I)Φ . Proof Using (20) and (21), we can rewrite (13)-(14) as u (w) = argmin u Rp n Φu (r + γΦ w) Φu (r + γΦ w) + λh,nu Ψu , (22) w = argmin w Rp n Φw (r + γΦ w) Φw (r + γΦ w) (23) 1 n Φu (w) (r + γΦ w) Φu (w) (r + γΦ w) + λQ,nw Ψw . Taking the derivative of (22) w.r.t. u and equating it to zero, we obtain u as a function of w: u (w) = Φ Φ + nλh,nΨ 1 Φ (r + γΦ w) = A(r + γΦ w). (24) Plug u (w) from (24) into (23), take the derivative w.r.t. w and equate it to zero to obtain the parameter vector w as announced above. The solution returned by REG-LSTD is given in the following proposition. 11. At the cost of using generalized inverses, everything in this section extends to the case when the basis functions are not linearly independent. Regularized Policy Iteration with Nonparametric Function Spaces Proposition 9 (Closed-form solution for REG-LSTD) Under the setting of this section, the approximate action-value function returned by REG-LSTD is ˆQ( ) = φ( ) w , where w = h E E + nλQ,nΨ i 1 E Ar, with A = Φ Φ + nλh,nΨ 1 Φ and E = (Φ γAΦ ). Proof Using (20) and (21), we can rewrite (15)-(16) as u (w) = argmin u Rp n Φu (r + γΦ w) Φu (r + γΦ w) + λh,nu Ψu , (25) w = argmin w Rp n Φw Φu (w) Φw Φu (w) + λQ,nw Ψw o . (26) Similar to the parametric REG-BRM, we solve (25) and obtain u (w) which is the same as (24). If we plug this u (w) into (26), take derivate w.r.t. w, and find the minimizer, the parameter vector w will be as announced. 4.2.2 RKHS Formulation for REG-BRM and REG-LSTD The class of reproducing kernel Hilbert spaces provides a flexible and powerful family of function spaces to choose F|A| from. An RKHS H : X A R is defined by a positive definite kernel k : (X A) (X A) R. With such a choice, we can use the corresponding squared RKHS norm 2 H as the regularizer J2( ). REG-BRM with an RKHS function space F|A| = H would be ˆhn( ; Q) = argmin h F|A|[=H] Dn + λh,n h 2 H ˆQ = argmin Q F|A|[=H] Dn ˆhn( ; Q) ˆT πQ 2 Dn + λQ,n Q 2 H and the coupled optimization problems for REG-LSTD are ˆhn( ; Q) = argmin h F|A|[=H] Dn + λh,n h 2 H ˆQ = argmin Q F|A|[=H] Q ˆhn( ; Q) 2 Dn + λQ,n Q 2 H We can solve these coupled optimization problems by the application of the generalized representer theorem for RKHS (Sch olkopf et al., 2001). The result, which is stated in the next theorem, shows that the infinite dimensional optimization problem defined on F|A| = H boils down to a finite dimensional problem with the dimension twice the number of data points. Theorem 10 Let Z be a vector defined as Z = (Z1, . . . , Zn, Z 1, . . . , Z n) . Then the optimizer ˆQ H of (27)-(28) can be written as ˆQ( ) = P2n i=1 αik( Zi, ) for some values of Farahmand, Ghavamzadeh, Szepesv ari, and Mannor α R2n. The same holds for the solution to (29)-(30). Further, the coefficient vectors can be obtained in the following form: REG-BRM: αBRM = (CKQ + nλQ,n I) 1(D + γC 2 B B)r , REG-LSTD: αLSTD = (F F KQ + nλQ,n I) 1F Er , where r = (R1, . . . , Rn) and the matrices KQ, B, C, C2, D, E, F are defined as follows: Kh Rn n is defined as [Kh]ij = k(Zi, Zj), 1 i, j n, and KQ R2n 2n is defined as [KQ]ij = k( Zi, Zj), 1 i, j 2n. Let C1 = In n 0n n and C2 = 0n n In n . Denote D = C1 γC2, E = Kh(Kh + nλh,n I) 1, F = C1 γEC2, B = Kh(Kh + nλh,n I) 1 I, and C = D D γ2(BC2) (BC2). Proof See Appendix A. 5. Theoretical Analysis In this section, we analyze the statistical properties of REG-LSPI and provide a finitesample upper bound on the performance loss Q QπK 1,ρ. Here, πK is the policy greedy w.r.t. ˆQ(K 1) and ρ is the performance evaluation measure. The distribution ρ is chosen by the user and is often different from the sampling distribution ν. Our study has two main parts. First, we analyze the policy evaluation error of REGLSTD in Section 5.1. We suppose that given any policy π, we obtain ˆQ by solving (15)-(16) with πk in these equations being replaced by π. Theorem 11 provides an upper bound on the Bellman error ˆQ T π ˆQ ν. We discuss the optimality of this upper bound for policy evaluation for some general classes of function spaces. We show that the result is not only optimal in its convergence rate, but also in its dependence on J(Qπ). After that in Section 5.2, we show how the Bellman errors of the policy evaluation procedure propagate through the API procedure (Theorem 13). The main result of this paper, which is an upper bound on the performance loss Q QπK 1,ρ, is stated as Theorem 14 in Section 5.3, followed by its discussion. We compare this work s statistical guarantee with some other papers in Section 5.3.1. To analyze the statistical performance of the REG-LSPI procedure, we make the following assumptions. We discuss their implications and the possible relaxations after stating each of them. Assumption A1 (MDP Regularity) The set of states X is a compact subset of Rd. The random immediate rewards Rt R( |Xt, At) (t = 1, 2, . . . ) as well as the expected immediate rewards r(x, a) are uniformly bounded by Rmax, i.e., |Rt| Rmax (t = 1, 2, . . . ) and r Rmax. Even though the algorithms were presented for a general measurable state space X, the theoretical results are stated for the problems whose state space is a compact subset of Rd. Generalizing Assumption A1 to other state spaces should be possible under certain regularity conditions. One example could be any Polish space, i.e., separable completely Regularized Policy Iteration with Nonparametric Function Spaces metrizable topological space. Nevertheless, we do not investigate such generalizations here. The boundedness of the rewards is a reasonable assumption that can be replaced by a more relaxed condition such as its sub-Gaussianity (Vershynin, 2012; van de Geer, 2000). This relaxation, however, increases the technicality of the proofs without adding much to the intuition. We remark on the compactness assumption after stating Assumption A4. Assumption A2 (Sampling) At iteration k of REG-LSPI (for k = 0, . . . , K 1), n fresh independent and identically distributed (i.i.d.) samples are drawn from distribution ν M(X A), i.e., D(k) n = n Z(k) t , R(k) t , X t (k) on t=1 with Z(k) t = (X(k) t , A(k) t ) i.i.d. ν and X t (k) P( |X(k) t , A(k) t ). The i.i.d. requirement of Assumption A2 is primarily used to simplify the proofs. With much extra effort, these results can be extended to the case when the data samples belong to a single trajectory generated by a fixed policy. In the single trajectory scenario, samples are not independent anymore, but under certain conditions on the Markov process, the process (Xt, At) gradually forgets its past. One way to quantify this forgetting is through mixing processes. For these processes, tools such as the independent blocks technique (Yu, 1994; Doukhan, 1994) or information theoretical inequalities (Samson, 2000) can be used to carry on the analysis as have been done by Antos et al. (2008b) in the API context, by Farahmand and Szepesv ari (2012) for analyzing the regularized regression problem, and by Farahmand and Szepesv ari (2011) in the context of model selection for RL problems. It is worthwhile to emphasize that we do not require that the distribution ν to be known. The sampling distribution is also generally different from the distribution induced by the target policy πk. For example, it might be generated by drawing state samples from a given νX and choosing actions according to a behavior policy πb, which is different from the policy being evaluated. So we are in the off-policy sampling setting. Moreover, changing ν at each iteration based on the previous iterations is a possibility with potential practical benefits, which has theoretical justifications in the context of imitation learning (Ross et al., 2011). For simplicity of the analysis, however, we assume that ν is fixed in all iterations. Finally, we note that the proofs work fine if we reuse the same data sets in all iterations. We comment on it later after the proof of Theorem 11 in Appendix B. Assumption A3 (Regularizer) Define two regularization functionals J : B(X) R and J : B(X A) R that are pseudo-norms on F and F|A|, respectively.12 For all Q F|A| and a A, we have J(Q( , a)) J(Q). The regularizer J(Q) measures the complexity of an action-value function Q. The functions that are more complex have larger values of J(Q). We also need to define a related regularizer for value functions Q( , a) (a A). The latter regularizer is not explicitly used in the algorithm, and is only used in the analysis. This assumption imposes some mild restrictions on these regularization functionals. The condition that the regularizers be pseudo-norms is satisfied by many commonly-used regularizers such as the Sobolev norms, 12. Note that here we are slightly abusing the notations as the same symbol is used for the regularizer over both B(X) and B(X A). However, this should not cause any confusion since in any specific expression the identity of the regularizer should always be clear from the context. Farahmand, Ghavamzadeh, Szepesv ari, and Mannor the RKHS norms, and the l2-regularizer defined in Section 4.2.1 with a positive semidefinite choice of matrix Ψ. Moreover, the condition J(Q( , a)) J(Q) essentially states that the complexity of Q should upper bound the complexity of Q( , a) for all a A. If the regularizer J : B(X A) R is derived from a regularizer J : B(X) R through J(Q) = (J (Q( , a))a A p for some p [1, ], then J will satisfy the second part of the assumption. From a computational perspective, a natural choice for RKHS is to choose p = 2 and to define J2(Q) = P a A Q( , a) 2 H for H being the RKHS defined on X. Assumption A4 (Capacity of Function Space) For R > 0, let FR = {f F : J(f) R}. There exist constants C > 0 and 0 < α < 1 such that for any u, R > 0 the following metric entropy condition is satisfied: log N (u, FR) C R This assumption characterizes the capacity of the ball with radius R in F. The value of α is an essential quantity in our upper bounds. The metric entropy is precisely defined in Appendix G, but roughly speaking it is the logarithm of the minimum number of balls with radius u that are required to completely cover a ball with radius R in F. This is a measure of complexity of a function space as it is more difficult to estimate a function when the metric entropy grows fast when u decreases. As a simple example, when the function space is finite, we effectively need to have good estimate of |F| functions in order not to choose the wrong one. In this case, N (u, FR) can be replaced by |F|, so α = 0 and C = log |F|. When the state space X is finite and all functions are bounded by Qmax, we have log N (u, FR) log N (u, F) = |X| log(2Qmax u ). This shows that the metric entropy for problems with finite state spaces grows much slower than what we consider here. Assumption A4 is suitable for large function spaces and is indeed satisfied for the Sobolev spaces and various RKHS. Refer to van de Geer (2000); Zhou (2002, 2003); Steinwart and Christmann (2008) for many examples. An alternative assumption would be to have a similar metric entropy for the balls in F|A| (instead of F). This would slightly change a few steps of the proofs, but leave the results essentially the same. Moreover, it makes the requirement that J(Q( , a)) J(Q) in Assumption A3 unnecessary. Nevertheless, as results on the capacity of F is more common in the statistical learning theory literature, we stick to the combination of Assumptions A3 and A4. The metric entropy here is defined w.r.t. the supremum norm. All proofs, except that of Lemma 23, only require the same bound to hold when the supremum norm is replaced by the more relaxed empirical L2-norm, i.e., those results require that there exist constants C > 0 and 0 < α < 1 such that for any u, R > 0 and all x1, . . . , xn X, we have log N2(u, FR, x1:n) C R u 2α. Of course, the metric entropy w.r.t. the supremum norm implies the one with the empirical norm. It is an interesting question to relax the supremum norm assumption in Lemma 23. We can now remark on the requirement that X is compact (Assumption A1). We stated that requirement mainly because most of the metric entropy results in the literature are for compact spaces (one exception is Theorem 7.34 of Steinwart and Christmann (2008), which relaxes the compactness requirement by adding some assumptions on the tail of νX on X). Regularized Policy Iteration with Nonparametric Function Spaces So we could remove the compactness requirement from Assumption A1 and implicitly let Assumption A4 satisfy it, but we preferred to be explicit about it at the cost of a bit of redundancy in our set of assumptions. Assumption A5 (Function Space Boundedness) The subset F|A| B(X A; Qmax) is a separable and complete Carath eodory set with Rmax Qmax < . Assumption A5 requires all the functions in F|A| to be bounded so that the solutions of optimization problems (15)-(16) stay bounded. If they are not, they should be truncated, and thus, the truncation argument should be used in the analysis, see e.g., the proof of Theorem 21.1 of Gy orfiet al. (2002). The truncation argument does not change the final result, but complicates the proof at several places, so we stick to the above assumption to avoid unnecessary clutter. Moreover, in order to avoid the measurability issues resulting from taking supremum over an uncountable function space F|A|, we require the space to be a separable and complete Carath eodory set (cf. Section 7.3 of Steinwart and Christmann 2008). Assumption A6 (Function Approximation Property) The action-value function of any policy π belongs to F|A|, i.e., Qπ F|A|. This no function approximation error assumption is standard in analyzing regularizationbased nonparametric methods. This assumption is realistic and is satisfied for rich function spaces such as RKHS defined by universal kernels, e.g., Gaussian or exponential kernels (Section 4.6 of Steinwart and Christmann 2008). On the other hand, if the space is not large enough, we might have function approximation error. The behavior of the function approximation error for certain classes of small RKHS has been discussed by Smale and Zhou (2003); Steinwart and Christmann (2008). We stick to this assumption to simplify many key steps in the proofs. Assumption A7 (Expansion of Smoothness) For all Q F|A|, there exist constants 0 LR, LP < , depending only on the MDP and F|A|, such that for policy π, J(T πQ) LR + γLP J(Q). We require that the complexity of T πQ to be comparable to the complexity of Q itself. In other words, we require that if Q is smooth according to the regularizer J of a function space F|A|, it stays smooth after the application of the Bellman operator. We believe that this is a reasonable assumption for many classes of MDPs with sufficient stochasticity and when F|A| is rich enough. The intuition is that if the Bellman operator has a smoothing effect, the norm of T πQ does not blow up and the function can still be represented well within F|A|. Proposition 25 in Appendix F presents the conditions that for the so-called convolutional MDPs, Assumption A7 is satisfied. Briefly speaking, the conditions are 1) the transition probability kernel should have a finite gain (in the control-theoretic sense) in its frequency response, and 2) the reward function should be smooth according to the regularizer J. Of course, this is only an example of the class of problems for which this assumption holds. Farahmand, Ghavamzadeh, Szepesv ari, and Mannor 5.1 Policy Evaluation Error In this section, we focus on the kth iteration of REG-LSPI. To simplify the notation, we use Dn = {(Zt, Rt, X t)}n t=1 to refer to D(k) n . The policy πk depends on data used in the earlier iterations, but since we use independent set of samples D(k) n for the kth iteration and πk is independent of D(k) n , we can safely ignore the randomness of πk by working on the probability space obtained by conditioning on D(0) n , . . . , D(k 1) n , i.e., the probability space used in the kth iteration is (Ω, σΩ, Pk) with Pk = P n D(0) n , . . . , D(k 1) n o . In order to avoid clutter, we do not use the conditional probability symbol. In the rest of this section, π refers to a σ(D(0) n , . . . , D(k 1) n )-measurable policy and is independent of Dn; ˆQ and ˆhn(Q) = ˆhn( ; Q) refer to the solution to (15)-(16) when π, λh,n, and λQ,n replace πk, λ(k) h,n, and λ(k) Q,n in that set of equations, respectively. The following theorem is the main result of this section and provides an upper bound on the statistical behavior of the policy evaluation procedure REG-LSTD. Theorem 11 (Policy Evaluation) For any fixed policy π, let ˆQ be the solution to the optimization problem (15)-(16) with the choice of λh,n = λQ,n = 1 n J2(Qπ) If Assumptions A1 A7 hold, there exists c(δ) > 0 such that for any n N and 0 < δ < 1, we have ˆQ T π ˆQ 2 ν c(δ) n 1 1+α , with probability at least 1 δ. Here c(δ) is equal to c(δ) = c1 1 + (γLP )2 J 2α 1+α (Qπ) ln(1/δ) + c2 2α 1+α R + L2 R [J(Qπ)] 2 1+α for some constants c1, c2 > 0. Theorem 11, which is proven in Appendix B, indicates how the number of samples and the difficulty of the problem as characterized by J(Qπ), LP , and LR influence the policy evaluation error.13 This upper bound provides some insights about the behavior of the REG-LSTD algorithm. To begin with, it shows that under the specified conditions, REG-LSTD is a consistent algorithm: As the number of samples increases, the Bellman error decreases and asymptotically converges to zero. This is due to the use of a nonparametric function space and the proper control of its complexity through regularization. A parametric function space, e.g., a linear function approximator with a fixed number of features, does not generally have a similar guarantee unless the value function happens to belong to the span of the features. Achieving consistency for parametric function spaces requires careful choice 13. Without loss of generality and for simplicity we assumed that J(Qπ) > 0. Regularized Policy Iteration with Nonparametric Function Spaces of features, and might be difficult. On the other hand, a rich enough nonparametric function space, for example one defined by a universal kernel (cf. Assumption A6), ensures the consistency of the policy evaluation algorithm. This theorem, however, is much more powerful than a consistency result as it provides a finite-sample upper bound guarantee for the error, too. If the parameters of the REGLSTD algorithm are selected properly, one may achieve the sample complexity upper bound of O(n 1/(1+α)). For the case of the Sobolev space Wk(X) with X being an open Euclidean ball in Rd and k > d/2, one may choose α = d/2k to obtain the error upper bound of O(n d/(2k+d)).14 To study the upper bound a bit closer, let us focus on the special case of γ = 0. For this choice of the discount factor, T πQ is equal to rπ and ( ˆT πQ)(Xi, Ai) is equal to Ri. One can see that the policy evaluation problem becomes a regression problem with the regression function rπ. The guarantee of this theorem would be then on ˆQ T π ˆQ 2 ν = ˆQ rπ 2 ν, which is the usual squared error in the regression literature. Hence we reduced a regression problem to a policy evaluation problem. Because of this reduction, any lower bound on the regression would also be a lower bound on the policy evaluation problem. It is well-known that the convergence rate of n d/(2k+d) is asymptotically minimax optimal for the regression estimation for target functions belonging to the Sobolev space Wk(X) as well as some other smoothness classes with the k order of smoothness, cf. e.g., Nussbaum (1999) for the results for the Sobolev spaces, Stone (1982) for a closely related H older space Cp,α, which with the choice of k = p + α (k N and 0 < α 1) has the same rate, and Tsybakov (2009) for several results on minimax optimality of nonparametric estimators. More generally, the rate of O(n 1/(1+α) is optimal too: For a regression function belonging to a function space F with a packing entropy in the same form as in the upper bound of Assumption A4, the rate Ω(n 1/(1+α)) is its minimax lower bound (Yang and Barron, 1999), making the upper bound optimal. Comparing these lower bounds with the upper bound O(n 1/(1+α)) (or O(n d/(2k+d)) for the Sobolev space) of this theorem indicates that REG-LSTD algorithm has the optimal error rate as a function of the number of samples n, which is a remarkable result. Furthermore, to understand the fine behavior of the upper bound, beyond the dependence of the rate on n and α, we focus on the multiplicative term c(δ). Again we consider the special case of regression estimation as it is the only case we have some known lower bounds. With the choice of γ = 0, we have Qπ = rπ, so J(Qπ) = J(rπ). Moreover, since T πQ = rπ + 0PπQ = rπ, we can choose LR = J(rπ) in Assumption A7. As a result c(δ) = c1 J 2α 1+α (rπ) ln(1/δ) for a constant c1 > 0. We are interested in studying the dependence of the upper bound on J(rπ). We study its behavior when the function space is the Sobolev space Wk([0, 1]) and J( ) is the corresponding Sobolev space norm. We choose α = 1/2k to get J 2 2k+1 (rπ) dependence of c(δ). On the other hand, for the regression estimation problem within the subset F|A| 1 = Q( , a) Wk([0, 1]) : J(Q) J(rπ), a A of this Sobolev space, the fine behavior of the asymptotic minimax rate is determined by 14. For examples of the metric entropy results for the Sobolev spaces, refer to Section A.5.6 alongside Lemma 6.21 of Steinwart and Christmann (2008), or Theorem 2.4 of van de Geer (2000) for X = [0, 1] or Lemma 20.6 of Gy orfiet al. (2002) for X = [0, 1]d. Also in this paper we use the notation Wk(X) to refer to Wk,2(X), the Sobolev space defined based on the L2-norm of the weak derivatives. Farahmand, Ghavamzadeh, Szepesv ari, and Mannor the so-called Pinsker constant, whose dependence on J is in fact J 2 2k+1 (rπ), cf. e.g., Nussbaum (1999, 1985); Golubev and Nussbaum (1990), or Section 3.1 of Tsybakov (2009).15 Therefore, not only the exponent of the rate is optimal for this function space, but also its multiplicative dependence on the smoothness J(rπ) is optimal. For function spaces other than this choice of Sobolev space (i.e., the general case of α), we are not aware of any refined lower bound that indicates the optimality of J 2α 1+α (rπ). We note that some available upper bounds for regression with comparable assumptions on the metric entropy have the same dependence on J(rπ), e.g., Steinwart et al. (2009)16 or Farahmand and Szepesv ari (2012), whose result is for the regression setting with exponential β-mixing input, but can also be shown for i.i.d. data. We conjecture that under our assumptions this dependence is optimal. One may note that the proper selection of the regularization coefficients to achieve the optimal rate requires the knowledge of an unknown quantity J(Qπ). This, however, is not a major concern as a proper model selection procedure finds parameters that result in a performance which is almost the same as the optimal performance. We comment on this issue in more detail in Section 6. The proof of this theorem requires several auxiliary results, which are presented in the appendices, but the main idea behind the proof is as follows. Since ˆQ T π ˆQ 2 ν 2 ˆQ ˆhn( ; ˆQ) 2 ν + 2 ˆhn( ; ˆQ) T π ˆQ 2 ν, we may upper bound the Bellman error by upper bounding each term in the right-hand side (RHS). One can see that for a fixed Q, the optimization problem (15) essentially solves a regularized least-squares regression problem, which leads to small value of ˆhn( ; ˆQ) T π ˆQ ν, when there are enough samples and under proper conditions. The relation of the optimization problem (16) with ˆQ ˆhn( ; ˆQ) ν is evident too. The difficulty, however, is that these two optimization problems are coupled: ˆhn( ; ˆQ) is a function of ˆQ which itself is a function of ˆhn( ; ˆQ). Thus, Q appearing in (15) is not fixed, but is a random function ˆQ. The same is true for the other optimization problem as well. The coupling of the optimization problems makes the analysis more complicated than the usual supervised learning type of analysis. The dependencies between all the results that lead to the proof of Theorem 14 is depicted in Figure 2 in Appendix B. In order to obtain fast convergence rates, we use concepts and techniques from the empirical process theory such as the peeling device, the chaining technique, and the modulus of continuity of the empirical process, cf. e.g., van de Geer (2000). By focusing on the behavior of the empirical process over local subsets of the function space, these techniques allow us to study the deviations of the process in a more refined way compared to a global approach that studies the supremum of the empirical process in the whole function space. These techniques are crucial to obtain a fast rate for large function spaces. We discuss them in more detail as we proceed in the proofs. 15. The Pinsker constant determines the effect of the noise variance too. We do not present such information in our bounds. Also note that most aforementioned results, except Golubev and Nussbaum (1990), consider a normal noise model, which is different from our bounded noise. 16. This is obtained by using Corollary 3 of Steinwart et al. (2009) after substituting A2(λ) by its upper bound λ f 2 H, which is valid whenever f H, as is in our case. This result can be used after one converts the metric entropy condition to the condition on the decay rate of eigenvalues of a certain integral operator. Regularized Policy Iteration with Nonparametric Function Spaces We mentioned earlier that one can actually reuse a single data set in all iterations. To keep the presentation more clear, we keep the current setup. The reason behind this can be explained better after the proof of Theorem 11. But note that from the convergence-rate point of view, the difference between reusing data or not is insignificant. If we have a batch of data with size n and we divide it into K chunks and only use one chunk per iteration of API, the rate would be O(( n K ) 1 1+α ). For finite K, or slowly growing K, this is essentially the same as O(n 1 1+α ). 5.2 Error Propagation in API Consider an API algorithm that generates the sequence ˆQ(0) π1 ˆQ(1) π2 ˆQ(K 1) πK, where πk is the greedy policy w.r.t. ˆQ(k 1) and ˆQ(k) is the approximate action-value function for policy πk. For the sequence ( ˆQ(k))K 1 k=0 , denote the Bellman Residual (BR) of the kth action-value function by εBR k = ˆQ(k) T πk ˆQ(k). (31) The goal of this section is to study the effect of the ν-weighted L2-norm of the Bellman residual sequence (εBR k )K 1 k=0 on the performance loss Q QπK 1,ρ of the resulting policy πK. Because of the dynamical nature of the MDP, the performance loss Q QπK p,ρ depends on the difference between the sampling distribution ν and the future state-action distribution in the form of ρP π1P π2 . The precise form of this dependence is formalized in Theorem 13, which is a slight modification of a result by Farahmand et al. (2010).17 Before stating the results, we define the following concentrability coefficients that are used in a change of measure argument, see e.g., Munos (2007); Antos et al. (2008b); Farahmand et al. (2010). Definition 12 (Expected Concentrability of Future State-Action Distributions) Given the distributions ρ, ν M(X A), an integer m 0, and an arbitrary sequence of stationary policies (πm)m 1, let ρP π1P π2 P πm M(X A) denote the future state-action distribution obtained when the first state-action is distributed according to ρ and then we follow the sequence of policies (πk)m k=1. Define the following concentrability coefficients: c PI1,ρ,ν(m1, m2; π) d ρ(P π )m1(P π)m2 with (X, A) ν. If the future state-action distribution ρ(P π )m1(P π)m2 is not absolutely continuous w.r.t. ν, then we take c PI1,ρ,ν(m1, m2; π) = . In order to compactly present our results, we define the following notation: ak = (1 γ)γK k 1 1 γK+1 . (0 k < K) (32) 17. The difference of these two results is in the way the norm of functions from the space F |A| is defined, which in turn corresponds to whether the distributions ν and ρ are defined over the state space X, as Farahmand et al. (2010) defined, or over the state-action space X A, as we define here. These differences do not change the general form of the proof. See Theorem 3.2 in Chapter 3 of Farahmand (2011b) for the proof of the current result. Farahmand, Ghavamzadeh, Szepesv ari, and Mannor Theorem 13 (Error Propagation for API Theorem 3 of Farahmand et al. 2010) Let p 1 be a real number, K be a positive integer, and Qmax Rmax 1 γ . Then for any sequence ( ˆQ(k))K 1 k=0 B(X A, Qmax) and the corresponding sequence (εBR k )K 1 k=0 defined in (31), we have Q QπK p,ρ 2γ (1 γ)2 inf r [0,1] C 1 2p PI,ρ,ν(K; r)E 1 2p (εBR 0 , . . . , εBR K 1; r) + γ K where E(εBR 0 , . . . , εBR K 1; r) = PK 1 k=0 a2r k εBR k 2p 2p,ν and CPI,ρ,ν(K; r) = 1 γ 2 sup π 0,...,π K k=0 a2(1 r) k m 0 γm c PI1,ρ,ν(K k 1, m + 1; π k+1) + c PI1,ρ,ν(K k, m; π k) #2 . For better understanding of the intuition behind the error propagation results in general, refer to Munos (2007); Antos et al. (2008b); Farahmand et al. (2010). The significance of this particular theorem and the ways it improves previous similar error propagation results such as that of Antos et al. (2008b) (for API) and Munos (2007) (for AVI) is thoroughly discussed by Farahmand et al. (2010). We briefly comment on it in Section 5.3. 5.3 Performance Loss of REG-LSPI In this section, we use the error propagation result (Theorem 13 in Section 5.2) together with the upper bound on the policy evaluation error (Theorem 11 in Section 5.1) to derive an upper bound on the performance loss Q QπK 1,ρ of REG-LSPI. This is the main theoretical result of this work. Before stating the theorem, let us denote ˆΠ(F|A|) as the set of all policies that are greedy w.r.t. a member of F|A|, i.e., ˆΠ(F|A|) = {ˆπ( ; Q) : Q F|A|}. Theorem 14 Let ( ˆQ(k))K 1 k=0 be the solutions of the optimization problem (15)-(16) with the choice of λ(k) h,n = λ(k) Q,n = 1 n J2(Qπk) Let Assumptions A1 A5 hold; Assumptions A6 and A7 hold for any π ˆΠ(F|A|), and infr [0,1] CPI,ρ,ν(K; r) < . Then there exists CLSPI(δ, K; ρ, ν) such that for any n N and 0 < δ < 1, we have Q QπK 1,ρ 2γ (1 γ)2 h CLSPI(δ, K; ρ, ν)n 1 2(1+α) + γK 1Rmax i , with probability at least 1 δ. In this theorem, the function CLSPI(δ, K; ρ, ν) = CLSPI(δ, K; ρ, ν; LR, LP , α, β, γ) is CLSPI(δ, K; ρ, ν; LR, LP , α, β, γ) = C 1 2 I (δ) inf r [0,1] 1 2 PI,ρ,ν(K; r) Regularized Policy Iteration with Nonparametric Function Spaces with CI(δ) being defined as CI(δ) = sup π ˆΠ(F|A|) c1 1 + (γLP )2 J 2α 1+α (Qπ) ln K 2α 1+α R + L2 R [J(Qπ)] 2 1+α in which c1, c2 > 0 are universal constants. Proof Fix 0 < δ < 1. For each iteration k = 0, . . . , K 1, invoke Theorem 11 with the confidence parameter δ/K and take the supremum over all policies to upper bound the Bellman residual error εBR k ν as ˆQ(k) T πk ˆQ(k) 2 ν sup π ˆΠ(F|A|) c J(Qπ), LR, LP , α, β, γ, δ which holds with probability at least 1 δ K . Here c( ) is defined as in Theorem 11. For any r [0, 1], we have E(εBR 0 , . . . , εBR K 1; r) = k=0 a2r k εBR k 2 ν c n 1 1+α K 1 X = c n 1 1+α 1 γ 1 γK+1 2r 1 (γ2r)K where we used the definition of ak (32). We then apply Theorem 13 with the choice of p = 1 to get that with probability at least 1 δ, we have Q QπK 1,ρ 2γ (1 γ)2 h CLSPI(ρ, ν; K)n 1 1+α + γK 1Rmax i . CLSPI(ρ, ν; K) = s sup π ˆΠ(F|A|) c J(Qπ), LR, LP , α, γ, δ inf r [0,1] 1 2 PI,ρ,ν(K; r) Theorem 14 upper bounds the performance loss and relates it to the number of samples n, the capacity of the function space quantified by α, the number of iterations K, the concentrability coefficients, and some other properties of the MDP such as LR, LP , and γ. This theorem indicates that the behavior of the upper bound as a function of the number of samples is O(n 1 2(1+α) ). This upper bound is notable because of its minimax optimality, as discussed in detail after Theorem 11. The term CLSPI has two main components. The first is CPI,ρ,ν( ; r), which describes the effect of the sampling distribution ν and the evaluation distribution ρ, as well as the Farahmand, Ghavamzadeh, Szepesv ari, and Mannor transition probability kernel of the MDP itself on the performance loss. This term has been thoroughly discussed by Farahmand et al. (2010), but briefly speaking it indicates that ν and ρ affect the performance through a weighted summation of c PI1,ρ,ν (Definition 12). The concentrability coefficients c PI1,ρ,ν is defined as the square root of the expected squared Radon-Nikodym of the future state-action distributions starting from ρ w.r.t. the sampling distribution ν. This may be much tighter compared to the previous results (e.g., Antos et al. 2008b) that depend on the supremum of the Radon-Nikodym derivative. One may also notice that Theorem 13 actually provides a stronger result than what is reported in Theorem 14: The effect of errors at earlier iterations on the performance loss is geometrically decayed. So one may potentially use a fewer number of samples in the earlier iterations of REG-LSPI (or any other API algorithm) to get the same guarantee on the performance loss. We ignore this effect to simplify the result. The other important term is CI, which mainly describes the effect of LR, LP , and supπ ˆΠ(F|A|) J(Qπ) on the performance loss. These quantities depend on the MDP, as well as the function space F|A|. If the function space is matched with the MDP, these quantities would be small, otherwise they may even be infinity. Note that CI provides an upper bound on the constant in front of REG-LSTD procedure by taking supremum over all policies in ˆΠ(F|A|). This might be a conservative estimate as the actual encountered policies are the rather restricted random sequence π0, π1, . . . , πK 1 generated by the REG-LSPI procedure. One might expect that as the sequence ˆQ(k 1) converge to a neighbourhood of Q , the value function Qπk of the greedy policy πk = ˆπ( ; ˆQ(k 1)), which is the policy being evaluated, converges to a neighbourhood of Q too. Thus with certain assumptions, one might be able to show that its smoothness J(Qπk), the quantity that appears in the upper bound of Theorem 11, belongs to a neighbourhood of J(Q ). If J(Q ) is small, the value of J(Qπk) in that neighbourhood can be smaller than supπ ˆΠ(F|A|) J(Qπ). We postpone the analysis of this finer structure of the problem to future work. Finally we note that the optimality of the error bound for the policy evaluation task, as shown by Theorem 11, does not necessarily imply that the REG-LSPI algorithm has the optimal sample complexity rate for the corresponding RL/Planning problem as well. The reason is that it is possible to get close to the optimal policy, which is the ultimate goal in RL/Plannning, even though the estimate of the action-value function is still inaccurate. To act optimally, it is sufficient to have an action-value function whose greedy policy is the same as the optimal policy. This can happen even if there is some error in the estimated action-value function. This is called the action-gap phenomenon and has been analyzed in the reinforcement learning context by Farahmand (2011a). 5.3.1 Comparison with Similar Statistical Guarantees Theorem 14 might be compared with the results of Antos et al. (2008b), who introduced a BRM-based API procedure and studied its statistical properties, Lazaric et al. (2012), who analyzed LSPI with linear function approximators, Avila Pires and Szepesv ari (2012), who studied a regularized variant of LSTD, and Ghavamzadeh et al. (2011), who analyzed the statistical properties of Lasso-TD. Although these results address different algorithms, comparing them with the results of this work is insightful. Regularized Policy Iteration with Nonparametric Function Spaces We first focus on Antos et al. (2008b). Their simplified upper bound for Q QπK 1,ρ is C1/2 ρ,ν p VF log(n) + ln(K/δ) n 1/4, in which VF is the effective dimension of F and is defined based on the pseudo-dimension of sub-graphs of F and the so-called VC-crossing dimension of F; and Cρ,ν is a concentrability coefficient and plays a similar rule to our CPI,ρ,ν(K; r). In contrast, our simplified upper bound is CLSPI(δ)n 1 2(1+α) , in which CLSPI(δ) can roughly be factored into C 1 2 PI,ρ,ν(K; r) C1(J(Qπ), LR, LP ) p ln(K/δ). One important difference between these two results is that Antos et al. (2008b) considered parametric function spaces, which have finite effective dimension VF, while this work considers nonparametric function spaces, which essentially are infinite dimensional. The way they use the parametric function space assumption is equivalent to assuming that log N1(u, F, x1:n) VF log( 1 u) as opposed to log N (u, FB, x1:n) C R u 2α of Assumption A4. Our assumption lets us describe the capacity of infinite dimensional function spaces F. Disregarding this crucial difference, one may also note that our upper bound s dependence on the number of samples (i.e., O(n 1 2(1+α) )) is much faster than theirs (i.e., O(n 1/4)). This is more noticeable when we apply our result to a finite dimensional function space, which can be done by letting α 0 at a certain rate, to recover the error upper bound of n 1/2.18 This improvement is mainly because of more advanced techniques used in our analysis, i.e., the relative deviation tail inequality and the peeling device in this work in contrast with the uniform deviation inequality of Antos et al. (2008b). The other difference is in the definition of concentrability coefficients (CPI,ρ,ν(K) vs. Cρ,ν). In Definition 12, we use the expectation of Radon-Nikodym derivative of two distributions while their definition uses the supremum of a similar quantity. This can be a significant improvement in the multiplicative constant of the upper bound. For more information regarding this improvement, which can be used to improve the result of Antos et al. (2008b) too, refer to Farahmand et al. (2010). Lazaric et al. (2012) analyzed unregularized LSTD/LSPI specialized for linear function approximators with finite number of basis functions (parametric setting). Their rate of O(n 1/2) for V V πK 2,ρ is faster than the rate in the work of Antos et al. (2008b), and is comparable to our rate for Q QπK 1,ρ when α 0. The difference of their work with ours is that they focus on a parametric class of function approximators as opposed to the nonparametric class in this work. Moreover, because they formulate the LSTD as a fixedpoint problem, in contrast to this work and that of Antos et al. (2008b), their algorithm and results are only applicable to on-policy sampling scenario. Avila Pires and Szepesv ari (2012) studied a regularized version of LSTD in the parametric setting that works for both on-policy and off-policy sampling. Beside the difference between the class of function spaces with this work (parametric vs. nonparametric), another algorithmic difference is that they only use a regularizer for the projected Bellman error term, similar to (16), as opposed to using regularizers in both terms of REG-LSTD (15)-(16) (cf. Section 4.1). Also the weight used in their loss function, the matrix M in their paper, is not necessarily the one induced by data. Their result indicates O(n 1/2) for the projected Bellman error, which is comparable, though with some subtle differences, to Lazaric et al. 18. For problems with finite state space, we have log N (u, FR) |X| log( Qmax u ), so with a similar α 0 argument, we get O(n 1/2) error upper bound (disregarding the logarithmic terms). Farahmand, Ghavamzadeh, Szepesv ari, and Mannor (2012). It is remarkable that they separate the error bound analysis to deterministic and probabilistic parts. In the deterministic part, they use perturbation analysis to relate the loss to the error in the estimation of certain parameters used by the algorithms. In the probabilistic part, they provide upper bounds on the error in estimation of the parameters. We conjecture that their proof technique, even though simple and elegant, cannot easily be extended to provide the right convergence rate for large function spaces because the current analysis is based on a uniform bound on the error of a noisy matrix. Providing a tight uniform bound for a matrix (or operator) for large state spaces might be difficult or impossible to achieve. Ghavamzadeh et al. (2011) analyzed Lasso-TD, a policy evaluation algorithm that uses linear function approximators and enforces sparsity by the l1-regularization, and provided error upper bounds w.r.t. the empirical measure (or what they call Markov design). Their error upper bound is O([ w 2 1 log(p)]1/4n 1/4), where w is the weight vector describing the projection of Qπ onto the span of p basis functions. With some extra assumptions on the Grammian of the basis functions, they obtain faster rate of O( p w 0 log(p) n 1/2). These results indicate that by using the sparsity-inducing regularizer, the dependence of the error bound on the number of features becomes logarithmic. We conjecture that if one uses REG-LSTD with a linear function space (similar to Section 4.2.1) with J2(h) = u 1 and J2(Q) = w 1, the current analysis leads to the error upper bound O( w 1/2 1 n 1/4) with a logarithmic dependence on p. This result might be obtained using Corollary 5 of Zhang (2002) as Assumption A4. To get a faster rate of O(n 1/2), one should make extra assumptions on the Grammian as was done by Ghavamzadeh et al. (2011). We should emphasize that even with the choice of linear function approximators and the l1-regularization, REG-LSTD would not be the same algorithm as Lasso-TD since REG-LSTD uses regularization in both optimization problems (15)-(16). Also note that the error upper bound of Ghavamzadeh et al. (2011) is on the empirical norm 2,Dn as opposed to the norm 2,ν, which is w.r.t. the measure ν. This means that their result does not provide a generalization upper bound on the quality of the estimated value function over the whole state space, but provides an upper bound only on the training data. Comparing this work with its conference version (Farahmand et al., 2009b), we observe that the main difference in the theoretical guarantees is that the current results are for more general function spaces than the Sobolev spaces considered in the conference paper. Assumption A4 specifies the requirement on the capacity of the function space, which is satisfied not only by the Sobolev spaces (with the choice of α = d/2k for Wk(X) with X being an open Euclidean ball in Rd and k > d/2; cf. Section A.5.6 alongside Lemma 6.21 of Steinwart and Christmann (2008), or Theorem 2.4 of van de Geer (2000) for X = [0, 1] or Lemma 20.6 of Gy orfiet al. (2002) for X = [0, 1]d), but also many other large function spaces including several commonly-used RKHS. 6. Conclusion and Future Work We introduced two regularization-based API algorithms, namely REG-LSPI and REGBRM, to solve RL/Planning problems with large state spaces. Our formulation was general and could incorporate many types of function spaces and regularizers. We specifically showed how these algorithms can be implemented efficiently when the function space is the Regularized Policy Iteration with Nonparametric Function Spaces span of a finite number of basis functions (parametric model) or an RKHS (nonparametric model). We then focused on the statistical properties of REG-LSPI and provided its performance loss upper bound (Theorem 14). The error bound demonstrated the role of the sample size, the complexity of function space to which the action-value function belongs (quantified by its metric entropy in Assumption A4), and the intrinsic properties of the MDP such as the behavior of concentrability coefficients and the smoothness-expansion property of the Bellman operator (Definition 12 and Assumption A7). The result indicated that the dependence on the sample size for the task of policy evaluation is optimal. This work (and its conference (Farahmand et al., 2009b) and the dissertation (Farahmand, 2011b) versions) alongside the work on the Regularized Fitted Q-Iteration algorithm (Farahmand et al., 2008, 2009a) are the first that address the statistical performance of a regularized RL algorithm. Nevertheless, there have been a few other work that also used regularization for RL/Planning problems, most often without analyzing their statistical properties. Jung and Polani (2006) studied adding regularization to BRM, but their solution is restricted to deterministic problems. The main contribution of that work was the development of fast incremental algorithms using the sparsification technique. The l1-regularization has been considered by Loth et al. (2007), who were similarly concerned with incremental implementations and computational efficiency. Xu et al. (2007) provided a kernel-based, but not regularized, formulation of LSPI. They used sparsification to provide basis functions for the LSTD procedure. Sparsification leads to a selection of only a subset of data points to be used as the basis functions, thus indirectly controls the complexity of the resulting function space. This should be contrasted with a regularization-based approach in which the regularizer interacts with the empirical loss to jointly determine the subset of the function space to which the estimate belongs. Kolter and Ng (2009) formulated an l1-regularization fixed-point formulation LSTD, which is called Lasso-TD by Ghavamzadeh et al. (2011), and provided LARS-like algorithm (Efron et al., 2004) to compute the solutions. Johns et al. (2010) considered the same fixed-point formulation and cast it as a linear complementarity problem. The statistical properties of this l1-regularized fixed-point formulation is studied by Ghavamzadeh et al. (2011), as discussed earlier. Lasso-TD has a fixed-point formulation, which looks different from our coupled optimization formulation (15)-(16), but under on-policy sampling scenario, it is equivalent to a particular version of REG-LSTD: If we choose a fixed linear function approximator (parametric), use the l1-norm in the projection optimization problem (15), but do not regularize optimization problem (16) (i.e., λQ,n = 0), we get Lasso-TD. Geist and Scherrer (2012) suggested a different algorithm where the projection is not regularized (i.e., λh,n = 0), but the optimization problem (16) is regularized with the l1-norm of the parameter weights. The choice of only regularizing (16) is the same as the one in the algorithm introduced and analyzed by Avila Pires and Szepesv ari (2012), except that the latter work uses the l2-norm. Hoffman et al. (2012) introduced an algorithm similar to that of Geist and Scherrer (2012) with the difference that the projection optimization (15) uses the l2-norm (so it is a mixed l1/l2-regularized algorithm). All these algorithms are parametric. Several TD-based algorithms and their regularized variants are discussed in a survey by Dann et al. (2014). Farahmand, Ghavamzadeh, Szepesv ari, and Mannor Taylor and Parr (2009) unified several kernelized reinforcement learning algorithms, and showed the equivalence of kernelized value function approximators such as GPTD (Engel et al., 2005), the work of Xu et al. (2007), and a few other methods with a model-based reinforcement learning algorithm that has certain regularization on the transition kernel estimator, reward estimator, or both. Their result was obtained by considering two separate regularized regression problems: One that predicts the reward function given the current state and the other that predicts the next-state kernel values given the current-state ones. Their formulation is different from our formulation that is stated as a coupled optimization problem in an RKHS. Similar to other kernel-based algorithms (e.g., SVMs, Gaussian Process Regressions, Splines, etc.), devising a computationally efficient implementation of REG-LSPI/BRM is important to ensure that it is a practical algorithm for large-scale problems. A naive implementation of these algorithms requires the computation time of O(n3K), which is prohibitive for large sample sizes. One possible workaround is to reduce the effective number of samples by the sparsification technique (Engel et al., 2005; Jung and Polani, 2006; Xu et al., 2007). The other is to use elegant vector-matrix multiplication methods, which are used in iterative methods for matrix inversion, such as those based on the Fast Multipole Methods (Beatson and Greengard, 1997) and the Fast Gauss Transform (Yang et al., 2004). These methods can reduce the computational cost of vector-matrix multiplication from O(n2) to O(n log n), which results in computation time of O(n2K log n) for REG-LSPI/BRM, at the cost of some small, but controlled, numerical error. Another possibility is to use stochastic gradient-like algorithms similar to the works of Liu et al. (2012); Qin et al. (2014). The use of stochastic gradient-like algorithms is especially appealing in the light of results such as Bottou and Bousquet (2008); Shalev-Shwartz and Srebro (2008). They analyze the tradeoffbetween the statistical error and the optimization error caused by the choice of optimization method. They show that one might achieve lower generalization error by using a faster stochastic gradient-like algorithm, which processes more data points less accurately, rather than a slower but more accurate optimization algorithm, which can only process fewer data points. Designing scalable optimization algorithms for REG-LSPI/BRM is a topic for future work. An important issue in the successful application of any RL/Planning algorithm, including REG-LSPI and REG-BRM, is the proper choice of parameters. In REG-BRM and REG-LSTD we are faced with the choice of F|A| and the corresponding regularization parameters λQ,n and λh,n. The proper choice of these parameters, however, depends on quantities that are not known, e.g., J(Qπ) and the choice of F|A| that matches with the MDP. This problem in the RL/Planning context has been addressed by Farahmand and Szepesv ari (2011). They introduced a complexity-regularization-based model selection algorithm that allows one to design adaptive algorithms: Algorithms that perform almost the same as the one with the prior knowledge of the best parameters. Another important question is how to extend these algorithms to deal with continuous action MDPs. There are two challenges: Computational and statistical. The computational challenge is finding the greedy action at each state in the policy improvement step. In general, this is an intractable optimization problem, which cannot be solved exactly or even with any suboptimality guarantee. To analyze this inexact policy improvement some parts of the theory, especially the error propagation result, should be modified. Moreover, we also have a statistical challenge: One should specifically control the complexity of the Regularized Policy Iteration with Nonparametric Function Spaces policy space as the complexity of {maxa A Q( , a) : Q F|A|} might be infinity even though F|A| has a finite complexity (Antos et al., 2008a). A properly modified algorithm might be similar to the continuous-action extension of Farahmand et al. (2015), an API algorithm that explicitly controls the complexity of the policy space. Finally an open theoretical question is to characterize the properties of the MDP that determine the function space to which action-value function belong. A similar question is how the values of LP and LR in Assumption A7 are related to the intrinsic properties of the MDP. We partially addressed this question for the convolutional MDPs, but analysis of more general MDPs is remained to be done. Acknowledgments We thank the members of the Reinforcement Learning and Artificial Intelligence (RLAI) research group at the University of Alberta and the Reasoning and Learning Lab at Mc Gill University for fruitful discussions. We thank the anonymous reviewers and the editor for their helpful comments and suggestions, which improved the quality of the paper. We gratefully acknowledge funding from the National Science and Engineering Research Council of Canada (NSERC) and Alberta Innovates Centre for Machine Learning (AICML). Shie Mannor was partially supported by the European Community s Seventh Framework Programme (FP7/2007-2013) under grant agreement 306638 (SUPREL). Proofs and Auxiliary Results In these appendices, we first prove Theorem 10, which provides the closed-form solutions for REG-LSTD and REG-BRM when the function space is an RKHS (Appendix A). We then attend to the proof of Theorem 11 (Policy Evaluation error for REG-LSTD). The main body of the proof for Theorem 11 is in Appendix B. To increase the readability and flow, the proofs of some of the auxiliary and more technical results are postponed to Appendices C, D, and E. More specifically, we prove an extension of Theorem 21.1 of Gy orfiet al. (2002) in Appendix C (Lemma 15). We present a modified version of Theorem 10.2 of van de Geer (2000) in Appendix D. We then provide a covering number result in Appendix E (Lemma 20). The reason we require these results will become clear in Appendix B. Finally, we introduce convolutional MDPs as an instance of problems that satisfy Assumption A7 (Appendix F). We would like to remark that the generic constants c, c > 0 in the proofs, especially those related to the statistical guarantees, might change from line to line, if their exact value is not important in the bound. These values are constant as a function of important quantities of the upper bound (such as n, α, J(Qπ), etc.), but may depend on Qmax or |A|. Farahmand, Ghavamzadeh, Szepesv ari, and Mannor Appendix A. Proof of Theorem 10 (Closed-Form Solutions for RKHS Formulation of REG-LSTD/BRM) Proof REG-BRM: First, notice that the optimization problem (28) can be written in the form cn(Q) + λQ,n Q 2 H Q min! with an appropriately defined functional cn.19 In order to apply the representer theorem (Sch olkopf et al., 2001), we require to show that cn depends on Q only through the data-points Z1, Z 1, . . . , Zn, Z n. This is immediate for all the terms that define cn except the term that involves ˆhn( ; Q). However, since ˆhn is defined as the solution to the optimization problem (27), calling for the representer theorem once again, we observe that ˆhn can be written in the form ˆhn( ; Q) = t=1 β t k(Zt, ), where β = (β 1, . . . , β t ) satisfies β = argmin β Rn Khβ ˆT πQ 2 n + λh,nβ Khβ . Solving this minimization problem leads to β = (Kh + nλh,n I) 1( ˆT πk Q). In both equations ( ˆT πQ) is viewed as the n-dimensional vector ˆT πQ (Z1), . . . , ˆT πQ (Zn) = R1 + γQ(Z 1), . . . , Rn + γQ(Z n) . Thus, β depends on Q only through Q(Z 1), . . . , Q(Z n). Plugging this solution into (28), we get that cn(Q) indeed depends on Q through Q(Z1), Q(Z 1), , Q(Zn), Q(Z n), and thus on data points Z1, Z 1, , Zn, Z n. The representer theorem then implies that the minimizer of cn(Q) + λQ,n Q 2 H can be written in the form Q( ) = P2n i=1 αik( Zi, ), where Zi = Zi if i n and Zi = Z i n, otherwise. Let α = (α1, . . . , αn, α 1, . . . , α n) . Using the reproducing kernel property of k, we get the optimization problem C1KQ α (r + γC2KQ α) 2 n B(r + γC2KQ α) 2 n + λQ,n α KQ α α min!. Solving this for α concludes the proof for REG-BRM. REG-LSTD: The first part of the proof that shows cn depends on Q only through the datapoints Z1, Z 1, . . . , Zn, Z n is exactly the same as the proof of REG-BRM. Thus, using the representer theorem, the minimizer of (30) can be written in the form Q( ) = P2n i=1 αik( Zi, ), where Zi = Zi if i n and Zi = Z i n, otherwise. Let α = (α1, . . . , αn, α 1, . . . , α n) . Using the reproducing kernel property of k, we get the optimization problem (C1 γEC2)KQ α Er 2 n + λQ,n α KQ α α min!. Replacing C1 γEC2 with F and solving for α concludes the proof. 19. Here f(Q) Q min! indicates that Q is a minimizer of f(Q). Regularized Policy Iteration with Nonparametric Function Spaces Thm. 14 (REG-LSPI) Thm. 11 (REG-LSTD) Thm. 13 (Error Propagation) Lem. 15 [ ˆhn(Q) T πQ 0] Lem. 18 [J( ˆQ) J(Qπ)] Lem. 19 [ ˆQ ˆhn( ˆQ) 0] Lem. 22 (relative deviation) Thm. 16 [ ˆhn(Q) T πQ n&J(ˆhn(Q))] Lem. 17 [J(ˆhn(Q) J(Qπ)+J(Q)+J(T πQ)] Lem. 24 (modulus of continuity) Lem. 20 (covering number) Lem. 23 (supremum of weighted sum) Figure 2: Dependencies of results used to prove the statistical guarantee for REG-LSPI (Theorem 14). Appendix B. Proof of Theorem 11 (Statistical Guarantee for REG-LSTD) The goal of Theorem 11 is to provide a finite-sample upper bound on the Bellman error ˆQ T π ˆQ ν for REG-LSTD defined by the optimization problems (15) and (16). Since ˆQ T π ˆQ 2 ν 2 ˆQ ˆhn( ; ˆQ) 2 ν+2 ˆhn( ; ˆQ) T π ˆQ 2 ν, we may upper bound the Bellman error by upper bounding each term in the RHS. Recall from the discussion after Theorem 11 that the analysis is more complicated than the conventional supervised learning setting because the corresponding optimization problems are coupled: ˆhn( ; ˆQ) is a function of ˆQ which itself is a function of ˆhn( ; ˆQ). Theorem 11 is proven using Lemma 15, which upper bounds ˆhn( ; ˆQ) T π ˆQ ν, and Lemma 19, which upper bounds ˆQ ˆhn( ; ˆQ) ν. We also require to relate the smoothness J( ˆQ) to the smoothness J(Qπ). Lemma 18 specifies this relation. The proof of these lemmas themselves require further developments, which will be discussed when we encounter them. Figure 2 shows the dependencies between all results that lead to the proof of Theorem 11 and consequently Theorem 14. The following lemma controls the error behavior resulting from the optimization problem (15). This lemma, which is a result on the error upper bound of a regularized regression estimator, is similar to Theorem 21.1 of Gy orfiet al. (2002) with two main differences. First, Farahmand, Ghavamzadeh, Szepesv ari, and Mannor it holds uniformly over T πQ (as opposed to a fixed function T πQ); second, it holds for function spaces that satisfy a general metric entropy condition (as opposed to the special case of the Sobolev spaces). Lemma 15 (Convergence of ˆhn( ; Q) to T πQ) For any random Q F|A|, let ˆhn(Q) be defined according to (15). Under Assumptions A1 A5 and A7, there exist finite constants c1, c2 > 0 such that for any n N and 0 < δ < 1, we have ˆhn( ; Q) T πQ 2 ν 4λh,n J2(T πQ) + 2λh,n J2(Q) + c1 1 nλα h,n + c2 ln(1/δ) with probability at least 1 δ. Proof See Appendix C. When we use this lemma to prove Theorem 11, the action-value function Q that appears in the bound is the result of the optimization problems defined in (16), that is ˆQ, and so is random. Lemma 18, which we will prove later, provides a deterministic upper bound for the smoothness J( ˆQ) of this random quantity. It turns out that to derive our main result, we require to know more about the behavior of the regularized regression estimator than what is shown in Lemma 15. In particular, we need an upper bound on the empirical error of the regularized regression estimator ˆhn( ; Q) (cf. (33) below). Moreover, we should bound the random smoothness J(ˆhn( ; Q)) by some deterministic quantities, which turns out to be a function of J(T πQ) and J(Q). Theorem 16 provides us with the required upper bounds. This theorem is a modification of Theorem 10.2 by van de Geer (2000), with two main differences: 1) It holds uniformly over Q and 2) ˆhn( ; Q) uses the same data Dn that is used to estimate Q itself. We introduce the following notation: Let w = (x, a, r, x ) and define the random variables wi = (Xi, Ai, Ri, X i) for 1 i n. The data set Dn would be {w1, . . . , wn}. For a measurable function g : X A R X R, let g 2 n = 1 n Pn i=1 |g(wi)|2. Consider the regularized least squares estimator: ˆhn( ; Q) = argmin h F|A| h h [Ri + γQ(X i, π(X i))] 2 n + λh,n J2(h) i , (33) which is the same as (15) with π replacing πk. Theorem 16 (Empirical error and smoothness of ˆhn( ; Q)) For a random function Q F|A|, let ˆhn( , Q) be defined according to (33). Suppose that Assumptions A1 A5 and A7 hold. Then there exist constants c1, c2 > 0, such that for any n N and 0 < δ < 1, we have ˆhn( ; Q) T πQ n c1 max Qmax (J(Q) + J(T πQ)) α 1+α ln(1/δ) λh,n J(T πQ) Regularized Policy Iteration with Nonparametric Function Spaces J(ˆhn( ; Q)) c2 max J(Q) + J(T πQ), Q1+α max with probability at least 1 δ. Proof See Appendix D. The following lemma, which is an immediate corollary of Theorem 16, indicates that with the proper choice of the regularization coefficient, the complexity of the regression function ˆhn( ; Q) is in the same order as the complexities of Q, T πQ, and Qπ. This result will be used in the proof of Lemma 21, which itself is used in the proof of Lemma 19. Lemma 17 (Smoothness of ˆhn( ; Q)) For a random Q F|A|, let ˆhn( ; Q) be the solution to the optimization problem (15) with the choice of regularization coefficient λh,n = 1 n J2(Qπ) Let Assumptions A1 A5 and A7 hold. Then, there exits a finite constant c > 0, depending on Qmax, such that for any n N and 0 < δ < 1, the upper bound J(ˆhn( ; Q)) c J(T πQ) + J(Q) + J(Qπ) p holds with probability at least 1 δ. Proof With the choice of λh,n = h 1 n J2(Qπ) i 1 1+α , Theorem 16 implies that there exist some finite constant c1 > 0 as well as c2 > 0, which depends on Qmax, such that for any n N and 0 < δ < 1, the inequality J(ˆhn( ; Q)) c1 max J(Q) + J(T πQ), Q1+α max n h 1 n J2(Qπ) i 1 c2 J(T πQ) + J(Q) + J(Qπ) p holds with probability at least 1 δ. An intuitive understanding of this result might be gained if we consider ˆhn( ; Qπ), which is the regression estimate for T πQπ = Qπ. This lemma then indicates that the smoothness of ˆhn( ; Qπ) is comparable to the smoothness of its target function Qπ. This is intuitive whenever the regularization coefficients are chosen properly. The following lemma relates J( ˆQ) and J(T π ˆQ), which are random, to the complexity of the action-value function of the policy π, i.e., J(Qπ). This result is used in the proof of Theorem 11. Farahmand, Ghavamzadeh, Szepesv ari, and Mannor Lemma 18 (Smoothness of ˆQ) Let Assumptions A1 A7 hold, and let ˆQ be the solution to (16) with the choice of λh,n = 1 n J2(Qπ) Then, there exists a finite constant c > 0 such that for any n N and 0 < δ < e 1, we have λQ,n J2( ˆQ) λQ,n J2(Qπ) + c J 2α 1+α (Qπ) ln(1/δ) with probability at least 1 δ. Proof By Assumption A6 we have Qπ F|A|, so by the optimizer property of ˆQ (cf. (16)), we get λQ,n J2( ˆQ) ˆQ ˆhn( ; ˆQ) 2 Dn + λQ,n J2( ˆQ) Qπ ˆhn( ; Qπ) 2 Dn + λQ,n J2(Qπ). (34) Since Qπ = T πQπ, we have Qπ ˆhn( ; Qπ) Dn = T πQπ ˆhn( ; Qπ) Dn. So Theorem 16 shows that with the choice of λh,n = [ 1 n J2(Qπ)] 1 1+α , there exists a finite constant c > 0 such that for any n N and for 0 < δ < e 1 0.3679, we have Qπ ˆhn( ; Qπ) 2 Dn c1 1 Q2(1+α) max J 2α 1+α (Qπ) ln(1/δ) n 1 1+α , (35) with probability at least 1 δ. Chaining inequalities (34) and (35) finishes the proof. The other main ingredient of the proof of Theorem 11 is an upper bound to ˆQ ˆhn( ; ˆQ) ν, which is closely related to the optimization problem (16). This task is done by Lemma 19. In the proof of this lemma, we call Lemma 21, which shall be stated and proven right after this result. Lemma 19 (Convergence of ˆQ ˆhn( ; ˆQ) ν) Let ˆQ be the solution to the set of coupled optimization problems (15) (16). Suppose that Assumptions A1 A7 hold. Then there exists a finite constant c > 0 such that for any n N and 0 < δ < 2e 1 and with the choice of λh,n = λQ,n = 1 n J2(Qπ) we have ˆQ ˆhn( ; ˆQ) 2 ν c (1 + γ2L2 P )α J 2α 1+α (Qπ) ln(1/δ) + L 2α 1+α R n 1 1+α , with probability at least 1 δ. Regularized Policy Iteration with Nonparametric Function Spaces Proof Decompose ˆQ ˆhn( ; ˆQ) 2 ν = I1,n + I2,n, 1 2I1,n = ˆQ ˆhn( ; ˆQ) 2 Dn + λQ,n J2( ˆQ), I2,n = ˆQ ˆhn( ; ˆQ) 2 ν I1,n. (36) In what follows, we upper bound each of these terms. I1,n: Use the optimizer property of ˆQ to get 1 2I1,n = ˆQ ˆhn( ; ˆQ) 2 Dn + λQ,n J2( ˆQ) Qπ ˆhn( ; Qπ) 2 Dn + λQ,n J2(Qπ). To upper bound Qπ ˆhn( ; Qπ) 2 Dn = T πQπ ˆhn( ; Qπ) 2 Dn, we evoke Theorem 16. For our choice of λQ,n, there exists a constant c1 > 0 such that for any n N and 0 < δ1 < 1, we have 1 2I1,n λQ,n J2(Qπ) + c1 J 2α 1+α (Qπ) ln(1/δ1) n 1 1+α , (37) with probability at least 1 δ1. I2,n: With our choice of λQ,n and λh,n, Lemma 21, which shall be proven later, indicates that there exist some finite constants c2, c3, c4 > 0 such that for any n N and finite J(Qπ), LR, and LP , and 0 < δ2 < 1, we have 2α 1+α R + [J(Qπ)] 2α 1+α [ln(1/δ2)] α 1+α n 1 1+α + c3 (1 + γ2L2 P )α n λα Q,n + c4 ln(1/δ2) with probability at least 1 δ2. For δ2 < e 1 and α 0, we have [ln(1/δ2)] α 1+α ln(1/δ2), and also 1 n λα Q,n = [J(Qπ)] 2α 1+α n 1 1+α [J(Qπ)] 2α 1+α n 1 1+α ln(1/δ2). (39) With the right choice of constants, ln(1/δ2) n can be absorbed into the other terms. Select δ1 = δ2 = δ/2. Inequalities (37), (38), and (39) imply that with the specified choice of λQ,n and λh,n, there exists a finite constant c5 > 0 such that for any 0 < δ < 2e 1, we have ˆQ ˆhn( ; ˆQ) 2 ν c5 (1 + γ2L2 P )α J 2α 1+α (Qπ) ln(1/δ) + L 2α 1+α R n 1 1+α , with probability at least 1 δ. To upper bound I2,n, defined in (36), we simultaneously apply the peeling device (cf. Section 5.3 of van de Geer 2000) on two different, but coupled, function spaces (one to which Farahmand, Ghavamzadeh, Szepesv ari, and Mannor ˆQ belongs and the other to which ˆhn( ; ˆQ) belongs). In each layer of peeling, we apply an exponential tail inequality to control the relative deviation of the empirical mean from the true mean (Lemma 22 in Appendix C). We also require a covering number result, which is stated as Lemma 20. The final result of this procedure is a tight upper bound on I2,n, as stated in Lemma 21. To prepare for the peeling argument, define the following subsets of F and F|A|: Fσ f : f F, J2(f) σ , F|A| σ n f : f F|A|, J2(f) σ o . g Q,h(x, a) j=1 I{a=aj} [Qj(x) hj(x)]2 . (40) To simplify the notation, we use z = (x, a) and Z = (X, A) in the rest of this section. Define Gσ1,σ2 as the space of g Q,h functions with J(Q) σ1 and J(h) σ2, i.e., Gσ1,σ2 n g Q,h : Rd A R; Q F|A| σ1 , h F|A| σ2 o . (41) The following lemma provides an upper bound on the covering numbers of Gσ1,σ2. Lemma 20 (Covering Number) Let Assumptions A3, A4, and A5 hold. Then there exists a constant c1 > 0, independent of σ1, σ2, α, Qmax, and |A|, such that for any u > 0 and all ((x1, a1), . . . , (xn, an)) X A, the empirical covering number of the class of functions Gσ1,σ2 defined in (41) w.r.t. the empirical norm 2,z1:n is upper bounded by log N2(u, Gσ1,σ2, (x, a)1:n) c1|A|1+αQ2α max (σα 1 + σα 2 ) u 2α. Proof See Appendix E. Next, we state and prove Lemma 21, which provides a high probability upper bound on I2,n. Lemma 21 Let I2,n be defined according to (36). Under Assumptions A1 A5 and A7 and with the choice of λh,n = λQ,n = 1 n J2(Qπ) there exist constants c1, c2, c3 > 0, such that for any n N, finite J(Qπ), LR, and LP , and δ > 0 we have 2α 1+α R + [J(Qπ)] 2α 1+α [ln(1/δ)] α 1+α n 1 1+α + c2 (1 + γ2L2 P )α n λα Q,n + c3 ln(1/δ) with probability at least 1 δ. Regularized Policy Iteration with Nonparametric Function Spaces Proof Let Z = (X, A) be a random variable with distribution ν that is independent from Dn. Without loss of generality, we assume that Qmax 1/2. We use the peeling device in conjunction with Lemmas 20 and 22 to obtain a tight high-probability upper bound on I2,n. Based on the definition of I2,n in (36) we have P {I2,n > t} = P E h g ˆQ,ˆhn( ; ˆQ)(Z)|Dn i 1 n Pn i=1 g ˆQ,ˆhn( ; ˆQ)(Zi) t + 2λQ,n J2( ˆQ) + E h g ˆQ,ˆhn( ; ˆQ)(Z)|Dn i > 1 To benefit from the peeling device, we relate the complexity of ˆhn( ; ˆQ) to the complexity of ˆQ. For a fixed δ1 > 0 and some constant c > 0, to be specified shortly, define the following event: A0 = n ω : J2(ˆhn( ; ˆQ)) c J2(T π ˆQ) + J2( ˆQ) + J2(Qπ) ln(1/δ1) o . Lemma 17 indicates that P {A0} 1 δ1, where the constant c here can be chosen to be three times of the squared value of the constant in the lemma. We have P {I2,n > t} = P I2,n > t, AC 0 + P {I2,n > t, A0} δ1 + P {I2,n > t, A0}, so we focus on upper bounding P {I2,n > t, A0}. Since ˆQ F|A|, there exists l N0 such that 2lt I{l =0} 2λQ,n J2( ˆQ) < 2l+1t. Fix l N0. For any Q F|A|, Assumption A7 relates J(T πQ) to J(Q): λQ,n J2(T πQ) 2 L2 R + γ2L2 P 2lt λQ,n Thus on the event A0, if ˆQ F|A| σl 1 where σl 1 = 2lt λQ,n , we also have ˆhn( ˆQ) F|A| σl 2 with σl 2 = c 2 L2 R + (1 + γ2L2 P ) 2lt + J2(Qπ) ln(1/δ1) . (43) Apply the peeling device on (42). Use (43) and note that if for an l N0 we have 2λQ,n J2( ˆQ) 2lt I{l =0}, we also have t + 2λQ,n J2( ˆQ) 2lt to get P {I2,n > t} = P I2,n > t, AC 0 + P {I2,n > t, A0} A0, 2lt I{l =0} 2λQ,n J2( ˆQ) < 2l+1t, E h g ˆQ,ˆhn( ; ˆQ)(Z)|Dn i 1 n Pn i=1 g ˆQ,ˆhn( ; ˆQ)(Zi) t + 2λQ,n J2( ˆQ) + E h g ˆQ,ˆhn( ; ˆQ)(Z)|Dn i > 1 sup g Q,h Gσl 1,σl 2 E [g Q,h(Z)|Dn] 1 n Pn i=1 g Q,h(Zi) 2lt + E [g Q,h(Z)|Dn] > 1 Let us study the behavior of the lth term of the above summation by verifying the conditions of Lemma 22 with the choice of ε = 1 2 and η = 2lt. Farahmand, Ghavamzadeh, Szepesv ari, and Mannor Condition (A1): Since all functions involved are bounded by Qmax, it is easy to see that |g Q,h(x, a)| P|A| j=1 I{a=aj} [Qj(x) hj(x)]2 4Q2 max. Therefore, K1, defined in Lemma 22, can be set to K1 = 4Q2 max. Condition (A2): We have E [Q(Z) h(Z)]2 2 4Q2 max E h [Q(Z) h(Z)]2i . Therefore, K2 can be set to K2 = 4Q2 max. Condition (A3): We should satisfy 2 4 nη 288 max{8Q2 max, 8Qmax}. Since η = 2lt t, it is sufficient to have in which c is a function of Qmax (we can choose c = 2 46082 Q4 max). Condition (A4): We shall verify that for ε 1 82lt, and σ1 = σl 1 and σ2 = σl 2, the following holds: 2 max{K1, 2K2} 16 max{K1,2K2} g Gσ1,σ2 : 1 i=1 g2(zi) 16ε ) !!1/2 du. (45) Notice that there exists a constant c > 0 such that for any u, ε > 0 g Gσ1,σ2 : 1 i=1 g2(zi) 16ε ) log N2 (u, Gσ1,σ2, z1:n) c (σα 1 + σα 2 )u 2α, (46) where we used Lemma 20 in the second inequality. Plug (46) into (45) with the choice of σ1 = σl 1 = 2lt λQ,n and σ2 = σl 2 = c[2(L2 R + (1 + γ2L2 P ) 2lt λQ,n ) + J2(Qπ) ln(1/δ1)]. Therefore, for some constant c = c (Qmax) > 0, the inequality + c 2 L2 R + (1 + γ2L2 P ) 2lt + J2(Qπ) ln(1/δ1) α implies (45). Because (a + b) 1 2 (a 1 2 + b 1 2 ) for non-negative a and b, it suffices to verify the following two conditions: (a) We shall verify that for ε 1 82lt, we have 2 Q,n (2lt) α Regularized Policy Iteration with Nonparametric Function Spaces for some c > 0. Substituting ε with 2lt, we see that it is enough if for some constant c > 0, t c 2l n λα Q,n . (D1) (b) We should verify that for ε 1 82lt, the following is satisfied: L2 R |{z} (b1) + (1 + γ2L2 P ) 2lt λQ,n | {z } (b2) + J2(Qπ) ln(1/δ1) | {z } (b3) for some c > 0. After some manipulations, we get that the previous inequality holds if the following three inequalities are satisfied: (b1) : t c 1 L 2α 1+α R n 1 1+α , (D2) (b2) : t c 2 (1 + γ2L2 P )α nλα Q,n , (D3) (b3) : t c 3 [J(Qπ)] 2α 1+α [ln(1/δ1)] α 1+α n 1 1+α , (D4) for some constants c 1, c 2, c 3 > 0. Fix δ > 0 and let δ1 = δ/2. Whenever (C1), (D1), (D2), (D3), and (D4) are satisfied, for some choice of constants c, c > 0 we have P {I2,n > t} δ 2) 128 2304 max{16Q4max, 4Q2max} 2 + c exp( c n t). Let the left-hand side be equal δ and solve for t. Considering all aforementioned conditions, we get that there exist constants c1, c2, c3 > 0 such that for any n N, finite J(Qπ), LR, and LP , and δ > 0, we have 2α 1+α R + [J(Qπ)] 2α 1+α [ln(1/δ)] α 1+α n 1 1+α + c2 (1 + γ2L2 P )α n λα Q,n + c3 ln(1/δ) with probability at least 1 δ. After developing these tools, we are ready to prove Theorem 11. Proof [Proof of Theorem 11] We want to show that ˆQ T π ˆQ ν is small. Since (15)-(16) minimize ˆhn( ; ˆQ) T π ˆQ ν and ˆQ ˆhn( ; ˆQ) ν, we upper bound ˆQ T π ˆQ ν in terms of these quantities as follows: ˆQ T π ˆQ 2 ν 2 ˆQ ˆhn( ; ˆQ) 2 ν + 2 ˆhn( ; ˆQ) T π ˆQ 2 Farahmand, Ghavamzadeh, Szepesv ari, and Mannor Let us upper bound each of these two terms in the RHS. Fix 0 < δ < 1. Bounding ˆhn( ; ˆQ) T π ˆQ ν: Lemma 15 indicates that there exist constants c1, c2 > 0 such that for any random ˆQ F|A| and any fixed n N, we have ˆhn( ; ˆQ) T π ˆQ 2 ν λh,n 2J2( ˆQ) + 4J2(T π ˆQ) + c1 1 nλα h,n + c2 ln(3/δ) with probability at least 1 δ/3. Note that T π ˆQ F|A| is implied by Assumption A7 and ˆQ F|A|. Because ˆQ is random itself, the terms J( ˆQ) and J(T π ˆQ) in the upper bound of (48) are also random. In order to upper bound them, we use Lemma 18, which states that upon the choice of λh,n = λQ,n = [ 1 n J2(Qπ)] 1 1+α , there exists a constant c3 > 0 such that for any n N, λh,n J2( ˆQ) = λQ,n J2( ˆQ) λQ,n J2(Qπ) + c3 J 2α 1+α (Qπ) n 1 1+α ln(3/δ) (49) holds with probability at least 1 δ/3. We use Assumption A7 to show that we have λh,n J2(T π ˆQ) 2λQ,n L2 R + 2(γLP )2 λQ,n J2(Qπ) + c3 J 2α 1+α (Qπ) n 1 1+α ln(3/δ) with the same probability. Plugging (49) and (50) into (48) and using the selected schedule for λQ,n and λh,n, we get ˆhn( ; ˆQ) T π ˆQ 2 ν " 2 + c1 + 8(γLP )2 J 2α 1+α (Qπ) + c3 2 + 8(γLP )2 J 2α 1+α (Qπ) ln(3/δ) + 8L2 R J 2 1+α (Qπ) + c2 ln(3/δ) with probability at least 1 2 3δ. By the proper choice of constants, the term c2n 1 ln(3/δ) can be absorbed into n 1 1+α ln(3/δ). Therefore, there exists a constant c4 > 0 such that ˆhn( ; ˆQ) T π ˆQ 2 c4 1 + (γLP )2 J 2α 1+α (Qπ) ln(1/δ) + 8L2 R [J(Qπ)] 2 1+α n 1 1+α , (51) with probability at least 1 2 3δ. Bounding ˆQ ˆhn( ; ˆQ) ν: With our choice of λQ,n and λh,n, Lemma 19 states that there exists a constant c5 > 0 such that for any n N, ˆQ ˆhn( ; ; ˆQ) 2 ν c5 (1 + γ2L2 P )αJ 2α 1+α (Qπ) ln(1/δ) + L 2α 1+α R n 1 1+α , (52) holds with probability at least 1 δ/3. Regularized Policy Iteration with Nonparametric Function Spaces Thus, inequality (47) alongside upper bounds (51) and (52) indicate that there exist constants c6, c7 > 0 such that for any n N and δ > 0, we have ˆQ T π ˆQ 2 ν c6 1 + (γLP )2 J 2α 1+α (Qπ) ln(1/δ) + c7 2α 1+α R + L2 R [J(Qπ)] 2 1+α with probability at least 1 δ. A careful study of the proof of Theorem 11 and the auxiliary results used in it reveals that one can indeed reuse a single data set in all iterations. Recall that at the kth iteration of an API procedure such as REG-LSPI, the policy π = πk is the greedy policy w.r.t. ˆQ(k 1), so it depends on earlier data sets. This implies that a function such as T π ˆQ = T ˆπ( ; ˆQ(k 1)) ˆQ is random with two sources of randomness: One source is the data set used in the current iteration, which defines the empirical loss functions. This directly affects ˆQ. The other source is ˆπ( ; ˆQ(k 1)), which depends on the data sets in earlier iterations. When we assume that all data sets are independent from each other, the randomness of π does not cause any problem because we can work on the probability space conditioned on the data sets of the earlier iterations. Conditioned on that randomness, the policy π becomes a deterministic function. This is how we presented the statement of Theorem 11 by stating that π is fixed. Nonetheless, the proofs can handle the dependence with no change. Briefly speaking, the reason is that when we want to provide a high probability upper bounds on certain random quantities, we take the supremum over both ˆQ and T π ˆQ and consider them as two separate functions, even though they are related through a random T π operator. To see this more clearly, notice that in the proof of Lemma 15, which is used in the proof of this theorem, we define the function spaces Gl that chooses the functions h, Q, and T πQ separately. We then take the supremum over all functions in Gl. This means that for the probabilistic upper bound, the randomness of π in T πQ becomes effectively irrelevant as we are providing a uniform over Gl guarantee. In the proof of this theorem, we also use Lemma 19, which itself uses Theorem 16 and Lemma 21 that have a similar construct. Appendix C. Proof of Lemma 15 (Convergence of ˆhn( ; Q) to T πQ) The following lemma, quoted from Gy orfiet al. (2002), provides an exponential probability tail inequality for the relative deviation of the empirical mean from the true mean. A slightly modified version of this result was published as Theorem 2 of Kohler (2000). This result is used in the proof of Lemmas 15 and 21. Lemma 22 (Theorem 19.3 of Gy orfiet al. 2002) Let Z, Z1, , Zn be independent and identically distributed random variables with values in Z. Let 0 < ε < 1 and η > 0. Assume that K1, K2 1 and let F be a permissible class of functions f : Z R with the following properties: (A2) E f(Z)2 K2E [f(Z)], Farahmand, Ghavamzadeh, Szepesv ari, and Mannor (A3) nε 1 ε η 288 max{2K1, 2K2}, (A4) For all z1, , zn Z and all δ η/8, nε(1 ε)δ 96 2 max{K1, 2K2} ε(1 ε)δ 16 max{K1,2K2} v u u tlog N2 u, n f F : 1 i=1 f2(zi) 16δ o , z1:n n Pn i=1 f(Zi) η + E [f(Z)] > ε 60 exp n η ε2(1 ε) 128 2304 max{K2 1, K2} Let us now turn to the proof of Lemma 15. This proof follows similar steps to the proof of Theorem 21.1 of Gy orfiet al. (2002). Proof [Proof of Lemma 15] Without loss of generality, assume that Qmax 1/2. Denote z = (x, a) and let Z = (X, A) ν, R R( |X, A), and X P( |X, A) be random variables that are independent of Dn = {(Xi, Ai, Ri, X i)}n i=1. Define the following error decomposition Z ˆhn(z; Q) T πQ(z) 2 dν(z) = E ˆhn(Z; Q) [R + γQ(X , π(X ))] 2 Dn E h T πQ(Z) [R + γQ(X , π(X ))] 2i = I1,n + I2,n, 1 2I1,n = 1 ˆhn(Zi; Q) [Ri + γQ(X i, π(X i))] 2 T πQ(Zi) [Ri + γQ(X i, π(X i))] 2 + λh,n J2(ˆhn( ; Q)) + J2(Q) + J2(T πQ) , I2,n = E ˆhn(Z; Q) ˆT πQ(Z) 2 T πQ(Z) ˆT πQ(Z) 2 Dn By the optimizer property of ˆhn( ; Q), we get the following upper bound by substituting ˆhn( ; Q) with T πQ F|A|: T πQ(Zi) ˆT πQ(Zi) 2 T πQ(Zi) ˆT πQ(Zi) 2 + λh,n J2(T πQ) + J2(Q) + J2(T πQ) # = 4λh,n J2(T πQ) + 2λh,n J2(Q). (53) Regularized Policy Iteration with Nonparametric Function Spaces We now turn to upper bounding P {I2,n > t}. Given a policy π and functions h, Q, Q F|A|, for w = (x, a, r, x ) define g : X A R X R as gh,Q,Q (w) = h(z) r + γQ(x , π(x )) 2 Q (z) r + γQ(x , π(x )) 2 . Note that gˆhn( ;Q),Q,T πQ is the function appearing in the definition of I2,n. Define the following function spaces for l = 0, 1, . . . : Gl gh,Q,Q : X A R X R : h, Q, Q F|A|; J2(h), J2(Q), J2(Q ) 2lt Denote W = (X, A, R, X ) and Wi = (Xi, Ai, Ri, X i). Apply the peeling device to get P {I2,n > t} h, Q F|A|, 2lt I{l =0} 2λh,n J2(h) + J2(Q) + J2(T πQ) < 2l+1t; s.t. E [gh,Q,T πQ(W)|Dn] 1 n Pn i=1 gh,Q,T πQ(Wi) t + 2λh,n (J2(h) + J2(Q) + J2(T πQ)) + E [gh,Q(W)|Dn] > 1 E [g(W)|Dn] 1 n Pn i=1 g(Wi) 2lt + E [g(W)|Dn] > 1 Here we used the simple fact that if 2λh,n J2(h) + J2(Q) + J2(T πQ) < 2l+1t, then J2(h), J2(Q), and J2(T πQ) are also less than 2lt λh,n , so gh,Q,T πQ Gl. We study the behavior of the lth term of the above summation by verifying the conditions of Lemma 22 similar to what we did in the proof of Lemma 21. It is easy to verify that (A1) and (A2) are satisfied with the choice of K1 = K2 = 4Q2 max. Condition (A3) is satisfied whenever t c1 for some constant c1 > 0 depending on Qmax (the constant can be set to c1 = 2 46082Q2 max). To verify condition (A4), we first require an upper bound on N2(u, Gl, w1:n) for any sequence w1:n. This can be done similar to the proof of Lemma 20: Denote Fl = {f : f F, J2(f) 2lt λh,n }. For gh1,Q1,T πQ1, gh2,Q2,T πQ2 Gl and any sequence w1:n we have i=1 |gh1,Q1,T πQ1(wi) gh2,Q2,T πQ2(wi)|2 12(2 + γ)2Q2 max 1 n h |h1(zi) h2(zi)|2 + 4γ2 Q1(x i, π(x i)) Q2(x i, π(x i)) 2 + |T πQ1(zi) T πQ2(zi)|2 i 12(2 + γ)2Q2 max 1 n h |h1(xi, a) h2(xi, a)|2 + 4γ2 Q1(x i, a) Q2(x i, a) 2 + |T πQ1(xi, a) T πQ2(xi, a)|2 i . Farahmand, Ghavamzadeh, Szepesv ari, and Mannor With the same covering set argument as in the proof of Lemma 20, we get that for any u > 0, 2|A|Qmax u, Gl, w1:n) N2(u, Fl, x1:n)|A| N2(u, Fl, x 1:n)|A| N2(u, Fl, x1:n)|A|. Evoke Assumption A4 to get log N2(u, Gl, w1:n) c(|A|, Qmax) 2lt Plugging this covering number result into condition (A4), one can verify that the condition is satisfied if t c2 nλα h,n , (55) for a constant c2 > 0, which is only a function of Qmax and |A|. Therefore, Lemma 22 indicates that P {I2,n > t} 60 l=0 exp n(2lt)(1/4)(1/2) 128 2304 max{16Q4max, 4Q2max} c3 exp( c4nt). (56) for some constants c3, c4 > 0. Combining (53), (54), (55), and (56), we find that there exist c5, c6 > 0 such that for any n N and 0 < δ < 1, we have ˆhn(Q) T πQ 2 ν 4λh,n J2(T πQ) + 2λh,n J2(Q) + c5 1 nλα h,n + c6 ln(1/δ) Here, c5 is only a function of Qmax and |A|, and c6 is a function of Qmax. Appendix D. Proof of Theorem 16 (Empirical Error and Smoothness of ˆhn( ; Q)) To prove Theorem 16, which is a modification of Theorem 10.2 by van de Geer (2000), we first need to modify and specialize Lemma 3.2 by van de Geer (2000) to be suitable to our problem. The modification is required because Q in (33) is a random function in F|A| as opposed to being a fixed function as in Theorem 10.2 of van de Geer (2000). Let us denote z = (x, a) Z = X A and Z = (x, a, R, X ) Z = X A R X with (R, X ) P( , |x, a). Let Dn denote the set {(xi, ai, Ri, X i)}n i=1 of independent random variables. We use zi to refer to (xi, ai) and Z i to refer to (xi, ai, Ri, X i). Let Pn be the probability measure that puts mass 1/n on z1, . . . , zn, i.e., Pn = 1 n Pn i=1 δzi, in which δz is the Dirac s delta function that puts a mass of 1 at z. Denote G : Z R and G : Z R3|A|, which is defined as the set G = n (Q, T πQ, 1) : Q F|A| o Regularized Policy Iteration with Nonparametric Function Spaces with 1 : X A R|A| being a bounded constant function (and not necessarily equal to 1). We use g to denote the supremum norm of functions in G. The supremum norm of vector-valued functions in G is defined by taking the supremum norm over the l -norm of each vector. Similarly, the supremum norm of (g, g ) G G is defined by (g, g ) max{ g , g }. For g G, we define g Pn [ 1 n Pn i=1 g2(zi)]1/2. To simplify the notation, we use the following definition of the inner product: Fix n N. Consider z1, . . . , zn as a set of points in Z, and a real-valued sequence w = (w1, . . . , wn). For a function g G, define w , g n 1 n Pn i=1 wig(zi). For any g = (Q, T πQ, 1) G , define the mapping W(g )(x, a, r, x ) : X A R X R by W(g )(x, a, r, x ) = r 1+γQ(x , π(x )) T πQ(x, a). For any fixed g G and i = 1, . . . , n, define the random variables Wi(g ) = W(g )(Z i) and let W(g ) denote the random vector [W1(g ) . . . Wn(g )] . Notice that Wi(g ) can be re-written as Wi(g ) = (Ri r(zi)) + γ(Q(X , π(X )) (P πQ)(zi)), thus for any fixed g , E [Wi(g )] = 0 (i = 1, . . . , n). For notational simplification, we use a b = max{a, b}. Lemma 23 (Modified Lemma 3.2 of van de Geer 2000) Fix the sequence (zi)n i=1 Z and let (Z i)n i=1 Z be the sequence of independent random variables defined as above. Assume that for some constants 0 < R L, it holds that supg G g Pn R, supg G g L, and |Ri| L (1 i n) almost surely. There exists a constant C such that for all 0 ε < δ satisfying ε 28L [log N (u, G G )]1/2du R sup (g,g ) G G i=1 Wi(g )g(zi) 4 exp n(δ ε)2 The main difference between this lemma and Lemma 3.2 of van de Geer (2000) is that the latter provides a maximal inequality for supg G 1 n Pn i=1 Wig(zi), with Wi being random variables that satisfy a certain exponential probability inequality, while our result is a maximal inequality for sup(g,g ) G G 1 n Pn i=1 Wi(g )g(zi), i.e., the random variables Wi(g ) are functions of an arbitrary g G . The current proof requires us to have a condition on the metric entropy w.r.t. the supremum norm (cf. (57)) instead of w.r.t. the empirical L2norm used in Lemma 3.2 of van de Geer (2000). The possibility of relaxing this requirement is an interesting question. We now prove this result. Farahmand, Ghavamzadeh, Szepesv ari, and Mannor Proof First, note that for any g1, g2 G, and g 1, g 2 G (with the identification of g with its corresponding Q and T πQ), we have i=1 Wi(g 1)g1(zi) Wi(g 2)g2(zi) = i=1 (Ri r(zi))(g1(zi) g2(zi)) + i=1 γ (Q1(X i, π(X i)) P πQ1(zi)) (Q2(X i, π(X i)) P πQ2(zi)) g1(zi) + i=1 γ(Q2(X i, π(X i)) P πQ2(zi))(g1(zi) g2(zi)) 2L g1 g2 Pn + γR [ Q1 Q2 + P πQ1 P πQ2 ] + 3γL g1 g2 Pn = (2 + 3γ)L g1 g2 Pn + γR Q1 Q2 + R T πQ1 T πQ2 , (58) where we used the boundedness assumptions, the definition of the supremum norm, the norm inequality 1 n Pn i=1 |g1(zi) g2(zi)| g1 g2 Pn, and the fact that |γQ(X i, π(X i)) γP πQ(zi)| = |r(zi) + γQ(X i, π(X i)) T πQ(zi)| (2 + γ)L 3L for any L-bounded Q and T πQ to get the inequality. We used P πQs P πQs 1 = γ 1 T πQs T πQs 1 to get the last equality. Let {(gs j, g j s)}Ns j=1 with Ns = N (2 s R, G G ) be a minimal 2 s R-covering of G G w.r.t. the supremum norm. For any (g, g ) G G , there exists a (gs, (Qs, T πQs, 1)) = (gs, g s) {(gs j, g j s)}Ns j=1 such that (g, g ) (gs, g s) 2 s R. This implies that both Qs Q and T πQs T πQ are smaller than 2 s R as well. Moreover, gs g Pn gs g 2 s R. By (58) we get i=1 Wi(g s)gs(zi) Wi(g )g(zi) [(2 + 3γ)L + (1 + γ)R](2 s R) (3 + 4γ)L(2 s R) Choose S = min{s 1 : 2 s ε 7RL}, which entails that for any (g, g ) G G , the covering set defined by {(g S j , g j S)}NS j=1 approximates the inner product of [g(z1) g(zn)] and W(g ) with an error less than ε. So it suffices to prove the exponential inequality for max j=1,...,NS i=1 Wi(g j S)g S j (zi) Regularized Policy Iteration with Nonparametric Function Spaces We use the chaining technique (e.g., see van de Geer 2000) as follows (we choose g0 = 0, so Wi(g 0)g0(zi) = 0 for all 1 i n): i=1 Wi(g S)g S(zi) = 1 Wi(g s)gs(zi) Wi(g s 1)gs 1(zi) = i=1 (Ri r(zi))(gs(zi) gs 1(zi)) + i=1 γ (Qs(X i, π(X i)) (P πQs)(zi)) (Qs 1(X i, π(X i)) (P πQs 1)(zi)) gs(zi) + i=1 γ(Qs 1(X i, π(X i)) (P πQs 1)(zi))(gs(zi) gs 1(zi)) Because each of these summations consists of bounded random variables with expectation zero, we may use Hoeffding s inequality alongside the union bound to upper bound them. To apply Hoeffding s inequality, we require an upper bound on the sum of squared values of random variables involved. To begin, we have |gs(zi) gs 1(zi)| = |gs(zi) g(zi)+g(zi) gs 1(zi)| 2 s R +2 (s 1)R = 3 2 s R. Similarly, both Qs Qs 1 and T πQs T πQs 1 are smaller than 3 2 s R. As a result, for the first term we get (Ri r(zi))(gs(zi) gs 1(zi)) 2 36(RL)22 2s. For the second term we have γ (Qs(X i, π(X i)) (P πQs)(zi)) (Qs 1(X i, π(X i)) (P πQs 1)(zi)) gs(zi) 2 2γ2 h Qs Qs 1 2 + γ 2 T πQs T πQs 1 2 2(1 + γ2)32(2 s R)2R2 36R42 2s, in which we used P πQs P πQs 1 = γ 1 T πQs T πQs 1 . And finally, γ(Qs 1(X i, π(X i)) (P πQs 1)(zi))(gs(zi) gs 1(zi)) 2 (3L)232(2 s R)2 = 92(RL)22 2s, where we used the fact that |γQ(X i, π(X i)) γP πQ(zi)| 3L for any L-bounded Q and T πQ. Let ηs be a sequence of positive real-valued numbers satisfying PS s=1 ηs 1. We continue the chaining argument by the use of the union bound and the fact that Ns Ns 1 N2 s to Farahmand, Ghavamzadeh, Szepesv ari, and Mannor sup (g,g ) G G i=1 Wi(g )g(zi) max j=1,...,NS i=1 Wi(g j S)g S j (zi) max (gs,g s) (gs 1,g s 1) i=1 (Ri r(zi))(gs(zi) gs 1(zi)) max (gs,g s) (gs 1,g s 1) i=1 γ Qs(X i, π(X i)) (P πQs)(zi) Qs 1(X i, π(X i)) (P πQs 1)(zi) gs(zi) max (gs,g s) (gs 1,g s 1) i=1 γ(Qs 1(X i, π(X i)) (P πQs 1)(zi))(gs(zi) gs 1(zi)) s=1 Ns Ns 1 exp 2(δ ε)2η2 sn 4 92(RL)22 2s + Ns Ns 1 exp 2(δ ε)2η2 sn 4 92R42 2s Ns Ns 1 exp 2(δ ε)2η2 sn 3 92(RL)22 2s s=1 3 exp 2 log Ns 2(δ ε)2η2 sn 4 92(RL)22 2s Here the max(gs,g s) is over the corresponding covering set {(gs j, g j s)}Ns j=1, which has Ns elements (and the same for s 1). Choose 6RL2 s(log Ns)1/2 n(δ ε) 2 s s Take C in (57) sufficiently large such that n(δ ε) 2 32 s=1 2 s[log N (2 s R, G G )]1/2 72 p 6 log 4 RL. (61) It can be shown that by this choice of ηs and the condition (61), we have PS s=1 ηs 1. From (60), we have log Ns n(δ ε)2η2 s 2 35(RL)22 2s , so P1 in (59) can be upper bounded as follows s=1 3 exp n(δ ε)2η2 s 2 35(RL)22 2s Regularized Policy Iteration with Nonparametric Function Spaces Since ηs 2 s s/8 too, we have s=1 exp n(δ ε)22 2ss 27 35(RL)22 2s s=1 exp n(δ ε)2s 3 exp n(δ ε)2 1 exp n(δ ε)2 27 35(RL)2 4 exp n(δ ε)2 where in the last inequality we used the assumption that n(δ ε) 72 6 log 4 RL (cf. (61)). One can show that (61) is satisfied if ε 28L [log N (u, G G )]1/2du 72 p 6 log 4 RL, so C can be chosen as C = 72 6 log 4. The following lemma, which is built on Lemma 23, is a result on the behavior of the modulus of continuity and will be used in the proof of Theorem 16. This lemma provides a high-probability upper bound on sup(g,g ) G G | W(g ) , g n| g 1 α Pn Jα(g,g ). Here J(g, g ) is a regularizer that is defined on G G and is a pseudo-norm. This result is similar in spirit to Lemma 8.4 of van de Geer (2000), with two main differences: The first is that here we provide an upper bound on sup (g,g ) G G | W(g ) , g n | g 1 α Pn Jα(g, g ) , whereas in Lemma 8.4 of van de Geer (2000), the upper bound is on | W , g n | The normalization by g 1 α Pn Jα(g, g ) instead of g 1 α Pn is important to get the right error bound in Theorem 16. The other crucial difference is that here W are random variables that are functions of g G , while the result of van de Geer (2000) is for independent W. The proof technique is inspired by Lemmas 5.13, 5.14, and 8.4 of van de Geer (2000). Lemma 24 (Modulus of Continuity for Weighted Sums) Fix the sequence (zi)n i=1 Z and define Pn = 1 n Pn i=1 δzi. Let (Z i)n i=1 Z be the sequence of independent random variables defined as before. Assume that for some constant L > 0, it holds that supg G g Pn L, supg G g L, and |Ri| L (1 i n) almost surely. Furthermore, suppose that there exist 0 < α < 1 and a finite constant A such that for all u > 0, log N u, (g, g ) G G : J(g, g ) B A B Farahmand, Ghavamzadeh, Szepesv ari, and Mannor Then there exists a constant c > 0 such that for any 0 < δ < 1, we have sup (g,g ) G G | W(g ) , g n | g 1 α Pn Jα(g, g ) c L1+α with probability at least 1 δ. Proof The proof uses double-peeling, i.e., we peel on both J(g, g ) and g Pn. Without loss of generality, we assume that L 1. We use c1, c2, . . . > 0 as constants. First, we start by peeling on J(g, g ): sup (g,g ) G G | W(g ) , g n | g 1 α Pn Jα(g, g ) t sup (g,g ) G G | W(g ) , g n | g 1 α Pn t.2αs |{z} , 2s I{s =0} J(g, g ) < 2s+1 Let us denote each term in the RHS by δs. To upper bound δs, notice that by assumption g Pn L. For each term, we peel again, this time on g Pn, and apply Lemma 23: sup (g,g ) G G | W(g ) , g n | g 1 α Pn τs, 2s I{s =0} J(g, g ) < 2s+1, 2 (r+1)L < g Pn 2 r L sup (g,g ) G G | W(g ) , g n | τs 2 (r+1)L 1 α , J(g, g ) < 2s+1, g Pn 2 r L n h τs 2 (r+1)L 1 αi2 27 35 (2 r L)2 L2 r 0 4 exp 22rα nτ 2 s 27 35 22(1 α)L2(1+α) = c2 exp c1nτ 2 s L2(1+α) The last inequality holds only if the covering number condition in Lemma 23 is satisfied, which is the case whenever n τs(2 (r+1)L)1 α C L Substituting τs = 2αst and solving the integral, we get that the condition is nt 2αs(2 (r+1)L)1 α CL A (2s+1)α(2 r L)1 α 2 r L , which would be satisfied for A21+α n 21 αCL1+α n = c3 L1+α Regularized Policy Iteration with Nonparametric Function Spaces Plug (63) into (62) to get that s=0 c2 exp c1n t2 22αs = c4 exp c1n t2 Solving for δ, we have t c5L1+α q n with probability at least 1 δ. This alongside the condition (64) lead to the desired result. Let us turn to the proof of Theorem 16. The proof is similar to the proof of Theorem 10.2 by van de Geer (2000), but with necessary modifications in order to get a high probability upper bound that holds uniformly over Q. We discuss the differences in more detail after the proof. Proof [Proof of Theorem 16] Recall that in the optimization problem, we use wi = (Xi, Ai, Ri, X i) (i = 1, . . . , n) to denote the ith elements of the data set Dn = {(Xi, Ai, Ri, X i)}n i=1. Also for a measurable function f : X A R X R, we denote f 2 n = 1 n Pn i=1 |f(wi)|2. We also let (X, A) ν, R R( |X, A), and X P( |X, A) be random variables that are independent of Dn. For any Q F|A| and the corresponding T πQ F|A|, define the mapping W(Q, T πQ, 1) : X A R X R by W(Q, T πQ, 1)(X, A, R, X ) = R 1+γQ(X , π(X )) T πQ(X, A), in which 1 F|A| is the constant function defined on X A with the value of one. For any fixed Q and i = 1, . . . , n, define the random variables Wi(Q) = W(Q, T πQ, 1)(Xi, Ai, Ri, X i) and let W(Q) denote the random vector [W1(Q) . . . Wn(Q)] . Notice that |Wi(Q)| 3Qmax, and we have E [Wi(Q) | Q] = 0 (i = 1, . . . , n). From the optimizer property of ˆhn = ˆhn( , Q), we have ˆhn(Q) [R + γQ(X i, π(X i))] 2 n + λh,n J2(ˆhn(Q)) T πQ [R + γQ(X i, π(X i))] 2 n + λh,n J2(T πQ). After expanding and rearranging, we get ˆhn(Q) T πQ 2 n + λh,n J2(ˆhn(Q)) 2 D W(Q) , ˆhn(Q) T πQ E n + λh,n J2(T πQ). (65) We evoke Lemma 24 to upper bound D W(Q) , ˆhn(Q) T πQ E . The function spaces G and G in that lemma are set as G : X A R and G : X A R X R3 with G = n h T πQ : h, Q F|A| o , G = n (Q, T πQ, 1) : Q, T πQ F|A| o . All functions in F|A| are Qmax-bounded, so the functions in G and G are bounded by 2Qmax and (Qmax, Qmax, 1), respectively. Moreover for any g G, 1 n Pn i=1 |g(Xi, Ai)|2 4Q2 max. So by setting L equal to 2Qmax in that lemma, all boundedness conditions are satisfied. Farahmand, Ghavamzadeh, Szepesv ari, and Mannor Define J(g, g ) = J(h) + J(Q) + J(T πQ) and denote (G G )B = (g, g ) G G : J(g, g ) B . Lemma 24 requires an upper bound on log N (u, (G G )B). We relate the metric entropy of this space to that of FB = { f F : J(f) B }, which is specified by Assumption A4. Notice that if J(g, g ) B, each of J(h), J(Q), and J(T πQ) is also less than or equal to B. So we have (G G )B = n (h T πQ, Q, T πQ, 1) : h, Q, T πQ F|A|, J(h) + J(Q) + J(T πQ) B o n h T πQ : h, T πQ F|A|, J(h) + J(T πQ) B o n Q : Q F|A|, J(Q) B o n T πQ : T πQ F|A|, J(T πQ) B o {1} . Because J( ) is a pseudo-norm, we have J(h T πQ) J(h) + J(T πQ), so n h T πQ : h, T πQ F|A|, J(h) + J(T πQ) B o n Q : Q F|A|, J(Q) B o . As a result (G G )B is a subset of the product space Q F|A| : J(Q) B 3. Therefore by the usual covering argument, we get that log N u, (G G )B 3 log N u, n Q F|A| : J(Q) B o . It is easy to see that for finite |A|, if log N (u, {f F : J(f) B}) C(B u )2α, then log N (u, {(f1, . . . , f|A|) F|A| : J (f1, . . . , f|A|) B}) C1(B u )2α (we benefit from the condition J(Q( , a)) J(Q) in Assumption A3; the proof is similar to the proof of Lemma 20 in Appendix E). Here the constant C1 depends on |A|. This along with the previous inequality show that for some constant A > 0, we have log N u, (G G )B A B We are ready to apply Lemma 24 to upper bound the inner product term in (65). Fix δ > 0. To simplify the notation, denote Ln = ˆhn(Q) T πQ n, set t0 = q n , and use ˆhn to refer to ˆhn(Q). There exists a constant c > 0 such that with probability at least 1 δ, we have L2 n + λh,n J2(ˆhn) 2c L1+αL1 α n J(ˆhn) + J(Q) + J(T πQ) α t0 + λh,n J2(T πQ). (66) Either the first term in the RHS is larger than the second one or the second term is larger than the first. We analyze each case separately. Case 1. 2c L1+αL1 α n (J(ˆhn) + J(Q) + J(T πQ))αt0 λh,n J2(T πQ). In this case we have L2 n + λh,n J2(ˆhn) 4c L1+αL1 α n J(ˆhn) + J(Q) + J(T πQ) α t0. (67) Again, two cases might happen: Regularized Policy Iteration with Nonparametric Function Spaces Case 1.a. J(ˆhn) > J(Q) + J(T πQ): From (67) we have L2 n 22+αc L1+αL1 α n Jα(ˆhn)t0. Solving for Ln, we get that Ln 2 2+α 1+α c 1 1+α L[J(ˆhn)] α 1+α t 1 1+α 0 . From (67) we also have λh,n J2(ˆhn) 22+αc L1+αL1 α n Jα(ˆhn)t0. Plugging-in the recently obtained upper bound on Ln and solving for J(ˆhn), we get that J(ˆhn) 22+αc L1+αt0 Substituting this in the upper bound on Ln, we get that Ln 22+αc L1+αt0 2 h,n . (69) Case 1.b. J(ˆhn) J(Q) + J(T πQ): The upper bound on J(ˆhn) is obvious. From (67) we have L2 n 22+αc L1+αL1 α n (J(Q) + J(T πQ))α t0. Solving for Ln, we obtain Ln 2 2+α 1+α c 1 1+α L (J(Q) + J(T πQ)) α 1+α t 1 1+α 0 . (70) Case 2. 2c L1+αL1 α n (J(ˆhn) + J(Q) + J(T πQ))αt0 < λh,n J2(T πQ). In this case we have L2 n + λh,n J2(ˆhn) 2λh,n J2(T πQ), which implies that 2λh,n J(T πQ), (71) 2J(T πQ). (72) By (69), (70), and (71) for Ln and (68), (72), and the condition J(ˆhn) J(Q)+J(T πQ) in Case 1.b. for J(ˆhn), we have that for any fixed 0 < δ < 1, with probability at least 1 δ, the following inequalities hold: ˆhn(Q) T πQ n max ( 22+αc L1+α q 2 2+α 1+α c 1 1+α L (J(Q) + J(T πQ)) α 1+α ln(1/δ) 2λh,n J(T πQ) J(ˆhn(Q)) max 22+αc L1+α q , J(Q) + J(T πQ), Comparing this proof with that of Theorem 10.2 by van de Geer (2000), we see that here we do not normalize the function space G G to ensure that J(g, g ) 1 and then Farahmand, Ghavamzadeh, Szepesv ari, and Mannor use their Lemma 8.4, which provides a high-probability upper bound on supg G | W , g n| Instead we directly apply Lemma 24, which upper bounds sup(g,g ) G G | W(g ) , g n| g 1 α Pn Jα(g,g ), on the (unnormalized) function space G G . If we went through the former approach, in which the normalization is global, the first term in the RHS of (66) would be L1 α n (J(ˆhn)+J(Q)+ J(T πQ))1+αt0 instead of L1 α n (J(ˆhn) + J(Q) + J(T πQ))αt0 of here, which is obtained by local normalization. This extra J(ˆhn) + J(Q) + J(T πQ) would prevent us from getting proper upper bounds on Ln and J(ˆhn) in Case 1.a above. The reason that the original proof does not work is that here W(g ) is a function of g G . Appendix E. Proof of Lemma 20 (Covering Number of Gσ1,σ2) Here we prove Lemma 20, which relates the covering number of Gσ1,σ2 to the covering number of Fσ1 and Fσ2. Proof [Proof of Lemma 20] Let g Q1,h1, g Q2,h2 Gσ1,σ2. By the definition of Gσ1,σ2 (41), the functions Q1 and h1 corresponding to g Q1,h1 satisfy Q1 F|A| σ1 and h1 F|A| σ2 (and similarly for Q2 and h2). Set zi = (xi, ai). We have i=1 |g Q1,h1(zi) g Q2,h2(zi)|2 (Q1(zi) h1(zi))2 (Q2(zi) h2(zi))2 2 16Q2 max 1 n i=1 [(Q1(zi) Q2(zi)) + (h1(zi) h2(zi))]2 32Q2 max 1 n h (Q1,j(xi) Q2,j(xi))2 + (h1,j(xi) h2,j(xi))2i . Assumption A3 implies that for Q1, Q2 F|A| σ1 , the functions Q1,j, Q2,j are in Fσ1 and for h1, h2 F|A| σ2 , the functions h1,j, h2,j are in Fσ2 for all j = 1, , |A|. Therefore the previous inequality shows that u-covers on Qj Fσ1 and hj Fσ2 (for j = 1, , |A|) w.r.t. the empirical norms x1:n define an 8Qmax p |A| u-cover on Gσ1,σ2 w.r.t. z1:n. Thus, |A|u, Gσ1,σ2, (x, a)1:n N2 (u, Fσ1, x1:n)|A| N2 (u, Fσ2, x1:n)|A| . Assumption A4 then implies that for a constant c1, independent of u, |A|, Qmax, and α, and for all ((x1, a1), . . . , (xn, an)) X A we have log N2(u, Gσ1,σ2, (x, a)1:n) c1|A|1+αQ2α max (σα 1 + σα 2 ) u 2α. Regularized Policy Iteration with Nonparametric Function Spaces Appendix F. Convolutional MDPs and Assumption A7 In this appendix, we show that Assumption A7 holds for a certain class of MDPs. This class is defined by one dimensional MDPs in which the increment of the next X compared to the current state X is a function of chosen action only, i.e., X X W(π(X)). Proposition 25 Suppose that X = [ π, π] is the unit circle and F is the Sobolev space Wk(X) = Wk,2(X) and J( ) is defined as the corresponding norm Wk,2. For a function f F, let f(n) be the nth Fourier coefficient, i.e., f(n) = 1 2π R π π f(x)e jnxdx. Consider the MDPs that have the convolutional transition probability kernel, that is, for any policy π and V F, there exists Kπ(x, y) = Kπ(x y) such that Z X P(dy|x, π(x))V (y) = Z X Kπ(x y)V (y)dy = Kπ V. Moreover, assume that Kπ, V L1(X). For a given policy π, let rπ(x) = r(x, π(x)) (x X). Assumption A7 is then satisfied with the choice of LR = supπ rπ Wk,2 and LP = supπ maxn | Kπ(n)|. Proof By the convolution theorem, ( Kπ V )(n) = Kπ(n) V (n). It is also known that for V F, we have V 2 Wk,2 = P n= 1 + |n|2 k | V (n)|2. Thus, Kπ V 2 Wk,2 = n= (1 + |n|2)k| Kπ(n)|2| V (n)|2 h max n | Kπ(n)|2i X n= (1 + |n|2)k| V (n)|2 = h max n | Kπ(n)|2i V 2 Wk,2 . Therefore, T πV Wk,2 rπ Wk,2 + γ h maxn | Kπ(n)| i V Wk,2. Taking supremum over all policies finishes the proof. The interpretation of maxn | Kπ(n)| is the maximum gain of the linear filter Kπ that is applied to a value function V . The gain here is explicitly written in the frequency domain. Appendix G. The Metric Entropy and the Covering Number Definition 26 (Definition 9.3 of Gy orfiet al. 2002) Let ε > 0, F be a set of realvalued functions defined on X, and νX be a probability measure on X. Every finite collection of Nε = {f1, . . . , f Nε} defined on X with the property that for every f F, there is a function f Nε such that f f p,νX < ε is called an ε-cover of F w.r.t. p,νX . Let N(ε, F, p,νX ) be the size of the smallest ε-cover of F w.r.t. p,νX . If no finite ε-cover exists, take N(ε, F, p,νX ) = . Then N(ε, F, p,νX ) is called an ε-covering number of F and log N(ε, F, p,νX ) is called the metric entropy of F w.r.t. the same norm. The ε-covering of F w.r.t. the supremum norm is denoted by N (ε, F). For x1:n = (x1, . . . , xn) X n, one may also define the empirical measure νX,n(A) = 1 n Pn i=1 I{xi A} for A X. This leads to the empirical covering number of F w.r.t. the empirical norm p,x1:n and is denoted by Np(ε, F, x1:n). If X1:n = (X1, . . . , Xn) is a sequence of random variables, the covering number Np(ε, F, X1:n) is a random variable too. Farahmand, Ghavamzadeh, Szepesv ari, and Mannor Andr as Antos, R emi Munos, and Csaba Szepesv ari. Fitted Q-iteration in continuous actionspace MDPs. In Advances in Neural Information Processing Systems (NIPS - 20), pages 9 16. MIT Press, 2008a. 33 Andr as Antos, Csaba Szepesv ari, and R emi Munos. Learning near-optimal policies with Bellman-residual minimization based fitted policy iteration and a single sample path. Machine Learning, 71:89 129, 2008b. 6, 8, 10, 13, 19, 25, 26, 28, 29 Bernardo Avila Pires and Csaba Szepesv ari. Statistical linear estimation with penalized estimators: an application to reinforcement learning. In Proceedings of the 29th International Conference on Machine Learning (ICML), 2012. 2, 13, 15, 28, 29, 31 Philip Bachman, Amir-massoud Farahmand, and Doina Precup. Sample-based approximate regularization. In Proceedings of the 31st International Conference on Machine Learning (ICML), volume 32 of JMLR: W & CP, 2014. 12 Leemon Baird. Residual algorithms: Reinforcement learning with function approximation. In Proceedings of the 12th International Conference on Machine Learning (ICML), pages 30 37. Morgan Kaufmann, 1995. 8 Andr e M.S. Barreto, Doina Precup, and Joelle Pineau. Reinforcement learning using kernelbased stochastic factorization. In Advances in Neural Information Processing Systems (NIPS - 24), pages 720 728. 2011. 2 Andr e M.S. Barreto, Doina Precup, and Joelle Pineau. On-line reinforcement learning using incremental kernel-based stochastic factorization. In Advances in Neural Information Processing Systems (NIPS - 25), pages 1484 1492. 2012. 2 Rick Beatson and Leslie Greengard. A short course on fast multipole methods. In Wavelets, Multilevel Methods and Elliptic PDEs, pages 1 37. Oxford University Press, 1997. 32 Mikhail Belkin, Partha Niyogi, and Vikas Sindhwani. Manifold regularization: A geometric framework for learning from labeled and unlabeled examples. Journal of Machine Learning Research (JMLR), 7:2399 2434, 2006. 12 Shai Ben-David, John Blitzer, Koby Crammer, and Fernando Pereira. Analysis of representations for domain adaptation. In Advances in Neural Information Processing Systems (NIPS - 19), pages 137 144. MIT Press, 2006. 8 Shai Ben-David, John Blitzer, Koby Crammer, Alex Kulesza, Fernando Pereira, and Jennifer Wortman Vaughan. A theory of learning from different domains. Machine Learning, 79(1-2):151 175, 2010. 8 Dimitri P. Bertsekas and Steven E. Shreve. Stochastic Optimal Control: The Discrete-Time Case. Academic Press, 1978. 3, 5 Dimitri P. Bertsekas and John N. Tsitsiklis. Neuro-Dynamic Programming. Athena Scientific, 1996. 3 Regularized Policy Iteration with Nonparametric Function Spaces Wendelin B ohmer, Steffen Gr unew alder, Yun Shen, Marek Musial, and Klaus Obermayer. Construction of approximation spaces for reinforcement learning. Journal of Machine Learning Research (JMLR), 14:2067 2118, 2013. 2 Leon Bottou and Olivier Bousquet. The tradeoffs of large scale learning. In Advances in Neural Information Processing Systems (NIPS - 20), pages 161 168. MIT Press, 2008. 32 Steven J. Bradtke and Andrew G. Barto. Linear least-squares algorithms for temporal difference learning. Machine Learning, 22:33 57, 1996. 9 Corinna Cortes, Mehryar Mohri, and Andres Mu noz Medina. Adaptation algorithm and theory based on generalized discrepancy. In International Conference on Knowledge Discovery and Data Mining (KDD), pages 169 178, 2015. 8 Christoph Dann, Gerhard Neumann, and Jan Peters. Policy evaluation with temporal differences: A survey and comparison. Journal of Machine Learning Research (JMLR), 15:809 883, 2014. 31 Ronald A. Devore. Nonlinear approximation. Acta Numerica, 7:51 150, 1998. 12 Paul Doukhan. Mixing: Properties and Examples, volume 85 of Lecture Notes in Statistics. Springer-Verlag, Berlin, 1994. 6, 19 Bradley Efron, Trevor Hastie, Iain M. Johnstone, and Robert Tibshirani. Least angle regression. The Annals of Statistics, 32(2):407 499, 2004. 31 Yaakov Engel, Shie Mannor, and Ron Meir. Reinforcement learning with Gaussian processes. In Proceedings of the 22nd International Conference on Machine learning (ICML), pages 201 208. ACM, 2005. 2, 32 Damien Ernst, Pierre Geurts, and Louis Wehenkel. Tree-based batch mode reinforcement learning. Journal of Machine Learning Research (JMLR), 6:503 556, 2005. 2 Theodoros Evgeniou, Massimiliano Pontil, and Tomaso Poggio. Regularization networks and support vector machines. Advances in Computational Mathematics, 13(1):1 50, 1999. 12 Amir-massoud Farahmand. Action-gap phenomenon in reinforcement learning. In Advances in Neural Information Processing Systems (NIPS - 24), pages 172 180. Curran Associates, Inc., 2011a. 28 Amir-massoud Farahmand. Regularization in Reinforcement Learning. Ph D thesis, University of Alberta, 2011b. 2, 3, 25, 31 Amir-massoud Farahmand and Doina Precup. Value pursuit iteration. In Advances in Neural Information Processing Systems (NIPS - 25), pages 1349 1357. Curran Associates, Inc., 2012. 2, 3 Amir-massoud Farahmand and Csaba Szepesv ari. Model selection in reinforcement learning. Machine Learning, 85(3):299 332, 2011. 19, 32 Farahmand, Ghavamzadeh, Szepesv ari, and Mannor Amir-massoud Farahmand and Csaba Szepesv ari. Regularized least-squares regression: Learning from a β-mixing sequence. Journal of Statistical Planning and Inference, 142 (2):493 505, 2012. 6, 19, 24 Amir-massoud Farahmand, Mohammad Ghavamzadeh, Csaba Szepesv ari, and Shie Mannor. Regularized fitted Q-iteration: Application to planning. In Recent Advances in Reinforcement Learning, 8th European Workshop (EWRL), volume 5323 of Lecture Notes in Computer Science, pages 55 68. Springer, 2008. 31 Amir-massoud Farahmand, Mohammad Ghavamzadeh, Csaba Szepesv ari, and Shie Mannor. Regularized fitted Q-iteration for planning in continuous-space Markovian Decision Problems. In Proceedings of American Control Conference (ACC), pages 725 730, June 2009a. 2, 31 Amir-massoud Farahmand, Mohammad Ghavamzadeh, Csaba Szepesv ari, and Shie Mannor. Regularized policy iteration. In Advances in Neural Information Processing Systems (NIPS - 21), pages 441 448. MIT Press, 2009b. 1, 2, 3, 13, 30, 31 Amir-massoud Farahmand, R emi Munos, and Csaba Szepesv ari. Error propagation for approximate policy and value iteration. In Advances in Neural Information Processing Systems (NIPS - 23), pages 568 576. 2010. 8, 25, 26, 28, 29 Amir-massoud Farahmand, Doina Precup, Mohammad Ghavamzadeh, and Andr e M.S. Barreto. Classification-based approximate policy iteration. IEEE Transactions on Automatic Control, 60(11):2989 2993, November 2015. 33 Matthieu Geist and Bruno Scherrer. ℓ1-penalized projected Bellman residual. In Recent Advances in Reinforcement Learning, volume 7188 of Lecture Notes in Computer Science, pages 89 101. Springer Berlin Heidelberg, 2012. 2, 13, 31 Alborz Geramifard, Finale Doshi, Joshua Redding, Nicholas Roy, and Jonathan How. Online discovery of feature dependencies. In Proceedings of the 28th International Conference on Machine Learning (ICML), pages 881 888. ACM, 2011. 2 Alborz Geramifard, Thomas J. Walsh, Nicholas Roy, and Jonathan P. How. Batch i FDD: A scalable matching pursuit algorithm for solving MDPs. In 29th Annual Conference on Uncertainty in Artificial Intelligence (UAI), 2013. 3 Mohammad Ghavamzadeh, Alessandro Lazaric, R emi Munos, and Matthew Hoffman. Finite-sample analysis of Lasso-TD. In Proceedings of the 28th International Conference on Machine Learning (ICML), pages 1177 1184. ACM, 2011. 2, 13, 28, 30, 31 Grigori K. Golubev and Michael Nussbaum. A risk bound in Sobolev class regression. The Annals of Statistics, 18(2):758 778, 1990. 24 L aszl o Gy orfi, Michael Kohler, Adam Krzy zak, and Harro Walk. A Distribution-Free Theory of Nonparametric Regression. Springer Verlag, New York, 2002. 2, 12, 21, 23, 30, 33, 35, 45, 46, 59 Regularized Policy Iteration with Nonparametric Function Spaces Arthur E. Hoerl and Robert W. Kennard. Ridge regression: Biased estimation for nonorthogonal problems. Technometrics, 12(1), 1970. 15 Matthew W. Hoffman, Alessandro Lazaric, Mohammad Ghavamzadeh, and R emi Munos. Regularized least squares temporal difference learning with nested ℓ1 and ℓ2 penalization. In Recent Advances in Reinforcement Learning, volume 7188 of Lecture Notes in Computer Science, pages 102 114. Springer Berlin Heidelberg, 2012. 2, 31 JeffJohns. Basis Construction and Utilization for Markov Decision Processes using Graphs. Ph D thesis, University of Massachusetts Amherst, 2010. 3 JeffJohns, Christopher Painter-Wakefield, and Ronald Parr. Linear complementarity for regularized policy evaluation and improvement. In Advances in Neural Information Processing Systems (NIPS - 23), pages 1009 1017. 2010. 2, 31 Tobias Jung and Daniel Polani. Least squares SVM for least squares TD learning. In Proceedings of the 17th European Conference on Artificial Intelligence (ECAI), pages 499 503, 2006. 2, 31, 32 Michael Kohler. Inequalities for uniform deviations of averages from expectations with applications to nonparametric regression. Journal of Statistical Planning and Inference, 89:1 23, 2000. 45 J. Zico Kolter and Andrew Y. Ng. Regularization and feature selection in least-squares temporal difference learning. In Proceedings of the 26th International Conference on Machine Learning (ICML), pages 521 528. ACM, 2009. 2, 13, 31 George Konidaris, Sarah Osentoski, and Philip Thomas. Value function approximation in reinforcement learning using the Fourier basis. In AAAI Conference on Artificial Intelligence, 2011. 16 Michail G. Lagoudakis and Ronald Parr. Least-squares policy iteration. Journal of Machine Learning Research (JMLR), 4:1107 1149, 2003. 8, 9 Alessandro Lazaric, Mohammad Ghavamzadeh, and R emi Munos. Finite-sample analysis of least-squares policy iteration. Journal of Machine Learning Research (JMLR), 13: 3041 3074, October 2012. 28, 29 Bo Liu, Sridhar Mahadevan, and Ji Liu. Regularized off-policy TD-learning. In Advances in Neural Information Processing Systems (NIPS - 25), pages 845 853, 2012. 32 Manuel Loth, Manuel Davy, and Philippe Preux. Sparse temporal difference learning using LASSO. In IEEE International Symposium on Approximate Dynamic Programming and Reinforcement Learning, pages 352 359, 2007. 2, 31 Sridhar Mahadevan and Bo Liu. Basis construction from power series expansions of value functions. In Advances in Neural Information Processing Systems (NIPS - 23), pages 1540 1548. 2010. 2 Farahmand, Ghavamzadeh, Szepesv ari, and Mannor Sridhar Mahadevan and Mauro Maggioni. Proto-value functions: A Laplacian framework for learning representation and control in Markov decision processes. Journal of Machine Learning Research (JMLR), 8:2169 2231, 2007. 2 St ephane Mallat and Zhifeng Zhang. Matching pursuits with time-frequency dictionaries. IEEE Transactions on Signal Processing, 41(12):3397 3415, 1993. 3 Yishay Mansour, Mehryar Mohri, and Afshin Rostamizadeh. Domain adaptation: Learning bounds and algorithms. In Proceedings of the 22nd Annual Conference on Learning Theory (COLT), 2009. 8 Mahdi Milani Fard, Yuri Grinberg, Amir-massoud Farahmand, Joelle Pineau, and Doina Precup. Bellman error based feature generation using random projections on sparse spaces. In Advances in Neural Information Processing Systems (NIPS - 26), pages 3030 3038, 2013. 2 R emi Munos. Error bounds for approximate policy iteration. In Proceedings of the 20th International Conference on Machine Learning (ICML), pages 560 567, 2003. 8 R emi Munos. Performance bounds in Lp norm for approximate value iteration. SIAM Journal on Control and Optimization, pages 541 561, 2007. 25, 26 Boaz Nadler, Nathan Srebro, and Xueyuan Zhou. Semi-supervised learning with the graph Laplacian: The limit of infinite unlabelled data. In Advances in Neural Information Processing Systems (NIPS - 22), pages 1330 1338, 2009. 14 Michael Nussbaum. Spline smoothing in regression models and asymptotic efficiency in L2. Annals of Statistics, 13(3):984 997, 1985. 24 Michael Nussbaum. Minimax risk: Pinsker bound. In Encyclopedia of Statistical Sciences, volume 3, pages 451 460. 1999. 23, 24 Dirk Ormoneit and Saunak Sen. Kernel-based reinforcement learning. Machine Learning, 49:161 178, 2002. 2 Christopher Painter-Wakefield and Ronald Parr. Greedy algorithms for sparse reinforcement learning. In Proceedings of the 29th International Conference on Machine Learning (ICML), 2012. 3 Ronald Parr, Christopher Painter-Wakefield, Lihong Li, and Michael Littman. Analyzing feature generation for value-function approximation. In Proceedings of the 24th International Conference on Machine Learning (ICML), pages 737 744. ACM, 2007. 2 Y. C. Pati, R. Rezaiifar, and P. S. Krishnaprasad. Orthogonal matching pursuit: Recursive function approximation with applications to wavelet decomposition. In Proceedings of the 27th Annual Asilomar Conference on Signals, Systems, and Computers, pages 40 44, 1993. 3 Marek Petrik. An analysis of Laplacian methods for value function approximation in MDPs. In International Joint Conference on Artificial Intelligence (IJCAI), pages 2574 2579, 2007. 2 Regularized Policy Iteration with Nonparametric Function Spaces Zhiwei Qin, Weichang Li, and Firdaus Janoos. Sparse reinforcement learning via convex optimization. In Proceedings of the 31st International Conference on Machine Learning (ICML), volume 32 of JMLR: W & CP, 2014. 32 St ephane Ross, Geoffrey Gordon, and J. Andrew Bagnell. A reduction of imitation learning and structured prediction to no-regret online learning. In Proceedings of the 14th International Conference on Artifical Intelligence and Statistics (AISTATS), pages 627 635, 2011. 19 Paul-Marie Samson. Concentration of measure inequalities for Markov chains and φ-mixing processes. The Annals of Probability, 28(1):416 461, 2000. 19 Bernhard Sch olkopf, Ralf Herbrich, and Alex J. Smola. A generalized representer theorem. In COLT 01/Euro COLT 01: Proceedings of the 14th Annual Conference on Computational Learning Theory and and 5th European Conference on Computational Learning Theory, pages 416 426. Springer-Verlag, 2001. 17, 34 Paul J. Schweitzer and Abraham Seidmann. Generalized polynomial approximations in Markovian decision processes. Journal of Mathematical Analysis and Applications, 110: 568 582, 1985. 8 Shai Shalev-Shwartz and Nathan Srebro. SVM optimization: Inverse dependence on training set size. In Proceedings of the 25th international conference on Machine learning (ICML), pages 928 935. ACM, 2008. 32 Steve Smale and Ding-Xuan Zhou. Estimating the approximation error in learning theory. Analysis and Applications, 1(1):17 41, 2003. 21 Ingo Steinwart and Andreas Christmann. Support Vector Machines. Springer, 2008. 13, 20, Ingo Steinwart, Don Hush, and Clint Scovel. Optimal rates for regularized least squares regression. In in Proceedings of the 22nd Annual Conference on Learning Theory (COLT), 2009. 24 Charles J. Stone. Optimal global rates of convergence for nonparametric regression. Annals of Statistics, 10(4):1040 1053, 1982. 23 Richard S. Sutton and Andrew G. Barto. Reinforcement Learning: An Introduction. The MIT Press, 1998. 3 Richard S. Sutton, Hamid Reza Maei, Doina Precup, Shalabh Bhatnagar, David Silver, Csaba Szepesv ari, and Eric Wiewiora. Fast gradient-descent methods for temporaldifference learning with linear function approximation. In Proceedings of the 26th International Conference on Machine Learning (ICML), pages 993 1000. ACM, 2009. 13 Csaba Szepesv ari. Algorithms for Reinforcement Learning. Morgan Claypool Publishers, 2010. 3 Farahmand, Ghavamzadeh, Szepesv ari, and Mannor Gavin Taylor and Ronald Parr. Kernelized value function approximation for reinforcement learning. In Proceedings of the 26th Annual International Conference on Machine Learning (ICML), pages 1017 1024. ACM, 2009. 2, 31 Hans Triebel. Theory of Function Spaces III. Springer, 2006. 12 Alexander B. Tsybakov. Introduction to Nonparametric Estimation. Springer, 2009. 23, 24 Sara A. van de Geer. Empirical Processes in M-Estimation. Cambridge University Press, 2000. 12, 19, 20, 23, 24, 30, 33, 36, 39, 48, 49, 51, 53, 55, 57 Roman Vershynin. Introduction to the non-asymptotic analysis of random matrices. In Y. Eldar and G. Kutyniok, editors, Compressed Sensing, Theory and Applications, chapter 5, pages 210 268. 2012. 19 Larry Wasserman. All of Nonparametric Statistics. Springer, 2007. 2 Ronald J. Williams and Leemon C. Baird. Tight performance bounds on greedy policies based on imperfect value functions. In Proceedings of the Tenth Yale Workshop on Adaptive and Learning Systems, 1994. 8 Xin Xu, Dewen Hu, and Xicheng Lu. Kernel-based least squares policy iteration for reinforcement learning. IEEE Transactions on Neural Networks, 18:973 992, 2007. 31, 32 Changjiang Yang, Ramani Duraiswami, and Larry Davis. Efficient kernel machines using the improved fast Gauss transform. In Advances in Neural Information Processing Systems (NIPS - 17), pages 1561 1568. MIT Press, 2004. 32 Yuhong Yang and Andrew R. Barron. Information-theoretic determination of minimax rates of convergence. The Annals of Statistics, 27(5):1564 1599, 1999. 23 Bin Yu. Rates of convergence for empirical processes of stationary mixing sequences. The Annals of Probability, 22(1):94 116, January 1994. 6, 19 Tong Zhang. Covering number bounds of certain regularized linear function classes. Journal of Machine Learning Research (JMLR), 2(527 550), 2002. 30 Ding-Xuan Zhou. The covering number in learning theory. Journal of Complexity, 18(3): 739 767, 2002. 20 Ding-Xuan Zhou. Capacity of reproducing kernel spaces in learning theory. IEEE Transactions on Information Theory, 49:1743 1752, 2003. 20