# practical_kernelbased_reinforcement_learning__ae932815.pdf

Journal of Machine Learning Research 17 (2016) 1-70 Submitted 3/13; Revised 12/14; Published 4/16

Practical Kernel-Based Reinforcement Learning

Andr e M. S. Barreto amsb@lncc.br Laborat orio Nacional de Computa c ao Cient ıﬁca Petr opolis, Brazil

Doina Precup dprecup@cs.mcgill.ca Joelle Pineau jpineau@cs.mcgill.ca School of Computer Science Mc Gill University Montreal, Canada

Editor: George Konidaris

Kernel-based reinforcement learning (KBRL) stands out among approximate reinforcement learning algorithms for its strong theoretical guarantees. By casting the learning problem as a local kernel approximation, KBRL provides a way of computing a decision policy which converges to a unique solution and is statistically consistent. Unfortunately, the model constructed by KBRL grows with the number of sample transitions, resulting in a computational cost that precludes its application to large-scale or on-line domains. In this paper we introduce an algorithm that turns KBRL into a practical reinforcement learning tool. Kernel-based stochastic factorization (KBSF) builds on a simple idea: when a transition probability matrix is represented as the product of two stochastic matrices, one can swap the factors of the multiplication to obtain another transition matrix, potentially much smaller than the original, which retains some fundamental properties of its precursor. KBSF exploits such an insight to compress the information contained in KBRL s model into an approximator of ﬁxed size. This makes it possible to build an approximation considering both the diﬃculty of the problem and the associated computational cost. KBSF s computational complexity is linear in the number of sample transitions, which is the best one can do without discarding data. Moreover, the algorithm s simple mechanics allow for a fully incremental implementation that makes the amount of memory used independent of the number of sample transitions. The result is a kernel-based reinforcement learning algorithm that can be applied to large-scale problems in both oﬀ-line and on-line regimes. We derive upper bounds for the distance between the value functions computed by KBRL and KBSF using the same data. We also prove that it is possible to control the magnitude of the variables appearing in our bounds, which means that, given enough computational resources, we can make KBSF s value function as close as desired to the value function that would be computed by KBRL using the same set of sample transitions. The potential of our algorithm is demonstrated in an extensive empirical study in which KBSF is applied to diﬃcult tasks based on real-world data. Not only does KBSF solve problems that had never been solved before, but it also signiﬁcantly outperforms other state-of-the-art reinforcement learning algorithms on the tasks studied.

Keywords: reinforcement learning, dynamic programming, Markov decision processes, kernel-based approximation, stochastic factorization

c 2016 Andr e M. S. Barreto, Doina Precup, and Joelle Pineau.

Barreto, Precup, and Pineau

1. Introduction

Reinforcement learning provides a conceptual framework with the potential to materialize a long-sought goal in artiﬁcial intelligence: the construction of situated agents that learn how to behave from direct interaction with the environment (Sutton and Barto, 1998). But such an endeavor does not come without its challenges; among them, extrapolating the ﬁeld s basic machinery to large-scale domains has been a particularly persistent obstacle. It has long been recognized that virtually any real-world application of reinforcement learning must involve some form of approximation. Given the mature stage of the supervisedlearning theory, and considering the multitude of approximation techniques available today, this realization may not come across as a particularly worrisome issue at ﬁrst glance. However, it is well known that the sequential nature of the reinforcement learning problem renders the incorporation of function approximators non-trivial (Bertsekas and Tsitsiklis, 1996). Despite the diﬃculties, in the last two decades the collective eﬀort of the reinforcement learning community has given rise to many reliable approximate algorithms (Szepesv ari, 2010). Among them, Ormoneit and Sen s (2002) kernel-based reinforcement learning (KBRL) stands out for two reasons. First, unlike other approximation schemes, KBRL always converges to a unique solution. Second, KBRL is consistent in the statistical sense, meaning that adding more data improves the quality of the resulting policy and eventually leads to optimal performance. Unfortunately, the good theoretical properties of KBRL come at a price: since the model constructed by this algorithm grows with the number of sample transitions, the cost of computing a decision policy quickly becomes prohibitive as more data become available. Such a computational burden severely limits the applicability of KBRL. This may help explain why, in spite of its nice theoretical guarantees, kernel-based learning has not been widely adopted as a practical reinforcement learning tool. This paper presents an algorithm that can potentially change this situation. Kernelbased stochastic factorization (KBSF) builds on a simple idea: when a transition probability matrix is represented as the product of two stochastic matrices, one can swap the factors of the multiplication to obtain another transition matrix, potentially much smaller than the original, which retains some fundamental properties of its precursor (Barreto and Fragoso, 2011). KBSF exploits this insight to compress the information contained in KBRL s model into an approximator of ﬁxed size. Speciﬁcally, KBSF builds a model, whose size is independent of the number of sample transitions, which serves as an approximation of the model that would be constructed by KBRL. Since the size of the model becomes a parameter of the algorithm, KBSF essentially detaches the structure of KBRL s approximator from its conﬁguration. This extra ﬂexibility makes it possible to build an approximation that takes into account both the diﬃculty of the problem and the computational cost of ﬁnding a policy using the constructed model. KBSF s computational complexity is linear in the number of sample transitions, which is the best one can do without throwing data away. Moreover, we show in the paper that the amount of memory used by our algorithm is independent of the number of sample transitions. Put together, these two properties make it possible to apply KBSF to largescale problems in both oﬀ-line and on-line regimes. To illustrate this possibility in practice,

Practical Kernel-Based Reinforcement Learning

we present an extensive empirical study in which KBSF is applied to diﬃcult control tasks based on real-world data, some of which had never been solved before. KBSF outperforms least-squares policy iteration and ﬁtted Q-iteration on several oﬀ-line problems and SARSA on a diﬃcult on-line task. We also show that KBSF is a sound algorithm from a theoretical point of view. Speciﬁcally, we derive results bounding the distance between the value function computed by our algorithm and the one computed by KBRL using the same data. We also prove that it is possible to control the magnitude of the variables appearing in our bounds, which means that we can make the diﬀerence between KBSF s and KBRL s solutions arbitrarily small. We start the paper presenting some background material in Section 2. Then, in Section 3, we introduce the stochastic-factorization trick, the insight underlying the development of our algorithm. KBSF itself is presented in Section 4. This section is divided in two parts, one theoretical and one practical. In Section 4.1 we present theoretical results showing that not only is the diﬀerence between KBSF s and KBRL s value functions bounded, but it can also be controlled. Section 4.2 brings experiments with KBSF on four reinforcement-learning problems: single and double pole-balancing, HIV drug schedule domain, and epilepsy suppression task. In Section 5 we introduce the incremental version of our algorithm, which can be applied to on-line problems. This section follows the same structure of Section 4, with theoretical results followed by experiments. Speciﬁcally, in Section 5.1 we extend the results of Section 4.1 to the on-line scenario, and in Section 5.2 we present experiments on the triple pole-balancing and helicopter tasks. In Section 6 we take a closer look at the approximation computed by KBSF and present a practical guide on how to conﬁgure our algorithm to solve a reinforcement learning problem. In Section 7 we summarize related works and situate KBSF in the context of kernel-based learning. Finally, in Section 8 we present the main conclusions regarding the current research and discuss some possibilities of future work. The paper has three appendices. Appendix A has the proofs of our theoretical results. The details of the experiments that were omitted in the main body of the text are described in Appendix B. In Appendix C we provide a table with the main symbols used in the paper that can be used as a reference to facilitate reading. Parts of the material presented in this article have appeared before in two papers published in the Neural Information Processing Systems conference (NIPS, Barreto et al., 2011, 2012). The current manuscript is a substantial extension of the aforementioned works.

2. Background

We consider the standard framework of reinforcement learning, in which an agent interacts with an environment and tries to maximize the amount of reward collected in the long run (Sutton and Barto, 1998). The interaction between agent and environment happens at discrete time steps: at each instant t the agent occupies a state s(t) S and must choose an action a from a ﬁnite set A. The sets S and A are called the state and action spaces, respectively. The execution of action a in state s(t) moves the agent to a new state s(t+1), where a new action must be selected, and so on. Each transition has a certain probability of occurrence and is associated with a reward r R. The goal of the agent is to ﬁnd a policy π : S 7 A, that is, a mapping from states to actions, that maximizes the expected

Barreto, Precup, and Pineau

return. Here we deﬁne the return from time t as:

R(t) = r(t+1) + γr(t+2) + γ2r(t+3) + ... =

i=1 γi 1r(t+i), (1)

where r(t+1) is the reward received at the transition from state s(t) to state s(t+1). The interaction of the agent with the environment may last forever (T = ) or until the agent reaches a terminal state (T < ); each sequence of interactions is usually referred to as an episode. The parameter γ [0, 1) is the discount factor, which determines the relative importance of individual rewards depending on how far in the future they are received.

2.1 Markov Decision Processes

As usual, we assume that the interaction between agent and environment can be modeled as a Markov decision process (MDP, Puterman, 1994). An MDP is a tuple M (S, A, P a, Ra, γ), where P a and Ra describe the dynamics of the task at hand. For each action a A, P a( |s) deﬁnes the next-state distribution upon taking action a in state s. The reward received at transition s a s is given by Ra(s, s ), with |Ra(s, s )| Rmax < . Usually, one is interested in the expected reward resulting from the execution of action a in state s, that is, ra(s) = Es P a( |s){Ra(s, s )}. Once the interaction between agent and environment has been modeled as an MDP, a natural way of searching for an optimal policy is to resort to dynamic programming (Bellman, 1957). Central to the theory of dynamic-programming is the concept of a value function. The value of state s under a policy π, denoted by V π(s), is the expected return the agent will receive from s when following π, that is, V π(s) = Eπ{R(t)|s(t) = s} (here the expectation is over all possible sequences of rewards in (1) when the agent follows π). Similarly, the value of the state-action pair (s, a) under policy π is deﬁned as Qπ(s, a) = Es P a( |s){Ra(s, s ) + γV π(s )} = ra(s) + γEs P a( |s){V π(s )}. The notion of value function makes it possible to impose a partial ordering over decision policies. In particular, a policy π is considered to be at least as good as another policy π if V π (s) V π(s) for all s S. The goal of dynamic programming is to ﬁnd an optimal policy π that performs no worse than any other. It is well known that there always exists at least one such policy for a given MDP (Puterman, 1994). When there is more than one optimal policy, they all share the same value function V . When both the state and action spaces are ﬁnite, an MDP can be represented in matrix form: each function P a becomes a matrix Pa R|S| |S|, with pa ij = P a(sj|si), and each function ra becomes a vector ra R|S|, where ra i = ra(si). Similarly, V π can be represented as a vector vπ R|S| and Qπ can be seen as a matrix Qπ R|S| |A|. 1

When the MDP is ﬁnite, dynamic programming can be used to ﬁnd an optimal decisionpolicy π A|S| in time polynomial in the number of states |S| and actions |A| (Ye, 2011).

1. Throughout the paper we will use the conventional and matrix notations interchangeably, depending on the context. When using the latter, vectors will be denoted by small boldface letters and matrices will be denoted by capital boldface letters. We will also use the same notation for all MDPs and associated components, distinguishing between them through the use of math accents. So, for example, if M is an MDP, its transition functions and matrices will be referred to as P a and Pa, its expect-reward functions and vectors will be denoted by ra and ra, its optimal decision policy will be π , and so on (see Table 2).

Practical Kernel-Based Reinforcement Learning

Let v R|S| and let Q R|S| |A|. Deﬁne the operator Γ : R|S| |A| 7 R|S| such that ΓQ = v if and only if vi = maxj qij for all i. Also, given an MDP M, deﬁne : R|S| 7 R|S| |A| such that v = Q if and only if qia = ra i +γ P|S| j=1 pa ijvj for all i and all a. The Bellman operator of the MDP M is given by T Γ . A fundamental result in dynamic programming states that, starting from v(0) = 0, the expression v(t) = Tv(t 1) = ΓQ(t) gives the optimal t-step value function, and as t the vector v(t) approaches v . At any point, the optimal t-step policy can be obtained by selecting π(t) i argmaxj q(t) ij (Puterman, 1994). In contrast with dynamic programming, in reinforcement learning it is assumed that the MDP is unknown, and the agent must learn a policy based on transitions sampled from the environment. If the process of learning a decision policy is based on a ﬁxed set of sample transitions, we call it batch reinforcement learning. On the other hand, in on-line reinforcement learning the computation of a decision policy takes place concomitantly with the collection of data (Sutton and Barto, 1998).

2.2 Kernel-Based Reinforcement Learning

Kernel-based reinforcement learning (KBRL) is a batch algorithm that uses a ﬁnite model approximation to solve a continuous MDP M (S, A, P a, Ra, γ), where S [0, 1]d S and d S N+ is the dimension of the state space (Ormoneit and Sen, 2002). Let Sa {(sa k, ra k, ˆsa k)|k = 1, 2, ..., na} be sample transitions associated with action a A, where sa k, ˆsa k S and ra k R. Let φ : R+ 7 R+ be a Lipschitz continuous function satisfying R 1 0 φ(x)dx = 1. Let kτ(s, s ) be a kernel function deﬁned as

kτ(s, s ) = φ s s

where τ R and is a norm in Rd S (for concreteness, the reader may think of kτ(s, s ) as the Gaussian kernel, although the deﬁnition also encompasses other functions). Finally, deﬁne the normalized kernel function associated with action a as

κa τ(s, sa i ) = kτ(s, sa i ) Pna j=1 kτ(s, sa j). (3)

KBRL uses (3) to build a ﬁnite MDP whose state space ˆS is composed solely of the n = P a na states ˆsa i . We assume without loss of generality that the action space A is ordered and the sampled states are ordered lexicographically as ˆsa i < ˆsb j a < b or (a = b and i < j); if a given state s S occurs more than once in the set of sample transitions, each occurrence will be treated as a distinct state in the ﬁnite MDP. The transition functions of KBRL s model, ˆP a : ˆS ˆS 7 [0, 1], are given by:

ˆP a ˆsb i|s = κa τ(s, sb i), if a = b, 0, otherwise, (4)

where a, b A. Similarly, the reward functions of the MDP constructed by KBRL, ˆRa : ˆS ˆS 7 R, are

ˆRa(s, ˆsb i) = ra i , if a = b, 0, otherwise. (5)

Barreto, Precup, and Pineau

Based on (4) and (5) we can deﬁne the transition matrices and expected-reward vectors of KBRL s MDP. The matrices ˆPa are derived directly from the deﬁnition of ˆP a(ˆsb i|s), replacing s with the sampled states ˆsa i (see Figure 2a). The vectors ˆra are computed as follows. Let r [(r1) , (r2) , ..., (r|A|) ] Rn, where ra Rna are the vectors composed of the sample rewards, that is, the ith element of ra is ra i Sa. Since Ra(s, ˆsb i) does not depend on the start state s, we can write

ˆra = ˆPar. (6)

KBRL s MDP is thus given by ˆ M ( ˆS, A, ˆPa,ˆra, γ). Once ˆ M has been deﬁned, one can use dynamic programming to compute its optimal value function ˆV . Then, the value of any state-action pair of the continuous MDP can be determined as:

i=1 κa τ(s, sa i ) h ra i + γ ˆV (ˆsa i ) i , (7)

where s S and a A. Ormoneit and Sen (2002) have shown that, if na for all a A and the kernel s width τ shrink at an admissible rate, the probability of choosing a suboptimal action based on ˆQ(s, a) converges to zero (see their Theorem 4). As discussed, using dynamic programming one can compute the optimal value function of ˆ M in time polynomial in the number of sample transitions n (which is also the number of states in ˆ M). However, since each application of the Bellman operator ˆT is O(n2|A|), the computational cost of such a procedure can easily become prohibitive in practice. Thus, the use of KBRL leads to a dilemma: on the one hand one wants as much data as possible to describe the dynamics of the task, but on the other hand the number of transitions should be small enough to allow for the numerical solution of the resulting model. In the following sections we describe a practical approach to weigh the relative importance of these two conﬂicting objectives.

3. Stochastic Factorization

A stochastic matrix has only nonnegative elements and each of its rows sums to 1. That said, we can introduce the concept that will serve as a cornerstone for the rest of the paper:

Deﬁnition 1 Given a stochastic matrix P Rn p, the relation P = DK is called a stochastic factorization of P if D Rn m and K Rm p are also stochastic matrices. The integer m > 0 is the order of the factorization.

This mathematical construct has been explored before. For example, Cohen and Rothblum (1991) brieﬂy discuss it as a special case of nonnegative matrix factorization, while Cutler and Breiman (1994) study slightly modiﬁed versions of stochastic factorization for statistical data analysis. However, in this paper we will focus on a useful property of this type of factorization that has only recently been noted (Barreto and Fragoso, 2011).

3.1 Stochastic-Factorization Trick

Let P Rn n be a transition matrix, that is, a square stochastic matrix, and let P = DK be an order m stochastic factorization. In this case, one can see the elements of

Practical Kernel-Based Reinforcement Learning

Figure 1: Reducing the dimension of a transition model from n = 3 states to m = 2 artiﬁcial states. Original states si are represented as big white circles; small black circles depict artiﬁcial states sh. The symbol is used to represent nonzero elements. These ﬁgures have appeared before in the article by Barreto and Fragoso (2011).

D and K as probabilities of transitions between the states si and a set of m artiﬁcial states sh. Speciﬁcally, the elements in each row of D can be interpreted as probabilities of transitions from the original states to the artiﬁcial states, while the rows of K can be seen as probabilities of transitions in the opposite direction. Under this interpretation, each element pij = Pm h=1 dihkhj is the sum of the probabilities associated with m two-step transitions: from state si to each artiﬁcial state sh and from these back to state sj. In other words, pij is the accumulated probability of all possible paths from si to sj with a stopover in one of the artiﬁcial states sh. Following similar reasoning, it is not diﬃcult to see that by swapping the factors of a stochastic factorization, that is, by switching from DK to KD, one obtains the transition probabilities between the artiﬁcial states sh, P = KD. If m < n, P Rm m will be a compact version of P. Figure 1 illustrates this idea for the case in which n = 3 and m = 2. The stochasticity of P follows immediately from the same property of D and K. What is perhaps more surprising is the fact that this matrix shares some fundamental characteristics with the original matrix P. Speciﬁcally, it is possible to show that: (i) for each recurrent class in P there is a corresponding class in P with the same period and, given some simple assumptions about the factorization, (ii) P is irreducible if and only if P is irreducible and (iii) P is regular if and only if P is regular (for details, see the article by Barreto and Fragoso, 2011). We will refer to this insight as the stochastic-factorization trick :

Given a stochastic factorization of a transition matrix, P = DK, swapping the factors of the factorization yields another transition matrix P = KD, potentially much smaller than the original, which retains the basic topology and properties of P.

Given the strong connection between P Rn n and P Rm m, the idea of replacing the former by the latter comes almost inevitably. The motivation for this would be, of course, to save computational resources when m < n. For example, Barreto and Fragoso

Barreto, Precup, and Pineau

(2011) have shown that it is possible to recover the stationary distribution of P through a linear transformation of the corresponding distribution of P. In this paper we will use the stochastic-factorization trick to reduce the computational cost of KBRL. The strategy will be to summarize the information contained in KBRL s MDP in a model of ﬁxed size.

3.2 Reducing a Markov Decision Process

The idea of using stochastic factorization to reduce dynamic programming s computational requirements is straightforward: given factorizations of the transition matrices Pa, we can apply our trick to obtain a reduced MDP that will be solved in place of the original one. In the most general scenario, we would have one independent factorization Pa = Da Ka for each action a A and then use Pa = Ka Da instead of Pa. However, in the current work we will focus on the particular case in which there is a single matrix D, which will prove to be convenient both mathematically and computationally. Obviously, in order to apply the stochastic-factorization trick to an MDP, we have to ﬁrst compute the matrices involved in the factorization. Unfortunately, such a procedure can be computationally demanding, exceeding the number of operations necessary to calculate v (Vavasis, 2009; Barreto et al., 2014). Thus, in practice we may have to replace the exact factorizations Pa = DKa with approximations Pa DKa. The following proposition bounds the error in the value-function approximation resulting from the application of our trick to approximate stochastic factorizations:

Proposition 2 Let M (S, A, Pa, ra, γ) be a ﬁnite MDP with |S| = n and 0 γ < 1. Let D Rn m be a stochastic matrix and, for each a A, let Ka Rm n be stochastic and let ra be a vector in Rm. Deﬁne the MDP M ( S, A, Pa, ra, γ), with | S| = m and Pa = Ka D. Then,

v ΓD Q ξv 1 1 γ max a ra D ra + Rdif (1 γ)2

2 max a Pa DKa + σ(D) , (8)

where σ(D) = max i (1 max j dij), (9)

is the maximum norm, and Rdif = max a,i ra i min a,i ra i .2

The proofs of most of our theoretical results are in Appendix A.1. We note that Proposition 2 is only valid for the maximum norm; in Appendix A.2 we derive another bound for the distance between v and ΓD Q which is valid for any norm. Our bound depends on two factors: the quality of the MDP s factorization, given by maxa Pa DKa and maxa ra D ra , and the level of stochasticity of D, measured by σ(D). When the MDP factorization is exact, we recover a computable version of Sorg and Singh s (2009) bound for soft homomorphisms (see (32)). On the other hand, when D is deterministic that is, when all its nonzero elements are 1 expression (8) reduces to Whitt s (1978) classical result regarding state aggregation in dynamic programming. Finally, if we have exact deterministic factorizations, the right-hand side of (8) reduces to

2. We recall that induces the following norm over the space of matrices: A = maxi P

Practical Kernel-Based Reinforcement Learning

zero. This also makes sense, since in this case the stochastic-factorization trick gives rise to an exact homomorphism (Ravindran, 2004). Proposition 2 elucidates the basic mechanism through which one can use the stochasticfactorization trick to reduce the number of states in an MDP (and hence the computational cost of ﬁnding a policy using dynamic programming). One possible way to exploit this result is to see the computation of D, Ka, and ra as an optimization problem in which the objective is to minimize some function of maxa Pa DKa , maxa ra D ra , and possibly also σ(D) (Barreto et al., 2014). Note though that addressing the factorization problem as an optimization may be computationally infeasible when the dimension of the matrices Pa is large which is exactly the case we are interested in here. To illustrate this point, we will draw a connection between the stochastic factorization and a popular problem known in the literature as nonnegative matrix factorization (Paatero and Tapper, 1994; Lee and Seung, 1999). In a nonnegative matrix factorization the elements of D and Ka are greater or equal to zero, but in general no stochasticity constraint is imposed. Cohen and Rothblum (1991) have shown that it is always possible to derive a stochastic factorization from a nonnegative factorization of a stochastic matrix, which formally characterizes the former as a particular case of the latter. Unfortunately, nonnegative factorization is hard: Vavasis (2009) has shown that a particular version of the problem is in fact NP-hard. Instead of solving the problem exactly, one can resort instead to an approximate nonnegative matrix factorization. However, since the number of states n in an MDP determines both the number of rows and the number of columns of the matrices Pa, even the fast linear approximate methods run in O(n2) time, which is infeasible for large n (Barreto et al., 2014). One can circumvent this computational obstacle by exploiting structure in the optimization problem or by resorting to heuristics. In another article on the subject we explore both these alternatives at length (Barreto et al., 2014). However, in this paper we adopt a diﬀerent approach. Since the model ˆ M built by KBRL is itself an approximation, instead of insisting in ﬁnding a near-optimal factorization for ˆ M we apply our trick to avoid the construction of Pa and ra. As will be seen, this is done by applying KBRL s own approximation scheme to ˆ M.

4. Kernel-Based Stochastic Factorization

In Section 2 we presented KBRL, an approximation framework for reinforcement learning whose main drawback is its high computational complexity. In Section 3 we discussed how the stochastic-factorization trick can in principle be useful to reduce an MDP, as long as one circumvents the computational burden imposed by the calculation of the matrices involved in the process. We now show that by combining these two approaches we get an algorithm that overcomes the computational limitations of its components. We call it kernel-based stochastic factorization, or KBSF for short. KBSF emerges from the application of the stochastic-factorization trick to KBRL s MDP ˆ M (Barreto et al., 2011). Similarly to Ormoneit and Sen (2002), we start by deﬁning a mother kernel φ(x) : R+ 7 R+. In Section 4.1 we list our assumptions regarding φ. Here, it suﬃces to note that, since our assumptions and Ormoneit and Sen s (2002) are not mutually exclusive, we can have φ φ (by using the Gaussian function in both

Barreto, Precup, and Pineau

cases, for example). Let S { s1, s2, ..., sm} be a set of representative states. Analogously to (2) and (3), we deﬁne the kernel k τ(s, s ) = φ ( s s / τ) and its normalized version κ τ(s, si) = k τ(s, si)/Pm j=1 k τ(s, sj). We will use κa τ to build matrices Ka and κ τ to build matrix D. As shown in Figure 2a, KBRL s matrices ˆPa have a very speciﬁc structure, since only transitions ending in states ˆsa i Sa have a nonzero probability of occurrence. Suppose that we want to apply the stochastic-factorization trick to KBRL s MDP. Assuming that the matrices Ka have the same structure as ˆPa, when computing Pa = Ka D we only have to look at the sub-matrices of Ka and D corresponding to the na nonzero columns of Ka. We call these matrices Ka Rm na and Da Rna m. The strategy of KBSF is to ﬁll out matrices Ka and Da with elements

ka ij = κa τ( si, sa j) and da ij = κ τ(ˆsa i , sj). (10)

Note that, based on Da, one can easily recover D as D [( D1) , ( D2) , ...( D|A|) ] Rn m. Similarly, if we let K [ K1, K2, ... K|A|] Rm n, then Ka Rm n is matrix K with all elements replaced by zeros except for those corresponding to matrix Ka(see Figures 2b and 2c for an illustration). It should be thus obvious that Pa = Ka D = Ka Da. In order to conclude the construction of KBSF s MDP, we have to deﬁne the vectors of expected rewards ra. As shown in expression (5), the reward functions of KBRL s MDP, ˆRa(s, s ), only depend on the ending state s . Recalling the interpretation of the rows of Ka as transition probabilities from the representative states to the original ones, illustrated in Figure 1, it is clear that ra = Kara = Kar. (11)

Therefore, the formal speciﬁcation of KBSF s MDP is given by M ( S, A, Ka Da, Kara, γ) = ( S, A, Ka Da, Kar, γ) = ( S, A, Pa, ra, γ). As discussed in Section 2.2, KBRL s approximation scheme can be interpreted as the derivation of a ﬁnite MDP. In this case, the sample transitions deﬁne both the ﬁnite state space ˆS and the model s transition and reward functions. This means that the state space and the dynamics of KBRL s model are inexorably linked: except maybe for degenerate cases, changing one also changes the other. By deﬁning a set of representative states, KBSF decouples the MDP s structure from its particular instantiation. To see why this is so, note that, if we ﬁx the representative states, diﬀerent sets of sample transitions will give rise to diﬀerent models. Conversely, the same set of transitions can generate diﬀerent MDPs, depending on how the representative states are deﬁned. A step by step description of KBSF is given in Algorithm 1. As one can see, KBSF is very simple to understand and to implement. It works as follows: ﬁrst, the MDP M is built as described above. Then, its action-value function Q is determined through any dynamic programming algorithm. Finally, KBSF returns an approximation of ˆv the optimal value function of KBRL s MDP computed as v = ΓD Q . Based on v, one can compute an approximation of KBRL s action-value function ˆQ(s, a) by simply replacing V for ˆV in (7), that is,

i=1 κa τ(s, sa i ) h ra i + γ V (ˆsa i ) i , (12)

Practical Kernel-Based Reinforcement Learning

ˆsa 1 ˆsa 2 ˆsa 3 ˆsb 1 ˆsb 2

ˆsa 1 ˆsa 2 ˆsa 3 ˆsb 1 ˆsb 2

κa τ(ˆsa 1, sa 1) κa τ(ˆsa 1, sa 2) κa τ(ˆsa 1, sa 3) 0 0 κa τ(ˆsa 2, sa 1) κa τ(ˆsa 2, sa 2) κa τ(ˆsa 2, sa 3) 0 0 κa τ(ˆsa 3, sa 1) κa τ(ˆsa 3, sa 2) κa τ(ˆsa 3, sa 3) 0 0 κa τ(ˆsb 1, sa 1) κa τ(ˆsb 1, sa 2) κa τ(ˆsb 1, sa 3) 0 0 κa τ(ˆsb 2, sa 1) κa τ(ˆsb 2, sa 2) κa τ(ˆsb 2, sa 3) 0 0

ˆsa 1 ˆsa 2 ˆsa 3 ˆsb 1 ˆsb 2

ˆsa 1 ˆsa 2 ˆsa 3 ˆsb 1 ˆsb 2

0 0 0 κa τ(ˆsa 1, sb 1) κa τ(ˆsa 1, sb 2) 0 0 0 κa τ(ˆsa 2, sb 1) κa τ(ˆsa 2, sb 2) 0 0 0 κa τ(ˆsa 3, sb 1) κa τ(ˆsa 3, sb 2) 0 0 0 κa τ(ˆsb 1, sb 1) κa τ(ˆsb 1, sb 2) 0 0 0 κa τ(ˆsb 2, sb 1) κa τ(ˆsb 2, sb 2)

(a) KBRL s matrices

ˆsa 1 ˆsa 2 ˆsa 3 ˆsb 1 ˆsb 2

κ τ(ˆsa 1, s1) κ τ(ˆsa 1, s2) κ τ(ˆsa 2, s1) κ τ(ˆsa 2, s2) κ τ(ˆsa 3, s1) κ τ(ˆsa 3, s2) κ τ(ˆsb 1, s1) κ τ(ˆsb 1, s2) κ τ(ˆsb 2, s1) κ τ(ˆsb 2, s2)

ˆsa 1 ˆsa 2 ˆsa 3 ˆsb 1 ˆsb 2 κa τ( s1, sa 1) κa τ( s1, sa 2) κa τ( s1, sa 3) 0 0 κa τ( s2, sa 1) κa τ( s2, sa 2) κa τ( s2, sa 3) 0 0 ,

ˆsa 1 ˆsa 2 ˆsa 3 ˆsb 1 ˆsb 2 0 0 0 κa τ( s1, sb 1) κa τ( s1, sb 2) 0 0 0 κa τ( s2, sb 1) κa τ( s2, sb 2) .

(b) KBSF s sparse matrices

Da = ˆsa 1 ˆsa 2 ˆsa 3

s1 s2 " # κ τ(ˆsa 1, s1) κ τ(ˆsa 1, s2) κ τ(ˆsa 2, s1) κ τ(ˆsa 2, s2) κ τ(ˆsa 3, s1) κ τ(ˆsa 3, s2) ,

Db = ˆsb 1 ˆsb 2

s1 s2 κ τ(ˆsb 1, s1) κ τ(ˆsb 1, s2) κ τ(ˆsb 2, s1) κ τ(ˆsb 2, s2) ,

ˆsa 1 ˆsa 2 ˆsa 3 κa τ( s1, sa 1) κa τ( s1, sa 2) κa τ( s1, sa 3) κa τ( s2, sa 1) κa τ( s2, sa 2) κa τ( s2, sa 3) ,

ˆsb 1 ˆsb 2 κa τ( s1, sb 1) κa τ( s1, sb 2) κa τ( s2, sb 1) κa τ( s2, sb 2) .

(c) KBSF s dense matrices

Figure 2: Matrices built by KBRL and KBSF for the case in which the original MDP has two actions, a and b, and na = 3, nb = 2, and m = 2.

Barreto, Precup, and Pineau

where s S and a A. Note that V (ˆsa i ) corresponds to one speciﬁc entry of vector v, whose index is given by Pa 1 b=0 nb + i, where we assume that n0 = 0.

Algorithm 1 Batch KBSF

Input: Sa = {(sa k, ra k, ˆsa k)|k = 1, 2, ..., na} for all a A Sample transitions S = { s1, s2, ..., sm} Set of representative states Output: v ˆv

for each a A do

Compute matrix Da: da ij = κ τ(ˆsa i , sj) Compute matrix Ka: ka ij = κa τ( si, sa j) Compute vector ra: ra i = P j ka ijra j Compute matrix Pa = Ka Da

Solve M ( S, A, Pa, ra, γ) i.e., compute Q

Return v = ΓD Q , where D = h ( D1) , ( D2) , ...( D|A|) i

As shown in Algorithm 1, the key point of KBSF s mechanics is the fact that the matrices ˇPa = DKa are never actually computed, but instead we directly solve the MDP M containing m states only. This results in an eﬃcient algorithm that requires only O(nm|A|d S + ˆnm2|A|) operations and O(ˆnm) bits to build a reduced version of KBRL s MDP, where ˆn = maxa na. After the reduced model M has been constructed, KBSF s computational cost becomes a function of m only. In particular, the cost of solving M through dynamic programming becomes polynomial in m instead of n: while one application of ˆT, the Bellman operator of ˆ M, is O(nˆn|A|), the computation of T is O(m2|A|). Therefore, KBSF s time and memory complexities are only linear in n.

We note that, in practice, KBSF s computational requirements can be reduced even further if one enforces the kernels κa τ and κ τ to be sparse. In particular, given a ﬁxed si, instead of computing k τ( si, sa j) for j = 1, 2, ..., na, one can evaluate the kernel on a prespeciﬁed neighborhood of si only. Assuming that k τ( si, sa j) is zero for all sa j outside this region, one can avoid not only computing the kernel but also storing the resulting values (the same reasoning applies to the computation of kτ(ˆsa i , sj) for a ﬁxed ˆsa i ).

4.1 Theoretical results

Since KBSF comes down to the solution of a ﬁnite MDP, it always converges to the same approximation v, whose distance to KBRL s optimal value function ˆv is bounded by Proposition 2. Once v is available, the value of any state-action pair can be determined through (12). The following result generalizes Proposition 2 to the entire continuous state space S:

Proposition 3 Let ˆQ be the value function computed by KBRL through (7) and let Q be the value function computed by KBSF through (12). Then, for any s S and any a A, | ˆQ(s, a) Q(s, a)| γξv, with ξv deﬁned in (8).

Practical Kernel-Based Reinforcement Learning

| ˆQ(s, a) Q(s, a)| =

i=1 κa τ(s, sa i ) h ra i + γ ˆV (ˆsa i ) i

i=1 κa τ(s, sa i ) h ra i + γ V (ˆsa i ) i

i=1 κa τ(s, sa i ) ˆV (ˆsa i ) V (ˆsa i ) γ

i=1 κa τ(s, sa i )ξv γξv,

where the second inequality results from the application of Proposition 2 and the third inequality is a consequence of the fact that Pna i=1 κa τ(s, sa i ) deﬁnes a convex combination.

Proposition 3 makes it clear that the approximation computed by KBSF depends crucially on ξv. In the remainder of this section we will show that, if the distances between sampled states and the respective nearest representative states are small enough, then we can make ξv as small as desired by setting τ to a suﬃciently small value.

4.1.1 General results

We assume that KBSF s kernel φ(x) : R+ 7 R+ has the following properties:

(i) φ(x) φ(y) if x < y,

(ii) A φ > 0, λ φ 1, B φ 0 such that A φ exp( x) φ(x) λ φA φ exp( x) if x B φ.

Assumption (ii) implies that the function φ is positive and will eventually decay exponentially. Note that we assume that φ is greater than zero everywhere in order to guarantee that κ τ is well deﬁned for any value of τ. It should be straightforward to generalize our results for the case in which φ has ﬁnite support by ensuring that, given sets of sample transitions Sa and a set of representative states S, τ is such that, for any ˆsa i Sa, with a A, there is a sj S for which k τ(ˆsa i , sj) > 0 (note that this assumption is naturally satisﬁed by the sparse kernels used in some of the experiments see Appendix B). Let rs : S {1, 2, ..., m} 7 S be a function that orders the representative states according to their distance to a given state s, that is,

rs(s, i) = sk sk is the ith nearest representative state to s. (13)

dist : S {1, 2, ..., m} 7 R

dist(s, i) = s rs(s, i) . (14)

We will show that, for any ϵ > 0 and any w {1, 2, ..., m}, there is a δ > 0 such that, if maxa,i dist(ˆsa i , w) < δ, then we can set τ in order to guarantee that ξv < ϵ. To show that, we will need the following two lemmas:

Lemma 4 For any sa i Sa and any ϵ > 0, there is a δ > 0 such that |κa τ(s, sa i ) κa τ(s , sa i )| < ϵ if s s < δ.

Barreto, Precup, and Pineau

Lemma 5 Let s S, let m > 1, and assume there is a w {1, 2, ..., m 1} such that dist(s, w) < dist(s, w + 1). Deﬁne

W w(s) {k | s sk dist(s, w)} and W w(s) {1, 2, ..., m} W w(s). (15)

Then, for any α > 0, P k W w(s) κ τ(s, sk) < α P k W w(s) κ τ(s, sk) for τ suﬃciently small.

Lemma 4 is basically a continuity argument: it shows that, for any ﬁxed sa i , |κa τ(s, sa i ) κa τ(s , sa i )| 0 as s s 0. Lemma 5 states that, if we order the representative states according to their distance to a ﬁxed state s, and then partition them in two subsets, we can control the relative magnitude of the corresponding kernels s sums by adjusting the parameter τ (we redirect the reader to Appendix A.1 for details on how to set τ). Based on these two lemmas, we present the main result of this section:

Proposition 6 Let w {1, 2, ..., m}. For any ϵ > 0, there is a δ > 0 such that, if maxa,i dist(ˆsa i , w) < δ, then we can guarantee that ξv < ϵ by making τ suﬃciently small.

Proposition 6 tells us that, regardless of the speciﬁc reinforcement learning problem at hand, if the distances between sampled states ˆsa i and the respective w nearest representative states are small enough, then we can make KBSF s approximation of KBRL s value function as accurate as desired by setting τ to a suﬃciently small value (one can see how exactly to set τ in the proof of the proposition). How small the maximum distance maxa,i dist(ˆsa i , w) should be depends on the particular choice of kernel kτ and on the sets of sample transitions Sa. Note that a ﬁxed number of representative states m imposes a minimum possible value for maxa,i dist(ˆsa i , w), and if this value is not small enough decreasing τ may actually hurt the approximation (this is easier to see if we consider that w = 1). The optimal value for τ in this case is again context-dependent. As a positive ﬂip side of this statement, we note that, even if maxa,i dist(ˆsa i , w) > δ, it might be possible to make ξv < ϵ by setting τ appropriately. Therefore, rather than as a practical guide on how to conﬁgure KBSF, Proposition 6 should be seen as a theoretical argument showing that KBSF is a sound algorithm, in the sense that in the limit it recovers KBRL s solution.

4.1.2 Error rate

In the previous section we deliberately refrained from making assumptions on the kernel κ τ used by KBSF in order to make Proposition 6 as general as possible. In what follows we show that, by restricting the class of kernels used by our algorithm, we can derive stronger results regarding its behavior. In particular, we derive an upper bound for ξv that shows how this approximation error depends on the variables of a given reinforcement learning problem. Our strategy will be to deﬁne an admissible kernel whose width is determined based on data. We will need the following assumption:

(iii) |φ(x) φ(y)| Cφ|x y|, with Cφ 0.

Assumption (iii) simply states that the function φ used to construct the kernel κa τ is Lipschitz continuous with constant Cφ. This is actually part of Ormoneit and Sen s (2002) Assumption 4, and is explicitly listed here for convenience only.

Practical Kernel-Based Reinforcement Learning

We will now deﬁne an auxiliary function which will be used in the deﬁnition of our admissible kernel. To simplify the notation, let

ςa,i,j k |κa τ(ˆsa i , sa j) κa τ( sk, sa j)|. (16)

Based on (15) and (16), we deﬁne the following function:

F( τ, w|Sa, S, κa τ, k τ) = Rmax (1 γ)2

k W w(ˆsa i ) κ τ(ˆsa i , sk)

j=1 ςa,i,j k + 1

2 max a,i (1 κ τ(ˆsa i , rs(ˆsa i , 1)))

where τ > 0, w {1, 2, ..., m}, Rmax = maxa,i |ra i |, and rs and W w are deﬁned in (13) and (15), respectively. Note that the deﬁnition of F makes it clear its dependency on the representative states, sets of sample transitions, and kernels adopted. Lemma 5 implies that, as τ 0, κ τ(ˆsa i , rs(ˆsa i , 1)) 1 and P k W w(ˆsa i ) κ τ(ˆsa i , sk) 0 for all a, i, and w (see equation (35) in Appendix A.1 for a clear picture). Thus, for any w, F( τ, w) 0 as τ 0. This leads to the following deﬁnition:

Deﬁnition 7 Given ϵ > 0 and w {1, 2, ..., m}, an admissible kernel κϵ,w τ is any kernel whose parameter τ is such that F( τ, w) < ϵ.

The deﬁnition above allows us to enunciate the following proposition:

Proposition 8 Let kmin = φ d S/τ , where τ is the parameter used by KBRL s kernel κa τ. Given ϵ > 0 and w {1, 2, ..., m}, suppose that KBSF is adopted with an admissible kernel κϵ,w τ . Then,

ξv 2w CφRmax τkmin(1 γ)2 max a,i dist(ˆsa i , w) + ϵ, (18)

where Cφ is the Lipschitz constant of function φ appearing in Assumption (iii), Rmax = maxa,i |ra i |, and γ [0, 1) is the discount factor of the underlying MDP.

Proposition 8 shows that ξv is bounded by the maximum distance between a sampled state and the wth closest representative state scaled by constants characterizing the MDP and the kernels used by KBSF. Assuming that ϵ and w are ﬁxed, among the quantities appearing in (18) we only have control over τ and maxa,i dist(ˆsa i , w) the latter through the deﬁnition of the representative states. Regarding maxa,i dist(ˆsa i , w), the same observation made above applies here: given sample transitions Sa, a ﬁxed value for m < n imposes a lower bound on this term, and thus on the right-hand side of (18). As m n, this lower bound approaches some value ϵ , with 0 ϵ < ϵ. The fact that (18) is generally greater than zero even when m = n reﬂects the fact that ξv depends on σ(D), the level of stochasticity of D (see (8) and (9)). Since Assumption (ii) implies that k τ(s, s ) > 0 for all s, s S, matrix D will never become deterministic, no matter how small τ is (see Figure 2b). If we replace k τ by a kernel with ﬁnite support, then we can set ϵ = 0 in Deﬁnition 7, and therefore in the right-hand side of (18). Needless to say, this does not mean that using a kernel with ﬁnite support will lead to better performance in practice.

Barreto, Precup, and Pineau

As for the deﬁnition of τ, we see that the right-hand side of (18) decreases as τ . This means that we can make the value function computed by KBSF arbitrarily close to the one computed by KBRL. Note though that increasing τ also changes the model constructed by KBRL, and an excessively large value for this parameter makes the resulting approximation meaningless (see (4) and (5)). Similarly, we might be tempted to always set w to 1 in order to minimize the right-hand side of (18). Note however that a kernel κϵ,w τ that is admissible for w > 1 may not be so for w = 1, and in this case the bound would no longer be valid.

4.2 Empirical results

We now present a series of computational experiments designed to illustrate the behavior of KBSF in a variety of challenging domains. We start with a simple problem, the puddle world , to show that KBSF is indeed capable of compressing the information contained in KBRL s model. We then move to more diﬃcult tasks, and compare KBSF with other state-of-the-art reinforcement-learning algorithms. We start with two classical control tasks, single and double pole-balancing. Next we study two medically-related problems based on real data: HIV drug schedule and epilepsy-suppression domains.

All problems considered in this paper have a continuous state space and a ﬁnite number of actions, and were modeled as discounted tasks. The algorithms s results correspond to the performance of the greedy decision policy derived from the ﬁnal value function computed. In all cases, the decision policies were evaluated on challenging test states from which the tasks cannot be easily solved. The details of the experiments are given in Appendix B.

4.2.1 Puddle world (proof of concept)

In order to show that KBSF is indeed capable of summarizing the information contained in KBRL s model, we use the puddle world task (Sutton, 1996). The puddle world is a simple two-dimensional problem in which the objective is to reach a goal region avoiding two puddles along the way. We implemented the task exactly as described by Sutton (1996), except that we used a discount factor of γ = 0.99 and evaluated the decision policies on a set of pre-deﬁned test states surrounding the puddles (see Appendix B).

The experiment was carried out as follows: ﬁrst, we collected a set of n sample transitions (sa k, ra k, ˆsa k) using a random exploration policy (that is, a policy that selects actions uniformly at random). In the case of KBRL, this set of sample transitions deﬁned the model used to approximate the value function. In order to deﬁne KBSF s model, the states ˆsa k were grouped by the k-means algorithm into m clusters and a representative state sj was placed at the center of each resulting cluster (Kaufman and Rousseeuw, 1990). As for the kernels s widths, we varied both τ and τ in the set {0.01, 0.1, 1} (see Table 1). The results reported represent the best performance of the algorithms over 50 runs; that is, for each n and each m we picked the combination of parameters that generated the maximum average return. We use the following convention to refer to speciﬁc instances of each method: the ﬁrst number enclosed in parentheses after an algorithm s name is n, the number of sample transitions used in the approximation, and the second one is m, the size of the model used to approximate the value function. Note that for KBRL n and m coincide.

Practical Kernel-Based Reinforcement Learning

In Figure 3a and 3b we observe the eﬀect of ﬁxing the number of transitions n and varying the number of representative states m. As expected, KBSF s results improve as m n. More surprising is the fact that KBSF has essentially the same performance as KBRL using models one order of magnitude smaller. This indicates that KBSF is summarizing well the information contained in the data. Depending on the values of n and m, such a compression may represent a signiﬁcant reduction on the consumption of computational resources. For example, by replacing KBRL(8000) with KBSF(8000, 100), we obtain a decrease of approximately 99.58% on the number of operations performed to ﬁnd a policy, as shown in Figure 3b (the cost of constructing KBSF s MDP is included in all reported run times). In Figures 3c and 3d we ﬁx m and vary n. Observe in Figure 3c how KBRL and KBSF have similar performances, and both improve as n increases. However, since KBSF is using a model of ﬁxed size, its computational cost depends only linearly on n, whereas KBRL s cost grows with n2ˆn, roughly. This explains the huge diﬀerence in the algorithms s run times shown in Figure 3d.

4.2.2 Single and double pole-balancing (comparison with LSPI)

We now evaluate how KBSF compares to other modern reinforcement learning algorithms on more diﬃcult tasks. We ﬁrst contrast our method with Lagoudakis and Parr s (2003) least-squares policy iteration algorithm (LSPI). Besides its popularity, LSPI is a natural candidate for such a comparison for three reasons: it also builds an approximator of ﬁxed size out of a batch of sample transitions, it has good theoretical guarantees, and it has been successfully applied to several reinforcement learning tasks. We compare the performance of LSPI and KBSF on the pole balancing task. Pole balancing has a long history as a benchmark problem because it represents a rich class of unstable systems (Michie and Chambers, 1968; Anderson, 1986; Barto et al., 1983). The objective in this problem is to apply forces to a wheeled cart moving along a limited track in order to keep one or more poles hinged to the cart from falling over. There are several variations of the task with diﬀerent levels of diﬃculty; among them, balancing two poles side by side is particularly hard (Wieland, 1991). In this paper we compare LSPI and KBSF on both the singleand two-poles versions of the problem. We implemented the tasks using a realistic simulator described by Gomez (2003). We refer the reader to Appendix B for details on the problems s conﬁguration. The experiments were carried out as described in the previous section, with sample transitions collected by a random policy and then clustered by the k-means algorithm. In both versions of the pole-balancing task LSPI used the same data and approximation architectures as KBSF. To make the comparison with LSPI as fair as possible, we ﬁxed the width of KBSF s kernel κa τ at τ = 1 and varied τ in {0.01, 0.1, 1} for both algorithms. Also, policy iteration was used to ﬁnd a decision policy for the MDPs constructed by KBSF, and this algorithm was run for a maximum of 30 iterations, the same limit used for LSPI. Figure 4 shows the results of LSPI and KBSF on the single and double pole-balancing tasks. We call attention to the fact that the version of the problems used here is signiﬁcantly harder than the more commonly-used variants in which the decision policies are evaluated on a single state close to the origin. This is probably the reason why LSPI achieves a

Barreto, Precup, and Pineau

20 40 60 80 100 120 140

0.5 1.0 1.5 2.0 2.5 3.0

KBSF(8000,m)

(a) Performance as a function of m

20 40 60 80 100 120 140

1e 01 1e+00 1e+01 1e+02 1e+03

Seconds (log)

KBSF(8000,m)

(b) Run time as a function of m

2000 4000 6000 8000 10000

1.0 1.5 2.0 2.5 3.0

KBSF(n,100)

(c) Performance as a function of n

2000 4000 6000 8000 10000

5e 01 5e+00 5e+01 5e+02

Seconds (log)

KBSF(n,100)

(d) Run time as a function of n

Figure 3: Results on the puddle-world task averaged over 50 runs. The algorithms were evaluated on a set of test states distributed over a region of the state space surrounding the puddles (details in Appendix B). The shadowed regions represent 99% conﬁdence intervals.

Practical Kernel-Based Reinforcement Learning

success rate of no more than 60% on the single pole-balancing task, as shown in Figure 4a. In contrast, KBSF s decision policies are able to balance the pole in 90% of the attempts, on average, using as few as m = 30 representative states. The results of KBSF on the double pole-balancing task are still more impressive. As Wieland (1991) rightly points out, this version of the problem is considerably more difﬁcult than its single pole variant, and previous attempts to apply reinforcement-learning techniques to this domain resulted in disappointing performance (Gomez et al., 2006). As shown in Figure 4c, KBSF(106, 200) is able to achieve a success rate of more than 80%. To put this number in perspective, recall that some of the test states are quite challenging, with the two poles inclined and falling in opposite directions. The good performance of KBSF comes at a relatively low computational cost. A conservative estimate reveals that, were KBRL(106) run on the same computer used for these experiments, we would have to wait for more than 6 months to see the results. KBSF(106, 200) delivers a decision policy in less than 7 minutes. KBSF s computational cost also compares well with that of LSPI, as shown in Figures 4b and 4d. LSPI s policy evaluation step involves the update and solution of a linear system of equations, which take O(nm2) and O(m3|A|3), respectively. In addition, the policy-update stage requires the deﬁnition of π(ˆsa k) for all n states in the set of sample transitions. In contrast, at each iteration KBSF only performs O(m3) operations to evaluate a decision policy and O(m2|A|) operations to update it.

4.2.3 HIV drug schedule domain (comparison with fitted Q-iteration)

We now compare KBSF with the ﬁtted Q-iteration algorithm (Ernst et al., 2005; Antos et al., 2007; Munos and Szepesv ari, 2008). Fitted Q-iteration is a conceptually simple method that also builds its approximation based solely on sample transitions. Here we adopt this algorithm with an ensemble of trees generated by Geurts et al. s (2006) extratrees algorithm. This combination, which we call FQIT, generated the best results on the extensive empirical evaluation performed by Ernst et al. (2005). We chose FQIT for our comparisons because it has shown excellent performance on both benchmark and real-world reinforcement-learning tasks (Ernst et al., 2005, 2006). In all experiments reported in this paper we used FQIT with ensembles of 30 trees. As detailed in Appendix B, besides the number of trees, FQIT has three main parameters. Among them, the minimum number of elements required to split a node in the construction of the trees, denoted here by ηmin, has a particularly strong eﬀect on both the algorithm s performance and computational cost. Thus, in our experiments we ﬁxed FQIT s parameters at reasonable values selected based on preliminary experiments and only varied ηmin. The respective instances of the tree-based approach are referred to as FQIT(ηmin). We compare FQIT and KBSF on an important medical problem which we will refer to as the HIV drug schedule domain (Adams et al., 2004; Ernst et al., 2006). Typical HIV treatments use drug cocktails containing two types of medication: reverse transcriptase inhibitors (RTI) and protease inhibitors (PI). Despite the success of drug cocktails in maintaining low viral loads, there are several complications associated with their long-term use. This has attracted the interest of the scientiﬁc community to the problem of optimizing drug-scheduling strategies. One strategy that has been receiving a lot of attention recently

Barreto, Precup, and Pineau

20 40 60 80 100 120 140

0.0 0.2 0.4 0.6 0.8 1.0

Successful episodes

LSPI(5x104,m)

KBSF(5x104,m)

(a) Performance on single pole-balancing

20 40 60 80 100 120 140

1 5 10 50 500

Seconds (log)

LSPI(5x104,m) KBSF(5x104,m)

(b) Run time on single pole-balancing

50 100 150 200

0.0 0.2 0.4 0.6 0.8

Successful episodes

LSPI(106,m) KBSF(106,m)

(c) Performance on double pole-balancing

50 100 150 200

50 100 500 2000 5000 20000

Seconds (log)

LSPI(106,m) KBSF(106,m)

(d) Run time on double pole-balancing

Figure 4: Results on the pole-balancing tasks, as a function of the number of representative states m, averaged over 50 runs. The values correspond to the fraction of episodes initiated from the test states in which the pole(s) could be balanced for 3000 steps (one minute of simulated time). The test sets were regular grids deﬁned over the hypercube centered at the origin and covering 50% of the state-space axes in each dimension (see Appendix B). Shadowed regions represent 99% conﬁdence intervals.

Practical Kernel-Based Reinforcement Learning

is structured treatment interruption (STI), in which patients undergo alternate cycles with and without the drugs. Although many successful STI treatments have been reported in the literature, as of now there is no consensus regarding the exact protocol that should be followed (Bajaria et al., 2004). The scheduling of STI treatments can be seen as a sequential decision problem in which the actions correspond to the types of cocktail that should be administered to a patient (Ernst et al., 2006). To simplify the problem s formulation, it is assumed that RTI and PI drugs are administered at ﬁxed amounts, reducing the actions to the four possible combinations of drugs: none, RTI only, PI only, or both. The goal is to minimize the viral load using as little drugs as possible. Following Ernst et al. (2006), we performed our experiments using a model that describes the interaction of the immune system with HIV. This model was developed by Adams et al. (2004) and has been identiﬁed and validated based on real clinical data. The resulting reinforcement learning task has a 6-dimensional continuous state space whose variables describe the overall patient s condition. We formulated the problem exactly as proposed by Ernst et al. (2006, see Appendix B for details). The strategy used to generate the data also followed the protocol proposed by these authors, which we now brieﬂy explain. Starting from a batch of 6000 sample transitions generated by a random policy, each algorithm ﬁrst computed an initial approximation of the problem s optimal value function. Based on this approximation, a 0.15-greedy policy was used to collect a second batch of 6000 transitions, which was merged with the ﬁrst.3

This process was repeated for 10 rounds, resulting in a total of 60000 sample transitions. We varied FQIT s parameter ηmin in the set {50, 100, 200}. For the experiments with KBSF, we ﬁxed τ = τ = 1 and varied m in {2000, 4000, ..., 10000} (in the rounds in which m n we simply used all states ˆsa i as representative states). As discussed in the beginning of this section, it is possible to reduce KBSF s computational cost with the use of sparse kernels. In our experiments with the HIV drug schedule task, we only computed the µ = 2 largest values of kτ( si, ) and the µ = 3 largest values of k τ(ˆsa i , ) (see Appendix B.2). The representative states si were selected at random from the set of sampled states ˆsa i (the reason for this will become clear shortly). Since in the current experiments the number of sample transitions n was ﬁxed, we will refer to the particular instances of our algorithm simply as KBSF(m). Before discussing the results, we point out that FQIT s performance on the HIV drug schedule task is very good, comparable to that of Adams et al. s (2004) approach, which uses a model of the task. FQIT s results, along with KBSF s, are shown in Figure 5. As shown in Figure 5a, the performance of FQTI improves when ηmin is decreased, as expected. In contrast, increasing the number of representative states m does not have a strong impact on the quality of KBSF s solutions (in fact, in some cases the average return obtained by the resulting policies decreases slightly when m grows). Overall, the performance of KBSF on the HIV drug schedule task is not nearly as impressive as on the previous problems. For example, even when using m = 10000 representative states, which corresponds to one sixth of the sampled states, KBSF is unable to reproduce the performance of FQIT with ηmin = 50.

3. As explained by Sutton and Barto (1998), an ϵ-greedy policy selects the action with maximum value with probability 1 ϵ, and with probability ϵ it picks an action uniformly at random.

Barreto, Precup, and Pineau

2000 4000 6000 8000 10000

0e+00 2e+08 4e+08 6e+08 8e+08 1e+09

(a) Performance

2000 4000 6000 8000 10000

0 1000 2000 3000 4000

(b) Run times

0 5 10 15 20 25 30

0e+00 2e+08 4e+08 6e+08 8e+08 1e+09

Number of agents

KBSF(10000)

(c) Performance

0 5 10 15 20 25 30

0 1000 2000 3000 4000

Number of agents

KBSF(10000)

(d) Run times

Figure 5: Results on the HIV drug schedule task averaged over 50 runs. The STI policies were evaluated for 5000 days starting from a state representing a patient s unhealthy state (see Appendix B). The shadowed regions represent 99% conﬁdence intervals.

Practical Kernel-Based Reinforcement Learning

On the other hand, when we look at Figure 5b, it is clear that the diﬀerence on the algorithms s performance is counterbalanced by a substantial diﬀerence on the associated computational costs. As an illustration, note that KBSF(10000) is 15 times faster than FQTI(100) and 20 times faster than FQTI(50). This diﬀerence on the algorithms s run times is expected, since each iteration of FQIT involves the construction (or update) of an ensemble of trees, each one requiring at least O(n log(n/ηmin)) operations, and the improvement of the current decision policy, which is O(n|A|) (Geurts et al., 2006). As discussed before, KBSF s eﬃciency comes from the fact that its computational cost per iteration is independent of the number of sample transitions n. Note that the fact that FQIT uses an ensemble of trees is both a blessing and a curse. If on the one hand this reduces the variance of the approximation, on the other hand it also increases the algorithm s computational cost (Geurts et al., 2006). Given the big gap between FQIT s and KBSF s time complexities, one may wonder if the latter can also beneﬁt from averaging over several models. In order to verify this hypothesis, we implemented a very simple model-averaging strategy with KBSF: we trained several agents independently, using Algorithm 1 on the same set of sample transitions, and then put them together on a single committee . In order to increase the variability within the committee of agents, instead of using k-means to determine the representative states sj we simply selected them uniformly at random from the set of sampled states ˆsa i (note that this has the extra beneﬁt of reducing the method s overall computational cost). The actions selected by the committee of agents were determined through voting that is, we simply picked the action chosen by the majority of agents, with ties broken randomly. We do not claim that the approach described above is the best model-averaging strategy to be used with KBSF. However, it seems to be suﬃcient to boost the algorithm s performance considerably, as shown in Figure 5c. Note how KBSF already performs comparably to FQTI(50) when using only 5 agents in the committee. When this number is increased to 15, the expected return of KBSF s agents is considerably larger than that of the best FQIT s agent, with only a small overlap between the 99% conﬁdence intervals associated with the algorithms s results. The good performance of KBSF is still more impressive when we look at Figure 5d, which shows that even when using a committee of 30 agents this algorithm is faster than FQIT(200).

4.2.4 Epilepsy-suppression domain (comparison with LSPI and fitted Q-iteration)

We conclude our empirical evaluation of KBSF by using it to learn a neuro-stimulation policy for the treatment of epilepsy. It has been shown that the electrical stimulation of speciﬁc structures in the neural system at ﬁxed frequencies can eﬀectively suppress the occurrence of seizures (Durand and Bikson, 2001). Unfortunately, in vitro neuro-stimulation experiments suggest that ﬁxed-frequency pulses are not equally eﬀective across epileptic systems. Moreover, the long term use of this treatment may potentially damage the patients s neural tissues. Therefore, it is desirable to develop neuro-stimulation policies that replace the ﬁxed-stimulation regime with an adaptive scheme. The search for eﬃcient neuro-stimulation strategies can be seen as a reinforcement learning problem. Here we study this problem using a generative model developed by Bush et al.

Barreto, Precup, and Pineau

(2009) based on real data collected from epileptic rat hippocampus slices. Bush et al. s model was shown to reproduce the seizure pattern of the original dynamical system and was later validated through the deployment of a learned treatment policy on a real brain slice (Bush and Pineau, 2009). The associated decision problem has a ﬁve-dimensional continuous state space and highly non-linear dynamics. At each time step the agent must choose whether or not to apply an electrical pulse. The goal is to suppress seizures as much as possible while minimizing the total amount of stimulation needed to do so. The experiments were performed as described in Section 4.2.1, with a single batch of sample transitions collected by a policy that selects actions uniformly at random. Specifically, the random policy was used to collect 50 trajectories of length 10000, resulting in a total of 500000 sample transitions. We use as a baseline for our comparisons the already mentioned ﬁxed-frequency stimulation policies usually adopted in in vitro clinical studies (Bush and Pineau, 2009). In particular, we considered policies that apply electrical pulses at frequencies of 0 Hz, 0.5 Hz, 1 Hz, and 1.5 Hz. We compare KBSF with LSPI and FQIT. For this task we ran both LSPI and KBSF with sparse kernels, that is, we only computed the kernels at the 6-nearest neighbors of a given state (µ = µ = 6; see Appendix B.2 for details). This modiﬁcation made it possible to use m = 50000 representative states with KBSF. Since for LSPI the reduction on the computational cost was not very signiﬁcant, we ﬁxed m = 50 to keep its run time within reasonable bounds. Again, KBSF and LSPI used the same approximation architectures, with representative states deﬁned by the k-means algorithm. We ﬁxed τ = 1 and varied τ in {0.01, 0.1, 1}. FQIT was conﬁgured as described in the previous section, with the parameter ηmin varying in {20, 30, ..., 200}. In general, we observed that the performance of the tree-based method improved with smaller values for ηmin, with an expected increase in the computational cost. Thus, in order to give an overall characterization of FQIT s performance, we only report the results obtained with the extreme values of ηmin. Figure 6 shows the results on the epilepsy-suppression task. In order to obtain diﬀerent compromises between the problem s two conﬂicting objectives, we varied the relative magnitude of the penalties associated with the occurrence of seizures and with the application of an electrical pulse (Bush et al., 2009; Bush and Pineau, 2009). Speciﬁcally, we ﬁxed the latter at 1 and varied the former with values in { 10, 20, 40}. This appears in the plots as subscripts next to the algorithms s names. As shown in Figure 6a, LSPI s policies seem to prioritize reduction of stimulation at the expense of higher seizure occurrence, which is clearly sub-optimal from a clinical point of view. FQIT(200) also performs poorly, with solutions representing no advance over the ﬁxed-frequency stimulation strategies. In contrast, FQTI(20) and KBSF are both able to generate decision policies that are superior to the 1 Hz policy, which is the most eﬃcient stimulation regime known to date in the clinical literature (Jerger and Schiﬀ, 1995). However, as shown in Figure 6b, KBSF is able to do it at least 100 times faster than the tree-based method.

5. Incremental KBSF

As clear in the previous section, one characteristic of KBSF that sets it apart from other methods is its low demand in terms of computational resources. Speciﬁcally, both time and memory complexities of our algorithm are linear in the number of sample transitions n. In

Practical Kernel-Based Reinforcement Learning

0.00 0.05 0.10 0.15 0.20 0.25 0.30 0.35

0.10 0.15 0.20

Fraction of stimulation

Fraction of seizures

FQIT(20) 40

FQIT(20) 10

FQIT(200) 40

FQIT(200) 10

LSPI 40 LSPI 20

FQIT(20) 20

FQIT(200) 20

(a) Performance. The length of the rectangles s edges represent 99% conﬁdence intervals.

FQIT(20) 10

FQIT(200) 10

FQIT(20) 20

FQIT(200) 20

FQIT(20) 40

FQIT(200) 40

Seconds (log)

50 200 1000 5000

(b) Run times (conﬁdence intervals do not show up in logarithmic scale)

Figure 6: Results on the epilepsy-suppression problem averaged over 50 runs. The decision policies were evaluated on episodes of 105 transitions starting from a ﬁxed set of 10 test states drawn uniformly at random.

Barreto, Precup, and Pineau

terms of the number of operations performed by the algorithm, this is the best one can do without discarding transitions. However, in terms of memory usage, it is possible to do even better. In this section we show how to build KBSF s approximation incrementally, without ever having access to the entire set of sample transitions at once. Besides reducing the memory complexity of the algorithm, this modiﬁcation has the additional advantage of making KBSF suitable for on-line reinforcement learning. In the batch version of KBSF, described in Section 4, the matrices Pa and vectors ra are determined using all the transitions in the corresponding sets Sa. This has two undesirable consequences. First, the construction of the MDP M requires an amount of memory of O(ˆnm), where ˆn = maxa na. Although this is a signiﬁcant improvement over KBRL s memory usage, which is lower bounded by (mina na)2|A|, in more challenging domains even a linear dependence on ˆn may be impractical. Second, in the batch version of KBSF the only way to incorporate new data into the model M is to recompute the multiplication Pa = Ka Da for all actions a for which there are new sample transitions available. Even if we ignore the issue with memory usage, this is clearly ineﬃcient in terms of computation. In what follows we present an incremental version of KBSF that circumvents these important limitations (Barreto et al., 2012). We assume the same scenario considered in Section 4: there is a set of sample transitions Sa = {(sa k, ra k, ˆsa k)|k = 1, 2, ..., na} associated with each action a A, where sa k, ˆsa k S and ra k R, and a set of representative states S = { s1, s2, ..., sm}, with si S. Suppose now that we split the set of sample transitions Sa in two subsets S1 and S2 such that S1 S2 = and S1 S2 = Sa (we drop the a superscript in the sets S1 and S2 to improve clarity). Without loss of generality, suppose that the sample transitions are indexed so that S1 {(sa k, ra k, ˆsa k)|k = 1, 2, ..., n1} and S2 {(sa k, ra k, ˆsa k)|k = n1 +1, n1 +2, ..., n1 +n2 = na}. Let PS1 and r S1 be matrix Pa and vector ra computed by KBSF using only the n1 transitions in S1 (if n1 = 0, we deﬁne PS1 = 0 Rm m and r S1 = 0 Rm for all a A). We want to compute PS1 S2 and r S1 S2 from PS1, r S1, and S2, without using the set of sample transitions S1. We start with the transition matrices Pa. We know that

p S1 ij = Pn1 t=1 ka it da tj = Pn1 t=1 kτ( si, sa t ) Pn1 l=1 kτ( si, sa l )

k τ(ˆsa t , sj) Pm l=1 k τ(ˆsa t , sl)

= 1 Pn1 l=1 kτ( si, sa l ) Pn1 t=1 kτ( si, sa t ) k τ(ˆsa t , sj) Pm l=1 k τ(ˆsa t , sl) .

To simplify the notation, deﬁne

l=1 kτ( si, sa l ), z S2 i =

l=n1+1 kτ( si, sa l ), and bt ij = kτ( si, sa t ) k τ(ˆsa t , sj) Pm l=1 k τ(ˆsa t , sl) ,

with t {1, 2, ..., n1 + n2}. Then, we can write

p S1 S2 ij = 1 z S1 i + z S2 i

Pn1 t=1 bt ij + Pn1+n2 t=n1+1 bt ij = 1 z S1 i + z S2 i

p S1 ij z S1 i + Pn1+n2 t=n1+1 bt ij .

Practical Kernel-Based Reinforcement Learning

Now, deﬁning b S2 ij = Pn1+n2 t=n1+1 bt ij, we have the simple update rule:

p S1 S2 ij = 1 z S1 i + z S2 i

b S2 ij + p S1 ij z S1 i

We can apply similar reasoning to derive an update rule for the rewards ra i . We know that

r S1 i = 1 Pn1 l=1 kτ( si, sa l )

t=1 kτ( si, sa t )ra t = 1 z S1 i

t=1 kτ( si, sa t )ra t .

Let et i = kτ( si, sa t )ra t , with t {1, 2, ..., n1 + n2}. Then,

r S1 S2 i = 1 z S1 i + z S2 i

Pn1 t=1 et i + Pn1+n2 t=n1+1 et i = 1 z S1 i + z S2 i

z S1 i r S1 i + Pn1+n2 t=n1+1 et i .

Deﬁning e S2 i = Pn1+n2 t=n1+1 et i, we have the following update rule:

r S1 S2 i = 1 z S1 i + z S2 i

e S2 i + r S1 i z S1 i

Since b S2 ij , e S2 i , and z S2 i can be computed based on S2 only, we can discard the sample transitions in S1 after computing PS1 and r S1. In order to do that, we only have to keep the variables z S1 i . These variables can be stored in |A| vectors za Rm, resulting in a modest memory overhead. Note that we can apply the ideas above recursively, further splitting the sets S1 and S2 in subsets of smaller size. Thus, we have a fully incremental way of computing KBSF s MDP which requires almost no extra memory. Algorithm 2 shows a step-by-step description of how to update M based on a set of sample transitions. Using this method to update its model, KBSF s space complexity drops from O(ˆnm) to O(m2). Since the amount of memory used by KBSF is now independent of n, it can process an arbitrary number of sample transitions (or, more precisely, the limit on the amount of data it can process is dictated by time only, not space). Instead of assuming that S1 and S2 are a partition of a ﬁxed data set Sa, we can consider that S2 was generated based on the policy learned by KBSF using the transitions in S1. Thus, Algorithm 2 provides a ﬂexible framework for integrating learning and planning within KBSF. Speciﬁcally, our algorithm can cycle between learning a model of the problem based on sample transitions, using such a model to derive a policy, and resorting to this policy to collect more data. Algorithm 3 shows a possible implementation of this framework. In order to distinguish it from its batch counterpart, we will call the incremental version of our algorithm i KBSF. i KBSF updates the model M and the value function Q at ﬁxed intervals tm and tv, respectively. When tm = tv = n, we recover the batch version of KBSF; when tm = tv = 1, we have a fully on-line method which stores no sample transitions. Algorithm 3 allows for the inclusion of new representative states to the model M. Using Algorithm 2 this is easy to do: given a new representative state sm+1, it suﬃces to set za m+1 = 0, ra m+1 = 0, and pm+1,j = pj,m+1 = 0 for j = 1, 2, ..., m + 1 and all a A. Then, in the following applications of update rules (19) and (20), the dynamics of M will naturally reﬂect the existence of state sm+1. Note that the inclusion of new representative states

Barreto, Precup, and Pineau

Algorithm 2 Update KBSF s MDP

Input: Pa, ra, za for all a A Current model Sa = {(sa k, ra k, ˆsa k)|k = 1, 2, ..., na} for all a A Sample transitions Output: Updated M and za

for t = 1, ..., na do zt Pm l=1 k τ(ˆsa t , sl) na |Sa| for i = 1, 2, ..., m do

z Pna t=1 kτ( si, sa t ) for j = 1, 2, ..., m do

b Pna t=1 kτ( si, sa t ) k τ(ˆsa t , sj)/ zt

pij 1 za i + z (b + pijza i ) Update transition probabilities

e Pna t=1 kτ( si, sa t )ra t ri 1 za i + z (e + riza i ) Update rewards

za i za i + z Update normalization factor

Algorithm 3 Incremental KBSF (i KBSF)

Input: S = { s1, s2, ..., sm} Set of representative states tm Interval to update model tv Interval to update value function Output: Approximate value function Q(s, a)

Pa 0 Rm m, ra 0 Rm, za 0 Rm, for all a A Q arbitrary matrix in Rm |A|

s initial state a random action for t 1, 2, ... do

Execute a in s and observe r and ˆs Sa Sa S{(s, r, ˆs)} if (t mod tm = 0) then Update model Add new representative states to M using Sa This step is optional Update M and za using Algorithm 2 and Sa

Sa for all a A Discard transitions if (t mod tv = 0) update Q Update value function s ˆs Select a based on Q(s, a) = Pm i=1 κ τ(s, si) qia

Practical Kernel-Based Reinforcement Learning

does not destroy the information already in the model. This allows i KBSF to reﬁne its approximation on the ﬂy, as needed. One can think of several ways of detecting the need for new representative states. A simple strategy, based on Proposition 6, is to impose a maximum distance allowed between a sampled state ˆsa i and the nearest representative state, dist(ˆsa i , 1). So, anytime the agent encounters a new state ˆsa i for which dist(ˆsa i , 1) is above a given threshold, ˆsa i is added to the model as sm+1. In Section 5.2 we report experiments with i KBSF using this approach. Before that, though, we discuss the theoretical properties of the incremental version of our algorithm.

5.1 Theoretical results

As discussed, i KBSF does not need to store sample transitions to build its approximation. However, the computation of Q(s, a) through (12) requires all the tuples (sa i , ra i , ˆsa i ) to be available. In some situations, it may be feasible to keep the transitions in order to compute Q(s, a). However, if we want to use i KBSF to its full extend, we need a way of computing Q(s, a) without using the sample transitions. This is why upon reaching state s at time step t i KBSF selects the action to be performed based on

i=1 κ τ(s, si) Qt( si, a), (21)

where Qt( si, a) is the action-value function available to i KBSF at the tth iteration (see Algorithm 3). Note that we do not assume that i KBSF has computed the optimal value function of its current model Mt that is, it may be the case that Qt( si, a) = Q t ( si, a). Unfortunately, when we replace (12) with (21) Proposition 3 no longer applies. In this section we address this issue by deriving an upper bound for the diﬀerence between Qt(s, a) and ˆQt(s, a), the action-value function that would be computed by KBRL using all the transitions processed by i KBSF up to time step t. In order to derive our bound, we assume that i KBSF uses a ﬁxed set S meaning that no representative states are added to the model M and that it never stops reﬁning its model, doing so at every iteration t (i.e., tm = 1 in Algorithm 3). We start by showing the following lemma:

Lemma 9 Let M (S, A, Pa, ra, γ) and M (S, A, Pa, ra, γ) be two ﬁnite MDPs. Then, for any s S and any a A,

|Q (s, a) Q (s, a)| 1 1 γ max a ra ra + γ 2(1 γ)2 Rdifmax a Pa Pa ,

where Rdif = maxa,i ra i mina,i ra i .

Lemma 9 provides an upper bound for the diﬀerence in the action-value functions of any two MDPs having the same state space S, action space A, and discount factor γ.4 Our strategy will be to use this result to bound the error introduced by the application of the stochastic-factorization trick in the context of i KBSF.

4. Strehl and Littman s (2008) Lemma 1 is similar to our result. Their bound is more general than ours, as it applies to any Qπ, but it is also slightly looser.

Barreto, Precup, and Pineau

When tm = 1, at any time step t i KBSF has a model Mt built based on the t transitions observed thus far. As shown in the beginning of this section, Mt exactly matches the model that would be computed by batch KBSF using the same data and the same set of representative states. Thus, we can think of matrices Pa t and vectors ra t available at the tth

iteration of i KBSF as the result of the stochastic-factorization trick applied with matrices Dt and Ka t . Although i KBSF does not explicitly compute such matrices, they serve as a theoretical foundation to build our result on.

Proposition 10 Suppose i KBSF is executed with a ﬁxed set of representative states S using tm = 1. Let Dt, Ka t and ra t be the matrices and the vector (implicitly) computed by this algorithm at iteration t. Then, if s is the state encountered by i KBSF at time step t,

| ˆQt(s, a) Qt(s, a)| 1 1 γ max a ˆra t Dt ra t + Rdif,t (1 γ)2 γ

2 max a ˆPa t Dt Ka t + σ(Dt) + ϵ Qt, (22)

for any a A, where Qt is the value function computed by i KBSF at time step t through (21), ˆQt is the value function computed by KBRL through (7) based on the same data, Rdif,t = maxa,i ra i,t mina,i ra i,t, σ(Dt) = maxi (1 maxj dij,t), and ϵ Qt = maxi,a | Q t ( si, a) Qt( si, a)|.

Proposition 10 shows that, at any time step t, the error in the action-value function computed by i KBSF is bounded above by the quality and the level of stochasticity of the stochastic factorization implicitly computed by the algorithm. The term ϵ Qt accounts for the possibility that i KBSF has not computed the optimal value function of its model at step t, either because tm = tv or because the update of Q in Algorithm 3 is not done to completion (for example, one can apply the Bellman operator T a ﬁxed number of times, stopping short of convergence). The restriction tm = 1 is not strictly necessary if we are willing to compare Qt(s, a) with ˆQt (s, a), where t = (t + tm)/t (the next time step scheduled for a model update). However, such a result would be somewhat circular, since the sample transitions used to build ˆQt (s, a) may depend on Qt(s, a). Finally, we note that, given the similarity between the right-hand sides of (8) and (22), it should be trivial to derive results analogous to Propositions 6 and 8 that also apply to i KBSF.

5.2 Empirical results

We now look at the empirical performance of the incremental version of KBSF. Following the structure of Section 4.2, we start with the puddle world task to show that i KBSF is indeed able to match the performance of batch KBSF without storing all sample transitions. Next we exploit the scalability of i KBSF to solve two diﬃcult control tasks, triple polebalancing and helicopter hovering. We also compare i KBSF s performance with that of other reinforcement learning algorithms.

5.2.1 Puddle world (proof of concept)

We use the puddle world problem as a proof of concept (Sutton, 1996). In this ﬁrst experiment we show that i KBSF is able to recover the model that would be computed by its batch counterpart. In order to do so, we applied Algorithm 3 to the puddle-world task using a random policy to select actions.

Practical Kernel-Based Reinforcement Learning

0 2000 4000 6000 8000

3 2 1 0 1 2 3

Number of sample transitions

ι = 1000 ι = 2000 ι = 4000 ι = 8000

(a) Performance

0 2000 4000 6000 8000

0.0 0.5 1.0 1.5

Number of sample transitions

ι = 1000 ι = 2000 ι = 4000 ι = 8000

(b) Run times

Figure 7: Results on the puddle-world task averaged over 50 runs. KBSF used 100 representative states evenly distributed over the state space and tm = tv = ι (see legends). Sample transitions were collected by a random policy. The agents were tested on two sets of states surrounding the puddles (see Appendix B).

Figure 7 shows the result of the experiment when we vary the parameters tm and tv. Note that the case in which tm = tv = 8000 corresponds to the batch version of KBSF, whose results on the puddle world are shown in Figure 3. As expected, the performance of KBSF policies improves gradually as the algorithm goes through more sample transitions, and in general the intensity of the improvement is proportional to the amount of data processed. More important, the performance of the decision policies after all sample transitions have been processed is essentially the same for all values of tm and tv, which conﬁrms that i KBSF can be used as an instrument to circumvent KBSF s memory demand. Thus, if one has a batch of sample transitions that does not ﬁt in the available memory, it is possible to split the data in chunks of smaller sizes and still get the same value-function approximation that would be computed if the entire data set were processed at once. As shown in Figure 7b, there is only a small computational overhead associated with such a strategy (this results from unnormalizing and normalizing the elements of Pa and ra multiple times through update rules (19) and (20)).

5.2.2 Triple pole-balancing (comparison with fitted Q-iteration)

As discussed in Section 4.2.2, the pole balancing task has been addressed in several diﬀerent versions, and among them simultaneously balancing two poles is particularly challenging (Wieland, 1991). Figures 4c and 4d show that the batch version of KBSF was able to satisfactorily solve the double pole-balancing task. In order to show the scalability of the incremental version of our algorithm, in this section we raise the bar, adding a third pole

Barreto, Precup, and Pineau

to the problem. We perform our simulations using the parameters usually adopted with the two-pole problem, with the extra pole having the same length and mass as the longer pole (Gomez, 2003, see Appendix B). This results in a diﬃcult control problem with an 8-dimensional state space S. In our experiments with KBSF on the two-pole task we used 200 representative states and 106 sample transitions collected by a random policy. Here we start our experiment with triple pole-balancing using exactly the same conﬁguration, and then we let i KBSF reﬁne its model M by incorporating more sample transitions through update rules (19) and (20). We also let i KBSF grow its model if necessary. Speciﬁcally, a new representative state is added to M on-line every time the agent encounters a sample state ˆsa i for which k τ(ˆsa i , sj) < 0.01 for all j 1, 2, ..., m. This corresponds to setting a maximum allowed distance from a sampled state to the closest representative state, maxa,i dist(ˆsa i , 1). Given the poor performance of LSPI on the double pole-balancing task, shown in Figures 4c and 4d, on the three-pole version of the problem we only compare KBSF with FQIT. We used FQIT with the same conﬁguration adopted in Sections 4.2.3 and 4.2.4, with the parameter ηmin varying in the set {10000, 1000, 100}. As for KBSF, the widths of the kernels were ﬁxed at τ = 100 and τ = 1 and sparse kernels were used (µ = 50 and µ = 10). In order to show the beneﬁts provided by the incremental version of our algorithm, we assumed that both KBSF and FQIT could store at most 106 sample transitions in memory. In the case of i KBSF, this is not a problem, since we can always split the data in subsets of smaller size and process them incrementally. Here, we used Algorithm 3 with a 0.3-greedy policy, tm = tv = 106, and n = 107. In the case of FQIT, we have two options to circumvent the limited amount of memory available. The ﬁrst one is to use a single batch of 106

sample transitions. The other option is to use the initial batch of transitions to compute an approximation of the problem s value function, then use an 0.3-greedy policy induced by this approximation to collect a second batch, and so on. Here we show the performance of FQIT using both strategies. We ﬁrst compare the performance of i KBSF with that of FQIT using a single batch of sample transitions. This is shown in Figure 8a and 8b. For reference, we also show the results of batch KBSF that is, we show the performance of the policy that would be computed by our algorithm if we did not have a way of computing its approximation incrementally. As shown in Figure 8a, both FQIT and batch KBSF perform poorly in the triple pole-balancing task, with average success rates below 55%. These results suggest that the amount of data used by these algorithms is insuﬃcient to describe the dynamics of the control task. Of course, we could give more sample transitions to FQIT and batch KBSF. Note however that, since they are batch learning methods, there is an inherent limit on the amount of data that these algorithms can use to construct their approximation. In contrast, the amount of memory required by i KBSF is independent of the number of sample transitions n. This fact together with the fact that KBSF s computational complexity is only linear in n allow our algorithm to process a large amount of data in reasonable time. This can be clearly observed in Figure 8b, which shows that i KBSF can build an approximation using 107

sample transitions in under 20 minutes. As a reference for comparison, FQIT(1000) took an average of 1 hour and 18 minutes to process 10 times less data. As shown in Figure 8a, i KBSF s ability to process a large number of sample transitions allows our algorithm to achieve a success rate of approximately 80%. This is similar to the

Practical Kernel-Based Reinforcement Learning

2e+06 4e+06 6e+06 8e+06 1e+07

0.2 0.4 0.6 0.8 1.0

Successful episodes

(a) Performance

2e+06 4e+06 6e+06 8e+06 1e+07

5e+01 5e+02 5e+03 5e+04

Seconds (log)

(b) Run times

2e+06 4e+06 6e+06 8e+06 1e+07

0.2 0.4 0.6 0.8 1.0

Successful episodes

G FQIT(10000)

(c) Performance

2e+06 4e+06 6e+06 8e+06 1e+07

5e+01 5e+02 5e+03 5e+04

Seconds (log)

G FQIT(10000)

(d) Run times

Figure 8: Results on the triple pole-balancing task, as a function of the number of sample transitions n, averaged over 50 runs. Figures 8a and 8b show results of FQIT when using a single batch of 106 transitions; in Figures 8c and 8d ten sets of 106 transitions were used in sequence by the algorithm (see text for details). The values correspond to the fraction of episodes initiated from the test states in which the 3 poles could be balanced for 3000 steps (one minute of simulated time). The test sets were regular grids of 256 cells deﬁned over the hypercube centered at the origin and covering 50% of the state-space axes in each dimension (see Appendix B for details). Shadowed regions represent 99% conﬁdence intervals.

Barreto, Precup, and Pineau

2e+06 4e+06 6e+06 8e+06 1e+07

1000 2000 3000 4000

Number of sample transitions

Number of representative states

Figure 9: Number of representative states used by i KBSF on the triple pole-balancing task. Results were averaged over 50 runs (99% conﬁdence intervals are almost imperceptible in the ﬁgure).

performance of batch KBSF on the two-pole version of the problem (cf. Figure 4). The good performance of i KBSF on the triple pole-balancing task is especially impressive when we recall that the decision policies were evaluated on a set of test states representing all possible directions of inclination of the three poles. In order to achieve the same level of performance with KBSF, approximately 2 Gb of memory would be necessary, even using sparse kernels, whereas i KBSF used less than 0.03 Gb of memory. One may argue that the above comparison between FQIT and KBSF is not fair, since the latter used ten times the amount of data used by the former. Thus, in Figures 8c and 8d we show the results of FQIT using 10 batches of 106 transitions exactly the same number of transitions processed by i KBSF. Here we cannot compare i KBSF with FQIT(100) because the computational cost of the tree-based approach is prohibitively large (it would take over 4 days only to train a single agent, not counting the test phase). When we look at the other instances of the algorithm, we see two opposite trends. Surprisingly, the extra sample transitions actually made the performance of FQIT(10000) worse. On the other hand, FQIT(1000) performs signiﬁcantly better using more data, though still not as well as i KBSF (both in terms of performance and computing time). To conclude, observe in Figure 9 how the number of representative states m grows as a function of the number of sample transitions processed by KBSF. As expected, in the beginning of the learning process m grows fast, reﬂecting the fact that some relevant regions of the state space have not been visited yet. As more and more data come in, the number of representative states starts to stabilize.

5.2.3 Helicopter hovering task (comparison with SARSA)

In the previous two sections we showed how i KBSF can be used to circumvent the inherent memory limitations of batch learning. We now show how our algorithm performs in a fully

Practical Kernel-Based Reinforcement Learning

on-line regime. For that, we focus on a challenging reinforcement learning task in which the goal is to control an autonomous helicopter.

Helicopters have unique control capabilities, such as low speed ﬂight and in-place hovering, that make them indispensable instruments in many contexts. Such ﬂexibility comes at a price, though: it is widely recognized that a helicopter is signiﬁcantly harder to control than a ﬁxed-wing aircraft (Ng et al., 2003; Abbeel et al., 2007). Part of this diﬃculty is due to the complex dynamics of the helicopter, which is not only non-linear, noisy, and asymmetric, but also counterintuitive in some aspects (Ng et al., 2003).

An additional complication of controlling an autonomous helicopter is the fact that a wrong action can easily lead to a crash, which is both dangerous and expensive. Thus, the usual practice is to ﬁrst develop a model of the helicopter s dynamics and then use the model to design a controller (Ng et al., 2003). Here we use the model constructed by Abbeel et al. (2005) based on data collected on actual ﬂights of an XCell Tempest helicopter (see Appendix B). The resulting reinforcement learning problem has a 12-dimensional state space whose variables represent the aircraft s position, orientation, and the corresponding velocities and angular velocities along each axis.

In the version of the task considered here the goal is to keep the helicopter hovering as close as possible to a ﬁxed position. All episodes start at the target location, and at each time step the agent receives a negative reward proportional to the distance from the current state to the desired position. Because the tail rotor s thrust exerts a sideways force on the helicopter, the aircraft cannot be held stationary in the zero-cost state even in the absence of wind. The episode ends when the helicopter leaves the hover regime, that is, when any of the state s variables exceeds pre-speciﬁed thresholds.

The helicopter is controlled via a 4-dimensional continuous vector whose variables represent the longitudinal cyclic pitch, the latitudinal cyclic pitch, the tail rotor collective pitch, and the main rotor collective pitch. By adjusting the value of these variables the pilot can rotate the helicopter around its axes and control the thrust generated by the main rotor. Since KBSF was designed to deal with a ﬁnite number of actions, we discretized the set A using 4 values per dimension, resulting in 256 possible actions. The details of the discretization process are given below.

Here we compare i KBSF with the SARSA(λ) algorithm using tile coding for value function approximation (Rummery and Niranjan, 1994, Sutton, 1996 see Appendix B). We applied SARSA with λ = 0.05, a learning rate of 0.001, and 24 tilings containing 412

tiles each. Except for λ, all the parameters were adjusted in a set of preliminary experiments in order to improve the performance of the SARSA agent. We also deﬁned the action-space discretization based on SARSA s performance. In particular, instead of partitioning each dimension in equally-sized intervals, we spread the break points unevenly along each axis in order to maximize the return obtained by the SARSA agent. The result of this process is described in Appendix B. The interaction of the SARSA agent with the helicopter hovering task was dictated by an ϵ-greedy policy. Initially we set ϵ = 1, and at every 50000 transitions the value of ϵ was decreased in 30%.

The i KBSF agent collected sample transitions using the same exploration regime. Based on the ﬁrst batch of 50000 transitions, m = 500 representative states were determined by the k-means algorithm. No representative states were added to i KBSF s model after that.

Barreto, Precup, and Pineau

Both the value function and the model were updated at ﬁxed intervals of tv = tm = 50000 transitions. We ﬁxed τ = τ = 1 and µ = µ = 4.

Figure 10 shows the results obtained by SARSA and KBSF on the helicopter hovering task. Note in Figure 10a how the average episode length increases abruptly at the points in which the value of ϵ is decreased. This is true for both SARSA and KBSF. Also, since the number of steps executed per episode increases over time, the interval in between such abrupt changes decreases in length, as expected. Finally, observe how the performance of both agents stabilizes after around 70000 episodes, probably because at this point there is almost no exploration taking place anymore.

When we compare KBSF and SARSA, it is clear that the former signiﬁcantly outperforms the latter. Speciﬁcally, after the cut-point of 70000 episodes, the KBSF agent executes approximately 2.25 times the number of steps performed by the SARSA agent before crashing. Looking at Figures 10a and 10b, one may argue at ﬁrst that there is nothing surprising here: being a model-based algorithm, KBSF is more sample eﬃcient than SARSA, but it is also considerably slower (Atkeson and Santamaria, 1997). Notice though that the diﬀerence between the run times of SARSA and KBSF shown in Figure 10b is in part a consequence of the good performance of the latter: since KBSF is able to control the helicopter for a larger number of steps, the corresponding episodes will obviously take longer. A better measure of the algorithms s computational cost can be seen in Figure 10c, which shows the average time taken by each method to perform one transition. Observe how KBSF s computing time peaks at the points in which the model and the value function are updated. In the beginning KBSF s MDP changes considerably, and as a result the value function updates take longer. As more data come in, the model starts to stabilize, accelerating the computation of Q (we warm start policy iteration with the value function computed in the previous round). At this point, KBSF s computational cost per step is only slightly higher than SARSA s, even though the former computes a model of the environment while the latter directly updates the value function approximation.

To conclude, we note that our objective in this section was exclusively to show that KBSF can outperform a well-known on-line algorithm with compatible computational cost. Therefore, we focused on the comparison of the algorithms rather than on obtaining the best possible performance on the task. Also, it is important to mention that more diﬃcult versions of the helicopter task have been addressed in the literature, usually using domain knowledge in the conﬁguration of the algorithms or to guide the collection of data (Ng et al., 2003; Abbeel et al., 2007). Since our focus here was on evaluating the on-line performance of KBSF, we addressed the problem in its purest form, without using any prior information to help the algorithms solve the task (see Asbah et al. s paper for experiments on the helicopter hovering task with a more specialized version of KBSF, 2013).

6. Discussion

In this section we deepen our discussion about KBSF. We start with an analysis of the approximation computed by our algorithm, in which we draw interesting connections with KBRL, and then we proceed to discuss some issues that come up when one uses KBSF in practice.

Practical Kernel-Based Reinforcement Learning

0e+00 2e+04 4e+04 6e+04 8e+04 1e+05

0 40 80 120

SARSA i KBSF

(a) Performance

0e+00 2e+04 4e+04 6e+04 8e+04 1e+05

0e+00 3e+05 6e+05

SARSA i KBSF

(b) Run time

0e+00 2e+04 4e+04 6e+04 8e+04 1e+05

0 5 10 15 20

SARSA i KBSF

(c) Average time per step (time of an episode divided by the number of steps)

Figure 10: Results on the helicopter hovering task averaged over 50 runs. The learned controllers were tested from a ﬁxed state (see text for details). The shadowed regions represent 99% conﬁdence intervals.

Barreto, Precup, and Pineau

6.1 A closer look at KBSF s approximation

As outlined in Section 2, KBRL deﬁnes the probability of a transition from state ˆsb i to state ˆsa k as being κa τ(ˆsb i, sa k), where a, b A (see Figure 2a). Note that the kernel κa τ is computed with the initial state sa k, and not ˆsa k itself. The intuition behind this is simple: since we know the transition sa k a ˆsa k has occurred before, the more similar ˆsb i is to sa k, the more likely the transition ˆsb i a ˆsa k becomes (Ormoneit and Sen, 2002). From (10), it is clear that the computation of matrices Ka performed by KBSF follows the same reasoning underlying the computation of KBRL s matrices ˆPa; in particular, κa τ( sj, sa k) gives the probability of a transition from sj to ˆsa k. However, when we look at matrix D things are slightly diﬀerent: here, the probability of a transition from ˆsb i to representative state sj is given by κ τ(ˆsb i, sj) a computation that involves sj itself. If we were to strictly adhere to KBRL s logic when computing the transition probabilities to the representative states sj, the probability of transitioning from ˆsb i to sj upon executing action a should be a function of ˆsb i and a state s from which we knew a transition s a sj had occurred. In this case we would end up with one matrix Da for each action a A. Note though that this formulation of the method is not practical, because the computation of the matrices Da would require a transition ( ) a sj for each a A and each sj S. Clearly, such a requirement is hard to fulﬁll even if we have a generative model available to generate sample transitions. In this section we take a closer look at KBSF s approximation, and provide two interpretations that support the way the matrices involved are built. We argue that KBSF can be seen as a kernel-based approximation of both the model and the value function computed by KBRL. Based on these interpretations we later discuss how KBSF can potentially be beneﬁcial from a statistical point of view.

6.1.1 Approximating the model

We start by looking at how KBRL constructs its model. As shown in Figure 2a, for each action a A the state ˆsb i has an associated stochastic vector ˆpa j R1 n whose nonzero entries correspond to the kernel κa τ(ˆsb i, ) evaluated at sa k, k = 1, 2, . . . , na. Since we are dealing with a continuous state space, it is possible to compute an analogous vector for any s S and any a A. Focusing on the nonzero entries of ˆpa j, we deﬁne the function

ˆPSa : S 7 R1 na ˆPSa(s) = ˆpa ˆpa i = κa τ(s, sa i ) for i = 1, 2, ..., na. (23)

Clearly, full knowledge of the function ˆPSa allows for an exact computation of KBRL s transition matrix ˆPa. Now suppose we do not know ˆPSa and we want to compute an approximation of this function in the points ˆsa i Sa, for all a A. Suppose further that we are only given a training set composed of m pairs ( sj, ˆPSa( sj)). One possible way of approaching this problem is to resort to kernel smoothing techniques. In this case, a particularly common choice is the so-called Nadaraya-Watson kernel-weighted estimator (Hastie et al., 2002, Chapter 6):

Pm j=1 k τ(s, sj) ˆPSa( sj) Pm j=1 k τ(s, sj) =

j=1 κ τ(s, sj) ˆPSa( sj). (24)

Practical Kernel-Based Reinforcement Learning

Contrasting the expression above with (10), we see that this is exactly how KBSF computes its approximation DKa ˆPa, with PSa evaluated at the points ˆsb i Sb, b = 1, 2, ..., |A|. In this case, κ τ(ˆsb i, sj) are the elements of matrix D, and ˆPSa( sj) is the jth row of matrix Ka. Thus, in some sense, KBSF uses KBRL s own kernel approximation principle to compute a stochastic factorization of ˆ M. One can easily extend the reasoning above to also include rewards by deﬁning a function ˆ MSa : S 7 R1 na+1 such that ˆ MSa(s) = [ ˆPSa(s), ˆPSa(s) ra]. The exposition above also makes it clear that it is possible to build the matrices Da

using diﬀerent sets of representative states Sa (this corresponds to having |A| training sets composed of pairs ( sa j, ˆPSa( sa j))). As long as all the sets Sa have the same cardinality m, the stochastic-factorization trick can still be applied. However, in order for the approximation computed by KBSF to make sense, we would have to have one matrix Da for each action a A (see in Figure 2b how matrix D is computed). Unfortunately, when diﬀerent matrices Da are used, our core theoretical result, Proposition 2, no longer applies. Although alternative error bounds are possible, as shown by Barreto (2014), they are in general signiﬁcantly looser than (8). In any case, the extension of KBSF to use diﬀerent sets of representative states Sa may be an interesting topic for future research.

6.1.2 Approximating the value function

We now turn our attention to the way KBRL computes its value function, and show an intuitive way to derive KBSF s update equations. Based on (4) and (5), we see that value iteration s update rule for KBRL s MDP, ˆ Γ, is

ˆQt+1(ˆsc i, a) =

j=1 κa τ(ˆsc i, sa j) ra j + γ max b ˆQt(ˆsa j, b) , (25)

where a, b, c A. Note that at any time step t the equation above can be used to compute an approximation of the t-step value function over the entire state space S; after convergence to ˆQ equation (25) gives rise to (7). From a computational point of view, the potential problem with rule (25) is the fact that updating ˆQ takes O(nˆn|A|) operations (recall that n = P a na and ˆn = maxa na). KBRL makes it possible to compute an approximate solution for an MDP with continuous state space S by only updating the value of a ﬁnite subset ˆS S. It is reasonable to ask whether similar strategy can come to the rescue when ˆS is itself too large. Speciﬁcally, if the values of the states ˆsa i ˆS can be approximated based on the values of a small set of states S S, with | S| = m, then only the latter must be updated during dynamic programming s iterative process. Following this reasoning, we can replace ˆQt(ˆsa j, b) with Pm l=1 κ τ(ˆsa j, sl) ˆQt( sl, b) and rewrite (25) as

ˆQ t+1( si, a) =

j=1 κa τ( si, sa j)

ra j + γ max b

l=1 κ τ(ˆsa j, sl) ˆQ t( sl, b)

j=1 κa τ( si, sa j)ra j + γ

j=1 κa τ( si, sa j) max b

l=1 κ τ(ˆsa j, sl) ˆQ t( sl, b). (26)

Updating ˆQ through (26) is O(mˆn|A|). This is already an improvement over the computational complexity of (25), but the dependence on ˆn still precludes the use of (26) in

Barreto, Precup, and Pineau

scenarios involving many sample transitions, such as on-line problems, for example. Now, if we swap the positions of the max operator and the kernel κ τ in (26), we can get rid of the dependence on the number of transitions, in the following way:

Qt+1( si, a) =

j=1 κa τ( si, sa j)ra j + γ

j=1 κa τ( si, sa j)

l=1 κ τ(ˆsa j, sl) max b Qt( sl, b)

j=1 κa τ( si, sa j)ra j + γ

j=1 κa τ( si, sa j) κ τ(ˆsa j, sl) max b Qt( sl, b)

l=1 pa il max b Qt( sl, b). (27)

Updating Q through (27) takes only O(m2|A|) operations. It is not diﬃcult to see that (27) is value iteration s update rule for the MDP M deﬁned by KBSF, Γ (see Algorithm 1). Summarizing, if we use a local kernel approximation in the update rule of KBRL s MDP, potentially introducing some error, we get (26), which is faster to compute than (25). If in addition we change the order in which the operations are applied, we get KBSF s update rule (27), which is in some sense a greater deviation from (25) than (26), but is even faster to compute. Therefore, KBSF can be interpreted as a computationally eﬃcient way of applying kernel approximation, the basic principle behind KBRL, to KBRL itself.

6.1.3 Curse of dimensionality

In this paper we emphasized the role of KBSF as a technique to reduce KBRL s computational cost. However, it is equally important to ask whether our algorithm provides beneﬁts from a statistical point of view. In particular, instead of trying to approximate KBRL s solution, it may be possible to compute better solutions using the same amount of data. Ormoneit and Sen (2002) showed that, in general, the number of sample transitions needed by KBRL to achieve a certain approximation accuracy grows exponentially with the dimension of the state space a phenomenon usually referred to as the curse of dimensionality (Bellman, 1961). As with other methods, the only way to avoid such an exponential dependency is to exploit some sort of regularity in the problem s structure paraphrasing Ormoneit and Sen (2002), one can only break the curse of dimensionality by incorporating prior knowledge into the approximation. In this section we argue that KBSF can be seen as a strategy to do so. In particular, the deﬁnition of the representative states can be seen as a practical mechanism to incorporate additional assumptions about the continuous MDP besides the sample transitions. The fact that KBRL s sample complexity is exponential in d S is not surprising. As noted by Ernst et al. (2005), KBRL can be seen as a particular case of the ﬁtted Q-iteration algorithm, which solves a reinforcement learning problem by breaking it into a succession of supervised learning problems. If the MDP deﬁned by KBRL is solved through value iteration, for example, each iteration of this algorithm can be seen as a non-parametric kernel-based regression (Barreto, 2014). It is well known that non-parametric kernel methods suﬀer from the curse of dimensionality (Stone, 1982; Gy orﬁet al., 2002). As mentioned above, one can circumvent the curse of dimensionality associated with non-parametric methods by exploiting regularities of the learning problem (Gy orﬁet al.,

Practical Kernel-Based Reinforcement Learning

2002). For example, Kveton and Theocharous (2013) showed that, when the dynamics of the underlying MDP are factored, incorporating this knowledge into KBRL leads to a dramatic reduction on the number of sample transitions needed for learning a good decision policy. Another type of regularity arises when the intrinsic dimension of the state space is much smaller than d S that is, when the data lies close to a low-dimensional manifold. Farahmand et al. (2007) have shown that simple k-nearest neighborhood regression is manifold-adaptive, meaning that it naturally exploits this type of regularity in the approximation problem. Since KBRL s approximation scheme is very similar to k-nearest neighborhood regression, it is likely that this algorithm also has such a property. Another way to avoid the curse of dimensionality is to resort to parametric methods. Parametric methods circumvent the exponential growth of the number of samples by restricting the space of functions spanned by the approximator which corresponds to making assumptions about the target function (Gy orﬁet al., 2002). Since in KBRL the structure of the approximator is deﬁned by the sampled states ˆsa i , one has little ﬂexibility in controlling the induced function space (see (25)). Based on the discussions in Sections 6.1.1 and 6.1.2, one can see that KBSF can be cast as a kernel-based approximation in which the structure of the approximator is deﬁned by the representative states that is, the representative states play the role of basis functions or features (this architecture is similar to that of radial basis function networks, see Chapter 7 of Gy orﬁet al. s book, 2002). It should be clear, then, that when the representative states are ﬁxed KBSF can be seen as a parametric model and thus as a potential tool for helping KBRL circumvent the curse of dimensionality. Parametric regression methods have a serious drawback, though: if the target function cannot be well approximated by any function in the space induced by the parametric model, the approximation error will be large regardless of the data. A popular strategy to overcome this issue without incurring in unacceptable sample complexity is to resort to adaptive methods that use the available data to adjust the complexity of the approximator to the problem at hand (Farahmand and Szepesv ari, 2011). In the case of KBRL, the complexity of the induced function space can be controlled via the kernel s width τ (Gy orﬁet al., 2002). KBSF provides an alternative way of controlling the complexity of the approximator through the parameter m (this is akin to limiting the depth of a regression tree, for example see Barreto s paper for a detailed discussion, 2014). If we think in terms of the classical biasvariance analysis of statistical estimators, intuitively we are decreasing bias and increasing variance as m n assuming that the representative states are systematically deﬁned by a reasonable method; see Section 6.2.2 (Hastie et al., 2002). Note that, unlike τ, m has a clear eﬀect on the method s computational cost. Therefore, KBSF provides both a way of determining the structure of the approximator and extra ﬂexibility in controlling the complexity of the resulting model. Of course, this does not automatically translate into reduced sample complexity. Even if the structural assumptions induced by KBSF arise in problems of interest, we must develop methods to conﬁgure the parameters of the algorithm appropriately (see Section 6.2.2). New theoretical results regarding the approximation properties of KBSF would probably be needed as well, since the current results were derived under the interpretation of our algorithm as a computational device to accelerate KBRL (and thus use its solution as a reference point). These might be interesting directions for future investigations.

Barreto, Precup, and Pineau

6.2 Practical issues

During the execution of our experiments we observed several interesting facts about KBSF which are not immediate from its conceptual deﬁnition. In this section we share some of the lessons learned with the reader. We start by discussing the impact of deviating from the theoretical assumptions over the performance of our algorithm. We then present general guidelines on how to conﬁgure KBSF to solve reinforcement learning problems.

6.2.1 KBSF s applicability

The theoretical guarantees regarding KBRL s solution assume that the initial states sa i in the transitions (sa i , ra i , ˆsa i ) are uniformly sampled from S (Ormoneit and Sen, 2002, see Assumption 3). This is somewhat restrictive because it precludes the collection of data through direct interaction with the environment. Ormoneit and Sen conjectured that sampling the states sa i from an uniform distribution is not strictly necessary, and indeed later Ormoneit and Glynn (2002) relaxed this assumption for the case in which KBRL is applied to an average-reward MDP. In this case, it is only required that the exploration policy used to collect data chooses all actions with positive probability. As described in Sections 4.2 and 5.2, in our computational experiments we collected data through an ϵ-greedy policy (in many cases with ϵ = 1). The good performance of KBSF corroborates Ormoneit and Sen s conjecture and suggests that Ormoneit and Glynn s results can be generalized to the discounted reward case, but more theoretical analysis is needed. Ormoneit and Sen (2002) also make some assumptions regarding the smoothness of the reward function and the transition kernel of the continuous MDP (Assumptions 1 and 2). Unfortunately, such assumptions are usually not veriﬁable in practice. Empirically, we observed that KBSF indeed performs better in problems with smooth dynamics loosely speaking, problems in which a small perturbation in sa i results in a small perturbation in ˆsa i , such as the pole balancing task. In problems with rougher dynamics, like the epilepsysuppression task, it is still possible to get good results with KBSF, but in this case it is necessary to use more representative states and narrower kernels (that is, smaller values for τ). As a result, in problems of this type KBSF is less eﬀective in reducing KBRL s computational cost.

6.2.2 KBSF s configuration

KBSF depends on two basic deﬁnitions: the set of representative states and the widths of the kernels. As discussed in Section 6.1, the former can be seen as the deﬁnition of the structure of the approximator, while the latter are parameters of the approximation. In what follows we discuss the impact of each of them on KBSF s performance.

Deﬁnition of Representative States In order to deﬁne the representative states we must determine their number, m, and their position in the state space. Both theory and practice indicate that KBSF s performance gets closer to KBRL s as m increases. Thus, a rule of thumb to deﬁne the number of representative states is to simply set m to the largest value allowed by the available computational resources (but see Section 6.1.3 for an alternative view).

Practical Kernel-Based Reinforcement Learning

As for the position of the representative states, looking at expression (24) we see that ideally sj would be such that the rows of the matrices Ka would form a convex hull containing the rows of the corresponding ˆPa. However, it is easy to see that when m < n such a set of states may not exist. Besides, even when it does exist, ﬁnding this set is not a trivial problem. Instead of insisting on ﬁnding representative states that allow for an exact representation of the matrices ˆPa, it sounds more realistic to content oneself with an approximate solution for this problem. Assuming the dynamics of the underlying MDP are reasonably smooth, one strategy is to guarantee that every sampled state ˆsa i is close to at least one representative state sj. Since covering the entire state space is generally impractical, we want sj to reﬂect the distribution of the data, that is, we want the representative states to be in the regions of the state space where the data lies. This is why in our experiments we clustered the states ˆsa i and used the clusters s centers as our representative states (this method is also used in the conﬁguration of radial basis function networks, as discussed by Gy orﬁet al., 2002). Despite its simplicity, this strategy usually results in good performance, as shown in Sections 4.2 and 5.2. In our experiments we clustered the data using the popular k-means algorithm, which minimizes the average square distance from a given point to the center of the corresponding cluster. However, Proposition 6 states that KBSF s approximation error can be controlled by maxa,i dist(ˆsa i , w), the maximum distance from a sampled state ˆsa i to the wth nearest representative state. It is reasonable to ask whether KBSF s performance would improve if we used a clustering method that directly minimizes this quantity. In the k-center clustering problem one seeks a set of k clusters that minimize the maximum distance from a point to the corresponding cluster center (Gonzales, 1985). Thus, in this case we would be minimizing maxa,i dist(ˆsa i , 1). Although ﬁnding an exact solution for this problem is NP hard, there exist fast algorithms that compute good approximations in time linear in the number of points (Feder and Greene, 1988). In order to provide some intuition on how k-center clustering compares to k-means, we implemented an algorithm by Gonzales (1985) which is extremely simple: at each iteration one selects a new cluster center by simply picking the point whose distance to the closest cluster center is maximal. Although simple, this method provides a solution for the k-center problem whose cost is within a factor of two from the optimal solution, which is the best one can hope for in polynomial time (Feder and Greene, 1988). Figure 11 shows the performance of KBSF on the puddle-world task using diﬀerent strategies to deﬁne the representative states. The experiment was carried out exactly as the one described in Section 4.2.1. In addition to the results provided by k-means and k-centers, we also show KBSF s performance when using representative states sampled uniformly at random and evenly distributed over the state space. As expected, the worst results correspond to the case in which representative states are picked at random. On the opposite extreme, representative states evenly distributed over S result in the best performance. Unfortunately, in more realistic scenarios it is impractical to cover the entire state space with representative states, and this is precisely why one may have to resort to strategies like k-means and k-centers. The results shown in Figure 11 suggest that kcenters clustering may indeed be a better choice than k-means, although the diﬀerence in the resulting performance is not very signiﬁcant. One advantage of k-center clustering is that

Barreto, Precup, and Pineau

0.5 1.0 1.5 2.0 2.5 3.0

G K means K centers Evenly distributed Random

Figure 11: Comparison of diﬀerent strategies to deﬁne the representative states. Results on the puddle-world task averaged over 50 runs. The errors around the mean correspond to the 99% conﬁdence intervals. See Figure 3 for details.

it is very simple to implement and is also very fast. Besides, there are algorithms available that compute an approximation on-line, which allows their use with i KBSF (Charikar et al., 1997; Beygelzimer et al., 2006).

Clustering methods are a natural choice for deﬁning the representative states because they summarize the distribution of the data, but of course they are not the only way to solve the problem. Ideally, we want a set of representative states that summarize the dynamics of the underlying dynamical system, and this may not reﬂect the spatial distribution of the data, regardless of the strategy used to collect transitions. As discussed in Section 6.1, the problem of deﬁning representative states is in some sense akin to the problem of deﬁning features in supervised learning (Guyon and Elisseeﬀ, 2003). Such a connection emphasizes the diﬃculty of the problem and suggests that the best solution may be domain-dependent. On the other hand, as with the deﬁnition of features, the deﬁnition of representative states can be seen as an opportunity to incorporate prior knowledge about the domain of interest into the approximation model. For example, if one knows that some regions of the state space are more important than others, this information can be used to allocate more representative states to those regions. Similar reasoning applies to tasks in which the level of accuracy required from the decision policy varies across the state space.

Deﬁnition of Kernels s Widths Given a well-deﬁned strategy to select representative states, the use of KBSF requires the deﬁnition of two parameters, τ and τ. The kernels s widths τ and τ may have a strong eﬀect on KBSF s performance. To illustrate this point, we show in Figure 12 the results of this algorithm on the puddle world task when τ and τ are varied in the set {0.01, 0.1, 1} (these were the results used to generate Figure 3).

Practical Kernel-Based Reinforcement Learning

G G G G G G G G

20 40 60 80 100 120 140

G G G G G G

G τ = 1 , τ = 1

τ = 1 , τ = 0.1 τ = 1 , τ = 0.01 τ = 0.1 , τ = 1 τ = 0.1 , τ = 0.1 τ = 0.1 , τ = 0.01 τ = 0.01 , τ = 1 τ = 0.01 , τ = 0.1 τ = 0.01 , τ = 0.01

τ Average return 1 1.47 0.42 0.1 3.01 0.08 0.01 3.00 0.08

(a) Performance of KBSF(8000, )

(b) Performance of KBRL(8000). Errors around the mean correspond to 99% conﬁdence intervals.

Figure 12: The impact of the kernels s widths on the performance of KBSF and KBRL. Results on the puddle-world task averaged over 50 runs. See Figure 3 for details.

Of course, the best combination of values for τ and τ depends on the speciﬁc problem at hand and on the particular choice of kernels. Here we give some general advice as to how to set these parameters, based on both theory in practice. Since τ is the same parameter used by KBRL, it should decrease with the number of sample transitions n at an admissible rate (see Ormoneit and Sen s Lemma 2, 2002). Analogously, Proposition 6 suggests that τ should get smaller as m n (details in the proof of the proposition in Appendix A). Empirically, we found out that a simple strategy that usually facilitates the conﬁguration of KBSF is to rescale the data so that all the variables have approximately the same magnitude which corresponds to using a weighted norm in the computation of the kernels. Using this strategy we were able to obtain good results with KBSF on all problems by performing a coarse search in the space of parameters in which we only varied the order of magnitude of τ and τ (see Table 1).

Alternatively, one can ﬁx τ and τ and deﬁne the neighborhood used to compute kτ( sj, ) and k τ(ˆsa i , ). As explained in Appendix B.2, in some of our experiments we only computed kτ( sj, ) for the µ closest sampled states sa i from sj, and only computed k τ(ˆsa i , ) for the µ closest representative states from ˆsa i . When using this approach, a possible way of conﬁguring KBSF is to set τ and τ to suﬃciently large values (so as to guarantee a minimum level of overlap between the kernels) and then adjust µ and µ. The advantage is that adjusting µ and µ may be more intuitive than directly conﬁguring τ and τ (cf. Table 1).

Barreto, Precup, and Pineau

7. Previous work

In our experiments we compared KBSF with KBRL, LSPI, ﬁtted Q-iteration, and SARSA, both in terms of computational cost and in terms of the quality of the resulting decision policies. In this section we situate our algorithm in the broader context of approximate reinforcement learning. Approximation in reinforcement learning is an important topic that has generated a huge body of literature. For a broad overview of the subject, we refer the reader to the books by Sutton and Barto (1998), Bertsekas and Tsitsiklis (1996), and Szepesv ari (2010). Here we will narrow our attention to kernel-based approximation techniques. We start by noting that the label kernel based is used with two diﬀerent meanings in the literature. On one side we have kernel smoothing techniques like KBRL and KBSF, which use local kernels essentially as a device to implement smooth instance-based approximation (Hastie et al., 2002). On the other side we have methods that use reproducing kernels to implicitly represent an inner product in a high-dimensional state space (Sch olkopf and Smola, 2002). Although these two frameworks can give rise to approximators with similar structures, they rest on diﬀerent theoretical foundations. Since reproducing-kernels methods are less directly related to KBSF, we will only describe them brieﬂy. We will then discuss the kernel smoothing approaches in more detail. The basic idea of reproducing-kernel methods is to apply the kernel trick in the context of reinforcement learning (Sch olkopf and Smola, 2002). Roughly speaking, the approximation problem is rewritten in terms of inner products only, which are then replaced by a properly-deﬁned kernel. This modiﬁcation corresponds to mapping the problem to a highdimensional feature space, resulting in more expressiveness of the function approximator. Perhaps the most natural way of applying the kernel trick in the context of reinforcement learning is to kernelize some formulation of the value-function approximation problem (Xu et al., 2005; Engel et al., 2005; Farahmand, 2011). Another alternative is to approximate the dynamics of an MDP using a kernel-based regression method (Rasmussen and Kuss, 2004; Taylor and Parr, 2009). Following a slightly diﬀerent line of work, Bhat et al. (2012) propose to kernelize the linear programming formulation of dynamic programming. However, this method is not directly applicable to reinforcement learning, since it is based on the assumption that one has full knowledge of the MDP. A weaker assumption is to suppose that only the reward function is known and focus on the approximation of the transition function. This is the approach taken by Grunewalder et al. (2012), who propose to embed the conditional distributions deﬁning the transitions of an MDP into a Hilbert space induced by a reproducing kernel. We now turn our attention to kernel-smoothing techniques, which are more closely related to KBRL and KBSF. Kroemer and Peters (2011) propose to apply kernel density estimation to the problem of policy evaluation. They call their method non-parametric dynamic programming (NPDP). If we use KBRL to compute the value function of a ﬁxed policy, we see many similarities with NPDP, but also some important diﬀerences. Like KBRL, NPDP is statistically consistent. Unlike KBRL, which assumes a ﬁnite action space A and directly approximates the conditional density functions P a(s |s), NPDP assumes that A is continuous and models the joint density P(s, a, s ). Kroemer and Peters (2011) showed that the value function of NPDP has a Nadaraya-Watson kernel regression form. Not

Practical Kernel-Based Reinforcement Learning

surprisingly, this is also the form of KBRL s solution if we ﬁx the policy being evaluated (cf. equation (7)). In both cases, the coeﬃcients of the kernel-based approximation are derived from the value function of the approximate MDP. The key diﬀerence is the way the transition matrices are computed in each algorithm. As shown in (4), the transition probabilities of KBRL s model are given by the kernel values themselves. In contrast, the computation of each element of NDPD s transition matrix requires an integration over the continuous state space S. In practice, this is done by numerical integration techniques that may be very computationally demanding (see for example the experiments performed by Grunewalder et al., 2012). We directly compared NPDP with KBRL because both algorithms build a model whose number of states is dictated by the number of sample transitions n, and neither method explicitly attempts to keep n small. Since in this case each application of the Bellman operator is O(n2), these methods are not suitable for problems in which a large number of transitions is required, nor are they applicable to on-line reinforcement learning.5 There are however kernel-smoothing methods that try to avoid this computational issue by either keeping n small or by executing a number of operations that grows only linearly with n. These algorithms are directly comparable with KBSF. One of the ﬁrst attempts to adapt KBRL to the on-line scenario was that of Jong and Stone (2006). Instead of collecting a batch of sample transitions before the learning process starts, the authors propose to grow such a set incrementally, based on an exploration policy derived from KBRL s current model. To avoid running a dynamic-programming algorithm to completion in between two transitions, which may not be computationally feasible, Jong and Stone (2006) resort to Moore and Atkeson s (1993) prioritized sweeping method to propagate the changes in the value function every time the model is modiﬁed. The idea of exploiting the interpretation of KBRL as the derivation of a ﬁnite MDP in order to use tabular exploration methods is insightful. However, it is not clear whether smart exploration is suﬃcient to overcome the computational diﬃculties arising from the fact that the size of the underlying model is inexorably linked to the number of sample transitions. For example, even using sparse kernels in their experiments, Jong and Stone (2006) had to ﬁx an upper limit for the size of KBRL s model. In this case, once the number of sample transitions has reached the upper limit, all subsequent data must be ignored. Following the same line of work, Jong and Stone (2009) later proposed to guide KBRL s exploration of the state space using Brafman and Tennenholtz s (2003) R-MAX algorithm. In this new paper the authors address the issue with KBRL s scalability more aggressively. First, they show how to combine their approach with Dietterich s (2000) MAX-Q algorithm, allowing the decomposition of KBRL s MDP into a hierarchy of simpler models. While this can potentially reduce the computational burden of ﬁnding a policy, such a strategy transfer to the user the responsibility of identifying a useful decomposition of the task. A more practical approach is to combine KBRL with some stable form of value-function approximation. For that, Jong and Stone (2009) suggest the use of Gordon s (1995) averagers. As shown in Appendix A.2, this setting corresponds to a particular case of KBSF in which representative states are selected among the set of sampled states ˆsa i . It should be noted that, even when using temporal abstraction and function approximation, Jong and Stone s (2009) approach

5. We note that, incidentally, all the reproducing-kernel methods discussed in this section also have a computational complexity super-linear in n.

Barreto, Precup, and Pineau

requires recomputing KBRL s transition probabilities at each new sample, which can be infeasible in reasonably large problems. Kveton and Theocharous (2012) propose a more practical algorithm to reduce KBRL s computational cost. Their method closely resembles the batch version of KBSF. As with our algorithm, Kveton and Theocharous s method deﬁnes a set of representative states si that give rise to a reduced MDP. The main diﬀerence in the construction of the models is that, instead of computing a similarity measure between each sampled state ˆsa i and all representative states sj, their algorithm associates each ˆsa i with a single sj which comes down to computing a hard aggregation of the state space ˆS. Such an aggregation corresponds to having a matrix D with a single nonzero element per row. In fact, it is possible to rewrite Kveton and Theocharous s (2012) algorithm using KBSF s formalism. In this case, the elements of Da and Ka would be deﬁned as:

ka ij = κa τ( si, rs(sa j, 1)), and da ij = κ0(rs(ˆsa i , 1), sj) (28)

where κ0 is the normalized kernel induced by an inﬁnitely narrow kernel k0(s, s ) whose value is greater than zero if and only if s = s (recall from Section 4.1 that rs(s, 1) gives the closest representative state from s). It is easy to see that we can make matrix D computed by KBSF as close as desired to a hard aggregation by setting τ to a suﬃciently small value (see Lemma 5). More practically, we can simply plug (28) in place of (10) in Algorithm 1 to exactly recover Kveton and Theocharous s method. Note though that, by replacing κa τ( si, sa j) with κa τ( si, rs(sa j, 1)) in the computation of Ka, we would be in some sense deviating from KBRL s framework. To see why this is so, observe that if the representative states si are sampled from the set of states ˆsa i , the rows of matrix Ka

computed by KBSF would coincide with a subset of the rows of the corresponding KBRL s matrix ˆPa (cf. (23)). However, this property is lost if one uses (28) instead of (10).6

8. Conclusion

This paper presented KBSF, a reinforcement learning algorithm that results from the application of the stochastic-factorization trick to KBRL. KBSF summarizes the information contained in KBRL s MDP in a model of ﬁxed size. By doing so, our algorithm decouples the structure of the model from its conﬁguration. This makes it possible to build an approximation which accounts for both the diﬃculty of the problem and the computational resources available. One of the main strengths of KBSF is its simplicity. As shown in the paper, its uncomplicated mechanics can be unfolded into two update rules that allow for a fully incremental version of the algorithm. This makes the amount of memory used by KBSF independent of the number of sample transitions. Therefore, with a few lines of code one has a reinforcement-learning algorithm that can be applied to large-scale problems, in both oﬀ-line and on-line regimes. KBSF is also a sound method from a theoretical point of view. As discussed, the distance between the value function computed by this algorithm and the one computed by KBRL

6. Note that this observation only means that KBSF is closer to KBRL in this strict sense; in particular, it does not imply that Kveton and Theocharous s algorithm is not a principled method, nor does it mean that it will perform worse than KBSF.

Practical Kernel-Based Reinforcement Learning

is bounded by two factors: the quality and the level of stochasticity of the underlying stochastic factorization. We showed that both factors can be made arbitrarily small, which implies that, in theory, we can make KBSF s solution as close to KBRL s solution as desired. Theoretical guarantees do not always translate into good performance in practice, though, either because they are built upon unrealistic assumptions or because they do not account for procedural diﬃculties that arise in practical situations. To ensure that this is not the case with our algorithm, we presented an extensive empirical study in which KBSF was successfully applied to diﬀerent problems, some of them quite challenging. We also presented general guidelines on how to conﬁgure KBSF to solve a reinforcement learning problem. For all the reasons listed above, we believe that KBSF has the potential of becoming a valuable resource for the solution of reinforcement learning problems. This is not to say that the subject has been exhausted. There are several possibilities for future research, some of which we now brieﬂy discuss. From an algorithmic point of view, perhaps the most pressing demand is for more principled methods to deﬁne the representative states. Incidentally, this also opens up the possibility of an automated procedure to set the kernel s widths τ based solely on data. Taking the idea one step further, one can think of having one distinct τi associated with each kernel κ τ( , si) (which would make the similarity between KBSF s approximation and radial basis function networks even stronger). Another important advance would be to endow i KBSF with more elaborate exploration strategies, maybe following the line of research initiated by Jong and Stone (2006, 2009). From a theoretical perspective, it may be desirable to detach KBSF from KBRL and analyze the former on its own. In this paper we emphasized the view of our algorithm as a tool to apply KBRL in practice. This view is clearly reﬂected in our theoretical results, which show how KBSF s solution deviates from KBRL s and how we can control such a deviation. But KBRL is itself an approximation, so perhaps it makes sense to analyze KBSF independently from its precursor. This possibility is particularly appealing when we look at KBSF as an intermediate approach between fully nonparametric and parametric models, as discussed in Section 6.1.3. Looking at KBSF against a broader context, a subject that deserves further investigation is the possibility of building an approximation based on multiple models. Model averaging is not inherently linked to KBSF, and in principle it can be used with virtually any reinforcement learning algorithm. However, KBSF s low computational cost makes it particularly amenable to this technique. Since our algorithm is signiﬁcantly faster than any method whose complexity per iteration is a function of the number of sample transitions, we can aﬀord to compute several approximations and still have a solution in comparable time (see Section 4.2.3 and Barreto s paper, 2014). Understanding how we can randomize the construction of the individual models and to what extend this can improve the quality of the resulting decision policy is a matter of interest. We conclude by noting that KBSF represents one particular way in which the stochasticfactorization trick can be exploited in the context of reinforcement learning. In principle, any algorithm that builds a model based on sample transitions can resort to the same trick to leverage the use of the data. The basic idea remains the same: instead of estimating the transition probabilities between every pair of states, one focuses on a small set of representative states whose values are propagated throughout the state space based on some

Barreto, Precup, and Pineau

notion of similarity. We believe that this general framework can potentially be materialized into a multitude of useful reinforcement learning algorithms.

Acknowledgments

The authors would like to thank Amir-massoud Farahmand and Yuri Grinberg for interesting discussions about approximation in reinforcement learning and related subjects. We also thank Branislav Kveton for insightful comments regarding the theory underlying kernelbased reinforcement learning, which were essential for the derivation of our Proposition 8. We are grateful to Keith Bush for making the epilepsy simulator available, and to Alicia Bendz and Ryan Primeau for helping in some of the computational experiments. Finally, we thank the anonymous reviewers for their valid comments and helpful suggestions to improve the paper. Funding for this research was provided by the NSERC Discovery Grant program.

Appendix A. Theoretical Results

Proposition 2 Proof Let ˇ M (S, A, ˇPa,ˇra, γ), with ˇPa = DKa and ˇra = D ra. From the triangle inequality, we know that v ΓD Q v ˇv + ˇv ΓD Q , (29)

where ˇv is the optimal value function of ˇ M. Our strategy will be to bound v ˇv and ˇv ΓD Q . In order to ﬁnd an upper bound for v ˇv , we apply Whitt s (1978) Theorem 3.1 and Corollary (b) of his Theorem 6.1, with all mappings between M and ˇ M taken to be identities, to obtain

max a ra D ra + γ Rdif 2(1 γ)max a Pa DKa

where we used the fact that maxa,i ˇra i mina,i ˇra i Rdif. It remains to bound ˇv ΓD Q . Since ˇra = D ra and D Pa = DKa D = ˇPa D for all a A, the stochastic matrix D satisﬁes Sorg and Singh s (2009) deﬁnition of a soft homomorphism between ˇ M and M (see equations (25) (28) in their paper). Applying Theorem 1 by the same authors, we know that Γ(ˇQ D Q ) 1 1 γ sup i,t (1 max j dij) δ(t) i , (31)

where δ(t) i = maxj:dij>0,k q(t) jk minj:dij>0,k q(t) jk and q(t) jk are elements of Q(t), the optimal t-step action-value function of M. Since ΓˇQ ΓD Q Γ(ˇQ D Q ) and, for all

Practical Kernel-Based Reinforcement Learning

t > 0, δ(t) i (1 γ) 1(maxa,k ra k mina,k ra k), we can write

ˇv ΓD Q Rdif (1 γ)2 max i (1 max j dij) = Rdif (1 γ)2 σ(D). (32)

Substituting (30) and (32) back into (29), we obtain (8).

Lemma 4 Proof Deﬁne the function

ψa,i τ,s(s ) =

kτ(s, sa i ) Pna j=1 kτ(s, sa j ) kτ(s , sa i ) Pna j=1 kτ(s , sa j )

φ ( s sa i /τ) Pna j=1 φ s sa j /τ φ ( s sa i /τ) Pna j=1 φ s sa j /τ

Since φ is continuous, it is obvious that ψa,i τ,s(s ) is also continuous in s . The property follows from the fact that lims s ψa,i τ,s(s ) = 0.

Lemma 5 Let s S, let m > 1, and assume there is a w {1, 2, ..., m 1} such that dist(s, w) < dist(s, w + 1). Deﬁne W w(s) {k | s sk dist(s, w)} and W w(s) {1, 2, ..., m} W w(s). Then, for any α > 0, we can guarantee that X

k W w(s) κ τ(s, sk) < α X

k W w(s) κ τ(s, sk) (33)

by making τ < ϕ(s, w, m, α), where

ϕ(s, w, m, α) = min(ϕ1(s, w), ϕ2(s, w, m, α)) (34)

B φ , if B φ > 0,

, otherwise, ϕ2(s, w, m, α) =

dist(s, w) dist(s, w + 1)

ln(αw/(m w)λ φ) , if αw (m w)λ φ < 1,

, otherwise.

Proof Expression (33) can be rewritten as P k W w(s) k τ(s, sk) Pm i=1 k τ(s, si) < α

P k W w(s) k τ(s, sk) Pm i=1 k τ(s, si) X

k τ(s, sk) < α X

k τ(s, sk),

(35) which is equivalent to

. We restate the lemma here showing explicitly how to deﬁne τ. This detail was omitted in the main body of the text to improve clarity.

Barreto, Precup, and Pineau

Based on Assumption (i), we know that a suﬃcient condition for (36) to hold is

φ dist(s, w + 1)

< αw m w φ dist(s, w)

Let β = αw/(m w). If β > 1, then (37) is always true, regardless of the value of τ. We now show that, when β 1, it is always possible to set τ in order to guarantee that (37) holds. Let z = dist(s, w) and let δ = dist(s, w + 1) z. From Assumption (ii), we know that, if B φ = 0 or τ < z/B φ,

φ((z + δ)/ τ)

φ(z/ τ) λ φA φ exp( (z + δ)/ τ)

A φ exp( z/ τ) = λ φ exp( (z + δ)/ τ)

exp( z/ τ) .

Thus, in order for the result to follow, it suﬃces to show that

exp( (z + δ)/ τ)

exp( z/ τ) < β

We know that, since δ > 0, if β/λ φ = 1 inequality (38) is true. Otherwise,

exp( (z + δ)/ τ)

exp( z/ τ) < β

λ φ ln exp( (z + δ)/ τ)

τ < δ ln(β/λ φ).

Thus, by taking τ < δ/ ln(β/λ φ) if B φ > 0, or τ < min( δ/ ln(β/λ φ), z/B φ) otherwise, the result follows.

Proposition 6 Proof From (6) and (11), we know that

ˆra D ra = ˆPar DKar = (ˆPa DKa)r ˆPa DKa r . (39)

Thus, plugging (39) back into (8), it is clear that there is a ν > 0 such that ξv < ϵ if

max a ˆPa DKa < ν (40)

and max i (1 max j dij) < ν. (41)

We start by showing that there is a δ > 0 and a θ > 0 such that expression (40) is true if maxa,i dist(ˆsa i , w) < δ and τ < θ. Let ˇPa = DKa and let ˆpa i R1 n and ˇpa i R1 n be the ith rows of ˆPa and ˇPa, respectively. Then,

ˆpa i ˇpa i = Pna j=1 |ˆpa ij Pm k=1 da ik ka kj| = Pna j=1 |κa τ(ˆsa i , sa j) Pm k=1 κ τ(ˆsa i , sk)κa τ( sk, sa j)| = Pna j=1 | Pm k=1 κ τ(ˆsa i , sk)κa τ(ˆsa i , sa j) Pm k=1 κ τ(ˆsa i , sk)κa τ( sk, sa j)| = Pna j=1 | Pm k=1 κ τ(ˆsa i , sk)[κa τ(ˆsa i , sa j) κa τ( sk, sa j)]|

Pna j=1 Pm k=1 κ τ(ˆsa i , sk) κa τ(ˆsa i , sa j) κa τ( sk, sa j) .

Practical Kernel-Based Reinforcement Learning

Our strategy will be to show that, for any a, i, and j, there is a δa,i,j > 0 and a θa,i,j > 0 such that

k=1 κ τ(ˆsa i , sk)|κa τ(ˆsa i , sa j) κa τ( sk, sa j)| =

k=1 κ τ(ˆsa i , sk)ςa,i,j k < ν

if dist(ˆsa i , w) < δa,i,j and τ < θa,i,j (recall that ςa,i,j k had already been deﬁned in (16)). To simplify the notation, we will use the superscript z meaning a, i, j . From Lemma 4 we know that there is a δz > 0 such that ςz k < ν/na if ˆsa i sk < δz. Let W z {k | ˆsa i sk < δz} and W z {1, 2, ..., m} W z. Since we are assuming that dist(ˆsa i , w) < δz, we know that W z = . In this case, we can write:

k=1 κ τ(ˆsa i , sk)ςz k = X

k W z κ τ(ˆsa i , sk)ςz k + X

k W z κ τ(ˆsa i , sk)ςz k.

( min k W z {ςz k|ςz k > 0} if max k W z ςz k > 0,

0 otherwise and ςz max =

( max k W z ςz k if |W z| < m,

0 otherwise.

If ςz max = 0, inequality (43) is necessarily true, since P k W z κ τ(ˆsa i , sk)ςz k max k W z ςz k < ν/na.

We now turn to the case in which ςz max > 0. Suppose ﬁrst that ςz min = 0. In this case, we have to show that there is a τ that yields

P k W z κ τ(ˆsa i , sk)ςz k < ν

A suﬃcient condition for (44) to be true is

k W z κ τ(ˆsa i , sk) < ν na ςzmax 1 Pm j=1 k τ(ˆsa i , sj)

k W z k τ(ˆsa i , sk) < ν na ςzmax . (45)

Obviously, if ςz max ν/na inequality (45) is always true, regardless of the value of τ. Otherwise, we can rewrite (45) as

k W z k τ(ˆsa i , sk) < ν na ςzmax

j W z k τ(ˆsa i , sj) + X

k W z k τ(ˆsa i , sk)

and, after a few algebraic manipulations, we obtain X

k W z k τ(ˆsa i , sk) < ν na ςzmax ν

k W z k τ(ˆsa i , sk) X

k W z κ τ(ˆsa i , sk) < ν na ςzmax ν

k W z κ τ(ˆsa i , sk).

(46) We can guarantee that (46) is true by applying Lemma 5. Before doing so, though, lets analyze the case in which ςz min > 0. Deﬁne

βz = ν na P k W z κ τ(ˆsa i , sk)ςz k 1 (47)

Barreto, Precup, and Pineau

(note that βz > 0 because P k W z κ τ(ˆsa i , sk)ςz k < v/na). In order for (43) to hold, we must show that there is a τ that guarantees that P k W z κ τ(ˆsa i , sk)ςz k βz P k W z κ τ(ˆsa i , sk)ςz k < 0. (48)

A suﬃcient condition for (48) to hold is

P k W z κ τ(ˆsa i , sk) < βzςz min ςzmax

P k W z κ τ(ˆsa i , sk). (49)

Observe that expressions (46) and (49) only diﬀer in the coeﬃcient multiplying the righthand side of the inequalities. Let

αz 1 = ν/( ςz maxna ν), if ςz max > ν/na , otherwise,

αz 2 = βzςz min/ ςz max, if ςz max > 0 and ςz min > 0, , otherwise.

Let αz < min(αz 1, αz 2). Then, if we make θz = ϕ(ˆsa i , |W|, m, αz), with ϕ deﬁned in (34), we can apply Lemma 5 to guarantee that (43) holds. Finally, if we let δ = minz δz = mina,i,j δa,i,j and θ = minz θz = mina,i,j θa,i,j, we can guarantee that (43) is true for all a, i, and j, which implies that (40) is also true (see (42)). It remains to show that there is a ω > 0 such that (41) is true if τ < ω. Recalling that, for any i and any a, max j da ij = κ τ(ˆsa i , rs(ˆsa i , 1)),

we want to show that

k τ(ˆsa i , rs(ˆsa i , 1)) > (1 ν)

" k τ(ˆsa i , rs(ˆsa i , 1)) +

k=2 k τ(ˆsa i , rs(ˆsa i , k))

which is equivalent to

k=2 k τ(ˆsa i , rs(ˆsa i , k)) < ν k τ(ˆsa i , rs(ˆsa i , 1)). (50)

If ν 1, inequality (50) is true regardless of the particular choice of τ. Otherwise, we can rewrite (50) as

k=2 k τ(ˆsa i , rs(ˆsa i , k)) < ν 1 ν k τ(ˆsa i , rs(ˆsa i , 1))

k=2 κ τ(ˆsa i , rs(ˆsa i , k)) < ν 1 ν κ τ(ˆsa i , rs(ˆsa i , 1)).

α = ν/(1 ν), if ν < 1, , otherwise.

Then, if we make ωa,i = ϕ(ˆsa i , 1, m, α), with ϕ deﬁned in (34), we can resort to Lemma 5 to

Practical Kernel-Based Reinforcement Learning

guarantee that (51) holds. As before, if we let ω = mina,i ωa,i, we can guarantee that (41) is true. Finally, by making τ = min(θ, ω), the result follows.

Proposition 8 Proof We start by plugging (39) into the deﬁnition of ξv, given in (8), to get:

ξv 1 1 γ max a ˆPa DKa r + Rdif (1 γ)2

2max a Pa DKa + σ(D)

= max a Pa DKa

1 γ + γ Rdif 2(1 γ)2

+ σ(D) Rdif (1 γ)2 .

max a Pa DKa

1 γ + γRmax

+ σ(D) Rmax 2(1 γ)2

= max a Pa DKa Rmax (1 γ)2 + σ(D) Rmax 2(1 γ)2 , (52)

where we used the fact that Rdif Rmax/2, which follows trivially from the observation that mina,i ra i mina,i ra i and maxa,i ra i maxa,i ra i . To simplify the notation, let R = Rmax/(1 γ)2. Then, from (42), the deﬁnition of ςa,i,j k in (16), and the deﬁnition of σ(D) in (9), we can rewrite (52) as

k=1 κ τ(ˆsa i , sk)ςa,i,j k

R + max a,i [1 κ τ(ˆsa i , rs(ˆsa i , 1))]R

k=1 κ τ(ˆsa i , sk)

j=1 ςa,i,j k

R + max a,i [1 κ τ(ˆsa i , rs(ˆsa i , 1))]R

Let W w(ˆsa i ) and W w(ˆsa i ) be deﬁned as in (15). Then,

k W w(ˆsa i ) κ τ(ˆsa i , sk)

j=1 ςa,i,j k + X

k W w(ˆsa i ) κ τ(ˆsa i , sk)

j=1 ςa,i,j k

R + max a,i (1 κ τ(ˆsa i , rs(ˆsa i , 1))]R

k W w(ˆsa i ) κ τ(ˆsa i , sk)

j=1 ςa,i,j k

k W w(ˆsa i ) κ τ(ˆsa i , sk)

j=1 ςa,i,j k

R + max a,i (1 κ τ(ˆsa i , rs(ˆsa i , 1))]R

2 | {z } F( τ,w)

We see that the two last terms of (53) coincide with the deﬁnition of F( τ, w), given in (17). Thus, if we make sure that κ τ is an admissible kernel κϵ,w τ (see Deﬁnition 7), we can

Barreto, Precup, and Pineau

rewrite (53) as

k W w(ˆsa i ) κϵ,w τ (ˆsa i , sk)

j=1 ςa,i,j k

w max k W w(ˆsa i )

j=1 ςa,i,j k

= w max a,i

max k W w(ˆsa i )

j=1 |κa τ(ˆsa i , sa j) κa τ( sk, sa j)|

R + ϵ. (54)

Let za i = P l kτ(ˆsa i , sa l ) and za k = P l kτ( sk, sa l ). Then, for ﬁxed a, i, j, and k:

|κa τ(ˆsa i , sa j) κa τ( sk, sa j)| = kτ(ˆsa i , sa j) P l kτ(ˆsa i , sa l ) kτ( sk, sa j) P l kτ( sk, sa l )

= kτ(ˆsa i , sa j) za i kτ( sk, sa j) za k

= za kkτ(ˆsa i , sa j) za i kτ( sk, sa j) za i za k

= za kkτ(ˆsa i , sa j) za i kτ( sk, sa j) + za i kτ(ˆsa i , sa j) za i kτ(ˆsa i , sa j) za i za k

= kτ(ˆsa i , sa j) kτ( sk, sa j) za k + kτ(ˆsa i , sa j)( za k za i ) za i za k

kτ(ˆsa i , sa j) kτ( sk, sa j)

za k + kτ(ˆsa i , sa j)| za k za i | za i za k .

j=1 |κa τ(ˆsa i , sa j ) κa τ( sk, sa j )|

kτ(ˆsa i , sa j ) kτ( sk, sa j )

kτ(ˆsa i , sa j )| za k za i | za i za k

kτ(ˆsa i , sa j ) kτ( sk, sa j )

j=1 κa τ(ˆsa i , sa j )| za k za i | za k

kτ(ˆsa i , sa j ) kτ( sk, sa j )

za k + | za k za i | za k

= 1 Pna l=1 kτ( sk, sa l )

kτ(ˆsa i , sa j ) kτ( sk, sa j ) +

j=1 kτ( sk, sa j )

j=1 kτ(ˆsa i , sa j )

1 Pna l=1 kτ( sk, sa l )

kτ(ˆsa i , sa j ) kτ( sk, sa j ) +

kτ( sk, sa j ) kτ(ˆsa i , sa j )

= 2 Pna l=1 kτ( sk, sa l )

kτ(ˆsa i , sa j ) kτ( sk, sa j ) . (55)

Practical Kernel-Based Reinforcement Learning

In order to derive our result, it suﬃces to replace (55) in (54). Before doing so, though, note that Assumption (iii) implies that

kτ(s, s ) kτ(s, s ) = φ s s

s s + s s s s

τ s s . (56)

Proceeding with the substitution mentioned above, we have

ξv 2w max a,i

max k W w(ˆsa i ) 1 Pna l=1 kτ( sk, sa l )

kτ(ˆsa i , sa j) kτ( sk, sa j)

max k W w(ˆsa i ) 1 Pna l=1 kτ( sk, sa l ) Cφ

τ na ˆsa i sk R + ϵ

max k W w(ˆsa i ) na ˆsa i sk Pna l=1 kτ( sk, sa l )

max k W w(ˆsa i ) na ˆsa i sk nakmin

τkmin max a,i max k W w(ˆsa i ) ˆsa i sk R + ϵ.

Lemma 9 Proof Let qa , qa R|S| be the ath columns of Q and Q , respectively. Then,

qa qa = ra + γPav ra γ Pa v

ra ra + γ Pav Pa v

= ra ra + γ Pav Pav + Pav Pa v

ra ra + γ (Pa Pa)v + γ Pa(v v )

ra ra + γ (Pa Pa)v + γ v v , (57)

where in the last step we used the fact that Pa is stochastic, and thus Pav v for any v. We now provide a bound for (Pa Pa)v . Let A = Pa Pa. Then, for any i, P j aij = P j(pa ij pa ij) = P j pa ij P j pa ij = 0, that is, the elements in each row of A sum to zero. Let a+ i be the sum of positive elements in the ith row of A and let a+ max = maxi a+ i .

Barreto, Precup, and Pineau

It should be clear that A = 2a+ max. Then, for any i,

j aijv j | X

(j:aij>0) aijv max + X

(j:aij<0) aijv min = a+ i v max a+ i v min a+ max(v max v min)

a+ max 1 γ (ra max ra min) a+ max Rdif

1 γ = Rdif 2(1 γ) Pa Pa , (58)

where we used the convention vmax = maxi vi (analogously for vmin). As done in (30), we can resort to Whitt s (1978) Theorem 3.1 and Corollary (b) of his Theorem 6.1 to obtain a bound for v v . Substituting such a bound and expression (58) in (57), we obtain

qa qa ra ra + γRdif 2(1 γ) Pa Pa + γ 1 γ

max a ra ra + γRdif 2(1 γ)max a Pa Pa

max a ra ra + γRdif 2(1 γ)max a Pa Pa + γ 1 γ

max a ra ra + γRdif 2(1 γ)max a Pa Pa

Proposition 10 Proof Let ˇ Mt ( ˆSt, A, ˇPa t ,ˇra t , γ), with ˇPa t = Dt Ka t and ˇra t = Dt ra t . From the triangle inequality, we know that

| ˆQt(s, a) Qt(s, a)| | ˆQt(s, a) ˇQ t (s, a)|+| ˇQ t (s, a) Q t (s, a)|+| Q t (s, a) Qt(s, a)|, (59)

where ˆQt and Qt are deﬁned in the proposition s statement, ˇQ t is the optimal actionvalue function of ˇ Mt, and Q t (s, a) = Pm i=1 κ τ(s, si) Q t ( si, a) (the reader will forgive a slight abuse of notation here, since in general Q t is not the optimal value function of any MDP). Our strategy will be to bound each term on the right-hand side of (59). Since ˆ Mt is the model constructed by KBRL using all the data seen by i KBSF up to time step t, state s will correspond to one of the states ˆsb i in this MDP. Thus, from (7), we see that ˆQt(s, a) = ˆQ t (ˆsb i, a) for some i and some b. Therefore, applying Lemma 9 to ˆ Mt and ˇ Mt, we can write

| ˆQt(s, a) ˇQ t (s, a)| 1 1 γ max a ˆra t Dt ra t + γ 2(1 γ)2 Rdif,tmax a ˆPa t Dt Ka t . (60)

In order to bound | ˇQ t (s, a) Q t (s, a)|, we note that, since the information contained in the transition to state s has been incorporated to i KBSF s model M at time t, Q t (s, a) = Pm i=1 dti,t Q t ( si, a), for any a A, where dti,t is the element in the tth row and ith column of Dt (see Figure 2b). In matrix form, we have Q t = Dt Q t . As Dt is a soft homomorphism between ˇ Mt and Mt, we can resort to Sorg and Singh s (2009) Theorem 1, as done in Proposition 2, to write:

| ˇQ t (s, a) Q t (s, a)| Rdif,t (1 γ)2 σ(Dt) (61)

Practical Kernel-Based Reinforcement Learning

(see (31) and (32)). Finally,

| Q t (s, a) Qt(s, a)| =

i=1 κ τ(s, si) Q t ( si, a)

i=1 κ τ(s, si) Qt( si, a)

i=1 κ τ(s, si) Q t ( si, a) Qt( si, a) ϵ Qt, (62)

where the last step follows from the fact that Pm i=1 κ τ(s, si) is a convex combination. Substituting (60), (61), and (62) in (59), we obtain the desired bound.

A.2 Alternative error bound

In Section 3 we derived an upper bound for the approximation error introduced by the application of the stochastic-factorization trick. In this section we introduce another bound that has diﬀerent properties. First, the bound is less applicable, because it depends on quantities that are usually unavailable in a practical situation (the ﬁxed points of two contraction mappings). On the bright side, unlike the bound presented in Proposition 2, the new bound is valid for any norm. Also, it draws an interesting connection with an important class of approximators known as averagers (Gordon, 1995). We start by deriving a theoretical result that only applies to stochastic factorizations of order n. We then generalize this result to the case in which the factorizations are of order m < n.

Lemma 11 Let M (S, A, Pa, ra, γ) be a ﬁnite MDP with |S| = n and 0 γ < 1. Let ELa = Pa be |A| stochastic factorizations of order n and let ra be vectors in Rn such that E ra = ra for all a A. Deﬁne the MDPs ˇ M (S, A, La, ra, γ) and M (S, A, Pa, ra, γ), with Pa = La E. Then,

v TE v ξ v 2γ 1 γ v u + γ(1 + γ)

1 γ v ˇv , (63)

where is a norm in Rn and u is a vector in Rn such that Eu = u.

Proof The Bellman operators of M, ˇ M, and M are given by T = Γ , ˇT = Γ ˇ , and T = Γ . Note that qa = ra + γPav = E ra + γELav = E( ra + γLav), where qa is the ath column of Q. Thus, = E ˇ . Since E is stochastic, we can think of it as one of Gordon s (1995) averagers given by A(v) = Ev, and then resort to Theorem 4.1 by the same author to conclude that T = E ˇT. Therefore,7

Tv = ΓE ˇ v and Tv = EΓ ˇ v. (64)

7. Interestingly, the eﬀect of swapping matrices E and La is to also swap the operators Γ and E.

Barreto, Precup, and Pineau

Using (64), it is easy to obtain the desired upper bound by resorting to the triangle inequality, the deﬁnition of a contraction map, and Denardo s (1967) Theorem 1:

v TE v γ v E v γ( v u + u E v ) γ( v u + u v )

γ v u + 1 1 γ u EΓ ˇ u γ v u + 1 1 γ u Γ ˇ u

γ v u + 1 1 γ u ˇv + ˇv Γ ˇ u

γ v u + 1 1 γ ( u ˇv + γ ˇv u ) = γ v u + 1 + γ

γ v u + 1 + γ

1 γ ( u v + v ˇv )

= γ v u + γ(1 + γ)

1 γ v u + γ(1 + γ)

= γ γ2 + γ + γ2

1 γ v u + γ(1 + γ)

The derived upper bound depends on two ﬁxed points: u, a ﬁxed point of E, and ˇv , the unique ﬁxed point of ˇT = Γ ˇ . Since the latter is deﬁned by ra and La, the bound is essentially a function of the factorization terms, as expected. Notice that the bound is valid for any norm and any ﬁxed point of E (we may think of u as the closest vector to v in Rn

which satisﬁes this property). Notice also that the ﬁrst term on the right-hand side of (63) is exactly the error bound derived in Gordon s (1995) Theorem 6.2. When La = Pa and ra = ra for all a A, the operators T and ˇT coincide, and hence the second term of (63) vanishes. This makes sense, since in this case T = ET, that is, the stochastic-factorization trick reduces to the averager A(v) = Ev. As mentioned above, one of the assumptions of Lemma 11 is that the factorizations ELa = Pa are of order n. This is unfortunate, since the whole motivation behind the stochastic-factorization trick is to create an MDP with m < n states. One way to obtain such a reduction is to suppose that matrix E has n m columns with zeros only. Deﬁne E {1, 2, ..., n} as the set of columns of E with at least one nonzero element and let H be a matrix in Rm n such that hij = 1 if j is the ith smallest element in E and hij = 0 otherwise. The following proposition generalizes the previous result to a stochastic factorization of order m:

Proposition 12 Suppose the assumptions of Lemma 11 hold. Let D = EH , Ka = HLa, and ra = H ra, with H deﬁned as described above. Deﬁne the MDP M ( S, A, Pa, ra, γ), with | S| = m and Pa = Ka D. Then, v ΓD Q ξ v, with ξ v deﬁned in (63).

Proof Let qa Rm be the ath column of Q . Then,

D qa = D ra + γ Pa v = D ra + γDKa D v = EH H ra + γEH HLa EH v

= E ra + γE Pa H v = E ra + γE Pa v = E ra + γ Pa v = E qa , (65)

Practical Kernel-Based Reinforcement Learning

where the equality EH H = E follows from the deﬁnition of H. The identity Pa H v = Pa v is a consequence of the fact that the ith column of Pa only contains zeros if i / E. Since the ith state of M is transient, the values of the recurrent states of M, which eﬀectively deﬁne the multiplication Pa v , will coincide with the values of the states of M. Expression (65) then leads to D Q = E Q . Also, since E qa = E ra + γELa E v = ra + γPa E v , we know that E Q = E v . Putting these results together, we obtain v ΓD Q = v Γ E v = v TE v , and Lemma 11 applies.

The derived bound can be generalized to the case of approximate stochastic factorizations through the triangle inequality, as done in (29). However, if one resorts to Whitt s (1978) results to bound the distance between v and ˇv where ˇv is the optimal value function of ˇ M (S, A, DKa, D ra, γ) the compounded bound will no longer be valid for all norms, since (30) only holds for the inﬁnity norm.

Appendix B. Details of the experiments

This appendix describes the details of the experiments omitted in the paper.

Puddle World: The puddle-world task was implemented as described by Sutton (1996), but here the task was modeled as a discounted problem with γ = 0.99. All the transitions were associated with a zero reward, except those leading to the goal, which resulted in a reward of +5, and those ending inside one of the puddles, which lead to a penalty of 10 times the distance to the puddle s nearest edge. If the agent did not reach the goal after 300 steps the episode was interrupted and considered as a failure. The algorithms were evaluated on two sets of states distributed over disjoint regions of the state space surrounding the puddles. The ﬁrst set was a 3 3 grid deﬁned over [0.1, 0.3] [0.3, 0.5] and the second one was composed of four states: {0.1, 0.3} {0.9, 1.0}. Pole Balancing: We implemented the simulator of the three versions of the polebalancing task using the equations of motion and parameters given in the appendix of Gomez s (2003) Ph D thesis. For the integration we used the 4th order Runge-Kutta method with a time step of 0.01 seconds and actions chosen every 2 time steps. The problem was modeled as a discounted task with γ = 0.99. We considered the version of the task in which the angle between the pole and the vertical plane must be kept within [ 36o, 36o]. In this formulation, an episode is interrupted and the agent gets a reward of 1 if the pole falls past a 36-degree angle or the cart reaches the boundaries of the track, located at 2.4m from its center. At all other steps the agent receives a reward of 0. In all versions of the problem an episode was considered a success if the pole(s) could be balanced for 3000 steps (one minute of simulated time). The test set was comprised of 81 states equally spaced in the region deﬁned by [1.2m, 1.2/5m, 18o, 75o/s], for the single pole case, and by [1.2m, 1.2/5m, 18o, 75o/s, 18o, 150o/s] for the double-pole version of the problem. These values correspond to a hypercube centered at the origin and covering 50% of the state-space axes in each dimension (since the velocity of the cart and the angular velocity of the poles are theoretically not bounded, we deﬁned the limits of these variables based on samples gen-

Barreto, Precup, and Pineau

erated in simple preliminary experiments). For the triple pole-balancing task we performed our simulations using the parameters usually adopted with the two pole version of the problem, but we added a third pole with the same length and mass as the longer pole (Gomez, 2003). In this case the decision policies were evaluated on a test set containing 256 states equally distributed in the region [1.2m, 1.2/5m, 18o, 75o/s, 18o, 150o/s, 18o, 75o/s]. HIV drug schedule: The HIV drug schedule task was implemented using the system of ordinary diﬀerential equations (ODEs) given by Adams et al. (2004). Integration was carried out by the Euler method using a step size of 0.001 with actions selected at each 5000 steps (corresponding to 5 days of simulated time). As suggested by Ernst et al. (2006), the problem was modeled as a discounted task with γ = 0.98. All other parameters of the task, as well as the protocol used for the numerical simulations, also followed the suggestions of the same authors. In particular, we assumed the existence of 30 patients who were monitored for 1000 days. During the monitoring period, the content of the drug cocktail administered to each patient could be changed at ﬁxed intervals of 5 days. Thus, in a sample transition (sa i , ra i , ˆsa i ): sa i is the initial patient condition, a is one of the four types of cocktails to be administered for the next 5 days, ˆsa i is the patient condition 5 days later, and ra i is a reward computed based on the amount of drug in the selected cocktail a and on the diﬀerence between the patient s condition from sa i to ˆsa i (Ernst et al., 2006). The results reported in Section 4.2.3 correspond to the performance of the greedy policy induced by the value function computed by the algorithms using all available sample transitions. The decision policies (in this case STI treatments) were evaluated for 5000 days starting from an unhealthy state corresponding to a basin of attraction of the ODEs describing the problem s dynamics (see the papers by Adams et al. and Ernst et al.). Epilepsy suppression: We used a generative model developed by Bush et al. (2009) to perform our experiments with the epilepsy suppression task. The model was generated based on labeled ﬁeld potential recordings of ﬁve rat brain slices electrically stimulated at frequencies of 0.0 Hz, 0.5 Hz, 1.0 Hz, and 2.0 Hz. The data was used to construct a manifold embedding which in turn gave rise to the problem s state space. The objective is to minimize the occurrence of seizures using as little stimulation as possible, therefore there is a negative reward associated with both events (see Section 4.2.4). Bush et al. s generative model is public available as an environment for the RL-Glue package (Tanner and White, 2009). In our experiments the problem was modeled as a discounted task with γ = 0.99. The decision policies were evaluated on episodes of 105 transitions starting from a ﬁxed set of 10 test states drawn uniformly at random from the problem s state space. Helicopter hovering: In the experiments with the helicopter hovering task we used the simulator developed by Abbeel et al. (2005), which is available as an environment for the RL-Glue package (Tanner and White, 2009). The simulator was built based on data collected from two separate ﬂights of an XCell Tempest helicopter. The data was used to adjust the parameters of an acceleration prediction model , which is more accurate than the linear model normally adopted by industry. The objective in the problem is to keep the helicopter hovering as close as possible to a speciﬁc location. Therefore, at each time step the agent gets a negative reward proportional to the distance from the target position. Since the problem s original action space is A [ 1, 1]4, we discretized each dimension using 4 break points distributed unevenly over [ 1, 1]. We tried several possible discretizations and picked the one which resulted in the best performance of the SARSA

Practical Kernel-Based Reinforcement Learning

agent (see Section 5.2.3). After this process, the problem s action space was redeﬁned as A { 0.25, 0.05, +0.05, +0.25}4. The problem was modeled as a discounted task with γ = 0.99. The decision policies were evaluated in episodes starting from the target position and ending when the helicopter crashed.

B.2 Algorithms

In all experiments, we used φ(z) φ(z) exp( z) (66)

to deﬁne the kernels used by KBRL, LSPI, and KBSF. In the experiments involving a large number of sample transitions we used sparse kernels, that is, we only computed the µ largest values of kτ( si, ) and the µ largest values of k τ(ˆsa i , ). In order to implement this feature, we used a KD-tree to ﬁnd the µ ( µ) nearest neighbors of si (ˆsa i ) and only computed kτ ( k τ) in these states (Bentley, 1975). The value of kτ and k τ outside this neighborhood was truncated to zero (we used specialized data structures to avoid storing those). We now list a few details regarding the algorithms s implementations which were not described in the paper:

KBRL and KBSF: We used modiﬁed policy iteration to compute ˆQ (Puterman and Shin, 1978). The value function of a ﬁxed policy π was approximated through value iteration using the stop criterion described by Puterman (1994, Proposition 6.6.5) with ε = 10 6. Table 1 shows the parameters s values used by KBSF across the experiments.

LSPI: As explained above, LSPI used the kernel derived from (66) as its basis function. Following Lagoudakis and Parr (2003), we adopted one block of basis functions for each action a A. Singular value decomposition was used to avoid eventual numerical instabilities in the system of linear equations constructed at each iteration of LSPI (Golub and Loan, 1993).

Fitted Q-iteration and extra trees: FQIT has four main parameters: the number of iterations, the number of trees composing the ensemble, the number of candidate cut-points evaluated during the generation of the trees, and the minimum number of elements required to split a node, denoted here ηmin. In general, increasing the ﬁrst three improves performance, while ηmin has an inverse relation with the quality of the ﬁnal value function approximation. Our experiments indicate that the following conﬁguration of FQIT usually results in good performance on the tasks considered in this paper: 50 iterations (with the structure of the trees ﬁxed after the 10th one), an ensemble of 30 trees, and d S candidate cut points, where d S is the dimension of the state space S. The parameter ηmin has a particularly strong eﬀect on FQIT s performance and computational cost, and its correct value seems to be more problemdependent. Therefore, in all of our experiments we ﬁxed the parameters of FQIT as described above and only varied ηmin.

SARSA: We adopted the implementation of SARSA(λ) available in the RL-Glue package (Tanner and White, 2009). The algorithm uses gradient descent temporaldiﬀerence learning to conﬁgure a tile coding function approximator.

Barreto, Precup, and Pineau

Problem Section si m τ τ µ µ Puddle 4.2.1 k-means {10, 30, ..., 150} {0.01, 0.1, 0.1} {0.01, 0.1, 0.1} Puddle 5.2.1 evenly 100 {0.01, 0.1, 0.1} {0.01, 0.1, 0.1} Single Pole 4.2.2 k-means {10, 30, ..., 150} 1 {0.01, 0.1, 0.1} Double Pole 4.2.2 k-means {20, 40, ..., 200} 1 {0.01, 0.1, 0.1} Triple Pole 5.2.2 on-line on-line 100 1 50 10

HIV 4.2.3 random {2000, 4000, ..., 10000} 1 1 2 3

Epilepsy 4.2.4 k-means 50000 1 {0.01, 0.1, 0.1} 6 6

Helicopter 5.2.3 k-means 500 1 1 4 4

Table 1: Parameters used by KBSF on the computational experiments. The values marked with an asterisk ( ) were determined by trial and error on preliminary tests. The remaining parameters were kept ﬁxed from the start or were deﬁned based on a very coarse search.

Appendix C. Table of Symbols

Table 2 shows the main symbols used in the paper. Auxiliary symbols and functions whose use is restricted to a speciﬁc part of the text are not listed in the table.

Pieter Abbeel, Varun Ganapathi, and Andrew Y. Ng. Learning vehicular dynamics, with application to modeling helicopters. In Advances in Neural Information Processing Systems (NIPS), 2005.

Pieter Abbeel, Adam Coates, Morgan Quigley, and Andrew Y. Ng. An application of reinforcement learning to aerobatic helicopter ﬂight. In Advances in Neural Information Processing Systems (NIPS), 2007.

Brian M. Adams, Harvey T. Banks, Hee-Dae Kwon, and Hien T. Tran. Dynamic multidrug therapies for HIV: optimal and STI control approaches. Mathematical Biosciences and Engineering, 1(2):223 41, 2004.

Charles W. Anderson. Learning and Problem Solving with Multilayer Connectionist Systems. Ph D thesis, Computer and Information Science, University of Massachusetts, 1986.

Andr as Antos, R emi Munos, and Csaba Szepesv ari. Fitted Q-iteration in continuous actionspace MDPs. In Advances in Neural Information Processing Systems (NIPS), pages 9 16, 2007.

Anwarissa Asbah, Andr e M. S. Barreto, Clement Gehring, Joelle Pineau, and Doina Precup. Reinforcement learning competition 2013: Controllability and kernel-based stochastic factorization. In Proceedings of the ICML Workshop on the Reinforcement Learning Competition, 2013.

Christopher G. Atkeson and Juan Carlos Santamaria. A comparison of direct and modelbased reinforcement learning. In Proceedings of the IEEE International Conference on Robotics and Automation, pages 3557 3564, 1997.

Practical Kernel-Based Reinforcement Learning

Symbol Meaning Deﬁned in Sec. γ Discount factor in [0, 1) 2 A, a Action space, generic action 2 S Generic state space 2 S, d S Continuous state space, dimension of S 2.2 Sa Sample transitions associated with action a, Sa {(sa i , ra i , ˆsa i )|i = 1, 2, ..., na} 2.2 s, si, s(t), sa i , ˆsa i Generic state, ith state of an MDP, state occupied at time t, ith tuple in Sa 2 r, r(t), ra i Generic reward, reward received at time t, ith sampled reward in Sa 2 na, n, ˆn Number of sample transitions in Sa, n = P

a na, and ˆn = maxa na 2.2 P a, P π, Pa, Pπ Transition function associated with a or π and their matrix counterparts 2.1 Ra, Rπ, ra, rπ Reward function associated with a or π and corresponding reward vectors 2.1 M MDP, M (S, A, P a, Ra, γ) 2.1 π, π Generic decision policy, optimal decision policy in M 2 V π, V , vπ, v Value function of π, optimal value function of M, vector counterparts 2.1 Qπ, Q , Qπ, Q Action-value functions of π and M and matrix counterparts 2.1 Γ Operator: ΓQ = v vi = maxj qij, i 2.1 Operator associated with M: v = Q qia = ra i + γ P|S| j=1 pa ijvj, i, a 2.1 T Bellman operator of M, given by T Γ 2.1 φ, φ Non-increasing functions in R+ 7 R+ used to construct the kernels 2.2, 4 kτ, k τ Kernel functions: kτ(s, s ) = φ ( s s /τ) and k τ(s, s ) = φ ( s s / τ) 2.2, 4 τ, τ Scalars deﬁning the widths of kτ and k τ 2.2, 4 κa τ, κ τ Normalized kernels: κa τ(s, sa i ) = kτ (s,sa i ) Pna j=1 kτ (s,sa j ) and κ τ(s, si) = k τ (s, si) Pm j=1 k τ (s, sj) 2.2, 4 S Set of representative states, S { s1, s2, ..., sm} 4 si ith representative state 4 m Number of representative states 4 D, Da Stochastic matrix in Rn m and its na m block associated with a 4 Ka, Ka Stochastic matrix in Rm n and its dense version in Rm na 4 rs Function: rs(s, i) = sk sk is the ith closest representative state to s 4.1 dist Function: dist(s, i) = s rs(s, i) 4.1

Table 2: List of main symbols used throughout the paper. We use the same notation to refer to all MDPs and the associated elements, and resort to math accents to distinguish between them. So, for example, if M is an MDP, the associated elements are referred to as Pa, ra, v , Q , π , , and T.

Barreto, Precup, and Pineau

Seema H. Bajaria, Glenn Webb, and Denise E. Kirschner. Predicting diﬀerential responses to structured treatment interruptions during HAART. Bulletin of Mathematical Biology, 66(5):1093 1118, 2004.

Andr e M. S. Barreto. Tree-based on-line reinforcement learning. In Proceedings of the AAAI Conference on Artiﬁcial Intelligence, pages 2417 2423, 2014.

Andr e M. S. Barreto and Marcelo D. Fragoso. Computing the stationary distribution of a ﬁnite Markov chain through stochastic factorization. SIAM Journal on Matrix Analysis and Applications, 32:1513 1523, 2011.

Andr e M. S. Barreto, Doina Precup, and Joelle Pineau. Reinforcement learning using kernelbased stochastic factorization. In Advances in Neural Information Processing Systems (NIPS), pages 720 728, 2011.

Andr e M. S. Barreto, Doina Precup, and Joelle Pineau. On-line reinforcement learning using incremental kernel-based stochastic factorization. In Advances in Neural Information Processing Systems (NIPS), pages 1484 1492, 2012.

Andr e M. S. Barreto, Joelle Pineau, and Doina Precup. Policy iteration based on stochastic factorization. Journal of Artiﬁcial Intelligence Research, 50:763 803, 2014.

Andrew G. Barto, Richard S. Sutton, and Charles W. Anderson. Neuronlike adaptive elements that can solve diﬃcult learning control problems. IEEE Transactions on Systems, Man, and Cybernetics, 13:834 846, 1983.

Richard E. Bellman. Dynamic Programming. Princeton University Press, 1957.

Richard E. Bellman. Adaptive Control Processes. Princeton University Press, 1961.

Jon Louis Bentley. Multidimensional binary search trees used for associative searching. Communications of the ACM, 18(9):509 517, 1975.

Dimitri P. Bertsekas and John N. Tsitsiklis. Neuro-Dynamic Programming. Athena Scientiﬁc, 1996.

Alina Beygelzimer, Sham Kakade, and John Langford. Cover trees for nearest neighbor. In Proceedings of the International Conference on Machine Learning (ICML), pages 97 104, 2006.

Nikhil Bhat, Ciamac Moallemi, and Vivek Farias. Non-parametric approximate dynamic programming via the kernel method. In Advances in Neural Information Processing Systems (NIPS), 2012.

Ronen I. Brafman and Moshe Tennenholtz. R-MAX - a general polynomial time algorithm for near-optimal reinforcement learning. Journal of Machine Learning Research, 3:213 231, 2003.

Keith Bush, Joelle Pineau, and Massivo Avoli. Manifold embeddings for model-based reinforcement learning of neurostimulation policies. In Proceedings of the ICML/UAI/COLT Workshop on Abstraction in Reinforcement Learning, 2009.

Practical Kernel-Based Reinforcement Learning

Keith Bush and Joelle Pineau. Manifold embeddings for model-based reinforcement learning under partial observability. In Advances in Neural Information Processing Systems (NIPS), pages 189 197, 2009.

Moses Charikar, Chandra Chekuri, Tom as Feder, and Rajeev Motwani. Incremental clustering and dynamic information retrieval. In Proceedings of the Annual ACM Symposium on Theory of Computing, pages 626 635, 1997.

Joel E. Cohen and Uriel G. Rothblum. Nonnegative ranks, decompositions and factorizations of nonnegative matrices. Linear Algebra and its Applications, 190:149 168, 1991.

Adele Cutler and Leo Breiman. Archetypal analysis. Technometrics, 36(4):338 347, 1994.

Eric V. Denardo. Contraction mappings in the theory underlying dynamic programming. SIAM Review, 9(2):165 177, 1967.

Thomas G. Dietterich. Hierarchical reinforcement learning with the MAXQ value function decomposition. Journal of Artiﬁcial Intelligence Research, 13:227 303, 2000.

Dominique M. Durand and Marom Bikson. Suppression and control of epileptiform activity by electrical stimulation: a review. Proceedings of the IEEE, 89(7):1065 1082, 2001.

Yaakov Engel, Shie Mannor, and Ron Meir. Reinforcement learning with Gaussian processes. In Proceedings of the International Conference on Machine learning (ICML), pages 201 208, 2005.

Damien Ernst, Guy-Bart Stan, Jorge Gon calves, and Louis Wehenkel. Clinical data based optimal STI strategies for HIV: a reinforcement learning approach. In Proceedings of the IEEE Conference on Decision and Control (CDC), 2006.

Damien Ernst, Pierre Geurts, and Louis Wehenkel. Tree-based batch mode reinforcement learning. Journal of Machine Learning Research, 6:503 556, 2005.

Amir-massoud Farahmand. Regularization in reinforcement learning. Ph D thesis, University of Alberta, 2011.

Amir-massoud Farahmand and Csaba Szepesv ari. Model selection in reinforcement learning. Machine Learning Journal, 85(3):299 332, 2011.

Amir-massoud Farahmand, Csaba Szepesv ari, and Jean-Yves Audibert. Toward manifoldadaptive learning. In NIPS Workshop on Topology learning, 2007.

Tom as Feder and Daniel Greene. Optimal algorithms for approximate clustering. In Proceedings of the Annual ACM Symposium on Theory of Computing, pages 434 444, 1988.

Pierre Geurts, Damien Ernst, and Louis Wehenkel. Extremely randomized trees. Machine Learning, 36(1):3 42, 2006.

Gene H. Golub and Charles F. Van Loan. Matrix Computations. The Johns Hopkins University Press, 2nd edition, 1993.

Barreto, Precup, and Pineau

Faustino J. Gomez, Juergen Schmidhuber, and Risto Miikkulainen. Eﬃcient non-linear control through neuroevolution. In Proceedings of the European Conference on Machine Learning (ECML), pages 654 662, 2006.

Faustino J. Gomez. Robust non-linear control through neuroevolution. Ph D thesis, The University of Texas at Austin, 2003. Technical Report AI-TR-03-303.

Teoﬁlo Gonzales. Clustering to minimize the maximum intercluster distance. Theoretical Computer Science, 38:293 306, 1985.

Geoﬀrey Gordon. Stable function approximation in dynamic programming. Technical Report CMU-CS-95-103, Computer Science Department, Carnegie Mellon University, 1995.

Steﬀen Grunewalder, Guy Lever, Luca Baldassarre, Massi Pontil, and Arthur Gretton. Modelling transition dynamics in MDPs with RKHS embeddings. In Proceedings of the International Conference on Machine Learning (ICML), pages 535 542, 2012.

Isabelle Guyon and Andr e Elisseeﬀ. An introduction to variable and feature selection. Journal of Machine Learning Research, 3:1157 1182, 2003.

L aszl o Gy orﬁ, Michael Kohler, Adam Krzy zak, and Harro Walk. A Distribution-Free Theory of Nonparametric Regression. Springer Verlag, 2002.

Trevor Hastie, Robert Tibshirani, and Jerome Friedman. The Elements of Statistical Learning: Data Mining, Inference, and Prediction. Springer, 2002.

Kristin Jerger and Steven Schiﬀ. Periodic pacing an in vitro epileptic focus. Journal of Neurophysiology, (2):876 879, 1995.

Nicholas Jong and Peter Stone. Kernel-based models for reinforcement learning in continuous state spaces. In Proceedings of the International Conference on Machine Learning (ICML) Workshop on Kernel Machines and Reinforcement Learning, 2006.

Nicholas Jong and Peter Stone. Compositional models for reinforcement learning. In Proceedings of the European Conference on Machine Learning and Knowledge Discovery in Databases, pages 644 659, 2009.

Leonard Kaufman and Peter J. Rousseeuw. Finding Groups in Data: an Introduction to Cluster Analysis. John Wiley and Sons, 1990.

Oliver B. Kroemer and Jan R. Peters. A non-parametric approach to dynamic programming. In Advances in Neural Information Processing Systems (NIPS), pages 1719 1727, 2011.

Branislav Kveton and Georgios Theocharous. Kernel-based reinforcement learning on representative states. In Proceedings of the AAAI Conference on Artiﬁcial Intelligence, pages 124 131, 2012.

Branislav Kveton and Georgios Theocharous. Structured kernel-based reinforcement learning. In Proceedings of the AAAI Conference on Artiﬁcial Intelligence, 2013.

Practical Kernel-Based Reinforcement Learning

Michail G. Lagoudakis and Ronald Parr. Least-squares policy iteration. Journal of Machine Learning Research, 4:1107 1149, 2003.

Daniel D. Lee and H. Sebastian Seung. Learning the parts of objects by non-negative matrix factorization. Nature, 401:788 791, 1999.

Donald Michie and Roger Chambers. BOXES: An experiment on adaptive control. Machine Intelligence 2, pages 125 133, 1968.

Andrew W. Moore and Christopher G. Atkeson. Prioritized sweeping: Reinforcement learning with less data and less time. Machine Learning, 13:103 130, 1993.

R emi Munos and Csaba Szepesv ari. Finite-time bounds for ﬁtted value iteration. Journal of Machine Learning Research, 9:815 857, 2008.

Andrew Y. Ng, H. Jin Kim, Michael I. Jordan, and Shankar Sastry. Autonomous helicopter ﬂight via reinforcement learning. In Advances in Neural Information Processing Systems (NIPS), 2003.

Dirk Ormoneit and Peter Glynn. Kernel-based reinforcement learning in average-cost problems. IEEE Transactions on Automatic Control, 47(10):1624 1636, 2002.

Dirk Ormoneit and Saunak Sen. Kernel-based reinforcement learning. Machine Learning, 49 (2 3):161 178, 2002.

Pentti Paatero and Unto Tapper. Positive matrix factorization: A non-negative factor model with optimal utilization of error estimates of data values. Environmetrics, 5:111 126, 1994.

Martin L. Puterman. Markov Decision Processes Discrete Stochastic Dynamic Programming. John Wiley & Sons, Inc., 1994.

Martin L. Puterman and Moon C. Shin. Modiﬁed policy iteration algorithms for discounted Markov decision problems. Management Science, 24(11):1127 1137, 1978.

Carl Edward Rasmussen and Malte Kuss. Gaussian processes in reinforcement learning. In Advances in Neural Information Processing Systems (NIPS), pages 751 759, 2004.

Balaraman Ravindran. An Algebraic Approach to Abstraction in Reinforcement Learning. Ph D thesis, University of Massachusetts, 2004.

Gavin Rummery and Mahesan Niranjan. On-line Q-learning using connectionist systems. Technical Report CUED/F-INFENG/TR 166, Engineering Department, Cambridge University, 1994.

Bernhard Sch olkopf and Alexander J. Smola. Learning with Kernels. MIT Press, 2002.

Jonathan Sorg and Satinder Singh. Transfer via soft homomorphisms. In Autonomous Agents & Multiagent Systems/Agent Theories, Architectures, and Languages, pages 741 748, 2009.

Barreto, Precup, and Pineau

Charles J. Stone. Optimal global rates of convergence for nonparametric regression. The Annals of Statistics, 10:1040 1053, 1982.

Alexander L. Strehl and Michael L. Littman. An analysis of model-based interval estimation for Markov decision processes. Journal of Computer and System Sciences, 74(8):1309 1331, 2008.

Richard S. Sutton. Generalization in reinforcement learning: Successful examples using sparse coarse coding. In Advances in Neural Information Processing Systems (NIPS), pages 1038 1044, 1996.

Richard S. Sutton and Andrew G. Barto. Reinforcement Learning: An Introduction. MIT Press, 1998.

Csaba Szepesv ari. Algorithms for Reinforcement Learning. Synthesis Lectures on Artiﬁcial Intelligence and Machine Learning. Morgan & Claypool Publishers, 2010.

Brian Tanner and Adam M. White. RL-Glue: Language-independent software for reinforcement-learning experiments. Journal of Machine Learning Research, 10:2133 2136, 2009.

Gavin Taylor and Ronald Parr. Kernelized value function approximation for reinforcement learning. In Proceedings of the International Conference on Machine Learning (ICML), pages 1017 1024, 2009. .

Stephen A. Vavasis. On the complexity of nonnegative matrix factorization. SIAM Journal on Optimization, 20:1364 1377, 2009.

Ward Whitt. Approximations of dynamic programs, I. Mathematics of Operations Research, 3(3):231 243, 1978.

Alexis P. Wieland. Evolving neural network controllers for unstable systems. In Proceedings of the International Joint Conference on Neural Networks, pages 667 673, 1991.

Xin Xu, Tao Xie, Dewen Hu, and Xicheng Lu. Kernel Least-Squares Temporal Diﬀerence Learning. Information Technology, pages 54 63, 2005.

Yinyu Ye. The simplex and policy-iteration methods are strongly polynomial for the Markov decision problem with a ﬁxed discount rate. Mathematics of Operations Research, 36(4): 593 603, 2011.