# treebased_online_reinforcement_learning__02704465.pdf

Tree-Based On-Line Reinforcement Learning

Andr e da Motta Salles Barreto Laborat orio Nacional de Computac ao Cient ıﬁca Petr opolis, RJ, Brazil

Fitted Q-iteration (FQI) stands out among reinforcement learning algorithms for its ﬂexibility and ease of use. FQI can be combined with any regression method, and this choice determines the algorithm s statistical and computational properties. The combination of FQI with an ensemble of regression trees gives rise to an algorithm, FQIT, that is computationally efﬁcient, scalable to high dimensional spaces, and robust to noise. Despite its nice properties and good performance in practice, FQIT also has some limitations: the fact that an ensemble of trees must be constructed (or updated) at each iteration conﬁnes the algorithm to the batch scenario. This paper aims to address this speciﬁc issue. Based on a strategy recently proposed in the literature, called the stochastic-factorization trick, we propose a modiﬁcation of FQIT that makes it fully incremental, and thus suitable for on-line learning. We call the resulting method tree-based stochastic factorization (TBSF). We derive upper bounds for the difference between the value functions computed by FQIT and TBSF, and also show in which circumstances the approximations coincide. A series of computational experiments is presented to illustrate the properties of TBSF and to show its usefulness in practice, including a medical problem involving the treatment of patients infected with HIV.

1 Introduction

Batch reinforcement-learning algorithms learn a decision policy based on a ﬁxed set of sample transitions. Among them, ﬁtted Q-iteration (FQI) stands out for its ﬂexibility and ease of use. Probably because of this, FQI has been adopted by researches and practitioners, and today it is possible to ﬁnd several reports of successful applications of the algorithm (Ernst, Geurts, and Wehenkel 2005; Riedmiller 2005; Ernst et al. 2006; Kalyanakrishnan and Stone 2007). FQI can be combined with any regression method, and this choice determines the algorithm s statistical and computational properties. In their original work, Ernst, Geurts, and Wehenkel (2005) suggest the use of an ensemble of regression trees. The authors point out several advantages of these models, such as their computational efﬁciency,

Copyright c 2014, Association for the Advancement of Artiﬁcial Intelligence (www.aaai.org). All rights reserved.

their scalability to high dimensional spaces, and their robustness with respect to irrelevant variables, outliers, and noise. The combination of FQI with regression trees (FQIT) has also been extensively tested in practice, in general with very good results (Ernst, Geurts, and Wehenkel 2005; Ernst et al. 2006). Despite its nice properties and good performance in practice, FQIT also has some limitations: the fact that an ensemble of trees must be constructed (or updated) at each iteration conﬁnes the algorithm to the batch scenario. This paper aims to address this speciﬁc issue. Based on a strategy recently proposed in the literature, called the stochastic-factorization trick, we propose a modiﬁcation of FQIT that makes its computational complexity independent of the number of sample transitions. The resulting method, tree-based stochastic factorization (TBSF), is signiﬁcantly faster than its precursor and can be used on-line. We derive TBSF and analyze it theoretically and empirically. We prove that the distance between the approximations computed by FQIT and TBSF is bounded. We also show empirically that, since our algorithm can process more sample transitions in the same amount of time, it can match FQIT s performance using less computation.

2 Reinforcement Learning

We consider the basic framework of reinforcement learning: an agent interacts with an environment and selects actions in order to maximize the amount of reward received in the long run (Sutton and Barto 1998). As usual, we assume that this interaction can be modeled as a Markov decision process (MDP, Puterman, 1994). An MDP is a tuple M (S, A, P a, ra, γ), where S is the state space and A is the (ﬁnite) action set. For each action a A, P a( |s) is the nextstate distribution upon taking action a in state s. The reward received at transition s a s is given by Ra(s, s ). Usually, one is interested in the expected reward resulting from the execution of a in s, that is, ra(s) = Es P a( |s){Ra(s, s )}. The discount factor γ [0, 1) gives smaller weights to rewards received further in the future. The algorithm presented in this paper can be applied to MDPs with continuous and discrete state spaces, and in both cases the strategy will be to derive a discrete model. When both the state and action spaces are ﬁnite, an MDP can be

Proceedings of the Twenty-Eighth AAAI Conference on Artificial Intelligence

represented in matrix form: each function P a becomes a matrix Pa R|S| |S|, with pa ij = P a(sj|si), and each function ra becomes a vector ra R|S|, with ra i = ra(si).1 The goal of the agent is to ﬁnd an optimal policy, that is, a mapping π : S 7 A that maximizes the expected sum of discounted rewards. When the MDP is ﬁnite, dynamic programming can be used to ﬁnd an optimal policy π A|S| in polynomial time (Littman, Dean, and Kaelbling 1995). Starting from Q0 = 0, the value-iteration update rule, Qt+1(si, a) = ra(si)+γ P|S| j=1 pa ij maxb A Qt(sj, b), with t > 0, gives the optimal t-step action-value function, from which the t-step policy can be obtained by selecting any π(t)(si) argmaxa Qt(si, a). As t the function Qt(si, a) approaches Q (si, a), the value function associated with all optimal policies π . In reinforcement learning it is assumed that the MDP is unknown, and the agent must learn a policy based on transitions sampled from the environment. If the process of learning a decision policy is based on a ﬁxed set of sample transitions, we call it batch reinforcement learning. In on-line reinforcement learning the computation of a decision policy takes place concomitantly with the collection of data.

3 Fitted Q-Iteration Fitted Q-iteration (FQI) is a batch reinforcement-learning algorithm proposed by Ernst, Geurts, and Wehenkel (2005) based on the works by Gordon (1995) and Ormoneit and Sen (2002). Let Sa {(sa i , ra i , ˆsa i )}, i = 1, 2, ..., na, be sample transitions associated with action a A, where sa i , ˆsa i S and ra i R. FQI keeps an approximation ˆQ : S A 7 R which is successively updated in the following way. Starting from ˆQ0 = 0, at each iteration t > 0 a training set Ht {[(sa i , a), Qt(sa i , a)]} composed of n = P a A na input-output pairs is constructed, with Qt(sa i , a) = ra i + γ maxb ˆQt 1(ˆsi, b). Then, a regression procedure uses this training set to compute the next approximation, ˆQt = ﬁt(Ht). The combination of the function approximator ˆQ(s, a) with the regression method ﬁt( ) gives rise to different instances of FQI. Here we will focus on approximators of the form ˆQ(s, a) = P b A Pnb j=1 κa(s, sb j) ˆQ(sb j, a), where κa(s, sb j) : S S 7 R can be intuitively seen as a measure of the similarity between s and sb j. We will assume that κa(s, sb j) = 0 if a = b, which simpliﬁes the expression above to

j=1 κa(s, sa j ) ˆQ(sa j , a). (1)

When ˆQt(s, a) satisﬁes (1) it is fully deﬁned by the values of the states sa i ; hence, FQI s regression step reduces to updating these values: ˆQt+1(sa i , a) = Qt+1(sa i , a) = ra i + γ maxb ˆQt(ˆsa i , b). (2)

1Throughout the paper vectors will be denoted by small boldface letters and matrices will be denoted by capital boldface letters.

Combining (2) and (1), we can see that

ˆQt(ˆsa i , b) =

j=1 κb(ˆsa i , sb j) h rb j + γ max c ˆQt 1(ˆsb j, c) i .

(3) It is well known that if

κb(ˆsa i , sb j) 0 and

j=1 κb(ˆsa i , sb j) = 1, (4)

then ˆQt converges to a ﬁxed point as t (Gordon 1995; Ormoneit and Sen 2002). This is easy to see if we recast FQI as the solution of a derived MDP as follows. Let ˆS be the set composed of the n states ˆsa i . Deﬁne

ˆRa(ˆsb i, ˆsc j) = rc j and ˆP a(ˆsc j|ˆsb i) = κa(ˆsb i, sc j). (5)

Comparing (3) and (5), it is easy to see that FQI is equivalent to value iteration in the MDP ˆ M ( ˆS, A, ˆP a, ˆra, γ) = ( ˆS, A, ˆPa,ˆra, γ), where ˆra(ˆsb i) = P c A Pnc j=1 κa(ˆsb i, sc j)rc j = Pna j=1 κa(ˆsb i, sa j )ra j . Antos, Munos, and Szepesv ari (2007) have analyzed the ﬁnite-sample performance of FQI for the case in which the class of policies is restricted and the regression procedure ﬁt( ) is a weighted least-squares method. Farahmand et al. (2009) provide similar analysis for the scenario where the least-squares problem is regularized. Here we will focus on the version of FQI originally proposed by Ernst, Geurts, and Wehenkel (2005), which we discuss next.

Fitted Q-Iteration Using Trees Ernst, Geurts, and Wehenkel (2005) suggest that the approximation ˆQ(s, a) be represented by an ensemble of regression trees. Besides all the advantages discussed in Section 1, ensembles of trees can be deﬁned in order to satisfy (1) and (4). Let T a be a set of |T a| trees T a i associated with action a A. Each T a i partitions the state space in ma i sets T a ij. The value associated with partition T a ij at iteration t is simply the average of the values of states sa l that belong to T a ij, that is:

ˆvt(T a ij) =

Pna l=1 I{sa l T a ij}Qt(sa l , a) Pna l=1 I{sa l T a ij} , (6)

where I{true} = 1 and I{false} = 0 is the indicator function. We deﬁne Pna l=1 I{sa l T a ij} = |T a ij|. Recall that, when using an ensemble of trees, the value of a state ˆsa i is the average value of all partitions ˆsa i belongs to, that is:

ˆ Qt(ˆsa i , b) = 1 |T b|

P|T b| l=1 Pmb l u=1 I{ˆsa i T b lu}ˆv(T b lu)

1 |T b lu| I{ˆsa i T b lu} Pnb j=1 I{sb j T b lu}Qt(sb j, b).

There are several methods available to build a regression tree T a based on a training set H (see Ernst, Geurts, and Wehenkel s article and references therein, 2005). If we assume that the structure of the trees is ﬁxed throughout FQIT s iterations, then we can look at them as just a speciﬁc way of

deﬁning κb(ˆsa i , sb j). Comparing (1), (2), and (7), we see that when using an ensemble of trees

κb(ˆsa i , sb j) = 1 |T b|

1 |T b lu|I{ˆsa i T b lu and sb j T b lu}.

(8) It is clear that κb(ˆsa i , sb j) 0. Also, if ˆsa i T b lu, the summation Pnb j=1 I{ˆsa i T b lu and sb j T b lu} = |T b lu|. Since there are |T b| such summations, we have Pnb j=1 κb(ˆsa i , sb j) = 1. Therefore, κb(ˆsa i , sb j) satisﬁes (4). The fact that each FQI iteration involves solving a regression problem with n training examples makes it impractical for larger domains, depending on the choice of regression method ﬁt( ) and the computational resources available. Moreover, even if ˆQ is updated through a simple rule like (2), the computational complexity of each FQI iteration will necessarily depend on n, which clearly precludes the use of the algorithm as an on-line method. In the next section we discuss a possible way to circumvent these limitations.

4 Tree-Based Stochastic Factorization

In order to derive our algorithm we will resort to the stochastic-factorization trick, an idea introduced by Barreto and Fragoso (2011) and later explored by Barreto, Precup, and Pineau (2011; 2012) in reinforcement learning. Given a ﬁnite MDP M (S, A, Pa, ra, γ), let Pa = Da Ka be |A| factorizations in which Da Rn m and Ka Rm n are stochastic matrices. Also, let Da ra = ra for all a. Then, if we swap the factors Da and Ka, we obtain another transition matrix Pa = Ka Da that retains some basic properties of Pa (Barreto and Fragoso 2011). The insight is that, instead of solving M, one can solve M ( X, A, Pa, ra, γ), where X is a set of m states that will be deﬁned shortly. Based on the optimal value function of M, Q , an approximation of Q can be computed as Q(si, a) = Pm j=1 dij Q ( xj, a). When m n, replacing M with M signiﬁcantly reduces memory usage and computing time. The stochastic-factorization trick can be applied to any ﬁnite MDP. Since when (1) and (4) hold FQI is equivalent to the solution of a discrete MDP ˆ M, as discussed in Section 3, in principle the technique described above can be used with any approximator ˆQ that has these properties. Note though that, depending on the choice of ˆQ, it is not clear how to efﬁciently compute Da Ka = ˆPa and Da ra = ˆra (see (5)). One possible strategy is to select a set of m representative states sl S and use the approximation κb(ˆsa i , sb j) Pm l=1 κ(ˆsa i , sl)κb( slsb j) (Barreto, Precup, and Pineau 2011). Although this is a valid strategy in general, it only provides an approximate factorization of the MDP. Besides, one has to ﬁnd a way to appropriately deﬁne the representative states sl. We now show that, in contrast, in the particular case in which ˆQ is represented by an ensemble of trees, it is possible to determine an exact factorization of the MDP ˆ M without any computation.

Stochastic Factorization In order to derive our algorithm we will need two assumptions: (i) FQIT builds the ensembles of trees at iteration t = 1 using any algorithm and then keeps the structure of the trees ﬁxed for t > 1. This means that the partitions T a ij remain the same throughout the execution of the algorithm, although their values ˆv(T a ij) may change. (ii) The number of partitions in the ensembles T a is the same for all a, that is, P|T a| l=1 ma l = m, a A. As mentioned before, Assumption (i) is needed to guarantee that the approximations ˆQ(ˆsa, b) are updated through (2). Changing the structure of the trees from one iteration to another corresponds to deﬁning a new MDP ˆ Mt at each iteration t, and thus the guarantee of convergence to a ﬁxed point is lost. We will discuss how to relax Assumption (i) later. Assumption (ii) is necessary to guarantee that the MDP M is well deﬁned, as will become clear shortly. In order to facilitate the exposition, we will refer to sa i as sj, with j = Pa 1 b=1 nb + i (the same for ra i and ˆsa i ). Similarly, we will use T a k to refer to T a ij, with k = Pi 1 l=1 ma l +j. Finally, we introduce the function act : {1, 2, ..., n} 7 {1, 2, ..., |A|} as act(i) = 1+{max a such that Pa b=1 nb < i} that is, act(i) is the action associated with si, ri, and ˆsi. Using these deﬁnitions and (5) and (8) we can write

ˆpa ij = m P

1 |T a| I{ˆsi T a l } 1 | T a l | I{act(j) = a and sj T a l }.

Thus, if we deﬁne the elements of Da and Ka as

da ij = 1 |T a|I{ˆsi T a j } and

ka ij = 1 | T a i |I{act(j) = a and sj T a i }, (9)

it is clear that Da Ka = Pa for all a. In order to have a complete factorization of the MDP ˆ M, it remains to ﬁnd vectors ra such that Da ra = ˆra for all a A. We know from (5) that ˆra(ˆsi) = Pn j=1 κa(ˆsi, sj)rj, which implies that ˆra = ˆPar (r Rn is the vector composed of the sampled rewards). Now, if we deﬁne ra = Kar, we can write ˆra = ˆPar = Da Kar = Da ra, and we have an exact factorization of ˆ M. Once we have determined Da, Ka, and ra, we can apply the stochastic-factorization trick to the MDP ˆ M. Let Pa = Ka Da for all a A. From (9), we have

pa ij = 1 |T a|| T a i |

Pn l=1 I{act(l) = a and sl T a i and ˆsl T a j }. (10)

Thus, we simply count the number of transitions in Sa that started in T a i and divide it by the number of transitions that started in this partition and ended in T a j . Since the states si and ˆsi simultaneously belong to |T a| partitions T a i , in order to obtain pa ij this fraction must be further divided by |T a|. As for the rewards, from the deﬁnition of ra we have

ra i = 1 | T a i |

j=1 I{act(j) = a and sj T a i }ra j , (11)

which is simply the average of the rewards ra j associated with states sj that belong to T a i . We see an interesting inversion here: while in the MDP ˆ M the reward function ˆRa(ˆsi, ˆsj) = rj is deﬁned by the end-state ˆsj, as shown in (5), here ra i only depends on the initial states si. Because of Assumption (ii), we know that Pa Rm m and ra Rm m for all a A. Thus, we can deﬁne the MDP M ( X, A, Pa, ra, γ). Note that, unlike Barreto, Precup, and Pineau (2011), we do not explicitly deﬁne X S as a set of representative states. In the particular case in which the trees in all ensembles T a have the same structure that is, T a i and T b i partition S in the same way we can think of the partitions Ti as the states of M. In general, though, the states xi X are not subsets of S.

As discussed, when Assumption (i) holds, FQIT builds the MDP ˆ M at iteration t = 1 and then solves it iteratively through value iteration. Based on (10) and (11) we can deﬁne an algorithm that does the same using M instead of ˆ M. At any iteration t, the value function ˆQt(ˆsi, a) computed by FQIT, given by (7), can be approximated as Qt(ˆsi, a) = Pm j=1 da ij Qt( xj, a). More generally, from the deﬁnition of Da in (9) we see that the action-values of a state s not necessarily in the sets Sa can be computed as Qt(s, a) = 1/|T a| Pm j=1 I{s T a j } Qt( xj, a). We call the proposed algorithm tree-based stochastic factorization (TBSF). One immediate advantage of replacing FQIT with TBSF is the potential reduction on the computational cost. Let ηmin denote the minimum number of points required to split a node during the construction of the trees (see Section 6). Each iteration of FQIT involves the construction (or update) of |A| ensembles of trees, each one requiring at least O(|T a|na log(na/ηmin)) operations, and the improvement of the current decision policy, which is O(n|A|). In contrast, TBSF executes O(|T a|na log(na/ηmin)) operations only once per action, to build Pa and ra, and then performs only O(m2|A|) operations in each iteration assuming the worst-case scenario where the matrices Pa are dense. As one can see, when m n, replacing FQIT with TBSF can result in signiﬁcant gains in terms of computation. There is however a stronger reason to adopt TBSF instead of FQIT. Suppose that at iteration t a new batch of sample transitions Sa 2 {(sa i , ra i , ˆsa i )} becomes available, with a A. One way of incorporating the new data into FQIT s approximation is to set ˆQt(ˆsa i , a) = 0 for all states ˆsa i Sa 2 and then proceed normally from there. However, since FQIT s mechanics require that all transitions used in the approximation are available, as shown in (2), there is an inbuilt limit on the amount of data that this algorithm can process. This is why FQIT is inherently a batch-mode algorithm. In contrast, TBSF can extract the information contained in the sample transitions and then discard them, which means that it can also be used in the on-line regime. Let S1 Sa and S2 Sa 2 (we drop the a superscript in the sets Sa and Sa 2 to improve readability). Let PS1 and r S1 be matrix Pa and vector ra computed by TBSF through (10)

and (11) using only the na transitions in S1. We assume that we have already discarded the data in S1 and want to compute PS1 S2 and r S1 S2 from PS1, r S1, and S2. To have an algorithm that is as general as possible, we consider the possibility that the ensembles change with the new data, either because new partitions are added to the trees already in the ensemble or because new trees are added altogether. We denote the two ensembles by T S1 and T S1 S2. We also deﬁne m i as the number of partitions associated with the ith tree of T S1 S2 and set m = P|T S1 S2 | i=1 m i. If m > m, we set ra l = 0 and pa lj = pa jl = 0 for l = m + 1, m + 2, ..., m , j = 1, 2, ..., m , and all a A. We start by rewriting (10) as p S1 ij = z S1 ij /(|T S1|z S1 i ), where z S1 i = Pna l=1 I{sl T a i } and z S1 ij = Pna l=1 I{sl T a i and ˆsl T a j } (note that act(l) = a for all l). Then, if we deﬁne the variables z S2 i and z S2 ij in an analogous way, we have p S1 S2 ij = (z S1 ij + z S2 ij )/ |T S1 S2|(z S1 i + z S2 i ) . (12)

In order to avoid keeping z S1 ij for all i, j, we rewrite (12) as

p S1 S2 ij = p S1 ij |T S1|z S1 i + z S2 ij |T S1 S2|(z S1 i + z S2 i ). (13)

Similarly, if we deﬁne b S1 i = Pna j=1 I{sj T S1 i }rj and proceed in the same way as above starting from (11), we arrive at the update rule

r S1 S2 i = r S1 i z S1 i + b S2 i z S1 i + z S2 i . (14)

Since z S2 i , z S2 ij , and b S2 i are computed based on S2, and |T S1 S2| is readily available, in order to have an incremental algorithm we only have to keep |T S1| and z S1 i for i = 1, 2, ..., m. Note that, when T S1= T S1 S2, the model computed through (13) and (14) is the same that would be built if the transitions in S1 S2 were used all at once. Algorithm 1 shows a generic step by step description of TBSF. As one can see, several variations of the algorithm are possible. If the variable itmax deﬁned in line 3 is set to 1, TBSF reduces to a batch method; otherwise it is incremental. If itmax > 1 and the distribution µ is updated based on Q line 8 TBSF is an on-line method. If in addition n = 1, we have an algorithm that does not store sample transitions (line 4). Finally, if in line 5 new trees or new partitions are created with it > 1, we have an adaptive method (this is akin to relaxing Assumption (i) for FQIT). It is also trivial to modify the algorithm for the scenario where the sample transitions are given.

5 Theoretical Analysis In this section we analyze some properties of TBSF. We will focus on the batch version of the algorithm, that is, the case in which itmax = 1 in Algorithm 1. We will assume that Assumptions (i) and (ii) hold, and compare the approximation ˆQt computed by FQTI after t iterations with the approximation Qt computed by TBSF after t applications of the valueiteration update rule in M (line 7 of Algorithm 1). Our ﬁrst result depends on the following assumption:

Algorithm 1 TBSF

1: Input: ﬁt Algorithm to construct the trees µ Distribution over S A

2: Output: Approximate value function Q 3: for it 1, 2, ..., itmax do 4: Collect n transitions based on µ and store in Sa

5: T a ﬁt(Sa) for all a A Construct or grow 6: Update M using (13) and (14) 7: Apply t iterations of value iteration to Q 8: Modify µ based on Q(s, a) Optional 9: Sa for all a A Discard transitions

(iii) The number of trees in the ensembles is the same, that is, |T a| = |T| for all a A.

Note that the structure of the trees can differ. Let r+ be the vector whose elements are r+ i = maxa A ra i . Analogously, let r be the vector with entries r i = mina A ra i . Let R+ max be the sum of the |T| largest elements of r+ and let R min be the sum of the |T| smallest elements of r . Then, we can show the following:

Proposition 1 For t > 0, | ˆQt(ˆsi, a) Qt(ˆsi, a)| 1 |T| γ γt

1 γ R+ max R min for any ˆsi and any a.

The proof of Proposition 1 is in the supplementary material (Barreto 2014). Note that the derived bound corresponds to the maximum difference between the values of two states belonging to an MDP whose rewards are restricted to [ R min/|T|, R+ max/|T|]. This reﬂects the special structure of the MDPs ˆ M and M imposed by the trees; in the general case the interval above would be replaced by [mina,i ra i , maxa,i ra i ], which is a superset of its counterpart. For the next results we will need an extra assumption:

(iv) The |T| trees in the ensembles have the same structure, that is, T a i and T b i deﬁne the same partitioning of the state space, for any i, a, and b.

From the deﬁnition of Da in (9) we know that Assumption (iv) implies that Da = D for all a A. In this case, the transformation computed by TBSF reduces to Sorg and Singh s (2009) concept of soft homomorphism, and all results by these authors apply. In particular, we know that if, for any i, I{ˆsi Tj and ˆsi Tl} implies that π ( xj) = π ( xl), where π is the optimal policy of M, then the value functions computed by FQIT and TBSF coincide (see Sorg and Singh s Corollary 2, 2009). As a consequence, if a single tree is used in the ensembles the difference | Qt(ˆsi, a) ˆQt(ˆsi, a)| is zero. Still assuming that (iv) holds, we can show the following result. Let rdif be the vector whose elements are rdif i = maxa ra i mina ra i . Let Rdif be the sum of the |T| largest elements of rdif and let R+ min be the sum of the |T| smallest elements of r+. Then,

Proposition 2 For t > 0, 0 Qt(ˆsi, a) ˆQt(ˆsi, a)

1 |T | h γ Rdif + (γ2 γt)

(1 γ) R+ max R+ min i for any ˆsi and any a.

The proof of Proposition 2 is also in the supplementary material (Barreto 2014). Note that increasing the number |T| of trees in the ensembles has two effects on our bounds. If on one hand it decreases the scalar 1/|T| multiplying the right-hand side of the bounds, on the other hand it may also increase R+ max R min, Rdif, and R+ max R+ min, since the number of partitions Ti grows. The overall effect of increasing |T| will depend on the structure of the trees.

6 Experiments We compare the performance of our algorithm with that of FQIT combined with Geurts, Ernst, and Wehenkel s extra-trees algorithm (2006). This combination generated the best results in the extensive empirical evaluation performed by Ernst, Geurts, and Wehenkel (2005). When combined with the extra-trees algorithm FQIT has four parameters; in our experiments we used 200 iterations, dim(S) candidate cut points to build the trees, and varied |T| and ηmin (during the construction of the trees, a node is split only if it contains at least ηmin points hence, this parameter can be seen as an indirect way of setting the number of partitions). We also combined TBSF with the extra-trees algorithm, and the same parameters were used to build the trees. Since the extra-trees algorithm does not allow a precise control of the size of the trees, we used trees with the same structure for all actions. As in this case it does not make sense to build the trees based on Q1, the criterion used to select the cutpoints was simply the difference on the cardinality | Ti| of the resulting partitions. To serve as a reference, we evaluated a version of FQIT that used a ﬁxed ensemble of trees constructed in the exact same way, referred to as FQIT-F . We ﬁrst use the mountain car task as a proof of concept (Singh and Sutton 1996). In this case, an exploration policy that selects actions uniformly at random was used to collect a batch of n sample transitions. The algorithms s results, shown in Figure 1, correspond to the performance of the greedy policy computed using this data.2 As shown in the ﬁgure, the algorithms s performance improves with |T| and n, as expected. In general, TBSF s results are intermediate between FQIT s and FQIT-F s. The good performance of FQIT suggests that adapting the structure of the trees at each iteration may play an important role. This ability comes at a price, though: as shown in Figure 1, FQIT s run time is orders of magnitude greater than TBSF s. This opens the possibility of using the latter to process larger amounts of data, as we explore next. We revisit one of the most successful applications of FQIT: an important medical problem which we will refer to as the HIV domain (Adams et al. 2004; Ernst et al. 2006). Despite the effectiveness of drug cocktails in maintaining low HIV viral loads, there are several complications associated with their long-term use. This has attracted the interest of the scientiﬁc community to the problem of optimizing drug-scheduling strategies. The problem can be formulated as a reinforcement-learning task in which the actions correspond to the types of cocktail that should be administered.

2See the supplementary material for similar experiments with the puddle-world task (Barreto 2014).

0 5 10 15 20 25 30

0.1 0.2 0.3 0.4

0 5 10 15 20 25 30

0 50 100 150 200

FQTI FQTI F TBSF

2000 4000 6000 8000 10000 12000

0.15 0.20 0.25 0.30 0.35 0.40

2000 4000 6000 8000 10000 12000

0 50 100 150 200 250 300

Figure 1: Results on the mountain-car task. All the algorithms used ηmin = 30 to build the trees. In the ﬁrst two plots n = 10000 and in the last two |T| = 30. Policies were evaluated on a set of 25 states evenly distributed over [ 1.00, 0.07] [0.15, 0.02]. Shadows represent one standard error over 50 runs.

The problem has a 6-dimensional state space and 4 actions. Following Ernst et al. (2006), we performed our experiments using a realistic model that describes the interaction of the immune system with HIV (Adams et al. 2004). We also adopted the protocol proposed by these authors to collect data: starting from a batch of 6000 transitions generated by a random policy, corresponding to the treatment of 30 patients, the algorithms computed an initial approximation of the problem s value function. Based on this approximation, a 0.15-greedy policy was used to collect a second batch of 6000 transitions, which was merged with the ﬁrst. This process was repeated for 10 rounds. As one can see, this strategy for collecting transitions ﬁts well with the on-line version of TBSF. Therefore, instead of merging the successive batches, we used Algorithm 1 and simply discarded the data after each round. Since in this case the difference between FQIT s and TBSF s computational cost is even bigger, we repeated the experiment with the latter assuming that we had more data available ﬁrst from 300 and then from 600 patients. In all other aspects FQIT and TBSF were conﬁgured as before. The results of the experiment are shown in Figure 2. As one can see, when FQIT and TBSF use the same amount of data the former outperforms the latter by a large margin. This is expected, since TBSF was used with a ﬁxed ensemble of trees built based solely on the ﬁrst batch of transitions. However, when the number of transitions used by TBSF increases, so does the quality of the resulting policies, which eventually start to be competitive with FQIT s. When using ηmin = 10 with the augmented datasets, TBSF is still faster than FQIT and produces policies of about the same quality.

7 Conclusion In this paper we applied the stochastic-factorization trick to FQIT. The resulting algorithm, TBSF, inherits many desirable properties of its precursor: it is simple to implement, ﬂexible, scalable to high-dimensional spaces, and robust to noise. The main difference between FQIT and TBSF is in the way they construct their models: while the former builds an MDP whose states are the n end-states in the sample transitions, the latter constructs an MDP whose size is the number

0 5 10 15 20

2e+06 5e+06 2e+07 5e+07 2e+08

Return (log)

Random Policy

FQIT(30 patients) TBSF(30 patients) TBSF(300 patients) TBSF(600 patients)

Figure 2: Results on the HIV domain. The algorithms were run with |T| = 30; the numbers beside each rectangle is ηmin. Policies were evaluated for 5000 days starting from a state representing an unhealthy condition of the patients (Ernst et al. 2006). The length of the rectangles s edges represents one standard error over 50 runs.

m of partitions in the ensembles of trees. This has two consequences. First, when m n TBSF is orders of magnitude faster than FQIT. Second, TBSF can be used on-line. We showed that the distance between the value functions computed by FQIT and TBSF is bounded. Empirically, we veriﬁed that FQIT tends to outperform the non-adaptive version of TBSF, especially if the latter is used on-line. However, since TBSF s computational cost is signiﬁcantly lower than FQIT s, it can process more data in the same amount of time, which can result in better performance. Besides, in problems that are inherently on-line TBSF is a natural candidate to replace FQIT. An interesting direction for future research would be to study the adaptive version of TBSF.

Acknowledgments The author would like to thank the anonymous reviewers for their comments and suggestions.

References Adams, B.; Banks, H.; Kwon, H.; and Tran, H. 2004. Dynamic multidrug therapies for HIV: optimal and STI control approaches. Mathematical Biosciences and Engineering 1(2):223 41. Antos, A.; Munos, R.; and Szepesv ari, C. 2007. Fitted Qiteration in continuous action-space MDPs. In Advances in Neural Information Processing Systems (NIPS), 9 16. Barreto, A. M. S., and Fragoso, M. D. 2011. Computing the stationary distribution of a ﬁnite Markov chain through stochastic factorization. SIAM Journal on Matrix Analysis and Applications 32:1513 1523. Barreto, A. M. S.; Precup, D.; and Pineau, J. 2011. Reinforcement learning using kernel-based stochastic factorization. In Advances in Neural Information Processing Systems (NIPS), 720 728. Barreto, A. M. S.; Precup, D.; and Pineau, J. 2012. Online reinforcement learning using incremental kernel-based stochastic factorization. In Advances in Neural Information Processing Systems (NIPS), 1484 1492. Barreto, A. M. S. 2014. Tree-based on-line reinforcement learning: Supplementary material. Available on-line. Ernst, D.; Stan, G.; Gonc alves, J.; and Wehenkel, L. 2006. Clinical data based optimal STI strategies for HIV: a reinforcement learning approach. In Proceedings of the IEEE Conference on Decision and Control (CDC), 124 131. Ernst, D.; Geurts, P.; and Wehenkel, L. 2005. Tree-based batch mode reinforcement learning. Journal of Machine Learning Research 6:503 556. Farahmand, A. M.; Ghavamzadeh, M.; Szepesv ari, C.; and Mannor, S. 2009. Regularized ﬁtted Q-iteration for planning in continuous-space Markovian decision problems. In Proceedings of the American Control Conference, 725 730. Geurts, P.; Ernst, D.; and Wehenkel, L. 2006. Extremely randomized trees. Machine Learning 36(1):3 42. Gordon, G. J. 1995. Stable function approximation in dynamic programming. In Proceedings of the International Conference on Machine Learning (ICML), 261 268. Kalyanakrishnan, S., and Stone, P. 2007. Batch reinforcement learning in a complex domain. In Proceedings of the International Joint Conference on Autonomous Agents and Multiagent Systems, 650 657. Littman, M. L.; Dean, T. L.; and Kaelbling, L. P. 1995. On the complexity of solving Markov decision problems. In Proceedings of the Conference on Uncertainty in Artiﬁcial Intelligence (UAI), 394 402. Ormoneit, D., and Sen, S. 2002. Kernel-based reinforcement learning. Machine Learning 49 (2 3):161 178. Puterman, M. L. 1994. Markov Decision Processes Discrete Stochastic Dynamic Programming. John Wiley & Sons, Inc.

Riedmiller, M. 2005. Neural ﬁtted Q-iteration ﬁrst experiences with a data efﬁcient neural reinforcement learning method. In European Conference on Machine Learning (ECML), 317 328. Springer. Singh, S. P., and Sutton, R. S. 1996. Reinforcement learning with replacing eligibility traces. Machine Learning 22(1 3):123 158. Sorg, J., and Singh, S. 2009. Transfer via soft homomorphisms. In Autonomous Agents & Multiagent Systems/Agent Theories, Architectures, and Languages, 741 748. Sutton, R. S., and Barto, A. G. 1998. Reinforcement Learning: An Introduction. MIT Press.