# treebased_online_reinforcement_learning__02704465.pdf Tree-Based On-Line Reinforcement Learning Andr e da Motta Salles Barreto Laborat orio Nacional de Computac ao Cient ıfica Petr opolis, RJ, Brazil Fitted Q-iteration (FQI) stands out among reinforcement learning algorithms for its flexibility and ease of use. FQI can be combined with any regression method, and this choice determines the algorithm s statistical and computational properties. The combination of FQI with an ensemble of regression trees gives rise to an algorithm, FQIT, that is computationally efficient, scalable to high dimensional spaces, and robust to noise. Despite its nice properties and good performance in practice, FQIT also has some limitations: the fact that an ensemble of trees must be constructed (or updated) at each iteration confines the algorithm to the batch scenario. This paper aims to address this specific issue. Based on a strategy recently proposed in the literature, called the stochastic-factorization trick, we propose a modification of FQIT that makes it fully incremental, and thus suitable for on-line learning. We call the resulting method tree-based stochastic factorization (TBSF). We derive upper bounds for the difference between the value functions computed by FQIT and TBSF, and also show in which circumstances the approximations coincide. A series of computational experiments is presented to illustrate the properties of TBSF and to show its usefulness in practice, including a medical problem involving the treatment of patients infected with HIV. 1 Introduction Batch reinforcement-learning algorithms learn a decision policy based on a fixed set of sample transitions. Among them, fitted Q-iteration (FQI) stands out for its flexibility and ease of use. Probably because of this, FQI has been adopted by researches and practitioners, and today it is possible to find several reports of successful applications of the algorithm (Ernst, Geurts, and Wehenkel 2005; Riedmiller 2005; Ernst et al. 2006; Kalyanakrishnan and Stone 2007). FQI can be combined with any regression method, and this choice determines the algorithm s statistical and computational properties. In their original work, Ernst, Geurts, and Wehenkel (2005) suggest the use of an ensemble of regression trees. The authors point out several advantages of these models, such as their computational efficiency, Copyright c 2014, Association for the Advancement of Artificial Intelligence (www.aaai.org). All rights reserved. their scalability to high dimensional spaces, and their robustness with respect to irrelevant variables, outliers, and noise. The combination of FQI with regression trees (FQIT) has also been extensively tested in practice, in general with very good results (Ernst, Geurts, and Wehenkel 2005; Ernst et al. 2006). Despite its nice properties and good performance in practice, FQIT also has some limitations: the fact that an ensemble of trees must be constructed (or updated) at each iteration confines the algorithm to the batch scenario. This paper aims to address this specific issue. Based on a strategy recently proposed in the literature, called the stochastic-factorization trick, we propose a modification of FQIT that makes its computational complexity independent of the number of sample transitions. The resulting method, tree-based stochastic factorization (TBSF), is significantly faster than its precursor and can be used on-line. We derive TBSF and analyze it theoretically and empirically. We prove that the distance between the approximations computed by FQIT and TBSF is bounded. We also show empirically that, since our algorithm can process more sample transitions in the same amount of time, it can match FQIT s performance using less computation. 2 Reinforcement Learning We consider the basic framework of reinforcement learning: an agent interacts with an environment and selects actions in order to maximize the amount of reward received in the long run (Sutton and Barto 1998). As usual, we assume that this interaction can be modeled as a Markov decision process (MDP, Puterman, 1994). An MDP is a tuple M (S, A, P a, ra, γ), where S is the state space and A is the (finite) action set. For each action a A, P a( |s) is the nextstate distribution upon taking action a in state s. The reward received at transition s a s is given by Ra(s, s ). Usually, one is interested in the expected reward resulting from the execution of a in s, that is, ra(s) = Es P a( |s){Ra(s, s )}. The discount factor γ [0, 1) gives smaller weights to rewards received further in the future. The algorithm presented in this paper can be applied to MDPs with continuous and discrete state spaces, and in both cases the strategy will be to derive a discrete model. When both the state and action spaces are finite, an MDP can be Proceedings of the Twenty-Eighth AAAI Conference on Artificial Intelligence represented in matrix form: each function P a becomes a matrix Pa R|S| |S|, with pa ij = P a(sj|si), and each function ra becomes a vector ra R|S|, with ra i = ra(si).1 The goal of the agent is to find an optimal policy, that is, a mapping π : S 7 A that maximizes the expected sum of discounted rewards. When the MDP is finite, dynamic programming can be used to find an optimal policy π A|S| in polynomial time (Littman, Dean, and Kaelbling 1995). Starting from Q0 = 0, the value-iteration update rule, Qt+1(si, a) = ra(si)+γ P|S| j=1 pa ij maxb A Qt(sj, b), with t > 0, gives the optimal t-step action-value function, from which the t-step policy can be obtained by selecting any π(t)(si) argmaxa Qt(si, a). As t the function Qt(si, a) approaches Q (si, a), the value function associated with all optimal policies π . In reinforcement learning it is assumed that the MDP is unknown, and the agent must learn a policy based on transitions sampled from the environment. If the process of learning a decision policy is based on a fixed set of sample transitions, we call it batch reinforcement learning. In on-line reinforcement learning the computation of a decision policy takes place concomitantly with the collection of data. 3 Fitted Q-Iteration Fitted Q-iteration (FQI) is a batch reinforcement-learning algorithm proposed by Ernst, Geurts, and Wehenkel (2005) based on the works by Gordon (1995) and Ormoneit and Sen (2002). Let Sa {(sa i , ra i , ˆsa i )}, i = 1, 2, ..., na, be sample transitions associated with action a A, where sa i , ˆsa i S and ra i R. FQI keeps an approximation ˆQ : S A 7 R which is successively updated in the following way. Starting from ˆQ0 = 0, at each iteration t > 0 a training set Ht {[(sa i , a), Qt(sa i , a)]} composed of n = P a A na input-output pairs is constructed, with Qt(sa i , a) = ra i + γ maxb ˆQt 1(ˆsi, b). Then, a regression procedure uses this training set to compute the next approximation, ˆQt = fit(Ht). The combination of the function approximator ˆQ(s, a) with the regression method fit( ) gives rise to different instances of FQI. Here we will focus on approximators of the form ˆQ(s, a) = P b A Pnb j=1 κa(s, sb j) ˆQ(sb j, a), where κa(s, sb j) : S S 7 R can be intuitively seen as a measure of the similarity between s and sb j. We will assume that κa(s, sb j) = 0 if a = b, which simplifies the expression above to j=1 κa(s, sa j ) ˆQ(sa j , a). (1) When ˆQt(s, a) satisfies (1) it is fully defined by the values of the states sa i ; hence, FQI s regression step reduces to updating these values: ˆQt+1(sa i , a) = Qt+1(sa i , a) = ra i + γ maxb ˆQt(ˆsa i , b). (2) 1Throughout the paper vectors will be denoted by small boldface letters and matrices will be denoted by capital boldface letters. Combining (2) and (1), we can see that ˆQt(ˆsa i , b) = j=1 κb(ˆsa i , sb j) h rb j + γ max c ˆQt 1(ˆsb j, c) i . (3) It is well known that if κb(ˆsa i , sb j) 0 and j=1 κb(ˆsa i , sb j) = 1, (4) then ˆQt converges to a fixed point as t (Gordon 1995; Ormoneit and Sen 2002). This is easy to see if we recast FQI as the solution of a derived MDP as follows. Let ˆS be the set composed of the n states ˆsa i . Define ˆRa(ˆsb i, ˆsc j) = rc j and ˆP a(ˆsc j|ˆsb i) = κa(ˆsb i, sc j). (5) Comparing (3) and (5), it is easy to see that FQI is equivalent to value iteration in the MDP ˆ M ( ˆS, A, ˆP a, ˆra, γ) = ( ˆS, A, ˆPa,ˆra, γ), where ˆra(ˆsb i) = P c A Pnc j=1 κa(ˆsb i, sc j)rc j = Pna j=1 κa(ˆsb i, sa j )ra j . Antos, Munos, and Szepesv ari (2007) have analyzed the finite-sample performance of FQI for the case in which the class of policies is restricted and the regression procedure fit( ) is a weighted least-squares method. Farahmand et al. (2009) provide similar analysis for the scenario where the least-squares problem is regularized. Here we will focus on the version of FQI originally proposed by Ernst, Geurts, and Wehenkel (2005), which we discuss next. Fitted Q-Iteration Using Trees Ernst, Geurts, and Wehenkel (2005) suggest that the approximation ˆQ(s, a) be represented by an ensemble of regression trees. Besides all the advantages discussed in Section 1, ensembles of trees can be defined in order to satisfy (1) and (4). Let T a be a set of |T a| trees T a i associated with action a A. Each T a i partitions the state space in ma i sets T a ij. The value associated with partition T a ij at iteration t is simply the average of the values of states sa l that belong to T a ij, that is: ˆvt(T a ij) = Pna l=1 I{sa l T a ij}Qt(sa l , a) Pna l=1 I{sa l T a ij} , (6) where I{true} = 1 and I{false} = 0 is the indicator function. We define Pna l=1 I{sa l T a ij} = |T a ij|. Recall that, when using an ensemble of trees, the value of a state ˆsa i is the average value of all partitions ˆsa i belongs to, that is: ˆ Qt(ˆsa i , b) = 1 |T b| P|T b| l=1 Pmb l u=1 I{ˆsa i T b lu}ˆv(T b lu) 1 |T b lu| I{ˆsa i T b lu} Pnb j=1 I{sb j T b lu}Qt(sb j, b). There are several methods available to build a regression tree T a based on a training set H (see Ernst, Geurts, and Wehenkel s article and references therein, 2005). If we assume that the structure of the trees is fixed throughout FQIT s iterations, then we can look at them as just a specific way of defining κb(ˆsa i , sb j). Comparing (1), (2), and (7), we see that when using an ensemble of trees κb(ˆsa i , sb j) = 1 |T b| 1 |T b lu|I{ˆsa i T b lu and sb j T b lu}. (8) It is clear that κb(ˆsa i , sb j) 0. Also, if ˆsa i T b lu, the summation Pnb j=1 I{ˆsa i T b lu and sb j T b lu} = |T b lu|. Since there are |T b| such summations, we have Pnb j=1 κb(ˆsa i , sb j) = 1. Therefore, κb(ˆsa i , sb j) satisfies (4). The fact that each FQI iteration involves solving a regression problem with n training examples makes it impractical for larger domains, depending on the choice of regression method fit( ) and the computational resources available. Moreover, even if ˆQ is updated through a simple rule like (2), the computational complexity of each FQI iteration will necessarily depend on n, which clearly precludes the use of the algorithm as an on-line method. In the next section we discuss a possible way to circumvent these limitations. 4 Tree-Based Stochastic Factorization In order to derive our algorithm we will resort to the stochastic-factorization trick, an idea introduced by Barreto and Fragoso (2011) and later explored by Barreto, Precup, and Pineau (2011; 2012) in reinforcement learning. Given a finite MDP M (S, A, Pa, ra, γ), let Pa = Da Ka be |A| factorizations in which Da Rn m and Ka Rm n are stochastic matrices. Also, let Da ra = ra for all a. Then, if we swap the factors Da and Ka, we obtain another transition matrix Pa = Ka Da that retains some basic properties of Pa (Barreto and Fragoso 2011). The insight is that, instead of solving M, one can solve M ( X, A, Pa, ra, γ), where X is a set of m states that will be defined shortly. Based on the optimal value function of M, Q , an approximation of Q can be computed as Q(si, a) = Pm j=1 dij Q ( xj, a). When m n, replacing M with M significantly reduces memory usage and computing time. The stochastic-factorization trick can be applied to any finite MDP. Since when (1) and (4) hold FQI is equivalent to the solution of a discrete MDP ˆ M, as discussed in Section 3, in principle the technique described above can be used with any approximator ˆQ that has these properties. Note though that, depending on the choice of ˆQ, it is not clear how to efficiently compute Da Ka = ˆPa and Da ra = ˆra (see (5)). One possible strategy is to select a set of m representative states sl S and use the approximation κb(ˆsa i , sb j) Pm l=1 κ(ˆsa i , sl)κb( slsb j) (Barreto, Precup, and Pineau 2011). Although this is a valid strategy in general, it only provides an approximate factorization of the MDP. Besides, one has to find a way to appropriately define the representative states sl. We now show that, in contrast, in the particular case in which ˆQ is represented by an ensemble of trees, it is possible to determine an exact factorization of the MDP ˆ M without any computation. Stochastic Factorization In order to derive our algorithm we will need two assumptions: (i) FQIT builds the ensembles of trees at iteration t = 1 using any algorithm and then keeps the structure of the trees fixed for t > 1. This means that the partitions T a ij remain the same throughout the execution of the algorithm, although their values ˆv(T a ij) may change. (ii) The number of partitions in the ensembles T a is the same for all a, that is, P|T a| l=1 ma l = m, a A. As mentioned before, Assumption (i) is needed to guarantee that the approximations ˆQ(ˆsa, b) are updated through (2). Changing the structure of the trees from one iteration to another corresponds to defining a new MDP ˆ Mt at each iteration t, and thus the guarantee of convergence to a fixed point is lost. We will discuss how to relax Assumption (i) later. Assumption (ii) is necessary to guarantee that the MDP M is well defined, as will become clear shortly. In order to facilitate the exposition, we will refer to sa i as sj, with j = Pa 1 b=1 nb + i (the same for ra i and ˆsa i ). Similarly, we will use T a k to refer to T a ij, with k = Pi 1 l=1 ma l +j. Finally, we introduce the function act : {1, 2, ..., n} 7 {1, 2, ..., |A|} as act(i) = 1+{max a such that Pa b=1 nb < i} that is, act(i) is the action associated with si, ri, and ˆsi. Using these definitions and (5) and (8) we can write ˆpa ij = m P 1 |T a| I{ˆsi T a l } 1 | T a l | I{act(j) = a and sj T a l }. Thus, if we define the elements of Da and Ka as da ij = 1 |T a|I{ˆsi T a j } and ka ij = 1 | T a i |I{act(j) = a and sj T a i }, (9) it is clear that Da Ka = Pa for all a. In order to have a complete factorization of the MDP ˆ M, it remains to find vectors ra such that Da ra = ˆra for all a A. We know from (5) that ˆra(ˆsi) = Pn j=1 κa(ˆsi, sj)rj, which implies that ˆra = ˆPar (r Rn is the vector composed of the sampled rewards). Now, if we define ra = Kar, we can write ˆra = ˆPar = Da Kar = Da ra, and we have an exact factorization of ˆ M. Once we have determined Da, Ka, and ra, we can apply the stochastic-factorization trick to the MDP ˆ M. Let Pa = Ka Da for all a A. From (9), we have pa ij = 1 |T a|| T a i | Pn l=1 I{act(l) = a and sl T a i and ˆsl T a j }. (10) Thus, we simply count the number of transitions in Sa that started in T a i and divide it by the number of transitions that started in this partition and ended in T a j . Since the states si and ˆsi simultaneously belong to |T a| partitions T a i , in order to obtain pa ij this fraction must be further divided by |T a|. As for the rewards, from the definition of ra we have ra i = 1 | T a i | j=1 I{act(j) = a and sj T a i }ra j , (11) which is simply the average of the rewards ra j associated with states sj that belong to T a i . We see an interesting inversion here: while in the MDP ˆ M the reward function ˆRa(ˆsi, ˆsj) = rj is defined by the end-state ˆsj, as shown in (5), here ra i only depends on the initial states si. Because of Assumption (ii), we know that Pa Rm m and ra Rm m for all a A. Thus, we can define the MDP M ( X, A, Pa, ra, γ). Note that, unlike Barreto, Precup, and Pineau (2011), we do not explicitly define X S as a set of representative states. In the particular case in which the trees in all ensembles T a have the same structure that is, T a i and T b i partition S in the same way we can think of the partitions Ti as the states of M. In general, though, the states xi X are not subsets of S. As discussed, when Assumption (i) holds, FQIT builds the MDP ˆ M at iteration t = 1 and then solves it iteratively through value iteration. Based on (10) and (11) we can define an algorithm that does the same using M instead of ˆ M. At any iteration t, the value function ˆQt(ˆsi, a) computed by FQIT, given by (7), can be approximated as Qt(ˆsi, a) = Pm j=1 da ij Qt( xj, a). More generally, from the definition of Da in (9) we see that the action-values of a state s not necessarily in the sets Sa can be computed as Qt(s, a) = 1/|T a| Pm j=1 I{s T a j } Qt( xj, a). We call the proposed algorithm tree-based stochastic factorization (TBSF). One immediate advantage of replacing FQIT with TBSF is the potential reduction on the computational cost. Let ηmin denote the minimum number of points required to split a node during the construction of the trees (see Section 6). Each iteration of FQIT involves the construction (or update) of |A| ensembles of trees, each one requiring at least O(|T a|na log(na/ηmin)) operations, and the improvement of the current decision policy, which is O(n|A|). In contrast, TBSF executes O(|T a|na log(na/ηmin)) operations only once per action, to build Pa and ra, and then performs only O(m2|A|) operations in each iteration assuming the worst-case scenario where the matrices Pa are dense. As one can see, when m n, replacing FQIT with TBSF can result in significant gains in terms of computation. There is however a stronger reason to adopt TBSF instead of FQIT. Suppose that at iteration t a new batch of sample transitions Sa 2 {(sa i , ra i , ˆsa i )} becomes available, with a A. One way of incorporating the new data into FQIT s approximation is to set ˆQt(ˆsa i , a) = 0 for all states ˆsa i Sa 2 and then proceed normally from there. However, since FQIT s mechanics require that all transitions used in the approximation are available, as shown in (2), there is an inbuilt limit on the amount of data that this algorithm can process. This is why FQIT is inherently a batch-mode algorithm. In contrast, TBSF can extract the information contained in the sample transitions and then discard them, which means that it can also be used in the on-line regime. Let S1 Sa and S2 Sa 2 (we drop the a superscript in the sets Sa and Sa 2 to improve readability). Let PS1 and r S1 be matrix Pa and vector ra computed by TBSF through (10) and (11) using only the na transitions in S1. We assume that we have already discarded the data in S1 and want to compute PS1 S2 and r S1 S2 from PS1, r S1, and S2. To have an algorithm that is as general as possible, we consider the possibility that the ensembles change with the new data, either because new partitions are added to the trees already in the ensemble or because new trees are added altogether. We denote the two ensembles by T S1 and T S1 S2. We also define m i as the number of partitions associated with the ith tree of T S1 S2 and set m = P|T S1 S2 | i=1 m i. If m > m, we set ra l = 0 and pa lj = pa jl = 0 for l = m + 1, m + 2, ..., m , j = 1, 2, ..., m , and all a A. We start by rewriting (10) as p S1 ij = z S1 ij /(|T S1|z S1 i ), where z S1 i = Pna l=1 I{sl T a i } and z S1 ij = Pna l=1 I{sl T a i and ˆsl T a j } (note that act(l) = a for all l). Then, if we define the variables z S2 i and z S2 ij in an analogous way, we have p S1 S2 ij = (z S1 ij + z S2 ij )/ |T S1 S2|(z S1 i + z S2 i ) . (12) In order to avoid keeping z S1 ij for all i, j, we rewrite (12) as p S1 S2 ij = p S1 ij |T S1|z S1 i + z S2 ij |T S1 S2|(z S1 i + z S2 i ). (13) Similarly, if we define b S1 i = Pna j=1 I{sj T S1 i }rj and proceed in the same way as above starting from (11), we arrive at the update rule r S1 S2 i = r S1 i z S1 i + b S2 i z S1 i + z S2 i . (14) Since z S2 i , z S2 ij , and b S2 i are computed based on S2, and |T S1 S2| is readily available, in order to have an incremental algorithm we only have to keep |T S1| and z S1 i for i = 1, 2, ..., m. Note that, when T S1= T S1 S2, the model computed through (13) and (14) is the same that would be built if the transitions in S1 S2 were used all at once. Algorithm 1 shows a generic step by step description of TBSF. As one can see, several variations of the algorithm are possible. If the variable itmax defined in line 3 is set to 1, TBSF reduces to a batch method; otherwise it is incremental. If itmax > 1 and the distribution µ is updated based on Q line 8 TBSF is an on-line method. If in addition n = 1, we have an algorithm that does not store sample transitions (line 4). Finally, if in line 5 new trees or new partitions are created with it > 1, we have an adaptive method (this is akin to relaxing Assumption (i) for FQIT). It is also trivial to modify the algorithm for the scenario where the sample transitions are given. 5 Theoretical Analysis In this section we analyze some properties of TBSF. We will focus on the batch version of the algorithm, that is, the case in which itmax = 1 in Algorithm 1. We will assume that Assumptions (i) and (ii) hold, and compare the approximation ˆQt computed by FQTI after t iterations with the approximation Qt computed by TBSF after t applications of the valueiteration update rule in M (line 7 of Algorithm 1). Our first result depends on the following assumption: Algorithm 1 TBSF 1: Input: fit Algorithm to construct the trees µ Distribution over S A 2: Output: Approximate value function Q 3: for it 1, 2, ..., itmax do 4: Collect n transitions based on µ and store in Sa 5: T a fit(Sa) for all a A Construct or grow 6: Update M using (13) and (14) 7: Apply t iterations of value iteration to Q 8: Modify µ based on Q(s, a) Optional 9: Sa for all a A Discard transitions (iii) The number of trees in the ensembles is the same, that is, |T a| = |T| for all a A. Note that the structure of the trees can differ. Let r+ be the vector whose elements are r+ i = maxa A ra i . Analogously, let r be the vector with entries r i = mina A ra i . Let R+ max be the sum of the |T| largest elements of r+ and let R min be the sum of the |T| smallest elements of r . Then, we can show the following: Proposition 1 For t > 0, | ˆQt(ˆsi, a) Qt(ˆsi, a)| 1 |T| γ γt 1 γ R+ max R min for any ˆsi and any a. The proof of Proposition 1 is in the supplementary material (Barreto 2014). Note that the derived bound corresponds to the maximum difference between the values of two states belonging to an MDP whose rewards are restricted to [ R min/|T|, R+ max/|T|]. This reflects the special structure of the MDPs ˆ M and M imposed by the trees; in the general case the interval above would be replaced by [mina,i ra i , maxa,i ra i ], which is a superset of its counterpart. For the next results we will need an extra assumption: (iv) The |T| trees in the ensembles have the same structure, that is, T a i and T b i define the same partitioning of the state space, for any i, a, and b. From the definition of Da in (9) we know that Assumption (iv) implies that Da = D for all a A. In this case, the transformation computed by TBSF reduces to Sorg and Singh s (2009) concept of soft homomorphism, and all results by these authors apply. In particular, we know that if, for any i, I{ˆsi Tj and ˆsi Tl} implies that π ( xj) = π ( xl), where π is the optimal policy of M, then the value functions computed by FQIT and TBSF coincide (see Sorg and Singh s Corollary 2, 2009). As a consequence, if a single tree is used in the ensembles the difference | Qt(ˆsi, a) ˆQt(ˆsi, a)| is zero. Still assuming that (iv) holds, we can show the following result. Let rdif be the vector whose elements are rdif i = maxa ra i mina ra i . Let Rdif be the sum of the |T| largest elements of rdif and let R+ min be the sum of the |T| smallest elements of r+. Then, Proposition 2 For t > 0, 0 Qt(ˆsi, a) ˆQt(ˆsi, a) 1 |T | h γ Rdif + (γ2 γt) (1 γ) R+ max R+ min i for any ˆsi and any a. The proof of Proposition 2 is also in the supplementary material (Barreto 2014). Note that increasing the number |T| of trees in the ensembles has two effects on our bounds. If on one hand it decreases the scalar 1/|T| multiplying the right-hand side of the bounds, on the other hand it may also increase R+ max R min, Rdif, and R+ max R+ min, since the number of partitions Ti grows. The overall effect of increasing |T| will depend on the structure of the trees. 6 Experiments We compare the performance of our algorithm with that of FQIT combined with Geurts, Ernst, and Wehenkel s extra-trees algorithm (2006). This combination generated the best results in the extensive empirical evaluation performed by Ernst, Geurts, and Wehenkel (2005). When combined with the extra-trees algorithm FQIT has four parameters; in our experiments we used 200 iterations, dim(S) candidate cut points to build the trees, and varied |T| and ηmin (during the construction of the trees, a node is split only if it contains at least ηmin points hence, this parameter can be seen as an indirect way of setting the number of partitions). We also combined TBSF with the extra-trees algorithm, and the same parameters were used to build the trees. Since the extra-trees algorithm does not allow a precise control of the size of the trees, we used trees with the same structure for all actions. As in this case it does not make sense to build the trees based on Q1, the criterion used to select the cutpoints was simply the difference on the cardinality | Ti| of the resulting partitions. To serve as a reference, we evaluated a version of FQIT that used a fixed ensemble of trees constructed in the exact same way, referred to as FQIT-F . We first use the mountain car task as a proof of concept (Singh and Sutton 1996). In this case, an exploration policy that selects actions uniformly at random was used to collect a batch of n sample transitions. The algorithms s results, shown in Figure 1, correspond to the performance of the greedy policy computed using this data.2 As shown in the figure, the algorithms s performance improves with |T| and n, as expected. In general, TBSF s results are intermediate between FQIT s and FQIT-F s. The good performance of FQIT suggests that adapting the structure of the trees at each iteration may play an important role. This ability comes at a price, though: as shown in Figure 1, FQIT s run time is orders of magnitude greater than TBSF s. This opens the possibility of using the latter to process larger amounts of data, as we explore next. We revisit one of the most successful applications of FQIT: an important medical problem which we will refer to as the HIV domain (Adams et al. 2004; Ernst et al. 2006). Despite the effectiveness of drug cocktails in maintaining low HIV viral loads, there are several complications associated with their long-term use. This has attracted the interest of the scientific community to the problem of optimizing drug-scheduling strategies. The problem can be formulated as a reinforcement-learning task in which the actions correspond to the types of cocktail that should be administered. 2See the supplementary material for similar experiments with the puddle-world task (Barreto 2014). 0 5 10 15 20 25 30 0.1 0.2 0.3 0.4 0 5 10 15 20 25 30 0 50 100 150 200 FQTI FQTI F TBSF 2000 4000 6000 8000 10000 12000 0.15 0.20 0.25 0.30 0.35 0.40 2000 4000 6000 8000 10000 12000 0 50 100 150 200 250 300 Figure 1: Results on the mountain-car task. All the algorithms used ηmin = 30 to build the trees. In the first two plots n = 10000 and in the last two |T| = 30. Policies were evaluated on a set of 25 states evenly distributed over [ 1.00, 0.07] [0.15, 0.02]. Shadows represent one standard error over 50 runs. The problem has a 6-dimensional state space and 4 actions. Following Ernst et al. (2006), we performed our experiments using a realistic model that describes the interaction of the immune system with HIV (Adams et al. 2004). We also adopted the protocol proposed by these authors to collect data: starting from a batch of 6000 transitions generated by a random policy, corresponding to the treatment of 30 patients, the algorithms computed an initial approximation of the problem s value function. Based on this approximation, a 0.15-greedy policy was used to collect a second batch of 6000 transitions, which was merged with the first. This process was repeated for 10 rounds. As one can see, this strategy for collecting transitions fits well with the on-line version of TBSF. Therefore, instead of merging the successive batches, we used Algorithm 1 and simply discarded the data after each round. Since in this case the difference between FQIT s and TBSF s computational cost is even bigger, we repeated the experiment with the latter assuming that we had more data available first from 300 and then from 600 patients. In all other aspects FQIT and TBSF were configured as before. The results of the experiment are shown in Figure 2. As one can see, when FQIT and TBSF use the same amount of data the former outperforms the latter by a large margin. This is expected, since TBSF was used with a fixed ensemble of trees built based solely on the first batch of transitions. However, when the number of transitions used by TBSF increases, so does the quality of the resulting policies, which eventually start to be competitive with FQIT s. When using ηmin = 10 with the augmented datasets, TBSF is still faster than FQIT and produces policies of about the same quality. 7 Conclusion In this paper we applied the stochastic-factorization trick to FQIT. The resulting algorithm, TBSF, inherits many desirable properties of its precursor: it is simple to implement, flexible, scalable to high-dimensional spaces, and robust to noise. The main difference between FQIT and TBSF is in the way they construct their models: while the former builds an MDP whose states are the n end-states in the sample transitions, the latter constructs an MDP whose size is the number 0 5 10 15 20 2e+06 5e+06 2e+07 5e+07 2e+08 Return (log) Random Policy FQIT(30 patients) TBSF(30 patients) TBSF(300 patients) TBSF(600 patients) Figure 2: Results on the HIV domain. The algorithms were run with |T| = 30; the numbers beside each rectangle is ηmin. Policies were evaluated for 5000 days starting from a state representing an unhealthy condition of the patients (Ernst et al. 2006). The length of the rectangles s edges represents one standard error over 50 runs. m of partitions in the ensembles of trees. This has two consequences. First, when m n TBSF is orders of magnitude faster than FQIT. Second, TBSF can be used on-line. We showed that the distance between the value functions computed by FQIT and TBSF is bounded. Empirically, we verified that FQIT tends to outperform the non-adaptive version of TBSF, especially if the latter is used on-line. However, since TBSF s computational cost is significantly lower than FQIT s, it can process more data in the same amount of time, which can result in better performance. Besides, in problems that are inherently on-line TBSF is a natural candidate to replace FQIT. An interesting direction for future research would be to study the adaptive version of TBSF. Acknowledgments The author would like to thank the anonymous reviewers for their comments and suggestions. References Adams, B.; Banks, H.; Kwon, H.; and Tran, H. 2004. Dynamic multidrug therapies for HIV: optimal and STI control approaches. Mathematical Biosciences and Engineering 1(2):223 41. Antos, A.; Munos, R.; and Szepesv ari, C. 2007. Fitted Qiteration in continuous action-space MDPs. In Advances in Neural Information Processing Systems (NIPS), 9 16. Barreto, A. M. S., and Fragoso, M. D. 2011. Computing the stationary distribution of a finite Markov chain through stochastic factorization. SIAM Journal on Matrix Analysis and Applications 32:1513 1523. Barreto, A. M. S.; Precup, D.; and Pineau, J. 2011. Reinforcement learning using kernel-based stochastic factorization. In Advances in Neural Information Processing Systems (NIPS), 720 728. Barreto, A. M. S.; Precup, D.; and Pineau, J. 2012. Online reinforcement learning using incremental kernel-based stochastic factorization. In Advances in Neural Information Processing Systems (NIPS), 1484 1492. Barreto, A. M. S. 2014. Tree-based on-line reinforcement learning: Supplementary material. Available on-line. Ernst, D.; Stan, G.; Gonc alves, J.; and Wehenkel, L. 2006. Clinical data based optimal STI strategies for HIV: a reinforcement learning approach. In Proceedings of the IEEE Conference on Decision and Control (CDC), 124 131. Ernst, D.; Geurts, P.; and Wehenkel, L. 2005. Tree-based batch mode reinforcement learning. Journal of Machine Learning Research 6:503 556. Farahmand, A. M.; Ghavamzadeh, M.; Szepesv ari, C.; and Mannor, S. 2009. Regularized fitted Q-iteration for planning in continuous-space Markovian decision problems. In Proceedings of the American Control Conference, 725 730. Geurts, P.; Ernst, D.; and Wehenkel, L. 2006. Extremely randomized trees. Machine Learning 36(1):3 42. Gordon, G. J. 1995. Stable function approximation in dynamic programming. In Proceedings of the International Conference on Machine Learning (ICML), 261 268. Kalyanakrishnan, S., and Stone, P. 2007. Batch reinforcement learning in a complex domain. In Proceedings of the International Joint Conference on Autonomous Agents and Multiagent Systems, 650 657. Littman, M. L.; Dean, T. L.; and Kaelbling, L. P. 1995. On the complexity of solving Markov decision problems. In Proceedings of the Conference on Uncertainty in Artificial Intelligence (UAI), 394 402. Ormoneit, D., and Sen, S. 2002. Kernel-based reinforcement learning. Machine Learning 49 (2 3):161 178. Puterman, M. L. 1994. Markov Decision Processes Discrete Stochastic Dynamic Programming. John Wiley & Sons, Inc. Riedmiller, M. 2005. Neural fitted Q-iteration first experiences with a data efficient neural reinforcement learning method. In European Conference on Machine Learning (ECML), 317 328. Springer. Singh, S. P., and Sutton, R. S. 1996. Reinforcement learning with replacing eligibility traces. Machine Learning 22(1 3):123 158. Sorg, J., and Singh, S. 2009. Transfer via soft homomorphisms. In Autonomous Agents & Multiagent Systems/Agent Theories, Architectures, and Languages, 741 748. Sutton, R. S., and Barto, A. G. 1998. Reinforcement Learning: An Introduction. MIT Press.