# multistep_greedy_reinforcement_learning_algorithms__cdeca6e5.pdf Multi-step Greedy Reinforcement Learning Algorithms Manan Tomar * 1 Yonathan Efroni * 2 Mohammad Ghavamzadeh 3 Multi-step greedy policies have been extensively used in model-based reinforcement learning (RL), both when a model of the environment is available (e.g., in the game of Go) and when it is learned. In this paper, we explore their benefits in model-free RL, when employed using multi-step dynamic programming algorithms: -Policy Iteration ( - PI) and -Value Iteration ( -VI). These methods iteratively compute the next policy ( -PI) and value function ( -VI) by solving a surrogate decision problem with a shaped reward and a smaller discount factor. We derive model-free RL algorithms based on -PI and -VI in which the surrogate problem can be solved by any discrete or continuous action RL method, such as DQN and TRPO. We identify the importance of a hyperparameter that controls the extent to which the surrogate problem is solved and suggest a way to set this parameter. When evaluated on a range of Atari and Mu Jo Co benchmark tasks, our results indicate that for the right range of , our algorithms outperform DQN and TRPO. This shows that our multi-step greedy algorithms are general enough to be applied over any existing RL algorithm and can significantly improve its performance. 1. Introduction Reinforcement learning (RL) algorithms solve sequential decision-making problems through repeated interaction with the environment. By incorporating deep neural networks into RL algorithms, the field has recently witnessed remarkable empirical success (e.g., Mnih et al., 2015; Lillicrap et al., 2015; Levine et al., 2016; Silver et al., 2017). Much of this success has been achieved by model-free RL algorithms, such as Q-learning and policy gradients. These algorithms are known to suffer from high variance in their *Equal contribution 1Facebook AI Research, Menlo Park, USA 2Technion, Haifa, Israel 3Google Research, Mountain View, USA. Correspondence to: Manan Tomar . Proceedings of the 37 th International Conference on Machine Learning, Online, PMLR 119, 2020. Copyright 2020 by the author(s). estimations (Greensmith et al., 2004) and to have difficulties in handling function approximation (e.g., Thrun & Schwartz, 1993; Baird, 1995; Van Hasselt et al., 2016; Lu et al., 2018). These issues are intensified in decision problems with long horizon, i.e., when the discount factor, γ, is large. Although using smaller values of γ addresses the discount factordependent issues and leads to more stable algorithms (Petrik & Scherrer, 2009; Jiang et al., 2015), it does not come for free, as the algorithm may return a biased solution, i.e., it may not converge to an optimal (or good) solution for the original decision problem (the one with larger value of γ). Efroni et al. (2018a) recently proposed another approach to mitigate the γ-dependant instabilities in RL in which they study multi-step greedy versions of the well-known Dynamic Programming (DP) algorithms: Policy Iteration (PI) and Value Iteration (VI) (Bertsekas & Tsitsiklis, 1996). They also proposed an alternative formulation of the multistep greedy policy, called -greedy policy, and studied the convergence of the resulted PI and VI algorithms: -PI and -VI. These algorithms iteratively solve a γ -discounted decision problem, whose reward has been shaped by the solution of the decision problem at the previous iteration. Unlike the biased solution obtained by solving the decision problem with a smaller value of γ (discussed above), by iteratively solving decision problems with a shorter γ horizon, the -PI and -VI algorithms could converge to an optimal policy of the original decision problem. In this paper, we derive model-free RL algorithms based on the -greedy formulation of multi-step greedy policies. As mentioned earlier, the main component of this formulation is (approximately) solving a surrogate decision problem with a shaped reward and a smaller discount factor. Our algorithms build on -PI and -VI, and solve the surrogate decision problem with the popular deep RL algorithms: Deep QNetwork (DQN) (Mnih et al., 2015) and Trust Region Policy Optimization (TRPO) (Schulman et al., 2015). We call the resulting algorithms -PI-DQN, -VI-DQN, -PI-TRPO, and -VI-TRPO, and empirically evaluate and compare them with DQN, TRPO, and Generalized Advantage Estimation (GAE) (Schulman et al., 2016) on Atari (Bellemare et al., 2013) and Mu Jo Co (Todorov et al., 2012) benchmarks. Our results indicate that for the right range of , our algorithms outperform DQN and TRPO. This suggests that the performance of these two deep RL algorithms can be Multi-step Greedy Reinforcement Learning Algorithms improved by using them as a solver within the multi-step greedy PI and VI schemes. Moreover, our results indicate that the success of our algorithms depends on a number of non-trivial design choices. In particular, we identify the importance of a hyper-parameter that controls the extent to which the surrogate decision problem is solved, and use the theory of multi-step greedy DP to derive a recipe for setting this parameter. We show the advantage of using hard over soft updates, verifying the theory in Efroni et al. (2018b, Thm. 1). By hard and soft update, we refer to fully solving the surrogate MDP in a model-free manner and then evaluating the resulting policy (policy improvement and evaluation steps are separated) vs. changing the policy at each iteration (policy improvement and evaluation steps are concurrent each improvement is followed by an evaluation). We also establish a connection between our multi-step greedy algorithms and GAE. In particular, we show that our -PI-TRPO algorithm coincides with GAE and we can obtain GAE by minor modifications to -PI-TRPO. Finally, we show the advantage of using our multi-step greedy algorithms over lowering the discount factor in DQN (valuebased) and TRPO (policy-based) algorithms. Our results indicate that while lowering the discount factor is detrimental to performance, our multi-step greedy algorithms indeed improve over DQN and TRPO. 2. Preliminaries In this paper, we assume that the agent s interaction with the environment is modeled as a discrete time γ-discounted Markov Decision Process (MDP), defined by Mγ = (S, A, P, R, γ, µ), where S and A are the state and action spaces; P P(s0|s, a) is the transition kernel; R r(s, a) is the reward function with the maximum value of Rmax; γ 2 (0, 1) is the discount factor; and µ is the initial state distribution. Let : S ! P(A) be a stationary Markovian policy, where P(A) is a probability distribution on the set A. The value function of a policy at any state s 2 S is defined as V (s) E[P t 0 γtr(st, at)|s0 = s, ], where the expectation is over all the randomness in the policy, dynamics, and rewards. Similarly, the action-value function of is defined as Q (s, a) = E[P t 0 γtr(st, at)|s0 = s, a0 = a, ]. Since the rewards are bounded by Rmax, both V and Q functions have the maximum value of Vmax = Rmax/(1 γ). An optimal policy is the policy with maximum value at every state. We call the value of the optimal value, and define it as V (s) = max V (s), 8s 2 S. We denote by Q (s, a), the state-action value of , and remind that the following relation holds V (s) = maxa Q (s, a), for all s. The algorithms by which we solve an MDP (obtain an optimal policy) are mainly based on two popular DP algorithms: Policy Iteration (PI) and Value Iteration (VI). While VI relies on iteratively computing the optimal Bellman operator T applied to the current value function V (Eq. 1), PI relies on (iteratively) calculating a 1-step greedy policy 1-step w.r.t. to the value function of the current policy V (Eq. 2): for all s 2 S, we have (T V )(s) = max a2A E[r(s0, a) + γV (s1) | s0 = s], (1) 1-step(s) 2 arg max a2A E[r(s0, a) + γV (s1) | s0 = s]. (2) It is known that T is a γ-contraction w.r.t. the max-norm and its unique fixed-point is V , and the 1-step greedy policy w.r.t. V is an optimal policy . In practice, the state space is often large, and thus, we can only approximately compute Eqs. 1 and 2, which results in approximate PI (API) and VI (AVI) algorithms. These approximation errors then propagate through the iterations of the API and AVI algorithms. However, it has been shown that this (propagated) error can be controlled (Munos, 2003; 2005; Farahmand et al., 2010), and after N steps, the algorithms approximately converge to a solution N, whose difference with the optimal value is bounded (see e.g., Scherrer, 2014 for API): ( ) ( N) Cδ/(1 γ)2 + γNVmax. (3) In Eq. 3, the scalar ( ) = Es µ[V (s)] is the expected value function at the initial state,1 δ represents the periteration error, and C upper-bounds the mismatch between the sampling distribution and the distribution according to which the final value function is evaluated (µ in Eq. 3), depending heavily on the dynamics. Finally, the second term on the RHS of Eq. 3 is the error due to initial values of policy/value and decays with the number of iterations N. 3. -Greedy Policy: -PI and -VI Algorithms The optimal Bellman operator T (Eq. 1) and 1-step greedy policy 1-step (Eq. 2) can be generalized to their multi-step versions. The most straightforward form of this generalization is realized by replacing T and 1-step with h-optimal Bellman operator and h-step greedy policy (i.e., a lookahead of horizon h), respectively. This is done by substituting the 1-step return in Eqs. 1 and 2, r(s0, a) + γV (s1), with the h-step return, Ph 1 t=0 r(st, at) + γh V (sh), and computing the maximum over actions a0, . . . , ah 1, instead of just a0 (Bertsekas & Tsitsiklis, 1996). Efroni et al. (2018a) proposed an alternative form for the multi-step optimal Bellman operator and greedy policy, called -optimal Bellman oper- 1Note that the LHS of Eq. 3 is the 1-norm of (V V N ) w.r.t. the initial state distribution µ. Multi-step Greedy Reinforcement Learning Algorithms ator, T , and -greedy policy, , for 2 [0, 1], i.e., (T V )(s) = max (γ )trt( , V ) | s0 = s, ], (4) (s) 2 arg max (γ )trt( , V ) | s0 = s, ], (5) for all s 2 S. In Eqs. 4 and 5, the shaped reward rt( , V ) w.r.t. the value function V is defined as rt( , V ) rt + γ(1 )V (st+1). (6) It can be shown that the -greedy policy w.r.t. the value function V is the optimal policy w.r.t. a -weighted geometric average of all future h-step returns (from h = 0 to 1). This can be interpreted as TD(λ) (Sutton & Barto, 2018) for policy improvement (see Efroni et al., 2018a, Sec. 6). The important difference is that TD(λ) is used for policy evaluation and not for policy improvement. It is easy to see that solving Eqs. 4 and 5 is equivalent to solving a surrogate γ -discounted MDP with the shaped reward rt( , V ), which we denote by Mγ (V ) throughout the paper. The optimal value and policy of the surrogate MDP Mγ (V ) are T V and the -greedy policy , respectively. Using the notions of -optimal Bellman operator, T , and -greedy policy, , Efroni et al. (2018a) derived -PI and -VI algorithms, whose pseudocodes are shown in Algorithms 1 and 2. -PI iteratively (i) evaluates the value V i of the current policy i, and (ii) sets the new policy, i+1, to the -greedy policy w.r.t. the value of the current policy V i, by solving Eq. 5. On the other hand, -VI repeatedly applies the T operator to the current value function Vi (solves Eq. 4) to obtain the next value function, Vi+1, and returns the -greedy policy w.r.t. the final value VN . Note that for = 0, the -optimal Bellman operator and -greedy policy are equivalent to their 1-step counterparts, defined by Eqs. 1 and 2, which indicates that -PI and -VI are generalizations of the seminal PI and VI algorithms. Algorithm 1 -Policy Iteration 1: Initialize: 2 [0, 1], 0, N 2: for i = 0, 1, . . . , N 1 do 3: V i = E[P t 0 γtrt | i] 4: i+1 arg max t 0( γ)trt( , V i) | ] 5: end for 6: Return N Algorithm 2 -Value Iteration 1: Initialize: 2 [0, 1], V0 , N 2: for i = 0, 1, . . . , N 1 do 3: Vi+1 = max E[P t 0(γ )trt( , Vi) | ] 4: end for 5: N arg max t 0( γ)trt( , VN ) | ] 6: Return N It has been shown that both PI and VI converge to the optimal value with an exponential rate that depends on the discount factor γ, i.e., k V V N k1 O(γN) (see e.g., Bertsekas & Tsitsiklis, 1996). Analogously, Efroni et al. (2018a) showed that -PI and -VI converge with a faster exponential rate = γ(1 ) 1 γ γ, i.e., k V V N k1 O( N ), with the cost that each iteration of these algorithms is computationally more expensive than that of PI and VI. Finally, we state the following property of -PI and -greedy policies that we use in our -PI and -VI based RL algorithms described in Section 4: Asymptotic performance depends on . Efroni et al. (2018b, Thm. 5) proved the following bound on the performance of -PI that is similar to the one in Eq. 3 for API: ( ) ( N ) C δ /(1 γ)2 | {z } Asymptotic Term Vmax | {z } Decaying Term where δ and C are quantities similar to δ and C in Eq. 3. Note that while the second term on the RHS of Eq. 7 decays with N , the first one is independent of N . 4. -PI and -VI based RL Algorithms As described in Section 3, implementing -PI and -VI requires iteratively solving a γ -discounted surrogate MDP with a shaped reward. If a model of the problem is given, the surrogate MDP can be solved using a DP algorithm (see Efroni et al., 2018a, Sec. 7). When a model is not available, we should approximately solve the surrogate MDP using a model-free RL algorithm. In this paper, we focus on the latter case and propose RL algorithms inspired by -PI and -VI. In our algorithms, we use model-free RL algorithms DQN (Mnih et al., 2015) and TRPO (Schulman et al., 2015) (for discrete and continuous action problems, respectively) as subroutines for estimating a -greedy policy (Line 4 in Alg. 1, -PI, and Line 5 in Alg. 2, -VI) and an optimal value of the surrogate MDP (Line 3 in Alg. 2, -VI). We refer to the resulting algorithms as -PI-DQN, -VI-DQN, -PI-TRPO, and -VI-TRPO. In order to have an efficient implementation of our -PI and -VI based algorithms, the main question to answer is how a fixed number of samples T should be allocated to different parts of the -PI and -VI algorithms? More precisely, how shall we set N 2 N, i.e., the total number of iterations of our algorithms, and determine the number of samples to solve the surrogate MDP at each iteration? To answer these questions, we devise a heuristic approach based on the theory of -PI and -VI algorithms, and in particular Eq. 7. Since N only appears explicitly in the second term on the RHS of Eq. 7, an appropriate choice of N is such that C δ /(1 γ)2 ' N Vmax. Note that setting N to a higher value would not significantly improve the performance, because the asymptotic term in Eq. 7 is independent Multi-step Greedy Reinforcement Learning Algorithms of N . In practice, since δ and C are unknown, we set N to satisfy the following equality: where CFA is a hyper-parameter that depends on the finalaccuracy we aim for. For example, if our goal is to obtain 90% accuracy, we would set CFA = 0.1, which results in N =0.99 ' 4 and N =0.5 ' 115, for γ = 0.99. Our experimental results in Section 5 suggest that this approach leads to a reasonable choice for the total number of iterations N . It is important to note the following facts: 1) as we increase , we expect less iterations are needed for -PI and -VI to converge to a good policy, and 2) the effective horizon2 of the surrogate MDP that -PI and -VI solve at each iteration increases with . Lastly, we need to determine the number of samples for each iteration of our -PI and -VI based algorithms. We allocate equal number of samples per iteration, denoted by T . Since the total number of samples, T, is known beforehand, we set the number of samples per iteration to T = T/N . (9) In the rest of the paper, we first derive our DQN-based and TRPO-based algorithms in Sections 4.1 and 4.2. It is important to note that for = 1, our algorithms are reduced to DQN and TRPO. We then conduct a set of experiments with our algorithms in Sections 5.1 and 5.2 in which we carefully study the effect of and N (or equivalently the hyperparameter CFA, defined by Eq. 8) on their performance. 4.1. -PI-DQN and -VI-DQN Algorithms Algorithm 3 presents the pseudo-code of -PI-DQN. Due to space constraints, we report the detailed pseudo-code in Appendix A.1 (Alg. 5). We use four neural networks in this algorithm, two to represent the Q-function of the original MDP (with discount factor γ and reward r), Qφ (Q-network) and Q0 φ (target network), and two for the Q-function of the surrogate MDP, Q (Q-network) and Q0 (target network). In the policy improvement step, we use DQN to solve the γ -discounted surrogate MDP with the shaped reward rj( , Vφ) = rj + γ(1 )Vφ(sj+1), i.e., Mγ (Vφ), where Vφ ' V i 1 and is computed as Vφ(s) = Qφ(s, arg maxa Q0 (s, a)). The output of DQN is (approximately) the optimal Q-function of Mγ (Vφ), and thus, the new policy i, which is the (approximate) -greedy policy w.r.t. Vφ is equal to i( ) = arg maxa Q0 ( , a). In the policy evaluation step, we use off-policy TD(0) to evaluate the Q-function of the current policy i, i.e., Qφ ' Q i. Although what is needed is an estimate of the value function 2The effective horizon of a γ -discounted MDP is 1/(1 γ ). Algorithm 3 -PI-DQN 1: Initialize replay buffer D; Q-networks Q , Qφ; target net- φ; 2: for i = 0, . . . , N 1 do 3: # Policy Improvement 4: for t = 1, . . . , T do 5: Act by an -greedy policy w.r.t. Q (st, a), observe rt, st+1, and store (st, at, rt, st+1) in D; 6: Sample a batch {(sj, aj, rj, sj+1)}N j=1 from D; 7: Update using DQN with 8: {(sj, aj, rj( , Vφ), sj+1)}N j=1, where 9: Vφ(sj+1) = Qφ(sj+1, i 1(sj+1)) and 10: i 1( ) 2 arg maxa Q0 ( , a); 11: Copy to 0 occasionally ( 0 ); 12: end for 13: # Policy Evaluation of i(s) 2 arg maxa Q0 (s, a) 14: for t = 1, . . . , T do 15: Sample a batch {(sj, aj, rj, sj+1)}N j=1 from D; 16: Update φ using this data and off-policy TD(0) to estimate the Q-function of the current policy i; 17: Copy φ to φ0 occasionally (φ0 φ); 18: end for 19: end for of the current policy, Vφ ' V i, we chose to evaluate its Qfunction, because the available data (the transitions stored in the replay buffer) is off-policy, and unlike the value function, the Q-function of a fixed policy can be easily evaluated with this type of data using off-policy TD(0). Remark 1. In the policy improvement phase, i 1 is computed as the arg max of Q0 ( , a) (Line 10), a quantity that is not constant and is (slowly) changing during this phase. Addressing this issue requires using an additional target net- work that is set to Q only at the end of each improvement step and its arg max is used to compute i 1 throughout the improvement step of the next iteration. We tested using this additional network in our experiments, but it did not improve the performance, and thus, we decided to report the algorithm without it. We report the pseudo-code of -VI-DQN in Appendix A.1 (Alg. 6). Note that -VI simply repeats V T V and computes T V , which is the optimal value of the surrogate MDP Mγ (V ). In -VI-DQN, we repeatedly solve Mγ (V ) by DQN and use its (approximately) optimal Q-function to shape the reward of the next iteration. The algorithm uses three neural networks, two to solve the surrogate MDP by DQN, Q (Q-network) and Q0 (target network), and one to store its optimal Q-function to use it for the shaped reward in the next iteration, Qφ. Let Q γ ,V be the optimal Q and V functions of Mγ (V ). Then, we have maxa Q γ ,V (s, a) = V γ ,V (s) = (T V )(s), where the first equality is by definition (Sec. 2) and the second one holds since T V is the optimal value of Mγ (V ) (Sec. 3). Therefore, in -VI-DQN, we shape the reward at each iteration by maxa Qφ(s, a), where Qφ is the output of the DQN Multi-step Greedy Reinforcement Learning Algorithms Space Invaders Figure 1: Training performance of the naive baseline N = T and -PI-DQN, -VI-DQN for CF A = 0.05 on Breakout (top) and Space Invaders (bottom). See Appendix A.2 for performance w.r.t. different CF A values. Algorithm 4 -PI-TRPO 1: Initialize V -networks V and Vφ; policy network ; 2: for i = 0, . . . , N 1 do 3: for t = 1, . . . , T do 4: Simulate the current policy for M steps and calculate the following two returns for all steps j: Rj( , Vφ) = PM t=j(γ )t jrt( , Vφ) and j = PM t=j γt jrt; 5: Update by minimizing the batch loss function: LV = 1 N j=1(V (sj) Rj( , Vφ))2; 6: # Policy Improvement 7: Update using TRPO and the batch {(Rj( , Vφ), V (sj))}N j=1; 8: end for 9: # Policy Evaluation 10: Update φ by minimizing the batch loss function: LVφ = 1 N j=1(Vφ(sj) j)2; 11: end for from the previous iteration, i.e., maxa Qφ(s, a) ' T Vi 1 . 4.2. -PI-TRPO and -VI-TRPO Algorithms Algorithm 4 presents the pseudo-code of -PI-TRPO (a detailed pseudo-code is reported in Appendix B.1, Alg. 7). Recall that TRPO iteratively updates the current policy using its return and an estimate of its value function. At each iteration i of -PI-TRPO: 1) we use the estimate of the value of the current policy Vφ ' V i 1 to calculate the return R( , Vφ) and the estimate of the value function V of the surrogate MDP Mγ (Vφ), 2) we use R( , Vφ) and V to compute the new policy i (TRPO style), and 3) we estimate the value of the new policy Vφ ' V i in the original MDP (with discount factor γ and reward r). The algorithm uses three neural networks, one for the value function of the original MDP, Vφ, one for the value function of the surrogate MDP, V , and one for the policy, . We report the pseudo-code of -VI-TRPO in Appendix B.1, Alg. 8. As previously noted, -VI iteratively solves the surrogate MDP and uses its optimal value T Vi 1 to shape the reward of the surrogate MDP in the next iteration. In - VI-TRPO, we solve the surrogate MDP Mγ (Vi 1 = Vφ) with TRPO until its policy converges to the optimal policy of Mγ (Vi 1 = Vφ) and its value V converges to T Vi 1 = T Vφ. We then replace Vi with V (Vi = Vφ V ) and repeat this process. 5. Experimental Results In our experiments, we specifically focus on answering the following questions: 1. Does the performance of DQN and TRPO improve when using them as -greedy solvers in -PI and - VI? Is there a performance tradeoff w.r.t. to ? Multi-step Greedy Reinforcement Learning Algorithms Figure 2: Training performance of the naive baseline N = T and -PI-TRPO, -VI-TRPO for CF A = 0.2 on Walker (top) and Ant (bottom). See Appendix B.2 for performance w.r.t. different CF A values. Each iteration corresponds to roughly 1000 environment samples, and thus, the total number of training samples is 2 millions. 2. Following -PI and -VI, our DQN and TRPO imple- mentations of these algorithms devote a significant T number of samples to each iteration. Is this needed or a naive choice of T = 1, or equivalently N = T, works just well for all values of ? We choose to test our -DQN and -TRPO algorithms on the Atari and Mu Jo Co benchmarks, respectively. Both of these algorithms use standard setups, including the use of the Adam optimizer for performing gradient descent, a discount factor of 0.99 across all tasks, target Q value networks in the case of -DQN and an entropy regularizer with a coefficient of 0.01 in the case of -TRPO. We choose to run our experiments for multiple values of between zero and one, which roughly follow a logarithmic scale. Note that just by using the definition of -greedy algorithms, we reduce to the base cases of DQN and TRPO when setting = 1.0 for all experiments. This forms as one of the two baselines we consider in this work. The second baseline essentially refers to using our -greedy algorithms for a fixed N = T value. Thus, independent of the value of , the surrogate MDP is solved for a single time-step per each iteration. Below, we describe the experiments and results in further detail. The implementation details for the -DQN and -TRPO cases are provided in Appendix A, Table 3 and Appendix B, Table 4, respectively. 5.1. -PI-DQN and -VI-DQN Experiments In this section, we empirically analyze the performance of the -PI-DQN and -VI-DQN algorithms on the Atari domains: Breakout, Space Invaders, Seaquest, Enduro, Beam Rider, and Qbert (Bellemare et al., 2013). We start by performing an ablation test on three values of hyper-parameter CF A = {0.001, 0.05, 0.2} on the Breakout domain. The value of CF A sets the number of samples per iteration T (Eq. 8) and the total number of iterations N (Eq. 9). The total number of samples is set to T ' 106. This value represents the number of samples after which our DQN-based algorithms approximately converge. For each value of CF A, we test -PI-DQN and -VI-DQN for several values. In both algorithms, the best performance was obtained with CF A = 0.05. Therefore, CF A is set to 0.05 for all our experiments with other Atari domains. Figure 1 shows the training performance of -PI-DQN and -VI-DQN on Breakout and Space Invaders for the best value of CF A = 0.05, as well as for the naive baseline T = 1, or equivalently N = T. The results on Breakout for the other values of CF A and the training plots for all other Atari domains (with CF A = 0.05) are reported in Appendices A.2 and A.3, respectively. Table 1 shows the final training performance of -PI-DQN and -VI-DQN Multi-step Greedy Reinforcement Learning Algorithms Domain Alg. best = 0 DQN, = 1 N = T, best ( -PI) -PI 223( 7), =0.84 154( 3) Breakout -VI 181( 7), =0.68 174( 5) 134( 4) 170( 2), =0.68 Space Inv. -PI 755( 23), =0.84 613( 20) 656( 17) 700( 21), =0.98 -VI 712( 25), =0.92 687( 32) -PI 5159( 292), =0.92 2612( 238) Seaquest -VI 3253( 402), =0.84 2680( 382) 3099( 191) 3897( 218), =0.68 Enduro -PI 533( 12), =0.84 478( 10) 224( 110) 535( 13), =0.68 -VI 486( 23), =0.84 443 ( 90) -PI 3849( 110), =1.0 3103( 279) Beam Rider -VI 4277 ( 269), =0.84 3714 ( 437) 3849( 110) 3849( 110), =1.0 Qbert -PI 8157( 265), =0.84 6719( 520) 7258( 385) 7968( 218), =0.98 -VI 8060 ( 158), =0.84 7563 ( 398) Table 1: The final training performance of -PI-DQN and -VI-DQN on the Atari domains, for the hyper-parameter CF A = 0.05. The values are reported for a 95% confidence interval across 10 random runs (empirical mean 1.96 empirical standard deviation / n = 10). The best scores are in bold and multiple bold values for a domain denote an insignificant statistical difference between them. Domain Alg. best = 0 TRPO, = 1 N = T, best ( -PI) GAE, λbest -PI 1438( 188), =0.68 1371 ( 192) Walker -VI 954( 88), =0.68 830( 262) 594 ( 99) 1082( 110), =0.0 1601( 190), λ=0.0 Ant -PI 1377( 183), =0.68 1006( 106) -19( 1) 1090( 99), =0.68 1094( 139), λ=0.0 -VI 2998( 264), =0.68 1879( 128) -PI 1334( 151), =0.36 907( 176) Half Cheetah -VI 1447( 346), =0.36 1072( 30) -18( 87) 1195( 218), =0.36 1322( 213), λ=0.36 Humanoid Stand -PI 72604( 1219), =0.99 52936( 1529) 68143( 1031) 71331( 1149), =0.98 71932( 2122), λ=0.98 -VI 72821( 908), =0.99 51148( 1377) -PI 107( 12), =1.0 42( 3) Swimmer -VI 114( 15), =1.0 46( 1) 107( 12) 107( 12), =1.0 103( 13), λ = 1.0 Hopper -PI 1486( 324), =0.68 1012( 263) 1142( 141) 1434( 129), =0.98 1600( 134), λ = 0.84 -VI 1069( 76), =0.92 531( 125) Table 2: The final training performance of -PI-TRPO and -VI-TRPO on the Mu Jo Co domains, for the hyper-parameter CF A = 0.2. The values are reported for a 95% confidence interval across 10 random runs (empirical mean 1.96 empirical standard deviation / n = 10). The best scores are in bold and multiple bold values for a domain denote an insignificant statistical difference between them. on the Atari domains with CF A = 0.05. Note that the scores reported in Table 1 are the actual returns on the Atari domains, while the vertical axis in the plots of Figure 1 corresponds to a scaled return. We plot the scaled return, since this way it can be easier to reproduce our results using the Open AI Baselines codebase (Hill et al., 2018). Our results exhibit that both -PI-DQN and -VI-DQN improve the performance of DQN ( = 1). Moreover, they show that setting N = T leads to a clear degradation of the final training performance on all of the domains except for Enduro, which gets to approximately the same score. Although the performance degrades, the results for N = T are still better than for DQN. 5.2. -PI-TRPO and -VI-TRPO Experiments In this section, we empirically analyze the performance of the -PI-TRPO and -VI-TRPO algorithms on the Mu Jo Co (Todorov et al., 2012) based Open AI Gym domains: Walker2d-v2, Ant-v2, Half Cheetah-v2, Humanoid Standupv2, Swimmer-v2, and Hopper-v2 (Brockman et al., 2016). As in Section 5.1, we start by performing an ablation test on the parameter CF A = {0.001, 0.05, 0.2} on the Walker domain. We set the total number of iterations to 2000, with each iteration consisting 1000 samples. Thus, the total number of samples is T ' 2 106. This is the number of samples after which our TRPO-based algorithms approximately converge. For each value of CF A, we test -PI-TRPO and -VI-TRPO for several values. In both algorithms, the best performance was obtained with CF A = 0.2, and thus, we set CF A = 0.2 in our experiments with other Mu Jo Co domains. Figure 2 shows the training performance of -PI-TRPO and -VI-TRPO on Walker and Ant domains for the best value of CF A = 0.2, as well as for the naive baseline T = 1, or equivalently N = T. The results on Walker for the other CF A values and the other Mu Jo Co domains (with CF A = 0.2) are reported in Appendices B.2 and B.3, Multi-step Greedy Reinforcement Learning Algorithms Figure 3: Lowering the discount factor γ for Atari domains Breakout and Space Invaders and Mu Jo Co domains Walker-v2 and Ant-v2 respectively. Table 2 shows the final training performance of -PI-TRPO and -VI-TRPO on the Mu Jo Co domains with CF A = 0.2. The results exhibit that both -PI-TRPO and -VI-TRPO yield better performance than TRPO ( = 1). Furthermore, they show that the algorithms with CF A = 0.2 perform better than with N = T for three out of six domains (Walker, Ant and Half Cheetah) and equally well for the remaining three (Humanoid Standup, Swimmer and Hopper). 6. Discussion and Related Work Comparison with GAE: There is a close connection between the Generalized Advantage Estimation (GAE) algorithm (Schulman et al., 2016) and -PI. In GAE, the policy is updated by the following gradient: δV = rt + γVt+1 Vt, (10) is the occupancy measure of policy . Eq. 10 can be interpreted as a gradient in a γλ-discounted MDP with shaped rewards δV , which we refer to as Mγλ(δV ). As noted in Efroni et al. (2018a, Sec. 6), an optimal policy γλ of the MDP Mγλ(δV ) is also optimal in Mγ (V ) with = λ. This means that γλ is the -greedy policy w.r.t. V . Comparing the two algorithms, we notice that GAE is conceptually similar to -PI-TRPO with T = 1, or equivalently N = T. In fact, in the case of T = 1, we can obtain a pseudo-code of GAE by removing Line 5 (no need to estimate the value of the surrogate MDP) and replacing rt( , Vφ) with the TD-error of the original MDP on Line 4 in Algorithm 4. In Section 5.2, we compared the empirical performance of GAE with that of -PI-TRPO and -VI-TRPO (see Figure 2 and Table 2). The results show that -PI-TRPO and -VITRPO perform better than or on par with GAE. Moreover, we observe that in most domains, the performance of GAE is equivalent to that of -PI-TRPO with N = T. This is in accordance with the description above connecting GAE to our naive baseline implementation. We report all GAE results in Appendix B.3. Remark 2 (GAE Implementation). In the Open AI implementation of GAE, the value network is updated w.r.t. to the target P t(γλ)trt, whereas in the GAE paper (Schulman et al., 2016), P t γtrt is used as the target. We chose the latter form in our implementation of GAE to be in accord with the paper. Lowering Discount Factor in DQN and TRPO: To show the advantage of -PI and -VI based algorithms over simply lowering the discount factor γ, we test the performance of the vanilla DQN and TRPO algorithms with values of γ lower than the one previously used (i.e., γ = 0.99). As evident from Figure 3, only in the Ant domain, this approach results in an improved performance (for γ = 0.68). On the other hand, in the Ant domain, the performance of -PITRPO, and especially -VI-TRPO, surpasses that of TRPO Multi-step Greedy Reinforcement Learning Algorithms with the lower value of γ = 0.68. In Breakout, Space Invaders, and Walker, the performance of DQN and TRPO worsens or remains unchanged when we lower the discount factor (DQN and TRPO do not benefit from lowering γ), while our -PI and -VI based algorithms perform better with lowering the value of (note that our algorithms are reduced to DQN and TRPO for = 1). Remark 3. While we observe better performance for smaller γ values in some Mu Jo Co domains, e.g., γ = 0.68 in Ant, lowering γ always results in inferior performance in the Atari domains. This is due to the fact that a number of Mu Jo Co domains, such as Ant, are inherently short-horizon decision problems, and thus, their performance does not degrade (even sometimes improves) with lowering the discount factor. On the other hand, the Atari problems are generally not short-horizon, and thus, their performance degrades with lowering γ. 7. Conclusion and Future Work In this paper, we studied the use of multi-step greedy policies in model-free RL and showed that in most problems, the algorithms derived from this formulation achieve a better performance than their single-step counterparts. We adopted the -greedy formulation of multi-step greedy policies (Efroni et al., 2018a) and derived four model-free RL algorithms. The main component of the policy and value iteration algorithms derived from this formulation, -PI and -VI, is solving a surrogate decision problem with a shaped reward and a smaller discount factor. Our algorithms use popular deep RL algorithms, DQN and TRPO, to solve the surrogate decision problem, and thus, we refer to them as - PI-DQN, -VI-DQN, -PI-TRPO, and -VI-TRPO. We empirically evaluated our proposed algorithms and compared them with DQN, TRPO, and GAE on Atari and Mu Jo Co benchmarks. Our experiments show that for a large range of , our algorithms perform better than DQN and TRPO. Furthermore, we proposed a recipe to allocate the total sample budget to the evaluation and improvement phases of our algorithms, and empirically demonstrated the importance of this allocation. We also showed how GAE can be derived by minor modifications to -PI-TRPO, and thus, is a -greedy RL algorithm. Finally, we showed the advantage of multi-step greedy formulation over lowering the discount factor in DQN and TRPO. Our results indicate that while the performance of DQN and TRPO degrades with lowering the discount factor, our multi-step greedy algorithms improve over DQN and TRPO. An interesting future direction would be to use other multistep greedy formulations (Bertsekas & Tsitsiklis, 1996; Bertsekas, 2018; Efroni et al., 2018a; Sun et al., 2018; Shani et al., 2019) to derive model-free RL algorithms. Another direction is to use multi-step greedy in model-based RL (e.g., Kumar et al., 2016; Talvitie, 2017; Luo et al., 2018; Janner et al., 2019) and solve the surrogate decision problem with an approximate model. We conjecture that in this case one may set or more generally, the planning horizon as a function of the quality of the approximate model: gradually increasing as the approximate model gets closer to the real one. We leave theoretical and empirical study of this problem for future work. Finally, we believe using adaptive would greatly improve the performance of our proposed algorithms. We leave verifying this and how should change as a function of errors in gradient and value estimation for future work. Baird, L. Residual algorithms: Reinforcement learning with function approximation. In Proceedings of the Twelfth International Conference on Machine Learning, pp. 30 37, 1995. Bellemare, M., Naddaf, Y., Veness, J., and Bowling, M. The arcade learning environment: An evaluation platform for general agents. Journal of Artificial Intelligence Research, 47:253 279, 2013. Bertsekas, D. Feature-based aggregation and deep reinforce- ment learning: A survey and some new implementations. Journal of Automatica Sinica, 6(1):1 31, 2018. Bertsekas, D. and Tsitsiklis, J. Neuro-dynamic program- ming, volume 5. 1996. Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., and Zaremba, W. Open AI Gym. Preprint ar Xiv:1606.01540, 2016. Efroni, Y., Dalal, G., Scherrer, B., and Mannor, S. Beyond the one step greedy approach in reinforcement learning. In Proceedings of the 35th International Conference on Machine Learning, 2018a. Efroni, Y., Dalal, G., Scherrer, B., and Mannor, S. Multiple- step greedy policies in approximate and online reinforcement learning. In Advances in Neural Information Processing Systems, pp. 5238 5247, 2018b. Farahmand, A., Szepesvári, C., and Munos, R. Error prop- agation for approximate policy and value iteration. In Advances in Neural Information Processing Systems, pp. 568 576, 2010. Greensmith, E., Bartlett, P., and Baxter, J. Variance reduc- tion techniques for gradient estimates in reinforcement learning. Journal of Machine Learning Research, 5:1471 1530, 2004. Multi-step Greedy Reinforcement Learning Algorithms Hill, A., Raffin, A., Ernestus, M., Gleave, A., Kan- ervisto, A., Traore, R., Dhariwal, P., Hesse, C., Klimov, O., Nichol, A., Plappert, M., Radford, A., Schulman, J., Sidor, S., and Wu, Y. Stable baselines. Git Hub repository https://github.com/ hill-a/stable-baselines, 2018. Janner, M., Fu, J., Zhang, M., and Levine, S. When to trust your model: Model-based policy optimization. ar Xiv preprint ar Xiv:1906.08253, 2019. Jiang, N., Kulesza, A., Singh, S., and Lewis, R. The depen- dence of effective planning horizon on model accuracy. In Proceedings of the International Conference on Autonomous Agents and Multiagent Systems, pp. 1181 1189, 2015. Kumar, V., Todorov, E., and Levine, S. Optimal control with learned local models: Application to dexterous manipulation. In International Conference on Robotics and Automation, pp. 378 383, 2016. Levine, S., Finn, C., Darrell, T., and Abbeel, P. End-to-end training of deep visuomotor policies. Journal of Machine Learning Research, 17(1):1334 1373, 2016. Lillicrap, T., Hunt, J., Pritzel, A., Heess, N., Erez, T., Tassa, Y., Silver, D., and Wierstra, D. Continuous control with deep reinforcement learning. Preprint ar Xiv:1509.02971, 2015. Lu, T., Schuurmans, D., and Boutilier, C. Non-delusional Q-learning and value iteration. In Proceedings of the International Conference on Neural Information Processing Systems, pp. 9971 9981, 2018. Luo, Y., Xu, H., Li, Y., Tian, Y., Darrell, T., and Ma, T. Algorithmic framework for model-based deep reinforcement learning with theoretical guarantees. Preprint ar Xiv:1807.03858, 2018. Mnih, V., Kavukcuoglu, K., Silver, D., Rusu, A. A., Veness, J., Bellemare, M., Graves, A., Riedmiller, M., Fidjeland, A., Ostrovski, G., et al. Human-level control through deep reinforcement learning. Nature, 518(7540):529, 2015. Munos, R. Error bounds for approximate policy iteration. In Proceedings of the International Conference on Machine Learning, pp. 560 567, 2003. Munos, R. Error bounds for approximate value iteration. In Proceedings of the National Conference on Artificial Intelligence, pp. 1006 1011, 2005. Petrik, M. and Scherrer, B. Biasing approximate dynamic programming with a lower discount factor. In Advances in neural information processing systems, pp. 1265 1272, 2009. Scherrer, B. Approximate policy iteration schemes: a com- parison. In International Conference on Machine Learning, pp. 1314 1322, 2014. Schulman, J., Levine, S., Abbeel, P., Jordan, M., and Moritz, P. Trust-region policy optimization. In International conference on machine learning, pp. 1889 1897, 2015. Schulman, J., Moritz, P., Levine, S., Jordan, M., and Abbeel, P. High-dimensional continuous control using generalized advantage estimation. In Proceedings of the International Conference on Learning Representations, 2016. Shani, L., Efroni, Y., and Mannor, S. Exploration conscious reinforcement learning revisited. In International Conference on Machine Learning, pp. 5680 5689, 2019. Silver, D., Schrittwieser, J., Simonyan, K., Antonoglou, I., Huang, A., Guez, A., Hubert, T., Baker, L., Lai, M., Bolton, A., et al. Mastering the game of Go without human knowledge. Nature, 550(7676):354, 2017. Sun, W., Gordon, G., Boots, B., and Bagnell, J. Dual policy iteration. In Advances in Neural Information Processing Systems, pp. 7059 7069, 2018. Sutton, R. and Barto, A. Reinforcement learning: An intro- duction. 2018. Talvitie, E. Self-correcting models for model-based rein- forcement learning. In Thirty-First AAAI Conference on Artificial Intelligence, 2017. Thrun, S. and Schwartz, A. Issues in using function approx- imation for reinforcement learning. In Proceedings of the Connectionist Models Summer School, pp. 255 263, 1993. Todorov, E., Erez, T., and Tassa, Y. Mu Jo Co: A physics engine for model-based control. In International Conference on Intelligent Robots and Systems, pp. 5026 5033, 2012. Van Hasselt, H., Guez, A., and Silver, D. Deep reinforce- ment learning with double Q-learning. In AAAI conference on artificial intelligence, 2016.