# dual_policy_iteration__0867cf02.pdf Dual Policy Iteration Wen Sun1, Geoffrey J. Gordon1, Byron Boots2, and J. Andrew Bagnell3 1School of Computer Science, Carnegie Mellon University, USA 2College of Computing, Georgia Institute of Technology, USA 3Aurora Innovation, USA {wensun, ggordon, dbagnell}@cs.cmu.edu, bboots@cc.gatech.edu A novel class of Approximate Policy Iteration (API) algorithms have recently demonstrated impressive practical performance (e.g., Ex It [1], Alpha Go-Zero [2]). This new family of algorithms maintains, and alternately optimizes, two policies: a fast, reactive policy (e.g., a deep neural network) deployed at test time, and a slow, non-reactive policy (e.g., Tree Search), that can plan multiple steps ahead. The reactive policy is updated under supervision from the non-reactive policy, while the non-reactive policy is improved via guidance from the reactive policy. In this work we study this class of Dual Policy Iteration (DPI) strategy in an alternating optimization framework and provide a convergence analysis that extends existing API theory. We also develop a special instance of this framework which reduces the update of non-reactive policies to model-based optimal control using learned local models, and provides a theoretically sound way of unifying model-free and model-based RL approaches with unknown dynamics. We demonstrate the efficacy of our approach on various continuous control Markov Decision Processes. 1 Introduction Approximate Policy Iteration (API) [3, 4, 5, 6, 7], including conservative API (CPI) [5], API driven by learned critics [8], or gradient-based API with stochastic policies [9, 10, 11, 12], have played a central role in Reinforcement Learning (RL) for decades and motivated many modern practical RL algorithms. Several existing API methods [4, 5] can provide both local optimality guarantees and global guarantees under strong assumptions regarding the way samples are generated (e.g., access to a reset distribution that is similar to the optimal policy s state distribution). However, most modern practical API algorithms rely on myopic random exploration (e.g., REINFORCE [13] type policy gradient or -greedy). Sample inefficiency due to random exploration can cause even sophisticated RL methods to perform worse than simple black-box optimization with random search in parameter space [14]. Recently, a new class of API algorithms, which we call Dual Policy Iteration (DPI), has begun to emerge. These algorithms follow a richer strategy for improving the policy, with two policies under consideration at any time during training: a reactive policy, usually learned by some form of function approximation, used for generating samples and deployed at test time, and an intermediate policy that can only be constructed or accessed during training, used as an expert policy to guide the improvement of the reactive policy. For example, Ex It [1] maintains and updates a UCT-based policy [15] as an intermediate expert. Ex It then updates the reactive policy by directly imitating the tree-based policy which we expect would be better than the reactive policy as it involves a multi-step lookahead search. Alpha Go-Zero [2] employs a similar strategy to achieve super-human performance at the ancient game of Go. The key difference that distinguishes Ex It and Alpha Go-Zero from previous APIs is that they leverage models to perform systematic forward search: the policy resulting from forward search acts as an expert and directly informs the improvement direction for the reactive policy. Hence the 32nd Conference on Neural Information Processing Systems (Neur IPS 2018), Montréal, Canada. reactive policy improves by imitation instead of trial-and-error reinforcement learning. This strategy often provides better sample efficiency in practice compared to algorithms that simply rely on locally random search (e.g., Alpha Go-Zero abandons REINFORCE from Alpha Go [16]). In this work we provide a general framework for synthesizing and analyzing DPI by considering a particular alternating optimization strategy with different optimization approaches each forming a new family of approximate policy iteration methods. We additionally consider the extension to the RL setting with unknown dynamics. For example, we construct a simple instance of our framework, where the intermediate expert is computed from Model-Based Optimal Control (MBOC) locally around the reactive policy, and the reactive policy in turn is updated incrementally under the guidance of MBOC. The resulting algorithm iteratively learns a local dynamics model, applies MBOC to compute a locally optimal policy, and then updates the reactive policy by imitation and achieve larger policy improvement per iteration than classic APIs. The instantiation shares similar spirit from some previous works from robotics and control literature, including works from [17, 18] and Guided Policy Search (GPS) [19] (and its variants (e.g., [20, 21, 22])), i.e., using local MBOC to speed up learning global policies. To evaluate our approach, we demonstrate our algorithm on discrete MDPs and continuous control tasks. We show that by integrating local model-based search with learned local dynamics into policy improvement via an imitation learning-style update, our algorithm is substantially more sampleefficient than classic API algorithms such as CPI [5], as well as more recent actor-critic baselines [23], albeit at the cost of slower computation per iteration due to the model-based search. We also apply the framework to a robust policy optimization setting [24, 25] where the goal is to learn a single policy that can generalize across environments. In summary, the major practical difference between DPI and many modern practical RL approaches is that instead of relying on random exploration, the DPI framework integrates local model learning, local model-based search for advanced exploration, and an imitation learning-style policy improvement, to improve the policy in a more systematic way. We also provide a general convergence analysis to support our empirical findings. Although our analysis is similar to CPI s, it has a key difference: as long as MBOC succeeds, we can provide a larger policy improvement than CPI at each iteration. Our analysis is general enough to provide theoretical intuition for previous successful practical DPI algorithms such as Expert Iteration (Ex It) [1]. We also analyze how predictive error from a learned local model can mildly affect policy improvement and show that locally accurate dynamics a model that accurately predicts next states under the current policy s state-action distribution, is enough for improving the current policy. We believe our analysis of local model predictive error versus local policy improvement can shed light on further development of model-based RL approaches with learned local models. In summary, DPI operates in the middle of two extremes: (1) API type methods that update policies locally (e.g., first-order methods like policy gradient and CPI), (2) global model-based optimization where one attempts to learn a global model and perform model-based search. First-order methods have small policy improvement per iteration and learning a global model displays greater model bias and requires a dataset that covers the entire state space. DPI instead learns a local model and allows us to integrate models to leverage the power of model-based optimization to locally improve the reactive policy. 2 Preliminaries A discounted infinite-horizon Markov Decision Process (MDP) is defined as (S, A, P, c, 0, γ). Here, S is a set of states, A is a set of actions, and P is the transition dynamics: P(s0|s, a) is the probability of transitioning to state s0 from state s by taking action a. We use Ps,a in short for P( |s, a). We denote c(s, a) as the cost of taking action a while in state s. Finally, 0 is the initial distribution of states, and γ 2 (0, 1) is the discount factor. Throughout this paper, we assume that we know the form of the cost function c(s, a), but the transition dynamics P are unknown.We define a stochastic policy such that for any state s 2 S, ( |s) outputs a distribution over action space. The distribution of states at time step t, induced by running the policy until and including t, is defined 8st: dt {si,ai}i t 1 0(s0) Qt 1 i=0 (ai|si)P(si+1|si, ai), where by definition d0 (s) = 0(s) for any . The state visitation distribution can be computed d (s) = (1 γ) P1 (s). Denote (d ) as the joint state-action distribution such that d (s, a) = d (s) (a|s). We define the value function V (s), state-action value function Q (s, a), and the objective function J( ) as: γtc(st, at)|s0 =s , Q (s, a)=c(s, a)+γEs0 Ps,a [V (s0)], J( )=Es 0[V (s)]. With V and Q , the advantage function A (s, a) is defined as A (s, a) = Q (s, a) V (s). As we work in the cost setting, in the rest of the paper we refer to A as the disadvantage function. The goal is to learn a single stationary policy that minimizes J( ): = arg min 2 J( ). For two distributions P1 and P2, DT V (P1, P2) denotes total variation distance, which is related to the L1 norm as DT V (P1, P2) = k P1 P2k1/2 (if we have a finite probability space) and DKL(P1, P2) = x P1(x) log(P1(x)/P2(x))dx denotes the KL divergence. We introduce Performance Difference lemma (PDL) [5], which will be used extensively in this work: Lemma 2.1 For any two policies and 0, we have: J( ) J( 0) = 1 1 γ E(s,a) d 3 Dual Policy Iteration We propose an alternating optimization framework inspired by the PDL (Lemma 2.1). Consider the min-max optimization framework: min 2 max 2 Es d Ea ( |s) [A (s, a)] . It is not hard to see that the unique Nash equilibrium for the above equation is ( , ) = ( , ). The above min-max proposes a general strategy, which we call Dual Policy Iteration (DPI): alternatively fix one policy and update the second policy. Mapping to previous practical DPI algorithms [1, 2], stands for the fast reactive policy and corresponds to the tree search policy. For notation purposes, we use n and n to represent the two policies in the nth iteration. Below we introduce one instance of DPI for settings with unknown models (hence no tree search), first describe how to compute n from a given reactive policy n (Sec. 3.1), and then describe how to update n to n+1 via imitating n (Sec. 3.2). 3.1 Updating with MBOC using Learned Local Models Given n, the objective function for becomes: max Es d n Ea n( |s) [A (s, a)] . From PDL we can see that updating is equivalent to finding the optimal policy : arg max (J( n) J( )) arg min J( ), regardless of what n is. As directly minimizing J( ) is as hard as the original problem, we update locally by constraining it to a trust region around n: J( ), s.t., Es d n DT V [( ( |s), n( |s))] . (1) To solve the constraint optimization problem in Eq 1, we propose to learn Ps,a and use it with any off-the-shelf model-based optimal control algorithm. Moreover, thanks to the trust region, we can simply learn a local dynamics model, under the state-action distribution d n n. We denote the optimal solution to the above constrained optimization (Eq. 1) under the real model Ps,a as n. Note that, due to the definition of the optimality, n must perform better than n: J( n) J( n) n( ), where n( ) 0 is the performance gain from n over n. When the trust region expands, i.e., increases, then n( ) approaches the performance difference between the optimal policy and n. To perform MBOC, we learn a locally accurate model a model ˆP that is close to P under the state-action distribution induced by n: we seek a model ˆP, such that the quantity E(s,a) d n n DT V ( ˆPs,a, Ps,a) is small. Optimizing DT V directly is hard, but note that, by Pinsker s inequality, we have DKL(Ps,a, ˆPs,a) DT V ( ˆPs,a, Ps,a)2, which indicates that we can optimize a surrogate loss defined by a KL-divergence: Es d n,a n(s)DKL(Ps,a, ˆPs,a) = arg min Es d n,a n(s),s0 Ps,a[ log ˆPs,a(s0)], (2) where we denote P as the model class. Hence we reduce the local model fitting problem into a classic maximum likelihood estimation (MLE) problem, where the training data {s, a, s0} can be easily collected by executing n on the real system (i.e., Ps,a). As we will show later, to ensure policy improvement, we just need a learned model to perform well under d n n (i.e., no training and testing distribution mismatch as one will have for global model learning). For later analysis purposes, we denote ˆP as the MLE in Eq. 2 and assume ˆP is δ-optimal under d n n: E(s,a) d n n DT V ( ˆPs,a, Ps,a) δ, (3) where δ 2 R+ is controlled by the complexity of model class P and by the amount of training data we sample using n, which can be analyzed by standard supervised learning theory. After achieving a locally accurate model ˆP, we solve Eq. 1 using any existing stochastic MBOC solvers. Assume a MBOC solver returns an optimal policy n under the estimated model ˆP subject to trust-region: n = arg min J( ), s.t., st+1 ˆPst,at, Es d n DT V ( , n) . (4) At this point, a natural question is: If n is solved by an MBOC solver under ˆP, by how much can n outperform n when executed under the real dynamics P? Recall that the performance gap between the real optimal solution n (optimal under P) and n is denoted as n( ). The following theorem quantifies the performance gap between n and n using the learned local model s predictive error δ: Theorem 3.1 Assume ˆPs,a satisfies Eq. 3, and n is the output of a MBOC solver for the optimization problem defined in Eq. 4, then we have: J( n) J( n) n( ) + O 1 γ + γ (1 γ)2 The proof of the above theorem can be found in Appendix A.2. Theorem 3.1 indicates that when the model is locally accurate, i.e., δ is small (e.g., P is rich and we have enough data from d n n), is small, and there exists a local optimal solution that is significantly better than the current policy n (i.e., n( ) 2 R+ is large), then the OC solver with the learned model ˆP finds a nearly local-optimal solution n that outperforms n. With a better n, now we are ready to improve n via imitating n. 3.2 Updating via Imitating Given n, we compute n+1 by performing the following constrained optimization procedure: Ea ( |s) [A n(s, a)] , s.t., Es d n [DT V ( ( |s), n( |s))] β (5) Note that the key difference between Eq. 5 and classic API policy improvement procedure is that we use n s disadvantage function A n, i.e., we are performing imitation learning by treating n as an expert in this iteration [26, 27]. We can solve Eq. 5 by converting it to supervised learning problem such as cost-sensitive classification [5] by sampling states and actions from n and estimating A n via rolling out n, subject to an L1 constraint. Note that a CPI-like update approximately solves the above constrained problem as well: n+1 = (1 β) n + β n = arg min Ea ( |s)[A n(s, a)] Note that n+1 satisfies the constraint as DT V ( n+1( |s), n( |s)) β, 8s. Intuitively, the update in Eq. 6 can be understood as first solving the objective function to obtain n without considering the constraint, and then moving n towards n until the boundary of the constraint is reached. 3.3 DPI: Combining Updates on and In summary, assume MBOC is used for Eq. 1, DPI operates in an iterative way: with n: 1. Fit MLE ˆP on states and actions from d n n (Eq. 2). 2. n MBOC( ˆP), subject to trust region Es d n DT V ( , n) (Eq. 4) 3. Update to n+1 by imitating n, subject to trust region Es d n DT V ( , n) β (Eq. 5). The above framework shows how and are tightened together to guide each other s improvements: the first step corresponds classic MLE under n s state-action distribution: d n n; the second step corresponds to model-based policy search around n ( ˆP is only locally accurate); the third step corresponds to updating by imitating (i.e., imitation). Note that in practice MBOC solver (e.g., a second order optimization method, as we will show in our practical algorithm below) could be computationally expensive and slow (e.g. tree search in Ex It and Alpha Go-Zero), but once ˆP is provided, MBOC does not require additional samples from the real system. Connections to Previous works We can see that the above framework generalizes several previous work from API and IL. (a) If we set = 0 in the limit, we reveal CPI (assuming we optimize with Eq. 6), i.e., no attempt to search for a better policy using model-based optimization. (b) Mapping to Ex It, our n plays the role of the tree-based policy, and our n plays the role of the apprentice policy, and MBOC plays the role of forward search. (c) when an optimal expert policy is available during and only during training, we can set every n to be , and DPI then reveals a previous IL algorithm AGGREVATED [27]. 4 Analysis of Policy Improvement We provide a general convergence analysis for DPI. The trust region constraints in Eq. 1 and Eq. 5 tightly combines MBOC and policy improvement together, and is the key to ensure monotonic improvement and achieve larger policy improvement per iteration than existing APIs. Define An( n+1) as the disadvantage of n+1 over n under d n: An( n+1) = Es d n Ea n+1( |s) [A n(s, a)] . Note that An( n+1) is at least non-positive (if and are from the same function class, or s policy class is rich enough to include ), as if we set n+1 to n. In that case, we simply have An( n+1) = 0, which means we can hope that the IL procedure (Eq. 5) finds a policy n+1 that achieves An( n+1) < 0 (i.e., local improvement over n). The question we want to answer is: by how much is the performance of n+1 improved over n by solving the two trust-region optimization procedures detailed in Eq. 1 and Eq. 5. Following Theorem 4.1 from [5], we define " = maxs |Ea n+1( |s)[A n(s, a)]|, which measures the maximum possible one-step improvement one can achieve from n. The following theorem states the performance improvement: Theorem 4.1 Solve Eq. 1 to get n and Eq. 5 to get n+1. The improvement of n+1 over n is: J( n+1) J( n) β" (1 γ)2 |An( n+1)| 1 γ n( ). (7) The proof of Theorem 4.1 is provided in Appendix A.3. When β is small, we are guaranteed to find a policy n+1 where the total cost decreases by n( ) + |An( n+1)|/(1 γ) compared to n. Note that classic CPI s per iteration improvement [5, 12] only contains a term that has the similar meaning and magnitude of the second term in the RHS of Eq. 7. Hence DPI can improve the performance of CPI by introducing an extra term n( ), and the improvement could be substantial when there exists a locally optimal policy n that is much better than the current reactive policy n. Such ( ) comes from the explicit introduction of a model-based search into the training loop, which does not exist in classic APIs. From a practical point view, modern MBOCs are usually second-order methods, while APIs are usually first-order (e.g., REINFORCE and CPI). Hence it is reasonable to expect ( ) itself will be larger than API s policy improvement per iteration. Connecting back to Ex It and Alpha Go-Zero under model-based setting, ( ) stands for the improvement of the tree-based policy over the current deep net reactive policy. In Ex It and Alpha Go Zero, the tree-based policy n performs fixed depth forward search followed by rolling out n (i.e., bottom up by V n(s)), which ensures the expert n outperforms n. When | n( )| and |An( n+1)| are small, i.e., | n( )| and |An( n+1)| , then we can guarantee that n and n are good policies, under the stronger assumption that the initial distribution 0 happens to be a good distribution (e.g., close to d ), and the realizable assumption: min 2 Es d n Ea ( |s)[A n(s, a)] = Es d n [mina A [A n(s, a)]], holds. We show in Appendix A.4 that under the realizable assumption: β(1 γ)2 + β(1 γ) The term (maxs (d (s)/ 0(s))) measures the distribution mismatch between the initial distribution 0 and the optimal policy , and appears in some previous API algorithms CPI [5] and PSDP [4]. A 0 that is closer to d (e.g., let experts reset the agent s initial position if possible) ensures better global performance guarantee. CPI considers a setting where a good reset distribution (different from 0) is available, DPI can leverage such reset distribution by replacing 0 by at training. In summary, we can expect larger per-iteration policy improvement from DPI compared to CPI (and TRPO which has similar per iteration policy improvement as CPI), thanks to the introduction of local model-based search. The final performance bound of the learned policy is in par with CPI and PSDP. 5 An Instance of DPI In this section, we dive into the details of each update step of DPI and suggest one practical instance of DPI, which can be used in continuous control settings. We denote T as the maximum possible horizon.1 We denote the state space S Rds and action space A Rda. We work on parameterized policies: we parameterize policy as ( |s; ) for any s 2 S (e.g., a neural network with parameter ), and parameterize by a sequence of time-varying linear-Gaussian policies = { t}1 t T , where t( |s) = N(Kts + kt, Pt) with control gain Kt 2 Rda ds, bias term kt 2 Rda and Covariance Pt 2 Rda da.We will use = {Kt, kt, Pt}0 t H to represent the collection of the parameters of all the linear-Gaussian policies across the entire horizon. One approximation we make here is to replace the policy divergence measure DT V ( n, ) (note total variation distance is symmetric) with the KL-divergence DKL( n, ), which allows us to leverage Natural Gradient [11, 10].2 To summarize, n and n are short for n and n = {N(Kts + kt, Pt)}t, respectively. Below we first describe how to compute n given n (Sec. 5.1), and then describe how to update via imitating n using Natural Gradient (Sec. 5.2). 5.1 Updating with MBOC using Learned Time Varying Linear Models Algorithm 1 AGGREVATED-GPS 1: Input: Parameters 2 R+, β 2 R+. 2: Initialized 0 3: for n = 0 to ... do 4: Execute n to generate a set of trajectories 5: Fit local linear dynamics ˆP (Eq. 8) using {st, at, st+1} collected from step 1 6: Solve the minmax in Eq. 9 subject to ˆP to obtain n and form disadvantage A n 7: Compute n+1 by NGD (Eq. 12) 8: end for We explain here how to find n given n using MBOC. In our implementation, we use Linear Quadratic Gaussian (LQG) optimal control [28] as the black-box optimal control solver. We learn a sequence of time varying linear Gaussian transition models to represent ˆP: 8t 2 [1, T], st+1 N(Atst + Btat + ct, t), (8) where At, Bt, ct, t can be learned using classic linear regression techniques on a dataset {st, at, st+1} collected from executing n on the real system. Although the dynamics P(s, a) may be complicated over the entire space, linear dynamics could locally approximate the dynamics well (after all, our theorem only requires ˆP to have low predictive error under d n n). Next, to find a locally optimal policy under linear-Gaussian transitions (i.e., Eq. 4), we add the KL constraint to the objective with Lagrange multiplier µ and form an equivalent min-max problem: γt 1c(st, at) γt 1Es dt [DKL( , n)] where µ is the Lagrange multiplier, which can be solved by alternatively updating and µ [19]. For a fixed µ, using the derivation from [19], ignoring terms that do not depend on , Eq. 9 can be written: γt 1(c(st, at)/µ log n(at|st)) γt 1Es dt [H( ( |s))], (10) where H( ( |s)) = P a (a|s) ln( (a|s)) is the negative entropy. Hence the above formulation can be understood as using a new cost function: c0(st, at) = c(st, at)/µ log( n(at|st)), and an entropy regularization on . It is well known in the optimal control literature that when c0 is quadratic and dynamics are linear, the optimal sequence of linear Gaussian policies for the objective in Eq. 10 can be found exactly by a Dynamic Programming (DP) based approach, the Linear Quadratic Regulator (LQR) [28]. Given a dataset {(st, at), c0(st, at)} collected while executing n, we can fit a quadratic approximation of c0(s, a) [29, 19]. With a quadratic approximation of c0 and linear dynamics, we solve Eq. 10 for exactly by LQR [29]. Once we get , we go back to Eq. 9 and update the Lagrange multiplier µ, for example, by projected gradient ascent [30]. Upon convergence, LQR gives us a sequence of time-dependent linear Gaussian policies together with a sequence of analytic quadratic cost-to-go functions Qt(s, a), and quadratic disadvantage functions A n t (s, a), for all t 2 [T]. 1Note T is the maximum possible horizon which could be long. Hence, we still want to output a single policy, especially when the policy is parameterized by complicated non-linear function approximators like deep nets. 2Small DKL leads to small DT V , as by Pinsker s inequality, DKL(q, p) (and DKL(p, q)) DT V (p, q)2. 5.2 Updating via imitating using Natural Gradient Performing a second order Taylor expansion of the KL constraint Es d n [DKL( n( |a), ( |s; )))] around n [11, 10], we get the following constrained optimization problem: Es d n [Ea ( |s; )[A n (s, a)]], s.t., ( n)T F n( n) β, (11) where F n is the Hessian of the KL constraint Es d n DKL( n, ) (i.e., Fisher information matrix), measured at n. Denote the objective (i.e., the first term in Eq. 11) as Ln( ), and denote r n as r Ln( )| = n, we can optimize by performing natural gradient descent (NGD): n+1 = n µF 1 n r n, where µ = n r n). (12) The specific µ above ensures the KL constraint is satisfied. More details about the imitation update on can be found in Appendix B.3. Summary If we consider as an expert, NGD is similar to natural gradient AGGREVATED a differential IL approach [27]. We summarizes the procedures presented in Sec. 3.1&5.2 in Alg. 1, which we name as AGGREVATED-GPS, stands for the fact that we are using MBOC to Guide Policy Search [19, 21] via AGGREVATED-type update. Every iteration, we run n on P to gather samples. We estimate time dependent local linear dynamics ˆP and then leverage an OC solver (e.g, LQR) to solve the Lagrangian in Eq. 9 to compute n and A n . We then perform NGD to update to n+1. 5.3 Additional Related Works The most closely related work with respect to Alg. 1 is Guided Policy Search (GPS) for unknown dynamics [19] and its variants (e.g.,[20, 21, 22]). GPS (including its variants) demonstrates modelbased optimal control approaches can be used to speed up training policies parameterized by rich non-linear function approximators (e.g., deep networks) in large-scale applications. While Alg. 1 in high level follows GPS s iterative procedure of alternating reactive policy improvement and MBOC, the main difference between Alg. 1 and GPSs are the update procedure of the reactive policy. Classic GPS, including the mirror descent version, phrases the update procedure of the reactive policy as a behavior cloning procedure, i.e., given an expert policy , we perform min DKL(dµµ||d ) 3. Note that our approach to updating is fundamentally on-policy, i.e., we generate samples from . Moreover, we update by performing policy iteration against , i.e., approximately acts greedily with respect to A , which resulting a key difference: if we limit the power of MBOC, i.e., set the trust region size in MBOC step to zero in both DPI and GPS, then our approach reduces to CPI and thus improves to local optimality. GPS and its variants, by contrast, have no ability to improve the reactive policy in that setting. 6 Experiments We tested our approach on several MDPs: (1) a set of random discrete MDPs (Garnet problems [7]) (2) Cartpole balancing [31], (3) Helicopter Aerobatics (Hover and Funnel) [32], (4) Swimmer, Hopper and Half-Cheetah from the Mu Jo Co physics simulator [33]. The goals of these experiments are: (a) to experimentally verify that using A from the intermediate expert computed by model-based search to perform policy improvement is more sample-efficient than using A . (b) to show that our approach can be applied to robust policy search and can outperform existing approaches [25]. 6.1 Comparison to CPI on Discrete MDPs Following [7], we randomly create ten discrete MDPs with 1000 states and 5 actions. Different from the techniques we introduced in Sec. 5.2 for continuous settings, here we use the conservative 3See Line 3 in Alg.2 in [21], where in principle a behavior cloning on uses samples from expert (i.e., off-policy samples). We note, however, in actual implementation some variants of GPS tend to swap the order of and inside the KL, often resulting a on-policy sampling strategy (e.g.,[22]). We also note a Mirror Descent interpretation and analysis to explain GPS s convergence [21] implies the correct way to perform a projection is to minimize the reverse KL, i.e., arg min 2 DKL(d ||d ). This in turn matches the DPI intuition: one should attempt to find a policy that is similar to under the state distribution of itself. (a) Discrete MDP (b) Cart-Pole (c) Helicopter Hover (d) Helicopter Funnel (e) Swimmer (g) Half-Cheetah Figure 1: Performance (mean and standard error of cumulative cost in log2-scale on y-axis) versus number of episodes (n on x-axis). update shown in Eq. 6 to update the reactive policy, where each n is a linear classifier and is trained using regression-based cost-sensitive classification on samples from d n [5]. The feature for each state φ(s) is a binary encoding of the state (φ(s) 2 Rlog2(|S|)). We maintain the estimated transition ˆP in a tabular representation. The policy is also in a tabular representation (hence expert and reactive policy have different feature representation) and is computed using exact VI under ˆP and c0(s, a) (hence we name our approach here as AGGREVATED-VI). The setup and the conservative update implementation is detailed in Appendix B.1. Fig. 1a reports the statistical performance of our approach and CPI over the 10 random discrete MDPs. Note that our approach is more sample-efficient than CPI, although we observed it is slower than CPI per iteration as we ran VI using learned model. We tune β and neither CPI nor our approach uses line search on β. The major difference between AGGREVATED-VI and CPI here is that we used A instead of A to update . 6.2 Comparison to Actor-Critic in Continuous Settings We compare against TRPO-GAE [23] on a set of continuous control tasks. The setup is detailed in Appendix B.4. TRPO-GAE is a actor-critic-like approach where both actor and critic are updated using trust region optimization. We use a two-layer neural network to represent policy which is updated by natural gradient descent. We use LQR as the underlying MBOC solver and we name our approach as AGGREVATED-ILQR. Fig. 1 (b-g) shows the comparison between our method and TRPO-GAE over a set of continuous control tasks (confidence interval is computed from 20 random trials). As we can see, our method is significantly more sample-efficient than TRPO-GAE albeit slower per iteration as we perform MBOC. The major difference between our approach and TRPO-GAE is that we use A while TRPO-GAE uses A for the policy update. Note that both A and A are computed using the rollouts from . The difference is that our approach uses rollouts to learn local dynamics and analytically estimates A using MBOC, while TRPO-GAE learns A using rollouts. Overall, our approach converges faster than TRPO-GAE (i.e., uses less samples), which again indicates the benefit of using A in policy improvement. 6.3 Application on Robust Policy Optimization One application for our approach is robust policy optimization [34], where we have multiple training environments that are all potentially different from, but similar to, the testing environments. The goal is to train a single reactive policy using the training environments and deploy the policy on a test environment without any further training. Previous work suggests a policy that optimizes all the training models simultaneously is stable and robust during testing [24, 25], as the training environments together act as regularization" to avoid overfitting and provide generalization. More formally, let us assume that we have M training environments. At iteration n with n, we execute n on the i th environment, generate samples, fit local models, and call MBOC associated with the i th environment to compute i n, for all i 2 [M]. With A in, for all i 2 [M], we consider all training environments equally and formalize the objective Ln( ) as Ln( ) = PM i=1 Es d n [Ea ( |s; )[A in ]]. We update n to n+1 by NGD on Ln( ). Intuitively, we update by imitating i n simultaneously for all i 2 [M]. (a) Cart-Pole (b) Helicopter Funnel Figure 2: Performance (mean in log-scale on y-axis) versus episodes (n on x-axis) in robust control. We consider two simulation tasks, cartpole balancing and helicopter funnel. For each task, we create ten environments by varying the physical parameters (e.g., mass of helicopter, mass and length of pole). We use 7 of the environments for training and the remaining three for testing. We compare our algorithm against TRPO, which could be regarded as a model-free, natural gradient version of the first-order algorithm proposed in [25]. We also ran our algorithm on a single randomly picked training environment, but still tested on test environments, which is denoted as non-robust in Fig. 2. Fig. 2 summarizes the comparison between our approach and baselines. Similar to the trend we saw in the previous section, our approach is more sample-efficient in the robust policy optimization setup as well. It is interesting to see the non-robust" approach fails to further converge, which illustrates the overfitting phenomenon: the learned policy overfits to one particular training environment. 7 Conclusion We present and analyze Dual Policy Iteration a framework that alternatively computes a non-reactive policy via more advanced and systematic search, and updates a reactive policy via imitating the non-reactive one. Recent algorithms that have been successful in practice, like Alpha Go-Zero and Ex It, are subsumed by the DPI framework. We then provide a simple instance of DPI for RL with unknown dynamics, where the instance integrates local model fit, local model-based search, and reactive policy improvement via imitating the teacher the nearly local-optimal policy resulting from model-based search. We theoretically show that integrating model-based search and imitation into policy improvement could result in larger policy improvement at each step. We also experimentally demonstrate the improved sample efficiency compared to strong baselines. Our work also opens some new problems. In theory, the performance improvement during one call of optimal control with the local accurate model depends on a term that scales quadratically with respect to the horizon 1/(1 γ). We believe the dependency on horizon can be brought down by leveraging system identification methods focusing on multi-step prediction [35, 36]. On the practical side, our specific implementation has some limitations due to the choice of LQG as the underlying OC algorithm. LQG-based methods usually require the dynamics and cost functions to be somewhat smooth so that they can be locally approximated by polynomials. We also found that LQG planning horizons must be relatively short, as the approximation error from polynomials will likely compound over the horizon. We plan to explore the possibility of learning a non-linear dynamics and using more advanced non-linear optimal control techniques such as Model Predictive Control (MPC) for more sophisticated control tasks. Acknowledgement We thank Sergey Levine for valuable discussion. WS is supported in part by Office of Naval Research contract N000141512365 [1] Thomas Anthony, Zheng Tian, and David Barber. Thinking fast and slow with deep learning and tree search. ar Xiv preprint ar Xiv:1705.08439, 2017. [2] David Silver, Julian Schrittwieser, Karen Simonyan, Ioannis Antonoglou, Aja Huang, Arthur Guez, Thomas Hubert, Lucas Baker, Matthew Lai, Adrian Bolton, et al. Mastering the game of go without human knowledge. Nature, 550(7676):354, 2017. [3] Dimitri P Bertsekas and John N Tsitsiklis. Neuro-dynamic programming: an overview. In Decision and Control, 1995., Proceedings of the 34th IEEE Conference on, volume 1, pages 560 564. IEEE, 1995. [4] J Andrew Bagnell, Sham M Kakade, Jeff G Schneider, and Andrew Y Ng. Policy search by dynamic programming. In Advances in neural information processing systems, pages 831 838, 2004. [5] Sham Kakade and John Langford. Approximately optimal approximate reinforcement learning. In ICML, 2002. [6] Alessandro Lazaric, Mohammad Ghavamzadeh, and Rémi Munos. Analysis of a classification- based policy iteration algorithm. In ICML-27th International Conference on Machine Learning, pages 607 614. Omnipress, 2010. [7] Bruno Scherrer. Approximate policy iteration schemes: a comparison. In International Conference on Machine Learning, pages 1314 1322, 2014. [8] Gavin A Rummery and Mahesan Niranjan. On-line Q-learning using connectionist systems, volume 37. University of Cambridge, Department of Engineering, 1994. [9] Jonathan Baxter and Peter L Bartlett. Infinite-horizon policy-gradient estimation. Journal of Artificial Intelligence Research, 15:319 350, 2001. [10] J Andrew Bagnell and Jeff Schneider. Covariant policy search. IJCAI, 2003. [11] Sham Kakade. A natural policy gradient. NIPS, 2002. [12] John Schulman, Sergey Levine, Pieter Abbeel, Michael I Jordan, and Philipp Moritz. Trust region policy optimization. In ICML, pages 1889 1897, 2015. [13] Ronald J Williams. Simple statistical gradient-following algorithms for connectionist reinforce- ment learning. Machine learning, 1992. [14] Horia Mania, Aurelia Guy, and Benjamin Recht. Simple random search provides a competitive approach to reinforcement learning. ar Xiv preprint ar Xiv:1803.07055, 2018. [15] Levente Kocsis and Csaba Szepesvári. Bandit based monte-carlo planning. In European conference on machine learning, pages 282 293. Springer, 2006. [16] David Silver et al. Mastering the game of go with deep neural networks and tree search. Nature, [17] Christopher G Atkeson. Using local trajectory optimizers to speed up global optimization in dynamic programming. In Advances in neural information processing systems, pages 663 670, 1994. [18] Christopher G Atkeson and Jun Morimoto. Nonparametric representation of policies and value functions: A trajectory-based approach. In Advances in neural information processing systems, pages 1643 1650, 2003. [19] Sergey Levine and Pieter Abbeel. Learning neural network policies with guided policy search under unknown dynamics. In Advances in Neural Information Processing Systems, pages 1071 1079, 2014. [20] Sergey Levine, Chelsea Finn, Trevor Darrell, and Pieter Abbeel. End-to-end training of deep visuomotor policies. The Journal of Machine Learning Research, 17(1):1334 1373, 2016. [21] William H Montgomery and Sergey Levine. Guided policy search via approximate mirror descent. In Advances in Neural Information Processing Systems, pages 4008 4016, 2016. [22] William Montgomery, Anurag Ajay, Chelsea Finn, Pieter Abbeel, and Sergey Levine. Reset-free guided policy search: efficient deep reinforcement learning with stochastic initial states. In Robotics and Automation (ICRA), 2017 IEEE International Conference on, pages 3373 3380. IEEE, 2017. [23] John Schulman, Philipp Moritz, Sergey Levine, Michael Jordan, and Pieter Abbeel. High- dimensional continuous control using generalized advantage estimation. ar Xiv preprint ar Xiv:1506.02438, 2015. [24] J Andrew Bagnell and Jeff G Schneider. Autonomous helicopter control using reinforcement learning policy search methods. In Robotics and Automation, 2001. Proceedings 2001 ICRA. IEEE International Conference on, volume 2, pages 1615 1620. IEEE, 2001. [25] Christopher G Atkeson. Efficient robust policy optimization. In American Control Conference (ACC), 2012, pages 5220 5227. IEEE, 2012. [26] Stephane Ross and J Andrew Bagnell. Reinforcement and imitation learning via interactive no-regret learning. ar Xiv preprint ar Xiv:1406.5979, 2014. [27] Wen Sun, Arun Venkatraman, Geoffrey J Gordon, Byron Boots, and J Andrew Bagnell. Deeply aggrevated: Differentiable imitation learning for sequential prediction. ICML, 2017. [28] Huibert Kwakernaak and Raphael Sivan. Linear optimal control systems, volume 1. Wiley- Interscience New York, 1972. [29] Brian D Ziebart. Modeling purposeful adaptive behavior with the principle of maximum causal entropy. 2010. [30] Martin Zinkevich. Online Convex Programming and Generalized Infinitesimal Gradient Ascent. In ICML, 2003. [31] Richard S Sutton and Andrew G Barto. Introduction to reinforcement learning, volume 135. MIT Press Cambridge, 1998. [32] Pieter Abbeel, Varun Ganapathi, and Andrew Y Ng. Learning vehicular dynamics, with application to modeling helicopters. In NIPS, pages 1 8, 2005. [33] Emanuel Todorov, Tom Erez, and Yuval Tassa. Mujoco: A physics engine for model-based control. In Intelligent Robots and Systems (IROS), 2012 IEEE/RSJ International Conference on, pages 5026 5033. IEEE, 2012. [34] Kemin Zhou, John Comstock Doyle, Keith Glover, et al. Robust and optimal control, volume 40. Prentice hall New Jersey, 1996. [35] Arun Venkatraman, Martial Hebert, and J Andrew Bagnell. Improving multi-step prediction of learned time series models. AAAI, 2015. [36] Wen Sun, Arun Venkatraman, Byron Boots, and J Andrew Bagnell. Learning to filter with predictive state inference machines. In ICML, 2016. [37] Stephane Ross and Drew Bagnell. Agnostic system identification for model-based reinforcement learning. In Proceedings of the 29th International Conference on Machine Learning (ICML-12), pages 1703 1710, 2012. [38] Stéphane Ross, Geoffrey J Gordon, and J.Andrew Bagnell. A reduction of imitation learning and structured prediction to no-regret online learning. In AISTATS, 2011. [39] Alex Gorodetsky, Sertac Karaman, and Youssef Marzouk. Efficient high-dimensional stochastic optimal motion control using tensor-train decomposition. In Proceedings of Robotics: Science and Systems, Rome, Italy, July 2015. [40] C. Finn, M. Zhang, J. Fu, X. Tan, Z. Mc Carthy, E. Scharff, and S. Levine. Guided policy search code implementation, 2016. Software available from rll.berkeley.edu/gps.