# sample_efficient_path_integral_control_under_uncertainty__71058c7a.pdf Sample Efficient Path Integral Control under Uncertainty Yunpeng Pan, Evangelos A. Theodorou, and Michail Kontitsis Autonomous Control and Decision Systems Laboratory Institute for Robotics and Intelligent Machines School of Aerospace Engineering Georgia Institute of Technology, Atlanta, GA 30332 {ypan37,evangelos.theodorou,kontitsis}@gatech.edu We present a data-driven optimal control framework that is derived using the path integral (PI) control approach. We find iterative control laws analytically without a priori policy parameterization based on probabilistic representation of the learned dynamics model. The proposed algorithm operates in a forward-backward manner which differentiate it from other PI-related methods that perform forward sampling to find optimal controls. Our method uses significantly less samples to find analytic control laws compared to other approaches within the PI control family that rely on extensive sampling from given dynamics models or trials on physical systems in a model-free fashion. In addition, the learned controllers can be generalized to new tasks without re-sampling based on the compositionality theory for the linearly-solvable optimal control framework. We provide experimental results on three different tasks and comparisons with state-of-the-art model-based methods to demonstrate the efficiency and generalizability of the proposed framework. 1 Introduction Stochastic optimal control (SOC) is a general and powerful framework with applications in many areas of science and engineering. However, despite the broad applicability, solving SOC problems remains challenging for systems in high-dimensional continuous state action spaces. Various function approximation approaches to optimal control are available [1, 2] but usually sensitive to model uncertainty. Over the last decade, SOC based on exponential transformation of the value function has demonstrated remarkable applicability in solving real world control and planning problems. In control theory the exponential transformation of the value function was introduced in [3, 4]. In the recent decade it has been explored in terms of path integral interpretations and theoretical generalizations [5, 6, 7, 8], discrete time formulations [9], and scalable RL/control algorithms [10, 11, 12, 13, 14]. The resulting stochastic optimal control frameworks are known as Path Integral (PI) control for continuous time, Kullback Leibler (KL) control for discrete time, or more generally Linearly Solvable Optimal Control [9, 15]. One of the most attractive characteristics of PI control is that optimal control problems can be solved with forward sampling of Stochastic Differential Equations (SDEs). While the process of sampling with SDEs is more scalable than numerically solving partial differential equations, it still suffers from the curse of dimensionality when performed in a naive fashion. One way to circumvent this problem is to parameterize policies [10, 11, 14] and then perform optimization with sampling. However, in this case one has to impose the structure of the policy a-priori, therefore restrict the possible optimal control solutions within the assumed parameterization. In addition, the optimized policy parameters can not be generalized to new tasks. In general, model-free PI policy search approaches require a large number of samples from trials performed on real physical systems. The issue of sample inefficiency further restricts the applicability of PI control methods on physical systems with unknown or partially known dynamics. Motivated by the aforementioned limitations, in this paper we introduce a sample efficient, modelbased approach to PI control. Different from existing PI control approaches, our method combines the benefits of PI control theory [5, 6, 7] and probabilistic model-based reinforcement learning methodologies [16, 17]. The main characteristics of the our approach are summarized as follows It extends the PI control theory [5, 6, 7] to the case of uncertain systems. The structural constraint is enforced between the control cost and uncertainty of the learned dynamics, which can be viewed as a generalization of previous work [5, 6, 7]. Different from parameterized PI controllers [10, 11, 14, 8], we find analytic control law without any policy parameterization. Rather than keeping a fixed control cost weight [5, 6, 7, 10, 18], or ignoring the constraint between control authority and noise level [11], in this work the control cost weight is adapted based on the explicit uncertainty of the learned dynamics model. The algorithm operates in a different manner compared to existing PI-related methods that perform forward sampling [5, 6, 7, 10, 18, 11, 12, 14, 8]. More precisely our method perform successive deterministic approximate inference and backward computation of optimal control law. The proposed model-based approach is significantly more sample efficient than samplingbased PI control [5, 6, 7, 18]. In RL setting our method is comparable to the state-of-the-art RL methods [17, 19] in terms of sample and computational efficiency. Thanks to the linearity of the backward Chapman-Kolmogorov PDE, the learned controllers can be generalized to new tasks without re-sampling by constructing composite controllers. In contrast, most policy search and trajectory optimization methods [10, 11, 14, 17, 19, 20, 21, 22] find policy parameters that can not be generalized. 2 Iterative Path Integral Control for a Class of Uncertain Systems 2.1 Problem formulation We consider a nonlinear stochastic system described by the following differential equation dx = f(x) + G(x)u dt + Bdω, (1) with state x Rn, control u Rm, and standard Brownian motion noise ω Rp with variance Σω. f(x) is the unknown drift term (passive dynamics), G(x) Rn m is the control matrix and B Rn p is the diffusion matrix. Given some previous control uold, we seek the optimal control correction term δu such that the total control u = uold + δu. The original system becomes dx = f(x) + G(x)(uold + δu) dt + Bdω = f(x) + G(x)uold | {z } f(x,uold) dt + G(x)δudt + Bdω. In this work we assume the dynamics based on the previous control can be represented by Gaussian processes (GP) such that f GP(x) = f(x, uold)dt + Bdω, (2) where f GP is the GP representation of the biased drift term f under the previous control. Now the original dynamical system (1) can be represented as follow dx = f GP + Gδudt, f GP GP(µf, Σf), (3) where µf, Σf are predictive mean and covariance functions, respectively. For the GP model we use a prior of zero mean and covariance function K(xi, xj) = σ2 s exp( 1 2(xi xj)TW(xi xj)) + δijσ2 ω, with σs, σω, W the hyper-parameters. δij is the Kronecker symbol that is one iff i = j and zero otherwise. Samples over f GP can be drawn using an vector of i.i.d. Gaussian variable Ω f GP = µf + LfΩ (4) where Lf is obtained using Cholesky factorization such that Σf = Lf LT f . Note that generally Ωis an infinite dimensional vector and we can use the same sample to represent uncertainty during learning [23]. Without loss of generality we assume Ωto be the standard zero-mean Brownian motion. For the rest of the paper we use simplified notations with subscripts indicating the time step. The discrete-time representation of the system is xt+dt = xt+µft+Gtδutdt+LftΩt dt, and the conditional probability of xt+dt given xt and δut is a Gaussian p xt+dt|xt, δut = N µt+dt, Σt+dt , where µt+dt = xt + µft + Gtδut and Σt+dt = Σft. In this paper we consider a finite-horizon stochastic optimal control problem J(x0) = E h q(x T ) + Z T t=0 L(xt, δut)dt i , where the immediate cost is defined as L(xt, ut) = q(xt) + 1 2δu T t Rtδut, and q(xt) = (xt xd t )TQ(xt xd t ) is a quadratic cost function where xd t is the desired state. Rt = R(xt) is a statedependent positive definite weight matrix. Next we show the linearized Hamilton-Jacobi-Bellman equation for this class of optimal control problems. 2.2 Linearized Hamilton-Jacobi-Bellman equation for uncertain dynamics At each iteration the goal is to find the optimal control update δut that minimizes the value function V (xt, t) = min δut E h Z t+dt t L(xt, δut)dt + V (xt + dxt, t + dt)dt|xt i . (5) (5) is the Bellman equation. By approximating the integral for a small dt and applying Itˆo s rule we obtain the Hamilton-Jacobi-Bellman (HJB) equation (detailed derivation is skipped): t Vt = min δut (qt + 1 2δu T t Rtδut + (µft + Gtδut)T x Vt + 1 2 Tr(Σft xx Vt)). To find the optimal control update, we take gradient of the above expression (inside the parentheses) with respect to δut and set to 0. This yields δut = R 1 t GT t x Vt. Inserting this expression into the HJB equation yields the following nonlinear and second order PDE t Vt = qt + ( x Vt)Tµft 1 2( x Vt)TGt R 1GT t x Vt + 1 2 Tr(Σft xx Vt). (6) In order to solve the above PDE we use the exponential transformation of the value function Vt = λ log Ψt, where Ψt = Ψ(xt) is called the desirability of xt. The corresponding partial derivatives can be found as t Vt = λ Ψt tΨt, x Vt = λ Ψt xΨt and xx Vt = λ Ψ2 t xΨt xΨT t λ Ψt xxΨt. Inserting these terms to (6) results in λ Ψt tΨt = qt λ Ψt ( xΨt)Tµft λ2 2Ψ2 t ( xΨt)TGt R 1 t GT t xΨt+ λ 2Ψ2 t Tr(( xΨt)TΣft xΨt) λ 2Ψt Tr( xxΨtΣft). The quadratic terms xΨt will cancel out under the assumption of λGt R 1 t GT t = Σft. This constraint is different from existing works in path integral control [5, 6, 7, 10, 18, 8] where the constraint is enforced between the additive noise covariance and control authority, more precisely λGt R 1 t GT t = BΣωBT. The new constraint enables an adaptive update of control cost weight based on explicit uncertainty of the learned dynamics. In contrast, most existing works use a fixed control cost weight [5, 6, 7, 10, 18, 12, 14, 8]. This condition also leads to more exploration (more aggressive control) under high uncertainty and less exploration with more certain dynamics. Given the aforementioned assumption, the above PDE is simplified as λqtΨt µT ft xΨt 1 2 Tr( xxΨtΣft), (7) subject to the terminal condition ΨT = exp( 1 λq T ). The resulting Chapman-Kolmogorov PDE (7) is linear. In general, solving (7) analytically is intractable for nonlinear systems and cost functions. We apply the Feynman-Kac formula which gives a probabilistic representation of the solution of the linear PDE (7) Ψt = lim dt 0 Z p(τt|xt) exp 1 j=t qjdt) ΨT dτt, (8) where τt is the state trajectory from time t to T. The optimal control is obtained as Gtδˆut = Gt R 1 t GT t ( x Vt) = λGt R 1 t GT t xΨt = ˆut = uold t + δˆut = uold t + G 1 t Σft xΨt Rather than computing xΨt and Ψt, the optimal control ˆut can be approximated based on path costs of sampled trajectories. Next we briefly review some of the existing approaches. 2.3 Related works According to the path integral control theory [5, 6, 7, 10, 18, 8], the stochastic optimal control problem becomes an approximation problem of a path integral (8). This problem can be solved by forward sampling of the uncontrolled (u = 0) SDE (1). The optimal control ˆut is approximated based on path costs of sampled trajectories. Therefore the computation of optimal controls becomes a forward process. More precisely, when the control and noise act in the same subspace, the optimal control can be evaluated as the weighted average of the noise ˆut = Ep(τt|xt) dωt , where the probability of a trajectory is p(τt|xt) = exp( 1 λ S(τt|xt)) R exp( 1 λ S(τt|xt))dτ , and S(τt|xt) is defined as the path cost computed by performing forward sampling. However, these approaches require a large amount of samples from a given dynamics model, or extensive trials on physical systems when applied in model-free reinforcement learning settings. In order to improve sample efficiency, a nonparametric approach was developed by representing the desirability Ψt in terms of linear operators in a reproducing kernel Hilbert space (RKHS) [12]. As a model-free approach, it allows sample re-use but relies on numerical methods to estimate the gradient of desirability, i.e., xΨt , which can be computationally expensive. On the other hand, computing the analytic expressions of the path integral embedding is intractable and requires exact knowledge of the system dynamics. Furthermore, the control approximation is based on samples from the uncontrolled dynamics, which is usually not sufficient for highly nonlinear or underactuated systems. Another class of PI-related method is based on policy parameterization. Notable approaches include PI2 [10], PI2-CMA [11], PI-REPS[14] and recently developed state-dependent PI[8]. The limitations of these methods are: 1) They do not take into account model uncertainty in the passive dynamics f(x). 2) The imposed policy parameterizations restrict optimal control solutions. 3) The optimized policy parameters can not be generalized to new tasks. A brief comparison of some of these methods can be found in Table 1. Motivated by the challenge of combining sample efficiency and generalizability, next we introduce a probabilistic model-based approach to compute the optimal control (9) analytically. PI [5, 6, 7], iterative PI [18] PI2[10], PI2-CMA [11] PI-REPS[14] State feedback PI[8] Our method Structural constraint λGt R 1 t GT t = BΣωBT same as PI same as PI same as PI λGR 1GT = Σf Dynamics model model-based model-free model-based model-based GP model-based Policy parameterization No Yes Yes Yes No Table 1: Comparison with some notable and recent path integral-related approaches. 3 Proposed Approach 3.1 Analytic path integral control: a forward-backward scheme In order to derive the proposed framework, firstly we learn the function f GP(xt) = f(x, uold)dt + Bdω from sampled data. Learning the continuous mapping from state to state transition can be viewed as an inference with the goal of inferring the state transition d xt = f GP(xt). The kernel function has been defined in Sec.2.1, which can be interpreted as a similarity measure of random variables. More specifically, if the training input xi and xj are close to each other in the kernel space, their outputs dxi and dxj are highly correlated. Given a sequence of states {x0, . . . x T }, and the corresponding state transition {d x0, . . . , d x T }, the posterior distribution can be obtained by conditioning the joint prior distribution on the observations. In this work we make the standard assumption of independent outputs (no correlation between each output dimension). To propagate the GP-based dynamics over a trajectory of time horizon T we employ the moment matching approach [24, 17] to compute the predictive distribution. Given an input distribution over the state N(µt, Σt), the predictive distribution over the state at t + dt can be approximated as a Gaussian p(xt+dt) N(µt+dt, Σt+dt) such that µt+dt = µt + µft, Σt+dt = Σt + Σft + COV[xt, d xt] + COV[d xt, xt]. (10) The above formulation is used to approximate one-step transition probabilities over the trajectory. Details regarding the moment matching method can be found in [24, 17]. All mean and variance terms can be computed analytically. The hyper-parameters σs, σω, W are learned by maximizing the log-likelihood of the training outputs given the inputs [25]. Given the approximation of transition probability (10), we now introduce a Bayesian nonparametric formulation of path integral control based on probabilistic representation of the dynamics. Firstly we perform approximate inference (forward propagation) to obtain the Gaussian belief (predictive mean and covariance of the state) over the trajectory. Since the exponential transformation of the state cost exp( 1 λq(x)dt) is an unnormalized Gaussian N(xd, 2λ dt Q 1). We can evaluate the following integral analytically Z N µj, Σj exp 1 λ qjdt dxj = I + dt 2 (µj xd j )T dt 2λ Q(I + dt 2λ λΣj Q) 1(µj xd j ) , for j = t+dt, ..., T. Thus given a boundary condition ΨT = exp( 1 λq T ) and predictive distribution at the final step N(µT , ΣT ), we can evaluate the one-step backward desirability ΨT dt analytically using the above expression (11). More generally we use the following recursive rule Ψj dt = Φ(xj, Ψj) = Z N µj, Σj exp 1 λqjdt Ψjdxj, (12) for j = t + dt, ..., T dt. Since we use deterministic approximate inference based on (10) instead of explicitly sampling from the corresponding SDE, we approximate the conditional distribution p(xj|xj dt) by the Gaussian predictive distribution N(µj, Σj). Therefore the path integral Ψt = Z p τt|xt exp 1 j=t qjdt) ΨT dτt Z ... Z N µT dt, ΣT dt exp 1 λq T dtdt Z N µT , ΣT exp 1 | {z } ΨT dt | {z } ΨT 2dt = Z N µt+dt, Σt+dt exp 1 λqt+dtdt Ψt+dtdxt+dt = Φ(xt+dt, Ψt+dt). (13) We evaluate the desirability Ψt backward in time by successive computation using the above recursive expression. The optimal control law ˆut (9) requires gradients of the desirability function with respect to the state, which can be computed backward in time as well. For simplicity we denote the function Φ(xj, Ψj) by Φj. Thus we compute the gradient of the recursive expression (13) xΨj dt = Ψj xΦj + Φj xΨj, (14) where j = t + dt, ..., T dt. Given the expression in (11) we compute the gradient terms in (14) as xΦj = dΦj dp(xj) dp(xj) dxt , where Φj µj = Φj(µj xd j)T dt 2λλΣj Q) 1, 2λλΣj Q) 1 µj xd j µj xd j T I dt 2λλΣj Q) 1, and dxt = n µj µj dt dxt + µj Σj dt dΣj dt dxt , Σj µj dt dxt + Σj Σj dt dΣj dt The term xΨT dt is compute similarly. The partial derivatives µj µj dt , µj Σj dt , Σj µj dt , Σj Σj dt can be computed analytically as in [17]. We compute all gradients using this scheme without any numerical method (finite differences, etc.). Given Ψt and xΨt, the optimal control takes a analytic form as in eq.(9). Since Ψt and xΨt are explicit functions of xt, the resulting control law is essentially different from the feedforward control in sampling-based path integral control frameworks [5, 6, 7, 10, 18] as well as the parameterized state feedback PI control policies [14, 8]. Notice that at current time step t, we update the control sequence ˆut,...,T using the presented forward-backward scheme. Only ˆut is applied to the system to move to the next step, while the controls ˆut+dt,...,T are used for control update at future steps. The transition sample recorded at each time step is incorporated to update the GP model of the dynamics. A summary of the proposed algorithm is shown in Algorithm 1. Algorithm 1 Sample efficient path integral control under uncertain dynamics 1: Initialization: Apply random controls ˆu0,..,T to the physical system (1), record data. 2: repeat 3: for t=0:T do 4: Incorporate transition sample to learn GP dynamics model. 5: repeat 6: Approximate inference for predictive distributions using uold t,..,T = ˆut,..,T , see (10). 7: Backward computation of optimal control updates δˆut,..,T , see (13)(14)(9). 8: Update optimal controls ˆut,..,T = uold t,..,T + δˆut,..,T . 9: until Convergence. 10: Apply optimal control ˆut to the system. Move one step forward and record data. 11: end for 12: until Task learned. 3.2 Generalization to unlearned tasks without sampling In this section we describe how to generalize the learned controllers for new (unlearned) tasks without any interaction with the real system. The proposed approach is based on the compositionality theory [26] in linearly solvable optimal control (LSOC). We use superscripts to denote previously learned task indexes. Firstly we define a distance measure between the new target xd and old targets xdk, k = 1, .., K, i.e., a Gaussian kernel 2( xd xdk)TP( xd xdk) , (15) where P is a diagonal matrix (kernel width). The composite terminal cost q(x T ) for the new task becomes q(x T ) = λ log PK k=1 ωk exp( 1 λqk(x T )) PK k=1 ωk where qk(x T ) is the terminal cost for old tasks. For conciseness we define a normalized distance measure ωk = ωk PK k=1 ωk , which can be interpreted as a probability weight. Based on (16) we have the composite terminal desirability for the new task which is a linear combination of Ψk T λ q(x T ) = k=1 ωkΨk T . (17) Since Ψk t is the solution to the linear Chapman-Kolmogorov PDE (7), the linear combination of desirability (17) holds everywhere from t to T as long as it holds on the boundary (terminal time step). Therefore we obtain the composite control ωkΨk t PK k=1 ωkΨk t ˆuk t . (18) The composite control law in (18) is essentially different from an interpolating control law[26]. It enables sample-free controllers that constructed from learned controllers for different tasks. This scheme can not be adopted in policy search or trajectory optimization methods such as [10, 11, 14, 17, 19, 20, 21, 22]. Alternatively, generalization can be achieved by imposing task-dependent policies [27]. However, this approach might restrict the choice of optimal controls given the assumed structure of control policy. 4 Experiments and Analysis We consider 3 simulated RL tasks: cart-pole (CP) swing up, double pendulum on a cart (DPC) swing up, and PUMA-560 robotic arm reaching. The CP and DPC systems consist of a cart and a single/double-link pendulum. The tasks are to swing-up the single/double-link pendulum from the initial position (point down). Both CP and DPC are under-actuated systems with only one control acting on the cart. PUMA-560 is a 3D robotic arm that has 12 state dimensions, 6 degrees of freedom with 6 actuators on the joints. The task is to steer the end-effector to the desired position and orientation. In order to demonstrate the performance, we compare the proposed control framework with three related methods: iterative path integral control [18] with known dynamics model, PILCO [17] and PDDP [19]. Iterative path integral control is a sampling-based stochastic control method. It is based on importance sampling using controlled diffusion process rather than passive dynamics used in standard path integral control [5, 6, 7]. Iterative PI control is used as a baseline with a given dynamics model. PILCO is a model-based policy search method that features state-of-the-art data efficiency in terms of number of trials required to learn a task. PILCO requires an extra optimizer (such as BFGS) for policy improvement. PDDP is a Gaussian belief space trajectory optimization approach. It performs dynamic programming based on local approximation of the learned dynamics and value function. Both PILCO and PDDP are applied with unknown dynamics. In this work we do not compare our method with model-free PI-related approaches such as [10, 11, 12, 14] since these methods would certainly cost more samples than model-based methods such as PILCO and PDDP. The reason for choosing these two methods for comparison is that our method adopts a similar model learning scheme while other state-of-the-art methods, such as [20] is based on a different model. In experiment 1 we demonstrate the sample efficiency of our method using the CP and DPC tasks. For both tasks we choose T = 1.2 and dt = 0.02 (60 time steps per rollout). The iterative PI [18] with a given dynamics model uses 103/104 (CP/DPC) sample rollouts per iteration and 500 iterations at each time step. We initialize PILCO and the proposed method by collecting 2/6 sample rollouts (corresponding to 120/360 transition samples) for CP/DPC tasks respectively. At each trial (on the true dynamics model), we use 1 sample rollout for PILCO and our method. PDDP uses 4/5 rollouts (corresponding to 240/300 transition samples) for initialization as well as at each trial for the CP/DPC tasks. Fig. 1 shows the results in terms of ΨT and computational time. For both tasks our method shows higher desirability (lower terminal state cost) at each trial, which indicates higher sample efficiency for task learning. This is mainly because our method performs online reoptimization at each time step. In contrast, the other two methods do not use this scheme. However we assume partial information of the dynamics (G matrix) is given. PILCO and PDDP perform optimization on entirely unknown dynamics. In many robotic systems G corresponds to the inverse of the inertia matrix, which can be identified based on data as well. In terms of computational efficiency, our method outperforms PILCO since we compute the optimal control update analytically, while PILCO solves large scale nonlinear optimization problems to obtain policy parameters. Our method is more computational expensive than PDDP because PDDP seeks local optimal controls that rely on linear approximations, while our method is a global optimal control approach. Despite the relatively higher computational burden than PDDP, our method offers reasonable efficiency in terms of the time required to reach the baseline performance. In experiment 2 we demonstrate the generalizability of the learned controllers to new tasks using the composite control law (18) based on the PUMA-560 system. We use T = 2 and dt = 0.02 (100 time steps per rollout). First we learn 8 independent controllers using Algorithm 1. The target postures are shown in Fig. 2. For all tasks we initialize with 3 sample rollouts and 1 sample at each trial. Blue bars in Fig. 2b shows the desirabilities ΨT after 3 trials. Next we use the composite law (18) to construct controllers without re-sampling using 7 other controllers learned using Algorithm 1. For instance the composite controller for task#1 is found as u1 t = P8 k=2 ωkΨk t P8 k=2 ωkΨk t ˆuk t . The performance comparison of the composite controllers with controllers learned from trials is shown in Fig. 2. It can be seen that the composite controllers give close performance as independently learned controllers. The compositionality theory [26] generally does not apply to policy search methods and trajectory optimizers such as PILCO, PDDP, and other recent methods [20, 21, 22]. Our method benefits from the compositionality of control laws that can be applied for multi-task control without re-sampling. 0 1 2 3 Trial# Iterative PI (true model, 103 samp/iter) PILCO (1 sample/trial) PDDP (4 samples/trial) Ours (1 sample/trial) 0 1 2 3 Trial# 0 2 4 6 8 Trial# Double pendulum on a cart Iterative PI (true model, 104 samp/iter) PILCO (1 sample/trial) PDDP (5 samples/trial) Ours (1 sample/trial) 0 2 4 6 8 Trial# (b) Figure 1: Comparison in terms of sample efficiency and computational efficiency for (a) cart-pole and (b) double pendulum on a cart swing-up tasks. Left subfigures show the terminal desirability ΨT (for PILCO and PDDP, ΨT is computed using terminal state costs) at each trial. Right subfigures show computational time (in minute) at each trial. Task# 1 2 3 4 5 6 7 8 Independent controller (1 samp/trial, 3 trials) Composite controller (no sampling) (b) Figure 2: Resutls for the PUMA-560 tasks. (a) 8 tasks tested in this experiment. Each number indicates a corresponding target posture. (b) Comparison of the controllers learned independently from trials and the composite controllers without sampling. Each composite controller is obtained (18) from 7 other independent controllers learned from trials. 5 Conclusion and Discussion We presented an iterative learning control framework that can find optimal controllers under uncertain dynamics using very small number of samples. This approach is closely related to the family of path integral (PI) control algorithms. Our method is based on a forward-backward optimization scheme, which differs significantly from current PI-related approaches. Moreover, it combines the attractive characteristics of probabilistic model-based reinforcement learning and linearly solvable optimal control theory. These characteristics include sample efficiency, optimality and generalizability. By iteratively updating the control laws based on probabilistic representation of the learned dynamics, our method demonstrated encouraging performance compared to the state-ofthe-art model-based methods. In addition, our method showed promising potential in performing multi-task control based on the compositionality of learned controllers. Besides the assumed structural constraint between control cost weight and uncertainty of the passive dynamics, the major limitation is that we have not taken into account the uncertainty in the control matrix G. Future work will focus on further generalization of this framework and applications to real systems. Acknowledgments This research is supported by NSF NRI-1426945. [1] D.P. Bertsekas and J.N. Tsitsiklis. Neuro-dynamic programming (optimization and neural computation series, 3). Athena Scientific, 7:15 23, 1996. [2] A.G. Barto, W. Powell, J. Si, and D.C. Wunsch. Handbook of learning and approximate dynamic programming. 2004. [3] W.H. Fleming. Exit probabilities and optimal stochastic control. Applied Math. Optim, 9:329 346, 1971. [4] W. H. Fleming and H. M. Soner. Controlled Markov processes and viscosity solutions. Applications of mathematics. Springer, New York, 1st edition, 1993. [5] H. J. Kappen. Linear theory for control of nonlinear stochastic systems. Phys Rev Lett, 95:200 201, 2005. [6] H. J. Kappen. Path integrals and symmetry breaking for optimal control theory. Journal of Statistical Mechanics: Theory and Experiment, 11:P11011, 2005. [7] H. J. Kappen. An introduction to stochastic control theory, path integrals and reinforcement learning. AIP Conference Proceedings, 887(1), 2007. [8] S. Thijssen and H. J. Kappen. Path integral control and state-dependent feedback. Phys. Rev. E, 91:032104, Mar 2015. [9] E. Todorov. Efficient computation of optimal actions. Proceedings of the national academy of sciences, 106(28):11478 11483, 2009. [10] E. Theodorou, J. Buchli, and S. Schaal. A generalized path integral control approach to reinforcement learning. The Journal of Machine Learning Research, 11:3137 3181, 2010. [11] F. Stulp and O. Sigaud. Path integral policy improvement with covariance matrix adaptation. In Proceedings of the 29th International Conference on Machine Learning (ICML), pages 281 288. ACM, 2012. [12] K. Rawlik, M. Toussaint, and S. Vijayakumar. Path integral control by reproducing kernel hilbert space embedding. In Proceedings of the Twenty-Third International Joint Conference on Artificial Intelligence, IJCAI 13, pages 1628 1634, 2013. [13] Y. Pan and E. Theodorou. Nonparametric infinite horizon kullback-leibler stochastic control. In 2014 IEEE Symposium on Adaptive Dynamic Programming and Reinforcement Learning (ADPRL), pages 1 8. IEEE, 2014. [14] V. G omez, H.J. Kappen, J. Peters, and G. Neumann. Policy search for path integral control. In Machine Learning and Knowledge Discovery in Databases, pages 482 497. Springer, 2014. [15] K. Dvijotham and E Todorov. Linearly solvable optimal control. Reinforcement learning and approximate dynamic programming for feedback control, pages 119 141, 2012. [16] M.P. Deisenroth, G. Neumann, and J. Peters. A survey on policy search for robotics. Foundations and Trends in Robotics, 2(1-2):1 142, 2013. [17] M. Deisenroth, D. Fox, and C. Rasmussen. Gaussian processes for data-efficient learning in robotics and control. IEEE Transsactions on Pattern Analysis and Machine Intelligence, 27:75 90, 2015. [18] E. Theodorou and E. Todorov. Relative entropy and free energy dualities: Connections to path integral and kl control. In 51st IEEE Conference on Decision and Control, pages 1466 1473, 2012. [19] Y. Pan and E. Theodorou. Probabilistic differential dynamic programming. In Advances in Neural Information Processing Systems (NIPS), pages 1907 1915, 2014. [20] S. Levine and P. Abbeel. Learning neural network policies with guided policy search under unknown dynamics. In Advances in Neural Information Processing Systems (NIPS), pages 1071 1079, 2014. [21] S. Levine and V. Koltun. Learning complex neural network policies with trajectory optimization. In Proceedings of the 31st International Conference on Machine Learning (ICML-14), pages 829 837, 2014. [22] J. Schulman, S. Levine, P. Moritz, M. I. Jordan, and P. Abbeel. Trust region policy optimization. ar Xiv preprint ar Xiv:1502.05477, 2015. [23] P. Hennig. Optimal reinforcement learning for gaussian systems. In Advances in Neural Information Processing Systems (NIPS), pages 325 333, 2011. [24] J. Quinonero Candela, A. Girard, J. Larsen, and C. E. Rasmussen. Propagation of uncertainty in bayesian kernel models-application to multiple-step ahead forecasting. In IEEE International Conference on Acoustics, Speech, and Signal Processing, 2003. [25] C.K.I Williams and C.E. Rasmussen. Gaussian processes for machine learning. MIT Press, 2006. [26] E. Todorov. Compositionality of optimal control laws. In Advances in Neural Information Processing Systems (NIPS), pages 1856 1864, 2009. [27] M.P. Deisenroth, P. Englert, J. Peters, and D. Fox. Multi-task policy search for robotics. In Proceedings of 2014 IEEE International Conference on Robotics and Automation (ICRA), 2014.