# adaptive_smoothing_for_path_integral_control__557b1a94.pdf Journal of Machine Learning Research 21 (2020) 1-37 Submitted 9/18; Revised 12/19; Published 9/20 Adaptive Smoothing for Path Integral Control Dominik Thalmeier d.thalmeier@science.ru.nl Radboud University Nijmegen Nijmegen, The Netherlands Hilbert J. Kappen b.kappen@science.ru.nl Radboud University Nijmegen Nijmegen, The Netherlands Simone Totaro simone.totaro@gmail.com Universitat Pompeu Fabra Barcelona, Spain Vicen c G omez vicen.gomez@upf.edu Universitat Pompeu Fabra Barcelona, Spain Editor: Benjamin Recht In Path Integral control problems a representation of an optimally controlled dynamical system can be formally computed and serve as a guidepost to learn a parametrized policy. The Path Integral Cross-Entropy (PICE) method tries to exploit this, but is hampered by poor sample efficiency. We propose a model-free algorithm called ASPIC (Adaptive Smoothing of Path Integral Control) that applies an inf-convolution to the cost function to speedup convergence of policy optimization. We identify PICE as the infinite smoothing limit of such technique and show that the sample efficiency problems that PICE suffers disappear for finite levels of smoothing. For zero smoothing, ASPIC becomes a greedy optimization of the cost, which is the standard approach in current reinforcement learning. ASPIC adapts the smoothness parameter to keep the variance of the gradient estimator at a predefined level, independently of the number of samples. We show analytically and empirically that intermediate levels of smoothing are optimal, which renders the new method superior to both PICE and direct cost optimization. Keywords: Path Integral Control, Entropy-Regularization, Cost Smoothing 1. Introduction How to choose an optimal action? For noisy dynamical systems, stochastic optimal control theory provides a framework to answer this question. Optimal control is framed as an optimization problem to find the control that minimizes an expected cost function. For non-linear dynamical systems that are continuous in time and space, this problem in general hard. c 2020 Dominik Thalmeier, Hilbert J. Kappen, Simone Totaro, Vicen c G omez. License: CC-BY 4.0, see https://creativecommons.org/licenses/by/4.0/. Attribution requirements are provided at http://jmlr.org/papers/v21/18-624.html. Thalmeier, Kappen, Totaro, G omez A method that has proven to work well is to introduce a parametrized policy like a neural network (Mnih et al., 2015; Levine et al., 2016; Duan et al., 2016; Fran cois-Lavet et al., 2018) and greedily optimize the expected cost using gradient descent (Williams, 1992; Peters and Schaal, 2008; Schulman et al., 2015; Heess et al., 2017). To achieve a robust decrease of the expected cost it is important to ensure that in each step, the updated policy stays in the proximity of the old policy (Duan et al., 2016). This can be achieved by enforcing a trust region constraint (Peters et al., 2010; Schulman et al., 2015) or using adaptive regularization that punishes strong deviations of the new policy from the old policy (Heess et al., 2017). However the applicability of these methods is limited, as in each iteration of the algorithm, samples from the controlled system have to be computed, either from a simulator or from a real system. We want to increase the convergence rate of policy optimization to reduce the number of simulations needed. To this end we consider path integral control problems (Kappen, 2005; Todorov, 2009; Kappen et al., 2012), that offer an alternative approach to direct cost optimization and explore if this allows to speed up policy optimization. This class of control problems permits arbitrary non-linear dynamics and state cost, but requires a linear dependence of the control on the dynamics and a quadratic control cost (Kappen, 2005; Bierkens and Kappen, 2014; Thijssen and Kappen, 2015). These restrictions allow to obtain an explicit expression for the probability density of optimally controlled system trajectories. Through this, an information-theoretical measure of the deviation of the current control policy from the optimal control can be calculated. The Path Integral Cross-Entropy (PICE) method (Kappen and Ruiz, 2016) proposes to use this measure as a pseudo-objective for policy optimization. However, there is yet no comparative study on whether PICE actually offers an advantage over direct cost optimization; and, in its original form (Kappen and Ruiz, 2016), the PICE method does not scale well to complex problems because the PICE gradient is hard to estimate if the current controller is not close enough to the optimal control (Ruiz and Kappen, 2017). Furthermore the PICE method has been introduced with standard gradient descent and does not use trust regions to ensure robust updates, which has been shown to be effective for policy optimization (Duan et al., 2016). In this work we propose and study a new kind of smoothing technique for the cost function that allows to interpolate between the optimization of the direct cost and the PICE objective. Optimizing this smoothed cost using a trust-region-based method yields an approach that is efficient and does not suffer from the feasibility issues of PICE. Our work is based on recently proposed smoothing techniques to speed up convergence in deep neural networks (Chaudhari et al., 2018). We adapt this smoothing technique to path integral control problems. In contrast to Chaudhari et al. (2018), smoothing for path integral control problems can be solved analytically and we obtain an expression of the gradient that can directly be computed from Monte Carlo samples. The strength of smoothing can be regulated by a parameter. Remarkably, this parameter can be determined independently of the number of samples. In the limits of this smoothing parameter we recover the PICE method for infinitely strong smoothing and direct cost optimization for zero smoothing, respectively. As in Chaudhari et al. (2018), the minimum of the smoothed cost, thus the optimal control policy, remains the same for all levels of smoothing. We provide a theoretical argument why smoothing is expected to speed up optimization and conduct numerical experiments on different control tasks, which show this accelerative Adaptive Smoothing for Path Integral Control effect in practice. For this we develop an algorithm called ASPIC (Adaptive Smoothing for Path Integral Control) that uses cost smoothing to speed up policy optimization. The algorithm adjusts the smoothing parameter in each step to keep the variance of the gradient estimator at a predefined level. To ensure robust updates of the policy, ASPIC enforces a trust region constraint; similar to Schulman et al. (2015) this is achieved with natural gradient updates and an adaptive stepsize. Like other policy gradient based methods (Williams, 1992; Peters and Schaal, 2008; Schulman et al., 2015; Heess et al., 2017) ASPIC is model-free. Many policy optimization algorithms update the control policy based on a direct optimization of the cost; examples are Trust Region Policy Optimization (TRPO) (Schulman et al., 2015) or Path-Integral Relative Entropy Policy Search (PIREPS) (G omez et al., 2014), where the later is particularly developed for path integral control problems. The main novelty of this work is the application to path integral control problems of the idea of smoothing, as introduced in Chaudhari et al. (2018). This technique outperforms direct cost optimization, achieving faster convergence rates with only a negligible amount of computational overhead. 2. Path Integral Control Problems Consider the (multivariate) dynamical system xt = f(xt, t) + g(xt, t) (u(xt, t) + ξt) , (1) with initial condition x0. The control policy is implemented in the control function u(x, t), which is additive to the white noise ξt which has variance ν dt. Given a control function u and a time horizon T, this dynamical system induces a probability distribution pu(τ) over state trajectories τ = {xt| t : 0 < t T} with initial condition x0. We define the regularized expected cost C(pu) = V (τ) pu + γKL(pu||p0), (2) with V (τ) = R T 0 V (xt, t)dt, where the strength of the regularization KL(pu||p0) is controlled by the parameter γ. The Kullback-Leibler divergence KL(pu||p0) puts high cost to controls u that bring the probability distribution pu far away from the uncontrolled dynamics p0 where u(xt, t) = 0. We can also rewrite the regularizer KL(pu||p0) directly in terms of the control function u by using the Girsanov theorem (compare Thijssen and Kappen (2015)) 2u(xt, t)T u(xt, t) + u(xt, t)T ξt The regularization then takes the form of a quadratic control cost KL(pu||p0) = 1 2u(xt, t)T u(xt, t) + u(xt, t)T ξt 1 2u(xt, t)T u(xt, t)dt Thalmeier, Kappen, Totaro, G omez where we used that u(xt, t)T ξt pu = 0. This shows that the regularization KL(pu||p0) puts higher cost for large values of the controller u. The path integral control problem is to find the optimal control function u that minimizes the regularized cost C(pu) u = arg min u C(pu). (3) For a more complete introduction to path integral control problems, see Thijssen and Kappen (2015); Kappen and Ruiz (2016). 2.1 Direct Cost Optimization Using Gradient Descent A standard approach to find an optimal control function is to introduce a parametrized controller uθ(xt, t) (Williams, 1992; Schulman et al., 2015; Heess et al., 2017). This parametrizes the path probabilities puθ and allows to optimize the expected cost C(puθ) (2) using stochastic gradient descent on the cost function: θC(puθ) = D Sγ puθ (τ) θ log puθ(τ) E with the stochastic cost Sγ puθ (τ) := V (τ) + γ log puθ(τ) p0(τ) (see Appendix A for details). 2.2 The Cross-Entropy Method for Path Integral Control Problems An alternative approach to direct cost optimization was introduced as the PICE method in Kappen and Ruiz (2016). It uses that we can obtain an expression for pu , the probability density of state trajectories induced by a system with the optimal controller u : pu = arg min pu C(pu), with C(pu) given by equation (2). Finding pu is an optimization problem over the space of all probability distributions pu that are induced by the controlled dynamical system (1). It has been shown (Bierkens and Kappen, 2014; Thijssen and Kappen, 2015) that we can solve this by replacing the minimization over pu with a minimization over all path probability distributions p: pu p := arg min p C(p) = arg min p V (τ) p + γKL(p||p0) = 1 Z p0(τ) exp 1 γ V (τ) , (5) with the normalization constant Z = D exp 1 p0. Note that this is not a trivial statement, as we now take the minimum also over non-Markovian processes with non Gaussian noise. The PICE algorithm (Kappen and Ruiz, 2016) takes advantage of the existence of this explicit expression for the density of optimally controlled trajectories pu . PICE does not directly optimize the expected cost, instead it minimizes the KL-divergence KL (p ||puθ) which measures the deviation of a parametrized distribution puθ from the optimal one p . Although direct cost optimization and PICE are different methods, their global minimum Adaptive Smoothing for Path Integral Control is the same if the parametrization of uθ can express the optimal control u = uθ . The parameters θ of the optimal controller are found using gradient descent: θKL (p ||puθ) = 1 Zpuθ γ Sγ puθ (τ) θ log puθ(τ) where Zpuθ := D exp 1 γ Sγ puθ (τ) E That PICE uses the optimal density as a guidepost for the policy optimization might give it an advantage compared to direct cost optimization. In practice however, this method only works properly if the initial guess of the controller uθ does not deviate too much from the optimal control, as a high value of KL (p ||puθ) leads to a high variance of the gradient estimator and results in bootstrapping problems of the algorithm (Ruiz and Kappen, 2017; Thalmeier et al., 2016). In the next section we introduce a method that interpolates between direct cost optimization and the PICE method. This allows us to take advantage of the analytical solution of the optimal density without being hampered by the same bootstrapping problems as PICE. 3. Interpolating Between the two Methods: Smoothing Stochastic Control Problems Cost function smoothing was recently introduced as a way to speed up optimization of neural networks (Chaudhari et al., 2018): optimization of a general cost function f(θ) can be speeded up by smoothing f(θ) using an inf-convolution with a distance kernel d(θ , θ).1 The smoothed function Jα(θ) = inf θ αd(θ , θ) + f(θ ) (7) preserves the global minima of the function f(θ). Chaudhari et al. (2018) showed that gradient descent optimization on Jα(θ) instead of f(θ) may significantly speed up convergence. For that, the authors used a stochastic optimal control interpretation of the smoothing process of the cost function. In particular, they looked at the smoothing process as the solution to a non-viscous Hamiltion-Jacobi partial differential equation. In this work, we want to use this accelerative effect to find the optimal parametrization of the controller uθ. Therefore, we smooth the cost function C(puθ) as a function of the parameters θ. As C(puθ) = V (τ) puθ + γKL(puθ||p0) is a functional on the space of probability distributions puθ, the natural distance2 is the KL-divergence KL(puθ ||puθ). So we replace f(θ) C(puθ) d(θ , θ) KL(puθ ||puθ) 1. This is a generalized description. Chaudhari et al. (2018) used d(θ , θ) = |θ θ|2 . 2. Remark: Strictly speaking the KL is not a distance, but a directed divergence. Thalmeier, Kappen, Totaro, G omez and obtain the smoothed cost Jα(θ) as Jα(θ) = inf θ αKL(puθ ||puθ) + C(puθ ) = inf θ αKL(puθ ||puθ) + γKL(puθ ||p0) + V (τ) puθ . (8) Note the different roles of α and γ: α penalizes the deviation of puθ from puθ, while γ penalizes the deviation of puθ from the uncontrolled dynamics p0. 3.1 Computing the Smoothed Cost and its Gradient The smoothed cost Jα is expressed as a minimization problem that has to be solved. Here we show that for path integral control problems this can be done analytically. To do this we first show that we can replace infθ infp and then solve the minimization over p analytically. We replace the minimization over θ by a minimization over p in two steps: first we state an assumption that allows us to replace infθ infu and then proof that for path integral control problems we can replace infu infp . We assume that for every uθ and any α > 0, the minimizer θ α,θ over the parameter space θ α,θ := arg min θ αKL(puθ ||puθ) + C(puθ ) (9) is the parametrization of the minimizer u α,θ over the function space u α,θ := arg min u αKL(pu ||puθ) + C(pu ), such that u α,θ uθ α,θ. We call this assumption full parametrization. Naturally it is sufficient for full parametrization if uθ(x, t) is a universal function approximator with a fully observable state space x and the time t as input, although this may be difficult to achieve in practice. With this assumption we can replace infθ infu . Analogously we replace infu infp : in Appendix B we proof that for path integral control problems the minimizer u α,θ over the function space induces the minimizer p α,θ over the space of probability distributions p α,θ := arg min p αKL(p ||puθ) + C(p ), (10) such that p α,θ pu α,θ. This step is similar to the derivation of equation (5) in Section 2.2, but now we have added an additional term αKL(pu ||puθ). Hence, given a path integral control problem and a controller uθ that satisfies full parametrization we can replace infθ infp and equation (8) becomes Jα(θ) = inf p αKL(p ||puθ) + γKL(p ||p0) + V (τ) p . (11) This can be solved directly: first we compute the minimizer (see Appendix C for details) p α,θ(τ) = 1 Zαpuθ puθ(τ) exp 1 γ + αSγ puθ (τ) (12) Adaptive Smoothing for Path Integral Control with the normalization constant Zα puθ = D exp 1 γ+αSγ puθ (τ) E puθ . We plug this back in equation (11) and get an expression of the smoothed cost Jα(θ) = (γ + α) log exp 1 γ + αSγ puθ (τ) and its gradient (for details see Appendix D) exp 1 γ + αSγ puθ (τ) θ log puθ(τ) which both can be estimated using samples from the distribution puθ. 3.2 PICE, Direct Cost Optimization and Risk Sensitivity as Limiting Cases of Smoothed Cost Optimization The smoothed cost and its gradient depend on the two parameters α and γ, which come from the smoothing equation (7) and the definition of the control problem (2), respectively. Although at first glance the two parameters seem to play a similar role, they change different properties of the smoothed cost Jα(θ) when they are varied. In the expression for the smoothed cost (13), the parameter α only appears in the sum γ + α. Varying it changes the effect of the smoothing but leaves the optimum θ = arg minθ Jα(θ) of the smoothed cost invariant (see Appendix E). We therefore call α the smoothing parameter. The larger α, the weaker the smoothing; in the limiting case α , smoothing is turned offas we can see from equation (13): for very large α, the exponential and the logarithmic function linearise, Jα(θ) C(puθ) and we recover direct cost optimization. For the limiting case α 0, we recover the PICE method: the optimizer p α,θ becomes equal to the optimal density p and the gradient on the smoothed cost (14) becomes proportional to the PICE gradient (6): lim α 0 1 α θJα(θ) = θKL(p ||puθ). Varying γ changes the control problem and thus its optimal solution. For γ 0, the control cost becomes zero. In this case the cost only consists of the state cost and arbitrary large controls are allowed. We get Jα(θ) = α log exp 1 This expression is identical to the risk sensitive control cost proposed by Fleming and Sheu (2002); Fleming and Mc Eneaney (1995); van den Broek et al. (2010). Thus, for γ = 0, the smoothing parameter α controls the risk-sensitivity, resulting in risk seeking objectives for α > 0 and risk avoiding objectives for α < 0. In the limiting case γ , the problem becomes trivial; the optimal controlled dynamics becomes equal to the uncontrolled dynamics: p p0, see equation (5), and u 0. Thalmeier, Kappen, Totaro, G omez If both parameters α and γ are small, the problem is hard (see Ruiz and Kappen (2017); Thalmeier et al. (2016)) as many samples are needed to estimate the smoothed cost. The problem becomes feasible if either α or γ is increased. Increasing γ however, changes the control problem, while increasing α weakens the effect of smoothing. In the remainder of this article we analyze, first theoretically in Section 4 and then numerically in Section 6, the effect that a finite α > 0 has on the iterative optimization of the control uθ for a fixed value γ. 4. The Effect of Cost Function Smoothing on Policy Optimization We introduced smoothing as a way to speed up policy optimization compared to a direct optimization of the cost. In this section we analyze policy optimization with and without smoothing and show analytically how smoothing can speed up policy optimization. To simplify notation, we overload puθ θ so that we get C(puθ) C(θ) and KL(puθ ||puθ) KL(θ ||θ). We use a trust region constraint to robustly optimize the policy (compare Peters et al. (2010); Schulman et al. (2015); G omez et al. (2014)). There are two options. On the one hand, we can directly optimize the cost C: Definition 1 We define the direct update with stepsize E as an update θ θ with θ = ΘC E (θ) and ΘC E (θ) := arg min θ s.t. KL(θ ||θ) E The direct update results in the minimal cost that can be achieved after one single update. We define the optimal one-step cost C E(θ) := min θ s.t. KL(θ ||θ) E C(θ ). On the other hand we can optimize the smoothed cost Jα: Definition 2 We define the smoothed update with stepsize E as an update θ θ with θ = ΘJα E (θ) and ΘJα E (θ) := arg min θ s.t. KL(θ ||θ) E Jα(θ ). (15) While a direct update achieves the minimal cost that can be achieved after a single update, we show below that a smoothed update can result in a faster cost reduction if more than one update step is performed. Definition 3 We define the optimal two-step update θ Θ Θ as an update that results in the lowest cost that can be achieved with a two-step update θ θ θ with fixed stepsizes E and E respectively: Θ , Θ := arg min θ ,θ s.t. KL(θ ||θ ) E KL(θ ||θ) E Adaptive Smoothing for Path Integral Control and the corresponding optimal two-step cost C E,E (θ) := min θ s.t. KL(θ ||θ) E min θ s.t. KL(θ ||θ ) E C(θ ) = min θ s.t. KL(θ ||θ) E C ΘC E (θ ) . (16) Figure 1 illustrates how such an optimal two-step update leads to a faster decrease of the cost than two consecutive direct updates. Theorem 1 Statement 1: For all E, α there exists an E , such that a smoothed update with stepsize E followed by a direct update with stepsize E is an optimal two-step update: Θ = ΘJα E (θ) Θ = ΘC E (Θ ) C Θ = C E,E (θ) The size of the second step E is a function of θ and α. Statement 2: E is monotonically decreasing in α. While it is evident from equation (16) that the second step of the optimal two-step update must be a direct update, the statement that the first step is a smoothed update is non-trivial. We proof this and statement 2 in Appendix F. Direct updates are myopic and do not take into account successive steps and are thus suboptimal when more than one update is needed. Smoothed updates on the other hand, as we see on Theorem 1, anticipate a subsequent step and minimize the cost that results from this two-step update. Hence smoothed updates favour a greater cost reduction in the future over maximal cost reduction in the current step. The strength of this anticipatory effect depends on the smoothing strength, which is controlled by the smoothing parameter α: For large α, smoothing is weak and the size E of this anticipated second step becomes small. Figure 1(B) illustrates that for this case, when E becomes small, smoothed updates become more similar to direct updates. In the limiting case α the difference between smoothed and direct updates vanishes completely, as Jα(θ) C(θ) (see Section 3.2). We expect that also with multiple update steps due to this anticipatory effect, iterating smoothed updates leads to a faster decrease of the cost than iterating direct updates. We will confirm this by numerical studies. Furthermore, we expect that this accelerating effect of smoothing is stronger for smaller values of α. On the other hand, as we will discuss in the next section, for smaller values of α it is harder to accurately perform the smoothed updates. Therefore we expect an optimal performance for an intermediate value of α. Based on this we build an algorithm in the next section that aims to accelerate policy optimization by cost function smoothing. Thalmeier, Kappen, Totaro, G omez Figure 1: Illustration of optimal two-step updates compared with two consecutive direct updates. Illustrated is a two-dimensional cost landscape C(θ) parametrized by θ. Dark colors represent low cost, while light colors represent high cost. Green dots indicate the optimal two-step update θ Θ Θ while red dots indicate two consecutive direct updates θ θ θ with θ = ΘC E (θ) and θ = ΘC E (θ ). The dashed circles indicate trust regions. θ , θ and Θ are the minimizers of the cost in the trust regions around θ, θ and Θ respectively. Θ is chosen such that the cost C(Θ ) after the subsequent direct update is minimized. In both panels, the final cost after an optimal two-step update C(Θ ) is smaller than the final cost after two direct updates C(θ ). (A) Equal sizes of the update steps, E = E . (B) When the size of the second step becomes small E E, the smoothed update θ Θ becomes more similar to the direct update θ θ . 5. Numerical Method In this section we develop an algorithm that takes a parametrized control function uθ with initial parameters θ0 and updates these parameters in each iteration n using smoothed updates. 5.1 Smoothed and Direct Updates Using Natural Gradients So far we have specified the smoothed updates θn+1 = ΘJα E (θn) (15) in an abstract manner and left open how to perform this optimization step. To compute an explicit expression we introduce a Lagrange multiplier β and express the constraint optimization (15) as an unconstrained optimization θn+1 = arg min θ Jα(θ ) + βKL(θ ||θn) (17) Following Schulman et al. (2015) we assume that the trust region size E is small. For small E 1 we get β 1 and can expand Jα(θ ) to first and KL(θ ||θn) to second order (see Adaptive Smoothing for Path Integral Control Appendix G for the details). This gives θn+1 = θn β 1F 1 θ Jα(θ ) θ =θn , (18) a natural gradient update with the Fisher-matrix F = θ T θ KL(θ ||θn) θ =θn (we use the conjugate gradient method to approximately compute the natural gradient for high dimensional parameter spaces. See Appendix J or Schulman et al. (2015) for details). The parameter β is determined using a line search such that3 KL(θn||θn+1) = E. (19) Note that for direct updates this derivation is the same, just replace Jα by C. 5.2 Reliable Gradient Estimation Using Adaptive Smoothing To compute smoothed updates using equation (18) we need the gradient of the smoothed cost. We assume full parametrization and use equation (14), which can be estimated using N weighted samples drawn from the distribution puθ: i=1 wi θ log puθ(τ i), (20) with weights given by Z exp 1 γ + αSγ puθ (τ i) , Z = i=1 exp 1 γ + αSγ puθ (τ i) . The variance of this estimator depends sensitively on the entropy of the weights i=1 wi log(wi). If the entropy is low, the total weight is concentrated on a few particles. This results in a poor gradient estimator where only a few of the particles actually contribute. This concentration is dependent on the smoothing parameter α: for small α, the weights are very concentrated in a few samples, resulting in a large weight-entropy and thus a high variance of the gradient estimator. As small α corresponds to strong smoothing, we want α to be as small as possible, but large enough to allow a reliable gradient estimation. Therefore, we set a bound to the weight entropy HN(w). To get a bound that is independent of the number of samples N, we use that in the limit of N the weight entropy is monotonically related to the KL-divergence KL(p α,uθ||puθ) KL(p α,uθ||puθ) = lim N log N HN(w) 3. For practical reasons, we reverse the arguments of the KL-divergence, since it is easier to estimate it from samples drawn from the first argument. For very small values, the KL is approximately symmetric in its arguments. Also, the equality in (19) differs from Schulman et al. (2015), which optimizes a value function within the trust region, e.g., KL(θn||θn+1) E. Thalmeier, Kappen, Totaro, G omez (see Appendix I). This provides a method for choosing α independently of the number of samples: we set the constraint KL(p α,uθ||puθ) and determine the smallest α that satisfies this condition using a line search. Large values of correspond to small values of α (see Appendix H) and therefore strong smoothing, we thus call the smoothing strength. 5.3 Formulating a Model-Free Algorithm We can compute the gradient (20) and the KL-divergence while treating the dynamical system as a black-box. For this we write the probability distribution puθ over trajectories τ as a Markov process: 0 0 Jα(θ) C ΘC E (θ) + αKL ΘC E (θ)||θ . Further, as ΘC E (θ) lies in the trust region {θ : KL(θ ||θ) E } we have that KL ΘC E (θ)||θ E , so we can write C ΘC E (θ) + αKL ΘC E (θ)||θ C ΘC E (θ) + αE Jα(θ) C ΘC E (θ) + αE . Next we minimize both sides of this inequality within the trust region {θ : KL(θ ||θ) E}. We use that Jα ΘJα E (θ) = min θ s.t. KL(θ ||θ) E Jα(θ ) Jα ΘJα E (θ) min θ s.t. KL(θ ||θ) E C ΘC E (θ ) + αE . (26) Now we use Lemma 2 and rewrite the left hand side of this inequality. Jα ΘJα E (θ) = C ΘC E ΘJα E (θ) E =E α(θ) + αE α(θ) with E α(θ) := Eα(ΘJα E (θ)). Plugging this back to (26) we get C ΘC E ΘJα E (θ) E =E α(θ) + αE α(θ) min θ s.t. KL(θ ||θ) E C ΘC E (θ ) + αE . Thalmeier, Kappen, Totaro, G omez As this inequality holds for any E > 0 we can plug in E α(θ) on the right hand side of this inequality and obtain C ΘC E ΘJα E (θ) E =E α(θ) + αE α(θ) min θ s.t. KL(θ ||θ) E C ΘC E (θ ) E =E α(θ) + αE α(θ). We subtract αE α(θ) on both sides C ΘC E ΘJα E (θ) E =E α(θ) min θ s.t. KL(θ ||θ) E C ΘC E (θ ) E =E α(θ) . Using equation (16) gives C ΘC E ΘJα E (θ) E =E α(θ) C E,E (θ) E =E α(θ) , which concludes the proof. F.3 Proof of Statement 2 Here we show that E = E α(θ) is a monotonically decreasing function of α. E α(θ) is given by E α(θ) = Eα ΘJα E (θ) = KL(θ α,θ ||θ ) θ =RJα E (θ) . αKL(θ α,θ ||θ ) + C θ α,θ θ =RJα E (θ) = inf θ αKL(θ ||θ ) + C(θ ) θ =RJα E (θ) = min θ s.t. KL(θ ||θ) E inf θ αKL(θ ||θ ) + C(θ ). For convenience we introduce a shorthand notation for the minimizers θα := ΘJα E (θ) θ α := θ α,θ |θ =ΘJα E (θ). We compare α1 0 with E α1(θ) := KL(θ α1||θα1) and α2 0 with E α2(θ) := KL(θ α2||θα2) and assume that E α1(θ) < E α2(θ). We show that from this it follows that α1 > α2. Proof As θ α1,θα1 minimize α1KL(θ ||θ) + C(θ ) we have α1KL(θ α1||θα1) + C(θ α1) α1KL(θ α2||θα2) + C(θ α2) α1Eα1(θ) + C(θ α1) α1Eα2(θ) + C(θ α2) and analogous for α2 α2KL(θ α1||θα1) + C(θ α1) α2KL(θ α2||θα2) + C(θ α2) α2Eα1(θ) + C(θ α1) α2Eα2(θ) + C(θ α2) Adaptive Smoothing for Path Integral Control With Eα1(θ) < Eα2(θ) we get α1 C(θ α1) C(θ α2) Eα2(θ) Eα1(θ) α2. We showed that from Eα1(θ) < Eα2(θ) it follows that α1 α2 which proofs that Eα(θ) is monotonously decreasing in α. Appendix G. Smoothed Updates for Small Update Steps E We want to compute equation (17) for small E which corresponds to large β. Assuming a smooth dependence of puθ on θ, bounding KL(θ||θn) to a very small value allows us to do a Taylor expansion which we truncate at second order: arg min θ Jα(θ ) + βKL(θ ||θn) arg min θ (θ θn)T θ Jα(θ ) + 1 2(θ θn)T (H + βF) (θ θn) = θn β 1F 1 θ Jα(θ ) θ =θn + O(β 2) H = θ T θ Jα(θ ) θ =θn F = θ T θ KL(θ ||θn) θ =θn . See also Martens (2014). We used that E 1 β 1. With this the Fisher information F dominates over the Hessian H and thus the Hessian does not appear anymore in the update equation. This defines a natural gradient update with stepsize β 1. Thalmeier, Kappen, Totaro, G omez Appendix H. s Monotonic in α Now we show that = KL(p α,θ||puθ) is a monotonic function of α. αKL(p α,θ||puθ) = ln p α,θ puθ p α,θ puθ ln p α,θ puθ ln p α,θ puθ puθ + p α,θ puθ α ln p α,θ puθ ln p α,θ puθ ln p α,θ puθ ln p α,θ puθ Now let us look at p α,θ puθ = 1 Zαpuθ exp 1 γ + αSγ puθ (τ) ! Zα puθ = exp 1 γ + αSγ puθ (τ) p α,θ puθ = 1 (γ + α)2 Sγ puθ (τ) p α,θ puθ p α,θ puθ αZα puθ = 1 (γ + α)2 Sγ puθ exp 1 γ + αSγ puθ (τ) p α,θ puθ = 1 (γ + α)2 Sγ puθ (τ) p α,θ puθ p α,θ puθ 1 (γ + α)2 D Sγ puθ = 1 (γ + α)2 p α,θ puθ Sγ puθ (τ) D Sγ puθ Adaptive Smoothing for Path Integral Control So finally we get αKL(p α,θ||puθ) = 1 (γ + α)2 Sγ puθ (τ) D Sγ puθ ln p α,θ puθ = 1 (γ + α)2 Sγ puθ (τ) D Sγ puθ 1 γ + αSγ puθ (τ) log Zα puθ = 1 (γ + α)2 Sγ puθ (τ) D Sγ puθ 1 γ + αSγ puθ (τ) log Zα puθ = 1 (γ + α)3 p α,θ D Sγ puθ = 1 (γ + α)3 Var Sγ puθ Therefore = KL(p α,θ||puθ) is a monotonically decreasing function of α. Appendix I. Proof for Equivalence of Weight Entropy and KL-Divergence We want to show that lim N log N HN(w) = lim N log N + i=1 wi log(wi) = KL(p α,θ||puθ), where the samples i are drawn from puθ and the wi are given by wi = 1 PN i exp 1 γ+αSpuθ (τ i) exp 1 γ + αSpuθ (τ i) . Thalmeier, Kappen, Totaro, G omez lim N log N + i=1 wi log(wi) = = lim N log N + 1 PN i exp 1 γ+αSγ puθ (τ i) exp 1 γ + αSγ puθ (τ i) 1 PN i exp 1 γ+αSγ puθ (τ i) exp 1 γ + αSγ puθ (τ i) = lim N log N + 1 1 N PN i exp 1 γ+αSγ puθ (τ i) exp 1 γ + αSγ puθ (τ i) 1 N 1 N PN i exp 1 γ+αSγ puθ (τ i) exp 1 γ + αSγ puθ (τ i) = lim N 1 N 1 N PN i exp 1 γ+αSγ puθ (τ i) exp 1 γ + αSγ puθ (τ i) 1 N PN i exp 1 γ+αSγ puθ (τ i) exp 1 γ + αSγ puθ (τ i) Now we replace in the limit N , 1 N PN i puθ : * 1 D exp 1 γ+αSγ puθ (τ) E exp 1 γ + αSγ puθ (τ) 1 D exp 1 γ+αSγ puθ (τ) E exp 1 γ + αSγ puθ (τ) Adaptive Smoothing for Path Integral Control Using equation (12) this gives 1 D exp 1 γ+αSγ puθ (τ) E exp 1 γ + αSγ puθ (τ) 1 D exp 1 γ+αSγ puθ (τ) E exp 1 γ + αSγ puθ (τ) puθ(τ) log p α,θ(τ) p α,θ = KL(p α,θ||puθ). Appendix J. Inversion of the Fisher Matrix We compute an approximation to the natural gradient gf = F 1g by approximately solving the linear equation Fgf = g using truncated conjugate gradient. With the standard gradient g and the Fisher matrix F = θ T θ KL(puθ||puθn) (see Appendix G). We use an efficient way to compute the Fisher vector product Fy (Schulman et al., 2015) using an automated differentiation package: first for each rollout i and timepoint t the symbolic expression for the gradient on the KL multiplied by a vector y is computed: ai,t(θn+1) = T θn+1 log πθn(ai t|t, xi t) πθn+1(ai t|t, xi t) Then we take the second derivative on this scalar quantity, sum over all times and average over the samples. This gives the Fisher vector 0