# adaptive_smoothing_for_path_integral_control__557b1a94.pdf

Journal of Machine Learning Research 21 (2020) 1-37 Submitted 9/18; Revised 12/19; Published 9/20

Adaptive Smoothing for Path Integral Control

Dominik Thalmeier d.thalmeier@science.ru.nl Radboud University Nijmegen Nijmegen, The Netherlands

Hilbert J. Kappen b.kappen@science.ru.nl Radboud University Nijmegen Nijmegen, The Netherlands

Simone Totaro simone.totaro@gmail.com Universitat Pompeu Fabra Barcelona, Spain

Vicen c G omez vicen.gomez@upf.edu Universitat Pompeu Fabra Barcelona, Spain

Editor: Benjamin Recht

In Path Integral control problems a representation of an optimally controlled dynamical system can be formally computed and serve as a guidepost to learn a parametrized policy. The Path Integral Cross-Entropy (PICE) method tries to exploit this, but is hampered by poor sample eﬃciency. We propose a model-free algorithm called ASPIC (Adaptive Smoothing of Path Integral Control) that applies an inf-convolution to the cost function to speedup convergence of policy optimization. We identify PICE as the inﬁnite smoothing limit of such technique and show that the sample eﬃciency problems that PICE suﬀers disappear for ﬁnite levels of smoothing. For zero smoothing, ASPIC becomes a greedy optimization of the cost, which is the standard approach in current reinforcement learning. ASPIC adapts the smoothness parameter to keep the variance of the gradient estimator at a predeﬁned level, independently of the number of samples. We show analytically and empirically that intermediate levels of smoothing are optimal, which renders the new method superior to both PICE and direct cost optimization. Keywords: Path Integral Control, Entropy-Regularization, Cost Smoothing

1. Introduction

How to choose an optimal action? For noisy dynamical systems, stochastic optimal control theory provides a framework to answer this question. Optimal control is framed as an optimization problem to ﬁnd the control that minimizes an expected cost function. For non-linear dynamical systems that are continuous in time and space, this problem in general hard.

c 2020 Dominik Thalmeier, Hilbert J. Kappen, Simone Totaro, Vicen c G omez.

License: CC-BY 4.0, see https://creativecommons.org/licenses/by/4.0/. Attribution requirements are provided at http://jmlr.org/papers/v21/18-624.html.

Thalmeier, Kappen, Totaro, G omez

A method that has proven to work well is to introduce a parametrized policy like a neural network (Mnih et al., 2015; Levine et al., 2016; Duan et al., 2016; Fran cois-Lavet et al., 2018) and greedily optimize the expected cost using gradient descent (Williams, 1992; Peters and Schaal, 2008; Schulman et al., 2015; Heess et al., 2017). To achieve a robust decrease of the expected cost it is important to ensure that in each step, the updated policy stays in the proximity of the old policy (Duan et al., 2016). This can be achieved by enforcing a trust region constraint (Peters et al., 2010; Schulman et al., 2015) or using adaptive regularization that punishes strong deviations of the new policy from the old policy (Heess et al., 2017). However the applicability of these methods is limited, as in each iteration of the algorithm, samples from the controlled system have to be computed, either from a simulator or from a real system. We want to increase the convergence rate of policy optimization to reduce the number of simulations needed. To this end we consider path integral control problems (Kappen, 2005; Todorov, 2009; Kappen et al., 2012), that oﬀer an alternative approach to direct cost optimization and explore if this allows to speed up policy optimization. This class of control problems permits arbitrary non-linear dynamics and state cost, but requires a linear dependence of the control on the dynamics and a quadratic control cost (Kappen, 2005; Bierkens and Kappen, 2014; Thijssen and Kappen, 2015). These restrictions allow to obtain an explicit expression for the probability density of optimally controlled system trajectories. Through this, an information-theoretical measure of the deviation of the current control policy from the optimal control can be calculated. The Path Integral Cross-Entropy (PICE) method (Kappen and Ruiz, 2016) proposes to use this measure as a pseudo-objective for policy optimization. However, there is yet no comparative study on whether PICE actually oﬀers an advantage over direct cost optimization; and, in its original form (Kappen and Ruiz, 2016), the PICE method does not scale well to complex problems because the PICE gradient is hard to estimate if the current controller is not close enough to the optimal control (Ruiz and Kappen, 2017). Furthermore the PICE method has been introduced with standard gradient descent and does not use trust regions to ensure robust updates, which has been shown to be eﬀective for policy optimization (Duan et al., 2016). In this work we propose and study a new kind of smoothing technique for the cost function that allows to interpolate between the optimization of the direct cost and the PICE objective. Optimizing this smoothed cost using a trust-region-based method yields an approach that is eﬃcient and does not suﬀer from the feasibility issues of PICE. Our work is based on recently proposed smoothing techniques to speed up convergence in deep neural networks (Chaudhari et al., 2018). We adapt this smoothing technique to path integral control problems. In contrast to Chaudhari et al. (2018), smoothing for path integral control problems can be solved analytically and we obtain an expression of the gradient that can directly be computed from Monte Carlo samples. The strength of smoothing can be regulated by a parameter. Remarkably, this parameter can be determined independently of the number of samples. In the limits of this smoothing parameter we recover the PICE method for inﬁnitely strong smoothing and direct cost optimization for zero smoothing, respectively. As in Chaudhari et al. (2018), the minimum of the smoothed cost, thus the optimal control policy, remains the same for all levels of smoothing. We provide a theoretical argument why smoothing is expected to speed up optimization and conduct numerical experiments on diﬀerent control tasks, which show this accelerative

Adaptive Smoothing for Path Integral Control

eﬀect in practice. For this we develop an algorithm called ASPIC (Adaptive Smoothing for Path Integral Control) that uses cost smoothing to speed up policy optimization. The algorithm adjusts the smoothing parameter in each step to keep the variance of the gradient estimator at a predeﬁned level. To ensure robust updates of the policy, ASPIC enforces a trust region constraint; similar to Schulman et al. (2015) this is achieved with natural gradient updates and an adaptive stepsize. Like other policy gradient based methods (Williams, 1992; Peters and Schaal, 2008; Schulman et al., 2015; Heess et al., 2017) ASPIC is model-free. Many policy optimization algorithms update the control policy based on a direct optimization of the cost; examples are Trust Region Policy Optimization (TRPO) (Schulman et al., 2015) or Path-Integral Relative Entropy Policy Search (PIREPS) (G omez et al., 2014), where the later is particularly developed for path integral control problems. The main novelty of this work is the application to path integral control problems of the idea of smoothing, as introduced in Chaudhari et al. (2018). This technique outperforms direct cost optimization, achieving faster convergence rates with only a negligible amount of computational overhead.

2. Path Integral Control Problems

Consider the (multivariate) dynamical system

xt = f(xt, t) + g(xt, t) (u(xt, t) + ξt) , (1)

with initial condition x0. The control policy is implemented in the control function u(x, t), which is additive to the white noise ξt which has variance ν

dt. Given a control function u and a time horizon T, this dynamical system induces a probability distribution pu(τ) over state trajectories τ = {xt| t : 0 < t T} with initial condition x0. We deﬁne the regularized expected cost

C(pu) = V (τ) pu + γKL(pu||p0), (2)

with V (τ) = R T 0 V (xt, t)dt, where the strength of the regularization KL(pu||p0) is controlled by the parameter γ. The Kullback-Leibler divergence KL(pu||p0) puts high cost to controls u that bring the probability distribution pu far away from the uncontrolled dynamics p0 where u(xt, t) = 0. We can also rewrite the regularizer KL(pu||p0) directly in terms of the control function u by using the Girsanov theorem (compare Thijssen and Kappen (2015))

2u(xt, t)T u(xt, t) + u(xt, t)T ξt

The regularization then takes the form of a quadratic control cost

KL(pu||p0) = 1

2u(xt, t)T u(xt, t) + u(xt, t)T ξt

1 2u(xt, t)T u(xt, t)dt

Thalmeier, Kappen, Totaro, G omez

where we used that u(xt, t)T ξt

pu = 0. This shows that the regularization KL(pu||p0) puts higher cost for large values of the controller u. The path integral control problem is to ﬁnd the optimal control function u that minimizes the regularized cost C(pu)

u = arg min u C(pu). (3)

For a more complete introduction to path integral control problems, see Thijssen and Kappen (2015); Kappen and Ruiz (2016).

2.1 Direct Cost Optimization Using Gradient Descent

A standard approach to ﬁnd an optimal control function is to introduce a parametrized controller uθ(xt, t) (Williams, 1992; Schulman et al., 2015; Heess et al., 2017). This parametrizes the path probabilities puθ and allows to optimize the expected cost C(puθ) (2) using stochastic gradient descent on the cost function:

θC(puθ) = D Sγ puθ (τ) θ log puθ(τ) E

with the stochastic cost Sγ puθ (τ) := V (τ) + γ log puθ(τ)

p0(τ) (see Appendix A for details).

2.2 The Cross-Entropy Method for Path Integral Control Problems

An alternative approach to direct cost optimization was introduced as the PICE method in Kappen and Ruiz (2016). It uses that we can obtain an expression for pu , the probability density of state trajectories induced by a system with the optimal controller u :

pu = arg min pu C(pu),

with C(pu) given by equation (2). Finding pu is an optimization problem over the space of all probability distributions pu that are induced by the controlled dynamical system (1). It has been shown (Bierkens and Kappen, 2014; Thijssen and Kappen, 2015) that we can solve this by replacing the minimization over pu with a minimization over all path probability distributions p:

pu p := arg min p C(p) = arg min p V (τ) p + γKL(p||p0) = 1

Z p0(τ) exp 1

γ V (τ) , (5)

with the normalization constant Z = D exp 1

p0. Note that this is not a trivial

statement, as we now take the minimum also over non-Markovian processes with non Gaussian noise. The PICE algorithm (Kappen and Ruiz, 2016) takes advantage of the existence of this explicit expression for the density of optimally controlled trajectories pu . PICE does not directly optimize the expected cost, instead it minimizes the KL-divergence KL (p ||puθ) which measures the deviation of a parametrized distribution puθ from the optimal one p . Although direct cost optimization and PICE are diﬀerent methods, their global minimum

Adaptive Smoothing for Path Integral Control

is the same if the parametrization of uθ can express the optimal control u = uθ . The parameters θ of the optimal controller are found using gradient descent:

θKL (p ||puθ) = 1 Zpuθ

γ Sγ puθ (τ) θ log puθ(τ)

where Zpuθ := D exp 1

γ Sγ puθ (τ) E

That PICE uses the optimal density as a guidepost for the policy optimization might give it an advantage compared to direct cost optimization. In practice however, this method only works properly if the initial guess of the controller uθ does not deviate too much from the optimal control, as a high value of KL (p ||puθ) leads to a high variance of the gradient estimator and results in bootstrapping problems of the algorithm (Ruiz and Kappen, 2017; Thalmeier et al., 2016). In the next section we introduce a method that interpolates between direct cost optimization and the PICE method. This allows us to take advantage of the analytical solution of the optimal density without being hampered by the same bootstrapping problems as PICE.

3. Interpolating Between the two Methods: Smoothing Stochastic Control Problems

Cost function smoothing was recently introduced as a way to speed up optimization of neural networks (Chaudhari et al., 2018): optimization of a general cost function f(θ) can be speeded up by smoothing f(θ) using an inf-convolution with a distance kernel d(θ , θ).1

The smoothed function

Jα(θ) = inf θ αd(θ , θ) + f(θ ) (7)

preserves the global minima of the function f(θ). Chaudhari et al. (2018) showed that gradient descent optimization on Jα(θ) instead of f(θ) may signiﬁcantly speed up convergence. For that, the authors used a stochastic optimal control interpretation of the smoothing process of the cost function. In particular, they looked at the smoothing process as the solution to a non-viscous Hamiltion-Jacobi partial diﬀerential equation. In this work, we want to use this accelerative eﬀect to ﬁnd the optimal parametrization of the controller uθ. Therefore, we smooth the cost function C(puθ) as a function of the parameters θ. As C(puθ) = V (τ) puθ + γKL(puθ||p0) is a functional on the space of

probability distributions puθ, the natural distance2 is the KL-divergence KL(puθ ||puθ). So we replace

f(θ) C(puθ)

d(θ , θ) KL(puθ ||puθ)

1. This is a generalized description. Chaudhari et al. (2018) used d(θ , θ) = |θ θ|2 . 2. Remark: Strictly speaking the KL is not a distance, but a directed divergence.

Thalmeier, Kappen, Totaro, G omez

and obtain the smoothed cost Jα(θ) as

Jα(θ) = inf θ αKL(puθ ||puθ) + C(puθ )

= inf θ αKL(puθ ||puθ) + γKL(puθ ||p0) + V (τ) puθ . (8)

Note the diﬀerent roles of α and γ: α penalizes the deviation of puθ from puθ, while γ penalizes the deviation of puθ from the uncontrolled dynamics p0.

3.1 Computing the Smoothed Cost and its Gradient

The smoothed cost Jα is expressed as a minimization problem that has to be solved. Here we show that for path integral control problems this can be done analytically. To do this we ﬁrst show that we can replace infθ infp and then solve the minimization over p

analytically. We replace the minimization over θ by a minimization over p in two steps: ﬁrst we state an assumption that allows us to replace infθ infu and then proof that for path integral control problems we can replace infu infp . We assume that for every uθ and any α > 0, the minimizer θ α,θ over the parameter space

θ α,θ := arg min θ αKL(puθ ||puθ) + C(puθ ) (9)

is the parametrization of the minimizer u α,θ over the function space

u α,θ := arg min u αKL(pu ||puθ) + C(pu ),

such that u α,θ uθ α,θ. We call this assumption full parametrization. Naturally it is sufﬁcient for full parametrization if uθ(x, t) is a universal function approximator with a fully observable state space x and the time t as input, although this may be diﬃcult to achieve in practice. With this assumption we can replace infθ infu . Analogously we replace infu infp : in Appendix B we proof that for path integral control problems the minimizer u α,θ over the function space induces the minimizer p α,θ over the space of probability distributions

p α,θ := arg min p αKL(p ||puθ) + C(p ), (10)

such that p α,θ pu α,θ. This step is similar to the derivation of equation (5) in Section 2.2, but now we have added an additional term αKL(pu ||puθ). Hence, given a path integral control problem and a controller uθ that satisﬁes full parametrization we can replace infθ infp and equation (8) becomes

Jα(θ) = inf p αKL(p ||puθ) + γKL(p ||p0) + V (τ) p . (11)

This can be solved directly: ﬁrst we compute the minimizer (see Appendix C for details)

p α,θ(τ) = 1 Zαpuθ puθ(τ) exp 1 γ + αSγ puθ (τ) (12)

Adaptive Smoothing for Path Integral Control

with the normalization constant Zα puθ = D exp 1 γ+αSγ puθ (τ) E

puθ . We plug this back in

equation (11) and get an expression of the smoothed cost

Jα(θ) = (γ + α) log exp 1 γ + αSγ puθ (τ)

and its gradient (for details see Appendix D)

exp 1 γ + αSγ puθ (τ) θ log puθ(τ)

which both can be estimated using samples from the distribution puθ.

3.2 PICE, Direct Cost Optimization and Risk Sensitivity as Limiting Cases of Smoothed Cost Optimization

The smoothed cost and its gradient depend on the two parameters α and γ, which come from the smoothing equation (7) and the deﬁnition of the control problem (2), respectively. Although at ﬁrst glance the two parameters seem to play a similar role, they change diﬀerent properties of the smoothed cost Jα(θ) when they are varied. In the expression for the smoothed cost (13), the parameter α only appears in the sum γ + α. Varying it changes the eﬀect of the smoothing but leaves the optimum θ = arg minθ Jα(θ) of the smoothed cost invariant (see Appendix E). We therefore call α the smoothing parameter. The larger α, the weaker the smoothing; in the limiting case α , smoothing is turned oﬀas we can see from equation (13): for very large α, the exponential and the logarithmic function linearise, Jα(θ) C(puθ) and we recover direct cost optimization. For the limiting case α 0, we recover the PICE method: the optimizer p α,θ becomes equal to the optimal density p and the gradient on the smoothed cost (14) becomes proportional to the PICE gradient (6):

lim α 0 1 α θJα(θ) = θKL(p ||puθ).

Varying γ changes the control problem and thus its optimal solution. For γ 0, the control cost becomes zero. In this case the cost only consists of the state cost and arbitrary large controls are allowed. We get

Jα(θ) = α log exp 1

This expression is identical to the risk sensitive control cost proposed by Fleming and Sheu (2002); Fleming and Mc Eneaney (1995); van den Broek et al. (2010). Thus, for γ = 0, the smoothing parameter α controls the risk-sensitivity, resulting in risk seeking objectives for α > 0 and risk avoiding objectives for α < 0. In the limiting case γ , the problem becomes trivial; the optimal controlled dynamics becomes equal to the uncontrolled dynamics: p p0, see equation (5), and u 0.

Thalmeier, Kappen, Totaro, G omez

If both parameters α and γ are small, the problem is hard (see Ruiz and Kappen (2017); Thalmeier et al. (2016)) as many samples are needed to estimate the smoothed cost. The problem becomes feasible if either α or γ is increased. Increasing γ however, changes the control problem, while increasing α weakens the eﬀect of smoothing. In the remainder of this article we analyze, ﬁrst theoretically in Section 4 and then numerically in Section 6, the eﬀect that a ﬁnite α > 0 has on the iterative optimization of the control uθ for a ﬁxed value γ.

4. The Eﬀect of Cost Function Smoothing on Policy Optimization

We introduced smoothing as a way to speed up policy optimization compared to a direct optimization of the cost. In this section we analyze policy optimization with and without smoothing and show analytically how smoothing can speed up policy optimization. To simplify notation, we overload puθ θ so that we get C(puθ) C(θ) and KL(puθ ||puθ) KL(θ ||θ). We use a trust region constraint to robustly optimize the policy (compare Peters et al. (2010); Schulman et al. (2015); G omez et al. (2014)). There are two options. On the one hand, we can directly optimize the cost C:

Deﬁnition 1 We deﬁne the direct update with stepsize E as an update θ θ with θ = ΘC E (θ) and

ΘC E (θ) := arg min θ s.t. KL(θ ||θ) E

The direct update results in the minimal cost that can be achieved after one single update. We deﬁne the optimal one-step cost

C E(θ) := min θ s.t. KL(θ ||θ) E C(θ ).

On the other hand we can optimize the smoothed cost Jα:

Deﬁnition 2 We deﬁne the smoothed update with stepsize E as an update θ θ with θ = ΘJα E (θ) and

ΘJα E (θ) := arg min θ s.t. KL(θ ||θ) E

Jα(θ ). (15)

While a direct update achieves the minimal cost that can be achieved after a single update, we show below that a smoothed update can result in a faster cost reduction if more than one update step is performed.

Deﬁnition 3 We deﬁne the optimal two-step update θ Θ Θ as an update that results in the lowest cost that can be achieved with a two-step update θ θ θ with ﬁxed stepsizes E and E respectively:

Θ , Θ := arg min θ ,θ s.t. KL(θ ||θ ) E

KL(θ ||θ) E

Adaptive Smoothing for Path Integral Control

and the corresponding optimal two-step cost

C E,E (θ) := min θ s.t. KL(θ ||θ) E min θ s.t. KL(θ ||θ ) E C(θ )

= min θ s.t. KL(θ ||θ) E C ΘC E (θ ) . (16)

Figure 1 illustrates how such an optimal two-step update leads to a faster decrease of the cost than two consecutive direct updates.

Theorem 1 Statement 1: For all E, α there exists an E , such that a smoothed update with stepsize E followed by a direct update with stepsize E is an optimal two-step update:

Θ = ΘJα E (θ)

Θ = ΘC E (Θ )

C Θ = C E,E (θ)

The size of the second step E is a function of θ and α. Statement 2: E is monotonically decreasing in α.

While it is evident from equation (16) that the second step of the optimal two-step update must be a direct update, the statement that the ﬁrst step is a smoothed update is non-trivial. We proof this and statement 2 in Appendix F. Direct updates are myopic and do not take into account successive steps and are thus suboptimal when more than one update is needed. Smoothed updates on the other hand, as we see on Theorem 1, anticipate a subsequent step and minimize the cost that results from this two-step update. Hence smoothed updates favour a greater cost reduction in the future over maximal cost reduction in the current step. The strength of this anticipatory eﬀect depends on the smoothing strength, which is controlled by the smoothing parameter α: For large α, smoothing is weak and the size E of this anticipated second step becomes small. Figure 1(B) illustrates that for this case, when E becomes small, smoothed updates become more similar to direct updates. In the limiting case α the diﬀerence between smoothed and direct updates vanishes completely, as Jα(θ) C(θ) (see Section 3.2). We expect that also with multiple update steps due to this anticipatory eﬀect, iterating smoothed updates leads to a faster decrease of the cost than iterating direct updates. We will conﬁrm this by numerical studies. Furthermore, we expect that this accelerating eﬀect of smoothing is stronger for smaller values of α. On the other hand, as we will discuss in the next section, for smaller values of α it is harder to accurately perform the smoothed updates. Therefore we expect an optimal performance for an intermediate value of α. Based on this we build an algorithm in the next section that aims to accelerate policy optimization by cost function smoothing.

Thalmeier, Kappen, Totaro, G omez

Figure 1: Illustration of optimal two-step updates compared with two consecutive direct updates. Illustrated is a two-dimensional cost landscape C(θ) parametrized by θ. Dark colors represent low cost, while light colors represent high cost. Green dots indicate the optimal two-step update θ Θ Θ while red dots indicate two consecutive direct updates θ θ θ with θ = ΘC E (θ) and θ = ΘC E (θ ). The dashed circles indicate trust regions. θ , θ and Θ are the minimizers of the cost in the trust regions around θ, θ and Θ respectively. Θ is chosen such that the cost C(Θ ) after the subsequent direct update is minimized. In both panels, the ﬁnal cost after an optimal two-step update C(Θ ) is smaller than the ﬁnal cost after two direct updates C(θ ). (A) Equal sizes of the update steps, E = E . (B) When the size of the second step becomes small E E, the smoothed update θ Θ becomes more similar to the direct update θ θ .

5. Numerical Method

In this section we develop an algorithm that takes a parametrized control function uθ with initial parameters θ0 and updates these parameters in each iteration n using smoothed updates.

5.1 Smoothed and Direct Updates Using Natural Gradients

So far we have speciﬁed the smoothed updates θn+1 = ΘJα E (θn) (15) in an abstract manner and left open how to perform this optimization step. To compute an explicit expression we introduce a Lagrange multiplier β and express the constraint optimization (15) as an unconstrained optimization

θn+1 = arg min θ Jα(θ ) + βKL(θ ||θn) (17)

Following Schulman et al. (2015) we assume that the trust region size E is small. For small E 1 we get β 1 and can expand Jα(θ ) to ﬁrst and KL(θ ||θn) to second order (see

Adaptive Smoothing for Path Integral Control

Appendix G for the details). This gives

θn+1 = θn β 1F 1 θ Jα(θ ) θ =θn , (18)

a natural gradient update with the Fisher-matrix F = θ T θ KL(θ ||θn) θ =θn (we use the conjugate gradient method to approximately compute the natural gradient for high dimensional parameter spaces. See Appendix J or Schulman et al. (2015) for details). The parameter β is determined using a line search such that3

KL(θn||θn+1) = E. (19)

Note that for direct updates this derivation is the same, just replace Jα by C.

5.2 Reliable Gradient Estimation Using Adaptive Smoothing

To compute smoothed updates using equation (18) we need the gradient of the smoothed cost. We assume full parametrization and use equation (14), which can be estimated using N weighted samples drawn from the distribution puθ:

i=1 wi θ log puθ(τ i), (20)

with weights given by

Z exp 1 γ + αSγ puθ (τ i) , Z =

i=1 exp 1 γ + αSγ puθ (τ i) .

The variance of this estimator depends sensitively on the entropy of the weights

i=1 wi log(wi).

If the entropy is low, the total weight is concentrated on a few particles. This results in a poor gradient estimator where only a few of the particles actually contribute. This concentration is dependent on the smoothing parameter α: for small α, the weights are very concentrated in a few samples, resulting in a large weight-entropy and thus a high variance of the gradient estimator. As small α corresponds to strong smoothing, we want α to be as small as possible, but large enough to allow a reliable gradient estimation. Therefore, we set a bound to the weight entropy HN(w). To get a bound that is independent of the number of samples N, we use that in the limit of N the weight entropy is monotonically related to the KL-divergence KL(p α,uθ||puθ)

KL(p α,uθ||puθ) = lim N log N HN(w)

3. For practical reasons, we reverse the arguments of the KL-divergence, since it is easier to estimate it from samples drawn from the ﬁrst argument. For very small values, the KL is approximately symmetric in its arguments. Also, the equality in (19) diﬀers from Schulman et al. (2015), which optimizes a value function within the trust region, e.g., KL(θn||θn+1) E.

Thalmeier, Kappen, Totaro, G omez

(see Appendix I). This provides a method for choosing α independently of the number of samples: we set the constraint KL(p α,uθ||puθ) and determine the smallest α that satisﬁes this condition using a line search. Large values of correspond to small values of α (see Appendix H) and therefore strong smoothing, we thus call the smoothing strength.

5.3 Formulating a Model-Free Algorithm

We can compute the gradient (20) and the KL-divergence while treating the dynamical system as a black-box. For this we write the probability distribution puθ over trajectories τ as a Markov process:

0<t<T puθ(xt+dt|xt, t),

where the product runs over the time t, which is discretized with time step dt. We deﬁne the noisy action at = u(xt, t) + ξt and formulate the Markov transitions puθ(xt+dt|xt) for the dynamical system (1) as

puθ(xt+dt|xt) = δ (xt+dt F (xt, at, t)) πθ(at|t, xt),

with δ( ) the Dirac delta function. This splits the transitions up into the deterministic dynamical system F (xt, at, t)4 and a Gaussian policy πθ(at|t, xt) N at|uθ(xt, t), ν

dt with mean uθ(xt, t) and variance ν

dt. Using this we get a simpliﬁed expression for the gradient of the smoothed cost (20) that is independent of the system dynamics, given samples drawn from the controlled system puθ:

0<t<T wi θ log πθ(ai t|t, xi t).

Similarly we obtain an expression for the estimator of the KL divergence

KL(θn||θn+1) 1

0<t<T log πθn(ai t|t, xi t) πθn+1(ai t|t, xi t).

With this we formulate ASPIC (Algorithm 1), a model-free algorithm which optimizes the parametrized policy πθ by iteratively drawing samples from the controlled system.

6. Numerical Experiments

We now analyze empirically the convergence speed of policy optimization with and without smoothing and show that smoothing accelerates convergence. For the optimization with smoothing, we use ASPIC (Algorithm 1) and for the optimization without smoothing, we use a version of ASPIC where we replaced the gradient of the smoothed cost with the gradient of the cost itself. We ﬁrst consider a simple linear-quadratic (LQ) control problem and then focus on non-linear control tasks, for which we analyze the dependence of ASPIC on the hyper-parameters. We also compare ASPI to other related RL algorithms. Further details about the numerical experiments are found in Appendix L.

4. Using the Euler method, we get F (xt, at) = (xt + dt (f(xt, t) + g(xt, t)at)).

Adaptive Smoothing for Path Integral Control

Algorithm 1 ASPIC - Adaptive Smoothing for Path Integral Control Require: State cost function V (x, t) control cost parameter γ base policy that deﬁnes uncontrolled dynamics π0 real system or simulator to compute dynamics using a parametrized policy πθ trust region sizes E smoothing strength number of samples per iteration N initialize θ0 n = 0 repeat draw state trajectories τ i, i = 1, . . . , N, using parametrized policy πθn for each sample i compute Sγ puθn (τ i) = P 0<t<T V (xi t, t) + γ log πθn(ai t|t,xi t) π0(ai t|t,xi t) {Find minimal α such that KL } α 0 repeat increase α Si α Sγ puθn (τ i) 1 γ+α compute weights wi exp( Si α) normalize weights wi wi P i(wi) compute sample size independent weight entropy KL log N + P i wi log(wi) until KL {whiten the weights} ˆwi wi mean(wi)

std(wi) {compute the gradient on the smoothed cost} g P i P t ˆwi θ log πθ(ai t|t, xi t) θ=θn {compute Fisher matrix} use conjugate gradient to approximate the natural gradient g F = F 1g (Appendix J) do line search to compute step size η such KL(θn||θn+1) = E update parameters θn+1 θn + η g F n = n + 1 until convergence

6.1 A Simple Linear-Quadratic Control Problem: Brownian Viapoints

We analyze the convergence speed for diﬀerent values of the smoothing strength in the task of controlling a one-dimensional Brownian particle

x = u(x, t) + ξ. (21)

We deﬁne the state cost as a quadratic penalty for deviating from the viapoints xi at the diﬀerent times ti: V (x, t) = P i δ (t ti) (x xi)2

2σ2 with σ = 0.1. As a parametrized controller we use a time varying linear feedback controller, i.e., uθ(x, t) = θ1,tx + θ0,t. This controller fulﬁls the requirement of full parametrization for this task (see Appendix K). For further details of the numerical experiment see Appendix L.1.

Thalmeier, Kappen, Totaro, G omez

0.0 0.5 1.0 1.5 2.0 2.5 3.0 3.5

#Iterations

0 200 400 600 800 1000

direct cost optimization smoothed cost optimization

Figure 2: LQ control problem: Brownian viapoints. For each iteration we used N = 100 rollouts to compute the gradient. (A) Number of iterations needed for the cost to cross a threshold C 2 104 versus the smoothing strength . For = 0 there is no smoothing. Increasing the smoothing strength results in a faster decrease of the cost; when is increased further the performance decreases again. Errorbars denote mean and standard deviation over 10 runs of the algorithm. (B) Cost versus the iterations of the algorithm. Direct optimization of the cost exhibits a slower convergence rate than optimization of the smoothed cost with = 0.2 log 100.

We apply ASPIC to this control problem and compare its performance for diﬀerent sizes of the smoothing strength (see Figure 2). The results conﬁrm our expectations from our theoretical analysis (sections 4 and 5.2). As predicted by theory we observe an acceleration of the policy optimization when smoothing is switched on. This acceleration becomes more pronounced when is increased, which we attribute to an increase of the anticipatory eﬀect of the smoothed updates as smoothing becomes stronger (see Section 4). When is too large the performance of the algorithm deteriorates again, in agreement with our discussion of gradient estimation problems that arise for strong smoothing (see Section 5.2).

6.2 Nonlinear Control Problems

We now consider non-linear control problems, which violate the full parametrization assumption. We focus on the pendulum swing-up task, the Acrobot task, and a 2D Walker task. The latter was simulated using the Open AI gym (Brockman et al., 2016). For pendulum swing-up and the Acrobot tasks we used time-varying linear feedback controllers, whereas for the 2D Walker task we parametrized the control uθ using a neural network.

Adaptive Smoothing for Path Integral Control

0 50 100 150 200 250 300 350 400

Pendulum task

direct cost optimization smoothed cost optimization

0 100 200 300 400 500 600 700 800

Acrobot task

direct cost optimization smoothed cost optimization

0 50 100 150 200 250 300

2D Walker Task

smoothed cost optimization direct cost optimization

0 100 200 300 400 500 600 700 800

Performance of ASPIC in the Pendulum task

N=50 N=100 N=500

Figure 3: (A-C) Smoothed cost optimization (ASPIC) exhibits faster convergence than direct cost optimization in a variety of tasks. Plots show mean and standard deviation of the cost per iteration for 10 runs of the algorithm. In all tasks except 2D Walker, we used N = 500 rollouts and a trust region size E = 0.1. For ASPIC, the smoothing strength was set to = 0.5. In the 2D Walker task (C) we used N = 100 rollouts and = 0.05 log N. (D) Performance as a function of the number of iterations for diﬀerent values of N {50, 100, 500}. Dashed lines denote the solution for a total ﬁxed budget of 25K rollouts, i.e., 500, 250, and 50 iterations, respectively. In this case, N = 50 achieves near optimal performance whereas using larger values of N leads to worse solutions.

Further details are given in Appendix L.2 for the pendulum, L.3 for the Acrobot and L.4 for the 2D Walker.

Convergence Rate of Policy Optimization

Figure 3(A-C) shows the comparison of ASPIC algorithm with smoothing against directcost optimization. In all three tasks, smoothing improves the convergence rate of policy optimization. Smoothed cost optimization requires less iterations to achieve the same cost reduction as direct cost optimization, with only a negligible amount of additional computational steps that do not depend on the complexity of the simulation runs.

Thalmeier, Kappen, Totaro, G omez

We can thus conclude that even in cases when the parametrized controller does not strictly meet the requirement of full parametrization, a strong performance boost can also be achieved.

Dependence on the Number of Rollouts per Iteration N

We now analyze the dependence of the performance of ASPIC on the number of rollouts per iteration N. In general, using larger values of N allows for more reliable gradient estimates and achieves convergence in fewer iterations. However, too large N may be ineﬃcient and lead to suboptimal solutions in the presence of a ﬁxed budget of rollouts. Figure 3(D) illustrates this trade-oﬀin the Pendulum swing-up task for three values of N, including the previous one N = 500. For a total budget of 25K rollouts (dashed lines) the lowest value of N = 50 achieves near optimal performance and is preferable to the other choices, despite resulting in higher variance estimates and requiring more iterations until convergence.

Interplay Between Smoothing Strength and Trust Region Size E

To understand better the relation between the smoothing strength and the trust region sizes, we analyze empirically the performance of ASPIC as a function of both and E parameters. We focus on the Acrobot task and in the setting of N = 500 and intermediate smoothing strength, when smoothing is most beneﬁcial. Figure 4 shows the cost as a function of and E averaged over the ﬁrst 500 iterations of the algorithm, and for 10 diﬀerent runs. Larger (averaged) costs correspond runs where the algorithm fails to converge. Conversely, the lower the cost, the fastest the convergence. In general, larger values of E lead to faster convergence. However, the convergence is less stable for smaller values of . For stronger smoothing, the algorithm is less sensitive to E.

Comparison with other model-free RL algorithms

We ﬁnish this experimental analysis with a comparison between ASPIC and other related model-free RL algorithms. We consider trajectory-based algorithms that use the return of the entire trajectories, instead of evaluating the gradient at every state within a trajectory. This setting allows us to disentangle the eﬀect of smoothing in the optimization from other factors, such as the use of state-dependent baselines. In particular, we compare ASPIC with following methods: Policy Gradient (PG): this is the vanilla policy gradient method (Sutton et al., 2000), and can be seen as direct cost optimization without a trust region constraint. Policy Gradient with a trust region constraint (PG-TR): this is again a direct cost optimization method, similar to natural gradient descent (Kakade, 2002), with the main diﬀerence that it can perform multiple gradient steps inside the trust region. Trajectory-based Trust Region Policy Optimization (TRPO-TB): we consider the original TRPO (Schulman et al., 2015) without the state-dependent baseline, that is, computing the gradient estimate over trajectories instead of state-action pairs. We use the same controller architecture and hyper-parameters as in the original paper. We evaluate the performance of these algorithms on a set of six tasks from Pybullet, an open source real-time physics engine (see Appendix L.5 for more details). Figure 5

Adaptive Smoothing for Path Integral Control

ASPIC in the Acrobot task

Figure 4: Solution cost as a function of the smoothing strength and the trust region size E in the Acrobot task. Shown is the cost averaged over the ﬁrst 500 iterations of the algorithm, and for 10 diﬀerent runs. Blue indicates failure to convergence. White indicates the solutions which converged fastest.

shows the results. For the six tasks, ASPIC systematically converges faster than the other methods. Remarkably, in the tasks with higher dimensions (Walker2D and Half-Cheetah) the diﬀerences in performance is more pronounced, indicating that ASPIC can also scale well to higher-dimensional problems.

7. Discussion

For path integral control problems the optimal control policy can serve as a guidepost for policy optimization. This is used in the PICE algorithm (Kappen and Ruiz, 2016). One might hope that a representation of optimal control can help to ﬁnd a parametrized policy and surpass the more general approach of direct cost optimization. In practice however, the PICE algorithm suﬀers from problems with sample eﬃciency (Ruiz and Kappen, 2017). We introduced a smoothing technique using an inf-convolution which preserves global minima. Remarkably, for path integral control problems, minimization in the inf-convolution can be solved analytically. We used this result to interpolate between direct cost optimization and the PICE method. In between these extremes we have found a method that is superior to direct cost optimization while remaining feasible. We conducted a theoretical analysis of the optimization of smoothed cost-functions and showed that minimizing the smoothed cost can accelerate policy optimization by having less myopic updates that favour stronger cost reduction in subsequent updates over immediate cost reduction in the current step. This prediction is conﬁrmed by our numerical experiments, which show that smoothing the cost accelerates the convergence of policy optimization. While the theoretical analysis only makes statements for optimizations with

Thalmeier, Kappen, Totaro, G omez

Figure 5: Pybullet experiments: comparison between ASPIC and other related methods (see main text for details). Curves show mean and standard deviation of the cost per iteration for 5 diﬀerent runs. The panel titles show the task name as well as the number of action/state dimensions in parenthesis. ASPIC converges faster than the other methods in all tasks, specially in high-dimensional ones.

a total of two update steps, the numerical experiments show that the acceleration eﬀect persists when more than two update steps are performed. Because direct cost optimization and the PICE method are recovered in the limits of weak and strong smoothing respectively, we examined smoothed cost optimization for different levels of smoothing. The result shows in both limits the performance of the algorithm deteriorates. For weak smoothing this can be explained with the disappearance of the accelerating eﬀect that is caused by smoothing. The deterioration of performance for strong

Adaptive Smoothing for Path Integral Control

smoothing may be attributed to the higher variance of the sample weights that result in gradient estimation problems which also appear in PICE (Ruiz and Kappen, 2017). These problems appear for strong smoothing, while the accelerative eﬀect stays noticeable when smoothing is weak. The explanatory power of the theoretical results regarding the numerical experiments is limited through the fact that our derivation of the smoothed cost and its gradient requires an assumption on the representational power of the parametrized control policy. In principle, a universal function approximator, like an inﬁnitely large neural network, would be suﬃcient to satisfy this full parametrization assumption. However in practice, where we have to rely on function approximators with a ﬁnite number of parameters, this is difﬁcult to obtain. Nevertheless, the qualitative behaviour, that smoothing speeds up policy optimization, persists despite this deviation of the numerical methods from the theoretical assumptions. To conduct the numerical studies we used the algorithm ASPIC that we developed based on our theoretical results. ASPIC uses robust updates and an adaptive adjustment of the smoothing parameter to ensure that the gradient on the smoothed cost stays computable with a ﬁnite amount of samples. This procedure bears similarities to an adaptive annealing scheme, with the smoothing parameter playing the role of an artiﬁcial temperature. In contrast to classical annealing schemes, such as simulated annealing, changing the smoothing parameter does not change the optimization target: the minimum of the smoothed cost remains the optimal control solution for all levels of smoothing. In the weak smoothing limit, ASPIC directly optimizes the cost using trust region constrained updates, similar to the TRPO algorithm (Schulman et al., 2015). TRPO diﬀers from ASPIC s weak smoothing limit by additionally using certain variance reduction techniques for the gradient estimator: they replace the stochastic cost in the gradient estimator by the easier-to-estimate advantage function, which has a state dependent baseline and only takes into account future expected cost. Since this depends on the linearity of the gradient in the stochastic cost and this dependence is non-linear for the gradient of the smoothed cost, we cannot directly incorporate these variance reduction techniques in ASPIC. In the strong smoothing limit ASPIC becomes a version of PICE (Kappen and Ruiz, 2016) that unlike the plain PICE algorithm uses a trust region constraint to achieve robust updates. The gradient estimation problem that appears in the PICE algorithm was previously addressed in Ruiz and Kappen (2017): they proposed a heuristic that allows to reduce the variance of the gradient estimator by adjusting the particle weights used to compute the policy gradient. Ruiz and Kappen (2017) introduced this heuristic as an ad hoc ﬁx of the sampling problem and the adjustment of the weights introduces a bias with possible unknown side eﬀects. Our study sheds a new light on this, as adjusting the particle weights corresponds to a change of the smoothing parameter in our case. The theoretical results we derived can however not directly be transferred to Ruiz and Kappen (2017), since we assume the use of trust regions to bound the updates of the policy optimization. Especially when samples are expensive to compute it is important to squeeze out as much information from them as possible. We showed that for path integral control problems a smoothed version of the cost function and its gradient can directly be computed from the samples and allows to make less myopic policy updates than cost-greedy methods (like TRPO and PIREPS) and thereby accelerate convergence. We believe this can potentially

Thalmeier, Kappen, Totaro, G omez

be useful for variational inference in other areas of machine learning (Arenz et al., 2018). To fully beneﬁt from this, it is important future work to develop variance reduction techniques for the gradient of the smoothed cost similar to the techniques already used for methods that directly optimize the cost. A possible way to achieve this would be control variates that are tailored to the gradient estimator of the smoothed cost (Papini et al., 2018; Ranganath et al., 2014; Glasserman, 2013). Another important future work is to develop a deeper understanding of the full parametrization assumption and how its violation impacts the performance of the algorithm. Minimizing this impact might be an important lever to boost the performance of policy optimization for path integral control problems.

Acknowledgments

The research leading to these results has received funding from La Caixa Banking Foundation. Vicen c G omez is supported by the Ramon y Cajal program RYC-2015-18878 (AEI / MINEICO / FSE, UE).

Adaptive Smoothing for Path Integral Control

Appendix A. Derivation of the Policy Gradient

Here we derive equation (4). We write C(puθ) = Sγ uθ(τ) puθ , with Sγ uθ(τ) := V (τ) +

γ log puθ(τ)

p0(τ) and take the derivative of equation (2):

V (τ) + γ log puθ(τ)

Now we introduce the importance sampler puθ and correct for it.

V (τ) + γ log puθ(τ)

This is true for all θ as long as puθ(τ) and puθ (τ) are absolutely continuous to each other. Taking the derivative we get:

puθ= θpuθ(τ)

V (τ) + γ log puθ(τ)

puθ + puθ(τ)

γ 1 puθ(τ) θpuθ(τ)

= ( θ log puθ(τ)) V (τ) + γ log puθ(τ)

1 puθ (τ)puθ(τ)

puθ = Sγ uθ(τ) θ log puθ(τ)

puθ + γ θ 1 puθ = Sγ uθ(τ) θ log puθ(τ)

Appendix B. Replacing Minimization Over u by Minimization Over p

Here we show that for

Jα(θ) = inf u αKL(pu ||puθ) + γKL(pu ||p0) + V (τ) p (22)

we can replace the minimization over u by a minimization over p to obtain equation (11). For this, we need to show that the minimizer p α,θ of equation (11) is induced by u α,θ, the minimizer of equation (22):

p α,θ pu α,θ.

The solution to (11) is given by (see Appendix C)

Z puθ(τ) exp 1 γ + αSγ puθ (τ)

Z puθ(τ) p0(τ)

γ γ+α exp 1 γ + αV (τ) .

p0(τ) puθ(τ)

1 γ γ+α = p0(τ) exp 1 γ γ + α

2uθ(xt, t)T uθ(xt, t) + uθ(xt, t)T ξt

Thalmeier, Kappen, Totaro, G omez

where we used the Girsanov theorem (Bierkens and Kappen, 2014; Thijssen and Kappen, 2015) (and set ν = 1 for simpler notation). With uθ(xt, t) := 1 γ γ+α uθ(xt, t) this gives

p0(τ) puθ(τ)

1 γ γ+α = p0(τ) exp Z T

2 uθ(xt, t)T uθ(xt, t) + uθ(xt, t)T ξt

2 γ α uθ(xt, t)T uθ(xt, t) dt

= p uθ(τ) exp Z T

2 γ α uθ(xt, t)T uθ(xt, t) dt .

Z p uθ(τ) exp Z T

2 γ α uθ(xt, t)T uθ(xt, t) dt exp 1 γ + αV (τ) .

This has the form of an optimally controlled distribution with dynamics

xt = f(xt, t) + g(xt, t) ( uθ(xt, t) + ˆu(xt, t) + ξt) (23)

and cost Z T

1 γ + αV (xt, t) 1

2 γ α uθ(xt, t)T uθ(xt, t)dt + Z T

2 ˆu(xt, t)T ˆu(xt, t) + ˆu(xt, t)T ξt

This is a path integral control problem with state cost R T 0 1 γ+αV (xt, t) 1

2 γ α uθ(xt, t)T uθ(xt, t)dt

which is well deﬁned with uθ(xt, t) = 1 γ γ+α uθ(xt, t). Let ˆu be the optimal control of this path integral control problem. Then p α,θ is induced by equation (23) with ˆu = ˆu . This is equivalent to say that p α,θ is induced by equation (1). As p α,θ is the density that minimizes equation (11), uθ + ˆu is minimizing equation (22).

Appendix C. Minimizer of the Smoothed Cost

Here we want to proof equation (12):

p α,θ(τ) := arg min p αKL(p ||puθ) + D Sγ puθ (τ) E

= arg min p

α log p (τ)

puθ(τ) + V (τ) + γ log p (τ)

For this we take the variational derivative and set it to zero:

0 = δ δp (τ)

α log p (τ)

puθ(τ) + V (τ) + γ log p (τ)

where we added a Lagrange multiplier κ to ensure normalization. We get

0 = α log p (τ)

puθ(τ) + V (τ) + γ log p (τ)

p0(τ) + κ p =p α,θ ,

Adaptive Smoothing for Path Integral Control

from which follows

p α,θ(τ) = exp κ α + γ

puθ(τ) α α+γ p0(τ)

γ α+γ exp 1 γ + αV (τ)

= exp κ α + γ

puθ(τ) exp 1 γ + αV (τ) γ α + γ log puθ(τ)

= exp κ α + γ

puθ(τ) exp 1 γ + αSγ puθ (τ) ,

where κ is chosen such that the distribution is normalized.

Appendix D. Derivation of the Gradient of the Smoothed Cost Function

Here we derive equation (14) by taking the derivative of equation (13):

θJα(θ) = (γ + α) θ log exp 1 γ + α

V (τ) + γ log puθ(τ)

exp 1 γ + α

V (τ) + γ log puθ(τ)

Now we introduce the importance sampler puθ and correct for it.

θJα(θ) = γ + α

puθ (τ) exp 1 γ + α

V (τ) + γ log puθ(τ)

puθ (τ) (puθ(τ)) α γ+α exp 1 γ + αV (τ) +

* 1 puθ (τ)

γ γ+α exp 1 γ + αV (τ) θpuθ

exp 1 γ + αSγ puθ (τ) θ log puθ(τ)

Appendix E. Global Minimum is Preserved Under Full Parametrization

Here we show that smoothing leaves the global optimum of the cost C(puθ) invariant. Proof As KL(puθ ||puθ) 0 we have that

Jα(θ) = inf θ C(puθ ) + αKL(puθ ||puθ) inf θ C(puθ ) = C(puθ ).

To show that the global minimum θ of C is also the global minimum of Jα, it is thus suﬃcient to show that

Jα(θ ) C(puθ ).

Thalmeier, Kappen, Totaro, G omez

Jα(θ ) = inf θ C(puθ ) + αKL(puθ ||puθ ).

Using that the minimum of a sum of terms is never larger than the sum of the minimum of terms, we get

Jα(θ ) inf θ C(puθ ) + inf θ αKL(puθ ||puθ )

= C(puθ ) + αKL(puθ ||puθ )

We also expect local minima to be also preserved for large-enough smoothing parameter α. This would correspond to small time smoothing by the associated Hamilton-Jacobi partial diﬀerential equation (Chaudhari et al., 2018).

Appendix F. Smoothing Theorem

Here we proof Theorem 1. We split the proof into three subsections: in the ﬁrst subsection, we state and proof a lemma that we need to proof statement 1. In the second subsection, we proof statement 1 and in the third subsection, we proof statement 2.

Lemma 2 With θ α,θ deﬁned as in equation (9) and Eα(θ) = KL(θ α,θ||θ) we can rewrite Jα(θ):

Jα(θ) = C ΘC E (θ) E =Eα(θ) + αEα(θ). (24)

Proof With the deﬁnition of θ α,θ as the minimizer of C(θ ) + αKL(θ ||θ) (see (9)) we have

Jα(θ) = C θ α,θ + αKL(θ α,θ||θ)

= C θ α,θ + αEα(θ).

What is left to show is that

θ α,θ ΘC Eα(θ)(θ).

As ΘC Eα(θ)(θ) is the minimizer of the cost C within the trust region deﬁned by {θ : KL(θ ||θ) Eα(θ)} we have to show that

1. θ α,θ lies within this trust region,

2. C(θ α,θ) is a minimizer of the cost C within this trust region.

Adaptive Smoothing for Path Integral Control

The ﬁrst point is trivially true as KL(θ α,θ||θ) = Eα(θ) by deﬁnition. Hence θ α,θ lies at the boundary of this trust region and therefore in it, as the boundary belongs to the trust region. The second point we proof by contradiction: Given θ α,θ is not minimizing the cost within the trust region, then there exists a ˆθ with C(ˆθ) < C(θ α,θ) and KL(ˆθ||θ) Eα(θ) = KL(θ α,θ||θ). Therefore it must hold that

C(ˆθ) + αKL(ˆθ||θ) < C(θ α,θ) + αKL(θ α,θ, θ)

which is a contradiction, as θ α,θ is the minimizer of C(θ ) + αKL(θ ||θ).

F.2 Proof of Statement 1

Here we show that for every α and θ there exists an E = E α(θ) such that

C ΘC E ΘJα E (θ) E =E α(θ) = C E,E E =E α(θ) . (25)

Proof As Jα(θ) is the inﬁmum of C(θ ) + αKL(θ ||θ), we have for any E > 0

Jα(θ) C ΘC E (θ) + αKL ΘC E (θ)||θ .

Further, as ΘC E (θ) lies in the trust region {θ : KL(θ ||θ) E } we have that KL ΘC E (θ)||θ E , so we can write

C ΘC E (θ) + αKL ΘC E (θ)||θ C ΘC E (θ) + αE

Jα(θ) C ΘC E (θ) + αE .

Next we minimize both sides of this inequality within the trust region {θ : KL(θ ||θ) E}. We use that

Jα ΘJα E (θ) = min θ s.t. KL(θ ||θ) E Jα(θ )

Jα ΘJα E (θ) min θ s.t. KL(θ ||θ) E

C ΘC E (θ ) + αE . (26)

Now we use Lemma 2 and rewrite the left hand side of this inequality.

Jα ΘJα E (θ) = C ΘC E ΘJα E (θ) E =E α(θ) + αE α(θ)

with E α(θ) := Eα(ΘJα E (θ)). Plugging this back to (26) we get

C ΘC E ΘJα E (θ) E =E α(θ) + αE α(θ) min θ s.t. KL(θ ||θ) E

C ΘC E (θ ) + αE .

Thalmeier, Kappen, Totaro, G omez

As this inequality holds for any E > 0 we can plug in E α(θ) on the right hand side of this inequality and obtain

C ΘC E ΘJα E (θ) E =E α(θ) + αE α(θ) min θ s.t. KL(θ ||θ) E C ΘC E (θ ) E =E α(θ) + αE α(θ).

We subtract αE α(θ) on both sides

C ΘC E ΘJα E (θ) E =E α(θ) min θ s.t. KL(θ ||θ) E C ΘC E (θ ) E =E α(θ) .

Using equation (16) gives

C ΘC E ΘJα E (θ) E =E α(θ) C E,E (θ) E =E α(θ) ,

which concludes the proof.

F.3 Proof of Statement 2

Here we show that E = E α(θ) is a monotonically decreasing function of α. E α(θ) is given by

E α(θ) = Eα ΘJα E (θ) = KL(θ α,θ ||θ ) θ =RJα E (θ) .

αKL(θ α,θ ||θ ) + C θ α,θ θ =RJα E (θ) = inf θ αKL(θ ||θ ) + C(θ ) θ =RJα E (θ) = min θ s.t. KL(θ ||θ) E inf θ αKL(θ ||θ ) + C(θ ).

For convenience we introduce a shorthand notation for the minimizers

θα := ΘJα E (θ)

θ α := θ α,θ |θ =ΘJα E (θ).

We compare α1 0 with E α1(θ) := KL(θ α1||θα1) and α2 0 with E α2(θ) := KL(θ α2||θα2) and assume that E α1(θ) < E α2(θ). We show that from this it follows that α1 > α2. Proof As θ α1,θα1 minimize α1KL(θ ||θ) + C(θ ) we have

α1KL(θ α1||θα1) + C(θ α1) α1KL(θ α2||θα2) + C(θ α2)

α1Eα1(θ) + C(θ α1) α1Eα2(θ) + C(θ α2)

and analogous for α2

α2KL(θ α1||θα1) + C(θ α1) α2KL(θ α2||θα2) + C(θ α2)

α2Eα1(θ) + C(θ α1) α2Eα2(θ) + C(θ α2)

Adaptive Smoothing for Path Integral Control

With Eα1(θ) < Eα2(θ) we get

α1 C(θ α1) C(θ α2) Eα2(θ) Eα1(θ) α2.

We showed that from Eα1(θ) < Eα2(θ) it follows that α1 α2 which proofs that Eα(θ) is monotonously decreasing in α.

Appendix G. Smoothed Updates for Small Update Steps E

We want to compute equation (17) for small E which corresponds to large β. Assuming a smooth dependence of puθ on θ, bounding KL(θ||θn) to a very small value allows us to do a Taylor expansion which we truncate at second order:

arg min θ Jα(θ ) + βKL(θ ||θn)

arg min θ (θ θn)T θ Jα(θ ) + 1

2(θ θn)T (H + βF) (θ θn)

= θn β 1F 1 θ Jα(θ ) θ =θn + O(β 2)

H = θ T θ Jα(θ ) θ =θn F = θ T θ KL(θ ||θn) θ =θn .

See also Martens (2014). We used that E 1 β 1. With this the Fisher information F dominates over the Hessian H and thus the Hessian does not appear anymore in the update equation. This deﬁnes a natural gradient update with stepsize β 1.

Thalmeier, Kappen, Totaro, G omez

Appendix H. s Monotonic in α

Now we show that = KL(p α,θ||puθ) is a monotonic function of α.

αKL(p α,θ||puθ) =

ln p α,θ puθ

p α,θ puθ ln p α,θ puθ

ln p α,θ puθ

puθ + p α,θ puθ

α ln p α,θ puθ

ln p α,θ puθ

ln p α,θ puθ

ln p α,θ puθ

Now let us look at

p α,θ puθ =

1 Zαpuθ exp 1 γ + αSγ puθ (τ) !

Zα puθ = exp 1 γ + αSγ puθ (τ)

p α,θ puθ = 1 (γ + α)2 Sγ puθ (τ) p α,θ puθ p α,θ puθ

αZα puθ = 1 (γ + α)2 Sγ puθ exp 1 γ + αSγ puθ (τ)

p α,θ puθ = 1 (γ + α)2 Sγ puθ (τ) p α,θ puθ p α,θ puθ

1 (γ + α)2 D Sγ puθ

= 1 (γ + α)2 p α,θ puθ

Sγ puθ (τ) D Sγ puθ

Adaptive Smoothing for Path Integral Control

So ﬁnally we get

αKL(p α,θ||puθ) = 1 (γ + α)2

Sγ puθ (τ) D Sγ puθ

ln p α,θ puθ

= 1 (γ + α)2

Sγ puθ (τ) D Sγ puθ

1 γ + αSγ puθ (τ) log Zα puθ

= 1 (γ + α)2

Sγ puθ (τ) D Sγ puθ

1 γ + αSγ puθ (τ) log Zα puθ

= 1 (γ + α)3

p α,θ D Sγ puθ

= 1 (γ + α)3 Var Sγ puθ

Therefore = KL(p α,θ||puθ) is a monotonically decreasing function of α.

Appendix I. Proof for Equivalence of Weight Entropy and KL-Divergence

We want to show that

lim N log N HN(w) = lim N log N +

i=1 wi log(wi)

= KL(p α,θ||puθ),

where the samples i are drawn from puθ and the wi are given by

wi = 1 PN i exp 1 γ+αSpuθ (τ i) exp 1 γ + αSpuθ (τ i) .

Thalmeier, Kappen, Totaro, G omez

lim N log N +

i=1 wi log(wi) =

= lim N log N +

1 PN i exp 1 γ+αSγ puθ (τ i) exp 1 γ + αSγ puθ (τ i)

1 PN i exp 1 γ+αSγ puθ (τ i) exp 1 γ + αSγ puθ (τ i)

= lim N log N + 1

1 N PN i exp 1 γ+αSγ puθ (τ i) exp 1 γ + αSγ puθ (τ i)

1 N 1 N PN i exp 1 γ+αSγ puθ (τ i) exp 1 γ + αSγ puθ (τ i)

= lim N 1 N

1 N PN i exp 1 γ+αSγ puθ (τ i) exp 1 γ + αSγ puθ (τ i)

1 N PN i exp 1 γ+αSγ puθ (τ i) exp 1 γ + αSγ puθ (τ i)

Now we replace in the limit N , 1 N PN i puθ :

* 1 D exp 1 γ+αSγ puθ (τ) E

exp 1 γ + αSγ puθ (τ)

1 D exp 1 γ+αSγ puθ (τ) E

exp 1 γ + αSγ puθ (τ)

Adaptive Smoothing for Path Integral Control

Using equation (12) this gives

1 D exp 1 γ+αSγ puθ (τ) E

exp 1 γ + αSγ puθ (τ)

1 D exp 1 γ+αSγ puθ (τ) E

exp 1 γ + αSγ puθ (τ) puθ(τ)

log p α,θ(τ)

p α,θ = KL(p α,θ||puθ).

Appendix J. Inversion of the Fisher Matrix

We compute an approximation to the natural gradient gf = F 1g by approximately solving the linear equation Fgf = g using truncated conjugate gradient. With the standard gradient g and the Fisher matrix F = θ T θ KL(puθ||puθn) (see Appendix G). We use an eﬃcient way to compute the Fisher vector product Fy (Schulman et al., 2015) using an automated diﬀerentiation package: ﬁrst for each rollout i and timepoint t the symbolic expression for the gradient on the KL multiplied by a vector y is computed:

ai,t(θn+1) = T θn+1 log πθn(ai t|t, xi t) πθn+1(ai t|t, xi t)

Then we take the second derivative on this scalar quantity, sum over all times and average over the samples. This gives the Fisher vector

0<t<T θn+1ai,t(θn+1).

Appendix K. Full Parametrization in LQ Problems

Here we discuss why for a linear quadratic problem a time varying linear controller is a full parametrization. We want to show that for every

Z pu0(τ) exp 1 γ + αSγ puθ0 (τ) (27)

there is a time varying linear controller uθ α,θ0 such that puθ α,θ0 = p α,θ0. We assume that uθ0 is a time varying linear controller. In Appendix B we have shown that u α,θ0 is the solution to the path integral control problem with dynamics

xt = f(xt, t) + g(xt, t) ( u(xt, t) + ˆu(xt, t) + ξt)

Thalmeier, Kappen, Totaro, G omez

and cost Z T

1 γ V (xt, t) 1

2 γ α u(xt, t)T u(xt, t)dt + Z T

2 ˆu(xt, t)T ˆu(xt, t) + ˆu(xt, t)T ξt

with u = 1 γ γ+α uθ0(xt, t). It is now easy to see that if uθ0 is a time varying linear controller, thus a linear function of the state, the cost is a quadratic function of the state x (note that V (xt, t) is quadratic in the LQ case). Thus for all values of α, u α,θ0 is the solution to a linear quadratic control problem and thus a time varying linear controller (see, e.g., Kwakernaak and Sivan (1972)). Therefore a time varying linear controller is a full parametrization.

Appendix L. Details for the Numerical Experiments

L.1 Linear-Quadratic Control Task

Dynamics: the dynamics are ODEs integrated by an Euler scheme (see Section 6.1). The diﬀerential equation is initialized at x = 0 and dt = 0.1.

Control problem: Regularization γ = 1. Time-Horizon T = 10s. State-Cost function: see Section 6.1. (x0, t0) = ( 10, 1), (x1, t1) = (10, 2), (x2, t2) = ( 10, 3), (x3, t3) = ( 20, 4), (x4, t4) = ( 100, 5), (x5, t5) = ( 50, 6), (x6, t6) = (10, 7), (x7, t7) = (20, 8), (x8, t8) = (30, 9). Variance of uncontrolled dynamics ν = 1.

Algorithm: Batchsize: N = 100, trust region E = 0.1, smoothing strength = 0.2 log 100, conjugate gradient iterations: 2 (for each time step separately). The parametrized controller was initialized at θ0 = 0.

L.2 Pendulum Task

Dynamics: the diﬀerential equation for the pendulum is

x + cω0 x + ω2 0 sin(x) = λ (u + ξ) ,

with cω0 = 0.1 [s 1], ω2 0 = 10 [s 2], and λ = 0.2. We implemented this diﬀerential equation as a ﬁrst order diﬀerential equation and integrated it with an Euler scheme (dt = 0.01). The pendulum is initially resting at the bottom:

x = 0, x = 0.

As a parametrized controller we use a time varying linear feedback controller:

uθ(x, x, t) = θ3,t cos(x) + θ2,t sin(x) + θ1,t x + θ0,t.

The parametrized controller was initialized at θ = 0.

Control-problem: the regularization is set to γ = 1 and the time-horizon T = 3.0s. The state-cost function has end-cost only:

V (x, x, t) = δ(t T) 500Y + 10 x2 ,

with Y = cos(x) (height of tip). The variance of uncontrolled dynamics is ν = 1.

Adaptive Smoothing for Path Integral Control

Algorithm: batchsize: N = 500, trust region E = 0.1, smoothing strength = 0.5. The Fisher-matrix was inverted for each time step separately using the scipy pseudo-inverse with rcond=10 4.

L.3 Acrobot Task

Dynamics: we use the deﬁnition of the acrobot as in Spong (1995). The diﬀerential equations for the acrobot are

d11(x) x1 + d12(x) x2 + h1(x, x) + φ1(x) = 0

d21(x) x1 + d22 x2 + h2(x, x) + φ2(x) = λ (u + ξ)

d11 = m1l2 c1 + m2 l2 1 + l2 c2 + 2l1lc2 cos(x2) + I1 + I2 d12 = m2 l2 c2 + l1lc2 cos(x2) + I2 d21 = d12 d22 = m2l2 c2 + I2 h1 = m2l1lc2 sin(x2) x2 2 + 2 x1 x2

h2 = m2l1lc2 sin(x2) x2 1 φ2 = m2lc2G cos (x1 + x2)

φ1 = (m1lc1 + m2l1)g cos (x1) + φ2

and parameter values

l1 = 1. [m]

l2 = 2. [m]

m1 = 1. [kg] mass of link 1

m2 = 1. [kg] mass of link 2

lc1 = 0.5 [m] position of the center of mass of link 1

lc2 = 1.0 [m] position of the center of mass of link 2

I1 = 0.083 moments of inertia for both links

I2 = 0.33 moments of inertia for both links

Thalmeier, Kappen, Totaro, G omez

We implemented this diﬀerential equation as a ﬁrst order diﬀerential equation and integrated it with an Euler scheme (dt = 0.01). The acrobot is initially resting at the bottom:

x1 = 0, x2 = 0, x1 = 1

2π, x2 = 0.

As a parametrized controller we use a time varying linear feedback controller:

uθ(x, x, t) =θ8,t cos(x1) + θ7,t sin(x2) + θ6,t cos(x2) + θ5,t sin(x2)+

+ θ4,t sin(x1 + x2) + θ3,t cos(x1 + x2) + θ2,t x1 + θ1,t x2 + θ0,t.

The parametrized controller was initialized at θ = 0.

Control-problem: regularization γ = 1, time-horizon T = 3.0s, and state-cost function has end-cost only:

V (x, x, t) = δ(t T) 500Y + 10( x12 + x22) ,

with Y = l1 cos(x1) l2 cos(x1+x2) (height of tip). The variance of uncontrolled dynamics is ν = 1.

Algorithm: batchsize N = 500, trust region E = 0.1, and smoothing strenght = 0.5. The Fisher-matrix was inverted for each time step separately using the scipy pseudo-inverse with rcond=10 4.

L.4 Walker Task

For dynamics and the state cost function we used Bipedal Walker-v2 from the Open AI gym (Brockman et al., 2016). The policy was a Gaussian policy, with static variance σ2 = 1. The state dependent mean of the Gaussian policy was a neural network controller with two hidden layers with 32 neurons, each. The activation function is a tanh. For the initialization we used Glorot Uniform (Glorot and Bengio, 2010). The inputs to the neural network was the observation space provided by Open AI gym task Bipedal Walker-v2 : State consists of hull angle speed, angular velocity, horizontal speed, vertical speed, position of joints and joints angular speed, legs contact with ground, and 10 lidar rangeﬁnder measurements.

Control-problem:

Time-Horizon: deﬁned by Open AI gym task Bipedal Walker-v2

State-Cost function deﬁned by Open AI gym task Bipedal Walker-v2 : Reward is given for moving forward, total +300 points up to the far end. If the robot falls, it gets 100. Applying motor torque costs a small amount of points, more optimal agent will get better score.

Algorithm: batchsize N = 100, trust region E = 0.01, smoothing strength = 0.05 log 100, and 10 conjugate gradient iterations.

Adaptive Smoothing for Path Integral Control

Hyperparameters Value Number of rollouts (N) 50 Total number of rollouts 50 000 Smoothing strength ( ) {0.1.0.5} Trust region size (E) {0.025, 0.075} Mini batch size 256 Units per layer 32 Number of hidden layers 1 Learning rate 7e-4 Activation function tanh Action distribution Isotropic Gaussian

Table 1: Hyperparameters for the experiments using Pybullet.

L.5 Pybullet Experiments

For these experiments we use the Pybullet open source engine.5 In all tasks, we used N = 50 rollouts per iteration. The choice of controller as well as the hyperparameters for the conjugate gradient step were optimized as in Schulman et al. (2015). For PG-TR, we also used the same values of the trust region size ϵ = 0.01 and hyperparameters for the conjugate gradient optimizer. For ASPIC, we considered two values of the smoothing strength = {0.1.0.5} and two trust region sizes E = {0.025, 0.075}.6

Oleg Arenz, Gerhard Neumann, and Mingjun Zhong. Eﬃcient gradient-free variational inference using policy search. In Proceedings of the 35th International Conference on Machine Learning, volume 80, pages 234 243, 2018.

Joris Bierkens and Hilbert J Kappen. Explicit solution of relative entropy weighted control. Systems & Control Letters, 72:36 43, 2014.

Greg Brockman, Vicki Cheung, Ludwig Pettersson, Jonas Schneider, John Schulman, Jie Tang, and Wojciech Zaremba. Openai gym, 2016.

Pratik Chaudhari, Adam Oberman, Stanley Osher, Stefano Soatto, and Guillaume Carlier. Deep relaxation: partial diﬀerential equations for optimizing deep neural networks. Research in the Mathematical Sciences, 5(3):30, 2018.

Yan Duan, Xi Chen, Rein Houthooft, John Schulman, and Pieter Abbeel. Benchmarking deep reinforcement learning for continuous control. In Proceedings of the 33rd International Conference on Machine Learning, 2016.

Wendell H Fleming and William M Mc Eneaney. Risk-sensitive control on an inﬁnite time horizon. SIAM Journal on Control and Optimization, 33(6):1881 1915, 1995.

5. https://pybullet.org/wordpress/ 6. For reproducibility, the code will be made available upon acceptance of the ﬁnal manuscript

Thalmeier, Kappen, Totaro, G omez

WH Fleming and SJ Sheu. Risk-sensitive control and an optimal investment model ii. Annals of Applied Probability, pages 730 767, 2002.

Vincent Fran cois-Lavet, Peter Henderson, Riashat Islam, Marc G. Bellemare, and Joelle Pineau. An introduction to deep reinforcement learning. Foundations and Trends R in Machine Learning, 11(3-4):219 354, 2018. ISSN 1935-8237. doi: 10.1561/2200000071.

Paul Glasserman. Monte Carlo methods in ﬁnancial engineering, volume 53. Springer Science & Business Media, 2013.

Xavier Glorot and Yoshua Bengio. Understanding the diﬃculty of training deep feedforward neural networks. In Thirteenth International Conference on Artiﬁcial Intelligence and Statistics, pages 249 256, 2010.

Vicen c G omez, Hilbert J Kappen, Jan Peters, and Gerhard Neumann. Policy search for path integral control. In Joint European Conference on Machine Learning and Knowledge Discovery in Databases, pages 482 497. Springer, 2014.

Nicolas Heess, Srinivasan Sriram, Jay Lemmon, Josh Merel, Greg Wayne, Yuval Tassa, Tom Erez, Ziyu Wang, Ali Eslami, Martin Riedmiller, et al. Emergence of locomotion behaviours in rich environments. ar Xiv preprint ar Xiv:1707.02286, 2017.

Sham Kakade. A natural policy gradient. Advances in neural information processing systems, 2:1531 1538, 2002.

H. J. Kappen. Linear theory for control of nonlinear stochastic systems. Physical Review Letters, 95(20):200201, 2005. doi: 10.1103/Phys Rev Lett.95.200201.

H. J. Kappen and H. C. Ruiz. Adaptive importance sampling for control and inference. Journal of Statistical Physics, 162(5):1244 1266, 2016. ISSN 1572-9613. doi: 10.1007/ s10955-016-1446-7.

H. J. Kappen, V. G omez, and M. Opper. Optimal control as a graphical model inference problem. Machine Learning, 87(2):159 182, 2012.

Huibert Kwakernaak and Raphael Sivan. Linear optimal control systems, volume 1. Wiley Interscience New York, 1972.

Sergey Levine, Chelsea Finn, Trevor Darrell, and Pieter Abbeel. End-to-end training of deep visuomotor policies. Journal of Machine Learning Research, 17(1):1334 1373, January 2016. ISSN 1532-4435.

James Martens. New insights and perspectives on the natural gradient method. ar Xiv preprint ar Xiv:1412.1193, 2014.

Volodymyr Mnih, Koray Kavukcuoglu, David Silver, Andrei A Rusu, Joel Veness, Marc G Bellemare, Alex Graves, Martin Riedmiller, Andreas K Fidjeland, Georg Ostrovski, et al. Human-level control through deep reinforcement learning. Nature, 518(7540):529 533, 2015.

Adaptive Smoothing for Path Integral Control

Matteo Papini, Damiano Binaghi, Giuseppe Canonaco, Matteo Pirotta, and Marcello Restelli. Stochastic variance-reduced policy gradient. In Proceedings of the 35th International Conference on Machine Learning, volume 80, pages 4026 4035. PMLR, 2018.

Jan Peters and Stefan Schaal. Reinforcement learning of motor skills with policy gradients. Neural networks, 21(4):682 697, 2008.

Jan Peters, Katharina M ulling, and Yasemin Alt un. Relative entropy policy search. In Proceedings of the Twenty-Fourth AAAI Conference on Artiﬁcial Intelligence, pages 1607 1612. AAAI Press, 2010.

Rajesh Ranganath, Sean Gerrish, and David Blei. Black Box Variational Inference. In Proceedings of the Seventeenth International Conference on Artiﬁcial Intelligence and Statistics, volume 33, pages 814 822. PMLR, 2014.

Hans-Christian Ruiz and Hilbert J Kappen. Particle smoothing for hidden diﬀusion processes: Adaptive path integral smoother. IEEE Transactions on Signal Processing, 65 (12):3191 3203, 2017.

John Schulman, Sergey Levine, Pieter Abbeel, Michael Jordan, and Philipp Moritz. Trust region policy optimization. In Proceedings of the 32nd International Conference on Machine Learning, volume 37, pages 1889 1897. PMLR, 2015.

Mark W Spong. The swing up control problem for the acrobot. IEEE control systems, 15 (1):49 55, 1995.

Richard S Sutton, David A Mc Allester, Satinder P Singh, and Yishay Mansour. Policy gradient methods for reinforcement learning with function approximation. In Advances in neural information processing systems, pages 1057 1063, 2000.

Dominik Thalmeier, Vicen c G omez, and Hilbert J Kappen. Action selection in growing state spaces: Control of network structure growth. Journal of Physics A: Mathematical and Theoretical, 50(3):034006, 2016.

S. Thijssen and H. J. Kappen. Path integral control and state-dependent feedback. Physical Review E, 91(3):032104, 2015.

Emanuel Todorov. Eﬃcient computation of optimal actions. Proceedings of the National Academy of Sciences, 106(28):11478 11483, 2009. ISSN 0027-8424. doi: 10.1073/pnas. 0710743106.

Bart van den Broek, Wim Wiegerinck, and Hilbert Kappen. Risk sensitive path integral control. In Proceedings of the Twenty-Sixth Conference on Uncertainty in Artiﬁcial Intelligence, pages 615 622. AUAI Press, 2010.

Ronald J Williams. Simple statistical gradient-following algorithms for connectionist reinforcement learning. Machine learning, 8(3-4):229 256, 1992.