# the_differentiable_crossentropy_method__7615f984.pdf

The Differentiable Cross-Entropy Method

Brandon Amos 1 Denis Yarats 1 2

We study the cross-entropy method (CEM) for the non-convex optimization of a continuous and parameterized objective function and introduce a differentiable variant that enables us to differentiate the output of CEM with respect to the objective function s parameters. In the machine learning setting this brings CEM inside of the endto-end learning pipeline where this has otherwise been impossible. We show applications in a synthetic energy-based structured prediction task and in non-convex continuous control. In the control setting we show how to embed optimal action sequences into a lower-dimensional space. DCEM enables us to ﬁne-tune CEM-based controllers with policy optimization.

1. Introduction

Recent work in the machine learning community has shown how optimization procedures can create new buildingblocks for the end-to-end machine learning pipeline (Gould et al., 2016; Johnson et al., 2016; Amos et al., 2017; Amos & Kolter, 2017; Domke, 2012; Metz et al., 2016; Finn et al., 2017; Zhang et al., 2019; Belanger et al., 2017; Rusu et al., 2018; Srinivas et al., 2018; Amos et al., 2018; Agrawal et al., 2019a). In this paper we focus on the setting of optimizing an unconstrained, non-convex, and continuous objective function f (x) : Rn ! R as

ˆx := arg min

x f (x), (1)

where we assume ˆx is unique and that f is parameterized by 2 and has inputs x 2 Rn. If it exists, some (sub-)derivative r ˆx is useful in the machine learning setting to make the output of the optimization procedure endto-end learnable. For example, could parameterize a predictive model that generates outcomes conditional on x.

1Facebook AI Research 2New York University.

Correspondence to: Brandon Amos <bda@fb.com>.

Proceedings of the 37 th International Conference on Machine Learning, Online, PMLR 119, 2020. Copyright 2020 by the author(s).

End-to-end learning in these settings can be done by deﬁning a loss function L on top of ˆx and taking gradient steps r L. If f were convex this gradient is easy to analyze and compute when it exists and is unique (Gould et al., 2016; Johnson et al., 2016; Amos et al., 2017; Amos & Kolter, 2017). Analyzing and computing a derivative through the non-convex arg min in eq. (1) is not as easy and is challenging in theory and practice. The derivative may not exist or may be uninformative in theory, it might not be unique, and even if it does, the numerical solver being used to compute the solution may not ﬁnd a global or even local optimum of f. One promising direction to sidestep these issues is to approximate the arg min operation with an explicit optimization procedure that is interpreted as another compute graph and unrolled through, i.e. seen as a sequence of differentiable computations. This is most commonly done with gradient descent as in Domke (2012); Metz et al. (2016); Finn et al. (2017); Belanger et al. (2017); Rusu et al. (2018); Srinivas et al. (2018); Foerster et al. (2018); Zhang et al. (2019). This approximation adds deﬁnition and structure to an otherwise ill-deﬁned desiderata at the cost of biasing the gradients and enabling the learning procedure to over-ﬁt to the hyper-parameters of the optimization algorithm, such as the number of gradient steps or the learning rate.

In this paper we show how to use the cross-entropy method (CEM) (Rubinstein, 1997; De Boer et al., 2005) to approximate the derivative through an unconstrained, non-convex, and continuous arg min. CEM for optimization is a zerothorder optimizer and works by generating a sequence of samples from the objective function. We show a simple and computationally negligible way of making CEM differentiable that we call DCEM by using the smooth top-k operation from Amos et al. (2019). This also brings CEM into the end-to-end learning process in scenarios such as control where there is otherwise a disconnection between the objective that is being learned and the objective that is induced by deploying CEM on top of those models.

We ﬁrst study DCEM in a simple non-convex energy-based learning setting for regression. We contrast using unrolled gradient descent and DCEM for optimizing over a SPEN (Belanger & Mc Callum, 2016). We show that unrolling through gradient descent in this setting over-ﬁts to the number of gradient steps taken and that DCEM generates a more reasonable energy surface.

The Differentiable Cross-Entropy Method

We next focus on using DCEM in the context of non-convex continuous control as a differentiable policy class that is end-to-end learnable. This setting is especially interesting as vanilla CEM is the state-of-the-art method for solving the control optimization problem with neural network transition dynamics as in Chua et al. (2018); Hafner et al. (2018). We show that DCEM is useful for embedding action sequences into a lower-dimensional space to make solving the control optimization process signiﬁcantly less computationally and memory expensive. This controller induces a differentiable policy class parameterized by the model-based components. DCEM is a solution to the objective mismatch problem in model-based control (Lambert et al., 2020), which is the issue that arises when training model-based components with the objective of maximizing the data likelihood but then using the model-based components for the objective of control. We use PPO (Schulman et al., 2017) to ﬁnetune the model-based components, demonstrating that it is possible to use standard policy learning for model-based RL components in addition to maximum-likelihood ﬁtting.

2. Background and Related Work

2.1. Differentiable optimization-based modeling in

machine learning

Optimization-based modeling is a way of integrating specialized operations and domain knowledge into end-to-end machine learning pipelines, typically in the form of a parameterized arg min operation. Convex, constrained, and continuous optimization problems, e.g. as in Gould et al. (2016); Johnson et al. (2016); Amos et al. (2017); Amos & Kolter (2017); Agrawal et al. (2019a), capture many standard layers as special cases and can be differentiated through by applying the implicit function theorem to a set of optimality conditions from convex optimization theory, such as the KKT conditions. Non-convex and continuous optimization problems, e.g. as in Domke (2012); Belanger & Mc Callum (2016); Metz et al. (2016); Finn et al. (2017); Belanger et al. (2017); Rusu et al. (2018); Srinivas et al. (2018); Foerster et al. (2018); Amos et al. (2018); Pedregosa (2016); Jenni & Favaro (2018); Rajeswaran et al. (2019); Zhang et al. (2019), are more difﬁcult to differentiate through. Differentiation is typically done by unrolling gradient descent or applying the implicit function theorem to some set of optimality conditions, sometimes forming a locally convex approximation to the larger non-convex problem. Unrolling gradient descent is the most common way and approximates the arg min operation with gradient descent for the forward pass and interprets the operations as another compute graph for the backward pass that can all be differentiated through. In contrast to these works, we show how continuous and non-convex arg min operations can also be approximated with the cross-entropy method.

2.2. Embedding domains for optimization problems

Oftentimes the solution space of high-dimensional optimization problems may have structural properties that an optimizer can exploit to ﬁnd a better solution or to ﬁnd the solution quicker than an otherwise naïve optimizer. Metalearning approaches such as LEO (Rusu et al., 2018) and CAVIA (Zintgraf et al., 2019) turn the optimization problem for adaptation in a high-dimensional parameter space into a lower-dimensional latent embedded optimization problem. In the context of Bayesian optimization this has been explored with random feature embeddings, hand-coded embeddings, and auto-encoder-learned embeddings (Antonova et al., 2019; Oh et al., 2018; Calandra et al., 2016; Wang et al., 2016; Garnett et al., 2013; Ben Salem et al., 2019; Kirschner et al., 2019). Luo et al. (2018) and Gómez Bombarelli et al. (2018) turn discrete search problems for architecture search and molecular design, respectively, into embedded continuous optimization problems. We show that DCEM is another reasonable way of learning an embedded domain for exploiting the structure in and efﬁciently solving larger optimization problems, with the signiﬁcant advantage of DCEM being that the latent space is directly learned to be optimized over as part of the end-to-end learning pipeline.

2.3. RL and Control

High-dimensional non-convex optimization problems that have a lot of structure in the solution space naturally arise in the control setting where the controller seeks to optimize the same objective in the same controller dynamical system from different starting states. This has been investigated in, e.g., planning (Ichter et al., 2018; Ichter & Pavone, 2019; Mukadam et al., 2018; Kurutach et al., 2018; Srinivas et al., 2018; Yu et al., 2019; Lynch et al., 2019), and policy distillation (Wang & Ba, 2019). Chandak et al. (2019) shows how to learn an action space for model-free learning and Co-Reyes et al. (2018); Antonova et al. (2019) embed action sequences with a VAE. There has also been a lot of work on learning reasonable latent state space representations (Tasﬁ& Capretz, 2018; Zhang et al., 2018; Gelada et al., 2019; Miladinovi c et al., 2019) that may have structure imposed to make it more controllable (Watter et al., 2015; Banijamali et al., 2017; Ghosh et al., 2018; Anand et al., 2019; Levine et al., 2019; Singh et al., 2019). In contrast to these works, we learn how to encode action sequences directly with DCEM instead of auto-encoding the sequences. This has the advantages of 1) never requiring the expensive expert s solution to the control optimization problem, 2) potentially being able to surpass the performance of an expert controller that uses the full action space, and 3) being end-to-end learnable through the controller for the purpose of ﬁnding a latent space of sequences that DCEM is good at searching over.

The Differentiable Cross-Entropy Method

Another direction the RL and control communities has been pursuing is on the combination of model-based and modelfree methods by differentiating through model-based components Bansal et al. (2017) does this with Bayesian optimization and locally linear models. Okada et al. (2017); Pereira et al. (2018) makes path integral control (Theodorou et al., 2010) differentiable. Agrawal et al. (2019b) considers a class of convex controllers and differentiates through them with Agrawal et al. (2019a). Amos et al. (2018) proposes differentiable MPC and only do imitation learning on the cartpole and pendulum tasks with known or lightlyparameterized dynamics in contrast, we are able to 1) scale our differentiable controller up to the cheetah and walker tasks, 2) use neural network dynamics inside of our controller, and 3) backpropagate a policy loss through the output of our controller and into the internal components.

3. The Differentiable Cross-Entropy Method

The cross-entropy method (CEM) (Rubinstein, 1997; De Boer et al., 2005) is an algorithm to solve optimization problems in the form of eq. (1). CEM is an iterative and zeroth-order solver that uses a sequence of parametric sampling distributions gφ deﬁned over the domain Rn, such as Gaussians. Given a sampling distribution gφ, the hyperparameters of CEM are the number of candidate points sampled in each iteration N, the number of elite candidates k to use to ﬁt the new sampling distribution to, and the number of iterations T. The iterates of CEM are the parameters φ of the sampling distribution. CEM starts with an initial sampling distribution gφ1(X) 2 Rn, and in each iteration t generates N samples from the domain [Xt,i]N

i=1 gφt( ), evaluates the function at those points vt,i := f (Xt,i), and re-ﬁts the sampling distribution to the top-k samples by solving the maximum-likelihood problem1

φt+1 := arg max

1{vt,i (vt)k} log gφ(Xt,i), (2)

where the indicator 1{P} is 1 if P is true and 0 otherwise, gφ(X) is the likelihood of X under the distribution g , and (x) sorts x 2 Rn in ascending order so that

(x)1 (x)2 . . . (x)n.

We can then map from the ﬁnal distribution gφT back to the domain by taking the mean of it, i.e. ˆx := E[gφT +1( )]. In some settings, the best sample can be returned as ˆx.

Proposition 1. For multivariate isotropic Gaussian sampling distributions we have that φ = {µ, σ2} and eq. (2) has a closed-form solution given by the sample mean and variance of the top-k samples as µt+1 := 1/k P

1CEM s name comes from eq. (2) more generally optimizing the cross-entropy measure between two distributions.

0110 1010 1001

Figure 1. The limited multi-label (LML) polytope Ln,k from Amos

et al. (2019) is the set of points in the unit n-hypercube with coordinates that sum to k. Ln,1 is the (n 1)-simplex. The L3,1 and L3,2 polytopes (triangles) are on the left in blue. The L4,2 polytope (an octahedron) is on the right. This polytope is also referred to as the knapsack polytope or capped simplex.

t+1 := 1/k P

i2It (Xt,i µt+1)2, where the top-k indexing set is It := {i : vt,i (vt)k}.

This is well-known and is discussed in, e.g., Friedman et al. (2001). We present this here to make the connections between CEM and DCEM clearer.

Differentiating through CEM s output with respect to the objective function s parameters with r ˆx is useful, e.g., to bring CEM into the end-to-end learning process in cases where there is otherwise a disconnection between the objective that is being learned and the objective that is induced by deploying CEM on top of those models. In the vanilla form presented above the top-k operation in eq. (2) makes ˆx non-differentiable with respect to . The function samples can usually be differentiated through with some estimator (Mohamed et al., 2019) such as the reparameterization trick (Kingma & Welling, 2013; Rezende et al., 2014; Titsias & Lázaro-Gredilla, 2014), which we use in all of our experiments.

The top-k operation can be made differentiable by replacing it with a soft version (Martins & Kreutzer, 2017; Malaviya et al., 2018; Amos et al., 2019), or by using a stochastic oracle (Brookes & Listgarten, 2018). Here we use the Limited Multi-Label Projection (LML) layer (Amos et al., 2019), which is a Bregman projection of points from Rn onto the LML polytope shown in ﬁg. 1 and deﬁned by

Ln,k := {p 2 Rn | 0 p 1 and 1>p = k}. (3)

The LML polytope is the set of points in the unit nhypercube with coordinates that sum to k and is useful for modeling in multi-label and top-k settings. If n is implied by the context we will leave it out and write Lk.

We propose a temperature-scaled LML variant to project

The Differentiable Cross-Entropy Method

onto the interior of the LML polytope with

Lk(x/ ) := arg min

x>y Hb(y) s. t. 1>y = k

(4) where > 0 is the temperature parameter and

yi log yi + (1 yi) log(1 yi)

is the binary entropy function. We introduce the hyperparameter to show how DCEM captures CEM as a special case as ! 0. Equation (4) is a convex optimization layer and can be solved in a negligible amount of time with a GPU-amenable bracketing method on the univariate dual as described in Amos et al. (2019). The derivative rx Lk(x/ ) necessary for backpropagation can be easily computed by implicitly differentiating the KKT optimality conditions as described in Amos et al. (2019).

We can use the LML layer to make a soft and differentiable version of eq. (2) as

φt+1 := arg max

It,i log gφ(Xt,i)

subject to It = Lk(vt/ ).

This is now a maximum weighted likelihood estimation problem (Markatou et al., 1997; 1998; Wang, 2001; Hu & Zidek, 2002), which still admits an analytic closed-form solution in many cases, e.g. for the natural exponential family (De Boer et al., 2005). Thus using the soft top-k operation with the reparameterization trick, e.g., on the samples from g results in a differentiable variant of CEM that we call DCEM and summarize in alg. 1. We usually also normalize the values in each iteration to help separate the scaling of the values from the temperature parameter.

Proposition 2. The temperature-scaled LML layer Lk(x/ ) approaches the hard top-k operation as ! 0+

when all components of x are unique.

We prove this in app. A with the KKT conditions of eq. (4). The only difference between CEM and DCEM is the soft topk operation, thus when the soft top-k operation approaches the hard top-k operation, we can conclude:

Corollary 1. DCEM becomes CEM as ! 0+.

Proposition 3. With an isotropic Gaussian sampling distribution, the maximum weighted likelihood update in eq. (5) becomes µt+1 := 1/k P

i It,i Xt,i and σ2

i It,i (Xt,i µt+1)2, where the soft top-k indexing set is It := Lk(vt/ ).

This is well-known and is discussed in, e.g., De Boer et al. (2005), and can be proved by differentiating eq. (5).

Corollary 2. Prop. 3 captures prop. 1 as ! 0+.

4. Applications

4.1. Energy-Based Learning

Energy-based learning for regression and classiﬁcation estimate the conditional probability P(y|x) of an output y 2 Y given an input x 2 X with a parameterized energy function E (y|x) 2 Y X ! R such that P(y|x) / exp{ E (y|x)}. Predictions are made by solving the optimization problem

ˆy := arg min

y E (y|x). (6)

Historically linear energy functions have been well-studied, e.g. in Taskar et al. (2005); Le Cun et al. (2006), as it makes eq. (6) easier to solve and analyze. More recently nonconvex energy functions that are parameterized by neural networks are being explored a popular one being Structured Prediction Energy Networks (SPENs) (Belanger & Mc Callum, 2016) which propose to model E with neural networks. Belanger et al. (2017) does supervised learning of SPENs by approximating eq. (6) with gradient descent that is then unrolled for T steps, i.e. by starting with some y0, making gradient updates

yt+1 := yt γry E (yt|x)

resulting in an output ˆy := y T , deﬁning a loss function L on top of ˆy, and doing learning with gradient updates r L that go through the inner gradient steps.

In this context we can alternatively use DCEM to approximate eq. (6). One potential consideration when training deep energy-based models with approximations to eq. (6) is the impact and bias that the approximation is going to have on the energy surface. We note that for gradient descent, e.g., it may cause the energy surface to overﬁt to the number of gradient steps so that the output of the approximate inference procedure isn t even a local minimum of the energy surface. One potential advantage of DCEM is that the output is more likely to be near a local minimum of the energy surface so that, e.g., more test-time iterations can be used to reﬁne the solution. We empirically illustrate the impact of the optimizer choice on a synthetic example in sect. 5.1.

4.2. Control and Reinforcement Learning

Our main application focus is in the continuous control setting where we show how to use DCEM to learn a latent control space that is easier to solve than the original problem and induces a differentiable policy class that allows parts of the controller to be ﬁne-tuned with auxiliary policy or imitation losses.

We are interested in controlling discrete-time dynamical systems with continuous state-action spaces. Let H be the horizon length of the controller and UH be the space of

The Differentiable Cross-Entropy Method

Algorithm 1 DCEM(f , gφ, φ1; , N, k, T)

DCEM minimizes a parameterized objective function f and is differentiable w.r.t. . Each DCEM iteration samples from the distribution gφ, starting with φ1. DCEM enables the derivative of E[gφT +1( )] with respect to to be computed by differentiating all of the iterative operations.

for t = 1 to T do

i=1 gφt( ) . Sample N points from the domain. Differentiate with reparameterization. vt,i = f (Xt,i) . Evaluate the objective function at those points. It = Lk(vt/ ) . Compute the soft top-k projection of the values with eq. (4). Implicitly differentiate Update φt+1 by solving the maximum weighted likelihood problem in eq. (5). end for return E[gφT +1( )]

Algorithm 2 Learning an embedded control space with DCEM

Fixed Inputs: Dynamics f trans, per-step state-action cost Ct(xt, ut) that induces C (z; xinit), horizon H, full control space UH, distribution over initial states D Learned Inputs: Decoder f dec

while not converged do

Sample initial state xinit D.

ˆz = arg minz2Z C (z; xinit) . Solve the embedded control problem eq. (8) with DCEM. grad-update(r C (ˆz)) . Update the decoder to improve the controller s cost. end while

control sequences over this horizon length, e.g. U could be a multi-dimensional real space or box therein and UH could be the Cartesian product of those spaces representing the sequence of controls over H timesteps. We are interested in repeatedly solving the control optimization problem2

ˆu1:H := arg min

subject to x1 = xinit

xt+1 = f trans(xt, ut)

where we are in an initial system state xinit governed by deterministic system transition dynamics f trans, and wish to ﬁnd the optimal sequence of actions ˆu1:H such that we ﬁnd a valid trajectory {x1:H, u1:H} that optimizes the cost Ct(xt, ut). Equation (7) can be seen as an instance of eq. (1) by moving the rollout of the dynamics into the cost function. Typically these controllers are used for receding horizon control (Mayne & Michalska, 1990) where only the ﬁrst action u1 is deployed on the real system, a new state is obtained from the system, and the eq. (7) is solved again from the new initial state. In this case we can say the controller induces a policy (xinit) := ˆu13 that solves eq. (7) and depends on the cost and transition dynamics, and potential parameters therein. In all of the cases we consider f trans

is deterministic, but may be approximated by a stochas-

2We omit some explicit variables from the arg min operator when they are can be inferred by the context.

3We also omit the dependency of u1 on xinit.

tic model for learning. Some model-based reinforcement learning settings consider cases where f trans and C are parameterized and potentially used in conjunction with another policy class.

For sufﬁciently complex dynamical systems, eq. (7) is computationally expensive and numerically instable to solve and rife with sub-optimal local minima. The cross-entropy method is the state-of-the-art method for solving eq. (7) with neural network transitions f trans (Chua et al., 2018; Hafner et al., 2018). CEM in this context samples full action sequences and reﬁnes the samples towards ones that solve the control problem. Hafner et al. (2018) uses CEM with 1000 samples in each iteration for 10 iterations with a hori-

zon length of 12. This requires 1000 10 12 = 120, 000 evaluations (!) of the transition dynamics to predict the control to be taken given a system state and the transition dynamics may use a deep recurrent architecture as in Hafner et al. (2018) or an ensemble of models as in Chua et al. (2018). One comparison point here is a model-free neural network policy takes a single evaluation for this prediction, albeit sometimes with a larger neural network.

The ﬁrst application we show of DCEM in the continuous control setting is to learn a latent action space Z with a parameterized decoder f dec

: Z ! UH that maps back up to the space of optimal action sequences, which we illustrate in ﬁg. 8. For simplicity starting out, assume that the dynamics and cost functions are known (and perhaps even the ground-truth) and that the only problem is to estimate

The Differentiable Cross-Entropy Method

the decoder in isolation, although we will show later that these assumptions can be relaxed. The motivation for having such a latent space and decoder is that the millions of times eq. (7) is being solved for the same dynamic system with the same cost, the solution space of optimal action sequences

ˆu1:H 2 UH has an extremely large amount of spatial (over U) and temporal (over time in UH) structure that is being ignored by CEM on the full space. The space of optimal action sequences only contains the knowledge of the trajectories that matter for solving the task at hand, such as different parts of an optimal gait, and not irrelevant control sequences. We argue that CEM over the full action space wastes a lot of computation considering irrelevant action sequences and show that these can be ignored by learning a latent space of more reasonable candidate solutions here that we search over instead. Given a decoder, the control optimization problem in eq. (7) can then be transformed into an optimization problem over Z as

ˆz := arg min

C (z; xinit) :=

subject to x1 = xinit

xt+1 = f trans(xt, ut)

u1:H = f dec

which is still a challenging non-convex optimization problem that searches over a decoder s input space to ﬁnd the optimal control sequence. Equation (8) can be seen as an instance of eq. (1) by moving the decoder and rollout of the dynamics into the cost function C (z; xinit) and can be solved with CEM and DCEM. We notationally leave out the dependence of ˆz on xinit and .

We propose in alg. 2 to use DCEM to approximately solve eq. (8) and then learn the decoder directly to optimize the performance of eq. (7). Every time we solve eq. (8) with DCEM and obtain an optimal latent representation ˆz along with the induced trajectory {xt, ut}, we can take a gradient step to push down the resulting cost of that trajectory with r C(ˆz), which goes through the DCEM process that uses the decoder to generate samples to obtain ˆz. The DCEM machinery behind this is not necessary if a reasonable local minima is consistently found as this is an instance of mindifferentiation (Rockafellar & Wets, 2009, Theorem 10.13) but in practice this breaks down in non-convex cases when the minimum cannot be consistently found. DCEM helps by providing the derivative information r ˆz. Antonova et al. (2019); Wang & Ba (2019) solve related problems in this space and we discuss them in sect. 2.3. Learning an action embedding also requires derivatives through the transition dynamics and cost functions to compute r C(ˆz), even if the ground-truth dynamics are being used. This gives the latent space the knowledge of how the control cost will change as the decoder s parameters change.

DCEM in this setting also induces a differentiable policy class (xinit) := u1 = f dec(ˆz)1. This enables a policy or imitation loss J to be deﬁned on the policy that can ﬁnetune the parts of the controller (decoder, cost, and transition dynamics) gradient information from r J . In theory the same approach could be used with CEM on the full optimization problem in eq. (7). For realistic problems without modiﬁcation this is intractable and memory-intensive as it would require storing and backpropagating through every sampled trajectory, although as a future direction we note that it may be possible to delete some of the low-inﬂuence trajectories to help overcome this.

5. Experiments

Our experiments demonstrate applications of the crossentropy method in structured prediction, control, and reinforcement learning. sect. 5.1 illustrate a synthetic regression structured prediction task where gradient descent learns a counter-intuitive energy surface while DCEM retains the minimum. sect. 5.2 shows how DCEM can embed control optimization problems in a case when the ground-truth model is known or unknown, and we show that PPO (Schulman et al., 2017) can help improve the embedded controller.

Our Py Torch (Paszke et al., 2019) source code is openly available at github.com/facebookresearch/dcem and uses the Py Torch LML implementation from github.com/locuslab/lml to compute eq. (4).

5.1. Unrolling optimizers for regression and structured

In this section we brieﬂy explore the impact of the inner optimizer on the energy surface of a SPEN as discussed in sect. 4.1. For illustrative purposes we consider a simple unidimensional regression task where the ground-truth data is generated from f(x) := x sin(x) for x 2 [0, 2 ]. We model P(y|x) / exp{ E (y|x)} with a single neural network E and make predictions ˆy by solving the optimization problem eq. (6). Given the ground-truth output y?, we use the loss L(ˆy, y?) := ||ˆy y?||2

2 and take gradient steps of this loss to shape the energy landscape.

We consider approximating eq. (6) with unrolled gradient descent and DCEM with Gaussian sampling distributions. Both of these are trained to take 10 optimizer steps and we use an inner learning rate of 0.1 for gradient descent and with DCEM we use 10 iterations with 100 samples per iteration and 10 elite candidates, with a temperature of 1. For both algorithms we start the initial iterate at y0 :=

0. We show in app. B that both of these models attain the same loss on the training dataset but, since this is a unidimensional regression task, we can visualize the entire energy surfaces over the joint input-output space in ﬁg. 2.

The Differentiable Cross-Entropy Method

0 1 2 3 4 5 6 x

Unrolled Gradient Descent

0 1 2 3 4 5 6 x

DiÆerentiable CEM

Figure 2. We trained an energy-based model with unrolled gradient descent and DCEM for 1D regression onto the black target function.

Each method unrolls through 10 optimizer steps. The contour surfaces show the (normalized/log-scaled) energy surfaces, highlighting that unrolled gradient descent models can overﬁt to the number of gradient steps. The lighter colors show areas of lower energy.

This shows that gradient descent has learned to adapt from the initial y0 position to the ﬁnal position by descending along the function s surface as we would expect, but there is no reason why the energy surface should be a local minimum around the last iterate ˆy := y10. The energy surface learned by CEM captures local minima around the regression target as the sequence of Gaussian iterates are able to capture a more global view of the function landscape and need to focus in on a minimum of it for regression. We show ablations in app. B from training for 10 inner iterations and then evaluating with a different number of iterations and show that gradient descent quickly steps away from making reasonable predictions.

Discussion. Other tricks could be used to force the output to be at a local minimum with gradient descent, such as using multiple starting points or randomizing the number of gradient descent steps taken our intention here is to highlight this behavior in the vanilla case. DCEM is also susceptible to overﬁtting to the hyper-parameters behind it in similar, albeit less obvious ways.

5.2. Control

5.2.1. STARTING SIMPLE: EMBEDDING THE

CARTPOLE S ACTION SPACE

We ﬁrst show that it is possible to learn an embedded control space as discussed in sect. 4.2 in an isolated setting. We use the standard cartpole dynamical system from Barto et al. (1983) with a continuous state-action space. We assume that the ground-truth dynamics and cost are known and use the differentiable ground-truth dynamics and cost implemented in Py Torch from Amos et al. (2018). This isolates the learning problem to only learning the embedding so that we can study what this is doing without the additional complications that arise from exploration, estimating the dynamics, learning a policy, and other non-stationarities. We show experiments with these assumptions relaxed in sect. 5.2.2.

We use DCEM and alg. 2 to learn a 2-dimensional latent space Z := [0, 1]2 that maps back up to the full control space UH := [0, 1]H where we focus on horizons of length H := 20. For DCEM over the embedded space we use 10 iterations with 100 samples in each iteration and 10 elite candidates, again with a temperature of 1. We show the details in app. C that we are able to recover the performance of an expert CEM controller that uses an order-of-magnitude more samples ﬁg. 3 shows a visualization of what the CEM and embedded DCEM iterates look like to solve the control optimization problem from the same initial system state. CEM spends a lot of evaluations on sequences in the control space that are unlikely to be optimal, such as the ones the bifurcate between the boundaries of the control space at every timestep, while our embedded space is able to learn more reasonable proposals.

5.2.2. SCALING UP TO CONTINUOUS LOCOMOTION

Next we show that we can relax the assumptions of having known transition dynamics and reward and show that we can learn a latent control space on top of a learned model on the cheetah.run and walker.walk continuous locomotion tasks from the Deep Mind control suite (Tassa et al., 2018) using the Mu Jo Co physics engine (Todorov et al., 2012). We then ﬁne-tune the policy induced by the embedded controller with PPO (Schulman et al., 2017), sending the policy loss directly back into the reward and latent embedding modules underlying the controller. Videos of our trained models are available at https://sites.google.com/view/diff-cross-entropy-method.

We start with a state-of-the-art model-based RL approach by noting that the Pla Net (Hafner et al., 2018) restricted state space model (RSSM) is a reasonable architecture for proprioceptive-based control in addition to just pixel-based control. We show the graphical model we use in ﬁg. 8, which maintains deterministic hidden states ht and stochas-

The Differentiable Cross-Entropy Method

Figure 3. Visualization of the samples that CEM and DCEM generate to solve the cartpole task starting from the same initial system state. The plots starting at the top-left show that CEM initially starts with no temporal knowledge over the control space whereas embedded DCEM s latent space generates a more feasible distribution over control sequences to consider in each iteration. Embedded DCEM uses an order of magnitude less samples and is able to generate a better solution to the control problem. The contours on the bottom show the controller s cost surface C(z) from eq. (8) for the initial state the lighter colors show regions with lower costs.

tic (proprioceptive) system observations xt and rewards rt. We model transitions as ht+1 = f trans

(ht, xt), observations with xt f odec

(ht), rewards with rt = f rew

(ht, xt), and map from the latent action space to action sequences with u1:T = f dec(z). We follow the online training procedure of Hafner et al. (2018) to initialize all of the models except for the action decoder f dec, using approximately 2M timesteps. We then use a variant of alg. 2 to learn f dec to embed the action space for control with DCEM, which we also do online while updating the models. We describe the full training process in app. D.

Our DCEM controller induces a differentiable policy class (xinit) where are the parameters of the models that impact the actions that the controller is selecting. We then use PPO to deﬁne a loss on top of this policy class and ﬁne-tune the components (the decoder and reward module) so that they improve the episode reward rather than the maximumlikelihood solution of observed trajectories. We chose PPO because we thought it would be able to ﬁne-tune the policy

with just a few updates because the policy is starting at a reasonable point, but this did not turn out to be the case and in the future other policy optimizers can be explored. We implement this by making our DCEM controller the policy in the PPO implementation by Kostrikov (2018). We provide more details behind our training procedure in app. D.

We evaluate our controllers on 100 test episodes and the rewards in ﬁg. 4 show that DCEM is almost (but not exactly) able to recover the performance of doing CEM over the full action space while using an order-of-magnitude less trajectory samples (1,000 vs 10,0000). PPO ﬁne-tuning helps bridge the gap between the performances.

Discussion. The future directions of DCEM in the control setting will help bring efﬁciency and policy-based ﬁnetuning to model-based reinforcement learning. Much more analysis and experimentation is necessary to achieve this as we faced many issues getting the model-based cheetah and walker tasks to work that did not arise in the ground-

The Differentiable Cross-Entropy Method

Latent DCEM

Latent DCEM+PPO

Figure 4. We evaluated our ﬁnal models by running 100 episodes each on the cheetah and walker tasks. CEM over the full action space

uses 10,000 trajectories for control at each time step while embedded DCEM samples only 1000 trajectories. DCEM almost recovers the performance of CEM over the full action space and PPO ﬁne-tuning of the model-based components helps bridge the gap.

truth cartpole task. We discuss this more in app. D. We also did not focus on the sample complexity of our algorithms getting these proof-of-concept experiments working. Other reasonable baselines on this task could involve distilling the controller into a model-free policy and then doing search on top of that policy, as done in POPLIN (Wang & Ba, 2019).

6. Conclusions and Future Directions

We have shown how to differentiate through the crossentropy method and have brought CEM into the end-toend learning pipeline. Beyond further explorations in the energy-based learning and control contexts we showed here, DCEM can be used anywhere gradient descent is unrolled. We ﬁnd this especially promising for meta-learning and can build on LEO (Rusu et al., 2018) or CAVIA (Zintgraf et al., 2019). Inspired by DCEM, other more powerful sampling-based optimizers could be made differentiable in the same way, potentially optimizers that leverage gradientbased information in the inner optimization steps (Sekhon & Mebane, 1998; Theodorou et al., 2010; Stulp & Sigaud, 2012; Maheswaranathan et al., 2018) or by also learning the hyper-parameters of structured optimizers (Li & Malik, 2016; Volpp et al., 2019; Chen et al., 2017).

Acknowledgments

We thank David Belanger, Roberto Calandra, Yinlam Chow, Rob Fergus, Mohammad Ghavamzadeh, Edward Grefenstette, Shubhanshu Shekhar, and Zoltán Szabó for insightful discussions, and the anonymous reviewers for many useful suggestions and improvements to this paper.

We acknowledge the Python community (Van Rossum & Drake Jr, 1995; Oliphant, 2007) for developing the core set of tools that enabled this work, including Py Torch (Paszke et al., 2019), Hydra (Yadan, 2019), Jupyter (Kluyver et al., 2016), Matplotlib (Hunter, 2007), seaborn (Waskom et al., 2018), numpy (Oliphant, 2006; Van Der Walt et al., 2011), pandas (Mc Kinney, 2012), and Sci Py (Jones et al., 2014).

Agrawal, A., Amos, B., Barratt, S., Boyd, S., Diamond, S., and

Kolter, J. Z. Differentiable convex optimization layers. In Advances in neural information processing systems, pp. 9562

9574, 2019a.

Agrawal, A., Barratt, S., Boyd, S., and Stellato, B. Learning convex

optimization control policies. ar Xiv preprint ar Xiv:1912.09529, 2019b.

Amos, B. and Kolter, J. Z. Optnet: Differentiable optimization

as a layer in neural networks. In Proceedings of the 34th International Conference on Machine Learning-Volume 70, pp. 136 145. JMLR. org, 2017.

Amos, B., Xu, L., and Kolter, J. Z. Input convex neural networks.

In Proceedings of the 34th International Conference on Machine Learning-Volume 70, pp. 146 155. JMLR. org, 2017.

Amos, B., Jimenez, I., Sacks, J., Boots, B., and Kolter, J. Z. Dif-

ferentiable mpc for end-to-end planning and control. In Advances in Neural Information Processing Systems, pp. 8289 8300, 2018.

Amos, B., Koltun, V., and Kolter, J. Z. The limited multi-label

projection layer. ar Xiv preprint ar Xiv:1906.08707, 2019.

Anand, A., Racah, E., Ozair, S., Bengio, Y., Côté, M.-A., and

Hjelm, R. D. Unsupervised state representation learning in atari. ar Xiv preprint ar Xiv:1906.08226, 2019.

Antonova, R., Rai, A., Li, T., and Kragic, D. Bayesian optimization

in variational latent spaces with dynamic compression. ar Xiv preprint ar Xiv:1907.04796, 2019.

Banijamali, E., Shu, R., Ghavamzadeh, M., Bui, H., and Ghodsi, A.

Robust locally-linear controllable embedding. ar Xiv preprint ar Xiv:1710.05373, 2017.

Bansal, S., Calandra, R., Xiao, T., Levine, S., and Tomlin, C. J.

Goal-driven dynamics learning via Bayesian optimization. In IEEE Conference on Decision and Control (CDC), pp. 5168 5173, 2017.

Barto, A. G., Sutton, R. S., and Anderson, C. W. Neuronlike

adaptive elements that can solve difﬁcult learning control problems. IEEE transactions on systems, man, and cybernetics, pp. 834 846, 1983.

The Differentiable Cross-Entropy Method

Belanger, D. and Mc Callum, A. Structured prediction energy

networks. In International Conference on Machine Learning, pp. 983 992, 2016.

Belanger, D., Yang, B., and Mc Callum, A. End-to-end learning for

structured prediction energy networks. In Proceedings of the 34th International Conference on Machine Learning-Volume 70, pp. 429 439. JMLR. org, 2017.

Ben Salem, M., Bachoc, F., Roustant, O., Gamboa, F., and Tomaso,

L. Sequential dimension reduction for learning features of expensive black-box functions. working paper or preprint, February 2019. URL https://hal.archives-ouvertes. fr/hal-01688329.

Brookes, D. H. and Listgarten, J. Design by adaptive sampling.

ar Xiv preprint ar Xiv:1810.03714, 2018.

Calandra, R., Peters, J., Rasmussen, C. E., and Deisenroth, M. P.

Manifold gaussian processes for regression. In 2016 International Joint Conference on Neural Networks (IJCNN), pp. 3338 3345. IEEE, 2016.

Chandak, Y., Theocharous, G., Kostas, J., Jordan, S., and Thomas,

P. S. Learning action representations for reinforcement learning. ar Xiv preprint ar Xiv:1902.00183, 2019.

Chen, Y., Hoffman, M. W., Colmenarejo, S. G., Denil, M., Lilli-

crap, T. P., Botvinick, M., and de Freitas, N. Learning to learn without gradient descent by gradient descent. In Proceedings of the 34th International Conference on Machine Learning-Volume 70, pp. 748 756. JMLR. org, 2017.

Cho, K., Van Merriënboer, B., Gulcehre, C., Bahdanau, D.,

Bougares, F., Schwenk, H., and Bengio, Y. Learning phrase representations using rnn encoder-decoder for statistical machine translation. ar Xiv preprint ar Xiv:1406.1078, 2014.

Chua, K., Calandra, R., Mc Allister, R., and Levine, S. Deep re-

inforcement learning in a handful of trials using probabilistic dynamics models. In Advances in Neural Information Processing Systems, pp. 4754 4765, 2018.

Co-Reyes, J. D., Liu, Y., Gupta, A., Eysenbach, B., Abbeel, P., and

Levine, S. Self-consistent trajectory autoencoder: Hierarchical reinforcement learning with trajectory embeddings. ar Xiv preprint ar Xiv:1806.02813, 2018.

De Boer, P.-T., Kroese, D. P., Mannor, S., and Rubinstein, R. Y.

A tutorial on the cross-entropy method. Annals of operations research, 134(1):19 67, 2005.

Domke, J. Generic methods for optimization-based modeling. In

Artiﬁcial Intelligence and Statistics, pp. 318 326, 2012.

Finn, C., Abbeel, P., and Levine, S. Model-agnostic meta-learning

for fast adaptation of deep networks. In Proceedings of the 34th International Conference on Machine Learning-Volume 70, pp. 1126 1135. JMLR. org, 2017.

Foerster, J., Chen, R. Y., Al-Shedivat, M., Whiteson, S., Abbeel,

P., and Mordatch, I. Learning with opponent-learning awareness. In Proceedings of the 17th International Conference on Autonomous Agents and Multi Agent Systems, pp. 122 130. In-

ternational Foundation for Autonomous Agents and Multiagent Systems, 2018.

Friedman, J., Hastie, T., and Tibshirani, R. The elements of statis-

tical learning, volume 1. Springer series in statistics New York, 2001.

Garnett, R., Osborne, M. A., and Hennig, P. Active learning

of linear embeddings for gaussian processes. ar Xiv preprint ar Xiv:1310.6740, 2013.

Gelada, C., Kumar, S., Buckman, J., Nachum, O., and Bellemare,

M. G. Deepmdp: Learning continuous latent space models for representation learning. ar Xiv preprint ar Xiv:1906.02736, 2019.

Ghosh, D., Gupta, A., and Levine, S. Learning actionable rep-

resentations with goal-conditioned policies. ar Xiv preprint ar Xiv:1811.07819, 2018.

Gómez-Bombarelli, R., Wei, J. N., Duvenaud, D., Hernández-

Lobato, J. M., Sánchez-Lengeling, B., Sheberla, D., Aguilera Iparraguirre, J., Hirzel, T. D., Adams, R. P., and Aspuru-Guzik, A. Automatic chemical design using a data-driven continuous representation of molecules. ACS central science, 4(2):268 276, 2018.

Gould, S., Fernando, B., Cherian, A., Anderson, P., Cruz, R. S., and

Guo, E. On differentiating parameterized argmin and argmax problems with application to bi-level optimization. ar Xiv preprint ar Xiv:1607.05447, 2016.

Hafner, D., Lillicrap, T., Fischer, I., Villegas, R., Ha, D., Lee, H.,

and Davidson, J. Learning latent dynamics for planning from pixels. ar Xiv preprint ar Xiv:1811.04551, 2018.

Hu, F. and Zidek, J. V. The weighted likelihood. Canadian Journal

of Statistics, 30(3):347 371, 2002.

Hunter, J. D. Matplotlib: A 2d graphics environment. Computing

in science & engineering, 9(3):90, 2007.

Ichter, B. and Pavone, M. Robot motion planning in learned latent

spaces. IEEE Robotics and Automation Letters, 4(3):2407 2414, 2019.

Ichter, B., Harrison, J., and Pavone, M. Learning sampling distri-

butions for robot motion planning. In 2018 IEEE International Conference on Robotics and Automation (ICRA), pp. 7087 7094. IEEE, 2018.

Jenni, S. and Favaro, P. Deep bilevel learning. In Proceedings

of the European Conference on Computer Vision (ECCV), pp. 618 633, 2018.

Johnson, M., Duvenaud, D. K., Wiltschko, A., Adams, R. P., and

Datta, S. R. Composing graphical models with neural networks for structured representations and fast inference. In Advances in neural information processing systems, pp. 2946 2954, 2016.

Jones, E., Oliphant, T., and Peterson, P. {Sci Py}: Open source

scientiﬁc tools for {Python}. 2014.

Kingma, D. P. and Welling, M. Auto-encoding variational bayes.

ar Xiv preprint ar Xiv:1312.6114, 2013.

Kirschner, J., Mutn y, M., Hiller, N., Ischebeck, R., and Krause, A.

Adaptive and safe bayesian optimization in high dimensions via one-dimensional subspaces. ar Xiv preprint ar Xiv:1902.03229, 2019.

The Differentiable Cross-Entropy Method

Kluyver, T., Ragan-Kelley, B., Pérez, F., Granger, B. E., Bus-

sonnier, M., Frederic, J., Kelley, K., Hamrick, J. B., Grout, J., Corlay, S., et al. Jupyter notebooks-a publishing format for reproducible computational workﬂows. In ELPUB, pp. 87 90, 2016.

Kostrikov, I. Pytorch implementations of reinforcement learn-

ing algorithms. https://github.com/ikostrikov/ pytorch-a2c-ppo-acktr-gail, 2018.

Kurutach, T., Tamar, A., Yang, G., Russell, S. J., and Abbeel,

P. Learning plannable representations with causal infogan. In Advances in Neural Information Processing Systems, pp. 8733

8744, 2018.

Lambert, N., Amos, B., Yadan, O., and Calandra, R. Objective mis-

match in model-based reinforcement learning. ar Xiv preprint ar Xiv:2002.04523, 2020.

Le Cun, Y., Chopra, S., Hadsell, R., Ranzato, M., and Huang, F. A

tutorial on energy-based learning. Predicting structured data, 1 (0), 2006.

Levine, N., Chow, Y., Shu, R., Li, A., Ghavamzadeh, M., and Bui,

H. Prediction, consistency, curvature: Representation learning for locally-linear control. ar Xiv preprint ar Xiv:1909.01506, 2019.

Li, K. and Malik, J. Learning to optimize. ar Xiv preprint ar Xiv:1606.01885, 2016.

Luo, R., Tian, F., Qin, T., Chen, E., and Liu, T.-Y. Neural architec-

ture optimization. In Advances in neural information processing systems, pp. 7816 7827, 2018.

Lynch, C., Khansari, M., Xiao, T., Kumar, V., Tompson, J., Levine,

S., and Sermanet, P. Learning latent plans from play. ar Xiv preprint ar Xiv:1903.01973, 2019.

Maheswaranathan, N., Metz, L., Tucker, G., Choi, D., and

Sohl-Dickstein, J. Guided evolutionary strategies: Augmenting random search with surrogate gradients. ar Xiv preprint ar Xiv:1806.10230, 2018.

Malaviya, C., Ferreira, P., and Martins, A. F. Sparse and con-

strained attention for neural machine translation. ar Xiv preprint ar Xiv:1805.08241, 2018.

Markatou, M., Basu, A., and Lindsay, B. Weighted likelihood esti-

mating equations: The discrete case with applications to logistic regression. Journal of Statistical Planning and Inference, 57(2): 215 232, 1997.

Markatou, M., Basu, A., and Lindsay, B. G. Weighted likelihood

equations with bootstrap root search. Journal of the American Statistical Association, 93(442):740 750, 1998.

Martins, A. F. and Kreutzer, J. Learning what s easy: Fully differ-

entiable neural easy-ﬁrst taggers. In Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, pp. 349 362, 2017.

Mayne, D. Q. and Michalska, H. Receding horizon control of

nonlinear systems. IEEE Transactions on automatic control, 35 (7):814 824, 1990.

Mc Kinney, W. Python for data analysis: Data wrangling with

Pandas, Num Py, and IPython. " O Reilly Media, Inc.", 2012.

Metz, L., Poole, B., Pfau, D., and Sohl-Dickstein, J. Unrolled

generative adversarial networks. Co RR, abs/1611.02163, 2016. URL http://arxiv.org/abs/1611.02163.

Miladinovi c, Ð., Gondal, M. W., Schölkopf, B., Buhmann, J. M.,

and Bauer, S. Disentangled state space representations. ar Xiv preprint ar Xiv:1906.03255, 2019.

Mohamed, S., Rosca, M., Figurnov, M., and Mnih, A. Monte

carlo gradient estimation in machine learning. ar Xiv preprint ar Xiv:1906.10652, 2019.

Mukadam, M., Dong, J., Yan, X., Dellaert, F., and Boots, B.

Continuous-time gaussian process motion planning via probabilistic inference. The International Journal of Robotics Research, 37(11):1319 1340, 2018.

Oh, C., Gavves, E., and Welling, M. Bock: Bayesian optimiza-

tion with cylindrical kernels. ar Xiv preprint ar Xiv:1806.01619, 2018.

Okada, M., Rigazio, L., and Aoshima, T. Path integral networks: End-to-end differentiable optimal control. ar Xiv preprint ar Xiv:1706.09597, 2017.

Oliphant, T. E. A guide to Num Py, volume 1. Trelgol Publishing

Oliphant, T. E. Python for scientiﬁc computing. Computing in

Science & Engineering, 9(3):10 20, 2007.

Paszke, A., Gross, S., Massa, F., Lerer, A., Bradbury, J., Chanan,

G., Killeen, T., Lin, Z., Gimelshein, N., Antiga, L., et al. Pytorch: An imperative style, high-performance deep learning library. In Advances in neural information processing systems, pp. 8026 8037, 2019.

Pedregosa, F. Hyperparameter optimization with approximate

gradient. ar Xiv preprint ar Xiv:1602.02355, 2016.

Pereira, M., Fan, D. D., An, G. N., and Theodorou, E. Mpc-

inspired neural network policies for sequential decision making. ar Xiv preprint ar Xiv:1802.05803, 2018.

Rajeswaran, A., Finn, C., Kakade, S., and Levine, S. Meta-learning

with implicit gradients. ar Xiv preprint ar Xiv:1909.04630, 2019.

Rao, C. R. Convexity properties of entropy functions and analysis

of diversity. Lecture Notes-Monograph Series, pp. 68 77, 1984.

Rezende, D. J., Mohamed, S., and Wierstra, D. Stochastic back-

propagation and approximate inference in deep generative models. ar Xiv preprint ar Xiv:1401.4082, 2014.

Rockafellar, R. T. and Wets, R. J.-B. Variational analysis, volume

317. Springer Science & Business Media, 2009.

Rubinstein, R. Y. Optimization of computer simulation models

with rare events. European Journal of Operational Research, 99(1):89 112, 1997.

Rusu, A. A., Rao, D., Sygnowski, J., Vinyals, O., Pascanu, R.,

Osindero, S., and Hadsell, R. Meta-learning with latent embedding optimization. ar Xiv preprint ar Xiv:1807.05960, 2018.

Schulman, J., Wolski, F., Dhariwal, P., Radford, A., and Klimov,

O. Proximal policy optimization algorithms. ar Xiv preprint ar Xiv:1707.06347, 2017.

The Differentiable Cross-Entropy Method

Sekhon, J. S. and Mebane, W. R. Genetic optimization using

derivatives. Political Analysis, 7:187 210, 1998.

Singh, S., Richards, S. M., Sindhwani, V., Slotine, J.-J. E.,

and Pavone, M. Learning stabilizable nonlinear dynamics with contraction-based regularization. ar Xiv preprint ar Xiv:1907.13122, 2019.

Srinivas, A., Jabri, A., Abbeel, P., Levine, S., and Finn, C. Uni-

versal planning networks. ar Xiv preprint ar Xiv:1804.00645, 2018.

Stulp, F. and Sigaud, O. Path integral policy improvement with

covariance matrix adaptation. ar Xiv preprint ar Xiv:1206.4621, 2012.

Tasﬁ, N. and Capretz, M. Dynamic planning networks. ar Xiv

preprint ar Xiv:1812.11240, 2018.

Taskar, B., Chatalbashev, V., Koller, D., and Guestrin, C. Learning

structured prediction models: A large margin approach. In Proceedings of the 22nd international conference on Machine learning, pp. 896 903. ACM, 2005.

Tassa, Y., Doron, Y., Muldal, A., Erez, T., Li, Y., Casas, D. d. L.,

Budden, D., Abdolmaleki, A., Merel, J., Lefrancq, A., et al. Deepmind control suite. ar Xiv preprint ar Xiv:1801.00690, 2018.

Theodorou, E., Buchli, J., and Schaal, S. A generalized path

integral control approach to reinforcement learning. The Journal of Machine Learning Research, 11:3137 3181, 2010.

Titsias, M. and Lázaro-Gredilla, M. Doubly stochastic variational

bayes for non-conjugate inference. In International conference on machine learning, pp. 1971 1979, 2014.

Todorov, E., Erez, T., and Tassa, Y. Mujoco: A physics engine for

model-based control. In 2012 IEEE/RSJ International Conference on Intelligent Robots and Systems, pp. 5026 5033. IEEE, 2012.

Van Der Walt, S., Colbert, S. C., and Varoquaux, G. The numpy ar-

ray: a structure for efﬁcient numerical computation. Computing in Science & Engineering, 13(2):22, 2011.

Van Rossum, G. and Drake Jr, F. L. Python reference manual.

Centrum voor Wiskunde en Informatica Amsterdam, 1995.

Volpp, M., Fröhlich, L., Doerr, A., Hutter, F., and Daniel, C. Meta-

learning acquisition functions for bayesian optimization. ar Xiv preprint ar Xiv:1904.02642, 2019.

Wang, S. X. Maximum weighted likelihood estimation. Ph D thesis,

University of British Columbia, 2001.

Wang, T. and Ba, J. Exploring model-based planning with policy

networks. ar Xiv preprint ar Xiv:1906.08649, 2019.

Wang, Z., Hutter, F., Zoghi, M., Matheson, D., and de Feitas,

N. Bayesian optimization in a billion dimensions via random embeddings. Journal of Artiﬁcial Intelligence Research, 55: 361 387, 2016.

Waskom, M., Botvinnik, O., O Kane, D., Hobson, P., Ostblom, J.,

Lukauskas, S., Gemperline, D. C., Augspurger, T., Halchenko, Y., Cole, J. B., Warmenhoven, J., de Ruiter, J., Pye, C., Hoyer, S., Vanderplas, J., Villalba, S., Kunter, G., Quintero, E., Bachant, P.,

Martin, M., Meyer, K., Miles, A., Ram, Y., Brunner, T., Yarkoni, T., Williams, M. L., Evans, C., Fitzgerald, C., Brian, and Qalieh, A. mwaskom/seaborn: v0.9.0 (july 2018), July 2018. URL https://doi.org/10.5281/zenodo.1313201.

Watter, M., Springenberg, J., Boedecker, J., and Riedmiller, M.

Embed to control: A locally linear latent dynamics model for control from raw images. In Advances in neural information processing systems, pp. 2746 2754, 2015.

Yadan, O. Hydra - a framework for elegantly conﬁguring complex

applications. Github, 2019. URL https://github.com/ facebookresearch/hydra.

Yu, T., Shevchuk, G., Sadigh, D., and Finn, C. Unsupervised

visuomotor control through distributional planning networks. ar Xiv preprint ar Xiv:1902.05542, 2019.

Zhang, M., Vikram, S., Smith, L., Abbeel, P., Johnson, M. J.,

and Levine, S. Solar: Deep structured latent representations for model-based reinforcement learning. ar Xiv preprint ar Xiv:1808.09105, 2018.

Zhang, Y., Hare, J., and Prügel-Bennett, A. Deep set prediction

networks. ar Xiv preprint ar Xiv:1906.06565, 2019.

Zintgraf, L., Shiarli, K., Kurin, V., Hofmann, K., and Whiteson,

S. Fast context adaptation via meta-learning. In International Conference on Machine Learning, pp. 7693 7702, 2019.