# regularizing_trajectory_optimization_with_denoising_autoencoders__8e07b9d5.pdf

Regularizing Trajectory Optimization with Denoising Autoencoders

Rinu Boney Aalto University & Curious AI rinu.boney@aalto.fi

Norman Di Palo Sapienza University of Rome normandipalo@gmail.com

Mathias Berglund Curious AI

Alexander Ilin Aalto University & Curious AI Juho Kannala Aalto University Antti Rasmus Curious AI Harri Valpola Curious AI

Trajectory optimization using a learned model of the environment is one of the core elements of model-based reinforcement learning. This procedure often suffers from exploiting inaccuracies of the learned model. We propose to regularize trajectory optimization by means of a denoising autoencoder that is trained on the same trajectories as the model of the environment. We show that the proposed regularization leads to improved planning with both gradient-based and gradientfree optimizers. We also demonstrate that using regularized trajectory optimization leads to rapid initial learning in a set of popular motor control tasks, which suggests that the proposed approach can be a useful tool for improving sample efﬁciency.

1 Introduction

State-of-the-art reinforcement learning (RL) often requires a large number of interactions with the environment to learn even relatively simple tasks [11]. It is generally believed that model-based RL can provide better sample-efﬁciency [9, 2, 5] but showing this in practice has been challenging. In this paper, we propose a way to improve planning in model-based RL and show that it can lead to improved performance and better sample efﬁciency.

In model-based RL, planning is done by computing the expected result of a sequence of future actions using an explicit model of the environment. Model-based planning has been demonstrated to be efﬁcient in many applications where the model (a simulator) can be built using ﬁrst principles. For example, model-based control is widely used in robotics and has been used to solve challenging tasks such as human locomotion [34, 35] and dexterous in-hand manipulation [21].

In many applications, however, we often do not have the luxury of an accurate simulator of the environment. Firstly, building even an approximate simulator can be very costly even for processes whose dynamics is well understood. Secondly, it can be challenging to align the state of an existing simulator with the state of the observed process in order to plan. Thirdly, the environment is often non-stationary due to, for example, hardware failures in robotics, change of the input feed and deactivation of materials in industrial process control. Thus, learning the model of the environment is the only viable option in many applications and learning needs to be done for a live system. And since many real-world systems are very complex, we are likely to need powerful function approximators, such as deep neural networks, to learn the dynamics of the environment.

However, planning using a learned (and therefore inaccurate) model of the environment is very difﬁcult in practice. The process of optimizing the sequence of future actions to maximize the

Equal contribution, rest in alphabetical order

33rd Conference on Neural Information Processing Systems (Neur IPS 2019), Vancouver, Canada.

expected return (which we call trajectory optimization) can easily exploit the inaccuracies of the model and suggest a very unreasonable plan which produces highly over-optimistic predicted rewards. This optimization process works similarly to adversarial attacks [1, 13, 33, 7] where the input of a trained model is modiﬁed to achieve the desired output. In fact, a more efﬁcient trajectory optimizer is more likely to fall into this trap. This can arguably be the reason why gradient-based optimization (which is very efﬁcient at for example learning the models) has not been widely used for trajectory optimization.

In this paper, we study this adversarial effect of model-based planning in several environments and show that it poses a problem particularly in high-dimensional control spaces. We also propose to remedy this problem by regularizing trajectory optimization using a denoising autoencoder (DAE) [37]. The DAE is trained to denoise trajectories that appeared in the past experience and in this way the DAE learns the distribution of the collected trajectories. During trajectory optimization, we use the denoising error of the DAE as a regularization term that is subtracted from the maximized objective function. The intuition is that the denoising error will be large for trajectories that are far from the training distribution, signaling that the dynamics model predictions will be less reliable as it has not been trained on such data. Thus, a good trajectory has to give a high predicted return and it can be only moderately novel in the light of past experience.

In the experiments, we demonstrate that the proposed regularization signiﬁcantly diminishes the adversarial effect of trajectory optimization with learned models. We show that the proposed regularization works well with both gradient-free and gradient-based optimizers (experiments are done with cross-entropy method [3] and Adam [14]) in both open-loop and closed-loop control. We demonstrate that improved trajectory optimization translates to excellent results in early parts of training in standard motor-control tasks and achieve competitive performance after a handful of interactions with the environment.

2 Model-Based Reinforcement Learning

In this section, we explain the basic setup of model-based RL and present the notation used. At every time step t, the environment is in state st, the agent performs action at, receives reward rt = r(st, at) and the environment transitions to new state st+1 = f(st, at). The agent acts based on the observations ot = o(st) which is a function of the environment state. In a fully observable Markov decision process (MDP), the agent observes full state ot = st. In a partially observable Markov decision process (POMDP), the observation ot does not completely reveal st. The goal of the agent is select actions {a0, a1, . . .} so as to maximize the return, which is the expected cumulative reward E [P t=0 r(st, at)].

In the model-based approach, the agent builds the dynamics model of the environment (forward model). For a fully observable environment, the forward model can be a fully-connected neural network trained to predict the state transition from time t to t + 1:

st+1 = fθ(st, at) . (1)

In partially observable environments, the forward model can be a recurrent neural network trained to directly predict the future observations based on past observations and actions:

ot+1 = fθ(o0, a0, . . . , ot, at) . (2)

In this paper, we assume access to the reward function and that it can be computed from the agent observations, that is rt = r(ot, at).

At each time step t, the agent uses the learned forward model to plan the sequence of future actions {at, . . . , at+H} so as to maximize the expected cumulative future reward.

G(at, . . . , at+H) = E

τ=t r(oτ, aτ)

at, . . . , at+H = arg max G(at, . . . , at+H) .

This process is called trajectory optimization. The agent uses the learned model of the environment to compute the objective function G(at, . . . , at+H). The model (1) or (2) is unrolled H steps into the future using the current plan {at, . . . , at+H}.

Algorithm 1 End-to-end model-based reinforcement learning

Collect data D by random policy. for each episode do

Train dynamics model fθ using D. for time t until the episode is over do

Optimize trajectory {at, ot+1, . . . , at+H, ot+H+1}. Implement the ﬁrst action at and get new observation ot. end for Add data {(s1, a1, . . . , a T , o T )} from the last episode to D. end for

The optimized sequence of actions from trajectory optimization can be directly applied to the environment (open-loop control). It can also be provided as suggestions to a human operator with the possibility for the human to change the plan (human-in-the-loop). Open-loop control is challenging because the dynamics model has to be able to make accurate long-range predictions. An approach which works better in practice is to take only the ﬁrst action of the optimized trajectory and then replan at each step (closed-loop control). Thus, in closed-loop control, we account for possible modeling errors and feedback from the environment. In the control literature, this ﬂavor of model-based RL is called model-predictive control (MPC) [22, 30, 16, 24].

The typical sequence of steps performed in model-based RL are: 1) collect data, 2) train the forward model fθ, 3) interact with the environment using MPC (this involves trajectory optimization in every time step), 4) store the data collected during the last interaction and continue to step 2. The algorithm is outlined in Algorithm 1.

3 Regularized Trajectory Optimization

3.1 Problem with using learned models for planning

In this paper, we focus on the inner loop of model-based RL which is trajectory optimization using a learned forward model fθ. Potential inaccuracies of the trained model cause substantial difﬁculties for the planning process. Rather than optimizing what really happens, planning can easily end up exploiting the weaknesses of the predictive model. Planning is effectively an adversarial attack against the agent s own forward model. This results in a wide gap between expectations based on the model and what actually happens.

We demonstrate this problem using a simple industrial process control benchmark from [28]. The problem is to control a continuous nonlinear reactor by manipulating three valves which control ﬂows in two feeds and one output stream. Further details of the process and the control problem are given in Appendix A. The task considered in [28] is to change the product rate of the process from 100 to 130 kmol/h. Fig. 1a shows how this task can be performed using a set of PI controllers proposed in [28]. We trained a forward model of the process using a recurrent neural network (2) and the data collected by implementing the PI control strategy for a set of randomly generated targets. Then we optimized the trajectory for the considered task using gradient-based optimization, which produced results in Fig. 1b. One can see that the proposed control signals are changed abruptly and the trajectory imagined by the model signiﬁcantly deviates from reality. For example, the pressure constraint (of max 300 k Pa) is violated. This example demonstrates how planning can easily exploit the weaknesses of the predictive model.

3.2 Regularizing Trajectory Optimization with Denoising Autoencoders

We propose to regularize the trajectory optimization with denoising autoencoders (DAE). The idea is that we want to reward familiar trajectories and penalize unfamiliar ones because the model is likely to make larger errors for the unfamiliar ones.

This can be achieved by adding a regularization term to the objective function:

Greg = G + α log p(ot, at . . . , ot+H, at+H) , (3)

Product Rate [kmol/h]

Pressure [k Pa]

A in purge [mole %]

Manipulated variables [%]

Product Rate [kmol/h]

Pressure [k Pa]

A in purge [mole %]

Manipulated variables [%]

Product Rate [kmol/h]

Pressure [k Pa]

A in purge [mole %]

Manipulated variables [%]

(a) Multiloop PI control (b) No regularization (c) DAE regularization

Figure 1: Open-loop planning for a continuous nonlinear two-phase reactor from [28]. Three subplots in every subﬁgure show three measured variables (solid lines): product rate, pressure and A in the purge. The black curves represent the model s imagination while the red curves represent the reality if those controls are applied in an open-loop mode. The targets for the variables are shown with dashed lines. The fourth (low right) subplots show the three manipulated variables: valve for feed 1 (blue), valve for feed 2 (red) and valve for stream 3 (green).

Figure 2: Example: fragment of a computational graph used during trajectory optimization in an MDP. Here, window size w = 1, that is the DAE penalty term is c1 = g([s1, a1]) [s1, a1] 2.

where p(ot, at, . . . , ot+H, at+H) is the probability of observing a given trajectory in the past experience and α is a tuning hyperparameter. In practice, instead of using the joint probability of the whole trajectory, we use marginal probabilities over short windows of size w:

Greg = G + α

τ=t log p(xτ) (4)

where xτ = {oτ, aτ, . . . oτ+w, aτ+w} is a short window of the optimized trajectory.

Suppose we want to ﬁnd the optimal sequence of actions by maximizing (4) with a gradient-based optimization procedure. We can compute gradients Greg

ai by backpropagation in a computational graph where the trained forward model is unrolled into the future (see Fig. 2). In such backpropagationthrough-time procedure, one needs to compute the gradient with respect to actions ai.

xτ log p(xτ) , (5)

where we denote by xτ a concatenated vector of observations oτ, . . . oτ+w and actions aτ, . . . aτ+w, over a window of size w. Thus to enable a regularized gradient-based optimization procedure, we need means to compute xτ log p(xτ).

In order to evaluate log p(xτ) (or its derivative), one needs to train a separate model p(xτ) of the past experience, which is the task of unsupervised learning. In principle, any probabilistic model can be used for that. In this paper, we propose to regularize trajectory optimization with a denoising autoencoder (DAE) which does not build an explicit probabilistic model p(xτ) but rather learns to approximate the derivative of the log probability density. The theory of denoising [23, 27] states that

the optimal denoising function g( x) (for zero-mean Gaussian corruption) is given by:

g( x) = x + σ2 n x log p( x) ,

where p( x) is the probability density function for data x corrupted with noise and σn is the standard deviation of the Gaussian corruption. Thus, the DAE-denoised signal minus the original gives the gradient of the log-probability of the data distribution convolved with a Gaussian distribution: x log p( x) g(x) x . Assuming x log p( x) x log p(x) yields

ai (g(xτ) xτ) . (6)

x log p( x) instead of

x log p(x) can behave better in practice because it is similar to replacing p(x) with its Parzen window estimate [36]. In automatic differentiation software, this gradient can be computed by adding the penalty term g(xτ) xτ 2 to G and stopping the gradient propagation through g. In practice, stopping the gradient through g did not yield any beneﬁts in our experiments compared to simply adding the penalty term g(xτ) xτ 2 to the cumulative reward, so we used the simple penalty term in our experiments. Also, this kind of regularization can easily be used with gradient-free optimization methods such as cross-entropy method (CEM) [3].

Our goal is to tackle high-dimensional problems and expressive models of dynamics. Neural networks tend to fare better than many other techniques in modeling high-dimensional distributions. However, using a neural network or any other ﬂexible parameterized model to estimate the input distribution poses a dilemma: the regularizing network which is supposed to keep planning from exploiting the inaccuracies of the dynamics model will itself have weaknesses which planning will then exploit. Clearly, DAE will also have inaccuracies but planning will not exploit them because unlike most other density models, DAE develops an explicit model of the gradient of logarithmic probability density.

The effect of adding DAE regularization in the industrial process control benchmark discussed in the previous section is shown in Fig. 1c.

3.3 Related work

Several methods have been proposed for planning with learned dynamics models. Locally linear timevarying models [17, 19] and Gaussian processes [8, 15] or mixture of Gaussians [29] are data-efﬁcient but have problems scaling to high-dimensional environments. Recently, deep neural networks have been successfully applied to model-based RL. Nagabandi et al. [24] use deep neural networks as dynamics models in model-predictive control to achieve good performance, and then shows how model-based RL can be ﬁne-tuned with a model-free approach to achieve even better performance. Chua et al. [5] introduce PETS, a method to improve model-based performance by estimating and propagating uncertainty with an ensemble of networks and sampling techniques. They demonstrate how their approach can beat several recent model-based and model-free techniques. Clavera et al. [6] combines model-based RL and meta-learning with MB-MPO, training a policy to quickly adapt to slightly different learned dynamics models, thus enabling faster learning.

Levine and Koltun [20] and Kumar et al. [17] use a KL divergence penalty between action distributions to stay close to the training distribution. Similar bounds are also used to stabilize training of policy gradient methods [31, 32]. While such a KL penalty bounds the evolution of action distributions, the proposed method also bounds the familiarity of states, which could be important in high-dimensional state spaces. While penalizing unfamiliar states also penalize exploration, it allows for more controlled and efﬁcient exploration. Exploration is out of the scope of the paper but was studied in [10], where a non-zero optimum of the proposed DAE penalty was used as an intrinsic reward to alternate between familiarity and exploration.

4 Experiments on Motor Control

We show the effect of the proposed regularization for control in standard Mujoco environments: Cartpole, Reacher, Pusher, Half-cheetah and Ant available in [4]. See the description of the environments in Appendix B. We use the Probabilistic Ensembles with Trajectory Sampling (PETS) model from [5] as the baseline, which achieves the best reported results on all the considered tasks except for Ant.

0 10 20 30 40 50 20

Cumulative Reward

DAE Regularization

0 10 20 30 40 50

Cumulative Reward

No Regularization

0 10 20 30 40 50 20

Cumulative Reward

DAE Regularization

0 10 20 30 40 50

Cumulative Reward

No Regularization

0 10 20 30 40 50 Planning Iterations

Cumulative Reward

0 10 20 30 40 50 Planning Iterations

Cumulative Reward

0 10 20 30 40 50 Planning Iterations

Cumulative Reward

0 10 20 30 40 50 Planning Iterations

Cumulative Reward

Trajectory optimization with CEM Trajectory optimization with Adam

Figure 3: Visualization of trajectory optimization at timestep t = 50. Each row has the same model but a different optimization method. The models are obtained by 5 episodes of end-to-end training. Row above: Cartpole environment. Row below: Half-cheetah environment. Here, the red lines denote the rewards predicted by the model (imagination) and the black lines denote the true rewards obtained when applying the sequence of optimized actions (reality). For a low-dimensional action space (Cartpole), trajectory optimizers do not exploit inaccuracies of the dynamics model and hence DAE regularization does not affect the performance noticeably. For a higher-dimensional action space (Half-cheetah), gradient-based optimization without any regularization easily exploits inaccuracies of the dynamics model but DAE regularization is able to prevent this. The effect is less pronounced with gradient-free optimization but still noticeable.

The PETS model consists of an ensemble of probabilistic neural networks and uses particle-based trajectory sampling to regularize trajectory optimization. We re-implemented the PETS model using the code provided by the authors as a reference.

4.1 Regularized trajectory optimization with models trained with PETS

In MPC, the innermost loop is open-loop control which is then turned to closed-loop control by taking in new observations and replanning after each action. Fig. 3 illustrates the adversarial effect during open-loop trajectory optimization and how DAE regularization mitigates it. In Cartpole environment, the learned model is very good already after a few episodes of data and trajectory optimization stays within the data distribution. As there is no problem to begin with, regularization does not improve the results. In Half-cheetah environment, trajectory optimization manages to exploit the inaccuracies of the model which is particularly apparent in gradient-based Adam. DAE regularization improves both but the effect is much stronger with Adam.

The problem is exacerbated in closed-loop control since it continues optimization from the solution achieved in the previous time step, effectively iterating more per action. We demonstrate how regularization can improve closed-loop trajectory optimization in the Half-cheetah environment. We ﬁrst train three PETS models for 300 episodes using the best hyperparameters reported in [5]. We then evaluate the performance of the three models on ﬁve episodes using four different trajectory optimizers: 1) Cross-entropy method (CEM) which was used during training of the PETS models, 2) Adam, 3) CEM with the DAE regularization and 4) Adam with the DAE regularization. The results averaged across the three models and the ﬁve episodes are presented in Table 1.

Table 1: Comparison of PETS with CEM and Adam optimizers in Half-cheetah

Optimizer CEM CEM + DAE Adam Adam + DAE

Average Return 10955 2865 12967 3216 12796 2716

We ﬁrst note that planning with Adam fails completely without regularization: the proposed actions lead to unstable states of the simulator. Using Adam with the DAE regularization ﬁxes this problem and the obtained results are better than the CEM method originally used in PETS. CEM appears

to regularize trajectory optimization but not as efﬁciently CEM+DAE. These open-loop results are consistent with the closed-loop results in Fig. 3.

4.2 End-to-end training with regularized trajectory optimization

In the following experiments, we study the performance of end-to-end training with different trajectory optimizers used during training. Our agent learns according to the algorithm presented in Algorithm 1. Since the environments are fully observable, we use a feedforward neural network as in (1) to model the dynamics of the environment. Unlike PETS, we did not use an ensemble of probabilistic networks as the forward model. We use a single probabilistic network which predicts the mean and variance of the next state (assuming a Gaussian distribution) given the current state and action. Although we only use the mean prediction, we found that also training to predict the variance improves the stability of the training.

For all environments, we use a dynamics model with the same architecture: three hidden layers of size 200 with the Swish non-linearity [26]. Similar to prior works, we train the dynamics model to predict the difference between st+1 and st instead of predicting st+1 directly. We train the dynamics model for 100 or more epochs (see Appendix C) after every episode. This is a larger number of updates compared to ﬁve epochs used in [5]. We found that an increased number of updates has a large effect on the performance for a single probabilistic model and not so large effect for the ensemble of models used in PETS. This effect is shown in Fig. 6.

For the denoising autoencoder, we use the same architecture as the dynamics model. The state-action pairs in the past episodes were corrupted with zero-mean Gaussian noise and the DAE was trained to denoise it. Important hyperparameters used in our experiments are reported in the Appendix C. For DAE-regularized trajectory optimization we used either CEM or Adam as optimizers.

The learning progress of the compared algorithms is presented in Fig. 4. Note that we report the average returns across different seeds, not the maximum return seen so far as was done in [5].2 In Cartpole, all the methods converge to the maximum cumulative reward but the proposed method converges the fastest. In the Cartpole environment, we also compare to a method which uses Gaussian Processes (GP) as the dynamics model (algorithm denoted GP-E in [5], which considers only the expectation of the next state prediction). The implementation of the GP algorithm was obtained from the code provided by [5]. Interestingly, our algorithm also surpasses the Gaussian Process (GP) baseline, which is known to be a sample-efﬁcient method widely used for control of simple systems. In Reacher, the proposed method converges to the same asymptotic performance as PETS, but faster. In Pusher, all algorithms perform similarly.

In Half-cheetah and Ant, the proposed method shows very good sample efﬁciency and very rapid initial learning. The agent learns an effective running gait in only a couple of episodes.3 The results demonstrate that denoising regularization is effective for both gradient-free and gradient-based planning, with gradient-based planning performing the best. The proposed algorithm learns faster than PETS in the initial phase of training. It also achieves performance that is competitive with popular model-free algorithms such as DDPG, as reported in [5].

However, the performance of the proposed method does not improve after initial 10 episodes, so it does not reach the asymptotic performance of PETS (see results for PETS for Half-cheetah after 300 episodes in Table 1). This result is evidence of the importance of exploration: the DAE regularization essentially penalizes exploration, which can harm asymptotic performance in complex environments. In PETS, CEM leaves some noise in the trajectories, which might help to obtain better asymptotic performance. The result presented in Appendix E provides some evidence that at least a part of the problem is lack of exploration.

We also compare the performance of our method with Model-Based Meta Policy Optimization (MB-MPO) [6], an approach that combines the beneﬁts of model-based RL and meta learning: the algorithm trains a policy using simulations generated by an ensemble of models, learned from data. Meta-learning allows this policy to quickly adapt to the various dynamics, hence learning how to quickly adapt in the real environment, using Model-Agnostic Meta Learning (MAML) [12]. In

2 Because of the different metric used, the PETS results presented in this paper may appear worse than in [5]. However, we veriﬁed that our implementation of PETS obtains similar results to [5] for the metric used in [5]. 3 Videos of our agents during training can be found at https://sites.google.com/view/ regularizing-mbrl-with-dae/home.

Figure 4: Results of our experiments on the ﬁve benchmark environments, in comparison to PETS [5]. We show the return obtained in each episode. All the results are averaged across 5 seeds, with the shaded area representing standard deviation. PETS is a recent state-of-the-art model-based RL algorithm and GP-based (Gaussian Processes) control algorithms are well known to be sampleefﬁcient and are extensively used for the control of simple systems.

0 20 40 60 80 100 Episodes (200x steps)

Half cheetah (episode length = 200)

Adam + DAE reg. (Ours) CEM + DAE reg. (Ours) MB MPO MB MPC MB TRPO

Figure 5: Comparison to MB-MPO [6], MB-TRPO [18] and MB-MPC [24] on Half-cheetah. We plot the average return over the last 20 episodes. Our results are averaged across 3 seeds, with the shaded area representing standard deviation. Note that the comparison numbers are picked from [6] and the results from the ﬁrst 20 episodes are not reported.

Fig. 5 we compare our method to MB-MPO and other model-based methods included in [6]. This experiment is done in the Half-cheetah environment with shorter episodes (200 timesteps) in order to compare to the results reported in [6]. The results show that our method learns faster than MB-MPO.

5 Discussion

In recent years, a lot of effort has been put in making deep reinforcement algorithms more sampleefﬁcient, and thus adaptable to real world scenarios. Model-based reinforcement learning has shown promising results, obtaining sample-efﬁciency even orders of magnitude better than model-free counterparts, but these methods have often suffered from sub-optimal performance due to many reasons. As already noted in the recent literature [24, 5], out-of-distribution errors and model overﬁtting are often sources of performance degradation when using complex function approximators. In this work we demonstrated how to tackle this problem using regularized trajectory optimization. Our experiments demonstrate that the proposed solution can improve the performance of model-based reinforcement learning.

While trajectory optimization is a key component in model-based RL, there are clearly several other issues which need to be tackled in complex environments:

Local minima for trajectory optimization. There can be multiple trajectories that are reasonable solutions but in-between trajectories can be very bad. For example, we can take a step with a right or left foot but both will not work. We tackled this issue by trying multiple initializations, which worked for the considered environments, but better techniques will be needed for more complex environments.

The planning horizon problem. In the presented experiments, the planning procedure did not care about what happens after the planning horizon. This was not a problem for the considered environments due to nicely formatted reward. Other solutions like value functions, multiple time scales or hierarchy for planning are required with sparser reward problems. All of these are compatible with model-based RL.

Open-loop vs. closed-loop (compounding errors). The implicit planning assumption of trajectory optimization is open-loop control. However, MPC only takes the ﬁrst action and then replans (closed-loop control). If the outcome is uncertain (e.g., due to stochastic environments or imperfect forward model), this can lead to overly pessimistic controls.

Local optima of the policy. This is the well-known exploration-exploitation dilemma. If the model has never seen data of alternative trajectories, it may predict their consequences incorrectly and never try them (because in-between trajectories can be genuinely worse). Good trajectory optimization (exploitation) can harm long-term performance because it reduces exploration, but we believe that it is better to add explicit exploration. With modelbased RL, intrinsically motivated exploration is a particularly interesting option because it is possible to balance exploration and the expected cost. This is particularly important in hazardous environments where safe exploration is needed.

High-dimensional input space. Sensory systems like cameras, lidars and microphones can produce vast amounts of data and it is infeasible to plan based on detailed prediction on low level such as pixels. Also, predictive models of pixels may miss the relevant state.

Changing environments. All the considered environments were static but real-world systems keep changing. Online learning and similar techniques are needed to keep track of the changing environment.

Still, model-based RL is an attractive approach and not only due to its sample-efﬁciency. Compared to model-free approaches, model-based learning makes safe exploration and adding known constraints or ﬁrst-principles models much easier. We believe that the proposed method can be a viable solution for real-world control tasks especially where safe exploration is of high importance.

We are currently working on applying the proposed methods for real-world problems such as assisting operators of complex industrial processes and for control of autonomous mobile machines.

Acknowledgments

We would like to thank Jussi Sainio, Jari Rosti and Isabeau Prémont-Schwarz for their valuable contributions in the experiments on industrial process control.

[1] Naveed Akhtar and Ajmal Mian. Threat of adversarial attacks on deep learning in computer vision: A survey. IEEE Access, 6:14410 14430, 2018.

[2] Kai Arulkumaran, Marc Peter Deisenroth, Miles Brundage, and Anil Anthony Bharath. A brief survey of deep reinforcement learning. ar Xiv preprint ar Xiv:1708.05866, 2017.

[3] Zdravko I Botev, Dirk P Kroese, Reuven Y Rubinstein, and Pierre L Ecuyer. The cross-entropy method for optimization. In Handbook of statistics, volume 31, pages 35 59. Elsevier, 2013.

[4] Greg Brockman, Vicki Cheung, Ludwig Pettersson, Jonas Schneider, John Schulman, Jie Tang, and Wojciech Zaremba. Open AI gym. ar Xiv preprint ar Xiv:1606.01540, 2016.

[5] Kurtland Chua, Roberto Calandra, Rowan Mc Allister, and Sergey Levine. Deep reinforcement learning in a handful of trials using probabilistic dynamics models. In Advances in Neural Information Processing Systems 31, pages 4759 4770. 2018.

[6] Ignasi Clavera, Jonas Rothfuss, John Schulman, Yasuhiro Fujita, Tamim Asfour, and Pieter Abbeel. Model-based reinforcement learning via meta-policy optimization. ar Xiv preprint ar Xiv:1809.05214, 2018.

[7] Nilesh Dalvi, Pedro Domingos, Mausam, Sumit Sanghai, and Deepak Verma. Adversarial classiﬁcation. In Proceedings of the Tenth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD 04, pages 99 108, 2004.

[8] Marc Deisenroth and Carl E Rasmussen. PILCO: A model-based and data-efﬁcient approach to policy search. In Proceedings of the 28th International Conference on Machine Learning (ICML), pages 465 472, 2011.

[9] Marc Peter Deisenroth, Gerhard Neumann, Jan Peters, et al. A survey on policy search for robotics. Foundations and Trends in Robotics, 2(1 2):1 142, 2013.

[10] Norman Di Palo and Harri Valpola. Improving model-based control and active exploration with reconstruction uncertainty optimization. ar Xiv preprint ar Xiv:1812.03955, 2018.

[11] Yan Duan, Xi Chen, Rein Houthooft, John Schulman, and Pieter Abbeel. Benchmarking deep reinforcement learning for continuous control. In Proceedings of the 33rd International Conference on Machine Learning (ICML), pages 1329 1338, 2016.

[12] Chelsea Finn, Pieter Abbeel, and Sergey Levine. Model-agnostic meta-learning for fast adaptation of deep networks. In Proceedings of the 34th International Conference on Machine Learning (ICML), pages 1126 1135, 2017.

[13] Sandy Huang, Nicolas Papernot, Ian Goodfellow, Yan Duan, and Pieter Abbeel. Adversarial attacks on neural network policies. ar Xiv preprint ar Xiv:1702.02284, 2017.

[14] Diederick P Kingma and Jimmy Ba. Adam: A method for stochastic optimization. In International Conference on Learning Representations, 2015.

[15] Jonathan Ko, Daniel J Klein, Dieter Fox, and Dirk Haehnel. Gaussian processes and reinforcement learning for identiﬁcation and control of an autonomous blimp. In Robotics and Automation, 2007 IEEE International Conference on, pages 742 747. IEEE, 2007.

[16] Basil Kouvaritakis and Mark Cannon. Non-linear Predictive Control: theory and practice. Iet, 2001.

[17] Vikash Kumar, Emanuel Todorov, and Sergey Levine. Optimal control with learned local models: Application to dexterous manipulation. In Robotics and Automation (ICRA), 2016 IEEE International Conference on, pages 378 383. IEEE, 2016.

[18] Thanard Kurutach, Ignasi Clavera, Yan Duan, Aviv Tamar, and Pieter Abbeel. Model-ensemble trust-region policy optimization. In International Conference on Learning Representations, 2018.

[19] Sergey Levine and Pieter Abbeel. Learning neural network policies with guided policy search under unknown dynamics. In Z. Ghahramani, M. Welling, C. Cortes, N. D. Lawrence, and K. Q. Weinberger, editors, Advances in Neural Information Processing Systems 27, pages 1071 1079. 2014.

[20] Sergey Levine and Vladlen Koltun. Guided policy search. In Proceedings of the 30th International Conference on Machine Learning (ICML), pages 1 9, 2013.

[21] Kendall Lowrey, Aravind Rajeswaran, Sham Kakade, Emanuel Todorov, and Igor Mordatch. Plan online, learn ofﬂine: Efﬁcient learning and exploration via model-based control. ar Xiv preprint ar Xiv:1811.01848, 2018.

[22] David Q Mayne, James B Rawlings, Christopher V Rao, and Pierre OM Scokaert. Constrained model predictive control: Stability and optimality. Automatica, 36(6):789 814, 2000.

[23] K Miyasawa. An empirical bayes estimator of the mean of a normal population. Bulletin of the International Statistical Institute, 38(181-188):1 2, 1961.

[24] Anusha Nagabandi, Gregory Kahn, Ronald S Fearing, and Sergey Levine. Neural network dynamics for model-based deep reinforcement learning with model-free ﬁne-tuning. In 2018 IEEE International Conference on Robotics and Automation (ICRA), pages 7559 7566. IEEE, 2018.

[25] Vitchyr Pong*, Shixiang Gu*, Murtaza Dalal, and Sergey Levine. Temporal difference models: Model-free deep RL for model-based control. In International Conference on Learning Representations, 2018.

[26] Prajit Ramachandran, Barret Zoph, and Quoc V Le. Searching for activation functions. ar Xiv preprint ar Xiv:1710.05941, 2017.

[27] Martin Raphan and Eero P Simoncelli. Least squares estimation without priors or supervision. Neural computation, 23(2):374 420, 2011.

[28] N Lawrence Ricker. Model predictive control of a continuous, nonlinear, two-phase reactor. Journal of Process Control, 3(2):109 123, 1993.

[29] Cédric Rommel, Frédéric Bonnans, Pierre Martinon, and Baptiste Gregorutti. Gaussian mixture penalty for trajectory optimization problems. Journal of Guidance, Control, and Dynamics, pages 1 6, 2019.

[30] John Rossiter. Model-based Predictive Control-a Practical Approach. CRC Press, 01 2003.

[31] John Schulman, Sergey Levine, Philipp Moritz, Michael Jordan, and Pieter Abbeel. Trust region policy optimization. In Proceedings of the 32nd International Conference on Machine Learning (ICML), pages 1889 1897, 2015.

[32] John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. Proximal policy optimization algorithms. ar Xiv preprint ar Xiv:1707.06347, 2017.

[33] Christian Szegedy, Wojciech Zaremba, Ilya Sutskever, Joan Bruna, Dumitru Erhan, Ian Goodfellow, and Rob Fergus. Intriguing properties of neural networks. ar Xiv preprint ar Xiv:1312.6199, 2013.

[34] Yuval Tassa, Tom Erez, and Emanuel Todorov. Synthesis and stabilization of complex behaviors through online trajectory optimization. In 2012 IEEE/RSJ International Conference on Intelligent Robots and Systems, pages 4906 4913. IEEE, 2012.

[35] Yuval Tassa, Nicolas Mansard, and Emo Todorov. Control-limited differential dynamic programming. In 2014 IEEE International Conference on Robotics and Automation (ICRA), pages 1168 1175. IEEE, 2014.

[36] Pascal Vincent. A connection between score matching and denoising autoencoders. Neural computation, 23(7):1661 1674, 2011.

[37] Pascal Vincent, Hugo Larochelle, Isabelle Lajoie, Yoshua Bengio, and Pierre-Antoine Manzagol. Stacked denoising autoencoders: Learning useful representations in a deep network with a local denoising criterion. Journal of machine learning research, 11(Dec):3371 3408, 2010.