# calibrated_modelbased_deep_reinforcement_learning__11c477d6.pdf

Calibrated Model-Based Deep Reinforcement Learning

Ali Malik * 1 Volodymyr Kuleshov * 1 2 Jiaming Song 1 Danny Nemer 2 Harlan Seymour 2 Stefano Ermon 1

Estimates of predictive uncertainty are important for accurate model-based planning and reinforcement learning. However, predictive uncertainties especially ones derived from modern deep learn-

ing systems can be inaccurate and impose a bottleneck on performance. This paper explores which uncertainties are needed for model-based reinforcement learning and argues that good uncertainties must be calibrated, i.e. their probabilities should match empirical frequencies of predicted events. We describe a simple way to augment any model-based reinforcement learning agent with a calibrated model and show that doing so consistently improves planning, sample complexity, and exploration. On the HALFCHEETAH Mu Jo Co task, our system achieves state-of-the-art performance using 50% fewer samples than the current leading approach. Our ﬁndings suggest that calibration can improve the performance of model-based reinforcement learning with minimal computational and implementation overhead.

1. Introduction

Methods for accurately assessing predictive uncertainty are important components of modern decision-making systems. Probabilistic methods have been used to improve the safety, interpretability, and performance of decision-making agents in various domains, including medicine (Saria, 2018), robotics (Chua et al., 2018; Buckman et al., 2018), and operations research (Van Roy et al., 1997).

In model-based reinforcement learning a setting in which an agent learns a model of the world from past experience and uses it to plan future decisions capturing uncertainty in the agent s model is particularly important (Deisen-

*Equal contribution 1Department of Computer Science, Stanford University, USA 2Afresh Technologies, San Francisco, USA. Correspondence to: Ali Malik <malikali@stanford.edu>, Volodymyr Kuleshov <kuleshov@cs.stanford.edu>.

Proceedings of the 36 th International Conference on Machine Learning, Long Beach, California, PMLR 97, 2019. Copyright 2019 by the author(s).

Figure 1. Modern model-based planning algorithms with proba-

bilistic models can over-estimate their conﬁdence (purple distribution), and overlook dangerous outcomes (e.g., a collision). We show how to endow agents with a calibrated world model that accurately captures true uncertainty (green distribution) and improves planning in high-stakes scenarios like autonomous driving or industrial optimisation.

roth & Rasmussen, 2011). Planning with a probabilistic model improves performance and sample complexity, especially when representing the model using a deep neural network (Rajeswaran et al., 2016; Chua et al., 2018).

Despite their importance in decision-making, predictive uncertainties can be unreliable, especially when derived from deep neural networks (Guo et al., 2017a). Although several modern approaches such as deep ensembles (Lakshminarayanan et al., 2017b) and approximations of Bayesian inference (Gal & Ghahramani, 2016a;b; Gal et al., 2017) provide uncertainties from deep neural networks, these methods suffer from shortcomings that reduce their effectiveness for planning (Kuleshov et al., 2018).

In this paper, we study which uncertainties are needed in model-based reinforcement learning and argue that good predictive uncertainties must be calibrated, i.e. their probabilities should match empirical frequencies of predicted events. We propose a simple way to augment any modelbased reinforcement learning algorithm with a calibrated model by adapting recent advances in uncertainty estimation for deep neural networks (Kuleshov et al., 2018). We complement our approach with diagnostic tools, best-practices,

Calibrated Model-Based Deep Reinforcement Learning

and intuition on how to apply calibration in reinforcement learning.

We validate our approach on benchmarks for contextual bandits and continuous control (Li et al., 2010; Todorov et al., 2012), as well as on a planning problem in inventory management (Van Roy et al., 1997). Our results show that calibration consistently improves the cumulative reward and the sample complexity of model-based agents, and also enhances their ability to balance exploration and exploitation in contextual bandit settings. Most interestingly, on the HALFCHEETAH task, our system achieves state-of-the-art performance, using 50% fewer samples than the previous leading approach (Chua et al., 2018). Our results suggest that calibrated uncertainties have the potential to improve model-based reinforcement learning algorithms with minimal computational and implementation overhead.

Contributions. In summary, this paper adapts recent advances in uncertainty estimation for deep neural networks to reinforcement learning and proposes a simple way to improve any model-based algorithm with calibrated uncertainties. We explain how this technique improves the accuracy of planning and the ability of agents to balance exploration and exploitation. Our method consistently improves performance on several reinforcement learning tasks, including contextual bandits, inventory management and continuous control1.

2. Background

2.1. Model-Based Reinforcement Learning

Let S and A denote (possibly continuous) state and action spaces in a Markov Decision Process (S, A, T, r) and let denote the set of all stationary stochastic policies : S ! P(A) that choose actions in A given states in S. The successor state s0 for a given action a from current state s are drawn from the dynamics function T(s0|s, a). We work in the γ-discounted inﬁnite horizon setting and we will use an expectation with respect to a policy 2 to denote an expectation with respect to the trajectory it generates: E [r(s, a)] , E [P1

t=0 γtr(st, at)], where s0 p0, at ( |st), and st+1 T( |st, at) for t 0. p0 is the initial distribution over states and r(st, at) is the reward at time t.

Typically, S, A, γ are known, while the dynamics model T(s0|s, a) and the reward function r(s, a) are not known explicitly. This work focuses on model-based reinforcement learning, in which the agent learns an approximate model

ˆT(s0|s, a) of the world from samples obtained by interacting with the environment and uses this model to plan its future decisions.

1Our code is available at https://github.com/ ermongroup/Calibrated Model Based RL

Probabilistic Models This paper focuses on probabilistic dynamics models b T(s0|s, a) that take a current state s 2 S and action a 2 A, and output a probability distribution over future states s0. We represent the output distribution over the next states, b T( |s, a), as a cumulative distribution function Fs,a : S ! [0, 1], which is deﬁned for both discrete and continuous S.

2.2. Calibration, Sharpness, and Proper Scoring Rules

A key desirable property of probabilistic forecasts is calibration. Intuitively, a transition model b T(s0|s, a) is calibrated if whenever it assigns a probability of 0.8 to an event such as a state transition (s, a, s0) that transition should occur about 80% of the time.

Formally, for a discrete state space S and when s, a, s0 are i.i.d. realizations of random variables S, A, S0 P, we say that a transition model b T is calibrated if

P(S0 = s0 | b T(S0 = s0|S, A) = p) = p

for all s0 2 S and p 2 [0, 1].

When S is a continuous state space, calibration is deﬁned using quantiles as P(S0 F 1

S,A(p)) = p for all p 2 [0, 1], where F 1

s,a (p) = inf{y : p Fs,a(y)} is the quantile function associated with the CDF Fs,a over future states s0 (Gneiting et al., 2007). A multivariate extension can be found in Kuleshov et al. (2018).

Note that calibration alone is not enough for a model to be good. For example, assigning the same average probability to each transition may sufﬁce as a calibrated model, but this model will not be useful. Good models need to also be sharp: intuitively, their probabilities should be maximally certain, i.e. close to 0 or 1.

Proper Scoring Rules. In the statistics literature, probabilistic forecasts are typically assessed using proper scoring rules (Murphy, 1973; Dawid, 1984). An example is the Brier score L(p, q) = (p q)2 deﬁned over two Bernoulli distributions with natural parameters p, q 2 [0, 1]. Crucially, any proper scoring rule decomposes precisely into a calibration and a sharpness term (Murphy, 1973):

Proper Scoring Rule = Calibration + Sharpness + const.

Most loss functions for probabilistic forecasts over both discrete and continuous variables are proper scoring rules (Gneiting & Raftery, 2007). Hence, calibration and sharpness are precisely the two sufﬁcient properties of a good forecast.

2.3. Recalibration

Most predictive models are not calibrated out-of-the-box (Niculescu-Mizil & Caruana, 2005). However, given an

Calibrated Model-Based Deep Reinforcement Learning

arbitrary pre-trained forecaster H : X ! (Y ! [0, 1]) that outputs CDFs F, we may train an auxiliary model R : [0, 1] ! [0, 1] such that the forecasts R F are calibrated in the limit of enough data. This recalibration procedure applies to any probabilistic regression model and does not worsen the original forecasts from H when measured using a proper scoring rule (Kuleshov & Ermon, 2017).

When S is discrete, a popular choice of R is Platt scaling (Platt et al., 1999); Kuleshov et al. (2018) extends Platt scaling to continuous variables. Either of these methods can be used within our framework.

3. What Uncertainties Do We Need In

Model-Based Reinforcement Learning?

In model-based reinforcement learning, probabilistic models improve the performance and sample complexity of planning algorithms (Rajeswaran et al., 2016; Chua et al., 2018); this naturally raises the question of what constitutes a good probabilistic model.

3.1. Calibration vs. Sharpness Trade-Off

A natural way of assessing the quality of a probabilistic model is via a proper scoring rule (Murphy, 1973; Gneiting et al., 2007). As discussed in Section 2, any proper scoring rule decomposes into a calibration and a sharpness term. Hence, these are precisely the two qualities we should seek.

Crucially, not all probabilistic predictions with the same proper score are equal: some are better calibrated, and others are sharper. There is a natural trade-off between these terms.

In this paper, we argue that this trade-off plays an important role when specifying probabilistic models in reinforcement learning. Speciﬁcally, it is much better to be calibrated than sharp, and calibration signiﬁcantly impacts the performance of model-based algorithms. Recalibration methods (Platt et al., 1999; Kuleshov et al., 2018) allow us to ensure that a model is calibrated, and thus improve reinforcement learning agents.

3.2. Importance of Calibration for Decision-Making

In order to explain the importance of calibration, we provide some intuitive examples, and then prove a formal statement.

Intuition. Consider a simple MDP with two states sgood and sbad. The former has a high reward r(sgood) = 1 and the latter has a low reward r(sbad) = 1.

First, calibration helps us better estimate expected rewards. Consider the expected reward ˆr from taking action a in sgood under the model. It is given by ˆr = 1 ˆT(sbad|sgood, a) + 1 ˆT(sgood|sgood, a). If the true transition probability is

T(sgood|sgood, a) = 80%, but our model ˆT predicts 60%, then in the long run the average reward from a in sgood will not equal to ˆr; incorrectly estimating the reward will in turn cause us to choose sub-optimal actions.

Similarly, suppose that the model is over-conﬁdent and

ˆT(sgood|sgood, a) = 0; intuitively, we may decide that it is not useful to try a in sgood, as it leads to sbad with 100% probability. This is an instance of the classical explorationexploitation problem; many approaches to this problem (such as the UCB family of algorithms) rely on accurate conﬁdence bounds and are likely to beneﬁt from calibrated uncertaintites that more accurately reﬂect the true probability of transitioning to a particular state.

Expectations Under Calibrated Models. More concretely, we can formalise our intuition about the accuracy of expectations via the following statement for discrete variables; see the Appendix for more details.

Lemma 1. Let Q(Y |X) be a calibrated model over two discrete variables X, Y P such that P(Y = y | Q(Y = y | X) = p) = p. Then any expectation of a function G(Y ) is the same under P and Q:

[G(y)] = E x P(X) y Q(Y |X=x)

[G(y)] . (1)

In model-based reinforcement learning, we take expectations in order to compute the expected reward of a sequence of decisions. A calibrated model will allow us to estimate these more accurately.

4. Calibrated Model-Based Reinforcement

In Algorithm 1, we present a simple procedure that augments a model-based reinforcement learning algorithm with an extra step that ensures the calibration of its transition model. Algorithm 1 effectively corresponds to standard model-based reinforcement learning with the addition of Step 4, in which we train a recalibrator R such that R T is calibrated. The subroutine CALIBRATE can be an instance of Platt scaling, for discrete S, or the method of Kuleshov et al. (2018), when S is continuous (see Algorithm 2 in the appendix).

In the rest of this section, we describe best practices for applying this method.

Diagnostic Tools. An essential tool for visualising calibration of predicted CDFs F1, . . . FN is the reliability curve (Gneiting et al., 2007). This plot displays the empirical frequency of points in a given interval relative to the predicted fraction of points in that interval. Formally, we

Calibrated Model-Based Deep Reinforcement Learning

Algorithm 1 Calibrated Model-Based Reinforcement Learning

Input: Initial transition model b T : S A ! P(S) and initial dataset of state transitions D = {(st, at), st+1}N

t=1 Repeat until sufﬁcient level of performance is reached:

1. Run the agent and collect a dataset of state transitions

Dnew EXECUTEPLANNING( b T). Gather all experience data D D [ Dnew. 2. Let Dtrain, Dcal PARTITIONDATA(D) be the train-

ing and calibration sets, respectively. 3. Train a transition model b T TRAINMODEL(Dtrain).

4. Train the recalibrator R CALIBRATE( b T, Dcal).

5. Let b T R b T be the new, recalibrated transition

choose m thresholds 0 p1 pm 1 and, for each threshold pj, compute the empirical frequency

ˆpj = |yt : Ft(yt) pj, t = 1, . . . , N|/N

Plotting {(pj, ˆpj)} gives us a sense of the calibration of the model (see Figure 2), with a straight line corresponding to perfect calibration. An equivalent, alternative visualisation is to plot a histogram of of the probability integral transform {Ft(yt)}N

t=1 and see if it looks like a uniform distribution (Gneiting et al., 2007).

These visualisations can be quantiﬁed by deﬁning the calibration loss2 of a model:

cal(F1, y1, . . . Ft, y Y ) =

(ˆpj pj)2, (2)

as the sum of the squares of the residuals (ˆpj pj)2. These diagnostic tools should be evaluated on unseen data distinct from the training and calibration sets as it may reveal signs of overﬁtting.

4.1. Applications to Deep Reinforcement Learning

Although deep neural networks can signiﬁcantly improve model-based planning algorithms (Higuera et al., 2018; Chua et al., 2018), their estimates of predictive uncertainty are often inaccurate (Guo et al., 2017a; Kuleshov et al., 2018).

Variational Dropout. One popular approach to deriving uncertainty estimates from deep neural networks involves using dropout. Taking the mean and the variance of dropout samples leads to a principled Gaussian approximation of the posterior predictive distribution from a Bayesian neural network (in regression) (Gal & Ghahramani, 2016a). To

2This is the calibration term in the two-component decomposition of the Brier score.

use Algorithm 1 we may instantiate CALIBRATE with the method of Kuleshov et al. (2018) and pass it the predictive Gaussian derived from the dropout samples.

More generally, our method can be naturally applied on top of any probabilistic model without any need to modify or retrain this model.

5. The Beneﬁts of Calibration in Model-Based

Reinforcement Learning

Next, we examine speciﬁc ways in which Algorithm 1 can improve model-based reinforcement learning agents.

5.1. Model-Based Planning

The ﬁrst beneﬁt of a calibrated model is enabling more accurate planning using standard algorithms such as value iteration or model predictive control (Sutton & Barto, 2018). Each of these methods involves estimates of future reward. For example, value iteration performs the update

V 0(s) Ea ( |s)

ˆT(s0|s, a)(r(s0) + V (s0))

Crucially, this algorithm requires accurate estimates of the expected reward P

s02S ˆT(s0|s, a)r(s0). Similarly, online planning algorithms involve computing the expected reward of a ﬁnite sequence of actions, which has a similar form. If the model is miscalibrated, then the predicted distribution

ˆT(s0|s, a) will not accurately reﬂect the true distribution of states that the agent will encounter in the real world. As a result, planning performed in the model will be inaccurate.

More formally, let us deﬁne the value of a policy as V ( ) = Es σ [V (s)], where σ is the stationary distribution of the Markov chain induced by . Let V 0( ) be an estimate of the value of in a second MDP in which we replaced the transition dynamics by a calibrated model b T learned from data. Then, the following holds.

Theorem 1. Let (S, A, T, r) be a discrete MDP and let be a stochastic policy over this MDP. The value V ( ) of policy under the true dynamics T is equal to the value V 0( ) of the policy under any set of calibrated dynamics b T.

Effectively, having a calibrated model makes it possible to compute accurate expectations of rewards, which in turn provides accurate estimates of the values of states and policies. Accurately estimating the value of a policy makes it easier to choose the best one by planning.

5.2. Balancing Exploration and Exploitation

Balancing exploration and exploitation successfully is a fundamental challenge for many reinforcement learning (RL) algorithms. A large family of algorithms tackle this

Calibrated Model-Based Deep Reinforcement Learning

Lin UCB Cal Lin UCB Optimal Linear 1209.8 12.1 1210.3 12.1 1231.8 Beta 1176.3 11.9 1174.6 12.0 1202.3

Mushroom 1429.4 154.0 1676.1 164.1 3122.0 Covertype 558.14 3.5 677.8 5.0 1200.0 Adult 131.3 1.2 198.9 4.7 1200.0 Census 207.6 1.7 603.7 3.8 1200.0

Table 1. Performance of calibrated/uncalibrated Lin UCB on a vari-

ety of datasets, averaged over 10 trials. The calibrated algorithm (Cal Lin UCB) does better on all non-synthetic datasets (bottom four rows) and has similar performance on the synthetic datasets (top two rows).

problem using notions of uncertainty/conﬁdence to guide their exploration process. For example, upper conﬁdence bound (UCB, Auer et al. (2002)) algorithms pick the action which has the highest upper bound on its reward conﬁdence interval.

In situations where the outputs of the algorithms are uncalibrated, the conﬁdence intervals might provide unreliable upper conﬁdence bounds, resulting in suboptimal performance. For example, in a two-arm bandit problem, if a model is under-estimating the reward of the best arm and has high conﬁdence, it s upper conﬁdence bound will be low, and it will not be selected. More generally, UCB-style methods need uncertainty estimates to be on the same order of magnitude so that arms can be compared against each other; calibration helps ensure that.

6. Experiments

We evaluate our calibrated model-based reinforcement learning method on several different environments and algorithms, including contextual bandits, inventory management, and continuous control for robotics.

6.1. Balancing Exploration and Exploitation

To test the effect of calibration on exploration/exploitation, we look at the contextual multi-armed bandit problem (Li et al., 2010). At each timestep, an agent is shown a context vector x and must pick an arm a 2 A from a ﬁnite set A. After picking an arm, the agent receives a reward ra,x which depends both on the arm picked and also on the context vector shown to the agent. The agent s goal over time is to learn the relationship between the context vector and reward gained from each arm so that it can pick the arm with the highest expected reward at each timestep.

Setup. For our experiments, we focus on the Lin UCB algorithm (Li et al., 2010) a well-known instantiation of the UCB approach to contextual bandits. Lin UCB assumes a linear relationship between the context vector and the

expected reward of an arm: for each arm a 2 A, there is an unknown coefﬁcient vector

a such that E[ra,x] = x>

Lin UCB learns a predictive distribution over this reward using Bayesian ridge regression, in which

a has a Gaussian posterior N(ˆ a, ˆ a). The posterior predictive distribution is also Gaussian, with mean x>ˆ a and with standard deviation p

x> ˆ 1 a x. Thus, the algorithm picks the arm with the highest -quantile, given by

We apply the recalibration scheme in Algorithm 1 of Kuleshov et al. (2018) to these predicted Gaussian distributions.

Data. We evaluate the calibrated version (Cal Lin UCB) and uncalibrated version (Lin UCB) of the Lin UCB algorithm on both synthetic data that satisﬁes the linearity assumption of the algorithm, as well as on real UCI datasets from Li et al. (2010). We run the tests on 2000 examples over 10 trials and compute the average cumulative reward.

Figure 2. Top: Performnce of Calib Lin UCB and Lin UCB on the

UCI covertype dataset. Bottom: Calibration curves of the Lin UCB algorithms on the covertype dataset

Results. We expect the Lin UCB algorithm to already be calibrated on the synthetic linear data since the model is well-speciﬁed, implying no difference in performance between Cal Lin UCB and Lin UCB. On the real UCI datasets

Calibrated Model-Based Deep Reinforcement Learning

however, the linear assumption might not hold, resulting in miscalibrated estimates of the expected reward.

In Table 1, we can see that indeed there is no signiﬁcant difference in performance between the Cal Lin UCB and Lin UCB algorithms on the synthetic linear dataset they both preform optimally. On the UCI datasets however, we see a noticeable improvement with Cal Lin UCB on almost all tasks, suggesting that recalibration aids exploration/exploitation in a setting where the model is misspeciﬁed. Note that both Cal Lin UCB and Lin UCB perform below the optimum on these datasets, implying linear models are not expressive enough in general for these tasks.

Analysis. To get a sense of the effect of calibration on the model s conﬁdence estimates, we can plot the predicted reward with 90% conﬁdence intervals that the algorithm expected for a chosen arm a at timestep t. We can then compare how good this prediction was with respect to the true observed reward. Speciﬁcally, we can look at the timesteps where the Cal Lin UCB algorithm picked the optimal action but the Lin UCB algorithm did not, and look at both the algorithms belief about the predicted reward from both of these actions. An example of this plot can be seen in the appendix on Figure 4.

A key takeaway from this plot is that the uncalibrated algorithm systematically underestimates the expected reward from the optimal action and overestimates the expected reward of the action it chose instead, resulting in suboptimal actions. The calibrated model does not suffer from this defect, and thus performs better on the task.

6.2. Model-Based Planning

6.2.1. INVENTORY MANAGEMENT

Our ﬁrst model-based planning task is inventory management (Van Roy et al., 1997). A decision-making agent controls the inventory of a perishable good in a store. Each day, the agent orders items into the store; if the agent underorders, the store runs out of stock; if the agent over-orders, perishable items are lost due to spoilage. Perishable inventory management systems have the potential to positively impact the environment by minimizing food waste and enabling a more effective use of resources (Vermeulen et al., 2012).

Model. We formalize perishable inventory management for one item using a Markov decision process (S, A, P, r). States s 2 S are tuples (ds, qs) consisting of a calendar day ds and an inventory state qs 2 ZL, where L 1 is the item shelf-life. Each component (qs)l indicates the number of units in the store that expire in l days; the total inventory level is ts = PL

l=1(qs)l. Transition probabilities P are deﬁned as follows: each day sees a random demand

Calibrated Uncalibrated Heuristic Shipped 332,150 319,692 338,011 Wasted 7,466 3,148 13,699 Stockouts 9,327 17,358 11,817

% Waste 2.2% 1.0% 4.1% % Stockouts 2.8% 5.4% 3.5%

Reward -16,793 -20,506 -25,516

Table 2. Performance of calibrated model planning on an inventory

management task. Calibration signiﬁcantly improves cumulative reward. Numbers are in units, averaged over ten trials.

of D(s) 2 Z units and sales of max(t(s) D(s), 0) units, sampled at random from all the units in the inventory; at the end of the state transition, the shelf-life of the remaining items is decreased by one (spoiled items are recorded and thrown away). Actions a 2 A Z correspond to orders: the store receives items with a shelf life of L before entering the next state s0. In our experiments we choose the reward r to be the sum of waste and unmet demand due to stock-outs.

Data. We use the Corporacion Favorita Kaggle dataset, which consists of historical sales from a supermarket chain in Ecuador. We experiment on the 100 highest-selling items and use data from 2014-01-01 to 2016-05-31 for training and data from 2016-06-01 to 2016-08-31 for testing.

Algorithms. We learn a probabilistic model ˆ M : S ! (R ! [0, 1]) of the demand D(s0) in a future state s0 based

on information available in the present state s. Speciﬁcally, we train a Bayesian Dense Net (Huang et al., 2017) to predict sales on each of the next ﬁve days based on features from the current day (sales serve as a proxy for demand). We use autoregressive features from the past four days, 7-, 14-, and 28-day rolling means of historical sales, binary indicators for the day of the week and the week of the year, and sine and cosine features over the number of days elapsed in the year. The Bayesian Dense Net has ﬁve layers of 128 hidden units with a dropout rate of 0.5 and parametric Re LU nonlinearities. We use variational dropout (Gal & Ghahramani, 2016b) to compute probabilistic forecasts from the model.

We use our learned distribution over D(s0) to perform online planning on the test set using model predictive control (MPC) learned on the training set. Speciﬁcally, we sample 5,000 random trajectories over a 5-step horizon, and choose the ﬁrst action of the trajectory with the highest expected reward under the model. We estimate the expected reward of each trajectory using 300 Monte Carlo samples from the model.

We also compare the planning approach to a simple heuristic rule that always sets the inventory to 1.5 E[D(s0)], which is the expected demand multiplied by a small safety factor.

Calibrated Model-Based Deep Reinforcement Learning

Results. We evaluate the agent within the inventory management MDP; the demand D(s) is instantiated with the historical sales on test day d(s) (which the agent did not observe). We measure total cumulative waste and stockouts over the 100 items in the dataset, and we report them as a fraction of the total number of units shipped to the store.

Table 2 shows that calibration improves the total cumulative reward by 14%. The calibrated model incurs waste and outof-stocks ratios of 2.2% and 2.8%, respectively, compared to 1.0% and 5.4% for the uncalibrated one. These values are skewed towards a smaller waste, while the objective function penalizes both equally. The heuristic has ratios of 4.1% and 3.5%.

6.2.2. MUJOCO ENVIRONMENTS

Our second model-based planning task is continuous control from Open AI Gym (Brockman et al., 2016) and the Mujoco robotics simulation environment (Todorov et al., 2012). Here the agent makes decisions about its torque controls given observation states (e.g. location / velocity of joints) that maximizes the expected return reward. These environments are standard benchmark tasks for deep reinforcement learning.

Setup. We consider calibrating the probablistic ensemble dynamics model proposed in (Chua et al., 2018). In this approach, the agent learns an ensemble of probabilistic neural networks (PE) that captures the environment dynamics st+1 f (st, at), which is used for model-based planning with model predictive control. The policy and ensemble model are then updated in an iterative fashion. Chua et al. (2018) introduce several strategies for particle-based state propagation, including trajectory sampling with bootstraped models (PE-TS); and distribution sampling (PE-DS), which samples from a multimodal distribution as follows:

st+1 N(E[sp

t+1], Var[sp

t+1 f (st, at) (4)

PE-TS and PE-DS achieve the highest sample efﬁciency among the methods proposed in (Chua et al., 2018).

To calibrate the model, we add a ﬁnal sigmoid recalibration layer to the sampling procedure in PE-DS at each step. This logistic layer is applied separately per output state dimension and serves as the recalibrator R. It is trained on the procedure described in Algorithm 2, after every trial, on a separate calibration set using cross entropy loss.

We consider three continuous control environments from Chua et al. (2018) 3. For model learning and modelbased planning, we follow the training procedure and hyperparameters in Chua et al. (2018), as described in

3We omitted the reacher environment because the reference papers did not have SAC results for it.

https://github.com/kchua/handful-of-trials. We also compare our method against Soft Actor-Critic (Haarnoja et al., 2018) which is one of the state-of-the-art model-free reinforcement learning algorithms. We use the ﬁnal convergence reward of SAC as a criterion for the highest possible reward achieved in the task (although it may require orders of magnitude more samples from the environment).

Results. One of the most important criteria for evaluating reinforcement learning algorithms is sample complexity, i.e, the amount of interactions with the environment in order to reach a certain high expected return. We compare the sample complexities of SAC, PE-DS and calibrated PE-DS in Figure 3. Compared to the model-free SAC method, both the model-based methods use much fewer samples from the environment to reach the convergence performance of SAC. However, our recalibrated PE-DS method compares favorably to PE-DS on all three environments.

Notably, the calibrated PE-DS method outperforms both PEDS by a signiﬁcant margin on the Half Cheetah environment, reaching near optimal performance at only around 180k timesteps. To our knowledge, the calibrated PE-DS is the most efﬁcient method on these environments in terms of sample complexity.

Analysis. In Figure 5 in the appendix, we visualise the 1-step prediction accuracy for action dimension zero in the

Cartpole environment for both PE-DS and calibrated PEDS. This ﬁgure shows that the calibrated PE-DS model is more accurate, has tighter uncertainty bounds, and is better calibrated, especially in earlier trials. Interestingly, we also observe a superior expected return for calibrated PE-DS for earlier trials in Figure 3, suggesting that being calibrated is correlated with improvements in model-based prediction and planning.

7. Discussion

Limitations. A potential failure mode for our method arises when not all forecasts are from the same family of distributions. This can lead to calibrated, but diffuse conﬁdence intervals. Another limitation of the method is its scalability to high-dimensional spaces. In our work, the uncalibrated forecasts were fully factored, and could be recalibrated component-wise. For non-factored distributions, recalibration is computationally intractable and requires approximations such as ones developed for multi-class classiﬁcation (Zadrozny & Elkan, 2002).

Finally, it is possible that uncalibrated forecasts are still effective if they induce a model that correctly ranks the agent s actions in terms of their expected reward (even when the estimates of the reward themselves are incorrect).

Calibrated Model-Based Deep Reinforcement Learning

Figure 3. Performance on different control tasks. The calibrated algorithm does at least as good, and often much better than the uncalibrated

models. Plots show maximum reward obtained so far, averaged over 10 trials. Standard error is displayed as the shaded areas.

Extensions to Safety. Calibration also plays an important role in the domain of RL safety (Berkenkamp et al., 2017). In situations where the agent is planning its next action, if it determines the 90% conﬁdence interval of the predicted next state to be in a safe area but this conﬁdence is miscalibrated, then the agent has a higher chance of entering a failure state.

8. Related Work

Model-based Reinforcement Learning. Model-based RL is effective in low-data and/or high-stakes regimes such as robotics (Chua et al., 2018), dialogue systems (Singh et al., 2000), education (Rollinson & Brunskill, 2015), scientiﬁc discovery (Mc Intire et al., 2016), or conservation planning (Ermon et al., 2012). A big challenge of modelbased RL is the model bias, which is being addressed by solutions such as model ensembles (Clavera et al., 2018; Kurutach et al., 2018; Depeweg et al., 2016; Chua et al., 2018) or combining with model-free approaches (Buckman et al., 2018).

Calibration. Two of the most widely used calibration procedures are Platt scaling (Platt et al., 1999) and isotonic regression (Niculescu-Mizil & Caruana, 2005). They can be extended from binary to multi-class classiﬁcation (Zadrozny & Elkan, 2002), to structured prediction (Kuleshov & Liang, 2015), and to regression (Kuleshov et al., 2018). Calibra-

tion has recently been studied in the context of deep neural networks (Guo et al., 2017b; Gal et al., 2017; Lakshminarayanan et al., 2017a), identifying important shortcomings in their uncertainties.

Probabilistic forecasting. Calibration has been studied extensively in statistics (Murphy, 1973; Dawid, 1984) as a criterion for evaluating forecasts (Gneiting & Raftery, 2007), including from a Bayesian perspective Dawid (1984). Recent studies on calibration have focused on applications in weather forecasting (Gneiting & Raftery, 2005), and have led to implementations in forecasting systems (Raftery et al., 2005). Gneiting et al. (2007) introduced a number of deﬁnitions of calibration for continuous variables, complementing early work on classiﬁcation (Murphy, 1973).

9. Conclusion

Probabilistic models of the environment can signiﬁcantly improve the performance of reinforcement learning agents. However, proper uncertainty quantiﬁcation is crucial for planning and managing exploration/exploitation tradeoffs. We demonstrated a general recalibration technique that can be combined with most model-based reinforcement learning algorithms to improve performance. Our approach leads to minimal computational overhead, and empirically improves performance across a range of tasks.

Calibrated Model-Based Deep Reinforcement Learning

Acknowledgments

This research was supported by NSF (#1651565, #1522054, #1733686), ONR (N00014-19-1-2145), AFOSR (FA955019-1-0024), Amazon AWS, and Lam Research.

Auer, P., Cesa-Bianchi, N., and Fischer, P. Finite-time analysis of the multiarmed bandit problem. Mach. Learn., 47(2-3):235 256, May 2002. ISSN 0885-6125. doi: 10. 1023/A:1013689704352. URL https://doi.org/ 10.1023/A:1013689704352.

Berkenkamp, F., Turchetta, M., Schoellig, A., and Krause, A.

Safe model-based reinforcement learning with stability guarantees. In Guyon, I., Luxburg, U. V., Bengio, S., Wallach, H., Fergus, R., Vishwanathan, S., and Garnett, R. (eds.), Advances in Neural Information Processing Systems 30, pp. 908 918. Curran Associates, Inc., 2017.

Brockman, G., Cheung, V., Pettersson, L., Schneider, J.,

Schulman, J., Tang, J., and Zaremba, W. Openai gym. ar Xiv preprint ar Xiv:1606.01540, 2016.

Buckman, J., Hafner, D., Tucker, G., Brevdo, E., and Lee,

H. Sample-efﬁcient reinforcement learning with stochastic ensemble value expansion. In Advances in Neural Information Processing Systems, pp. 8234 8244, 2018.

Chua, K., Calandra, R., Mc Allister, R., and Levine, S. Deep

reinforcement learning in a handful of trials using probabilistic dynamics models. In Bengio, S., Wallach, H., Larochelle, H., Grauman, K., Cesa-Bianchi, N., and Garnett, R. (eds.), Advances in Neural Information Processing Systems 31, pp. 4759 4770. Curran Associates, Inc., 2018.

Clavera, I., Rothfuss, J., Schulman, J., Fujita, Y., As-

four, T., and Abbeel, P. Model-based reinforcement learning via meta-policy optimization. ar Xiv preprint ar Xiv:1809.05214, 2018.

Dawid, A. P. Present position and potential developments:

Some personal views: Statistical theory: The prequential approach. Journal of the Royal Statistical Society. Series A (General), 147:278 292, 1984.

Deisenroth, M. and Rasmussen, C. E. Pilco: A model-based

and data-efﬁcient approach to policy search. In Proceedings of the 28th International Conference on machine learning (ICML-11), pp. 465 472, 2011.

Depeweg, S., Hern andez-Lobato, J. M., Doshi-Velez, F.,

and Udluft, S. Learning and policy search in stochastic dynamical systems with bayesian neural networks. ar Xiv preprint ar Xiv:1605.07127, 2016.

Ermon, S., Conrad, J., Gomes, C. P., and Selman, B. Playing

games against nature: optimal policies for renewable resource allocation. 2012.

Gal, Y. and Ghahramani, Z. Dropout as a Bayesian approxi-

mation: Representing model uncertainty in deep learning. In Proceedings of the 33rd International Conference on Machine Learning (ICML-16), 2016a.

Gal, Y. and Ghahramani, Z. A theoretically grounded ap-

plication of dropout in recurrent neural networks. In Advances in neural information processing systems, pp. 1019 1027, 2016b.

Gal, Y., Hron, J., and Kendall, A. Concrete dropout. In

Advances in Neural Information Processing Systems, pp.

3581 3590, 2017.

Gneiting, T. and Raftery, A. E. Weather forecasting with

ensemble methods. Science, 310(5746):248 249, 2005.

Gneiting, T. and Raftery, A. E. Strictly proper scoring

rules, prediction, and estimation. Journal of the American Statistical Association, 102(477):359 378, 2007.

Gneiting, T., Balabdaoui, F., and Raftery, A. E. Probabilistic

forecasts, calibration and sharpness. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 69 (2):243 268, 2007.

Guo, C., Pleiss, G., Sun, Y., and Weinberger, K. Q.

On calibration of modern neural networks. Co RR, abs/1706.04599, 2017a. URL http://arxiv.org/ abs/1706.04599.

Guo, C., Pleiss, G., Sun, Y., and Weinberger, K. Q. On

calibration of modern neural networks. ar Xiv preprint ar Xiv:1706.04599, 2017b.

Haarnoja, T., Zhou, A., Abbeel, P., and Levine, S. Soft

actor-critic: Off-policy maximum entropy deep reinforcement learning with a stochastic actor. ar Xiv preprint ar Xiv:1801.01290, 2018.

Higuera, J. C. G., Meger, D., and Dudek, G. Synthesizing

neural network controllers with probabilistic model based reinforcement learning. ar Xiv preprint ar Xiv:1803.02291, 2018.

Huang, G., Liu, Z., Van Der Maaten, L., and Weinberger,

K. Q. Densely connected convolutional networks. In 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 2261 2269. IEEE, 2017.

Kuleshov, V. and Ermon, S. Estimating uncertainty online

against an adversary. In AAAI, pp. 2110 2116, 2017.

Calibrated Model-Based Deep Reinforcement Learning

Kuleshov, V. and Liang, P. Calibrated structured prediction.

In Advances in Neural Information Processing Systems (NIPS), 2015.

Kuleshov, V., Fenner, N., and Ermon, S. Accurate uncer-

tainties for deep learning using calibrated regression. In Dy, J. and Krause, A. (eds.), Proceedings of the 35th International Conference on Machine Learning, volume 80 of Proceedings of Machine Learning Research, pp. 2796 2804, Stockholmsmssan, Stockholm Sweden, 10 15 Jul 2018. PMLR. URL http://proceedings.mlr. press/v80/kuleshov18a.html.

Kull, M. and Flach, P. Novel decompositions of proper scor-

ing rules for classiﬁcation: Score adjustment as precursor to calibration. In Joint European Conference on Machine Learning and Knowledge Discovery in Databases, pp. 68 85. Springer, 2015.

Kurutach, T., Clavera, I., Duan, Y., Tamar, A., and Abbeel, P.

Model-ensemble trust-region policy optimization. ar Xiv preprint ar Xiv:1802.10592, 2018.

Lakshminarayanan, B., Pritzel, A., and Blundell, C. Simple

and scalable predictive uncertainty estimation using deep ensembles. ar Xiv preprint ar Xiv:1612.01474, 2017a.

Lakshminarayanan, B., Pritzel, A., and Blundell, C. Simple

and scalable predictive uncertainty estimation using deep ensembles. In Advances in Neural Information Processing Systems, pp. 6402 6413, 2017b.

Li, L., Chu, W., Langford, J., and Schapire, R. E. A contextual-bandit approach to personalized news article recommendation. In Proceedings of the 19th International Conference on World Wide Web, WWW 10, pp. 661 670, New York, NY, USA, 2010. ACM.

ISBN 978-1-60558-799-8. doi: 10.1145/1772690. 1772758. URL http://doi.acm.org/10.1145/ 1772690.1772758.

Mc Intire, M., Ratner, D., and Ermon, S. Sparse gaussian

processes for bayesian optimization. In UAI, 2016.

Murphy, A. H. A new vector partition of the probability

score. Journal of Applied Meteorology, 12(4):595 600, 1973.

Niculescu-Mizil, A. and Caruana, R. Predicting good prob-

abilities with supervised learning. In Proceedings of the 22nd international conference on Machine learning, pp. 625 632, 2005.

Platt, J. et al. Probabilistic outputs for support vector ma-

chines and comparisons to regularized likelihood methods. Advances in large margin classiﬁers, 10(3):61 74, 1999.

Raftery, A. E., Gneiting, T., Balabdaoui, F., and Polakowski,

M. Using bayesian model averaging to calibrate forecast ensembles. Monthly weather review, 133(5):1155 1174, 2005.

Rajeswaran, A., Ghotra, S., Ravindran, B., and Levine,

S. Epopt: Learning robust neural network policies using model ensembles. ar Xiv preprint ar Xiv:1610.01283, 2016.

Rollinson, J. and Brunskill, E. From predictive models to

instructional policies. International Educational Data Mining Society, 2015.

Saria, S. Individualized sepsis treatment using reinforcement learning. Nature Medicine, 24(11):1641 1642, 11 2018. ISSN 1078-8956. doi: 10.1038/ s41591-018-0253-x.

Singh, S. P., Kearns, M. J., Litman, D. J., and Walker, M. A.

Reinforcement learning for spoken dialogue systems. In Advances in Neural Information Processing Systems, pp.

956 962, 2000.

Sutton, R. S. and Barto, A. G. Reinforcement learning: An

introduction. MIT press, 2018.

Todorov, E., Erez, T., and Tassa, Y. Mujoco: A physics

engine for model-based control. In Intelligent Robots and Systems (IROS), 2012 IEEE/RSJ International Conference on, pp. 5026 5033. IEEE, 2012.

Van Roy, B., Bertsekas, D. P., Lee, Y., and Tsitsiklis, J. N.

A neuro-dynamic programming approach to retailer inventory management. In Decision and Control, 1997., Proceedings of the 36th IEEE Conference on, volume 4, pp. 4052 4057. IEEE, 1997.

Vermeulen, S. J., Campbell, B. M., and Ingram, J. S. Climate

change and food systems. Annual Review of Environment and Resources, 37, 2012.

Zadrozny, B. and Elkan, C. Transforming classiﬁer scores

into accurate multiclass probability estimates. In International Conference on Knowledge Discovery and Data Mining (KDD), pp. 694 699, 2002.