# modelbased_active_exploration__282abada.pdf Model-Based Active Exploration Pranav Shyam 1 Wojciech Ja skowski 1 Faustino Gomez 1 Efficient exploration is an unsolved problem in Reinforcement Learning which is usually addressed by reactively rewarding the agent for fortuitously encountering novel situations. This paper introduces an efficient active exploration algorithm, Model-Based Active e Xploration (MAX), which uses an ensemble of forward models to plan to observe novel events. This is carried out by optimizing agent behaviour with respect to a measure of novelty derived from the Bayesian perspective of exploration, which is estimated using the disagreement between the futures predicted by the ensemble members. We show empirically that in semi-random discrete environments where directed exploration is critical to make progress, MAX is at least an order of magnitude more efficient than strong baselines. MAX scales to highdimensional continuous environments where it builds task-agnostic models that can be used for any downstream task. 1. Introduction Efficient exploration in large, high-dimensional environments is an unsolved problem in Reinforcement Learning (RL). Current exploration methods (Osband et al., 2016; Bellemare et al., 2016; Houthooft et al., 2016; Pathak et al., 2017) are reactive: the agent accidentally observes something novel and then decides to obtain more information about it. Further exploration in the vicinity of the novel state is carried out typically through an exploration bonus or intrinsic motivation reward, which have to be unlearned once the novelty has worn off, making exploration inefficient a problem we refer to as over-commitment. However, exploration can also be active, where the agent seeks out novelty based on its own internal estimate of what action sequences will lead to interesting transitions. 1NNAISENSE, Lugano, Switzerland. Correspondence to: Pranav Shyam . Proceedings of the 36 th International Conference on Machine Learning, Long Beach, California, PMLR 97, 2019. Copyright 2019 by the author(s). This approach is inherently more powerful than reactive exploration, but requires a method to predict the consequences of actions and their degree of novelty. This problem can be formulated optimally in the Bayesian setting where the novelty of a given state transition can be measured by the disagreement between the next-state predictions made by probable models of the environment. This paper introduces Model-Based Active e Xploration (MAX), an efficient algorithm, based on this principle, that approximates the idealized distribution using an ensemble of learned forward dynamics models. The algorithm identifies learnable unknowns or uncertainty representing novelty in the environment by measuring the amount of conflict between the predictions of the constituent models. It then constructs exploration policies to resolve those conflicts by visiting the relevant area. Unlearnable unknowns or risk, such as random noise in the environment does not interfere with the process since noise manifests as confusion among all models and not as a conflict. In discrete environments, novelty can be evaluated using the Jensen-Shannon Divergence (JSD) between the predicted next state distributions of the models in the ensemble. In continuous environments, computing JSD is intractable, so MAX instead uses the functionally equivalent Jensen-Rényi Divergence based on Rényi quadratic entropy (Rényi, 1961). While MAX can be used in conjunction with conventional policy learning to maximize external reward, this paper focuses on pure exploration: exploration disregarding or in the absence of external reward, followed by exploitation (Bubeck et al., 2009). This setup is more natural in situations where it is useful to do task-agnostic exploration and learn models that can later be exploited for multiple tasks, including those that are not known a priori. Experiments in the discrete domain show that MAX is significantly more efficient than reactive exploration techniques which use exploration bonuses or posterior sampling, while also strongly suggesting that MAX copes with risk. In the high-dimensional continuous Ant Maze environment, MAX reaches the far end of a U-shaped maze in just 40 episodes (12k steps), while reactive baselines are only around the mid-way point after the same time. In the Half Cheetah code: https://github.com/nnaisense/max Model-Based Active Exploration environment, the data collected by MAX leads to superior performance versus the data collected by reactive baselines when exploited using model-based RL. 2. Model-Based Active Exploration The key idea behind our approach to active exploration in the environment, or the external Markov Decision Process (MDP), is to use a surrogate or exploration MDP where the novelty of transitions can be estimated before they have actually been encountered by the agent in the environment. The next section provides the formal context for the conceptual foundation of this work. 2.1. Problem Setup Consider the environment or the external MDP, represented as the tuple (S, A, t , r, 0), where S is the state space, A is the action space, t is the unknown transition function, S A S ! [0, 1), specifying the probability density p(s0|s, a, t ) of the next state s0 given the current state s and the action a, r : S A ! R is the reward function, 0 : S ! [0, 1) is the probability density function of the initial state. Let T be the space of all possible transition functions and P(T) be a probability distribution over transition functions that captures the current belief of how the environment works, with corresponding density function p(T). The objective of pure exploration is to efficiently accumulate information about the environment, irrespective of r. This is equivalent to learning an accurate model of the transition function, t , while minimizing the number of state transitions, φ, belonging to transition space Φ, required to do so, where φ = (s, a, s0) and s0 is the state resulting from action a being taken in state s. Pure exploration can be defined as an iterative process, where in each iteration, an exploration policy : S A ! [0, 1), specifying a density p(a|s, ), is used to collect information about areas of the environment that have not been explored up to that iteration. To learn such exploration policies, there needs to be a method to evaluate any given policy at each iteration. 2.2. Utility of an Exploration Policy In the standard RL setting, a policy would be learned to take actions that maximize some function of the external reward received from the environment according to r, i.e., the return. Because pure active exploration does not care about r, and t is unknown, the amount of new information conveyed about the environment by state transitions that could be caused by an exploration policy has to be used as the learning signal. From the Bayesian perspective, this can be captured by the KL-divergence between P(T), the (prior) distribution over transition functions before a particular transition φ, and P(T|φ), the posterior distribution after φ has occurred. This is commonly referred to as Information Gain, which is abbreviated as IG(φ) for a transition φ: IG(s, a, s0) = IG(φ) = DKL(P(T|φ) k P(T)). (1) The utility can be understood as the extra number of bits needed to specify the posterior relative to the prior, effectively, the number of bits of information that was gathered about the external MDP. Given IG(φ), it is now possible to compute the utility of the exploration policy, IG( ), which is the expected utility over the transitions when is used: IG( ) = Eφ P(Φ| ) [IG(φ)] , (2) which can be expanded into (see Appendix A): IG( ) = Et P(T ) Es,a P(S,A| ,t) [u(s, a)] IG(s, a, s0)p(s0|a, s, t)p(t) ds0dt. (4) It turns out that (see Appendix B): u(s, a) = JSD{P(S|s, a, t) | t P(T)}, (5) where JSD is the Jensen-Shannon Divergence, which captures the amount of disagreement present in a space of distributions. Hence, the utility of the state-action pair u(s, a) is the disagreement, in terms of JSD, among the next-state distributions given s and a of all possible transition functions weighted by their probability. Since u depends only on the prior P(T), the novelty of potential transitions can be calculated without having to actually effect them in the external MDP. The internal exploration MDP can then be defined as (S, A, t, u, δ(s )), where the sets S and A are identical to those of the external MDP, and the transition function t is defined such that, p(s0|s, a, t) := E t P(T ) [p(s0|s, a, t)] , (6) which can be interpreted as a different sample of P(T) drawn at each state transition. u is to the exploration MDP what the r is to the external MDP, so that maximizing u results in the optimal exploration policy at that iteration, just as maximizing r results in the optimal policy for the corresponding task. Finally, the initial state distribution density is set to the Dirac delta function δ(s ) such that the initial state of exploration MDP is always the current state s in the environment. It is important to understand that the prior P(T) is used twice in the exploration MDP: Model-Based Active Exploration 1. To specify the state-action joint distribution as per Equation 3. Each member t in the prior P(T) determines a distribution P(S, A| , t) over the set of possible state-action pairs that can result by sequentially executing actions according to starting in s . 2. To obtain the utility for a particular transition as per Equation 5. Each state-action pair (s, a) in the above P(S, A| , t), according to the transition functions from P(T), forms a set of predicted next-state distributions {P(S|s, a, t) | t P(T)}. The JSD of this set is u(s, a). 2.3. Bootstrap Ensemble Approximation The prior P(T) can be approximated using a bootstrap ensemble of N learned transition functions or models that are each trained independently using different subsets of the history D, consisting of state transitions experienced by the agent while exploring the external MDP (Efron, 2012). Therefore, while P(T) is uniform when the agent is initialized, thereafter it is conditioned on agent s history D so that the general form of the prior is P(T|D). For generalizations that are warranted by observed data, there is a good chance that the models make similar predictions. If the generalization is not warranted by observed data, then the models could disagree owing to their exposure to different parts of the data distribution. Since the ensemble contains models that were trained to accurately approximate the data, these models have higher probability densities than a random model. Hence, even a relatively small ensemble can approximate the true distribution, P(T|D) (Lakshminarayanan et al., 2017). Recent work suggests that it is possible to do so even in high-dimensional state and action spaces (Kurutach et al., 2018; Chua et al., 2018). Using an N-model ensemble {t1, t2, , t N} approximating the prior and assuming that all models fit the data equally well, the dynamics of the exploration MDP can be approximated by randomly selecting one of the N models with equal probability at each transition. Therefor, to approximate u(s, a), the JSD in Equation 5 can be expanded as (see Appendix B): u(s, a) = JSD{P(S|s, a, t) | t P(T)} Et P(T ) [P(S|s, a, t)] Et P(T ) [H(P(S|s, a, t))] , where H( ) denotes the entropy of the distribution. Equation 7 can be summarized as the difference between the entropy of the average and the average entropy and can be approximated by averaging samples from the ensemble: u(s, a) ' H P(S|s, a, ti) H(P(S|s, a, ti)). 2.4. Large Continuous State Spaces For continuous state spaces, S Rd, the next-state distribution P(S|s, a, ti) is generally parameterized, typically, using a multivariate Gaussian Ni(µi, i) with mean vector µi and co-variance matrix i. With this, evaluating Equation 8 is intractable as it involves estimating the entropy of a mixture of Gaussians, which has no analytical solution. This problem can be circumvented by replacing the Shannon entropy with Rényi entropy (Rényi, 1961) and using the corresponding Jensen-Rényi Divergence (JRD). The Rényi entropy of a random variable X is defined as H (X) = 1 1 ln for a given order 0, of which Shannon entropy is a special case when tends to 1. When = 2, the resulting quadratic Rényi entropy has a closed-form solution for a mixture of N Gaussians (Wang et al., 2009). Therefore, the Jensen-Rényi Divergence JRD{Ni(µi, i) | i = 1, . . . , N} is given by can be calculated with where c = d ln(2)/2 and D(Ni, Nj) = 1 with = i + j and = µj µi. Equation 9 measures divergence among predictions of models based on some combination of their means and variances. However, when the models are learned, the parameters concentrate around their true values at different rates and the environments can greatly differ in the amount of noise they contain. On the one hand, if the environment is completely deterministic, exploration effort could be wasted in precisely matching the small predicted variances of all the models. On Model-Based Active Exploration (a) Chain environment of length 10. (b) 50-state chain. (c) Chain lengths (d) Stochastic trap Figure 1. Algorithm performance on the randomized Chain environment. For the first 3 episodes, marked by the vertical dotted line, actions were chosen at random (as warm-up). Each line corresponds to the median of 100 runs (seeds) in (b) and 5 runs in (c) and (d). The shaded area spans the 25th and 75th percentiles. Algorithm 1 MODEL-BASED ACTIVE EXPLORATION Initialize: Transitions dataset D, with random policy Initialize: Model ensemble, T = {t1, t2, , t N} repeat while episode not complete do Exploration MDP (S, A, Uniform{ T}, u, δ(s )) SOLVE(Exploration MDP) a (s ) act in environment: s +1 P(S|s , a , t ) D D [ {(s , a , s +1)} Train ti on D for each ti in T end while until computation budget exhausted the other hand, ignoring the variance in a noisy environment could result in poor exploration. To inject such prior knowledge into the system, an optional temperature parameter λ 2 [0, 1] that modulates the sensitivity of Equation 9 with respect to the variances was introduced. Since the outputs of parametric non-linear models, such as neural networks, are unbounded, it is common to use variance bounds for model and numerical stability. Using the upper bound U the variances can be re-scaled with λ: ˆ i = U λ( U i) 8 i = 1, . . . , N. In this paper, λ was fixed to 0.1 for all continuous environments. 2.5. The MAX Algorithm Algorithm 1 presents MAX in high-level pseudo-code. MAX is, essentially, a model-based RL algorithm with exploration as its objective. At each step, a fresh explo- ration policy is learned to maximise its return in the exploration MDP, a procedure which is generically specified as SOLVE(Exploration MDP). The policy then acts in the external MDP to collect new data, which is used to train the ensemble yielding the approximate posterior. This posterior is then used as the approximate prior in the subsequent exploration step. Note that a transition function is drawn from T for each transition in the exploration MDP. In practice, training the model ensemble and optimizing the policy can be performed at a fixed frequency to reduce the computational cost. 3. Experiments 3.1. Discrete Environment A randomized version of the Chain environment (Figure 1a), designed to be hard to explore proposed by Osband et al. (2016), which is a simplified generalization of River Swim (Strehl & Littman, 2005) was used to evaluate MAX. Starting in the second state (state 1) of an L-state chain, the agent can move either left or right at each step. An episode lasts L + 9 steps, after which the agent is reset to the start. The agent is first given 3 warm up episodes during which actions are chosen randomly. Trying to move outside the chain results in staying in place. The agent is rewarded only for staying in the edge states: 0.001 and 1 for the leftmost and the rightmost state, respectively. To make the problem harder (e.g. not solvable by the policy always-go-right), the effect of each action was randomly swapped so that in approximately half of the states, going RIGHT results in a leftward transition and vice-versa. Unless stated otherwise, L = 50 was used, so that exploring the environment fully and reaching the far right states is unlikely using random exploration. The probability of the latter decreases exponen- Model-Based Active Exploration (a) Ant Maze Environment (b) Maze Exploration Performance (c) 300 steps (d) 600 steps (e) 3600 steps (f) 12000 steps Figure 2. Performance of MAX exploration on the Ant Maze task. (a) shows the environment used. Results presented in (b) show that active methods (MAX and TVAX) are significantly quicker in exploring the maze compared to reactive methods (JDRX and PREX) with MAX being the quickest. (c)-(f) visualize the maze exploration by MAX across 8 runs. Chronological order of positions within an episode is encoded with the color spectrum, going from yellow (earlier in the episode) to blue (later in the episode). tially with L. Therefore, in order to explore efficiently, an agent needs to exploit the structure of the environment. MAX was compared to two exploration methods based on the optimism in face of uncertainty principle (Kaelbling et al., 1996): Exploration Bonus DQN (EB-DQN; Bellemare et al., 2016) and Bootstrapped DQN (Boot-DQN; Osband et al., 2016). Both algorithms employ the sample efficient DQN algorithm (Mnih et al., 2015). Bootstrapped DQN is claimed to be better than state of the art approaches to exploration via dithering ( -greedy), optimism and posterior sampling (Osband et al., 2016). Both of them are reactive since they do not explicitly seek new transitions, but upon finding one, prioritize frequenting it. Note that these baselines are fundamentally any-time-optimal RL algorithms which minimize cumulative regret by trading-off exploration and exploitation in each action. For the chain environment, MAX used Monte-Carlo Tree Search to find open-loop exploration policies (see Appendix C for details). The hyper-parameters for both of the baseline methods were tuned with grid search. Figure 1b shows the percentage of explored transitions as a function of training episodes for all the methods. MAX explores 100% of the transitions in around 15 episodes while the baseline methods reach 40% in 60 episodes. Figure 1c shows the exploration progress curves for MAX when chain length was varied from 20 to 100 in intervals of 5. To see if MAX can distinguish between the environment risk and uncertainty, the left-most state (state 0) of the Chain Environment was modified to be a stochastic trap state (see Appendix C). Although MAX slowed down as a consequence, it still managed to explore the transitions as Figure 1d shows. 3.2. Continuous Environments To evaluate MAX in the high-dimensional continuous setting, two environments based on Mu Jo Co (Todorov et al., 2012), Ant Maze and Half Cheetah, were considered. The exploration performance was measured directly for Ant Maze, and indirectly in the case of Half Cheetah. MAX was compared to four other exploration methods that lack at least one feature of MAX: 1. Trajectory Variance Active Exploration (TVAX): an active exploration method that defines transition utilities as per-timestep variance in sampled trajectories in contrast to the per-state JSD between next state-predictions used in MAX. 2. Jensen-Rényi Divergence Reactive Exploration Model-Based Active Exploration (a) Running task performance (b) Flipping task performance (c) Average performance Figure 3. MAX on Half Cheetah tasks. The grey dashed horizontal line shows the average performance of an oracle model-free policy trained for 200k (10x more) steps by SAC in the environment, directly using the corresponding task-specific reward function. Notice that exploring the dynamics for the flipping task is more difficult than the running task as evidenced by the performance of the random baseline. Overall, active methods are quicker and better explorers compared to the reactive ones in this task. Each curve is the mean of 8 runs. (JDRX): a reactive counter-part of MAX, which learns the exploration policy directly from the experience collected so far without planning in the exploration MDP. 3. Prediction Error Reactive Exploration (PERX): a commonly used reactive exploration method (e.g. in Pathak et al. (2017)), which uses the mean prediction error of the ensemble as transition utility. 4. Random exploration policy. In the Ant Maze (see Figure 2a), exploration performance was measured directly as the fraction of the U-shaped maze that the agent visited during exploration. In Half Cheetah, exploration performance was evaluated by measuring the usefulness of the learned model ensemble when exploiting it to perform two downstream tasks: running and flipping. For both environments, a Gaussian noise of N(0, 0.02) was added to the states to introduce stochasticity in the dynamics. Appendix D details the setup. Models were probabilistic Deep Neural Networks trained with negative log-likelihood loss to predict the next state distributions in the form of multivariate Gaussians with diagonal covariance matrices. Soft-Actor Critic (SAC; Haarnoja et al., 2018) was used to learn both pure exploration and task-specific policies. The maximum entropy framework used in SAC is particularly well-suited to model-based RL as it uses an objective the both improves policy robustness, which hinders adversarial model exploitation, and yields multi-modal policies, which could mitigate the negative effects of planning with inaccurate models. Exploration policies were regularly trained with SAC from scratch with the utilities re-calculated using the latest models to avoid over-commitment. The first phase of training involved only a fixed dataset containing the transitions experienced by the agent so-far (agent history D). For the active methods (MAX and TVAX), this was followed by an additional phase where the policies were updated by the data generated exclusively from the imaginary exploration MDP, which is the key feature distinguishing active from reactive exploration. The results for Ant Maze and Half Cheetah are presented in Figures 2 and 3, respectively. For Half Cheetah, an additional baseline, obtained by training an agent with modelfree SAC using the task-specific reward in the environment, is included. In both cases, the active exploration methods (MAX and TVAX) outperform the reactive ones (JDRX and PERX). Due to the noisy dynamics, PERX performs poorly. Among the two active methods, MAX is noticeably better as it uses a principled trajectory sampling and utility evaluation technique as opposed to the TVAX baseline which cannot distinguish the risk from the uncertainty. It is important to notice that the running task is easy since random exploration is sufficient which is not the case for flipping where good performance requires directed active exploration. None of the methods were able to learn task-oriented planning in Ant Maze, even with larger models and longer training times than were used to obtain the reported results. The Ant Maze environment is more complex than Half Cheetah and obtaining good performance in downstream tasks using only the learned models is difficult due to other confounding factors such as compounding errors that arise in long-horizon planning. Hence, exploration performance was measured simply as the fraction of the maze the that agent explored. The inferior results of the baseline methods suggest that this task is non-trivial. The evolution of the uncertainty landscape over the statespace of the environment when MAX is employed is visualized in Figure 4 for the Continuous Mountain Car environment. In the first exploration episode, the agent takes a spiral path through the state space (Figure 4c): MAX was able to drive the car up and down the sides of valley to develop enough velocity to reach the mountain top without any external reward. In subsequent episodes, it carves out more spiral paths in-between the previous ones (Figure 4d). Model-Based Active Exploration (a) 100 steps (b) 150 steps (c) 220 steps (d) 800 steps Figure 4. Illustration of MAX exploration in the Continuous Mountain Car environment. Each plot shows the state space of the agent, discretized as a 2D grid. The color indicates the average uncertainty of a state over all actions. The dotted lines represent the trajectories of the agent. 4. Discussion An agent is meta-stable if it is sub-optimal and, at the same time, has a policy that prevents it from gaining experience necessary to improve itself (Watkins, 1989). Simply put, a policy can get stuck in a local optimum and not be able to get out of it. In the simple cases, undirected exploration techniques (Thrun, 1992) such as adding random noise to the actions of the policy might be sufficient to break out of metastability. If the environment is ergodic, then reactive strategies that use exploration bonuses can solve meta-stability. But active exploration of the form presented in this paper, can in principle break free of any type of meta-stability. Model-based RL promises to be significantly more efficient and more general compared to model-free RL methods. However, it suffers from model-bias (Deisenroth & Rasmussen, 2011): in certain regions of the state space, the models could deviate significantly from the external MDP. Model-bias can have many causes such as improper generalization or poor exploration. A strong policy search method could then exploit such degeneracy resulting in overoptimistic policies that fail in the environment. Thorough exploration is one way to potentially mitigate this issue. If learning certain aspects of the environment is difficult, it will manifest itself as disagreement in the ensemble. MAX would collect more data about those aspects to improve the quality of models, thereby limiting adversarial exploitation by the policy. Since model-based RL does not have an inherent mechanism to explore, MAX could be considered as an important addition to the model-based RL framework rather than merely being an application of it. Limitations. The derivation in Section 2 makes the assumption that the utility of a policy is the average utility of the probable transitions when the policy is used. However, encountering a subset of those transitions and training the models can change the utility of the remaining transitions, thereby affecting the utility of the policy. This second-order effect was not considered in the derivation. In the Chain environment for example, this effect leads to the agent planning to loop between pairs of uncertain states, rather than visiting many different uncertain states. MAX is also less computationally efficient in comparison to the baselines used in the paper as it trades off computational efficiency for data efficiency as is common in model-based algorithms. 5. Related Work Our work is inspired by the framework developed in Schmidhuber (1997; 2002), in which two adversarial rewardmaximizing modules called the left brain and the right brain bet on outcomes of experimental protocols or algorithms they have collectively generated and agreed upon. Each brain is intrinsically motivated to outwit or surprise the other by proposing an experiment such that the other agrees on the experimental protocol but disagrees on the predicted outcome. After having executed the action sequence protocol approved by both brains, the surprised loser pays a reward to the winner in a zero-sum game. MAX greatly simplifies this previous active exploration framework, distilling certain essential aspects. Two or more predictive models that may compute different hypotheses about the consequences of the actions of the agent, given observations, are still used. However, there is only one reward maximizer or RL machine which is separate from the predictive models. The information provided by an experiment was first analytically measured by Lindley (1956) in the form of expected information gain in the Shannon sense (Shannon, 1948). Fedorov (1972) also proposed a theory of optimal resource allocation during experimentation. By the 1990s, information gain was used as an intrinsic reward for reinforcement learning systems (Storck et al., 1995). Even Model-Based Active Exploration earlier, intrinsic reward signals were based on prediction errors of a predictive model (Schmidhuber, 1991a) and on the learning progress of a predictive model (Schmidhuber, 1991b). Thrun (1992) introduced notions of directed and undirected exploration in RL. Optimal Bayesian experimental design (Chaloner & Verdinelli, 1995) is a framework for efficiently performing sequential experiments that uncover a phenomenon. However, usually the approach is restricted to linear models with Gaussian assumptions. Busetto et al. (2009) proposed an optimal experimental design framework for model selection of nonlinear biochemical systems using expected information gain where they solve for the posterior using the Fokker-Planck equation. In Model-Based Interval Estimation (Wiering, 1999), the uncertainty in the transition function captured by a surrogate model is used to boost Q-values of actions. In the context of Active Learning, Mc Callum & Nigam (1998) proposed using Jensen-Shannon Divergence between predictions of a committee of classifiers to identify the most useful sample to be labelled next among a pool of unlabelled samples. (Singh et al., 2005) developed an intrinsic motivation framework inspired by neuroscience using prediction errors. (Itti & Baldi, 2009) presented the surprise formulation used in Equation 1 and demonstrated a strong correlation between surprise and human attention. At a high-level, MAX can be seen as a form of Bayesian Optimization (Snoek et al., 2012) adopted for exploration in RL which employs an inner search-based optimization during planning. Curiosity has also been studied extensively from the perspective of developmental robotics (Oudeyer, 2018). Schmidhuber (2009) suggested a general form of learning progress as compression progress which can be used as an extra intrinsic reward for curious RL systems. Following these, Sun et al. (2011) developed an optimal Bayesian framework for curiosity-driven exploration using learning progress. After proving that Information Gain is additive in expectation, a dynamic programming-based algorithm was proposed to maximize Information Gain. Experiments however were limited to small tabular MDPs with a Dirichlet prior on transition probabilities. A similar Bayesian-inspired, hypothesis-resolving, model-based RL exploration algorithm was proposed in Hester & Stone (2012) and shown to outperform prediction error-based and other intrinsic motivation methods. In contrast to MAX, planning uses the mean prediction of a model ensemble to optimize a disagreement-based utility measure which is augmented with an additional state-distance bonus. Still & Precup (2012) derived a exploration and exploitation tradeoff in an attempt to maximize the predictive power of the agent. Mohamed & Rezende (2015) combined Variational Inference and Deep Learning to form an objective based on mutual information to approximate agent empowerment. In comparison to our method, Houthooft et al. (2016) presented a Bayesian approach to evaluate the value of experience tak- ing a reactive approach. However, they also used Bayesian Neural Networks to maintain a belief over environment dynamics, and the information gain to bias the policy search with an intrinsic reward component. Variational Inference was used to approximate the prior-posterior KL divergence. Bellemare et al. (2016) derived a notion of pseudo-count for estimating state visitation frequency in high-dimensional spaces. They then transformed this into a form of exploration bonus that is maximized using DQN. Osband et al. (2016) propose Bootstrapped DQN which was used as a baseline. Pathak et al. (2017) used inverse models to avoid learning anything that the agent cannot control to reduce risk, and prediction error in the latent space to perform reactive exploration. A large-scale study of curiosity-driven exploration (Burda et al., 2019) found that curiosity is correlated with the actual objectives of many environments, and reported that using random features mitigates some of the non-stationarity implicit in methods based on curiosity. Eysenbach et al. (2018) demonstrated the power of optimizing policy diversity in the absence of a reward function in developing skills which could then be exploited. Model-based RL has long been touted as the cure for sample inefficiency problems of modern RL (Schmidhuber, 1990; Sutton, 1991; Deisenroth & Rasmussen, 2011). Yet learning accurate models of high-dimensional environments and exploiting them appropriately in downstream tasks is still an active area of research. Recently, Kurutach et al. (2018) and Chua et al. (2018) have shown the potential of model-based RL when combined with Deep Learning in high-dimensional environments. In particular, this work was inspired by Chua et al. (2018) who combined probabilistic models with novel trajectory sampling techniques using particles to obtain better approximations of the returns in the environment. Concurrent with this work, Pathak et al. (2019) also showed the advantages of using an ensemble of models for exploring complex high-dimensional environments, including a real robot. 6. Conclusion This paper introduced MAX, a model-based RL algorithm for pure exploration. It can distinguish between learnable and unlearnable unknowns and search for policies that actively seek learnable unknowns in the environment. MAX provides the means to use an ensemble of models for simulation and evaluation of an exploration policy. The quality of the exploration policy can therefore be directly optimized without actual interaction with the environment. Experiments in hard-to-explore discrete and high-dimensional continuous environments indicate that MAX is a powerful generic exploration method. Model-Based Active Exploration Acknowledgements We would like to thank Jürgen Schmidhuber, Jan Koutník, Garrett Andersen, Christian Osendorfer, Timon Willi, Bas Steunebrink, Simone Pozzoli, Nihat Engin Toklu, Rupesh Kumar Srivastava and Mirek Strupl for their assistance and everyone at NNAISENSE for being part of a conducive research environment. Bellemare, M., Srinivasan, S., Ostrovski, G., Schaul, T., Saxton, D., and Munos, R. Unifying Count-Based Exploration and Intrinsic Motivation. In Advances in Neural Information Processing Systems, pp. 1471 1479, 2016. Bubeck, S., Munos, R., and Stoltz, G. Pure Exploration In Multi-armed Bandits Problems. In International Conference On Algorithmic Learning Theory, pp. 23 37. Springer, 2009. Burda, Y., Edwards, H., Pathak, D., Storkey, A., Darrell, T., and Efros, A. A. Large-Scale Study of Curiosity-Driven Learning. In International Conference on Learning Representations, 2019. Busetto, A. G., Ong, C. S., and Buhmann, J. M. Optimized Expected Information Gain for Nonlinear Dynamical Systems. In Proceedings of the 26th Annual International Conference on Machine Learning, pp. 97 104. ACM, 2009. Chaloner, K. and Verdinelli, I. Bayesian Experimental De- sign: A Review. Statistical Science, pp. 273 304, 1995. Chua, K., Calandra, R., Mc Allister, R., and Levine, S. Deep Reinforcement Learning in a Handful of Trials using Probabilistic Dynamics Models. ar Xiv preprint ar Xiv:1805.12114, 2018. Deisenroth, M. and Rasmussen, C. E. PILCO: A Model- Based and Data-Efficient Approach to Policy Search. In Proceedings of the 28th International Conference on Machine Learning, pp. 465 472, 2011. Efron, B. Bayesian Inference and the Parametric Bootstrap. Annals of Applied Statistics, 6(4):1971 1997, 2012. doi: 10.1214/12-AOAS571. Eysenbach, B., Gupta, A., Ibarz, J., and Levine, S. Diver- sity is All You Need: Learning Skills without a Reward Function. ar Xiv preprint ar Xiv:1802.06070, 2018. Fedorov, V. Theory of Optimal Experiments Designs. Aca- demic Press, 01 1972. Haarnoja, T., Zhou, A., Abbeel, P., and Levine, S. Soft Actor-Critic: Off-Policy Maximum Entropy Deep Reinforcement Learning with a Stochastic Actor. International Conference on Machine Learning (ICML), 2018. Hester, T. and Stone, P. Intrinsically Motivated Model Learn- ing for Developing Curious Robots. In Development and Learning and Epigenetic Robotics (ICDL), 2012 IEEE International Conference on, pp. 1 6. IEEE, 2012. Houthooft, R., Chen, X., Duan, Y., Schulman, J., De Turck, F., and Abbeel, P. VIME: Variational Information Maximizing Exploration. In Advances in Neural Information Processing Systems, pp. 1109 1117, 2016. Itti, L. and Baldi, P. Bayesian Surprise Attracts Human Attention. Vision Research, 49(10):1295 1306, 2009. Kaelbling, L. P., Littman, M. L., and Moore, a. W. Re- inforcement Learning: A Survey. Journal of Artificial Intelligence Research, 4:237 285, 1996. Kurutach, T., Clavera, I., Duan, Y., Tamar, A., and Abbeel, P. Model-Ensemble Trust-Region Policy Optimization. ar Xiv preprint ar Xiv:1802.10592, 2018. Lakshminarayanan, B., Pritzel, A., and Blundell, C. Simple and Scalable Predictive Uncertainty Estimation Using Deep Ensembles. In Advances in Neural Information Processing Systems, pp. 6402 6413, 2017. Lindley, D. V. On a Measure of the Information Provided by an Experiment. The Annals of Mathematical Statistics, pp. 986 1005, 1956. Mc Callum, A. K. and Nigam, K. Employing EM and Pool- Based Active Learning for Text Classification. In Proceedings International Conference on Machine Learning (ICML), pp. 359 367. Citeseer, 1998. Mnih, V., Kavukcuoglu, K., Silver, D., Rusu, A. A., Ve- ness, J., Bellemare, M. G., Graves, A., Riedmiller, M., Fidjeland, A. K., Ostrovski, G., et al. Human-Level Control Through Deep Reinforcement Learning. Nature, 518 (7540):529, 2015. Mohamed, S. and Rezende, D. J. Variational Information Maximisation For Intrinsically Motivated Reinforcement Learning. In Advances In Neural Information Processing Systems, pp. 2125 2133, 2015. Osband, I., Blundell, C., Pritzel, A., and Van Roy, B. Deep Exploration via Bootstrapped DQN. In Advances in Neural Information Processing Systems, pp. 4026 4034, 2016. Oudeyer, P.-Y. Computational Theories of Curiosity-Driven Learning. ar Xiv preprint ar Xiv:1802.10546, 2018. Model-Based Active Exploration Pathak, D., Agrawal, P., Efros, A. A., and Darrell, T. Curiosity-Driven Exploration by Self-Supervised Prediction. In Proceedings of The 35th International Conference On Machine Learning, 2017. Pathak, D., Gandhi, D., and Gupta, A. Self-Supervised Exploration via Disagreement. In Proceedings of the 36th International Conference on Machine Learning, 2019. Rényi, A. On Measures of Entropy and Information. Tech- nical report, Hungarian Academy Of Sciences, Budapest, Hungary, 1961. Schmidhuber, J. Making the World Differentiable: On Using Fully Recurrent Self-Supervised Neural Networks for Dynamic Reinforcement Learning and Planning in Non Stationary Environments. Technical Report FKI-126-90 (revised), Institut für Informatik, Technische Universität München, November 1990. Schmidhuber, J. A Possibility for Implementing Curiosity and Boredom in Model-Building Neural Controllers. In Meyer, J. A. and Wilson, S. W. (eds.), Proceedings of the International Conference on Simulation of Adaptive Behavior: From Animals to Animats, pp. 222 227. MIT Press/Bradford Books, 1991a. Schmidhuber, J. Curious Model-Building Control Systems. In Proceedings of the International Joint Conference on Neural Networks, Singapore, volume 2, pp. 1458 1463. IEEE press, 1991b. Schmidhuber, J. What s Interesting? Technical Report IDSIA-35-97, IDSIA, 1997. Schmidhuber, J. Exploring the Predictable. In Ghosh, A. and Tsuitsui, S. (eds.), Advances in Evolutionary Computing, pp. 579 612. Springer, 2002. Schmidhuber, J. Driven by Compression Progress: A Sim- ple Principle Explains Essential Aspects of Subjective Beauty, Novelty, Surprise, Interestingness, Attention, Curiosity, Creativity, Art, Science, Music, Jokes. In Pezzulo, G., Butz, M. V., Sigaud, O., and Baldassarre, G. (eds.), Anticipatory Behavior in Adaptive Learning Systems. From Psychological Theories to Artificial Cognitive Systems, volume 5499 of LNCS, pp. 48 76. Springer, 2009. Shannon, C. E. A Mathematical Theory of Communication (Parts I and II). Bell System Technical Journal, XXVII: 379 423, 1948. Singh, S. P., Barto, A. G., and Chentanez, N. Intrinsically Motivated Reinforcement Learning. In Advances in Neural Information Processing Systems, pp. 1281 1288, 2005. Snoek, J., Larochelle, H., and Adams, R. P. Practical Bayesian Optimization Of Machine Learning Algorithms. In Advances in Neural Information Processing Systems, pp. 2951 2959, 2012. Still, S. and Precup, D. An Information-Theoretic Approach to Curiosity-Driven Reinforcement Learning. Theory in Biosciences, 131(3):139 148, 2012. Storck, J., Hochreiter, S., and Schmidhuber, J. Reinforcement Driven Information Acquisition in Non Deterministic Environments. In Proceedings of the International Conference on Artificial Neural Networks, Paris, volume 2, pp. 159 164. Citeseer, 1995. Strehl, A. L. and Littman, M. L. A Theoretical Analysis of Model-Based Interval Estimation. In Proceedings of The 22nd International Conference On Machine Learning, pp. 856 863. ACM, 2005. Sun, Y., Gomez, F., and Schmidhuber, J. Planning to Be Surprised: Optimal Bayesian Exploration in Dynamic Environments. In International Conference on Artificial General Intelligence, pp. 41 51. Springer, 2011. Sutton, R. S. Reinforcement Learning Architectures for Animats. In From Animals to Animats: Proceedings of the First International Conference on Simulation of Adaptive Behavior, pp. 288 296, 1991. Thrun, S. B. Efficient Exploration In Reinforcement Learn- ing. Technical Report, 1992. Todorov, E., Erez, T., and Tassa, Y. Mujoco: A Physics Engine for Model-Based Control. In Intelligent Robots and Systems (IROS), 2012 IEEE/RSJ International Conference on, pp. 5026 5033. IEEE, 2012. Wang, F., Syeda-Mahmood, T., Vemuri, B. C., Beymer, D., and Rangarajan, A. Closed-form Jensen-Renyi Divergence for Mixture of Gaussians and Applications to Group-wise Shape Registration. In International Conference on Medical Image Computing and Computer Assisted Intervention, pp. 648 655. Springer, 2009. Watkins, C. J. C. H. Learning From Delayed Rewards. Ph D thesis, King s College, Cambridge, 1989. Wiering, M. A. Explorations in Efficient Reinforcement Learning. Ph D thesis, University of Amsterdam, 1999.