# optimistic_active_exploration_of_dynamical_systems__ca1d4382.pdf

Optimistic Active Exploration of Dynamical Systems

Bhavya Sukhija1 Lenart Treven1 Cansu Sancaktar2 Sebastian Blaes2

Stelian Coros1 Andreas Krause1

ETH Zürich 1 MPI for Intelligent Systems2 {sukhijab,trevenl,scoros,krausea}@ethz.ch {cansu.sancaktar,sebastian.blae}@tuebingen.mpg.de

Reinforcement learning algorithms commonly seek to optimize policies for solving one particular task. How should we explore an unknown dynamical system such that the estimated model globally approximates the dynamics and allows us to solve multiple downstream tasks in a zero-shot manner? In this paper, we address this challenge, by developing an algorithm OPAX for active exploration. OPAX uses well-calibrated probabilistic models to quantify the epistemic uncertainty about the unknown dynamics. It optimistically w.r.t. to plausible dynamics maximizes the information gain between the unknown dynamics and state observations. We show how the resulting optimization problem can be reduced to an optimal control problem that can be solved at each episode using standard approaches. We analyze our algorithm for general models, and, in the case of Gaussian process dynamics, we give a first-of-its-kind sample complexity bound and show that the epistemic uncertainty converges to zero. In our experiments, we compare OPAX with other heuristic active exploration approaches on several environments. Our experiments show that OPAX is not only theoretically sound but also performs well for zero-shot planning on novel downstream tasks.

1 Introduction

Most reinforcement learning (RL) algorithms are designed to maximize cumulative rewards for a single task at hand. Particularly, model-based RL algorithms, such as (Chua et al., 2018; Kakade et al., 2020; Curi et al., 2020), excel in efficiently exploring the dynamical system as they direct the exploration in regions with high rewards. However, due to the directional bias, their underlying learned dynamics model fails to generalize in other areas of the state-action space. While this is sufficient if only one control task is considered, it does not scale to the setting where the system is used to perform several tasks, i.e., under the same dynamics optimized for different reward functions. As a result, when presented with a new reward function, they often need to relearn a policy from scratch, requiring many interactions with the system, or employ multi-task (Zhang and Yang, 2021) or transfer learning (Weiss et al., 2016) methods. Traditional control approaches such as trajectory optimization (Biagiotti and Melchiorri, 2008) and model-predictive control (García et al., 1989) assume knowledge of the system s dynamics. They leverage the dynamics model to solve an optimal control problem for each task. Moreover, in the presence of an accurate model, important system properties such as stability and sensitivity can also be studied. Hence, knowing an accurate dynamics model bears many practical benefits. However, in many real-world settings, obtaining a model using just physics first principles is very challenging. A promising approach is to leverage data for learning the dynamics, i.e., system identification or active learning. To this end, the key question we investigate in this work is: how should we interact with the system to learn its dynamics efficiently?

While active learning for regression and classification tasks is well-studied, active learning in RL is much less understood. In particular, active learning methods that yield strong theoretical and practical results, generally query data points based on information-theoretic criteria (Krause et al., 2008; Settles, 2009; Balcan et al., 2010; Hanneke et al., 2014; Chen et al., 2015). In the context of

37th Conference on Neural Information Processing Systems (Neur IPS 2023).

dynamical systems, this requires querying arbitrary transitions (Berkenkamp et al., 2017; Mehta et al., 2021). However, in most cases, querying a dynamical system at any state-action pair is unrealistic. Rather, we can only execute policies on the real system and observe the resulting trajectories. Accordingly, an active learning algorithm for RL needs to suggest policies that are informative for learning the dynamics. This is challenging since it requires planning with unknown dynamics.

Contributions In this paper, we introduce a new algorithm, Optimistic Active e Xploration (OPAX), designed to actively learn nonlinear dynamics within continuous state-action spaces. During each episode, OPAX plans an exploration policy to gather the most information possible about the system. It learns a statistical dynamics model that can quantify its epistemic uncertainty and utilizes this uncertainty for planning. The planned trajectory targets state-action pairs where the model s epistemic uncertainty is high, which naturally encourages exploration. In light of unknown dynamics, OPAX uses an optimistic planner that picks policies that optimistically yield maximal information. We show that this optimism paradigm plays a crucial role in studying the theoretical properties of OPAX. Moreover, we provide a general convergence analysis for OPAX and prove convergence to the true dynamics for Gaussian process (GP) dynamics models. Theoretical guarantees for active learning in RL exist for a limited class of systems (Simchowitz et al., 2018; Wagenmaker and Jamieson, 2020; Mania et al., 2020), but lack for a more general and practical class of dynamics (Chakraborty et al., 2023; Wagenmaker et al., 2023). We are, to the best of our knowledge, the first to give convergence guarantees for a rich class of nonlinear dynamical systems.

We evaluate OPAX on several simulated robotic tasks with state dimensions ranging from two to 58. The empirical results provide validation for our theoretical conclusions, showing that OPAX consistently delivers strong performance across all tested environments. Finally, we provide an efficient implementation1 of OPAX in JAX (Bradbury et al., 2018).

2 Problem Setting

We study an unknown discrete-time dynamical system f , with state x X Rdx and control inputs u U Rdu. xk+1 = f (xk, uk) + wk. (1)

Here, wk represents the stochasticity of the system for which we assume wk i.i.d. N 0, σ2I (Assumption 3). Most common approaches in control, such as trajectory optimization and model-predictive control (MPC), assume that the dynamics model f is known and leverages the model to control the system state. Given a cost function c : X U R, such approaches formulate and solve an optimal control problem to obtain a sequence of control inputs that drive the system s state

arg min u0:T 1 Ew0:T 1

t=0 c(xt, ut)

xt+1 = f (xt, ut) + wt 0 t T.

Moreover, if the dynamics are known, many important characteristics of the system such as stability, and robustness (Khalil, 2015) can be studied. However, in many real-world scenarios, an accurate dynamics model f is not available. Accordingly, in this work, we consider the problem of actively learning the dynamics model from data a.k.a. system identification (Åström and Eykhoff, 1971). Specifically, we are interested in devising a cost-agnostic algorithm that focuses solely on learning the dynamics model. Once a good model is learned, it can be used for solving different downstream tasks by varying the cost function in Equation (2).

We study an episodic setting, with episodes n = 1, . . . , N. At the beginning of the episode n, we deploy an exploratory policy πn, chosen from a policy space Π for a horizon of T on the system. Next, we obtain trajectory τn = (xn,0, . . . , xn,T ), which we save to a dataset of transitions Dn = {(zn,i = (xn,i, πn(xn,i)), yn,i = xn,i+1)0 i<T }. We use the collected data to learn an estimate µn of f . To this end, the goal of this work is to propose an algorithm Alg, that at each episode n leverages the data acquired thus far, i.e., D1:n 1 to determine a policy πn Π for the next step of data collection, that is, Alg(D1:n 1, n) πn. The proposed algorithm should be consistent, i.e., µn(z) f (z) for n for all z R, where R is the reachability set defined as

R = {z Z | (π Π, t T), s.t., p(zt = z|π, f ) > 0},

1https://github.com/lasgroup/opax

and efficient w.r.t. rate of convergence of µn to f .

To devise such an algorithm, we take inspiration from Bayesian experiment design (Chaloner and Verdinelli, 1995). In the Bayesian setting, given a prior over f , a natural objective for active exploration is the mutual information (Lindley, 1956) between f and observations y Dn.

Definition 1 (Mutual Information, Cover and Thomas (2006)). The mutual information between f and its noisy measurements y Dn for points in Dn, where y Dn is the concatenation of (y Dn,i)i<T is defined as,

F(Dn) := I (f ; y Dn) = H (y Dn) H (y Dn | f ) , (3)

where H is the Shannon differential entropy.

The mutual information quantifies the reduction in entropy of f conditioned on the observations. Hence, maximizing the mutual information w.r.t. the dataset Dn leads to the maximal entropy reduction of our prior. Accordingly, a natural objective for active exploration in RL can be the mutual information between f and the collected transitions over a budget of N episodes, i.e., I (f ; y D1:N ). This requires maximizing the mutual information over a sequence of policies, which is a challenging planning problem even in settings where the dynamics are known (Mutny et al., 2023). A common approach is to greedily pick a policy that maximizes the information gain conditioned on the previous observations at each episode:

max π Π Eτ π [I (f τ π; yτ π | D1:n 1)] . (4)

Here f τ π = (f (zn,0), . . . , f (zn,T 1)), yτ π = (yn,0, . . . , yn,T 1), τ π is the trajectory under the policy π, and the expectation is taken w.r.t. the process noise w.

Interpretation in frequentist setting While information gain is Bayesian in nature (requires a prior over f ), it also has a frequentist interpretation. In particular, later in Section 3 we relate it to the epistemic uncertainty of the learned model. Accordingly, while this notion of information gain stems from Bayesian literature, we can use it to motivate our objective in both Bayesian and frequentist settings.

2.1 Assumptions In this work, we learn a probabilistic model of the function f from data. Moreover, at each episode n, we learn the mean estimator µn(x, u) and the epistemic uncertainty σn(x, u), which quantifies our uncertainty on the mean prediction. To this end, we use Bayesian models such as Gaussian processes (GPs, Rasmussen and Williams, 2005) or Bayesian neural networks (BNNs, Wang and Yeung, 2020). More generally, we assume our model is well-calibrated:

Definition 2 (All-time calibrated statistical model of f , Rothfuss et al. (2023)). Let, z = (x, u) and Z := X U. An all-time calibrated statistical model of the function f is a sequence (µn, σn, βn(δ))n 0, such that

Pr ( z Z, l {1, . . . , dx} , n N : |µn,l(z) fl(z)| βn(δ)σn,l(z)) 1 δ

Here µn,l and σn,l are the l-th element in the vector valued functions µn and σn respectively. The scalar function, βn(δ) R 0 quantifies the width of the 1 δ confidence intervals. We assume w.l.o.g. that βn monotonically increases with n, and that σn,l(z) σmax for all z Z, n 0, and l {1, . . . , dx}.

Assumption 1 (Well calibration assumption). Our learned model is an all-time-calibrated statistical model of f , i.e., there exists a sequence of (βn(δ))n 0 such that our model satisfies the wellcalibration condition, c.f., Definition 2.

This is a natural assumption on our modeling. It states that we can make a mean prediction and also quantify how far it is off from the true one with high probability. A GP model satisfies this requirement for a very rich class of functions, c.f., Lemma 3. For BNNs, calibration methods (Kuleshov et al., 2018) are often used and perform very well in practice. Next, we make a simple continuity assumption on our function f .

Assumption 2 (Lipschitz Continuity). The dynamics model f and our epistemic uncertainty prediction σn are Lf and Lσ Lipschitz continuous, respectively. Moreover, we define Π to be the policy class of Lπ Lipschitz continuous functions.

The Lipschitz continuity assumption on f is quite common in control theory (Khalil, 2015) and learning literature (Curi et al., 2020; Pasztor et al., 2021; Sussex et al., 2023). Furthermore, the Lipschitz continuity of σn also holds for GPs with common kernels such as the linear or radial basis function (RBF) kernel (Rothfuss et al., 2023).

Finally, we reiterate the assumption of the system s stochasticity.

Assumption 3 (Process noise distribution). The process noise is i.i.d. Gaussian with variance σ2, i.e., wk i.i.d N(0, σ2I).

We focus on the setting where w is homoscedastic for simplicity. However, our framework can also be applied to the more general heteroscedastic and sub-Gaussian case (c.f., Theorem 2).

3 Optimistic Active Exploration

In this section, we propose our optimistic active exploration (OPAX) algorithm. The algorithm consists of two main contributions: (i) First we reformulate the objective in Equation (4) to a simple optimal control problem, which suggests policies that visit states with high epistemic uncertainty. (ii) We leverage the optimistic planner introduced by Curi et al. (2020) to efficiently plan a policy under unknown dynamics. Moreover, we show that the optimistic planner is crucial in giving theoretical guarantees for the algorithm.

3.1 Optimal Exploration Objective The objective in Equation (4) is still difficult and expensive to solve in general. However, since in this work, we consider Gaussian noise, c.f., Assumption 3, we can simplify this further.

Lemma 1 (Information gain is upper bounded by sum of epistemic uncertainties). Let y = f (z)+w, with w N(0, σ2I) and let σn 1 be the epistemic uncertainty after episode n 1. Then the following holds for all n 1 and dataset D1:n 1,

I (f τ π; yτ π | D1:n 1) 1

1 + σ2 n 1,j(zt)

We prove Lemma 1 in Appendix A. The information gain is non-negative (Cover and Thomas, 2006). Therefore, if the right-hand side of Equation (5) goes to zero, the left-hand side goes to zero as well. Lemma 1 relates the information gain to the model epistemic uncertainty. Therefore, it gives a tractable objective that also has a frequentist interpretation - collect points with the highest epistemic uncertainty. We can use it to plan a trajectory at each episode n, by solving the following optimal control problem:

π n = argmax π Π Jn(π) = argmax π Π Eτ π

1 + σ2 n 1,j(xt, π(xt))

xt+1 = f (xt, π(xt)) + wt. The problem in Equation (6) is closely related to previous literature in active exploration for RL. For instance, some works consider different geometries such as the sum of epistemic uncertainties (Pathak et al. (2019); Sekar et al. (2020), c.f., appendix C for more detail).

3.2 Optimistic Planner The optimal control problem in Equation (6) requires knowledge of the dynamics f for planning, however, f is unknown. A common choice is to use the mean estimator µn 1 in Equation (6) instead of f for planning (Buisson-Fenet et al., 2020). However, in general, using the mean estimator is susceptible to model biases (Chua et al., 2018) and is provably optimal only in the case of linear systems (Simchowitz and Foster, 2020). To this end, we propose using an optimistic planner, as suggested in Curi et al. (2020), instead. Accordingly, given the mean estimator µn 1 and the epistemic uncertainty σn 1, we solve the following optimal control problem

πn, ηn = argmax π Π,η Ξ Jn(π, η) = argmax π Π,η Ξ Eτ π,η

1 + σ2 n 1,j(ˆxt, π(ˆxt))

ˆxt+1 = µn 1(ˆxt, π(ˆxt)) + βn 1(δ)σn 1(ˆxt, π(ˆxt))η(ˆxt) + wt,

OPAX: OPTIMISTIC ACTIVE EXPLORATION

Init: Aleatoric uncertainty σ, Probability δ, Statistical model (µ0, σ0, β0(δ)) for episode n = 1, . . . , N do

πn = argmax π Π max η Ξ E

1 + σ2 n 1,j(xt, π(xt))

Prepare policy

Dn ROLLOUT(πn) Collect measurements Update (µn, σn, βn(δ)) D1:n Update model

where Ξ is the space of policies η : X [ 1, 1]dx. Therefore, we use the policy η to hallucinate (pick) transitions that give us the most information. Overall, the resulting formulation corresponds to a simple optimal control problem with a larger action space, i.e., we increase the action space by another dx dimension. A natural consequence of Assumption 1 is that Jn(π n) Jn(πn, ηn) with high probability (c.f., Corollary 1 in Appendix A). That is by solving Equation (7), we get an optimistic estimate on Equation (6). Intuitively, the policy πn that OPAX suggests, behaves optimistically with respect to the information gain at each episode.

4 Theoretical Results

We theoretically analyze the convergence properties of OPAX. We first study the regret of planning under unknown dynamics. Specifically, since we cannot evaluate the optimal exploration policy from eq. (6) and use the optimistic one, i.e., eq. (7) instead, we incur a regret. We show that due to the optimism in the face of uncertainty paradigm, we can give sample complexity bounds for the Bayesian and frequentist settings. All the proofs are presented in Appendix A.

Lemma 2 (Regret of optimistic planning under unknown dynamics). Let Assumption 1 hold. Furthermore, define Jn,k(πn, ηn, x) as

Jn,k(πn, ηn, x) = Eτ πn,ηn

1 + σ2 n 1,j(ˆxt, πn(ˆxt))

s.t. ˆxt+1 = µn 1(ˆxt, πn(ˆxt)) + βn 1(δ)σn 1(ˆxt, πn(ˆxt))ηn(ˆxt) + wt and ˆx0 = x.

Then, for all n 1, with probability at least 1 δ,

Jn(π n) Jn(πn)

t=0 Eτ πn Jn,t+1(πn, ηn, x t+1) Jn,t+1(πn, ηn, xt+1) ,

with xt+1 = f (xt, πn(xt)) + wt,

and x t+1 = µn 1(xt, πn(xt)) + βn 1(δ)σn 1(xt, πn(xt))ηn(xt) + wt.

Lemma 2 gives a bound on the regret of planning optimistically under unknown dynamics. The regret is proportional to the difference in the expected returns for xt and x t. Note, xt x t βn(δ)σn 1(xt 1, πn(xt 1)). Hence, when we have low uncertainty in our predictions, planning optimistically suffers smaller regret. Next, we leverage Lemma 2 to give a sample complexity bound for the Bayesian and frequentist setting.

Bayesian Setting We start by introducing a measure of model complexity as defined by Curi et al. (2020).

MCN(f ) := max D1,...,DN Z X

z Dn σn 1(z) 2 2 . (8)

This complexity measure captures the difficulty of learning f given N trajectories. Mainly, the more complicated f , the larger the epistemic uncertainties σn, and in turn, the larger corresponding MCN(f ). Moreover, if the model complexity measure is sublinear in N, i.e. MCN(f )/N 0 for N , then the epistemic uncertainties also converge to zero in the limit, which implies

convergence to the true function f . We present our main theoretical result, in terms of the model complexity measure.

Theorem 1. Let Assumption 1 and 3 hold. Then, for all N 1, with probability at least 1 δ,

max π Π Eτ π [I (f τ π; yτ π | D1:N 1)] O

Theorem 1 relates the maximum expected information gain at iteration N to the model complexity of our problem. For deterministic systems, the expectation w.r.t. τ π is redundant. The bound in Equation (9) depends on the Lipschitz constants, planning horizon, and dimensionality of the state space (captured in βN and MCN(f )). If the right-hand side is monotonically decreasing with N, Theorem 1 guarantees that the information gain at episode N is also shrinking with N, and the algorithm is converging. Empirically, Pathak et al. (2019) show that the epistemic uncertainties go to zero as more data is acquired. In general, deriving a worst-case bound on the model complexity is a challenging and active open research problem. However, in the case of GPs, convergence results can be shown for a very rich class of functions. We show this in the following for the frequentist setting.

Frequentist Setting with Gaussian Process Models We extend our analysis to the frequentist kernelized setting, where f resides in a Reproducing Kernel Hilbert Space (RKHS) of vector-valued functions.

Assumption 4. We assume that the functions f j , j {1, . . . , dx} lie in a RKHS with kernel k and have a bounded norm B, that is f Hdx k,B, with Hdx k,B = {f | fj k B, j = 1, . . . , dx}.

In this setting, we model the posterior mean and epistemic uncertainty of the vector-valued function f with µn(z) = [µn,j(z)]j dx, and σn(z) = [σn,j(z)]j dx, where,

µn,j(z) = k n (z)(Kn + σ2I) 1yj 1:n,

σ2 n,j(z) = k(x, x) k n (z)(Kn + σ2I) 1kn(x), (10)

Here, yj 1:n corresponds to the noisy measurements of f j , i.e., the observed next state from the transitions dataset D1:n, kn = [k(z, zi)]i n T , zi D1:n, and Kn = [k(zi, zl)]i,l n T , zi, zl D1:n is the data kernel matrix. It is known that if f satisfies Assumption 4, then Equation (10) yields well-calibrated confidence intervals, i.e., that Assumption 1 is satisfied.

Lemma 3 (Well calibrated confidence intervals for RKHS, Rothfuss et al. (2023)). Let f Hdx k,B. Suppose µn and σn are the posterior mean and variance of a GP with kernel k, c.f., Equation (10). There exists βn(δ), for which the tuple (µn, σn, βn(δ)) satisfies Assumption 1 w.r.t. function f .

Theorem 2 presents our convergence guarantee for the kernelized case to the T-step reachability set R for the policy class π Π. In particular, R is defined as

R = {z Z | (π Π, t T), s.t., p(zt = z|π, f ) > 0}

There are two key differences from Theorem 1; (i) we can derive an upper bound on the epistemic uncertainties σn, and (ii) we can bound the model complexity MCN(f ), with the maximum information gain of kernel k introduced by Srinivas et al. (2012), defined as

γN(k) = max D1,...,DN;|Dn| T 1 2 log det(I + σ 2KN).

Theorem 2. Let Assumption 3 and 4 hold, Then, for all N 1, with probability at least 1 δ,

max π Π Eτ π

1 2σ2 N,j(z)

If we relax noise Assumption 3 to σ-sub Gaussian. Then, if Assumption 2 holds, we have for all N 1, with probability at least 1 δ,

max π Π Eτ π

1 2σ2 N,j(z)

βT NT 3/2 r

Moreover, if γN(k) = O (polylog(N)), then for all z R, and 1 j dx,

σN,j(z) a.s. 0 for N . (13)

We only state Theorem 2 for the expected epistemic uncertainty along the trajectory at iteration N. For deterministic systems, the expectation is redundant and for stochastic systems, we can leverage concentration inequalities to give a bound without the expectation (see Appendix A for more detail).

For the Gaussian noise case, we obtain a tighter bound by leveraging the change of measure inequality from Kakade et al. (2020, Lemma C.2.) (c.f., Lemma 6 in Appendix A for more detail). In the more general case of sub-Gaussian noise, we cannot use the same analysis. To this end, we use the Lipschitz continuity assumptions (Assumption 2) similar to Curi et al. (2020). This results in comparing the deviation between two trajectories under the same policy and dynamics but different initial states (see Lemma 2). For many systems (even linear) this can grow exponentially in the horizon T. Accordingly, we obtain a βT N term in our bound (Equation (12)). Nonetheless, for cases where the RKHS is of a kernel with maximum information gain γN(k) = O (polylog(N)), we can give sample complexity bounds and an almost sure convergence result in the reachable set R (Equation (13)). Kernels such as the RBF kernel or the linear kernel (kernel with a finite-dimensional feature map ϕ(x)) have maximum information gain which grows polylogarithmically with n (Vakili et al. (2021)). Therefore, our convergence guarantees hold for a very rich class of functions. The exponential dependence of our bound on T imposes the restriction on the kernel class. For the case of Gaussian noise, we can include a richer class of kernels, such as Matèrn.

In addition to the convergence results above, we also give guarantees on the zero-shot performance of OPAX in Appendix A.5.

5 Experiments

We evaluate OPAX on the Pendulum-v1 and Mountain Car environment from the Open AI gym benchmark suite (Brockman et al., 2016), on the Reacher, Swimmer, and Cheetah from the deep mind control suite (Tassa et al., 2018), and a high-dimensional simulated robotic manipulation task introduced by Li et al. (2020). See Appendix B for more details on the experimental setup.

Baselines We implement four baselines for comparisons. To show the benefit of our intrinsic reward, we compare OPAX to (1) a random exploration policy (RANDOM) which randomly samples actions from the action space. As we discuss in Section 3 our choice of objective in Equation (6) is in essence similar to the one proposed by Pathak et al. (2019) and Sekar et al. (2020). Therefore, in our experiments, we compare the optimistic planner with other planning approaches. Moreover, most work on active exploration either uses the mean planner or does not specify the planner (c.f., Section 6). We use the most common planners: (2) mean (MEAN-AE), and (3) trajectory sampling (TS-1) scheme proposed in Chua et al. (2018) (PETS-AE) as our baselines. The mean planner simply uses the mean estimate µn of the well-calibrated model. This is also used in Buisson-Fenet et al. (2020). Finally, we compare OPAX to (4) H-UCRL (Curi et al., 2020), a single-task model-based RL algorithm. We investigate the following three aspects: (i) how fast does active exploration reduce model s epistemic uncertainty σn with increasing n, (ii) can we solve downstream tasks with OPAX, and (iii) does OPAX scale to high-dimensional and challenging object manipulation tasks? For our experiments, we use GPs and probabilistic ensembles (PE, Lakshminarayanan et al. (2017)) for modeling the dynamics. For the planning, we either the soft actor-critic (SAC, Haarnoja et al. (2018)) policy optimizer, which takes simulated trajectories from our learned model to train a policy, or MPC with the i CEM optimizer (Pinneri et al., 2021).

How fast does active exploration reduce the epistemic uncertainty? For this experiment, we consider the Pendulum-v1 environment. We sample transitions at random from the pendulum s reachable state-action space and evaluate our model s epistemic uncertainty for varying episodes and baselines. We model the dynamics with both GPs and PE. We depict the result in Figure 1. We conclude that the RANDOM agent is slower in reducing the uncertainty compared to other active exploration methods for both GP and PE models. In particular, from the experiment, we empirically validate Theorem 2 for the GP case and also conclude that empirically even when using PE models, we find convergence of epistemic uncertainty. Moreover, we notice for the PE case that OPAX reaches smaller uncertainties slightly faster than MEAN-AE and PETS-AE. We believe this is due to the additional exploration induced by the optimistic planner.

0 5 10 15 Episodes

Pendulum-v1 GP - uncertainty

0 10 20 Episodes

Pendulum-v1 - uncertainty

PETS-AE Mean-AE Op Ax Random

Figure 1: Reduction in maximum epistemic uncertainty in reachable state-action space for the Pendulum-v1 environment over 10 different random seeds. We evaluate OPAX with both GPs and PE and plot the mean performance with two standard error confidence intervals. For both, active exploration reduces epistemic uncertainty faster compared to random exploration. All active exploration baselines perform well for the GP case, whereas for the PE case OPAX gives slightly lower uncertainties.

Can the model learnt through OPAX solve downstream tasks? We use OPAX and other active exploration baselines to actively learn a dynamics model and then evaluate the learned model on downstream tasks. We consider several tasks, (i) Pendulum-v1 swing up, (ii) Pendulum-v1 keep down (keep the pendulum at the stable equilibria), (iii) Mountain Car, (iv) Reacher - go to target, (v) Swimmer - go to target, (vi) Swimmer - go away from target (quickly go away from the target position), (vii) Cheetah - run forward, (viii) Cheetah - run backward. For all tasks, we consider PEs, except for (i) where we also use GPs. Furthermore, for the Mountain Car and Reacher, we give a reward once the goal is reached. Since this requires long-term planning, we use a SAC policy for these tasks. We use MPC with i CEM for the remaining tasks. We also train H-UCRL on tasks (i) with GPs, and (ii), (iii), (iv), (v), (vii) with PEs. We report the best performance across all episodes.

To make a fair comparison, we use the following evaluation procedure; first, we perform active exploration for each episode on the environment, and then after every few episodes we use the mean estimate µn to evaluate our learned model on the downstream tasks.

Figure 2 shows that all active exploration variants perform considerably better than the RANDOM agent. In particular, for the Mountain Car, the RANDOM agent is not able to solve the task. Moreover, PETS-AE performs slightly worse than the other exploration baselines in this environment. In general, we notice that OPAX always performs well and is able to achieve H-UCRL s performance on all the tasks for which H-UCRL is trained. However, on tasks that are new/unseen for H-UCRL, active exploration algorithms outperform H-UCRL. From this experiment, we conclude two things (1) apart from providing theoretical guarantees, the model learned through OPAX also performs well in downstream tasks, and (2) active exploration agents generalize well to downstream tasks, whereas H-UCRL performs considerably worse on new/unseen tasks. We believe this is because, unlike active exploration agents, task-specific model-based RL agents only explore the regions of the state-action space that are relevant to the task at hand.

Figure 3: Fetch Pick & Place Construction environment.

Does OPAX scale to high-dimensional and challenging object manipulation tasks? To answer this question, we consider the Fetch Pick & Place Construction environment (Li et al., 2020). We again use the active exploration agents to learn a model and then evaluate the success rate of the learned model in three challenging downstream tasks: (i) Pick & Place, (ii) Throw, and (iii) Flip (see Figure 4). The environment contains a 7-Do F robot arm and four 6-Do F blocks that can be manipulated. In total, the state space is 58-dimensional. The 4-dimensional actions control the end-effector of the robot in Cartesian space as well as the opening/closing of the gripper. We compare OPAX to PETS-AE, MEAN-AE, a random policy as well as CEE-US (Sancaktar et al., 2022). CEE-US is a model-based active exploration algorithm, for which Sancaktar et al. (2022) reports state-of-the-art performance compared to several other active exploration methods. In all three tasks, OPAX is at least on par with the best-performing baselines, including CEE-US. We run OPAX and all baselines with the same architecture and hyperparameter settings. See Appendix B for more details.

PETS-AE Mean-AE Op Ax Random H-UCRL

0 10 20 0.0

104Mountain Car (1)

Pendulum-v1 - keep down (4)

103 Pendulum-v1 GP - swing up (2)

102 Swimmer - go to target (5)

0 50 100 Episodes

103 Reacher - go to target (3)

0 200 400 Episodes

102 Cheetah - run forward (6)

Downstream tasks where H-UCRL is trained

103 Pendulum-v1 - swing up (7)

0 50 100 150 3

102 Swimmer - go away from target (8)

0 200 400 Episodes

102 Cheetah - run backward (9)

New downstream tasks

Figure 2: We evaluate the downstream performance of our agents over 10 different random seeds and plot the mean performance with two standard error confidence intervals. For all the environments we use PE as models, except plot (1), for which we use a GP model (see plot (2) in the figure above). For tasks (1)-(6), we also train H-UCRL, a model-based RL algorithm. Tasks (7)-(9) are new/unseen for H-UCRL. From the Figure, we conclude that (i) compared to other active exploration baselines, OPAX constantly performs well and is on par with H-UCRL, and (ii) on the new/unseen tasks the active exploration baselines and OPAX outperform H-UCRL by a large margin.

6 Related Work

System identification is a broadly studied topic (Åström and Eykhoff, 1971; Schoukens and Ljung, 2019; Schön et al., 2011; Ziemann et al., 2022; Ziemann and Tu, 2022). However, system identification from the perspective of experiment design for nonlinear systems is much less understood (Chiuso and Pillonetto, 2019). Most methods formulate the identification task through the maximization of intrinsic rewards. Common choices of intrinsic rewards are (i) model prediction error or Curiosity (Schmidhuber, 1991; Pathak et al., 2017), (ii) novelty of transitions (Stadie et al., 2015), and (iii) diversity of skills (Eysenbach et al., 2018).

A popular choice for intrinsic rewards is mutual information or entropy (Jain et al., 2018; Buisson Fenet et al., 2020; Shyam et al., 2019; Pathak et al., 2019; Sekar et al., 2020). Jain et al. (2018) propose an approach to maximize the information gain greedily wrt the immediate next transition, i.e., one-step greedy, whereas Buisson-Fenet et al. (2020) consider planning full trajectories. Shyam et al. (2019); Pathak et al. (2019); Sekar et al. (2020) and Sancaktar et al. (2022) consider general Bayesian models, such as BNNs, to represent a probabilistic distribution for the learned model. Shyam et al. (2019) propose using the information gain of the model with respect to observed transition as the intrinsic reward. To this end, they learn an ensemble of Gaussian neural networks and represent the distribution over models with a Gaussian mixture model (GMM). A similar approach is also proposed in Pathak et al. (2019); Sekar et al. (2020); Sancaktar et al. (2022). The main difference between Shyam et al. (2019) and Pathak et al. (2019) lies in how they represent mutual information. Moreover, Pathak et al. (2019) use the model s epistemic uncertainty, that is the

0 100 200 300 Training Iteration

100 Pick & Place 4 Objects

0 100 200 300 Training Iteration

40 Throw 4 Objects

0 100 200 300 Training Iteration

Flip 4 Objects

Random Mean-AE PETS-AE Op Ax CEE-US

Figure 4: Success rates for pick & place, throwing and flipping tasks with four objects in the Fetch Pick & Place Construction environment for OPAX and baselines. We evaluate task performance via planning zero-shot with models learned using different exploration strategies. We report performance on three independent seeds. OPAX is on par with the best-performing baselines in all tasks.

disagreement between the ensemble models as an intrinsic reward. Sekar et al. (2020) link the model disagreement (epistemic uncertainty) reward to maximizing mutual information and demonstrate state-of-the-art performance on several high-dimensional tasks. Similarly, Sancaktar et al. (2022), use the disagreement in predicted trajectories of an ensemble of neural networks to direct exploration. Since trajectories can diverge due to many factors beyond just the model epistemic uncertainty, e.g., aleatoric noise, this approach is restricted to deterministic systems and susceptible to systems with unstable equilibria. Our approach is the most similar to Pathak et al. (2019); Sekar et al. (2020) since we also propose the model epistemic uncertainty as the intrinsic reward for planning. However, we thoroughly and theoretically motivate this choice of reward from a Bayesian experiment design perspective. Furthermore, we induce additional exploration in OPAX through our optimistic planner and rigorously study the theoretical properties of the proposed methods. On the contrary, most of the prior work discussed above either uses the mean planner (MEAN-AE) or does not discuss the planner thoroughly or provide any theoretical results. In general, theoretical guarantees for active exploration algorithms are rather immature (Chakraborty et al., 2023; Wagenmaker et al., 2023) and mostly restrictive to a small class of systems (Simchowitz et al., 2018; Tarbouriech et al., 2020; Wagenmaker and Jamieson, 2020; Mania et al., 2020). To the best of our knowledge, we are the first to give convergence guarantees for a rich class of nonlinear systems.

While our work focuses on the active learning of dynamics, there are numerous works that study exploration in the context of reward-free RL (Jin et al., 2020; Kaufmann et al., 2021; Wagenmaker et al., 2022; Chen et al., 2022). However, most methods in this setting give guarantees for special classes of MDPs (Jin et al., 2020; Kaufmann et al., 2021; Wagenmaker et al., 2022; Qiu et al., 2021; Chen et al., 2022) and result in practical algorithms. On the contrary, we focus on solely learning the dynamics. While a good dynamics model may be used for zero-shot planning, it also exhibits more relevant knowledge about the system such as its stability or sensitivity to external effects. Furthermore, our proposed method is not only theoretically sound but also practical.

7 Conclusion

We present OPAX, a novel model-based RL algorithm for the active exploration of unknown dynamical systems. Taking inspiration from Bayesian experiment design, we provide a comprehensive explanation for using model epistemic uncertainty as an intrinsic reward for exploration. By leveraging the optimistic in the face of uncertainty paradigm, we put forth first-of-their-kind theoretical results on the convergence of active exploration agents in reinforcement learning. Specifically, we study convergence properties of general Bayesian models, such as BNNs. For the frequentist case of RKHS dynamics, we established sample complexity bounds and convergence guarantees for OPAX for a rich class of functions. We evaluate the efficacy of OPAX across various RL environments with state space dimensions from two to 58. The empirical results corroborate our theoretical findings, as OPAX displays systematic and effective exploration across all tested environments and exhibits strong performance in zero-shot planning for new downstream tasks.

Acknowledgments and Disclosure of Funding

We would like to thank Jonas Hübotter for the insightful discussions and his feedback on this work. Furthermore, we also thank Alex Hägele, Parnian Kassraie, Scott Sussex, and Dominik Baumann for their feedback.

This project has received funding from the Swiss National Science Foundation under NCCR Automation, grant agreement 51NF40 180545, and the Microsoft Swiss Joint Research Center.

Balcan, M.-F., Hanneke, S., and Vaughan, J. W. (2010). The true sample complexity of active learning. Machine learning, 80:111 139.

Berkenkamp, F., Turchetta, M., Schoellig, A., and Krause, A. (2017). Safe model-based reinforcement learning with stability guarantees. Neur IPS, 30.

Biagiotti, L. and Melchiorri, C. (2008). Trajectory Planning for Automatic Machines and Robots. Springer Publishing Company, Incorporated, 1st edition.

Bradbury, J., Frostig, R., Hawkins, P., Johnson, M. J., Leary, C., Maclaurin, D., Necula, G., Paszke, A., Vander Plas, J., Wanderman-Milne, S., and Zhang, Q. (2018). JAX: composable transformations of Python+Num Py programs.

Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., and Zaremba, W. (2016). Openai gym. ar Xiv preprint ar Xiv:1606.01540.

Buisson-Fenet, M., Solowjow, F., and Trimpe, S. (2020). Actively learning gaussian process dynamics. In Bayen, A. M., Jadbabaie, A., Pappas, G., Parrilo, P. A., Recht, B., Tomlin, C., and Zeilinger, M., editors, L4DC.

Chakraborty, S., Bedi, A., Koppel, A., Wang, M., Huang, F., and Manocha, D. (2023). STEERING : Stein information directed exploration for model-based reinforcement learning. pages 3949 3978.

Chaloner, K. and Verdinelli, I. (1995). Bayesian experimental design: A review. Statistical science, pages 273 304.

Chen, J., Modi, A., Krishnamurthy, A., Jiang, N., and Agarwal, A. (2022). On the statistical efficiency of reward-free exploration in non-linear rl. Advances in Neural Information Processing Systems, 35:20960 20973.

Chen, Y., Hassani, S. H., Karbasi, A., and Krause, A. (2015). Sequential information maximization: When is greedy near-optimal? In COLT.

Chiuso, A. and Pillonetto, G. (2019). System identification: A machine learning perspective. Annual Review of Control, Robotics, and Autonomous Systems, 2:281 304.

Chowdhury, S. R. and Gopalan, A. (2017). On kernelized multi-armed bandits. In ICML, pages 844 853. PMLR.

Chua, K., Calandra, R., Mc Allister, R., and Levine, S. (2018). Deep reinforcement learning in a handful of trials using probabilistic dynamics models. In Neur IPS.

Cover, T. M. and Thomas, J. A. (2006). Elements of Information Theory (Wiley Series in Telecommunications and Signal Processing). Wiley-Interscience.

Curi, S., Berkenkamp, F., and Krause, A. (2020). Efficient model-based reinforcement learning through optimistic policy search and planning. Neur IPS, 33:14156 14170.

Eysenbach, B., Gupta, A., Ibarz, J., and Levine, S. (2018). Diversity is all you need: Learning skills without a reward function. ar Xiv preprint ar Xiv:1802.06070.

García, C. E., Prett, D. M., and Morari, M. (1989). Model predictive control: Theory and practice - a survey. Automatica, pages 335 348.

Haarnoja, T., Zhou, A., Abbeel, P., and Levine, S. (2018). Soft actor-critic: Off-policy maximum entropy deep reinforcement learning with a stochastic actor. In ICML, pages 1861 1870.

Hanneke, S. et al. (2014). Theory of disagreement-based active learning. Foundations and Trends in Machine Learning, 7(2-3):131 309.

Jain, A., Nghiem, T., Morari, M., and Mangharam, R. (2018). Learning and control using gaussian processes. In ICCPS, pages 140 149.

Jin, C., Krishnamurthy, A., Simchowitz, M., and Yu, T. (2020). Reward-free exploration for reinforcement learning. In ICML, pages 4870 4879. PMLR.

Kakade, S., Krishnamurthy, A., Lowrey, K., Ohnishi, M., and Sun, W. (2020). Information theoretic regret bounds for online nonlinear control. Neur IPS, 33:15312 15325.

Kaufmann, E., Ménard, P., Domingues, O. D., Jonsson, A., Leurent, E., and Valko, M. (2021). Adaptive reward-free exploration. In ALT, pages 865 891. PMLR.

Khalil, H. K. (2015). Nonlinear control, volume 406. Pearson New York.

Krause, A., Singh, A., and Guestrin, C. (2008). Near-optimal sensor placements in gaussian processes: Theory, efficient algorithms and empirical studies. Journal of Machine Learning Research, 9(2).

Kuleshov, V., Fenner, N., and Ermon, S. (2018). Accurate uncertainties for deep learning using calibrated regression. In ICML, pages 2796 2804. PMLR.

Lakshminarayanan, B., Pritzel, A., and Blundell, C. (2017). Simple and scalable predictive uncertainty estimation using deep ensembles. Neur IPS, 30.

Li, R., Jabri, A., Darrell, T., and Agrawal, P. (2020). Towards practical multi-object manipulation using relational reinforcement learning. In ICRA, pages 4051 4058. IEEE.

Lindley, D. V. (1956). On a Measure of the Information Provided by an Experiment. The Annals of Mathematical Statistics, pages 986 1005.

Mania, H., Jordan, M. I., and Recht, B. (2020). Active learning for nonlinear system identification with guarantees. ar Xiv preprint ar Xiv:2006.10277.

Mehta, V., Paria, B., Schneider, J., Ermon, S., and Neiswanger, W. (2021). An experimental design perspective on model-based reinforcement learning. ar Xiv preprint ar Xiv:2112.05244.

Mutny, M., Janik, T., and Krause, A. (2023). Active exploration via experiment design in markov chains. In AISTATS.

Pasztor, B., Bogunovic, I., and Krause, A. (2021). Efficient model-based multi-agent mean-field reinforcement learning. ar Xiv preprint ar Xiv:2107.04050.

Pathak, D., Agrawal, P., Efros, A. A., and Darrell, T. (2017). Curiosity-driven exploration by self-supervised prediction. In ICML, pages 2778 2787. PMLR.

Pathak, D., Gandhi, D., and Gupta, A. (2019). Self-supervised exploration via disagreement. In ICML, pages 5062 5071. PMLR.

Pinneri, C., Sawant, S., Blaes, S., Achterhold, J., Stueckler, J., Rolinek, M., and Martius, G. (2021). Sample-efficient cross-entropy method for real-time planning. In CORL, Proceedings of Machine Learning Research, pages 1049 1065.

Qiu, S., Ye, J., Wang, Z., and Yang, Z. (2021). On reward-free rl with kernel and neural function approximations: Single-agent mdp and markov game. In ICML, pages 8737 8747. PMLR.

Rasmussen, C. E. and Williams, C. K. I. (2005). Gaussian Processes for Machine Learning (Adaptive Computation and Machine Learning). The MIT Press.

Åström, K. and Eykhoff, P. (1971). System identification - a survey. Automatica, 7(2):123 162.

Rothfuss, J., Sukhija, B., Birchler, T., Kassraie, P., and Krause, A. (2023). Hallucinated adversarial control for conservative offline policy evaluation. UAI.

Sancaktar, C., Blaes, S., and Martius, G. (2022). Curious exploration via structured world models yields zero-shot object manipulation. In Neur IPS 35 (Neur IPS 2022).

Schmidhuber, J. (1991). A possibility for implementing curiosity and boredom in model-building neural controllers. In Proc. of the international conference on simulation of adaptive behavior: From animals to animats, pages 222 227.

Schön, T. B., Wills, A., and Ninness, B. (2011). System identification of nonlinear state-space models. Automatica, pages 39 49.

Schoukens, J. and Ljung, L. (2019). Nonlinear system identification: A user-oriented road map. IEEE Control Systems Magazine, 39(6):28 99.

Sekar, R., Rybkin, O., Daniilidis, K., Abbeel, P., Hafner, D., and Pathak, D. (2020). Planning to explore via self-supervised world models. In ICML, pages 8583 8592. PMLR.

Settles, B. (2009). Active learning literature survey.

Shyam, P., Ja skowski, W., and Gomez, F. (2019). Model-based active exploration. In ICML, pages 5779 5788. PMLR.

Simchowitz, M. and Foster, D. (2020). Naive exploration is optimal for online LQR. In ICML, Proceedings of Machine Learning Research, pages 8937 8948. PMLR.

Simchowitz, M., Mania, H., Tu, S., Jordan, M. I., and Recht, B. (2018). Learning without mixing: Towards a sharp analysis of linear system identification. In COLT, pages 439 473. PMLR.

Srinivas, N., Krause, A., Kakade, S. M., and Seeger, M. W. (2012). Information-theoretic regret bounds for gaussian process optimization in the bandit setting. IEEE Transactions on Information Theory.

Stadie, B. C., Levine, S., and Abbeel, P. (2015). Incentivizing exploration in reinforcement learning with deep predictive models. ar Xiv preprint ar Xiv:1507.00814.

Sussex, S., Makarova, A., and Krause, A. (2023). Model-based causal bayesian optimization. In ICLR.

Tarbouriech, J., Shekhar, S., Pirotta, M., Ghavamzadeh, M., and Lazaric, A. (2020). Active model estimation in markov decision processes. In UAI, pages 1019 1028. PMLR.

Tassa, Y., Doron, Y., Muldal, A., Erez, T., Li, Y., Casas, D. d. L., Budden, D., Abdolmaleki, A., Merel, J., Lefrancq, A., et al. (2018). Deepmind control suite. ar Xiv preprint ar Xiv:1801.00690.

Vakili, S., Khezeli, K., and Picheny, V. (2021). On information gain and regret bounds in gaussian process bandits. In AISTATS.

Wagenmaker, A. and Jamieson, K. (2020). Active learning for identification of linear dynamical systems. In COLT, pages 3487 3582. PMLR.

Wagenmaker, A., Shi, G., and Jamieson, K. (2023). Optimal exploration for model-based rl in nonlinear systems. ar Xiv preprint ar Xiv:2306.09210.

Wagenmaker, A. J., Chen, Y., Simchowitz, M., Du, S., and Jamieson, K. (2022). Reward-free rl is no harder than reward-aware rl in linear markov decision processes. In ICML, pages 22430 22456. PMLR.

Wang, H. and Yeung, D.-Y. (2020). A survey on bayesian deep learning. ACM computing surveys (csur), pages 1 37.

Weiss, K., Khoshgoftaar, T. M., and Wang, D. (2016). A survey of transfer learning. Journal of Big data, pages 1 40.

Zhang, Y. and Yang, Q. (2021). A survey on multi-task learning. IEEE Transactions on Knowledge and Data Engineering, pages 5586 5609.

Ziemann, I. and Tu, S. (2022). Learning with little mixing. Neur IPS, 35:4626 4637.

Ziemann, I. M., Sandberg, H., and Matni, N. (2022). Single trajectory nonparametric learning of nonlinear dynamics. In COLT, pages 3333 3364. PMLR.

Contents of Appendix

A Proofs for section 4 16 A.1 Proof of Lemma 2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17 A.2 Analyzing regret of optimistic planning . . . . . . . . . . . . . . . . . . . . . . . 18 A.3 Proof for general Bayesian models . . . . . . . . . . . . . . . . . . . . . . . . . . 20 A.4 Proof of GP results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23 A.5 Zero-shot guarantees . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26

B Experiment Details 27 B.1 Environment Details . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27 B.2 OPAX in the High-dimensional Fetch Pick & Place Environment . . . . . . . . . . 29

C Study of exploration intrinsic rewards 32

A Proofs for section 4

We first prove some key properties of our active exploration objective in Equation (6). Then, we prove Theorem 1 which holds for general Bayesian models, and finally we prove Theorem 2, which guarantees convergence for the frequentist setting where the dynamics are modeled using a GP.

Lemma 4 (Properties of OPAX s objective). Let Assumption 1 and 2 hold, then the following is true for all n 0,

1 + σ2 n 1,j(xt, π(xt))

1 2 sup π Π,η Ξ E

1 + σ2 n 1,j(xt, π(xt))

2T 2d2 x log2 1 + σ2 max σ2

1 + σ2 n,j(z)

1 + σ2 n,j(z )

! dxσmax Lσ

σ2 z z . (3)

where σmax = supz Z;i 0;1 j dx σi,j(z).

Proof. The positivity of the reward follows from the positive definiteness of the epistemic uncertainty σn 1,j. For (2), the following holds

1 + σ2 n 1,j(xt, π(xt))

j=1 Tdx log2

1 + σ2 n 1,j(xt, π(xt))

j=1 Tdx log2 1 + σ2 max σ2

T 2d2 x log2 1 + σ2 max σ2

From hereon, let Jmax = 1/2T 2d2 x log2 1 + σ 2σ2 max .

Finally, we show that this reward is Lipschitz continuous.

|σ2 n,j(z) σ2 n,j(z )| = |σn,j(z)σn,j(z) σn,j(z)σn,j(z ) + σn,j(z)σn,j(z ) σn,j(z )σn,j(z )|

Lσ z z σn,j(z) + Lσ z z σn,j(z )

2σmax Lσ z z .

1 + σ2 n,j(z)

1 + σ2 n,j(z )

σ2 n,j(z) σ2 n,j(z ) σ2

1 + σ2 n,j(z )

1 + | σ2 n,j(z) σ2 n,j(z ) | σ2

σ2 n,j(z) σ2 n,j(z ) (*)

Where (*) is true because for all x 0, log(1 + x) x.

Corollary 1 (OPAX gives an optimistic estimate on Equation (6)). Let Assumption 1 hold and π n denote the solution to Equation (6) and Jn(π n) the resulting objective. Similarly, let πn and ηn be the solution to Equation (7) and Jn(πn, ηn) the corresponding value of the objective. Then with probability at least 1 δ we have for every episode n {1, . . . , N}: Jn(π n) Jn(πn, ηn).

Proof. Follows directly from Assumption 1.

A.1 Proof of Lemma 2

Lemma 5 (Difference in Policy performance). Let Jr,k(π, xk) = Eτ π h PT 1 t=k r(xt, π(xt)) i and

Ar,k(π, x, a) = Eτ π [r(x, a) + Jr,k+1(π, x ) Jr,k(π, x)] with x = f (x, a) + w. For simplicity we refer to Jr,0(π, x0) = Jr(π, x0). The following holds for all x0 X:

Jr(π , x0) Jr(π, x0) = Eτ π

t=0 Ar,t(π, x t, π (x t))

Jr(π , x0) = Eτ π

t=0 r(x t, π (x t))

= Eτ π [r(x0, π (x0)) + Jr,1(π , x 1)]

= Eτ π [r(x0, π (x0)) + Jr,1(π, x 1) + Jr,1(π , x 1) Jr,1(π, x 1)]

= Eτ π [r(x0, π (x0)) + Jr,1(π, x 1) Jr(π, x0)]

+ Jr(π, x0) + Eτ π [Jr,1(π , x 1) Jr,1(π, x 1)]

= Eτ π [Ar,0(π, x0, π (x0))] + Jr(π, x0) + Eτ π [Jr,1(π , x1) Jr,1(π, x1)] Therefore we obtain Jr(π , x0) Jr(π, x0) = Eτ π [A0(π, x0, π (x0))] + Eτ π [Jr,1(π , x 1) Jr,1(π, x 1)] . Using the same argument for Jr,1, Jr,2, ..., Jr,T 1 and that Jr,T (π, x) = 0 for all π Π and x X completes the proof.

Assume a policy π is fixed and dynamics are of the form: x = µn(x, π(x)) + βn(δ)σ(x, π(x))u + w. (14)

Here u [ 1, 1]dx. Furthermore, assume that the associated running rewards do not depend on u, that is, r(xt), and let η Ξ denote the policy, i.e., η : X [ 1, 1]dx.

Corollary 2. The following holds for all x0 X and policy π:

Jr(π, η , x0) Jr(π, η, x0) = Eτ η

t=0 Jr,t+1(π, η, x t+1) Jr,t+1(π, η, xt+1)

with xt+1 = µn(x t, π(x t)) + βn(δ)σ(x t, π(x t))η(x t) + wt, and x t+1 = µn(x t, π(x t)) + βn(δ)σ(x t, π(x t))η (x t) + wt.

Proof. From Lemma 5 we have

Jr(π, η , x0) Jr(π, η, x0) = Eτ η

t=0 Ar,t(η, x t, η (x t))

Furthermore, Eτ η [Ar,t(η, x t, η (x t))] = Eτ η r(x t) + Jr,t+1(π, η, x t+1) Jr,t(π, η, x t)

= Eτ η r(x t) + Jr,t+1(π, η, x t+1) r(x t) Jr,t+1(π, η, xt+1)

= Eτ η Jr,t+1(π, η, x t+1) Jr,t+1(π, η, xt+1) .

Proof of Lemma 2. From Assumption 1 we know that with probability at least 1 δ there exists a η such that f (z) = µn(z) + βn(δ)σ(z) η(x) for all z Z.

Jn(π n) Jn(πn) Jn(πn, ηn) Jn(πn) (Corollary 1) = Jn(πn, ηn) Jn(πn, η)

t=0 Jn,t+1(πn, ηn, x t+1) Jn,t+1(πn, ηn, xt+1)

(Corollary 2)

t=0 Jn,t+1(πn, ηn, x t+1) Jn,t+1(πn, ηn, xt+1)

(Expectation wrt πn under true dynamics f ) with xt+1 = f (xt, πn(xt)) + wt,

and x t+1 = µn 1(xt, πn(xt)) + βn 1(δ)σn 1(xt, πn(xt))ηn(xt) + wt.

A.2 Analyzing regret of optimistic planning

In the following, we analyze the regret of optimistic planning for both σ-Gaussian noise and σ-sub Gaussian noise case. We start with the Gaussian case.

Lemma 6 (Absolute expectation Difference Under Two Gaussians (Lemma C.2. Kakade et al. (2020))). For Gaussian distribution N(µ1, σ2I) and N(µ2, σ2I), and for any (appropriately measurable) positive function g, it holds that:

|Ez N1[g(z)] Ez N2[g(z)]| min µ1 µ2

Ez N1[g2(z)]

|Ez N1[g(z)] Ez N2[g(z)]| = Ez N1

Ez N1 [g2(z)]

v u u t Ez N1

Ez N1 [g2(z)]

v u u t Ez N1

Ez N1[g2(z)] min µ1 µ2

(Lemma C.2. Kakade et al. (2020))

Corollary 3 (Regret of optimistic planning for Gaussian noise). Let π n, πn denote the solution to Equation (6) and Equation (7) respectively, and z n,t, zn,t the corresponding state-action pairs visited during their respective trajectories. Furthermore, let Assumption 1 - 3 hold. Then, the following is true for all n 0, t [0, T 1], with probability at least 1 δ

Jn(π n) Jn(πn) O

(1 + dx)βn 1(δ) σn 1(zn,t)

Proof. For simplicity, define gn(x) = Jn,t+1(πn, ηn, x). Note since wt N(0, σ2I) (Assumption 3), we have that x t+1 and xt+1 are also Gaussians. Therefore, we can leverage Lemma 6.

Eτ πn Jn,t+1(πn, ηn, x t+1) Jn,t+1(πn, ηn, xt+1) = E gn(x t+1) gn(xt+1)

E[g2n(xt+1)] min

( x t+1 xt+1

( x t+1 xt+1

. (Lemma 4)

Furthermore, x t+1 xt+1 = µn 1(xt, πn(xt)) + βn 1(δ)σn 1(xt, πn(xt))ηn(xt) f (xt, πn(xt))

µn 1(xt, πn(xt)) f (xt, πn(xt)) + βn 1(δ) σn 1(xt, πn(xt)) ηn(xt)

dx)βn 1(δ) σn 1(xt, πn(xt)) . (Assumption 1) Next, we use Lemma 2

Jn(π n) Jn(πn) Eτ πn

t=0 Jn,t+1(πn, ηn, x t+1) Jn,t+1(πn, ηn, xt+1)

Jmax min (1 + dx)βn 1(δ) σn 1(xt, πn(xt))

(1 + dx)βn 1(δ) σn 1(xt, πn(xt))

(1 + dx)βn 1(δ) σn 1(xt, πn(xt))

Lemma 7 (Regret of planning optimistically for sub-Gaussian noise). Let π n, πn denote the solution to Equation (6) and Equation (7) respectively, and z n,t, zn,t the corresponding state-action pairs visited during their respective trajectories. Furthermore, let Assumption 1 and 2 hold, and relax Assumption 3 to σ-sub Gaussian noise. Then, the following is true for all n 0 with probability at least 1 δ

Jn(π n) Jn(πn) O

LT 1 σ βT n 1(δ)TEτ πn

t=0 σn 1,j(zn,t)

Proof. Curi et al. (2020, Lemma 5) bound the regret with the sum of epistemic uncertainties for Lipschitz continuous reward functions, under Assumption 1 and 2 for sub-Gaussian noise (c.f., Rothfuss et al. (2023, Theorem 3.5) for a more rigorous derivation). For the active exploration setting, the reward in episode n + 1 is

1 + σ2 n 1,j(z)

We show in Lemma 4 that our choice of exploration reward is Lipschitz continuous. Thus, can use the regret bound from Curi et al. (2020).

Compared to the Gaussian case, σ-sub Gaussian noise has the additional exponential dependence on the horizon T, i.e., the βT n term. This follows from the analysis through Lipschitz continuity. Moreover, as we show in Lemma 2, the regret of planning optimistically is proportional to the change in value under the same optimistic dynamics and policy, but different initial states. The Lipschitz continuity property of our objective allows us to relate the difference in values to the discrepancy in the trajectories. Even for linear systems, trajectories under the same dynamics and policy but different initial states can deviate exponentially in the horizon.

A.3 Proof for general Bayesian models

In this section, we analyze the information gain for general Bayesian models and prove Theorem 1.

Theorem 3 (Entropy of a RV with finite second moment is upper bounded by Gaussian entropy (Theorem 8.6.5 Cover and Thomas (2006))). Let the random vector x Rn have covariance K = E xx (i.e., Kij = E [xixj] , 1 i, j n). Then

2 log((2πe)n|K|)

with equality if and only if x N(µ, K) for µ = E [x].

Lemma 8 (Monotonocity of information gain). Let τ π denote the trajectory induced by the policy π. Then, the following is true for all n 0, policies π

ED1:n [I (f τ π; yτ π | D1:n)] ED1:n 1 [I (f τ π; yτ π | D1:n 1)] .

ED1:n [I (f τ π; yτ π | D1:n 1) I (f τ π; yτ π | D1:n)] = ED1:n [H (yτ π | D1:n 1) H (yτ π | f τ π, D1:n 1) (H (yτ π | D1:n) H (yτ π | f τ π, D1:n))] = ED1:n [H (yτ π | D1:n 1) H (yτ π | D1:n)] + ED1:n [H (yτ π | f τ π) H (yτ π | f τ π)] 0 (information never hurts)

A direct consequence of Lemma 8 is the following corollary.

Corollary 4 (Information gain at N is less than the average gain till N). Let τ π denote the trajectory induced by the policy π. Then, the following is true for all N 1, policies π,

ED1:N 1 [I (f τ π; yτ π | D1:N 1)] 1

n=1 ED1:n 1 [I (f τ π; yτ π | D1:n 1)] .

Next, we prove Lemma 1, which is central to our proposed active exploration objective in Equation (6).

Proof of Lemma 1. Let yτ π = {yt}T 1 t=0 = {f t + wt}T 1 t=0 , where f t = f (zt). Furthermore, denote with Σn(f 0:T 1) the covariance of f 0:T 1.

I (f τ π; yτ π | D1:n) = I f 0:T 1; y0:T 1 | D1:n

= H (y0:T 1 | D1:n) H y0:T 1 | f 0:T 1, D1:n

2 log σ2I + Σn(f 0:T 1) 1

2 log σ2I (Theorem 3)

2 log diag I + σ 2Σn(f 0:T 1) (Hadamard s inequality)

1 + σ2 n,j(zt)

We can leverage the result from Lemma 1 to bound the average mutual information with the sum of epistemic uncertainties.

Lemma 9 (Average information gain is less than sum of average epistemic uncertainties). Let Assumption 3 hold and denote with πN be the solution of Equation (4). Then, for all N 1 and dataset D1:N the following is true

n=1 ED1:n 1,τ πn [I (f τ πN ; yτ πN | D1:n 1)]

n=1 ED1:n 1,τ π n

1 + σ2 n,j(z n,t) σ2

where z n,t are the state-action tuples visited by the solution of Equation (6), i.e., π n.

n=1 ED1:n 1,τ πN [I (f τ πN ; yτ πN | D1:n 1)]

n=1 ED1:n 1,τ πN

1 + σ2 n,j(zt)

n=1 ED1:n 1

max π Π Eτ π

1 + σ2 n,j(zt)

n=1 ED1:n 1,τ πn

1 + σ2 n,j(z n,t) σ2

Here (1) follows from the tower property. Note that the second expectation in (1) is wrt τ π conditioned on a realization of D1:n 1, where the conditioning is captured in the epistemic uncertainty σn( ).

We use the results from above, to prove Theorem 1.

Proof of theorem 1. Let πn denote the solution to Equation (4) at iteration n 1. We first relate the information gain from OPAX to the information gain of πn.

ED1:N 1,τ πN [I (f τ πN ; yτ πN | D1:N 1)]

n=1 ED1:n 1,τ πn [I (f τ πn; yτ πn | D1:n 1)] (Corollary 4)

n=1 ED1:n 1,τ πn

1 + σ2 n 1,j(z n,t) σ2

1 + σ2 n 1,j(zn,t)

+ Jn(π n) Jn(πn)

n=1 ED1:n 1

1 + σ2 n 1,j(zn,t)

βn 1(δ)TEτπn

t=0 σn 1(zn,t) 2

(Corollary 3)

In summary, the maximum expected mutual information at episode N is less than the mutual information of OPAX and the sum of model epistemic uncertainties. Crucial to the proof is the regret

bound for optimistic planning from Corollary 3.

n=1 ED1:n 1

1 + σ2 n 1,j(zn,t)

Tβn 1(δ)Eτπn

t=0 σn 1(zn,t) 2

n=1 ED1:n 1

1 + σ2 n 1,j(zn,t)

Tβn 1(δ)Eτπn

t=0 σn 1(zn,t) 2

n=1 ED1:n 1

j=1 log 1 + σn 1,j(zn,t)

Tβn 1(δ)Eτπn

t=0 σn 1(zn,t) 2

n=1 ED1:n 1

σn 1,j(zn,t)

Tβn 1(δ)Eτπn

t=0 σn 1(zn,t) 2

(log(1 + x) x for x 0.)

n=1 ED1:n 1

Tβn 1(δ)Eτπn

t=0 σn 1(zn,t) 2

Above, we show that the maximum expected mutual information can be upper bounded with the sum of epistemic uncertainties for the states OPAX visits during learning. Finally, we further upper bound this with the model complexity measure.

n=1 ED1:n 1

βn 1(δ)TEτπn

t=0 σn 1(zn,t) 2

n=1 (Tβn 1(δ))Eτπn

t=0 σn 1(zn,t) 2

v u u t TNED1:N

t=0 ||σn(zn,t)||2 2

βN(δ)T 3/2 r

Theorem 1 gives a bound on the maximum expected mutual information w.r.t. the model complexity. We can use concentration inequalities such as Markov, to give a high probability bound on the

information gain. In particular, we have for all ϵ > 0

Pr (I (f τ πN ; yτ πN | D1:N 1) ϵ) ED1:N 1,τ πN I f τ πN ; yτ πN | D1:N 1

A.4 Proof of GP results This section presents our results for the frequentist setting where the dynamics are modeled using GPs. Since the information gain has no meaning in the frequentist setting, we study the epistemic uncertainty of the GP models.

Corollary 5 (Monotonicity of the variance). For all n 0, and policies π the following is true.

1 + σ2 N 1,j(zt)

1 + σ2 n 1,j(zt)

Proof. Follows directly due to the monotonicity of GP posterior variance.

Next, we prove that the trajectory of Equation (6) at iteration n is upper-bounded with the maximum information gain.

Lemma 10. Let Assumption 2 - 4 hold Then, for all N 1, with probability at least 1 δ, we have

max π Π Eτπ

1 + σ2 N,j(zt)

O βN(δ)T 3/2 rγN

Moreover, relax noise Assumption 3 to σ-sub Gaussian. Then, for all N 1, with probability at least 1 δ, we have

max π Π Eτπ

1 + σ2 N,j(zt)

O LT σβT N(δ)T 3/2 rγN

Proof. Gaussian noise case: Let z n,t denote the state-action pair at time t for the trajectory of Equation (6) at iteration n 1 and π n the corresponding policy.

1 + σ2 N,j(z N,t) σ2

1 + σ2 n,j(z N,t) σ2

(Corollary 5)

1 + σ2 n,j(z n,t) σ2

(By definition of π n)

βN(δ)T 3/2 r

(See proof of Theorem 1)

O βN(δ)T 3/2 rγN

(Curi et al., 2020, Lemma 17)

Sub-Gaussian noise case: The only difference between the Gaussian and sub-Gaussian case is the regret term (c.f., Corollary 3 and Lemma 7). In particular, the regret for the sub-Gaussian case leverages the Lipschitz continuity properties of the system (Assumption 2). This results in an exponential dependence on the horizon for our bound. We refer the reader to Curi et al. (2020); Rothfuss et al. (2023) for a more detailed derivation.

Lemma 10 gives a sample complexity bound that holds for a richer class of kernels. Moreover, for GP models, βN γN (Chowdhury and Gopalan, 2017). Therefore, for kernels, where lim N γ2 N/N 0, we can show convergence (for the Gaussian case). We summarize bounds on γN from Vakili et al. (2021) in Table 1.

Table 1: Maximum information gain bounds for common choice of kernels. Kernel k(x, x ) γn

Linear x x O (d log(n))

RBF e x x 2

2l2 O logd+1(n)

Matèrn 1 Γ(ν)2ν 1

O n d 2ν+d log 2ν 2ν+d (n)

From hereon, we focus on deriving the results for the case of Gaussian noise case. All our results can be easily extended for the sub-Gaussian setting by considering its corresponding bound.

Lemma 11. The following is true for all N 0 and policies π Π,

1 2σ2 N,j(z)

1 + σ2 N,j(zt)

with Cσ = σmax log(1+σ 2σmax).

1 + σ2 N,j(zt)

1 2σ2 N,j(zt)

, (Curi et al., 2020, Lemma. 15)

1 2σ2 N,j(z)

Corollary 6. Let Assumption 2 and 4 hold, and relax noise Assumption 3 to σ-sub Gaussian. Then, for all N 1, with probability at least 1 δ, we have

max π Π Eτπ

1 2σ2 N,j(z)

O βN(δ)T 3/2 rγN

Lemma 12. Let Assumption 2 and 4 hold, and relax noise Assumption 3 to σ-sub Gaussian. Furthermore, assume lim N β2 N(δ)γN(k)/N 0. Then for all N 1, z R, and 1 j dx, with probability at least 1 δ, we have

σn,j(z) a.s. 0 for n .

Proof. We first show that the expected epistemic uncertainty along a trajectory converges to zero almost surely. Then we leverage this result to show almost sure convergence of all trajectories induced

by π Π. To this end, let Sn = Eτπ n

h PT 1 t=0 Pdx j=1 1 2 log 1 + σ2 n,j(z n,t) σ2 i for all n 0. So far we have,

Pr Sn O βN(δ)T 3/2 rγn

Consider a sequence {δn}n 0 such that limn δn = 0, and limn βn(δn)T 3/2p γn

n 0. Note, for GP models with lim N β2 N(δ)γN(k)/N 0, such a sequence of δn exists (Chowdhury and Gopalan, 2017). Consider any ϵ > 0 and let N (ϵ) be the smallest integer such that

O βN (ϵ)(δ)T 3/2 rγN (ϵ)

Then, we have

n=0 Pr(Sn > ϵ) =

n=0 Pr(Sn > ϵ) +

n=N (ϵ) Pr(Sn > ϵ)

n=0 Pr(Sn > ϵ) +

n=N (ϵ) δn.

Note, since limn δn = 0, we have

n=N (ϵ) δn < .

In particular, P n=0 Pr(Sn > ϵ) < for all ϵ > 0. Therefore, we obtain

Sn a.s. 0 for n .

Define the random variable V = limn PT 1 t=0 Pdx j=1 1 2 log 1 + σ2 n,j(z n,t) σ2 , with z n,t τ

and τ τ π n. V represents the sum of epistemic uncertainties of a random trajectory induced by the sequence of policies {πn}n 0. Note V 0, therefore we apply Markov s inequality. Moreover, for all ϵ > 0, we have

Pr (V > ϵ) E[V ]

Hence, we have Pr (V = 0) = 1 = V a.s. 0 for n

Accordingly, we get for all π Π.

1 + σ2 n,j(zt)

Assume there exists a z R, such that for some ϵ, σ2 n,j(z) > ϵ for all n 0. Since, z R, there exists a t and π Π such that p(zt = z|π, f ) > 0. This implies that Pr(z τπ) > 0. However, from Equation (15), we have that σ2 n,j(z) 0 for N almost surely, which is a contradiction.

Finally, we leverage the results from above to prove Theorem 2.

Proof of Theorem 2. The proof follows directly from Corollary 6 and Lemma 12.

A.5 Zero-shot guarantees

In this section, we give guarantees on the zero-shot performance of OPAX for a bounded cost function. We focus this section on the case of Gaussian noise. However, a similar analysis can be performed for the sub-Gaussian case and Lipschitz continuous costs. Since the analysis for both cases is similar, we only present the Gaussian case with bounded costs here.

Corollary 7. Consider the following optimal control problem

arg min π Π Jc(π, f ) = arg min π Π Eτ π

t=0 c(xt, π(xt))

xt+1 = f (xt, π(xt)) + wt 0 t T,

with bounded and positive costs. Then we have for all policies π with probability at least 1 δ

Jc(π, ηP ) Jc(π) O

(1 + dx)βn 1(δ) σn 1(zt)

where Jc(π, ηP ) = maxη Ξ Jc(π, η).

Proof. From Corollary 2 we get

Jc(π, ηP ) Jc(π) = Eτ π

t=0 Jr,t+1(π, ηP , x P t+1) Jr,t+1(π, ηP , xt+1)

with xt+1 = f (xt, π(xt)) + wt, and x P t+1 = µn(xt, π(xt)) + βn(δ)σ(xt, π(xt))ηP (xt) + wt. Furthermore, the cost is positive and bounded. Therefore,

J2 c (π, η, x) T 2c2 max,

for all x, η, π. Accordingly, we can now use the same analysis as in Lemma 2 and get

Jc(π, ηP ) Jc(π) O

(1 + dx)βn 1(δ) σn 1(zt)

Lemma 13. Consider the control problem in Equation (16) and let Assumption 3 and 4 hold. Furthermore, assume for every ϵ > 0, there exists a finite integer n such that

n n ; β 3/2 n (δ)T 11/4 γn(k)

and denote with ˆπn the minimax optimal policy, i.e., the solution to minπ Π maxη Ξ Jc(π, η). Then for all n n , we have probability at least 1 δ, Jc(ˆπN) Jc(π ) O(ϵ).

Proof of Zero-shot performance. In Theorem 2 we give a rate at which the maximum uncertainty along a trajectory decreases:

max π Π Eτ π max z τ π 1 2 σN,j(z) 2 O

Combining this with Corollary 7 we get

Jc(π, ηP ) Jc(π) O

(1 + dx)βn 1(δ) σn 1(zt)

O T 2βn 1(δ)Eτ πn max z τ πn σn 1(z)

Eτ πn max z τ πn T 4β2 n 1(δ) σn 1(z) 2 !

β3n T 11/2 r

β 3/2 n (δ)T 11/4 γn(k)

= O(ϵ). ( n n )

Hence, we have that for each policy π, our upper bound maxη Jc(π, ηP ) is ϵ precise, i.e.,

max η Ξ Jc(π, η) Jc(π) O(ϵ), π Π. (18)

We leverage this to prove optimality for the minimax solution. For the sake of contradiction, assume that Jc(ˆπn) > Jc(π ) + O(ϵ). (19)

Then we have,

max η Ξ Jc(π , η) min π Π max η Ξ Jc(π, η)

= Jc(ˆπn, ˆηP ) = max η Ξ Jc(ˆπn, η)

Jc(ˆπn) > Jc(π ) + O(ϵ) (Equation (19))

Jc(π , η ,P ) (Equation (18)) = max η Ξ Jc(π , η).

This is a contradiction, which completes the proof.

Lemma 13 shows that OPAX also results in nearly-optimal zero-shot performance. The convergence criteria in Equation (17) is satisfied for kernels k that induce a very rich class of RKHS (c.f., Table 1).

B Experiment Details

The environment details and hyperparameters used for our experiments are presented in this section. In Appendix B.2 we discuss the experimental setup of the Fetch Pick & Construction environment (Li et al., 2020) in more detail.

B.1 Environment Details

Table 2 lists the rewards used for the different environments. We train the dynamics model after each episode of data collection. For training, we fix the number of epochs to determine the number of gradient steps. Furthermore, for computational reasons, we upper bound the number of gradient steps by maximum number of gradient steps . The hyperparameters for the model-based agent are presented in Table 3. Furthermore, we present the i CEM hyperparameters in Table 4. For more detail on the i CEM hyperparameters see Pinneri et al. (2021).

Table 2: Downstream task rewards for the environments presented in Section 5.

Environment Reward rt

Pendulum-v1 - swing up θ2 t + 0.1 θt + 0.001u2 t Pendulum-v1 - keep down (θt π)2 + 0.1 θt + 0.001u2 t Mountain Car 0.1u2 t + 100(1{xt xgoal}) Reacher - go to target 100(1{||xt xtarget||2 0.05} Swimmer - go to target ||xt xtarget||2 Swimmer - go away from target ||xt xtarget||2 Cheetah - run forward vforward,t Cheetah - run backward vforward,t

Table 3: Hyperparameters for results in Section 5.

Hyperparameters Pendulum-v1 - GP Pendulum-v1 Mountain Car Reacher Swimmer Cheetah

Action repeat N/A N/A N/A N/A 4 4 Exploration horizon 100 200 200 50 1000 1000 Downstream task horizon 200 200 200 50 1000 1000 Hidden layers N/A 2 2 2 4 4 Neurons per layers N/A 256 128 256 256 256 Number of ensembles N/A 7 5 5 5 5 Batch size N/A 64 64 64 64 64 Learning rate 0.1 5 10 4 5 10 4 10 3 5 10 4 5 10 5 Number of epochs 50 50 50 50 50 50 Maximum number of gradient steps N/A 5000 5000 6000 7500 7500 βn 2.0 2.0 1.0 1.0 2.0 2.0

Model-based SAC optimizer For the reacher and Mountain Car environment we use scarce rewards (c.f., Table 2), which require long-term planning. Therefore, a receding horizon MPC approach is less suitable for these tasks. Accordingly, we use a policy network which we train using SAC (Haarnoja et al., 2018). Moreover, our model-based SAC uses transitions simulated through the learned model to train a policy. Accordingly, given a dataset of transitions from the true environment, we sample P initial states from the dataset. For each of the states, we simulate a trajectory of H steps using our learned model and the SAC policy. We collect the simulated transitions into a simulation data buffer DSIM, which we then use to perform G gradient steps as suggested by Haarnoja et al. (2018) to train the policy. The algorithm is summarized below, and we provide the SAC hyperparameters in Table 5.

Table 4: Parameters of i CEM optimizer for experiments in Section 5.

Hyperparameters Pendulum-v1 - GP Pendulum-v1 Swimmer Cheetah

Number of samples P 500 500 250 200 Horizon H 20 20 15 10 Size of elite-set K 50 50 25 20 Colored-noise exponent β 0.25 0.25 0.25 0.25 Number of particles 10 10 10 10 CEM-iterations 10 10 5 5 Fraction of elites reused ξ 0.3 0.3 0.3 0.3

Model-based SAC

Init: Stastistical Model M = (µn, σn, βn(δ)), Dataset of transitions Dn, initial policy π0, Model Rollout steps H DSIM Initialize simulated transitions buffer for Training steps k = 1, . . . , K do

x0,1:P Dn Sample P initial states from buffer

for Initial state x0 x0,1:P do

DSIM DSIM MODELROLLOUT(πk, x0, H, M) Collect simulated transitions

Perform G gradient updates on πk as proposed in Haarnoja et al. (2018) using DSIM.

Table 5: Parameters of model-based SAC optimizer for experiments in Section 5.

Parameters Mountain Car Reacher

Discount factor 0.99 0.99 Learning rate actor 5 10 4 5 10 4

Learning rate critic 5 10 4 5 10 4

Learning rate entropy coefficient 5 10 4 5 10 4 Actor architecture [64, 64] [250, 250] Critic architecture [256, 256] [250, 250] Batch size 32 64 Gradient steps G 350 500 Simulation horizon H 4 10 Initial state sample size P 500 2000 Total SAC training steps K 35 350

B.2 OPAX in the High-dimensional Fetch Pick & Place Environment

B.2.1 Environment and Model Details In our experiments, we use an extension of the Fetch Pick & Place environment to multiple objects as proposed in Li et al. (2020) and further modified in Sancaktar et al. (2022) with the addition of a large table. This is a compositional object manipulation environment with an end-effector-controlled robot arm. The robot actions u R4 correspond to the gripper movement in Cartesian coordinates and the gripper opening/closing. The robot state xrobot R10 contains positions and velocities of the end-effector as well as the gripper-state (open/close) and gripper-velocity. Each object s state xobj R12 is given by its position, orientation (in Euler angles), and linear and angular velocities.

We follow the free-play paradigm as used in CEE-US (Sancaktar et al., 2022). At the beginning of free play, an ensemble of world models is randomly initialized with an empty replay buffer. At each iteration of free play, a certain number of rollouts (here: 20 rollouts with 100 timesteps each) are collected and then added to the replay buffer. Afterwards, the models in the ensemble are trained for a certain number of epochs on the collected data so far. Note that unlike in the original proposal by Sancaktar et al. (2022), we use Multilayer Perceptrons (MLP) as world models instead of Graph Neural Networks (GNN), for the sake of reducing run-time. This corresponds to the ablation MLP + i CEM presented in Sancaktar et al. (2022), which was shown to outperform all the baselines other than CEE-US with GNNs. As we are interested in exploring the difference in performance via injection of optimism into active exploration, we use the computationally less heavy MLPs in our work. This is reflected in the downstream task performance we report compared to the original CEE-US with GNNs. Details for the environment and models are summarized in Table 6.

After the free-play phase, we use the trained models to solve downstream tasks zero-shot via modelbased planning with i CEM. We test for the tasks pick & place, throwing and flipping with 4 objects. The rewards used for these tasks are the same as presented in Sancaktar et al. (2022).

Table 6: Environment and model settings used for the experiment results shown in Figure 4.

(a) Fetch Pick & Place settings.

Parameter Value

Episode Length 100 Train Model Every 20 Ep. Action Dim. 4 Robot/Agent State Dim. 10 Object Dynamic State Dim. 12 Number of Objects 4

(b) MLP model settings.

Parameter Value

Network Size 3 256 Activation function Si LU Ensemble Size 5 Optimizer ADAM Batch Size 256 Epochs 50 Learning Rate 0.0001 Weight decay 5 10 5 Weight Initialization Truncated Normal Normalize Input Yes Normalize Output Yes Predict Delta Yes

B.2.2 OPAX Heuristic Variant In the case of Fetch Pick & Place with an high-dimensional observation space, we implement a heuristic of OPAX. Note that as Fetch Pick & Place is a deterministic environment without noise, we only model epistemic uncertainty.

OPAX (Heuristic Variant)

Input: Ensemble {fi}K k=1, ϵ 1 for n 1, . . . , Nmax do

Solve optimal control problem till convergence for the system: xt+1 = fjt(xt, ut).

u 0:T 1, j 0:T 1 = argmax u0:T 1,j0:T 1

i=1 log ϵ2 + σ2 n,i(xt, ut) Estimate

σn,i(xt, ut) with ensemble disagreement.

Rollout u 0:T1 on the system and collect data Dn. Update models {fi}K k=1.

B.2.3 Controller Parameters The set of default hyperparameters used for the i CEM controller are presented in Table 7, as they are used during the intrinsic phase for OPAX, CEE-US, and other baselines. The controller settings used for the extrinsic phase are given in Table 8. For more information on the hyperparameters and the i CEM algorithm, we refer the reader to Pinneri et al. (2021).

Table 7: Base settings for i CEM as they are used in the intrinsic phase. Same settings are used for all methods.

Parameter Value

Number of samples P 128 Horizon H 30 Size of elite-set K 10 Colored-noise exponent β 3.5 CEM-iterations 3 Noise strength σinit 0.5 Momentum α 0.1 use_mean_actions Yes shift_elites Yes keep_elites Yes Fraction of elites reused ξ 0.3 Cost along trajectory sum

Table 8: i CEM hyperparameters used for zero-shot generalization in the extrinsic phase. Any settings not specified here are the same as the general settings given in Table 7.

Task Controller Parameters Horizon Colored-noise exponent use_mean_actions Noise strength Cost Along h β σinit Trajectory

Pick & Place 30 3.5 Yes 0.5 best Throwing 35 2.0 Yes 0.5 sum Flipping 30 3.5 No 0.5 sum

C Study of exploration intrinsic rewards

0 5 10 15 10 2

Pendulum-v1 GP - uncertainty

0 5 10 15 2.0

103 Pendulum-v1 GP - swing up

0 5 10 15 20 25 Episodes

104 Mountain Car

0 100 200 300 400 500 Episodes

102 Cheetah - run forward

Op Ax Op Ax-sum

Figure 5: Comparison of OPAX between intrinsic reward proposed in Equation (6) and the sum of epistemic uncertainties proposed by (Pathak et al., 2019), i.e., OPAX-SUM. For both choices of intrinsic rewards, OPAX reduces the epistemic uncertainty and performs well on downstream tasks.

The intrinsic reward suggested in Equation (6) takes the log of the model epistemic uncertainty. Another common choice for the intrinsic reward is the epistemic uncertainty or model disagreement without the log (Pathak et al., 2019; Sekar et al., 2020). In the following Lemma, we show that these rewards are closely related.

Lemma 14. Let σmax = supz Z;i 0;1 j dx σi,j(z) and σ > 0. Then for all i 0 and j {1, . . . , dx}

σ2 i,j(z) σmax log(1 + σ 2σmax) log(1 + σ 2σ2 i,j(z)) σ 2σmax log(1 + σ 2σmax)σ2 i,j(z).

Proof. Curi et al. (2020) derive the first inequality on the left. The second inequality follows since log(1 + x) x for all x 0.

Due to this close relation between the two objectives, our theoretical findings also apply to the intrinsic reward proposed by (Pathak et al., 2019). Moreover, empirically we notice in Figure 5 that OPAX performs similarly when the sum of epistemic uncertainties is used instead of the objective in Equation (6). However, our objective can naturally be extended to the case of heteroscedastic aleatoric noise, since it trades off the ratio of epistemic and aleatoric uncertainty.