# goalconditioned_offline_planning_from_curious_exploration__efa53cc3.pdf

Goal-conditioned Offline Planning from Curious Exploration

Marco Bagatella ETH Zürich & Max Planck Institute for Intelligent Systems Tübingen, Germany mbagatella@ethz.ch

Georg Martius University of Tübingen & Max Planck Institute for Intelligent Systems Tübingen, Germany georg.martius@uni-tuebingen.de

Curiosity has established itself as a powerful exploration strategy in deep reinforcement learning. Notably, leveraging expected future novelty as intrinsic motivation has been shown to efficiently generate exploratory trajectories, as well as a robust dynamics model. We consider the challenge of extracting goal-conditioned behavior from the products of such unsupervised exploration techniques, without any additional environment interaction. We find that conventional goal-conditioned reinforcement learning approaches for extracting a value function and policy fall short in this difficult offline setting. By analyzing the geometry of optimal goalconditioned value functions, we relate this issue to a specific class of estimation artifacts in learned values. In order to mitigate their occurrence, we propose to combine model-based planning over learned value landscapes with a graph-based value aggregation scheme. We show how this combination can correct both local and global artifacts, obtaining significant improvements in zero-shot goal-reaching performance across diverse simulated environments.

1 Introduction

While the standard paradigm in reinforcement learning (RL) is aimed at maximizing a specific reward signal, the focus of the field has gradually shifted towards the pursuit of generally capable agents [1, 20, 40]. This work focuses on a promising framework for learning diverse behaviors with minimal supervision, by decomposing the problem into two consecutive phases. The first phase, referred to as exploratory [28, 43] or intrinsic [38], consists of unsupervised exploration of an environment, aiming to collect diverse experiences. These can be leveraged during the second phase, called exploitatory or extrinsic, to distill a family of specific behaviors without further environment interaction. This paradigm unifies several existing frameworks, including zero-shot deep reinforcement learning [28, 43], Explore2Offline [23], Ex ORL [52] and model-based zero-shot planning [38].

An established technique for the intrinsic phase is curiosity-based exploration, which leverages some formulation of surprise as a self-supervised reward signal. In practice, this is often done by estimating the uncertainty of forward dynamic predictions of a learned model [49]. At the end of the exploratory phase, this method produces a sequence of collected trajectories, as well as a trained dynamics model. At its core, our work focuses on the subsequent extrinsic phase, and is concerned with extracting strong goal-conditioned behavior from these two products, without further access to the environment.

37th Conference on Neural Information Processing Systems (Neur IPS 2023).

Figure 1: This work focuses on extracting goal-conditioned behavior from exploratory trajectories D and a trained dynamics model ˆP. We propose to learn a value function through goal-conditioned RL, plan with zero-order trajectory optimization to mitigate local estimation artifacts, and aggregate value estimates along a graph in order to correct global estimation artifacts.

In its simplest form, this corresponds to learning to control the agent s future state distribution in order to achieve a given goal or, more concretely, to visit a particular state. Existing methods tackle this problem without additional learning through model-based online planning [38], or rely on learning an amortized policy, either during exploration [28, 43] or in a fully offline manner [23, 52]. The former class of approaches requires the definition of a hand-crafted cost function, without which long-horizon, sparse tasks would remain out of reach. On the other hand, the latter class does not rely on a well-shaped cost function, but may suffer from myopic behavior in the absence of explicit planning, and from estimation errors in the learned value function.

As we do not assume access to external task-specific cost functions, we first investigate the effectiveness of goal-conditioned RL algorithms for learning a value function. In doing so, we empirically extend the results obtained by Yarats et al. [52] to previously unstudied model-based and goalconditioned settings. We thus confirm that RL methods not involving offline corrections can be competitive when trained on offline exploratory data. Additionally, by a formal analysis of the geometry of optimal goal-conditioned value functions, we pinpoint a geometric feature of inaccurate value landscapes, which we refer to as T -local optima. After showing the detrimental effects of such learning artifacts to the performance of goal-conditioned RL, we propose to mitigate them through a novel combination of model-based planning with a practical graph-based aggregation scheme for learned value estimates. Crucially, we show how these two components address the same estimation issue on different scales, that is, respectively, locally and globally. We provide a graphical overview in Figure 1 for a simple maze environment.

The contributions of this work can be organized as follows:

We provide an empirical evaluation of goal-conditioned reinforcement learning algorithms when trained on the outcomes of an unsupervised exploration phase. We pinpoint a class of estimation artifacts in learned goal-conditioned value functions, which explains the suboptimality of naive, model-free reinforcement learning approaches in this setting. We propose to combine model-based planning with graph-based value aggregation to address the occurrence of both local and global value artifacts. We provide the first fully offline method for achieving goals after a curious exploration phase.

After the relating our work to existing literature, we evaluate goal-conditioned RL methods when learning from curious exploration (Section 4). We then pinpoint a class of estimation artifacts that are not compatible with optimal goal-conditioned value functions and analyze them through the lens of T -local optima (Section 5). Equipped with these insights, we show how our practical method can mitigate such artifacts (Section 6). A compact description of the resulting method, highlighting its intrinsic and extrinsic phases, can be found in Appendix H.

2 Related Work

Unsupervised Exploration The challenge of exploration has been at the core of reinforcement learning research since its inception. Within deep reinforcement learning, successful strategies are

often derived from count-based [2], prediction-based [41] or direct exploration techniques [10], with the first two traditionally relying on intrinsic motivation [7] computed from state visitation distributions or prediction errors (i.e., surprise), respectively. Depending on the entity on which prediction error is computed, methods can focus on estimating forward dynamics [32, 38, 43, 44], inverse dynamics [33] or possibly random features of the state space [3]. Another important distinction falls between methods that retroactively compute surprise [33] or methods that compute future expected disagreement [38, 43], and proactively seek out uncertain areas in the state space. Overall, such techniques compute an additional intrinsic reward signal, which can be summed to the extrinsic environment reward signal, or used alone for fully unsupervised exploration. In this work, we build upon efficient exploration methods that are proactive, and rely on prediction disagreement of a forward dynamics model [38, 43, 44]. Nevertheless, our method could be easily applicable in conjunction with any aforementioned strategy, as long as it can produce exploratory data, and a dynamics model, which could potentially also be learned offline.

Zero-shot Frameworks Unsupervised exploration techniques are capable of efficiently sampling diverse trajectories, without extrinsic reward signals. As a result, they are a promising source of data to learn flexible behaviors. Several works in the literature have proposed that a task-agnostic unsupervised exploration phase be followed by a second phase, optimizing a task-specific reward function only provided a posteriori. This has been formalized as zero-shot reinforcement learning [9, 47, 48]. Notably, this framework conventionally forbids task-specific fine-tuning, explicit planning or computation during the second phase. On the applied side, agents endowed with learned world models [28, 43] have shown promise for zero-shot generalization without additional learning or planning. Other works are less strict in their constraints: Sancaktar et al. [38] also propose an intrinsic exploration phase followed by zero-shot generalization, but the latter relies on task-specific online planning. Furthermore, two concurrent works [23, 52] study the application of offline deep reinforcement learning for the second phase. However, these methods are not designed for goalconditioned tasks, and thus require an external task-specific reward function to perform relabeling during training. Our framework fits within this general research direction, in that it also involves taskagnostic and task-specific components. The initial phase is task-agnostic: we rely on unsupervised exploration, and train a universal value estimator [40] completely offline, without access to an external task-specific reward signal. In the second phase, we perform model-based planning via zero-order trajectory optimization to control the agent. We remark that, while this requires significant computation, existing algorithms are fast enough to allow real-time inference [35].

Goal-Conditioned Reinforcement Learning The pursuit of generally capable agents has been a ubiquitous effort in artificial intelligence research [36, 42, 45]. Goal-conditioned RL represents a practical framework for solving multiple tasks, by describing each task through a goal embedding, which can then condition reward and value functions. Deep RL methods can then be used to estimate values by providing the goal as an additional input to their (often neural) function approximators [40]. A ubiquitous technique in these settings is that of goal relabeling [1], which substitutes a trajectory s goal with one sampled a posteriori. Goal-conditioned methods have been shown to be successful in the offline setting [46], often when coupled with negative sampling methods [5], but have been mostly evaluated on standard offline datasets. An alternative line of work frames the problem as supervised learning instead [11], and shows connections to the original reinforcement learning objective [51]. Additionally, several works have explored the combination of goal-conditioned learning with modelbased planning [4, 25, 31]. This paper studies the natural application of goal-conditioned RL after an unsupervised exploration phase, which allows a straightforward definition of the multiple tasks to be learned. Moreover, our main formal insights in Section 5 build on top of common assumptions in goal-conditioned RL.

Graph-based Value Estimation Several existing works rely on graph abstractions in order to retrieve global structure and address value estimation issues. In particular, Eysenbach et al. [12] and Savinov et al. [39] propose high-level subgoal selection procedures based on shortest paths between states sampled from the replay buffer. Another promising approach proposes to build a graph during training, and to leverage its structure when distilling a policy [30]. In contrast, our method relies on discrete structures only at inference, and directly integrates these structures into model-based planning, thus removing the need for subgoal selection routines.

3 Preliminaries

3.1 Background and Notation

The environments of interest can be modeled as Markov decision processes (MDP) M = {S, A, P, R, µ0, γ}, where S and A are (possibly continuous) state and action spaces, P : S A Ω(S) is the transition probability distribution 1, R : S A R is a reward function, µ0 Ω(S) is an initial state distribution, and γ is a discount factor.

Goal-conditioned reinforcement learning augments this conventional formulation by sampling a goal g µG, where p G is a distribution over a goal space G. While goal abstraction is possible, for simplicity of notation we consider G S (i.e., goal-conditioned behavior corresponds to reaching given states). The reward function is then defined as R(s, a; g) = R(s; g) = 1d(s,g) ϵ for some ϵ 0, and under some distance metric d( ). We additionally assume that episodes terminate when a goal is achieved, and the maximum undiscounted return within an episode is thus one. It is now possible to define a goal-conditioned policy π : S G Ω(A) and its value function V π(s0; g) = EP,π P t=0 γt R(st, g), with st P( |st 1, at 1), at π(st; g). The objective of a goal-conditioned RL agent can then be expressed as finding a stochastic policy π = argmaxπ Eg p G,s p0V π(s; g).

When learning from curious exploration, we additionally assume the availability of a set D = {(s0, a0, s1, a1, . . . , s Ti)i}N i=1 of N exploratory state-action trajectories of possibly different lengths Ti, as well as access to an estimate of the probabilistic transition function ˆP P, i.e., a dynamics model.

3.2 Data Collection and Experimental Framework

As our contributions are motivated by empirical observations, we also introduce our experimental pipeline early on. Algorithms are evaluated in four MDP instantiations, namely maze_large and kitchen from D4RL [14], fetch_push from gymnasium-robotics [13] and an adapted pinpad environment [18]. These environments were chosen to drastically differ in dynamics, dimensionality, and task duration. In particular, maze_large involves the navigation to distant goals in a 2D maze by controlling the acceleration of a point mass, kitchen requires controlling a 9 DOF robot through diverse tasks, including manipulating a kettle and operating switches, fetch_push is a manipulation task controlled through operation space control, and pinpad requires navigating a room and pressing a sequence of buttons in the right order.

For each environment, we collect 200k exploratory transitions by curious exploration, corresponding to 2 hours of real-time interaction at 30Hz, which is an order of magnitude less than existing benchmarks [14, 24]. We stress that this amount of data is realistically collectible in the real world with minimal supervision, thus providing a strong benchmark for data-efficient learning of general agents. Moreover, the nature of the data is significantly different from standard offline datasets, which often contain high-quality demonstrations [14] that can be distilled into policies through naive imitation.

The environment s dynamics are estimated through an ensemble of stochastic neural networks, similar to the one proposed in Chua et al. [8]. The intrinsic exploration signal is ensemble disagreement [38, 43, 49], as described in Appendix I.2. All algorithms are evaluated by success rate, that is, the ratio of test episodes in which the commanded goal is achieved. Details can be found in Appendix I.

4 Value Learning from Curious Exploration

Curious exploration produces a learned dynamics model ˆP and a set D of trajectories, lacking both goal and reward labels. In order to achieve arbitrary (and potentially distant) goals at inference time, and considering that a task-specific cost function is not externally provided, it is natural to resort to learning and optimizing a value function. This section evaluates established methods for value learning and studies the quality of their estimates, thus motivating the analysis and method proposed in Sections 5 and 6.

1We use Ω(S) to represent the space of probability distributions over S.

A standard approach to train a goal-conditioned value function from unlabeled data involves two steps. The first step consists of a goal-relabeling procedure, which associates each training sample with a goal selected a posteriori, and computes a self-supervised reward signal from it, as briefly explained in Section 4.1. The second step consists of extracting a goal-conditioned value function by applying off-policy RL algorithms on the relabeled data, as analyzed in Section 4.2. While this work focuses on data collected through curious exploration, the methods described can also be applied to a general offline goal-conditioned reinforcement learning settings, for which we provide a minimal study in Appendix F.

4.1 Relabeling Curious Exploration through Self-supervised Temporal Distances

We follow the framework proposed by Tian et al. [46] to relabel exploration trajectories, and compute self-supervised reward signals by leveraging temporal distances, without assuming access to external task-specific reward functions. Training samples (st, at, st+1) are sampled uniformly from the replay buffer D, and need to be assigned a goal g and a reward scalar r. Each training sample is assigned a positive goal with probability pg, and a negative goal otherwise. Positive goals are sampled as future states from the same trajectory: g = st+τ, where τ Geo(pgeo) follows a geometric distribution with parameter pgeo. Negative goals are instead sampled uniformly from the entire dataset, while Tian et al. [46] rely on prior knowledge on the semantics of the state space to extract challenging goals. Rewards for all state-goal pairs are simply computed using the reward function r = R(st+1; g) = 1st+1=g: the reward is non-zero only for positive goals sampled with τ = 1. This corresponds to a minorizer for the class of reward functions R(st+1; g) = 1d(st+1,g) ϵ, as setting ϵ = 0 effectively removes the necessity to define a specific distance function d over state and goal spaces. We remark that while its extreme sparsity prevents its optimization via zero-shot model based planning, it remains functional as a reward signal for training a value function in hindsight.

4.2 Value Estimation and Suboptimality

Now that sampling and relabeling procedures have been defined, we focus our attention on which methods can effectively retrieve a goal-conditioned value function V (s; g). In particular, we evaluate off-policy actor-critic algorithms, which are in principle capable of extracting the optimal value function (and policy) from trajectories generated by a suboptimal policy (as the one used for exploration). We thus ask the following questions:

1. Which off-policy actor-critic algorithm is most suitable for learning a goal-conditioned value function, purely offline, from data collected through unsupervised exploration?

2. How close to optimality is the learned value function?

To answer the first question, we consider several off-policy actor-critic algorithms, and evaluate the learned value function through the success rate of an agent which seeks to optimize it. The choice of algorithms should adhere to the offline setting, as well as the availability of an estimated dynamics model ˆP. We thus evaluate TD3 [15] as default model-free actor-critic algorithm, and

CRR MBPO MOPO TD3 0.0

Success Rate

CRR MBPOMOPO TD3 0.00

Success Rate

CRR MBPO MOPO TD3 0.0

Success Rate

CRR MBPO MOPO TD3 0.0

Success Rate

Actor Network Model-based Planning

Figure 2: Performance of actor networks (blue) and model-based planning (orange) on value functions trained through various actor-critic algorithms from the same curious exploration datasets.

compare its performance to CRR [50], a state-of-the-art offline method [16]. We also evaluate two model-based methods, namely MBPO [19] and MOPO [53], the latter of which introduces an uncertainty-based penalty for offline learning. Further value-learning baselines are presented in Appendix C. All methods have access to the same offline exploration dataset D; model-based method additionally query the dynamics model ˆP in order to generate training trajectories. Algorithm-specific hyperparameters were tuned separately for each method through grid search, and kept constant across environments. Each algorithm is allowed the same number of gradient steps (sufficient to ensure convergence for all methods). We implement two strategies for optimizing the value functions: querying the jointly learned actor network, and employing model-based planning and using zeroorder trajectory optimization (e.g., CEM [37]/i CEM[35]) with the learned model ˆP to optimize n-step look ahead value estimates [46]. Average success rates over 9 seeds are presented with 90% simple bootstrap confidence intervals in Figure 2.

To answer the first question, we consider the performance of the best optimizer (often model-based planning, in orange) for each of the value learning methods. We observe that simpler actor-critic algorithms (i.e., TD3) are overall competitive with the remaining value learning methods. MBPO, MOPO and CRR are not consistently better in the considered environments, despite involving offline corrections or relying on the learned model ˆP to generate training trajectories. We hypothesize that this phenomenon can be attributed, at least partially, to the exploratory nature of the data distribution, as exemplified in Figure 3. Additionally, we note that our findings and hypothesis are consistent with those reported by Yarats et al. [52] for non-goal-conditioned tasks.

We then turn to examining the performance gap between actor networks and model-based planning. We observe a tendency for model-based planning to outperform actor networks, with the largest improvements in fetch_push, as all learned policies similarly fail in this environment. While this stands true for almost all considered algorithms, CRR poses an exception to the success of model-based planning for two environments, which we hypothesize is due to its reliance on weighted behavior cloning rather than on backpropagation through a learned critic.

Figure 3: 200k transitions from the exploration dataset D (left), and from the standard offline dataset provided by D4RL (right). Unsupervised exploration leads to a more uniform state coverage, and eases out-of-distribution issues.

We remark that it is possible to show that, under common assumptions, a policy which exactly optimizes an optimal goal-conditioned value function is guaranteed to achieve its goals (see Appendix A.2). The low success rate of actor networks, assuming their objective is well-optimized, is thus a symptom of suboptimality for learned value functions. In Section 5, we argue how this suboptimality can be partially traced back to a class of estimation artifacts; moreover, we show how model-based planning is intrinsically more robust to such artifacts than actor networks (see Section 6.1).

Due to its strong empirical performance, TD3 is adopted for value learning through the rest of this work. We however remark that the analysis and the method proposed in Sections 5 and 6 remain valid for each of the RL methods considered. Further details on methods and results presented in this section are reported in Appendix I.

5 Pathologies of Goal-Conditioned Value Functions

The performance gap between model-based planning and actor networks described in the previous section is a symptom of estimation issues in learned value functions. In this section, we pinpoint a particular class of geometric features which are not compatible with optimal goal-conditioned value functions. Such features represent a significant failure case for an agent optimizing a learned value function, as they can effectively trap it within a manifold of the state space. Intuitively, this happens when the estimated value of the current state exceeds that of all other reachable states, despite the goal not being achieved.

We now formalize this intuition by defining such geometric features as T -local optima. We then show that T -local optima are not present in optimal goal-conditioned value functions, and are therefore

estimation artifacts. This justifies the method proposed in Section 6, which is designed to mitigate this class of artifacts.

Let us define the set operator T : P(S) P(S) 2 that returns the set of states reachable in one step from a given initial set of states S0 S:

so S0 {s S | a A : P(s |s0, a) > 0}. (1)

We can now formally introduce a class of geometric features, which we refer to as T -local optima: Definition 5.1. A state s S is a T -local optimum point for a value function V (s) if and only if max s S V (s ) > V (s ) > max s T ({s })\{s } V (s ).

Intuitively, a T -local optimum is a hill in the value landscape over the state space, where proximity is defined by the transition function of the MDP. The first inequality ensures that s does not maximize the value function globally. The second inequality guarantees that, if there is an action a A : P(s |s , a) = 1, this action would be chosen by greedily optimizing the value of the next-state (i.e., a = argmaxa A R

s S V (s )P(s |s , a)), thus trapping the agent in s . By extending the previous definition to the multistep case, we can also conveniently define depth as the minimum number of steps required to reach a state with a higher value (i.e., escape the T -local optimum). Definition 5.2. A T -local optimum point s has depth k 1 if k is the smallest integer such that V (s ) < max s T k+1({s }) V (s ).

Intuitively, a T -local optimum of depth k dominates the value of states reachable in up to k steps.

Figure 4: Learned value as a function of x-y coordinates in maze_large. Choosing actions that maximize the value of the next state traps the agent in a corner, i.e. a T -local optimum point.

These definitions prepare the introduction of a core observation. Within the setting of goal-conditioned RL, T -local optima are not an intrinsic feature of the value landscape, but rather an estimation artifact. Proposition 5.3. Given a goal g G, an optimal goalconditioned value function V (s; g) has no T -local optima.

The proof follows directly from Bellman optimality of the value function under standard assumptions of goalconditioned RL introduced in Section 3.1. We report it in its entirety in Appendix A.1.

We remark that, in principle, showing that a state s is a T -local optimum point requires evaluating the value function over the support of the next state distribution, which would require access to the true dynamics of the MDP. Moreover, exhaustive evaluation is unfeasible in continuous state spaces. However, one can bypass these issues by instead estimating, rather than proving, the presence of T -local optima. Given a candidate T -local optimum point s, V ( ; g) can be evaluated on (i) random states sr sampled from the exploration data D instead of the state space S and (ii) samples s from the next state distribution, accessed through the dynamic model ˆP(s |s, a) with a U(A). The maximum operators in Definition 5.1 can then be estimated through sampling; if both inequalities hold, s can be estimated to be a T -local optimum point.

Having defined T -local optima, and shown that they are learning artifacts, we now investigate their impact on goal-reaching performance. We train n = 30 bootstrapped value estimators for all environments, and for each of them we generate trajectories through model-based planning with horizon H. We then estimate the occurrence of T -local optima of depth k H by testing each state along each trajectory, as described above. This is reported in Figure 5, separately for successful and unsuccessful trajectories. We observe that, in most environments, trajectories which avoid T -local optima generally tend to be more likely to succeed; this tendency tends to fade in environments which only require short-term planning (e.g., fetch_push). We note that these estimates are sampled-based, and thus noisy by nature, especially in low-data regimes; we thus expand on this analysis in Appendix B. For illustrative purposes, we also visualize a T -local optimum in the value landscape over maze_large in Figure 4.

2P( ) represents a power set.

Failure Success

Occurrence of T -local Optima

Failure Success

Occurrence of T -local Optima

Failure Success

Occurrence of T -local Optima

Failure Success

Occurrence of T -local Optima

Figure 5: Occurrence of T -local optima of depth k H along successful and unsuccessful trajectories generated through model-based planning with horizon H = 15, estimated via sampling. An horizontal bar marks the median.

6 Mitigating Local and Global Value Artifacts

Having shown that T -local optima are estimation artifacts, we now focus on categorizing and mitigating their occurrence. A first observation, which confirms the results in Section 4, is that model-based planning is robust to T local optima of small depth. This is further argued in Section 6.1. However, this does not address the occurrence of global estimation artifacts (i.e. T -local optima of large depth). We thus propose a graph-based value aggregation scheme for this purpose in Section 6.2. Finally, in Section 6.3 we empirically show how combining these two components can address estimation artifacts both locally and globally, and consistently improves goal-reaching performance.

6.1 Model-based Planning for Local Value Artifacts

We will now provide further arguments for the effectiveness of model-based planning over inaccurate value landscapes, supported by Figure 2. Model-based planning is a model-predictive control (MPC) algorithm, which finds an action sequence (a0, . . . a H 1) by optimizing the RL objective through a H-step value estimate for a given horizon H, dynamics model ˆP, state s0, and goal g:

(a0, , a H 1) = argmax (a0, ,a H 1) AH

i=0 γi R(si, g) + γHV (s H; g) with si+1 ˆP( |si, ai), (2)

and executes the first action a0. We remark that in our setting, due to the extreme sparsity of R(s; g) = 1s=g in continuous spaces, the objective in Equation 2 simplifies to the maximization of V (s H; g) alone. For the same reason, it is easy to show that an actor network trained through policy gradient is optimizing the same objective as model-based planning with horizon H = 1:

a0 = argmax a0 A Q(s0, a0; g) = argmax a0 A R(s0; g) + γV (s1; g) = argmax a0 A V (s1; g), (3)

with s1 P( |s0, a0). Under certain assumptions, such as exact dynamics estimation (i.e., ˆP = P), exact optimization for Equation 2 and existence of self-loops in the MDP, it is possible to show that (i) model-based planning from a T -local optimum point of depth k is guaranteed to fail if H k and (ii) model-based planning is guaranteed to reach a state with a higher value if the horizon H is greater than the depth of the largest T -local optima over the value landscape. For a complete argument, see Appendix A.3. As a consequence, actor networks can be trapped in all local optima, while model-based planning with horizon H > 1 can potentially escape local optima of depth k < H, and is therefore potentially more robust to such estimation artifacts, as shown empirically in Figure 2.

6.2 Graph-based Aggregation for Global Value Artifacts

While we have shown that model-based planning is robust to local artifacts, the correction of global estimation artifacts, which span beyond the effective planning horizon H, remains a significant challenge. We propose to address it by grounding the value function in states observed during exploration, such that long-horizon value estimates are computed by aggregating short-horizon (i.e.,

local) estimates. Given the current state s S and the goal g G, we build a sparse graph by sampling vertices from the empirical state distribution Ds from the replay buffer, and connecting them with edges weighted by the learned goal-conditioned value ˆV :

G = (V, E) with V Ds {s, g} and E = {(s , s , ˆV (s ; s )) (s , s ) V V}. (4) Crucially, we then prune the graph by removing all edges that encode long-horizon estimates, i.e., edges whose weight is below a threshold Vmin. In practice, we set Vmin to the maximum value which still ensures connectivity between the agent s state s and the goal g along the graph. This can be computed efficiently, as detailed in Appendix D. Once the graph has been constructed and pruned, we simply compute value estimates by minimizing the product of value estimates along all paths to the goal.

V (s; g) = min (s0,...,s T )

i=0 ˆV (si; si+1) where (s0, . . . , s T ) is a path between s and g on G (5)

This value aggregation scheme differentiates itself from existing methods [12] in its vertex sampling phase, which relies on density estimation to address skewed state distributions, and in its pruning phase, which does not depend on hyperparameters. Crucially, the final value aggregate can be directly adapted as a cost function for model-based planning, and is not used as a proxy for subgoal selection. These differences and their consequences are described extensively in Appendix D.

In practice, this scheme can mitigate global T -local optima by replacing long-horizon value estimates through an aggregation of short-horizon estimates. Such short-horizon estimates have been shown to be generally more accurate, and useful in containing value overestimation issues [12], which can be considered a generalization of T -local optima.

For reference, we now provide a visual intuition for the effect of the graph-based value aggregation scheme we propose. We highlight this in maze_large, as it is prone to global estimation artifacts, and allows straightforward visualization. The blue column in Figure 6 plots value functions { ˆV (s; gi)}2 i=0 learned through TD3 against x-y coordinates for three different goals. The green column reproduces the same plots, but estimates values through graph-based value aggregation, as described in this Section, resulting in corrected estimates { V (s; gi)}2 i=0. We further compute expert trajectories for all three goals, in red, and plot ˆV and V along them, resulting in the three plots on the right. When comparing value estimates with optimal values (V , in orange, computed in closed form), we observe that graph-based aggregation results in more accurate estimates. Crucially, it is able to address global estimation artifacts, which result in a bleeding effect for ˆV (s; g0) (on the left), or in large hills in ˆV (st; g0) (on the right, in blue).

Figure 6: Learned values ˆV (s; gi) as a function of x-y coordinates in maze_large (blue column). Graph-based value aggregation results in corrected estimates V (s; gi) (green column). Values along optimal trajectories are compared to optimal value functions V (st; gi) on the right.

Success Rate

TD3 TD3+MBP TD3+MBP+Aggregation Shaped Cost+MBP

Figure 7: Performance of an actor network (TD3) against model-based planning without (TD3+MBP) and with graph-based value aggregation (TD3+MBP+Aggregation). Model-based planning on aggregated values addresses both local and global estimation artifacts, and performs well across diverse environments.

6.3 Experimental Evaluation

Having evaluated diverse value learning algorithms, and described a class of harmful estimation artifacts, we now set out to evaluate how such artifacts can be addressed both locally and globally. Figure 7 compares the performance of actor networks with that of model-based planning, with and without graph-based value aggregation. We observe that model-based planning improves performance in most tasks, as it addresses T -local optima of limited depth. However, it fails to correct global artifacts in long-horizon tasks (i.e., maze_large and pinpad). In this case, we observe that graph-based value aggregation increases success rates significantly. Furthermore, we observe that our method, which combines these two components, consistently achieves strong performance across diverse environments.

In each plot, a dashed line represents the performance of model-based planning with an external sparse cost (for some ϵ > 0). On one hand, its strong performance in environments with shorter optimal trajectories confirms the accuracy of the learned model. On the other, its failure for long-horizon tasks shows the limitation of planning in the absence of a learned value function, which has been an important motivation for this work.

7 Discussion

This work is concerned with the challenge of achieving goals after an unsupervised exploration phase, without any further interaction with the environment. In this challenging setting, we observe that simple actor-critic methods are surprisingly competitive with model-based, or offline algorithms, but their actor networks are often suboptimal due to estimation artifacts in the learned value function. After characterizing these artifacts by leveraging the goal-conditioned nature of the problem, we show how model-based planning can address this problem locally, graph-based value aggregation can apply global corrections, and their combination works consistently across diverse environments.

Limitations With respect to simply querying actor networks, this method comes at a cost. In particular, it increases computational requirements and risks introducing further artifacts due to model exploitation, or the discretization involved in graph-based aggregation (see Appendix D). Furthermore, the proposed method is only applied at inference and, while it s able to mitigate learning artifacts, it does not directly correct them by updating the value estimator. Nevertheless, within our settings, we find the impact of these limitations to be, in practice, modest.

Our method represents a strong practical approach for goal-reaching after curious exploration, and sheds light on interesting properties of both offline and goal-conditioned problems. An exciting future avenue would further leverage these formal insights, possibly through a tighter integration of value learning and value correction. We thus believe that further exploring the interplay between a theoretically sound analysis of the properties of goal-conditioned value function, and the design of practical algorithms, represents a rich research opportunity.

Acknowledgments and Disclosure of Funding

We thank Andreas Krause, Nico Gürtler, Pierre Schumacher, Núria Armengol Urpí and Cansu Sancaktar for their help throughout the project, and the anonymous reviewers for their valuable feedback. Marco Bagatella is supported by the Max Planck ETH Center for Learning Systems. Georg Martius is a member of the Machine Learning Cluster of Excellence, EXC number 2064/1 Project number 390727645. We acknowledge the support from the German Federal Ministry of Education and Research (BMBF) through the Tübingen AI Center (FKZ: 01IS18039B).

[1] M. Andrychowicz, F. Wolski, A. Ray, J. Schneider, R. Fong, P. Welinder, B. Mc Grew, J. Tobin, O. Pieter Abbeel, and W. Zaremba. Hindsight experience replay. Advances in Neural Information Processing Systems, 2017.

[2] M. Bellemare, S. Srinivasan, G. Ostrovski, T. Schaul, D. Saxton, and R. Munos. Unifying count-based exploration and intrinsic motivation. Advances in Neural Information Processing Systems, 2016.

[3] Y. Burda, H. Edwards, A. Storkey, and O. Klimov. Exploration by random network distillation. In International Conference on Learning Representations, 2019.

[4] H. Charlesworth and G. Montana. Plangan: Model-based planning with sparse rewards and multiple goals. In Advances in Neural Information Processing Systems, 2020.

[5] Y. Chebotar, K. Hausman, Y. Lu, T. Xiao, D. Kalashnikov, J. Varley, A. Irpan, B. Eysenbach, R. C. Julian, C. Finn, et al. Actionable models: Unsupervised offline reinforcement learning of robotic skills. In International Conference on Machine Learning, 2021.

[6] X. Chen, A. Ghadirzadeh, T. Yu, J. Wang, A. Y. Gao, W. Li, L. Bin, C. Finn, and C. Zhang. Lapo: Latent-variable advantage-weighted policy optimization for offline reinforcement learning. Advances in Neural Information Processing Systems, 2022.

[7] N. Chentanez, A. Barto, and S. Singh. Intrinsically motivated reinforcement learning. Advances in Neural Information Processing Systems, 2004.

[8] K. Chua, R. Calandra, R. Mc Allister, and S. Levine. Deep reinforcement learning in a handful of trials using probabilistic dynamics models. Advances in Neural Information Processing Systems, 2018.

[9] P. Dayan. Improving generalization for temporal difference learning: The successor representation. Neural Computation, 5(4):613 624, 1993.

[10] A. Ecoffet, J. Huizinga, J. Lehman, K. O. Stanley, and J. Clune. First return, then explore. Nature, 590(7847):580 586, 2021.

[11] S. Emmons, B. Eysenbach, I. Kostrikov, and S. Levine. Rvs: What is essential for offline RL via supervised learning? In International Conference on Learning Representations, 2022.

[12] B. Eysenbach, R. R. Salakhutdinov, and S. Levine. Search on the replay buffer: Bridging planning and reinforcement learning. Advances in Neural Information Processing Systems, 2019.

[13] Farama-Foundation. Gymnasium-robotics. https://github.com/ Farama-Foundation/Gymnasium-Robotics, 2023.

[14] J. Fu, A. Kumar, O. Nachum, G. Tucker, and S. Levine. D4rl: Datasets for deep data-driven reinforcement learning, 2020.

[15] S. Fujimoto, H. Hoof, and D. Meger. Addressing function approximation error in actor-critic methods. In International Conference on Machine Learning, 2018.

[16] N. Gürtler, S. Blaes, P. Kolev, F. Widmaier, M. Wuthrich, S. Bauer, B. Schölkopf, and G. Martius. Benchmarking offline reinforcement learning on real-robot hardware. In The Eleventh International Conference on Learning Representations, 2023.

[17] T. Haarnoja, A. Zhou, P. Abbeel, and S. Levine. Soft actor-critic: Off-policy maximum entropy deep reinforcement learning with a stochastic actor. In International Conference on Machine Learning, 2018.

[18] D. Hafner, K.-H. Lee, I. Fischer, and P. Abbeel. Deep hierarchical planning from pixels. In Advances in Neural Information Processing Systems, 2022.

[19] M. Janner, J. Fu, M. Zhang, and S. Levine. When to trust your model: Model-based policy optimization. Advances in Neural Information Processing Systems, 2019.

[20] L. P. Kaelbling. Learning to achieve goals. In IJCAI, volume 2, pages 1094 8, 1993.

[21] D. P. K. Kingma and J. Ba. Adam: A method for stochastic optimization. In International Conference on Learning Representations, 2015.

[22] I. Kostrikov, A. Nair, and S. Levine. Offline reinforcement learning with implicit q-learning. In International Conference on Learning Representations, 2022.

[23] N. Lambert, M. Wulfmeier, W. Whitney, A. Byravan, M. Bloesch, V. Dasagi, T. Hertweck, and M. Riedmiller. The challenges of exploration for offline reinforcement learning. ar Xiv preprint ar Xiv:2201.11861, 2022.

[24] M. Laskin, D. Yarats, H. Liu, K. Lee, A. Zhan, K. Lu, C. Cang, L. Pinto, and P. Abbeel. URLB: Unsupervised reinforcement learning benchmark. In Deep RL Workshop Neur IPS 2021, 2021.

[25] J. Li, C. Tang, M. Tomizuka, and W. Zhan. Hierarchical planning through goal-conditioned offline reinforcement learning. IEEE Robotics and Automation Letters, 2022.

[26] T. P. Lillicrap, J. J. Hunt, A. Pritzel, N. Heess, T. Erez, Y. Tassa, D. Silver, and D. Wierstra. Continuous control with deep reinforcement learning. In International Conference on Learning Representations, 2016.

[27] K. Lowrey, A. Rajeswaran, S. Kakade, E. Todorov, and I. Mordatch. Plan Online, Learn Offline: Efficient Learning and Exploration via Model-Based Control. In International Conference on Learning Representations (ICLR), 2019.

[28] R. Mendonca, O. Rybkin, K. Daniilidis, D. Hafner, and D. Pathak. Discovering and achieving goals via world models. Advances in Neural Information Processing Systems, 2021.

[29] R. Mendonca, S. Bahl, and D. Pathak. Alan: Autonomously exploring robotic agents in the real world. ar Xiv preprint ar Xiv:2302.06604, 2023.

[30] L. Mezghani, S. Sukhbaatar, P. Bojanowski, A. Lazaric, and K. Alahari. Learning goalconditioned policies offline with self-supervised reward shaping. In 6th Annual Conference on Robot Learning, 2022.

[31] S. Nasiriany, V. Pong, S. Lin, and S. Levine. Planning with goal-conditioned policies. Advances in Neural Information Processing Systems, 32, 2019.

[32] P.-Y. Oudeyer, F. Kaplan, and V. V. Hafner. Intrinsic motivation systems for autonomous mental development. IEEE Transactions on Evolutionary Computation, 11(2):265 286, 2007.

[33] D. Pathak, P. Agrawal, A. A. Efros, and T. Darrell. Curiosity-driven exploration by selfsupervised prediction. In International Conference on Machine Learning, 2017.

[34] L. Pineda, B. Amos, A. Zhang, N. O. Lambert, and R. Calandra. Mbrl-lib: A modular library for model-based reinforcement learning. Arxiv, 2021.

[35] C. Pinneri, S. Sawant, S. Blaes, J. Achterhold, J. Stueckler, M. Rolinek, and G. Martius. Sampleefficient cross-entropy method for real-time planning. In Conference on Robot Learning, 2021.

[36] S. Reed, K. Zolna, E. Parisotto, S. G. Colmenarejo, A. Novikov, G. Barth-maron, M. Giménez, Y. Sulsky, J. Kay, J. T. Springenberg, T. Eccles, J. Bruce, A. Razavi, A. Edwards, N. Heess, Y. Chen, R. Hadsell, O. Vinyals, M. Bordbar, and N. de Freitas. A generalist agent. Transactions on Machine Learning Research, 2022.

[37] R. Rubinstein. The cross-entropy method for combinatorial and continuous optimization. Methodology and computing in applied probability, 1(2):127 190, 1999.

[38] C. Sancaktar, S. Blaes, and G. Martius. Curious exploration via structured world models yields zero-shot object manipulation. In Advances in Neural Information Processing Systems. Curran Associates, Inc., Dec. 2022.

[39] N. Savinov, A. Dosovitskiy, and V. Koltun. Semi-parametric topological memory for navigation. In International Conference on Learning Representations, 2018.

[40] T. Schaul, D. Horgan, K. Gregor, and D. Silver. Universal value function approximators. In International Conference on Machine Learning, 2015.

[41] J. Schmidhuber. A possibility for implementing curiosity and boredom in model-building neural controllers. In Proc. of the international Conference on simulation of adaptive behavior: From animals to animats, pages 222 227, 1991.

[42] J. Schrittwieser, I. Antonoglou, T. Hubert, K. Simonyan, L. Sifre, S. Schmitt, A. Guez, E. Lockhart, D. Hassabis, T. Graepel, et al. Mastering atari, go, chess and shogi by planning with a learned model. Nature, 588(7839):604 609, 2020.

[43] R. Sekar, O. Rybkin, K. Daniilidis, P. Abbeel, D. Hafner, and D. Pathak. Planning to explore via self-supervised world models. In International Conference on Machine Learning, 2020.

[44] B. C. Stadie, S. Levine, and P. Abbeel. Incentivizing exploration in reinforcement learning with deep predictive models. In International Conference on Learning Representations, 2016.

[45] O. E. L. Team, A. Stooke, A. Mahajan, C. Barros, C. Deck, J. Bauer, J. Sygnowski, M. Trebacz, M. Jaderberg, M. Mathieu, et al. Open-ended learning leads to generally capable agents. ar Xiv preprint ar Xiv:2107.12808, 2021.

[46] S. Tian, S. Nair, F. Ebert, S. Dasari, B. Eysenbach, C. Finn, and S. Levine. Model-based visual planning with self-supervised functional distances. In International Conference on Learning Representations, 2021.

[47] A. Touati and Y. Ollivier. Learning one representation to optimize all rewards. Advances in Neural Information Processing Systems, 2021.

[48] A. Touati, J. Rapin, and Y. Ollivier. Does zero-shot reinforcement learning exist? In 3rd Offline RL Workshop: Offline RL as a Launchpad , 2022.

[49] M. Vlastelica, S. Blaes, C. Pinneri, and G. Martius. Risk-averse zero-order trajectory optimization. In 5th Annual Conference on Robot Learning, 2021.

[50] Z. Wang, A. Novikov, K. Zolna, J. S. Merel, J. T. Springenberg, S. E. Reed, B. Shahriari, N. Siegel, C. Gulcehre, N. Heess, et al. Critic regularized regression. Advances in Neural Information Processing Systems, 2020.

[51] R. Yang, Y. Lu, W. Li, H. Sun, M. Fang, Y. Du, X. Li, L. Han, and C. Zhang. Rethinking goalconditioned supervised learning and its connection to offline RL. In International Conference on Learning Representations, 2022.

[52] D. Yarats, D. Brandfonbrener, H. Liu, M. Laskin, P. Abbeel, A. Lazaric, and L. Pinto. Don t change the algorithm, change the data: Exploratory data for offline reinforcement learning. In ICLR 2022 Workshop on Generalizable Policy Learning in Physical World, 2022.

[53] T. Yu, G. Thomas, L. Yu, S. Ermon, J. Y. Zou, S. Levine, C. Finn, and T. Ma. Mopo: Modelbased offline policy optimization. Advances in Neural Information Processing Systems, 2020.

Supplementary Material to Goal-conditioned Offline Planning from Curious Exploration

A.1 Proposition 5.3

Let us consider an arbitrary goal g G and the optimal goal-conditioned value function V (s; g). Proposition 5.3 states that V (s; g) has no T -local optima. The proof follows directly from the Bellman optimality condition, assuming that (1) R(s; g) = 1d(s,g) ϵ for some ϵ 0 and distance function d( , ), and that (2) the episode terminates when the goal is achieved.

For a candidate state s, let us assume that the first inequality of the condition for being a T -local optimum point is met: max s S V (s ; g) > V ( s; g)3. It is now sufficient to show that the second

inequality (i.e., V ( s; g) > max s T ({ s})\{ s} V (s ; g)) cannot hold, and therefore s cannot be a T -local

optimum point. According to Bellman optimality, we have that

V ( s; g) = R( s; g) + γ( s; g) max a A E s P( |a, s) V (s ; g) (S6)

where we use γ(s; g) = γ(1 R(s; g)): (1 R(s; g)) acts as the termination signal, to enforce that value propagation stops when reaching the goal (and the episode ends). If R( s; g) = 1, then V ( s; g) = 1, which would break the first inequality, since max s S V (s ; g) 1. This implies that

R( s; g) = 0. As a consequence, Equation S6 simplifies to

V ( s; g) = γ max a A E s P( |a, s) V (s ; g)

γ max s T ({ s}) V (s ; g) (S7)

We can now differentiate two cases. If maxs T ({ s}) V (s ; g) = 0 (i.e., the goal cannot be reached), then we also have that V ( s; g) = max s T ({ s})\{ s} V (s ; g) = 0, and s is not a T -local optimum point.

If the goal is instead reachable from at least one state in the next state distribution (i.e., maxs T ({ s}) V (s ; g) > 0), we have that

V ( s; g) γ( s; g) max s T ({ s}) V (s ; g)

< max s T ({ s}) V (s ; g)

= max s T ({ s})\{ s} V (s ; g).

The second inequality follows from the goal-conditioned assumption that 0 < γ < 1, and the final equality is due to the fact that V ( s; g) < max s T ({ s}) V (s ; g) implies that V ( s; g) =

max s T ({ s}) V (s ; g). Thus, V ( s; g) > max s T ({ s})\{ s} V (s ; g) does not hold, and s is not a T -local

optimum point.

A.2 Optimization of Optimal Goal-conditioned Value Functions

Section 4 argues that, assuming that the objective for actor networks is well optimized, the failure of naive actor-critic algorithms is due to estimation errors in the critic network. Accordingly,

3While a supremum operator sup( ) would be required in open domains, for simplicity we use max( ) throughout this work.

we show that, under common assumptions, a state-space trajectory which optimizes an optimal goal-conditioned value function V (s; g) is guaranteed to achieve its goal g G. Furthermore, minimization of the actor loss results in an actor network which produces this exact trajectory.

Given a goal g G, and an initial state s0 S, we make the following assumptions 4:

1. V (s0; g) > 0, i.e., the goal can be reached from the initial state. 2. For all s S, a A, s S : P(s |s, a) = 1, i.e., the MDP transitions deterministically.

We now consider the state-space trajectory (si)T i=0 which optimizes V under the dynamics of the MDP, that is si+1 = argmax s T ({si}) V (s ; g) 0 i < T. (S9)

We will show that, for a certain finite T, s T achieves the goal, i.e., R(s T ; g) = 1 and R(si; g) = 0 0 i < T. We note that, since the episode terminates when the goal is achieved and γ(s; g) = γ(1 R(s; g)), Bellman optimality implies that R(s T ; g) = 1 V (s T ; g) = 1, as shown in Equation S6.

From Equation S6 we can then derive

V (si; g) = R(si, g) + γ max a A E s P( |a,si) V (s ; g)

= γ max a A E s P( |a,si) V (s ; g)

= γ max s T ({si}) V (s ; g)

= γV (si+1; g),

where the third equality follows from deterministic transitions, and the last from the objective in Equation S9. By applying this equality recursively, we can write V (si; g) = γ i V (s0; g). After T = logγ V (s0; g) steps,

V (s T ; g) = γ T V (s0; g) 1. (S11)

Since V ( ; g) 1, therefore V (s T ; g) = 1 and s T achieves the goal.

An actor network is trained to output

π(si; g) = argmax a A Q (si, a; g)

= argmax a A R(si; g) + γ E s P( |a,si) V (s ; g)

= argmax a A E s P( |a,si) V (s ; g)

Under deterministic dynamics, the optimal policy is therefore producing the trajectory which optimizes the objective in Equation S9, and thus reaches the goal.

A.3 Model-based Planning and T -local Optima

Section 5 pinpoints T -local optima as a class of estimation artifacts in learned value functions. Following from Section 6.1, we now present two properties arising from the interaction between model-based planning, and a suboptimal value landscape, depending on the depth of T -local optima, and the planning horizon.

First, we show that, given a goal g G, model-based planning with horizon H from a T -local optimum point s of depth k H fails to achieve the goal. We make the assumptions presented in Section 3.1, and additionally assume:

1. ˆP = P, i.e., exact dynamics estimation. 2. Model-based planning optimizes its objective exactly (see Equation S13).

4On top of those listed in Section 3.1.

3. For every state s S and goal g G, if V (s; g) maxs T ({s}) V (s ; g), then a A : P(s|s, a) = 1. This implies the existence of self-loops for a set of states including T -local optimum points.

We recall from Section 6.1 that, in this setting, model-based planning optimizes the following objective: a H 1 0 = argmax a H 1 0 AH V (s H; g) with si+1 ˆP( |si, ai), (S13)

We note that, if s0 = s is a T -local optima, then by assumption (3) an action a is available such that P(s |s , a) = 1. If ai = a for 0 i < H, then the objective in Equation S13 simply evaluates to V (s ; g). Any other action sequence will reach a state within the H-step future state distribution T H({s }), and will evaluate to a lower or equal value as per the definition of T -local optima of depth k: V (s ; g) max s T H({s }) V (s ; g). If ties are broken favorably 5, an MPC model-based planning

algorithm will therefore execute a, which deterministically transitions to the T -local optimum point at each step. Thus, the goal is never reached.

Second, we show that, if the planning horizon H is greater than the depth of all T -local optima over the state space S, open-loop execution of the action sequence returned through model-based planning successfully optimizes the value function (i.e., reaches a state with a higher value).

This property holds under additional assumptions, namely:

4. For all s S, a A, s S : P(s |s, a) = 1, i.e., the MDP transitions deterministically. 5. The action sequence optimizing the objective for model-based planning (Equation S13) is executed in its entirety before replanning, i.e., open-loop execution of the plan. 6. V (s0; g) < maxs S V (s ; g), i.e., the initial state does not maximize the value function.

Since there are no T -local optima of depth k H, then there must exist a state s1 T H ({s0}) such that V (s1; g) > V (s0; g) and H H. Now, let us consider the state reachable in up to H steps with the highest value:

V (s1; g) = max H H max s T H ({s0}) V (s ; g) (S14)

As shown in Equation S13, model-based planning only optimizes over states that can be reached in exactly H steps. It is thus sufficient to show that s1 T H({s0}), as this would ensure that

V (s0; g) < V (s1; g) max s H T H({s0}) V (s H; g) = max a H 1 0 AH V (s H; g) with si+1 ˆP( |si, ai).

Thus, open-loop execution of the action sequence a H 1 0 optimizing Equation S13 would reach a state with a higher value (in this case, s H).

We recall that s1 T H ({s0}) for some H < H. We will now show that s1 T H({s0}) by contradiction. Let us assume that s1 T H({s0}). We thus have that

V (s1; g) = max H H max s T H ({s0}) V (s ; g)

max H <H H max s T H ({s0}) V (s ; g)

max H <H H max s T H H ({s1}) V (s ; g)

max s T ({s1}) V (s ; g)

5Given two action sequences that both maximize the objective in Equation S13, we need to assume that the agent breaks the tie by maximizing the value of the first state in the trajectory V (s1; g). This assumption would not be necessary in case of open-loop execution of the action sequence.

Following from assumption (4), a self-loop is thus available. Therefore s1 T ({s1}), and more importantly s1 T ({s0})H for any H H , thus contradicting the initial assumption and completing the proof.

B Occurrence of Artifacts and Downstream Performance

Sections 5 pinpoints T -local optima as a class of estimation artifacts in learned value functions. This section empirically validates the harmfulness of such artifacts for an agent which attempts to optimize a value function ˆV (s; g).

In order to do so, we train n = 30 value functions for each environment. We then optimize each of them through model-based planning with horizon H for different goals, and obtain a set of trajectories. We split all evaluation trajectories in two groups: successes (those that achieve the goal and obtain a return greater than 0) and failures. We then estimate the occurrence of T -local optima of depth k H along each trajectory, i.e., the ratio of states in the trajectory which are likely to be T -local optimum points, and plot its distribution for each group.

In practice, evaluating whether a candidate state s S is a T -local optima is unfeasible, as it requires the evaluation of maximum operators over continuous spaces. A sampling-based estimation of T -local optima remains possible, as described in Section 5, but will necessarily introduce a significant amount of noise. Nonetheless, we present the results of this procedure in Figure S8 (identical to Figure 5 in the main paper).

We note that the distributions of occurrences present long tails. We observe a general tendency for successful trajectories to contain fewer estimated T -local optima, as the median occurrence is zero in three environments, and in any case never greater than the median occurrence for unsuccessful trajectories. This tendency is unsurprisingly the strongest in maze_large, which was observed to be prone to the occurrence of global estimation artifacts with depth k H = 15.

We additionally report an analysis of the same trajectories with an indirect procedure, which does not rely on sampling. Appendix A.3 argues that, under several assumptions, model-based planning with horizon H from a T -local optimum point of depth k H is incapable of reaching the goal, or any state with a larger value. The reverse does not necessarily hold: lack of value improvement does not imply the presence of artifacts, but might result from a suboptimal optimization procedure. Moreover, in practical settings, not all necessary assumptions hold. Nevertheless, it is still possible to compute the ratio of states si in evaluation trajectories for which ˆV (si; g) ˆV (si+H; g): this metric should tend to zero in the absence of estimation artifacts in the value function ˆV ( ; g), assuming that the action selection procedure optimizes it correctly. We report similar plots for this metric, which we refer to as occurrence of non-monotonicities, in Figure S9. We observe that the patterns occurring in Figure S8 arise more evidently, as non-monotonicities tend to occur less frequently in successful trajectories.

Failure Success

Occurrence of T -local Optima

Failure Success

Occurrence of T -local Optima

Failure Success

Occurrence of T -local Optima

Failure Success

Occurrence of T -local Optima

Figure S8: Occurrence of T -local optima along successful and unsuccessful trajectories, estimated via sampling. An horizontal bar marks the median.

Failure Success

Occurrence of Non-Monotonicities

Failure Success

Occurrence of Non-Monotonicities

Failure Success

Occurrence of Non-Monotonicities

Failure Success

Occurrence of Non-Monotonicities

Figure S9: Occurrence of non-monotonicities along successful and unsuccessful trajectories. An horizontal bar marks the median.

C Comparison to Additional Reinforcement Learning Algorithms

In this Section we benchmark three additional value-learning methods for offline goal-conditioned reinforcement learning, namely IQL[22], LAPO [6] and WGCSL [51]. Additional hyperparameters are tuned with a limited budget on maze_large. Performances are reported for the original algorithms (actor network, in blue), and in combination with model-based planning (in orange). Results are averaged over 12 seeds and reported in Figure S10. The three additional methods perform within the range of those evaluated in the main paper, suggesting that they do not avoid value estimation artifacts. Moreover, they do not consistently outperform TD3, confirming the competitiveness of simple actor-critic algorithms. Finally, we observe that these additional methods generally improve their performance when coupled with model-based planning, as shown in Section 4.

D Comparison to Graph-based Methods

Discrete structures such as graphs have been explored in order to model long-horizon temporal distances over MDPs. Eysenbach et al. [12] have shown how shortest paths on a graph constructed by uniformly sampling a replay buffer can effectively be used to select subgoals in goal-conditioned RL, introducing Search on the Replay Buffer (SORB). Our graph-based value aggregation scheme generally follows this framework, but with several differences, which we find to be crucial when learning from exploration trajectories. In this section, we expand the method description from Section 6.2, and outline these differences.

The first, and perhaps most crucial, difference lies in how shortest paths are leveraged to guide a goal-conditioned agent. In Eysenbach et al. [12], shortest paths are only used to retrieve the second state along them, which is then commanded to the agent as a subgoal. In practice, we found this to potentially result in the selection of subgoals which are too close, and therefore cause myopic

Actor Network Model-based Planning

Figure S10: Extension of Figure 2 to additional value learning methods. We confirm that TD3 remains, on average, competitive with offline and model-based baselines when trained on curious exploration data.

behavior in the agent. For instance, in maze_large, the agent is encouraged to move at lower velocities to avoid overshooting its closest subgoal. While this could be addressed by conditioning the agent on subgoals further along the shortest path, this solution can be counterproductive, and force the agent to follow unreliable long-horizon value estimates. In contrast, we employ the entirety of the shortest path in order to compute long-horizon estimates by aggregating short-horizon estimates along this path. Conveniently, aggregates can be directly optimized through model-based planning, without needing to command intermediate subgoals, or to retrain the policy to match corrected value estimates.

A second difference with respect to existing works [12] lies in the sampling procedure for vertices in the graph. Instead of adopting uniform sampling, we sample states according to the inverse of a learned density estimate. In practice, we use an Exponential Kernel Density Estimator with bandwidth h = 20, where the distance between two states is simply the temporal distance d(s, s ) = logγ V (s; s ). This makes the density estimates independent of the parameterization of the state space S, thus avoiding the strong assumption that Euclidean distances in the state space are meaningful.

Third and last, once the graph has been constructed, a pruning threshold Vmin is usually enforced to remove all edges encoding long-horizon distance estimates, which have low value estimates V (s ; s ) < Vmin. For this reason, in principle, this threshold should be as large as possible. However, an excessively high threshold would result in a disconnected graph, in which shortest paths may not exist. While this threshold is conventionally hand-tuned, we propose to select it automatically, by choosing the largest value Vmin that ensures connectivity from the agent s state to the goal. This can be efficiently computed through a Dijkstra-like algorithm. We refer to our codebase for its implementation.

Having outlined three key differences with respect to previous methods, we now provide an empirical evaluation of their effects on downstream performance in Figure S11. All success rates reported are obtained by optimizing value functions learned through TD3 with model-based planning and default hyperparameters.

Success Rate

Success Rate

Success Rate

Success Rate

Our aggregation Our aggregation w/ uniform sampling SORB w/ Vmin = γ5

SORB w/ Vmin = γ10 SORB w/ Vmin = γ50 SORB w/ Vmin = γ100

Figure S11: Ablation and comparison of graph-based methods. Our value aggregation scheme largely outperforms a naive application of SORB to this setting, while removing the need to tune the graph pruning threshold Vmin. We find that density-aware sampling of vertices in the graph is crucial for succeeding in long-horizon tasks.

In particular, we compare our graph-based value aggregation scheme (Our aggregation) with an ablation that samples graph vertices uniformly from the replay buffer. We observe that this results in a sharp decrease in performance in long-horizon tasks. We also replace our value aggregation scheme with a SORB-like subgoal selection procedure for different pruning thresholds Vmin. We find that a naive application of SORB to this setting is not effective, and can be detrimental with respect to simple model-based planning.

Removing the necessity to tune Vmin leaves a single hyperparameter to be chosen for graph-based value aggregation, namely the size of the graph. This number is limited by the complexity of the shortest path search, and thus we use the default value of 1000 from the literature. However, we

found 100 vertices to already achieve sufficient coverage in pinpad, which therefore adopts this reduced hyperparameter.

In terms of limitations, as the graph aggregation procedure involves learned value estimates, it remains prone to failure due to compounding errors. However, two design choice introduce a degree of robustness. First, value estimates are only computed over states and goals that appear in the dataset the value function was trained on, which alleviates OOD issues. Second, the pruning step removes all edges encoding long-horizon estimates, which are empirically found to be less accurate [12]. Moreover, the tuning procedure we propose for the pruning threshold effectively minimizes the number of paths in the graph, while ensuring the existence of a solution to the optimization problem in Equation 5. Despite this, for particularly sparse datasets, the pruning can fail to remove inaccurate estimates, and the graph-based aggregates may reproduce the artifacts of the underlying value estimator. Further limitations of the graph search include increasing computational costs, and artifacts due to the effective discretization of the state space: while increasing the sample size can address the latter, it aggravates the former in a tradeoff.

E Sample Efficiency

All value functions in previous experiments are trained on 200k transitions gathered through curious exploration. This amount of data is around an order of magnitude lower than popular offline datasets [14], and has been chosen to be realistically collectable on real hardware. In practice, 200k timesteps at 30 Hz amount to little under 2 hours of real time interaction. Nonetheless, we also evaluate methods from Figure 7 (main paper) on smaller datasets, namely with 50k and 100k transitions. We visualize results in Figure S12, and find them to remain consistent, while success rates decrease on smaller datasets, as expected.

50 k 100 k 150 k 200 k Exploration Steps

Success Rate

TD3 TD3+MBP TD3+MBP+Aggregation

Figure S12: Performance of an actor network (TD3) against model-based planning without (TD3+MBP) and with graph-based value aggregation (TD3+MBP+Aggregation), reported for value functions trained on datasets with 50k, 100k, and 200k samples.

F Applicability to General Offline Goal-Conditioned Settings

The motivation of this work is to address the challenge of extracting goal-conditioned behavior after curious exploration. Focusing on data gathered through unsupervised exploration has value in itself, arguably, as this requires minimal supervision and can enable real-world robot learning [29]. However, the proposed method can in principle be applied to general offline goal-conditioned settings. This section explores this possibility and provides empirical evidence.

Throughout this work, the exploratory nature of the data distribution is leveraged mainly in two points. First, as argued in Section 4.2, when learning from curious exploration data, simple actor-critic methods are, on average, competitive with model-based or offline methods. In standard offline settings with near-expert data, offline algorithms should be expected to be more effective [14]. However, within the realm of curious exploration, Section 2 suggests that TD3 is equally performant, and thus the proposed method uses it as core algorithm for value learning. Second, the proposed method relies on model-based planning, whose performance is heavily dependent on the quality and robustness of the learned model. Model-based planning is thus particularly promising when learning from curious exploration, as the unsupervised exploration phase already produces a dynamics model, which does not need to be trained from scratch. Moreover, as data has been collected to maximize the information gain of this dynamics model, it is potentially more robust [38]. When learning from offline datasets, a model is generally not available: while it can also be learned offline, training it to high accuracy can be difficult.

0.0 0.5 1.0 % Expert Data

Success Rate

0.0 0.5 1.0 % Uniform Data

Success Rate

0.0 0.5 1.0 % Expert Data

Success Rate

0.0 0.5 1.0 % Uniform Data

Success Rate

(d) Dynamics model learned during exploration Dynamics model learned offline

Figure S13: Ablation mixing the original dataset collected through curious exploration with expert or random uniform datasets. Our method can in practice be applied to different data sources, although may it suffer from weaker exploration in long-horizon environments (e.g. maze_large). Environments with more complex dynamics (e.g. fetch_push) largely benefit from the dynamics model learned during exploration (with active data collection).

We provide a minimal empirical study of the applicability of the proposed method on different data sources in Figure S13. Each subplot presents the performance of our full method (TD3 + model-based planning + graph-based value aggregation) when mixing the original curious exploration datasets with data collected from different sources (the size is fixed at 200k samples). In particular, a fraction of the trajectories is replaced with transitions from an expert policy (a,c), and from a uniformly random policy (b,d). Performance curves are recorded when using the dynamics model learned during curious exploration (green), and a dynamics model learned offline on the mixed data (violet); offline model learning was not tuned specifically. Due to time and compute constraints, we also present this analysis over two representative environments (maze_large in (a,b) and fetch_push in (c,d)), and average over 9 random seeds.

In these settings, we observe that access to expert data does not hinder success rates (Figure S13a,c): interestingly, theproposed method shows robustness to shrinking data support. On the other hand, when introducing data collected by a uniform random policy (Figure S13b,d), performance decreases accordingly to the reduced exploration: while collecting random trajectories is sufficient in fetch_push, it is not comparable to curiosity-driven exploration for long-horizon tasks (i.e., maze_large). Furthermore, we note that models trained offline (in violet) achieve generally success when the complexity of the environment s dynamics increases (fetch_push), confirming that online training via disagreement maximization returns a more robust model [38].

G Comparison to Value-driven Exploration

In this Section, we provide an informative comparison with a related framework: POLO [27]. POLO also combines value learning and model-based planning. However, there are fundamental differences:

POLO explores by maximizing the uncertainty of the value function; our framework leverages the uncertainty of the dynamics model.

POLO assumes the availability of ground truth models for planning; our framework does not and adopts a learned dynamic model.

POLO does not rely on graph-based components for long-horizon planning.

While an official implementation of POLO is not available, we can loosely reproduce it by integrating POLO s exploration signal in our framework, replacing TD3 with its value learning scheme, and disabling graph-based value aggregation, while still relying on i CEM for model-based planning and learning a dynamics model. We name the resulting method POLO*. We find that POLO* suffers from model inaccuracy when computing its value targets via MPPI optimization. We thus introduce a variant which uses on-policy value targets, which we refer to as POLO**.

Figure S14 reports success rates for POLO* and POLO** against our method (TD3 + model-based planning + graph-based aggregation) in all of our benchmark environments. Plots are averaged over 12 seeds, except for POLO* which uses 5 due to higher computational cost. Moreover, we

Success Rate

Success Rate

Success Rate

Success Rate

POLO* POLO** TD3+MBP+Aggregation

Figure S14: Comparison with adapted implementations of POLO a privileged baseline that uses value-driven exploration and online value learning.

plot exploration trajectories in maze_large in Figure S15. We note that POLO is a privileged baseline, as it requires goals to be specified during training, and its value function is trained online by optimizing against its uncertainty. Our method, on the other hand, learns its value function offline from unlabeled data, and is therefore more widely applicable.

While the closer implementation (POLO*) is less performant, the improved version (POLO**) performs comparably or better than our method in three out of four environment. Despite successfully leveraging online value learning, we find its potential for long-horizon exploration to be limited, as shown by a large drop in performance in maze_large. We confirm this by visualizing trajectories gathered by POLO** (value-driven), and by the unsupervised exploration phase used in our work (dynamics-driven) in Figure S15.

H Compact Algorithmic Description

We report the full method (TD3 + MDP + Aggregation) in a compact algorithmic description in box S1, as well as in a short verbal summary below.

Our work deals with extracting a goal-reaching controller from the outcomes of curious exploration. Exploration data is collected while learning a dynamic model ensemble by maximizing its disagreement through model-based planning with zero-order trajectory optimization. This constitutes the intrinsic phase, which returns an exploration dataset, and a trained dynamics model. We then train a value function offline on the exploration dataset with TD3. During inference, we are given a goal. We select actions through model-based planning with the learned dynamics model. This time, actions

Figure S15: Random uniform, value-driven (as in POLO) and dynamics-driven (default in our method) exploration in maze_large. Lighter colors represent high exploration incentive. Uniform exploration is insufficient, and dynamics-driven exploration achieves dense coverage of the state space while being task-agnostic.

sequences are selected by maximizing the aggregated value of the last state in imagined trajectories, where aggregation is performed with graph search over the value estimates from TD3.

Algorithm S1 TD3 + MDP + Aggregation

1: for each exploration episode do Intrinsic Phase 2: Collect a trajectory by maximizing the intrinsic reward rint (S17, S18) with i CEM and the dynamics model ˆP. 3: Add the trajectory to the replay buffer D. 4: Train the dynamics model ˆP via MLE over the replay buffer D. 5: end for 6: for each offline training epoch do 7: Train a goal-conditioned value function ˆV with TD3 on the replay buffer D, relabeled as described in Subsection 4.1. 8: end for 9: for each evaluation goal do Extrinsic Phase 10: Build graph G as described in Subsection 6.2. 11: Act by maximizing the aggregated value V (5, S19) with i CEM and the dynamics model ˆP. 12: end for

I Implementation Details

This section provides an in-depth description of methods and hyperparameters. Our codebase builds upon mbrl-lib [34], and adapts it to implement unsupervised exploration and goal-conditioned value-learning algorithms. To ensure reproducibility, we make it publicly available 6.

Figure S16: Environments used for evaluation: from left to right, maze2d_large, pinpad, fetch_push and kitchen.

I.1 Environments

Empirical evaluations rely on four simulated environments. maze_large is adapted from D4RL [14], with the sole modification of fixing the initial position of the agent in a corner of the maze, in order to make exploration more challenging. The goal distribution includes 3 goals, placed on far-away corners of the maze, as displayed in Figure 6 (main paper). pinpad is implemented on top of D4RL [14] as a continuous, state-based version of the environment introduced in Hafner et al. [18]. The agent is initialized in a room with 4 buttons at its corners, and is tasked with pressing 3 of them in the correct order. Observations are thus 7 dimensional, and include 2D position and velocity of the agent, as well as a 3-dimensional history of the last buttons pressed. The agent is controlled through acceleration, as in maze_large. kitchen is adapted from gymnasium-robotics [13] in order to allow goal-conditioning. The goal distribution includes controlling each of the seven objects individually: a sliding door, a cabinet, a light switch, two burners, a teapot and a microwave oven. We use a frame skip value of 50 to encourage exploration. fetch_push is one of the benchmarks presented in Andrychowicz et al. [1], and we use it in its unmodified form, as provided in gymnasium [13]. A summary of further environment parameters can be found in Table S1.

6Code available at sites.google.com/view/gcopfce

|S| |G| |A| Episode Length γ

maze_large 4 2 2 600 0.99 pinpad 7 3 2 300 0.99 fetch_push 25 3 4 50 0.95 kitchen 30 17 8 280 0.99

Table S1: Environment parameters.

I.2 Unsupervised Exploration

Unsupervised exploration of the environment relies on curious model-based planning [38, 43]. We will thus describe the exploration algorithm, followed by the architecture of the dynamics model ˆP and the computation of the intrinsic cost signal.

Unsupervised exploration alternates environment interaction with model training, until sufficient experience is collected. We warm-start exploration by executing a uniformly random policy for 3000 steps, and saving its trajectories in the replay buffer D. Then, before each episode, the dynamics model ˆP is trained on data from the replay buffer D via supervised learning for 12 epochs. Adam [21] is used as an optimizer with batch size B = 512, learning rate γlr = 0.00028 and weight decay parameter λ = 0.0001.

The dynamics model ˆP is learned as an ensemble of Gaussian neural networks, as suggested in Chua et al. [8]. Given a state s S and an action a S, each of the M ensemble member outputs a Gaussian distribution parameterized by its mean and its variance: {fm(s, a) = [µm(s, a), Varm(s, a)]}M m=1. In practice, inputs and outputs are normalized, and the model predicts state differences. By default, the ensemble contains 7 members; during inference, only the best 5 members according to log-likelihood on a validation set are queried. Implementation and hyperparameters, reported in Table S2, are the default from mbrl-lib [34].

Hyperparameter Value

# of ensemble members 7 # of elites 5 # of layers 4 # of units per layer 200 Activation function Si LU Propagation method TS1 [8] # of particles 20

Table S2: Hyperparameters for the dynamics model ˆP.

Exploration episodes are controlled through finite-horizon model-based planning with zero-order trajectory optimization (i.e., i CEM with the learned model ˆP). Given a state-action pair (s, a), an intrinsic reward signal can be computed as the trace of the covariance matrix over the ensemble s predictions [38, 49]

rint(s, a) = Tr(Cov({fm(s, a)}M m=1)). (S17)

This reward function can be aggregated into a cost function for trajectory optimization, as described in Section I.4. Overall, our implementation of unsupervised exploration runs at 5K steps per hour on a single Nvidia RTX3060 GPU or equivalent.

I.3 Value Learning

The relabeling procedure for exploration data is adapted from Tian et al. [46], and described in Section 4. We implement TD3 [14] and CRR [50] for value learning, and build upon TD3 for two model-based algorithms: MBPO [19] and MOPO [53]. We found DDPG [26] and SAC [17] to be slightly less performant in comparison to TD3. Each value learning algorithm is trained for 200 epochs, which, for a buffer size of 200K, corresponds to a wall-clock time less than 3 hours on a single NVIDIA RTX3060 GPU. Hyperparameters are listed in Table S3. Those tagged as (MBPO) are adopted by both MBPO and MOPO, while those tagged as (MOPO) or (CRR) are algorithm-specific. Remaining hyperparameters are shared by all algorithms. All hyperparameters are shared across environments, and they were tuned through a grid search across maze_large and fetch_push.

I.4 Model-based Planning

Model-based planning is used both during the exploration phase, and during goal-conditioned evaluation. During exploration, the cost of each trajectory (s0, a0, . . . , s H 1, a H 1) is computed as:

cint((s0, a0, . . . , s H 1, a H 1)) =

t=0 rint(st, at) (S18)

During goal-conditioned evaluation, a value function V (s; g) is optimized, and the cost of each trajectory is directly adapted from the reinforcement learning objective:

cext((s0, a0, . . . , s H 1, a H 1)) =

t=0 γt R(st; at) + γH 1V (s H 1; g)

= γH 1V (s H 1; g),

where the equality holds due to the extreme sparsity of R(s; g) = 1s=g in continuous spaces. Hyperparameters are fixed across environments, except for the planning horizon H. We find exploration to benefit from a longer horizon, which is thus set to H = 30. On the other hand, we found goalconditioned evaluation to be more prone to model exploitation and hallucination. For this reason, during the extrinsic phase, we set H = 10 for pinpad and H = 15 for the remaining environments. For the same reason, we also compute the variance of particles for each trajectory, and discard those exceeding a set threshold.

While vanilla CEM [37] also performs well as zero-order trajectory optimizer, we found i CEM [35] to reach similar performances when using a reduced number of samples. Hyperparameters are reported in Table S4.

I.5 Metrics

Unless specified, all plots show the mean estimate from 9 seeds, which are the combination of 3 seeds for collecting exploration data, and 3 seeds for value learning and planning for each dataset. Each mean estimate is accompanied by a 90% simple bootstrap confidence interval.

I.6 Computational Costs

Our method s training costs (TD3 + model-based planning + aggregation) are those of TD3 ( 70 minutes on a single NVIDIA RTX3060 GPU), comparable to other value-learning baselines ( 50-90 minutes). Inference costs differ greatly: actor networks can be queried at >100Hz; model-based planning with and without aggregation at 3Hz and 0.3Hz respectively. An optimized implementation would bring significant speedups, e.g. through precomputation and parallelization of graph search, which could address the significant slowdown.

J Open Challenges and Negative Results

Our method provides insights on how to mitigate the occurrence of estimation artifacts in learned value functions. However, it does not correct them by finetuning the neural value estimator. This possibility is interesting, as it would bring no computational overhead, and was broadly explored. TD finetuning with CQL-like penalties on the value of T -local optima was attemped, but unsuccessful. In this context, we found such correction strategies to be sensible to hyperparameters, and often resulting in overly pessimistic estimates. We, however, stress the significance of direct value correction as an exciting direction for future research.

K Numerical Results

Table S5 reports success rates from Figures 2 and 7 (main paper) in numerical form.

Hyperparameter Value

# of critic layers 2 # of units per critic layer 512 # of actor layers 2 # of units per actor layer 512 Batch size 512 Critic learning rate 1e-5 Actor learning rate 1e-5 Activation function Re LU Polyak coefficient 0.995 Target noise 0.2 Target noise clipping threshold 0.5 Policy delay 2 Policy Squashing True pg 0.75 pgeo 0.2

(MBPO) # updates before rollouts 500 (MBPO) # of rollouts 5000 (MBPO) Rollout horizon 5 (MBPO) Buffer capacity 1M (MBPO) Real data ratio 0.0 (MOPO) Penalty coefficient λ 0.1

(CRR) β 1.0 (CRR) # of action samples 4 (CRR) Advantage type mean (CRR) Weight type exponential (CRR) Maximum weight 20.0

Table S3: Hyperparameters for value learning.

Hyperparameter Value

# of iterations 3 Population size 400 Elite ratio 0.01 α 0.1 Population decay False Colored noise exponent 3.0 Fraction of kept elites 0.3

Table S4: Hyperparameters for modelbased planning.

maze_large pinpad fetch_push kitchen

CRR 0.095 < 0.173 < 0.253 0.027 <0.055 < 0.088 0.083 <0.083 < 0.083 0.253 < 0.331 < 0.412 MBPO 0.000 < 0.000 < 0.000 0.017 <0.028 < 0.040 0.083 <0.083 < 0.083 0.111 < 0.143 < 0.174 MOPO 0.023 < 0.049 < 0.079 0.033 <0.050 < 0.072 0.083 <0.083 < 0.083 0.015 < 0.046 < 0.079 TD3 0.079 < 0.144 < 0.206 0.244 <0.267 < 0.291 0.083 <0.083 < 0.083 0.079 < 0.110 < 0.142

CRR+MBP 0.000 < 0.031 < 0.063 0.283 <0.321 < 0.361 0.152 <0.190 < 0.231 0.174 < 0.206 < 0.238 MBPO+MBP 0.024 < 0.060 < 0.098 0.043 <0.068 < 0.091 0.638 <0.719 < 0.791 0.174 < 0.222 < 0.285 MOPO+MBP 0.000 < 0.025 < 0.098 0.013 <0.025 < 0.038 0.824 <0.884 < 0.935 0.365 < 0.444 < 0.523 TD3+MBP 0.206 < 0.301 < 0.412 0.190 <0.232 < 0.274 0.638 <0.693 < 0.745 0.333 < 0.379 < 0.428

TD3+MBP+Aggregation 0.627 < 0.705 < 0.814 0.314 <0.410 < 0.500 0.587 <0.647 < 0.708 0.301 < 0.365 < 0.428

MBP+Sparse 0.030 0.100 0.730 0.430

Table S5: Average success rates with 90% bootstrap confidence intervals from Figures 2 and 7 (main paper).