# goalconditioned_offline_planning_from_curious_exploration__efa53cc3.pdf Goal-conditioned Offline Planning from Curious Exploration Marco Bagatella ETH Zürich & Max Planck Institute for Intelligent Systems Tübingen, Germany mbagatella@ethz.ch Georg Martius University of Tübingen & Max Planck Institute for Intelligent Systems Tübingen, Germany georg.martius@uni-tuebingen.de Curiosity has established itself as a powerful exploration strategy in deep reinforcement learning. Notably, leveraging expected future novelty as intrinsic motivation has been shown to efficiently generate exploratory trajectories, as well as a robust dynamics model. We consider the challenge of extracting goal-conditioned behavior from the products of such unsupervised exploration techniques, without any additional environment interaction. We find that conventional goal-conditioned reinforcement learning approaches for extracting a value function and policy fall short in this difficult offline setting. By analyzing the geometry of optimal goalconditioned value functions, we relate this issue to a specific class of estimation artifacts in learned values. In order to mitigate their occurrence, we propose to combine model-based planning over learned value landscapes with a graph-based value aggregation scheme. We show how this combination can correct both local and global artifacts, obtaining significant improvements in zero-shot goal-reaching performance across diverse simulated environments. 1 Introduction While the standard paradigm in reinforcement learning (RL) is aimed at maximizing a specific reward signal, the focus of the field has gradually shifted towards the pursuit of generally capable agents [1, 20, 40]. This work focuses on a promising framework for learning diverse behaviors with minimal supervision, by decomposing the problem into two consecutive phases. The first phase, referred to as exploratory [28, 43] or intrinsic [38], consists of unsupervised exploration of an environment, aiming to collect diverse experiences. These can be leveraged during the second phase, called exploitatory or extrinsic, to distill a family of specific behaviors without further environment interaction. This paradigm unifies several existing frameworks, including zero-shot deep reinforcement learning [28, 43], Explore2Offline [23], Ex ORL [52] and model-based zero-shot planning [38]. An established technique for the intrinsic phase is curiosity-based exploration, which leverages some formulation of surprise as a self-supervised reward signal. In practice, this is often done by estimating the uncertainty of forward dynamic predictions of a learned model [49]. At the end of the exploratory phase, this method produces a sequence of collected trajectories, as well as a trained dynamics model. At its core, our work focuses on the subsequent extrinsic phase, and is concerned with extracting strong goal-conditioned behavior from these two products, without further access to the environment. 37th Conference on Neural Information Processing Systems (Neur IPS 2023). Figure 1: This work focuses on extracting goal-conditioned behavior from exploratory trajectories D and a trained dynamics model ˆP. We propose to learn a value function through goal-conditioned RL, plan with zero-order trajectory optimization to mitigate local estimation artifacts, and aggregate value estimates along a graph in order to correct global estimation artifacts. In its simplest form, this corresponds to learning to control the agent s future state distribution in order to achieve a given goal or, more concretely, to visit a particular state. Existing methods tackle this problem without additional learning through model-based online planning [38], or rely on learning an amortized policy, either during exploration [28, 43] or in a fully offline manner [23, 52]. The former class of approaches requires the definition of a hand-crafted cost function, without which long-horizon, sparse tasks would remain out of reach. On the other hand, the latter class does not rely on a well-shaped cost function, but may suffer from myopic behavior in the absence of explicit planning, and from estimation errors in the learned value function. As we do not assume access to external task-specific cost functions, we first investigate the effectiveness of goal-conditioned RL algorithms for learning a value function. In doing so, we empirically extend the results obtained by Yarats et al. [52] to previously unstudied model-based and goalconditioned settings. We thus confirm that RL methods not involving offline corrections can be competitive when trained on offline exploratory data. Additionally, by a formal analysis of the geometry of optimal goal-conditioned value functions, we pinpoint a geometric feature of inaccurate value landscapes, which we refer to as T -local optima. After showing the detrimental effects of such learning artifacts to the performance of goal-conditioned RL, we propose to mitigate them through a novel combination of model-based planning with a practical graph-based aggregation scheme for learned value estimates. Crucially, we show how these two components address the same estimation issue on different scales, that is, respectively, locally and globally. We provide a graphical overview in Figure 1 for a simple maze environment. The contributions of this work can be organized as follows: We provide an empirical evaluation of goal-conditioned reinforcement learning algorithms when trained on the outcomes of an unsupervised exploration phase. We pinpoint a class of estimation artifacts in learned goal-conditioned value functions, which explains the suboptimality of naive, model-free reinforcement learning approaches in this setting. We propose to combine model-based planning with graph-based value aggregation to address the occurrence of both local and global value artifacts. We provide the first fully offline method for achieving goals after a curious exploration phase. After the relating our work to existing literature, we evaluate goal-conditioned RL methods when learning from curious exploration (Section 4). We then pinpoint a class of estimation artifacts that are not compatible with optimal goal-conditioned value functions and analyze them through the lens of T -local optima (Section 5). Equipped with these insights, we show how our practical method can mitigate such artifacts (Section 6). A compact description of the resulting method, highlighting its intrinsic and extrinsic phases, can be found in Appendix H. 2 Related Work Unsupervised Exploration The challenge of exploration has been at the core of reinforcement learning research since its inception. Within deep reinforcement learning, successful strategies are often derived from count-based [2], prediction-based [41] or direct exploration techniques [10], with the first two traditionally relying on intrinsic motivation [7] computed from state visitation distributions or prediction errors (i.e., surprise), respectively. Depending on the entity on which prediction error is computed, methods can focus on estimating forward dynamics [32, 38, 43, 44], inverse dynamics [33] or possibly random features of the state space [3]. Another important distinction falls between methods that retroactively compute surprise [33] or methods that compute future expected disagreement [38, 43], and proactively seek out uncertain areas in the state space. Overall, such techniques compute an additional intrinsic reward signal, which can be summed to the extrinsic environment reward signal, or used alone for fully unsupervised exploration. In this work, we build upon efficient exploration methods that are proactive, and rely on prediction disagreement of a forward dynamics model [38, 43, 44]. Nevertheless, our method could be easily applicable in conjunction with any aforementioned strategy, as long as it can produce exploratory data, and a dynamics model, which could potentially also be learned offline. Zero-shot Frameworks Unsupervised exploration techniques are capable of efficiently sampling diverse trajectories, without extrinsic reward signals. As a result, they are a promising source of data to learn flexible behaviors. Several works in the literature have proposed that a task-agnostic unsupervised exploration phase be followed by a second phase, optimizing a task-specific reward function only provided a posteriori. This has been formalized as zero-shot reinforcement learning [9, 47, 48]. Notably, this framework conventionally forbids task-specific fine-tuning, explicit planning or computation during the second phase. On the applied side, agents endowed with learned world models [28, 43] have shown promise for zero-shot generalization without additional learning or planning. Other works are less strict in their constraints: Sancaktar et al. [38] also propose an intrinsic exploration phase followed by zero-shot generalization, but the latter relies on task-specific online planning. Furthermore, two concurrent works [23, 52] study the application of offline deep reinforcement learning for the second phase. However, these methods are not designed for goalconditioned tasks, and thus require an external task-specific reward function to perform relabeling during training. Our framework fits within this general research direction, in that it also involves taskagnostic and task-specific components. The initial phase is task-agnostic: we rely on unsupervised exploration, and train a universal value estimator [40] completely offline, without access to an external task-specific reward signal. In the second phase, we perform model-based planning via zero-order trajectory optimization to control the agent. We remark that, while this requires significant computation, existing algorithms are fast enough to allow real-time inference [35]. Goal-Conditioned Reinforcement Learning The pursuit of generally capable agents has been a ubiquitous effort in artificial intelligence research [36, 42, 45]. Goal-conditioned RL represents a practical framework for solving multiple tasks, by describing each task through a goal embedding, which can then condition reward and value functions. Deep RL methods can then be used to estimate values by providing the goal as an additional input to their (often neural) function approximators [40]. A ubiquitous technique in these settings is that of goal relabeling [1], which substitutes a trajectory s goal with one sampled a posteriori. Goal-conditioned methods have been shown to be successful in the offline setting [46], often when coupled with negative sampling methods [5], but have been mostly evaluated on standard offline datasets. An alternative line of work frames the problem as supervised learning instead [11], and shows connections to the original reinforcement learning objective [51]. Additionally, several works have explored the combination of goal-conditioned learning with modelbased planning [4, 25, 31]. This paper studies the natural application of goal-conditioned RL after an unsupervised exploration phase, which allows a straightforward definition of the multiple tasks to be learned. Moreover, our main formal insights in Section 5 build on top of common assumptions in goal-conditioned RL. Graph-based Value Estimation Several existing works rely on graph abstractions in order to retrieve global structure and address value estimation issues. In particular, Eysenbach et al. [12] and Savinov et al. [39] propose high-level subgoal selection procedures based on shortest paths between states sampled from the replay buffer. Another promising approach proposes to build a graph during training, and to leverage its structure when distilling a policy [30]. In contrast, our method relies on discrete structures only at inference, and directly integrates these structures into model-based planning, thus removing the need for subgoal selection routines. 3 Preliminaries 3.1 Background and Notation The environments of interest can be modeled as Markov decision processes (MDP) M = {S, A, P, R, µ0, γ}, where S and A are (possibly continuous) state and action spaces, P : S A Ω(S) is the transition probability distribution 1, R : S A R is a reward function, µ0 Ω(S) is an initial state distribution, and γ is a discount factor. Goal-conditioned reinforcement learning augments this conventional formulation by sampling a goal g µG, where p G is a distribution over a goal space G. While goal abstraction is possible, for simplicity of notation we consider G S (i.e., goal-conditioned behavior corresponds to reaching given states). The reward function is then defined as R(s, a; g) = R(s; g) = 1d(s,g) ϵ for some ϵ 0, and under some distance metric d( ). We additionally assume that episodes terminate when a goal is achieved, and the maximum undiscounted return within an episode is thus one. It is now possible to define a goal-conditioned policy π : S G Ω(A) and its value function V π(s0; g) = EP,π P t=0 γt R(st, g), with st P( |st 1, at 1), at π(st; g). The objective of a goal-conditioned RL agent can then be expressed as finding a stochastic policy π = argmaxπ Eg p G,s p0V π(s; g). When learning from curious exploration, we additionally assume the availability of a set D = {(s0, a0, s1, a1, . . . , s Ti)i}N i=1 of N exploratory state-action trajectories of possibly different lengths Ti, as well as access to an estimate of the probabilistic transition function ˆP P, i.e., a dynamics model. 3.2 Data Collection and Experimental Framework As our contributions are motivated by empirical observations, we also introduce our experimental pipeline early on. Algorithms are evaluated in four MDP instantiations, namely maze_large and kitchen from D4RL [14], fetch_push from gymnasium-robotics [13] and an adapted pinpad environment [18]. These environments were chosen to drastically differ in dynamics, dimensionality, and task duration. In particular, maze_large involves the navigation to distant goals in a 2D maze by controlling the acceleration of a point mass, kitchen requires controlling a 9 DOF robot through diverse tasks, including manipulating a kettle and operating switches, fetch_push is a manipulation task controlled through operation space control, and pinpad requires navigating a room and pressing a sequence of buttons in the right order. For each environment, we collect 200k exploratory transitions by curious exploration, corresponding to 2 hours of real-time interaction at 30Hz, which is an order of magnitude less than existing benchmarks [14, 24]. We stress that this amount of data is realistically collectible in the real world with minimal supervision, thus providing a strong benchmark for data-efficient learning of general agents. Moreover, the nature of the data is significantly different from standard offline datasets, which often contain high-quality demonstrations [14] that can be distilled into policies through naive imitation. The environment s dynamics are estimated through an ensemble of stochastic neural networks, similar to the one proposed in Chua et al. [8]. The intrinsic exploration signal is ensemble disagreement [38, 43, 49], as described in Appendix I.2. All algorithms are evaluated by success rate, that is, the ratio of test episodes in which the commanded goal is achieved. Details can be found in Appendix I. 4 Value Learning from Curious Exploration Curious exploration produces a learned dynamics model ˆP and a set D of trajectories, lacking both goal and reward labels. In order to achieve arbitrary (and potentially distant) goals at inference time, and considering that a task-specific cost function is not externally provided, it is natural to resort to learning and optimizing a value function. This section evaluates established methods for value learning and studies the quality of their estimates, thus motivating the analysis and method proposed in Sections 5 and 6. 1We use Ω(S) to represent the space of probability distributions over S. A standard approach to train a goal-conditioned value function from unlabeled data involves two steps. The first step consists of a goal-relabeling procedure, which associates each training sample with a goal selected a posteriori, and computes a self-supervised reward signal from it, as briefly explained in Section 4.1. The second step consists of extracting a goal-conditioned value function by applying off-policy RL algorithms on the relabeled data, as analyzed in Section 4.2. While this work focuses on data collected through curious exploration, the methods described can also be applied to a general offline goal-conditioned reinforcement learning settings, for which we provide a minimal study in Appendix F. 4.1 Relabeling Curious Exploration through Self-supervised Temporal Distances We follow the framework proposed by Tian et al. [46] to relabel exploration trajectories, and compute self-supervised reward signals by leveraging temporal distances, without assuming access to external task-specific reward functions. Training samples (st, at, st+1) are sampled uniformly from the replay buffer D, and need to be assigned a goal g and a reward scalar r. Each training sample is assigned a positive goal with probability pg, and a negative goal otherwise. Positive goals are sampled as future states from the same trajectory: g = st+τ, where τ Geo(pgeo) follows a geometric distribution with parameter pgeo. Negative goals are instead sampled uniformly from the entire dataset, while Tian et al. [46] rely on prior knowledge on the semantics of the state space to extract challenging goals. Rewards for all state-goal pairs are simply computed using the reward function r = R(st+1; g) = 1st+1=g: the reward is non-zero only for positive goals sampled with τ = 1. This corresponds to a minorizer for the class of reward functions R(st+1; g) = 1d(st+1,g) ϵ, as setting ϵ = 0 effectively removes the necessity to define a specific distance function d over state and goal spaces. We remark that while its extreme sparsity prevents its optimization via zero-shot model based planning, it remains functional as a reward signal for training a value function in hindsight. 4.2 Value Estimation and Suboptimality Now that sampling and relabeling procedures have been defined, we focus our attention on which methods can effectively retrieve a goal-conditioned value function V (s; g). In particular, we evaluate off-policy actor-critic algorithms, which are in principle capable of extracting the optimal value function (and policy) from trajectories generated by a suboptimal policy (as the one used for exploration). We thus ask the following questions: 1. Which off-policy actor-critic algorithm is most suitable for learning a goal-conditioned value function, purely offline, from data collected through unsupervised exploration? 2. How close to optimality is the learned value function? To answer the first question, we consider several off-policy actor-critic algorithms, and evaluate the learned value function through the success rate of an agent which seeks to optimize it. The choice of algorithms should adhere to the offline setting, as well as the availability of an estimated dynamics model ˆP. We thus evaluate TD3 [15] as default model-free actor-critic algorithm, and CRR MBPO MOPO TD3 0.0 Success Rate CRR MBPOMOPO TD3 0.00 Success Rate CRR MBPO MOPO TD3 0.0 Success Rate CRR MBPO MOPO TD3 0.0 Success Rate Actor Network Model-based Planning Figure 2: Performance of actor networks (blue) and model-based planning (orange) on value functions trained through various actor-critic algorithms from the same curious exploration datasets. compare its performance to CRR [50], a state-of-the-art offline method [16]. We also evaluate two model-based methods, namely MBPO [19] and MOPO [53], the latter of which introduces an uncertainty-based penalty for offline learning. Further value-learning baselines are presented in Appendix C. All methods have access to the same offline exploration dataset D; model-based method additionally query the dynamics model ˆP in order to generate training trajectories. Algorithm-specific hyperparameters were tuned separately for each method through grid search, and kept constant across environments. Each algorithm is allowed the same number of gradient steps (sufficient to ensure convergence for all methods). We implement two strategies for optimizing the value functions: querying the jointly learned actor network, and employing model-based planning and using zeroorder trajectory optimization (e.g., CEM [37]/i CEM[35]) with the learned model ˆP to optimize n-step look ahead value estimates [46]. Average success rates over 9 seeds are presented with 90% simple bootstrap confidence intervals in Figure 2. To answer the first question, we consider the performance of the best optimizer (often model-based planning, in orange) for each of the value learning methods. We observe that simpler actor-critic algorithms (i.e., TD3) are overall competitive with the remaining value learning methods. MBPO, MOPO and CRR are not consistently better in the considered environments, despite involving offline corrections or relying on the learned model ˆP to generate training trajectories. We hypothesize that this phenomenon can be attributed, at least partially, to the exploratory nature of the data distribution, as exemplified in Figure 3. Additionally, we note that our findings and hypothesis are consistent with those reported by Yarats et al. [52] for non-goal-conditioned tasks. We then turn to examining the performance gap between actor networks and model-based planning. We observe a tendency for model-based planning to outperform actor networks, with the largest improvements in fetch_push, as all learned policies similarly fail in this environment. While this stands true for almost all considered algorithms, CRR poses an exception to the success of model-based planning for two environments, which we hypothesize is due to its reliance on weighted behavior cloning rather than on backpropagation through a learned critic. Figure 3: 200k transitions from the exploration dataset D (left), and from the standard offline dataset provided by D4RL (right). Unsupervised exploration leads to a more uniform state coverage, and eases out-of-distribution issues. We remark that it is possible to show that, under common assumptions, a policy which exactly optimizes an optimal goal-conditioned value function is guaranteed to achieve its goals (see Appendix A.2). The low success rate of actor networks, assuming their objective is well-optimized, is thus a symptom of suboptimality for learned value functions. In Section 5, we argue how this suboptimality can be partially traced back to a class of estimation artifacts; moreover, we show how model-based planning is intrinsically more robust to such artifacts than actor networks (see Section 6.1). Due to its strong empirical performance, TD3 is adopted for value learning through the rest of this work. We however remark that the analysis and the method proposed in Sections 5 and 6 remain valid for each of the RL methods considered. Further details on methods and results presented in this section are reported in Appendix I. 5 Pathologies of Goal-Conditioned Value Functions The performance gap between model-based planning and actor networks described in the previous section is a symptom of estimation issues in learned value functions. In this section, we pinpoint a particular class of geometric features which are not compatible with optimal goal-conditioned value functions. Such features represent a significant failure case for an agent optimizing a learned value function, as they can effectively trap it within a manifold of the state space. Intuitively, this happens when the estimated value of the current state exceeds that of all other reachable states, despite the goal not being achieved. We now formalize this intuition by defining such geometric features as T -local optima. We then show that T -local optima are not present in optimal goal-conditioned value functions, and are therefore estimation artifacts. This justifies the method proposed in Section 6, which is designed to mitigate this class of artifacts. Let us define the set operator T : P(S) P(S) 2 that returns the set of states reachable in one step from a given initial set of states S0 S: so S0 {s S | a A : P(s |s0, a) > 0}. (1) We can now formally introduce a class of geometric features, which we refer to as T -local optima: Definition 5.1. A state s S is a T -local optimum point for a value function V (s) if and only if max s S V (s ) > V (s ) > max s T ({s })\{s } V (s ). Intuitively, a T -local optimum is a hill in the value landscape over the state space, where proximity is defined by the transition function of the MDP. The first inequality ensures that s does not maximize the value function globally. The second inequality guarantees that, if there is an action a A : P(s |s , a) = 1, this action would be chosen by greedily optimizing the value of the next-state (i.e., a = argmaxa A R s S V (s )P(s |s , a)), thus trapping the agent in s . By extending the previous definition to the multistep case, we can also conveniently define depth as the minimum number of steps required to reach a state with a higher value (i.e., escape the T -local optimum). Definition 5.2. A T -local optimum point s has depth k 1 if k is the smallest integer such that V (s ) < max s T k+1({s }) V (s ). Intuitively, a T -local optimum of depth k dominates the value of states reachable in up to k steps. Figure 4: Learned value as a function of x-y coordinates in maze_large. Choosing actions that maximize the value of the next state traps the agent in a corner, i.e. a T -local optimum point. These definitions prepare the introduction of a core observation. Within the setting of goal-conditioned RL, T -local optima are not an intrinsic feature of the value landscape, but rather an estimation artifact. Proposition 5.3. Given a goal g G, an optimal goalconditioned value function V (s; g) has no T -local optima. The proof follows directly from Bellman optimality of the value function under standard assumptions of goalconditioned RL introduced in Section 3.1. We report it in its entirety in Appendix A.1. We remark that, in principle, showing that a state s is a T -local optimum point requires evaluating the value function over the support of the next state distribution, which would require access to the true dynamics of the MDP. Moreover, exhaustive evaluation is unfeasible in continuous state spaces. However, one can bypass these issues by instead estimating, rather than proving, the presence of T -local optima. Given a candidate T -local optimum point s, V ( ; g) can be evaluated on (i) random states sr sampled from the exploration data D instead of the state space S and (ii) samples s from the next state distribution, accessed through the dynamic model ˆP(s |s, a) with a U(A). The maximum operators in Definition 5.1 can then be estimated through sampling; if both inequalities hold, s can be estimated to be a T -local optimum point. Having defined T -local optima, and shown that they are learning artifacts, we now investigate their impact on goal-reaching performance. We train n = 30 bootstrapped value estimators for all environments, and for each of them we generate trajectories through model-based planning with horizon H. We then estimate the occurrence of T -local optima of depth k H by testing each state along each trajectory, as described above. This is reported in Figure 5, separately for successful and unsuccessful trajectories. We observe that, in most environments, trajectories which avoid T -local optima generally tend to be more likely to succeed; this tendency tends to fade in environments which only require short-term planning (e.g., fetch_push). We note that these estimates are sampled-based, and thus noisy by nature, especially in low-data regimes; we thus expand on this analysis in Appendix B. For illustrative purposes, we also visualize a T -local optimum in the value landscape over maze_large in Figure 4. 2P( ) represents a power set. Failure Success Occurrence of T -local Optima Failure Success Occurrence of T -local Optima Failure Success Occurrence of T -local Optima Failure Success Occurrence of T -local Optima Figure 5: Occurrence of T -local optima of depth k H along successful and unsuccessful trajectories generated through model-based planning with horizon H = 15, estimated via sampling. An horizontal bar marks the median. 6 Mitigating Local and Global Value Artifacts Having shown that T -local optima are estimation artifacts, we now focus on categorizing and mitigating their occurrence. A first observation, which confirms the results in Section 4, is that model-based planning is robust to T local optima of small depth. This is further argued in Section 6.1. However, this does not address the occurrence of global estimation artifacts (i.e. T -local optima of large depth). We thus propose a graph-based value aggregation scheme for this purpose in Section 6.2. Finally, in Section 6.3 we empirically show how combining these two components can address estimation artifacts both locally and globally, and consistently improves goal-reaching performance. 6.1 Model-based Planning for Local Value Artifacts We will now provide further arguments for the effectiveness of model-based planning over inaccurate value landscapes, supported by Figure 2. Model-based planning is a model-predictive control (MPC) algorithm, which finds an action sequence (a0, . . . a H 1) by optimizing the RL objective through a H-step value estimate for a given horizon H, dynamics model ˆP, state s0, and goal g: (a0, , a H 1) = argmax (a0, ,a H 1) AH i=0 γi R(si, g) + γHV (s H; g) with si+1 ˆP( |si, ai), (2) and executes the first action a0. We remark that in our setting, due to the extreme sparsity of R(s; g) = 1s=g in continuous spaces, the objective in Equation 2 simplifies to the maximization of V (s H; g) alone. For the same reason, it is easy to show that an actor network trained through policy gradient is optimizing the same objective as model-based planning with horizon H = 1: a0 = argmax a0 A Q(s0, a0; g) = argmax a0 A R(s0; g) + γV (s1; g) = argmax a0 A V (s1; g), (3) with s1 P( |s0, a0). Under certain assumptions, such as exact dynamics estimation (i.e., ˆP = P), exact optimization for Equation 2 and existence of self-loops in the MDP, it is possible to show that (i) model-based planning from a T -local optimum point of depth k is guaranteed to fail if H k and (ii) model-based planning is guaranteed to reach a state with a higher value if the horizon H is greater than the depth of the largest T -local optima over the value landscape. For a complete argument, see Appendix A.3. As a consequence, actor networks can be trapped in all local optima, while model-based planning with horizon H > 1 can potentially escape local optima of depth k < H, and is therefore potentially more robust to such estimation artifacts, as shown empirically in Figure 2. 6.2 Graph-based Aggregation for Global Value Artifacts While we have shown that model-based planning is robust to local artifacts, the correction of global estimation artifacts, which span beyond the effective planning horizon H, remains a significant challenge. We propose to address it by grounding the value function in states observed during exploration, such that long-horizon value estimates are computed by aggregating short-horizon (i.e., local) estimates. Given the current state s S and the goal g G, we build a sparse graph by sampling vertices from the empirical state distribution Ds from the replay buffer, and connecting them with edges weighted by the learned goal-conditioned value ˆV : G = (V, E) with V Ds {s, g} and E = {(s , s , ˆV (s ; s )) (s , s ) V V}. (4) Crucially, we then prune the graph by removing all edges that encode long-horizon estimates, i.e., edges whose weight is below a threshold Vmin. In practice, we set Vmin to the maximum value which still ensures connectivity between the agent s state s and the goal g along the graph. This can be computed efficiently, as detailed in Appendix D. Once the graph has been constructed and pruned, we simply compute value estimates by minimizing the product of value estimates along all paths to the goal. V (s; g) = min (s0,...,s T ) i=0 ˆV (si; si+1) where (s0, . . . , s T ) is a path between s and g on G (5) This value aggregation scheme differentiates itself from existing methods [12] in its vertex sampling phase, which relies on density estimation to address skewed state distributions, and in its pruning phase, which does not depend on hyperparameters. Crucially, the final value aggregate can be directly adapted as a cost function for model-based planning, and is not used as a proxy for subgoal selection. These differences and their consequences are described extensively in Appendix D. In practice, this scheme can mitigate global T -local optima by replacing long-horizon value estimates through an aggregation of short-horizon estimates. Such short-horizon estimates have been shown to be generally more accurate, and useful in containing value overestimation issues [12], which can be considered a generalization of T -local optima. For reference, we now provide a visual intuition for the effect of the graph-based value aggregation scheme we propose. We highlight this in maze_large, as it is prone to global estimation artifacts, and allows straightforward visualization. The blue column in Figure 6 plots value functions { ˆV (s; gi)}2 i=0 learned through TD3 against x-y coordinates for three different goals. The green column reproduces the same plots, but estimates values through graph-based value aggregation, as described in this Section, resulting in corrected estimates { V (s; gi)}2 i=0. We further compute expert trajectories for all three goals, in red, and plot ˆV and V along them, resulting in the three plots on the right. When comparing value estimates with optimal values (V , in orange, computed in closed form), we observe that graph-based aggregation results in more accurate estimates. Crucially, it is able to address global estimation artifacts, which result in a bleeding effect for ˆV (s; g0) (on the left), or in large hills in ˆV (st; g0) (on the right, in blue). Figure 6: Learned values ˆV (s; gi) as a function of x-y coordinates in maze_large (blue column). Graph-based value aggregation results in corrected estimates V (s; gi) (green column). Values along optimal trajectories are compared to optimal value functions V (st; gi) on the right. Success Rate TD3 TD3+MBP TD3+MBP+Aggregation Shaped Cost+MBP Figure 7: Performance of an actor network (TD3) against model-based planning without (TD3+MBP) and with graph-based value aggregation (TD3+MBP+Aggregation). Model-based planning on aggregated values addresses both local and global estimation artifacts, and performs well across diverse environments. 6.3 Experimental Evaluation Having evaluated diverse value learning algorithms, and described a class of harmful estimation artifacts, we now set out to evaluate how such artifacts can be addressed both locally and globally. Figure 7 compares the performance of actor networks with that of model-based planning, with and without graph-based value aggregation. We observe that model-based planning improves performance in most tasks, as it addresses T -local optima of limited depth. However, it fails to correct global artifacts in long-horizon tasks (i.e., maze_large and pinpad). In this case, we observe that graph-based value aggregation increases success rates significantly. Furthermore, we observe that our method, which combines these two components, consistently achieves strong performance across diverse environments. In each plot, a dashed line represents the performance of model-based planning with an external sparse cost (for some ϵ > 0). On one hand, its strong performance in environments with shorter optimal trajectories confirms the accuracy of the learned model. On the other, its failure for long-horizon tasks shows the limitation of planning in the absence of a learned value function, which has been an important motivation for this work. 7 Discussion This work is concerned with the challenge of achieving goals after an unsupervised exploration phase, without any further interaction with the environment. In this challenging setting, we observe that simple actor-critic methods are surprisingly competitive with model-based, or offline algorithms, but their actor networks are often suboptimal due to estimation artifacts in the learned value function. After characterizing these artifacts by leveraging the goal-conditioned nature of the problem, we show how model-based planning can address this problem locally, graph-based value aggregation can apply global corrections, and their combination works consistently across diverse environments. Limitations With respect to simply querying actor networks, this method comes at a cost. In particular, it increases computational requirements and risks introducing further artifacts due to model exploitation, or the discretization involved in graph-based aggregation (see Appendix D). Furthermore, the proposed method is only applied at inference and, while it s able to mitigate learning artifacts, it does not directly correct them by updating the value estimator. Nevertheless, within our settings, we find the impact of these limitations to be, in practice, modest. Our method represents a strong practical approach for goal-reaching after curious exploration, and sheds light on interesting properties of both offline and goal-conditioned problems. An exciting future avenue would further leverage these formal insights, possibly through a tighter integration of value learning and value correction. We thus believe that further exploring the interplay between a theoretically sound analysis of the properties of goal-conditioned value function, and the design of practical algorithms, represents a rich research opportunity. Acknowledgments and Disclosure of Funding We thank Andreas Krause, Nico Gürtler, Pierre Schumacher, Núria Armengol Urpí and Cansu Sancaktar for their help throughout the project, and the anonymous reviewers for their valuable feedback. Marco Bagatella is supported by the Max Planck ETH Center for Learning Systems. Georg Martius is a member of the Machine Learning Cluster of Excellence, EXC number 2064/1 Project number 390727645. We acknowledge the support from the German Federal Ministry of Education and Research (BMBF) through the Tübingen AI Center (FKZ: 01IS18039B). [1] M. Andrychowicz, F. Wolski, A. Ray, J. Schneider, R. Fong, P. Welinder, B. Mc Grew, J. Tobin, O. Pieter Abbeel, and W. Zaremba. Hindsight experience replay. Advances in Neural Information Processing Systems, 2017. [2] M. Bellemare, S. Srinivasan, G. Ostrovski, T. Schaul, D. Saxton, and R. Munos. Unifying count-based exploration and intrinsic motivation. Advances in Neural Information Processing Systems, 2016. [3] Y. Burda, H. Edwards, A. Storkey, and O. Klimov. Exploration by random network distillation. In International Conference on Learning Representations, 2019. [4] H. Charlesworth and G. Montana. Plangan: Model-based planning with sparse rewards and multiple goals. In Advances in Neural Information Processing Systems, 2020. [5] Y. Chebotar, K. Hausman, Y. Lu, T. Xiao, D. Kalashnikov, J. Varley, A. Irpan, B. Eysenbach, R. C. Julian, C. Finn, et al. Actionable models: Unsupervised offline reinforcement learning of robotic skills. In International Conference on Machine Learning, 2021. [6] X. Chen, A. Ghadirzadeh, T. Yu, J. Wang, A. Y. Gao, W. Li, L. Bin, C. Finn, and C. Zhang. Lapo: Latent-variable advantage-weighted policy optimization for offline reinforcement learning. Advances in Neural Information Processing Systems, 2022. [7] N. Chentanez, A. Barto, and S. Singh. Intrinsically motivated reinforcement learning. Advances in Neural Information Processing Systems, 2004. [8] K. Chua, R. Calandra, R. Mc Allister, and S. Levine. Deep reinforcement learning in a handful of trials using probabilistic dynamics models. Advances in Neural Information Processing Systems, 2018. [9] P. Dayan. Improving generalization for temporal difference learning: The successor representation. Neural Computation, 5(4):613 624, 1993. [10] A. Ecoffet, J. Huizinga, J. Lehman, K. O. Stanley, and J. Clune. First return, then explore. Nature, 590(7847):580 586, 2021. [11] S. Emmons, B. Eysenbach, I. Kostrikov, and S. Levine. Rvs: What is essential for offline RL via supervised learning? In International Conference on Learning Representations, 2022. [12] B. Eysenbach, R. R. Salakhutdinov, and S. Levine. Search on the replay buffer: Bridging planning and reinforcement learning. Advances in Neural Information Processing Systems, 2019. [13] Farama-Foundation. Gymnasium-robotics. https://github.com/ Farama-Foundation/Gymnasium-Robotics, 2023. [14] J. Fu, A. Kumar, O. Nachum, G. Tucker, and S. Levine. D4rl: Datasets for deep data-driven reinforcement learning, 2020. [15] S. Fujimoto, H. Hoof, and D. Meger. Addressing function approximation error in actor-critic methods. In International Conference on Machine Learning, 2018. [16] N. Gürtler, S. Blaes, P. Kolev, F. Widmaier, M. Wuthrich, S. Bauer, B. Schölkopf, and G. Martius. Benchmarking offline reinforcement learning on real-robot hardware. In The Eleventh International Conference on Learning Representations, 2023. [17] T. Haarnoja, A. Zhou, P. Abbeel, and S. Levine. Soft actor-critic: Off-policy maximum entropy deep reinforcement learning with a stochastic actor. In International Conference on Machine Learning, 2018. [18] D. Hafner, K.-H. Lee, I. Fischer, and P. Abbeel. Deep hierarchical planning from pixels. In Advances in Neural Information Processing Systems, 2022. [19] M. Janner, J. Fu, M. Zhang, and S. Levine. When to trust your model: Model-based policy optimization. Advances in Neural Information Processing Systems, 2019. [20] L. P. Kaelbling. Learning to achieve goals. In IJCAI, volume 2, pages 1094 8, 1993. [21] D. P. K. Kingma and J. Ba. Adam: A method for stochastic optimization. In International Conference on Learning Representations, 2015. [22] I. Kostrikov, A. Nair, and S. Levine. Offline reinforcement learning with implicit q-learning. In International Conference on Learning Representations, 2022. [23] N. Lambert, M. Wulfmeier, W. Whitney, A. Byravan, M. Bloesch, V. Dasagi, T. Hertweck, and M. Riedmiller. The challenges of exploration for offline reinforcement learning. ar Xiv preprint ar Xiv:2201.11861, 2022. [24] M. Laskin, D. Yarats, H. Liu, K. Lee, A. Zhan, K. Lu, C. Cang, L. Pinto, and P. Abbeel. URLB: Unsupervised reinforcement learning benchmark. In Deep RL Workshop Neur IPS 2021, 2021. [25] J. Li, C. Tang, M. Tomizuka, and W. Zhan. Hierarchical planning through goal-conditioned offline reinforcement learning. IEEE Robotics and Automation Letters, 2022. [26] T. P. Lillicrap, J. J. Hunt, A. Pritzel, N. Heess, T. Erez, Y. Tassa, D. Silver, and D. Wierstra. Continuous control with deep reinforcement learning. In International Conference on Learning Representations, 2016. [27] K. Lowrey, A. Rajeswaran, S. Kakade, E. Todorov, and I. Mordatch. Plan Online, Learn Offline: Efficient Learning and Exploration via Model-Based Control. In International Conference on Learning Representations (ICLR), 2019. [28] R. Mendonca, O. Rybkin, K. Daniilidis, D. Hafner, and D. Pathak. Discovering and achieving goals via world models. Advances in Neural Information Processing Systems, 2021. [29] R. Mendonca, S. Bahl, and D. Pathak. Alan: Autonomously exploring robotic agents in the real world. ar Xiv preprint ar Xiv:2302.06604, 2023. [30] L. Mezghani, S. Sukhbaatar, P. Bojanowski, A. Lazaric, and K. Alahari. Learning goalconditioned policies offline with self-supervised reward shaping. In 6th Annual Conference on Robot Learning, 2022. [31] S. Nasiriany, V. Pong, S. Lin, and S. Levine. Planning with goal-conditioned policies. Advances in Neural Information Processing Systems, 32, 2019. [32] P.-Y. Oudeyer, F. Kaplan, and V. V. Hafner. Intrinsic motivation systems for autonomous mental development. IEEE Transactions on Evolutionary Computation, 11(2):265 286, 2007. [33] D. Pathak, P. Agrawal, A. A. Efros, and T. Darrell. Curiosity-driven exploration by selfsupervised prediction. In International Conference on Machine Learning, 2017. [34] L. Pineda, B. Amos, A. Zhang, N. O. Lambert, and R. Calandra. Mbrl-lib: A modular library for model-based reinforcement learning. Arxiv, 2021. [35] C. Pinneri, S. Sawant, S. Blaes, J. Achterhold, J. Stueckler, M. Rolinek, and G. Martius. Sampleefficient cross-entropy method for real-time planning. In Conference on Robot Learning, 2021. [36] S. Reed, K. Zolna, E. Parisotto, S. G. Colmenarejo, A. Novikov, G. Barth-maron, M. Giménez, Y. Sulsky, J. Kay, J. T. Springenberg, T. Eccles, J. Bruce, A. Razavi, A. Edwards, N. Heess, Y. Chen, R. Hadsell, O. Vinyals, M. Bordbar, and N. de Freitas. A generalist agent. Transactions on Machine Learning Research, 2022. [37] R. Rubinstein. The cross-entropy method for combinatorial and continuous optimization. Methodology and computing in applied probability, 1(2):127 190, 1999. [38] C. Sancaktar, S. Blaes, and G. Martius. Curious exploration via structured world models yields zero-shot object manipulation. In Advances in Neural Information Processing Systems. Curran Associates, Inc., Dec. 2022. [39] N. Savinov, A. Dosovitskiy, and V. Koltun. Semi-parametric topological memory for navigation. In International Conference on Learning Representations, 2018. [40] T. Schaul, D. Horgan, K. Gregor, and D. Silver. Universal value function approximators. In International Conference on Machine Learning, 2015. [41] J. Schmidhuber. A possibility for implementing curiosity and boredom in model-building neural controllers. In Proc. of the international Conference on simulation of adaptive behavior: From animals to animats, pages 222 227, 1991. [42] J. Schrittwieser, I. Antonoglou, T. Hubert, K. Simonyan, L. Sifre, S. Schmitt, A. Guez, E. Lockhart, D. Hassabis, T. Graepel, et al. Mastering atari, go, chess and shogi by planning with a learned model. Nature, 588(7839):604 609, 2020. [43] R. Sekar, O. Rybkin, K. Daniilidis, P. Abbeel, D. Hafner, and D. Pathak. Planning to explore via self-supervised world models. In International Conference on Machine Learning, 2020. [44] B. C. Stadie, S. Levine, and P. Abbeel. Incentivizing exploration in reinforcement learning with deep predictive models. In International Conference on Learning Representations, 2016. [45] O. E. L. Team, A. Stooke, A. Mahajan, C. Barros, C. Deck, J. Bauer, J. Sygnowski, M. Trebacz, M. Jaderberg, M. Mathieu, et al. Open-ended learning leads to generally capable agents. ar Xiv preprint ar Xiv:2107.12808, 2021. [46] S. Tian, S. Nair, F. Ebert, S. Dasari, B. Eysenbach, C. Finn, and S. Levine. Model-based visual planning with self-supervised functional distances. In International Conference on Learning Representations, 2021. [47] A. Touati and Y. Ollivier. Learning one representation to optimize all rewards. Advances in Neural Information Processing Systems, 2021. [48] A. Touati, J. Rapin, and Y. Ollivier. Does zero-shot reinforcement learning exist? In 3rd Offline RL Workshop: Offline RL as a Launchpad , 2022. [49] M. Vlastelica, S. Blaes, C. Pinneri, and G. Martius. Risk-averse zero-order trajectory optimization. In 5th Annual Conference on Robot Learning, 2021. [50] Z. Wang, A. Novikov, K. Zolna, J. S. Merel, J. T. Springenberg, S. E. Reed, B. Shahriari, N. Siegel, C. Gulcehre, N. Heess, et al. Critic regularized regression. Advances in Neural Information Processing Systems, 2020. [51] R. Yang, Y. Lu, W. Li, H. Sun, M. Fang, Y. Du, X. Li, L. Han, and C. Zhang. Rethinking goalconditioned supervised learning and its connection to offline RL. In International Conference on Learning Representations, 2022. [52] D. Yarats, D. Brandfonbrener, H. Liu, M. Laskin, P. Abbeel, A. Lazaric, and L. Pinto. Don t change the algorithm, change the data: Exploratory data for offline reinforcement learning. In ICLR 2022 Workshop on Generalizable Policy Learning in Physical World, 2022. [53] T. Yu, G. Thomas, L. Yu, S. Ermon, J. Y. Zou, S. Levine, C. Finn, and T. Ma. Mopo: Modelbased offline policy optimization. Advances in Neural Information Processing Systems, 2020. Supplementary Material to Goal-conditioned Offline Planning from Curious Exploration A.1 Proposition 5.3 Let us consider an arbitrary goal g G and the optimal goal-conditioned value function V (s; g). Proposition 5.3 states that V (s; g) has no T -local optima. The proof follows directly from the Bellman optimality condition, assuming that (1) R(s; g) = 1d(s,g) ϵ for some ϵ 0 and distance function d( , ), and that (2) the episode terminates when the goal is achieved. For a candidate state s, let us assume that the first inequality of the condition for being a T -local optimum point is met: max s S V (s ; g) > V ( s; g)3. It is now sufficient to show that the second inequality (i.e., V ( s; g) > max s T ({ s})\{ s} V (s ; g)) cannot hold, and therefore s cannot be a T -local optimum point. According to Bellman optimality, we have that V ( s; g) = R( s; g) + γ( s; g) max a A E s P( |a, s) V (s ; g) (S6) where we use γ(s; g) = γ(1 R(s; g)): (1 R(s; g)) acts as the termination signal, to enforce that value propagation stops when reaching the goal (and the episode ends). If R( s; g) = 1, then V ( s; g) = 1, which would break the first inequality, since max s S V (s ; g) 1. This implies that R( s; g) = 0. As a consequence, Equation S6 simplifies to V ( s; g) = γ max a A E s P( |a, s) V (s ; g) γ max s T ({ s}) V (s ; g) (S7) We can now differentiate two cases. If maxs T ({ s}) V (s ; g) = 0 (i.e., the goal cannot be reached), then we also have that V ( s; g) = max s T ({ s})\{ s} V (s ; g) = 0, and s is not a T -local optimum point. If the goal is instead reachable from at least one state in the next state distribution (i.e., maxs T ({ s}) V (s ; g) > 0), we have that V ( s; g) γ( s; g) max s T ({ s}) V (s ; g) < max s T ({ s}) V (s ; g) = max s T ({ s})\{ s} V (s ; g). The second inequality follows from the goal-conditioned assumption that 0 < γ < 1, and the final equality is due to the fact that V ( s; g) < max s T ({ s}) V (s ; g) implies that V ( s; g) = max s T ({ s}) V (s ; g). Thus, V ( s; g) > max s T ({ s})\{ s} V (s ; g) does not hold, and s is not a T -local optimum point. A.2 Optimization of Optimal Goal-conditioned Value Functions Section 4 argues that, assuming that the objective for actor networks is well optimized, the failure of naive actor-critic algorithms is due to estimation errors in the critic network. Accordingly, 3While a supremum operator sup( ) would be required in open domains, for simplicity we use max( ) throughout this work. we show that, under common assumptions, a state-space trajectory which optimizes an optimal goal-conditioned value function V (s; g) is guaranteed to achieve its goal g G. Furthermore, minimization of the actor loss results in an actor network which produces this exact trajectory. Given a goal g G, and an initial state s0 S, we make the following assumptions 4: 1. V (s0; g) > 0, i.e., the goal can be reached from the initial state. 2. For all s S, a A, s S : P(s |s, a) = 1, i.e., the MDP transitions deterministically. We now consider the state-space trajectory (si)T i=0 which optimizes V under the dynamics of the MDP, that is si+1 = argmax s T ({si}) V (s ; g) 0 i < T. (S9) We will show that, for a certain finite T, s T achieves the goal, i.e., R(s T ; g) = 1 and R(si; g) = 0 0 i < T. We note that, since the episode terminates when the goal is achieved and γ(s; g) = γ(1 R(s; g)), Bellman optimality implies that R(s T ; g) = 1 V (s T ; g) = 1, as shown in Equation S6. From Equation S6 we can then derive V (si; g) = R(si, g) + γ max a A E s P( |a,si) V (s ; g) = γ max a A E s P( |a,si) V (s ; g) = γ max s T ({si}) V (s ; g) = γV (si+1; g), where the third equality follows from deterministic transitions, and the last from the objective in Equation S9. By applying this equality recursively, we can write V (si; g) = γ i V (s0; g). After T = logγ V (s0; g) steps, V (s T ; g) = γ T V (s0; g) 1. (S11) Since V ( ; g) 1, therefore V (s T ; g) = 1 and s T achieves the goal. An actor network is trained to output π(si; g) = argmax a A Q (si, a; g) = argmax a A R(si; g) + γ E s P( |a,si) V (s ; g) = argmax a A E s P( |a,si) V (s ; g) Under deterministic dynamics, the optimal policy is therefore producing the trajectory which optimizes the objective in Equation S9, and thus reaches the goal. A.3 Model-based Planning and T -local Optima Section 5 pinpoints T -local optima as a class of estimation artifacts in learned value functions. Following from Section 6.1, we now present two properties arising from the interaction between model-based planning, and a suboptimal value landscape, depending on the depth of T -local optima, and the planning horizon. First, we show that, given a goal g G, model-based planning with horizon H from a T -local optimum point s of depth k H fails to achieve the goal. We make the assumptions presented in Section 3.1, and additionally assume: 1. ˆP = P, i.e., exact dynamics estimation. 2. Model-based planning optimizes its objective exactly (see Equation S13). 4On top of those listed in Section 3.1. 3. For every state s S and goal g G, if V (s; g) maxs T ({s}) V (s ; g), then a A : P(s|s, a) = 1. This implies the existence of self-loops for a set of states including T -local optimum points. We recall from Section 6.1 that, in this setting, model-based planning optimizes the following objective: a H 1 0 = argmax a H 1 0 AH V (s H; g) with si+1 ˆP( |si, ai), (S13) We note that, if s0 = s is a T -local optima, then by assumption (3) an action a is available such that P(s |s , a) = 1. If ai = a for 0 i < H, then the objective in Equation S13 simply evaluates to V (s ; g). Any other action sequence will reach a state within the H-step future state distribution T H({s }), and will evaluate to a lower or equal value as per the definition of T -local optima of depth k: V (s ; g) max s T H({s }) V (s ; g). If ties are broken favorably 5, an MPC model-based planning algorithm will therefore execute a, which deterministically transitions to the T -local optimum point at each step. Thus, the goal is never reached. Second, we show that, if the planning horizon H is greater than the depth of all T -local optima over the state space S, open-loop execution of the action sequence returned through model-based planning successfully optimizes the value function (i.e., reaches a state with a higher value). This property holds under additional assumptions, namely: 4. For all s S, a A, s S : P(s |s, a) = 1, i.e., the MDP transitions deterministically. 5. The action sequence optimizing the objective for model-based planning (Equation S13) is executed in its entirety before replanning, i.e., open-loop execution of the plan. 6. V (s0; g) < maxs S V (s ; g), i.e., the initial state does not maximize the value function. Since there are no T -local optima of depth k H, then there must exist a state s1 T H ({s0}) such that V (s1; g) > V (s0; g) and H H. Now, let us consider the state reachable in up to H steps with the highest value: V (s1; g) = max H H max s T H ({s0}) V (s ; g) (S14) As shown in Equation S13, model-based planning only optimizes over states that can be reached in exactly H steps. It is thus sufficient to show that s1 T H({s0}), as this would ensure that V (s0; g) < V (s1; g) max s H T H({s0}) V (s H; g) = max a H 1 0 AH V (s H; g) with si+1 ˆP( |si, ai). Thus, open-loop execution of the action sequence a H 1 0 optimizing Equation S13 would reach a state with a higher value (in this case, s H). We recall that s1 T H ({s0}) for some H < H. We will now show that s1 T H({s0}) by contradiction. Let us assume that s1 T H({s0}). We thus have that V (s1; g) = max H H max s T H ({s0}) V (s ; g) max H 100Hz; model-based planning with and without aggregation at 3Hz and 0.3Hz respectively. An optimized implementation would bring significant speedups, e.g. through precomputation and parallelization of graph search, which could address the significant slowdown. J Open Challenges and Negative Results Our method provides insights on how to mitigate the occurrence of estimation artifacts in learned value functions. However, it does not correct them by finetuning the neural value estimator. This possibility is interesting, as it would bring no computational overhead, and was broadly explored. TD finetuning with CQL-like penalties on the value of T -local optima was attemped, but unsuccessful. In this context, we found such correction strategies to be sensible to hyperparameters, and often resulting in overly pessimistic estimates. We, however, stress the significance of direct value correction as an exciting direction for future research. K Numerical Results Table S5 reports success rates from Figures 2 and 7 (main paper) in numerical form. Hyperparameter Value # of critic layers 2 # of units per critic layer 512 # of actor layers 2 # of units per actor layer 512 Batch size 512 Critic learning rate 1e-5 Actor learning rate 1e-5 Activation function Re LU Polyak coefficient 0.995 Target noise 0.2 Target noise clipping threshold 0.5 Policy delay 2 Policy Squashing True pg 0.75 pgeo 0.2 (MBPO) # updates before rollouts 500 (MBPO) # of rollouts 5000 (MBPO) Rollout horizon 5 (MBPO) Buffer capacity 1M (MBPO) Real data ratio 0.0 (MOPO) Penalty coefficient λ 0.1 (CRR) β 1.0 (CRR) # of action samples 4 (CRR) Advantage type mean (CRR) Weight type exponential (CRR) Maximum weight 20.0 Table S3: Hyperparameters for value learning. Hyperparameter Value # of iterations 3 Population size 400 Elite ratio 0.01 α 0.1 Population decay False Colored noise exponent 3.0 Fraction of kept elites 0.3 Table S4: Hyperparameters for modelbased planning. maze_large pinpad fetch_push kitchen CRR 0.095 < 0.173 < 0.253 0.027 <0.055 < 0.088 0.083 <0.083 < 0.083 0.253 < 0.331 < 0.412 MBPO 0.000 < 0.000 < 0.000 0.017 <0.028 < 0.040 0.083 <0.083 < 0.083 0.111 < 0.143 < 0.174 MOPO 0.023 < 0.049 < 0.079 0.033 <0.050 < 0.072 0.083 <0.083 < 0.083 0.015 < 0.046 < 0.079 TD3 0.079 < 0.144 < 0.206 0.244 <0.267 < 0.291 0.083 <0.083 < 0.083 0.079 < 0.110 < 0.142 CRR+MBP 0.000 < 0.031 < 0.063 0.283 <0.321 < 0.361 0.152 <0.190 < 0.231 0.174 < 0.206 < 0.238 MBPO+MBP 0.024 < 0.060 < 0.098 0.043 <0.068 < 0.091 0.638 <0.719 < 0.791 0.174 < 0.222 < 0.285 MOPO+MBP 0.000 < 0.025 < 0.098 0.013 <0.025 < 0.038 0.824 <0.884 < 0.935 0.365 < 0.444 < 0.523 TD3+MBP 0.206 < 0.301 < 0.412 0.190 <0.232 < 0.274 0.638 <0.693 < 0.745 0.333 < 0.379 < 0.428 TD3+MBP+Aggregation 0.627 < 0.705 < 0.814 0.314 <0.410 < 0.500 0.587 <0.647 < 0.708 0.301 < 0.365 < 0.428 MBP+Sparse 0.030 0.100 0.730 0.430 Table S5: Average success rates with 90% bootstrap confidence intervals from Figures 2 and 7 (main paper).