# ogbench_benchmarking_offline_goalconditioned_rl__3cb84836.pdf

Published as a conference paper at ICLR 2025

OGBENCH: BENCHMARKING OFFLINE GOAL-CONDITIONED RL

Seohong Park1 Kevin Frans1 Benjamin Eysenbach2 Sergey Levine1

1University of California, Berkeley 2Princeton University seohong@berkeley.edu

Offline goal-conditioned reinforcement learning (GCRL) is a major problem in reinforcement learning (RL) because it provides a simple, unsupervised, and domainagnostic way to acquire diverse behaviors and representations from unlabeled data without rewards. Despite the importance of this setting, we lack a standard benchmark that can systematically evaluate the capabilities of offline GCRL algorithms. In this work, we propose OGBench, a new, high-quality benchmark for algorithms research in offline goal-conditioned RL. OGBench consists of 8 types of environments, 85 datasets, and reference implementations of 6 representative offline GCRL algorithms. We have designed these challenging and realistic environments and datasets to directly probe different capabilities of algorithms, such as stitching, long-horizon reasoning, and the ability to handle high-dimensional inputs and stochasticity. While representative algorithms may rank similarly on prior benchmarks, our experiments reveal stark strengths and weaknesses in these different capabilities, providing a strong foundation for building new algorithms. Project page: https://seohong.me/projects/ogbench

antsoccer visual-maze cube scene puzzle powderworld

humanoidmaze

Figure 1: OGBench Overview. OGBench provides a variety of stateand pixel-based locomotion, manipulation, and drawing tasks that are designed to exercise diverse challenges in offline goal-conditioned RL, such as stitching, long-horizon reasoning, and stochastic control.

1 MOTIVATION

Why offline goal-conditioned reinforcement learning (RL)? The enduring trend in modern machine learning is to simplify domain-specific assumptions and scale up the data. In computer vision and natural language processing, the strongest general-purpose models are trained via simple unsupervised objectives on raw, unlabeled data, such as next-token prediction, contrastive learning, and masked auto-encoding. What analogous paradigm could enable data-driven unsupervised learning for reinforcement learning? Ideally, such a framework should be able to produce from data a generalist policy that can be directly queried or adapted to solve a variety of downstream tasks, much like how generative language models trained via next-token prediction can be easily adapted to everyday tasks.

We posit that a natural analogy to data-driven unsupervised learning in RL is offline goal-conditioned RL (GCRL). The objective of offline goal-conditioned RL is fully unsupervised, remarkably simple, and requires no domain knowledge: it merely aims to learn to reach any state from any other state in the dataset in the fewest number of steps. However, mastering this simple objective is exceptionally difficult: the agent needs to not only acquire diverse skills to efficiently navigate the state space, but

Published as a conference paper at ICLR 2025

also have a deep, complete understanding of the underlying world and dataset. As a result, offline goal-conditioned RL yields a highly capable general-purpose multi-task policy as well as rich, useful representations that can be adapted to solve a variety of downstream tasks (Ghosh et al., 2023; Kim et al., 2024). Indeed, for such simplicity and generality, interest in (offline) goal-conditioned RL has recently surged to the extent that even a standalone workshop on goal-conditioned RL was held at a machine learning conference.1

Why a new benchmark? Despite the importance of and increasing interest in offline goal-conditioned RL, we currently lack a standard benchmark that can systematically assess the capabilities of offline GCRL algorithms, such as the ability to stitch, perform long-horizon reasoning, and handle stochasticity. Prior works in offline goal-conditioned RL (Eysenbach et al., 2022; Ma et al., 2022; Park et al., 2023; Myers et al., 2024) have mainly used either existing datasets for standard offline RL tasks without modification (e.g., D4RL (Fu et al., 2020)), relatively simple goal-conditioned tasks (e.g., Fetch (Plappert et al., 2018)), or tasks tailored to demonstrate the individual abilities of the proposed methods. This often results in limited evaluation. Prior works often evaluate their multi-task policies only on a single task (when using datasets not originally designed for offline GCRL), or learn relatively simple behaviors. While there exist some prior tasks tailored to evaluate individual properties of offline GCRL such as stitching or generalization (Yang et al., 2023; Ghugare et al., 2024), we lack a comprehensive, standardized benchmark that exhaustively assesses various properties of offline GCRL algorithms with diverse, challenging tasks.

Therefore, we introduce a benchmark named Offline Goal-Conditioned RL Benchmark (OGBench) in this work (Figure 1). The primary goals of this benchmark are to facilitate algorithms research in offline goal-conditioned RL and to provide a set of complex tasks that can unlock the potential of offline GCRL. Our benchmark introduces 8 types of environments and 85 datasets across robotic locomotion, robotic manipulation, and drawing, and provides well-tuned reference implementations of 6 representative offline GCRL methods. These datasets, tasks, and implementations are carefully designed. Tasks are designed in a way that complex behaviors can naturally emerge when successfully solved, and that they pose diverse algorithmic challenges in offline GCRL, such as goal stitching, stochastic control, long-horizon reasoning, and more. Dataset and task difficulties are carefully adjusted to highlight stark contrasts between algorithms across multiple criteria. The entire benchmark is designed to minimize unnecessary computational overhead and maximize usability such that any researcher can easily iterate and evaluate new ideas. We believe OGBench serves as a solid foundation for developing ideal algorithms for goal-conditioned and unsupervised RL from data.

2 PROBLEM SETTING

The offline goal-conditioned RL problem is defined by a controlled Markov process M = (S, A, µ, p) (i.e., a Markov decision process (MDP) without rewards) and an unlabeled dataset D, where S denotes the state space, A denotes the action space, µ(s) (S)2 denotes the initial state distribution, and p(s | s, a): S A (S) denotes the transition dynamics function. (X) denotes the set of probability distributions defined on a set X. The dataset D = {τ (n)}n {1,2,...,N} consists of

unlabeled trajectories τ (n) = (s(n) 0 , a(n) 0 , s(n) 1 , . . . , s(n) Tn ).

The objective of offline goal-conditioned RL is to learn to reach any state from any other state in the minimum number of time steps. Formally, offline GCRL aims to learn a goal-conditioned policy π(a | s, g): S S (A) that maximizes the objective Eτ p(τ|g)[PT t=0 γtδg(st)] for all g S, where T N denotes the episode horizon, γ (0, 1) denotes the discount factor, p(τ | g) denotes the distribution given by p(τ | g) = µ(s0)π(a0 | s0, g)p(s1 | s0, a0) p(s T | s T 1, a T 1), and δg( ) denotes the Dirac delta function 3 at g. Note that we use the entire state space as the goal space (i.e., a goal is simply a full state, not part of the state like only the x-y position of the agent). This choice makes the objective fully unsupervised, making it suitable for domain-agnostic training from unlabeled data. Our goal in this paper is to propose a new benchmark in offline GCRL. That is, formally speaking, to specify the dynamics and dataset for each task we introduce.

1https://goal-conditioned-rl.github.io/2023/ 2We denote placeholder variables in gray throughout the paper. 3In a discrete MDP, δg(s) is equal to an indicator function 1{g}(s). In a continuous MDP, it is technically not well-defined in its current form. It can be made precise using measure-theoretic notation or the distribution theory, but we choose to avoid them for simplicity.

Published as a conference paper at ICLR 2025

Table 1: Properties of benchmark tasks. We summarize the properties of benchmark tasks commonly used in prior works (above the line) and our OGBench tasks (below the line).

Benchmark Task Type1 Longest Task2 # Subtasks3 Test Stitching?4 Have Stoch. Tasks?5 Support Pixels?6 Multi-Goal?7 Dependency8

D4RL Ant Maze Loco. 400 - Mu Jo Co D4RL Kitchen Manip. 250 4 Mu Jo Co Roboverse Manip. 100 1-2 Py Bullet Fetch Manip. 50 1 Mu Jo Co

Point Maze (ours) Loco. 600 - Mu Jo Co Ant Maze (ours) Loco. 1000 - Mu Jo Co Humanoid Maze (ours) Loco. 3000 - Mu Jo Co Ant Soccer (ours) Loco. 1000 - Mu Jo Co Cube (ours) Manip. 400 1-4 Mu Jo Co Scene (ours) Manip. 400 2-8 Mu Jo Co Puzzle (ours) Manip. 800 2-24 Mu Jo Co Powderworld (ours) Draw. 100 - Num Py

1 Environment type (locomotion, manipulation, or drawing). 2 The (approximate) minimum number of environment steps to solve the longest task. 3 The number of atomic behaviors (e.g., pick-and-place) in each manipulation task. 4 Does it contain tasks that require goal stitching? 5 Does it contain tasks with stochastic dynamics? 6 Does it support pixel-based observations? 7 Does it use multiple goals for evaluation? 8 The main dependency of the benchmark.

3 HOW HAVE PRIOR WORKS BENCHMARKED OFFLINE GCRL?

While many excellent offline GCRL algorithms have been proposed so far, the community currently lacks a standardized way to evaluate their performance, unlike other fields in RL (Brockman et al., 2016; Tassa et al., 2018; Fu et al., 2020; Terry et al., 2021). Tasks used by prior works in offline GCRL are often limited for providing a proper evaluation for various reasons. For example, many works directly use tasks in existing offline RL benchmarks (not necessarily designed for offline goal-conditioned RL), such as D4RL Ant Maze and Kitchen (Fu et al., 2020). However, since the tasks are designed for single-task offline RL, they often evaluate their multi-task policies only on the single, original goal when using these tasks (Eysenbach et al., 2022; Park et al., 2023; Zheng et al., 2024b; Myers et al., 2024), which results in limited evaluation. Some employ online GCRL tasks provided by Plappert et al. (2018) (e.g., Fetch tasks) with policy-collected datasets (Ma et al., 2022; Yang et al., 2023), but these tasks are mostly atomic (e.g., single pick-and-place) and do not sufficiently address various challenges in offline GCRL, such as long-horizon reasoning and goal stitching. While Roboverse (Fang et al., 2022; Zheng et al., 2024a) provides pixel-based manipulation tasks, the tasks are still relatively atomic and it requires installing multiple, fragmented dependencies, making it less approachable for researchers. Some prior works construct bespoke tasks and datasets to individually study the important and specific features of their algorithms (e.g., stitching (Ma et al., 2022; Ghugare et al., 2024; Wang et al., 2024) and stochasticity (Myers et al., 2024)), but it is often not entirely clear how algorithms compare to one another or if the new capabilities of new methods (e.g., stitching) come at the loss of other capabilities. These limitations of previous evaluation tasks have motivated us to create a new benchmark. In this work, we introduce a set of diverse tasks that cover various challenges in offline GCRL, enabling a much more thorough and multi-faceted evaluation than previous tasks. We summarize the properties of the previous tasks and our new tasks in Table 1 and refer to Appendix E for further discussion on related work.

4 OVERVIEW OF OGBENCH

We now introduce our benchmark, Offline Goal-Conditioned RL Benchmark (OGBench). The primary goal of this benchmark is to provide tasks and datasets to unlock the full potential of offline goal-conditioned RL. To this end, we pose diverse challenges in offline GCRL throughout the benchmark in such a way that researchers can easily test and iterate on algorithmic ideas, and that complex, intriguing behaviors can naturally emerge when successful. OGBench consists of 8 types of environments, 85 datasets, and reference implementations of 6 representative offline GCRL algorithms. In the following sections, we first describe the challenges in offline GCRL (Section 5) and then outline our core design philosophies (Section 6). We next introduce the tasks and datasets (Section 7) and present the benchmarking results of the current algorithms (Section 8).

5 CHALLENGES IN OFFLINE GOAL-CONDITIONED RL

Offline goal-conditioned RL, despite its simplicity, is a challenging problem. Here, we discuss the major challenges in offline GCRL, which will motivate the design choices in our benchmark tasks.

(1) Learning from suboptimal, unstructured data: An ideal offline GCRL algorithm should be able to learn an effective multi-task policy from diverse and suboptimal data. This is especially important, considering that suboptimal (yet diverse) data is much cheaper to collect than curated,

Published as a conference paper at ICLR 2025

expert datasets (Lynch et al., 2019), and that the very use of large, diverse, unstructured data is one of the foundations for the success of modern machine learning. Reflecting this challenge, we provide datasets with high diversity and varying suboptimality in this benchmark to challenge the capabilities of offline GCRL algorithms.

(2) Goal stitching: Another important challenge is to stitch the initial and final states of different trajectories to learn more diverse behaviors. We call this goal stitching. Goal stitching is different from regular stitching in offline RL, which applies only when the dataset is suboptimal. Unlike regular stitching, goal stitching applies even when the dataset only consists of optimal, expert trajectories, because we can often acquire more diverse goal-reaching behaviors by stitching multiple trajectories together, regardless of their optimality. For instance, an agent can stitch two atomic pickand-place behaviors to sequentially move two objects in a single episode, even when the dataset does not contain any double pick-and-place behaviors. Goal stitching is crucial to learning diverse behaviors in many real-world applications with high behavioral diversity and large state spaces. In our benchmark, we introduce many tasks with large state spaces to assess the ability to stitch goals.

(3) Long-horizon reasoning: Long-horizon reasoning refers to the capability of navigating from a starting state to a goal state that is many steps apart. This challenge is important in many real-world tasks like autonomous driving or assembly, which may require several hours of continuous control or achieving dozens of subtasks. To substantially challenge the long-horizon reasoning ability of offline GCRL methods, we introduce tasks that are more than 5 times longer than previously used ones in terms of both the episode length and the number of subtasks (Table 1).

(4) Handling stochasticity: Another prominent challenge in offline GCRL is the ability to deal with stochastic environments. Correctly handling stochasticity is very important in practice, because virtually any real-world environment is stochastic due to partial observability. Yet, many works in offline GCRL assume deterministic dynamics to exploit the metric structure of temporal distances (Wang et al., 2023) or to enable hierarchical control (Park et al., 2023), at the cost of being optimistically biased in stochastic environments (Wang et al., 2023; Park et al., 2023). Correctly handling environment stochasticity while fully exploiting the recursive subgoal structure of GCRL remains an open problem. Since most previous tasks used to evaluate offline GCRL methods have deterministic dynamics (Table 1), we introduce several challenging tasks with stochastic dynamics in this benchmark.

6 DESIGN PRINCIPLES

Next, we discuss the design principles underlying our benchmark tasks. The tasks are intended to exercise the major challenges in offline GCRL in the previous section, while providing a set of highquality tasks that provide not only a toolkit for algorithms research and evaluation, but also a platform for vividly illustrating the capabilities of offline GCRL with compelling and complex domains.

(1) Realistic and exciting tasks: The tasks should be realistic yet exciting enough while posing diverse challenges in offline goal-conditioned RL. Imagine a robot arm watching random movements of a puzzle and then solving it zero-shot at test time, a humanoid robot navigating through a labyrinth, or an agent painting cool pictures using different types of brushes. In this benchmark, we design new tasks such that these kinds of exciting behaviors can naturally emerge when an RL agent properly stitches different trajectory segments together (up to 24; see Table 1) or successfully generalizes. At the same time, we make sure our tasks exhaustively cover major challenges in offline goal-conditioned RL, such as long-horizon reasoning, stochastic control, and combinatorial generalization via goal stitching (Section 5).

(2) Appropriate difficulty: The tasks and datasets should have appropriate levels of difficulty to properly evaluate different algorithms. In other words, they should be of high quality for benchmarking. No matter how intricate or compelling a task is, it will fail to provide a useful signal for benchmarking if it is too easy, too hard, unsolvable from the given dataset, or does not clearly distinguish between more or less effective methods. In this work, we carefully curate and adjust the difficulty of each task and dataset such that they can provide effective guidance for algorithms research. For some tasks, we provide multiple versions with varying difficulty, all the way up to tasks that are difficult to solve with current methods, so that the same benchmark can continuously be used to develop new methods even in the future. Our rule of thumb is to have, for each type of tasks, at least one task where the current state-of-the-art offline GCRL method achieves a success rate of 20-30%. This ensures that the task is solvable from the dataset while leaving significant room for improvement.

Published as a conference paper at ICLR 2025

(3) Controllable datasets: The benchmark should provide tools to control and adjust datasets for scientific research and ablation studies. Verifying the effectiveness of algorithms in real-world problems is surely important. However, for algorithms research, it is equally, if not more, important to provide analysis tools to enable a rigorous, scientific understanding of challenges and algorithms. Hence, instead of employing fixed, human-collected data, which does not always provide clear benchmarking signals for algorithms research, we choose to focus on simulated environments and synthetic datasets that we have full control of, and provide tools to reproduce and adjust them easily. We note that many algorithmic ideas in RL that have made a major impact, such as DQN (Mnih et al., 2013), PPO (Schulman et al., 2017), and CQL (Kumar et al., 2020), were originally developed in simulated environments. Even in natural language processing, studies on synthetic, controlled datasets have revealed the mechanisms and limitations of language models with scientific evidence (Allen Zhu & Li, 2023a), and these insights have transferred to real scenarios (Allen-Zhu & Li, 2023b). We demonstrate how such controllability of datasets reveals challenges and design principles in offline GCRL in Section 8.2.

(4) Minimal compute requirements: The tasks should be designed to minimize unnecessary computational overhead so that as many researchers as possible, including those from small labs and underprivileged backgrounds, can quickly iterate on their new algorithmic ideas. This does not mean that the tasks should be easy (indeed, some of our tasks are very challenging (but solvable)!); it means the benchmark should focus mainly on algorithmic challenges (e.g., not requiring highresolution image processing). In our benchmark, we provide both stateand pixel-based observations whenever possible, and minimize the size of image observations (up to 64 64 3) to reduce the computational burden. Moreover, we carefully adjust colors, transparency, and lighting for imagebased tasks to enable pixel-based control without high-resolution images or multiple views.

(5) High code quality: The reference implementations should be very clean and well-tuned so that researchers can directly use our implementations to build their ideas, and the benchmark should be very easy to set up. Our benchmark environments only depend on Mu Jo Co (Todorov et al., 2012) and do not require any other dependencies (Table 1). For reference implementations, we minimize the number of file dependencies for each algorithm, largely following the spirit of the single-file implementations of recent RL libraries (Huang et al., 2022; Tarasov et al., 2023), while maintaining a minimal amount of additional modularity. We also extensively test and tune different design choices and hyperparameters for each offline GCRL algorithm, to the degree that several methods achieve even better performances than their original performances reported on previous benchmarks (Table 3).

7 ENVIRONMENTS, TASKS, AND DATASETS

We now introduce the environments, tasks, and datasets in our benchmark. They can be broadly categorized into three groups: locomotion, manipulation, and drawing. We provide a separate validation dataset for each dataset, and most tasks support both stateand pixel-based observations. Videos are available at https://seohong.me/projects/ogbench.

Evaluation. Each task in OGBench accompanies five pre-defined state-goal pairs for evaluation (Appendix H). Performance is measured by the average success rate across the five evaluation goals. For each pre-define state-goal pair, we perform multiple rollouts with slightly randomized initial and goal states. In each evaluation episode, a goal g S (which is simply another state; see Section 2) is given to the agent, and the episode immediately terminates when the agent reaches the goal. Each task has its own goal success criterion, which we describe in Appendix G.1.

7.1 LOCOMOTION TASKS

We provide four types of locomotion environments, Point Maze, Ant Maze, Humanoid Maze, and Ant Soccer, with diverse variants. These environments are designed to test the agent s long-horizon and hierarchical reasoning abilities. They are based on the Mu Jo Co simulator (Todorov et al., 2012).

antmaze humanoidmaze

Point Maze (pointmaze), Ant Maze (antmaze), and Humanoid Maze (humanoidmaze). Maze navigation is one of the most widely used tasks for benchmarking offline GCRL algorithms. We provide three different types of maze navigation tasks: Point Maze, which involves controlling a 2-D point mass, Ant Maze, which involves controlling a quadrupedal

Published as a conference paper at ICLR 2025

Ant agent with 8 degrees of freedom (Do F), and Humanoid Maze, which involves controlling a much more complex 21-Do F Humanoid agent. The aim of these tasks is to control the agent to reach a goal location in the given maze. The agent must learn both the high-level maze navigation and low-level locomotion skills that involve high-dimensional control, purely from diverse offline trajectories.

In our benchmark, we substantially extend the original Point Maze and Ant Maze tasks proposed by D4RL (Fu et al., 2020). Unlike the original D4RL tasks, which only support Point and Ant agents and do not challenge stitching4, stochasticity, or pixel-based control, we support Humanoid control, pixel-based observations, and multi-goal evaluation, while providing more challenging and diverse types of mazes and datasets. The supported maze types are as follows:

giant teleport medium large

medium: This is the smallest maze, with the same layout as the original medium maze in D4RL. large: This is a larger maze, with the same layout as the original large maze in D4RL. giant: This is the largest maze, twice the size of large. It has the same size as the previous antmaze-ultra maze by Jiang et al. (2023), but its layout is more challenging and contains longer paths that require up to 3000 environment steps (in the case of Humanoid). This maze is designed to substantially challenge the long-horizon reasoning capability of the agent. teleport: This maze is specially designed to challenge the agent s ability to handle environment stochasticity. It has the same size as large, but contains multiple stochastic teleporters. If the agent enters a black hole, it is immediately sent to a randomly chosen white hole. However, since one of the three white holes is a dead-end, there is always a risk in taking a teleporter. The agent therefore must learn to avoid the black holes, without being optimistically biased by lucky outcomes.

On these mazes, we collect datasets with a low-level directional policy trained via SAC (Haarnoja et al., 2018) and a high-level waypoint controller. For each maze type, we provide three types of datasets that pose different kinds of challenges (the figures below show example trajectories in these datasets):

navigate stitch

navigate: This is the standard dataset, collected by a noisy expert policy that navigates the maze by repeatedly reaching randomly sampled goals. stitch: This dataset is designed to challenge the agent s stitching ability. It consists of short goal-reaching trajectories, where the length of each trajectory is at most 4 cell units. Hence, the agent must be able to stitch multiple trajectories (up to 8) together to complete the tasks. explore: This dataset is designed to test whether the agent can learn navigation skills from extremely low-quality (yet high-coverage) data. It consists of random exploratory trajectories, collected by commanding the low-level policy with random directions re-sampled every 10 steps, with a large amount of action noise.

pixels colored maze

We provide two types of observation modalities:

States: This is the default setting, where the agent has access to the full low-dimensional state representation, including its current x-y position. Pixels (visual): This requires pure pixel-based control, where the agent only receives 64 64 3 RGB images rendered from

4See Ghugare et al. (2024) for the details about this point.

Published as a conference paper at ICLR 2025

a third-person camera viewpoint. Following Park et al. (2023), we color the floor to enable the agent to infer its location from the images, obviating the need for a potentially expensive memory component. However, unlike Park et al. (2023), which additionally provides proprioceptive information, we do not provide any low-dimensional state information like joint angles; the agent must learn purely from image observations.

arena medium

Ant Soccer (antsoccer). To provide a more diverse type of locomotion task beyond simple maze navigation, we introduce a new locomotion task, Ant Soccer. This task involves controlling an Ant agent to dribble a soccer ball. It is inspired by the quadruped-fetch task in the Deep Mind Control suite (Tassa et al., 2018). Ant Soccer is significantly harder than Ant Maze because the agent must also carefully control the ball while navigating the environment. We provide two maze types: arena, which is an open space without walls, and medium, which is the same maze as the medium one in Ant Maze. For datasets, we provide navigate and stitch. The navigate datasets consist of trajectories where the agent repeatedly approaches the ball and dribbles it to random locations. The stitch datasets consist of two different types of trajectories, maze navigation without the ball and dribbling with the ball near the agent, so that stitching is required to complete the full task. Ant Soccer only supports state-based observations.

7.2 MANIPULATION TASKS We provide a manipulation suite with three types of robotic manipulation tasks, Cube, Scene, and Puzzle, with diverse difficulties and complexities. They are designed to test the agent s object manipulation, sequential generalization, and combinatorial generalization abilities. These environments are based on the Mu Jo Co simulator (Todorov et al., 2012) and a 6-Do F UR5e robot arm (Zakka et al., 2022). On these tasks, we provide play -style datasets (play) (Lynch et al., 2019) collected by non Markovian expert policies with temporally correlated noise. To support more diverse types of research (e.g., dataset ablation studies), we additionally provide more noisy datasets (noisy) collected by Markovian expert policies with uncorrelated Gaussian noise, which we describe in Appendix D.

states pixels

For all manipulation tasks, we support both state-based observations and pixel-based observations with 64 64 3 RGB camera images. For pixel observations, we adjust colors and make the arm transparent to ensure full observability. The transparent arm in the figure might appear challenging, but the colors and transparency are carefully adjusted to minimize difficulties in visual perception, to the extent that some methods achieve even better performance with pixels (see Table 2).

triple quadruple

single double

Cube (cube). This task involves pick-and-place manipulation of cube blocks, whose goal is to control a robot arm to arrange cubes into designated configurations. We provide four variants, single, double, triple, and quadruple, with different numbers (1-4) of cubes. We provide play -style datasets collected by a scripted policy that repeatedly picks a random block and places it in other random locations or on another block. At test time, the agent is given goal configurations that require moving, stacking, swapping, or permuting cube blocks. Hence, the agent must learn not only generalizable multi-object pickand-place behaviors from unstructured random trajectories in the dataset, but also long-term plans to achieve the tasks (e.g., permuting blocks requires non-trivial sequential and logical reasoning).

Published as a conference paper at ICLR 2025

unlock drawer open drawer

put cube in drawer close drawer

Scene (scene). This task is designed to challenge the sequential, long-horizon reasoning capabilities of the agent. It involves manipulating diverse everyday objects, such as a cube block, a window, a drawer, and two button locks, where pressing a button toggles the lock status of the corresponding object (the drawer or window). We provide play -style datasets collected by scripted policies that randomly interact with these objects. At test time, the agent is commanded to arrange the objects into a desired configuration. Evaluation tasks require a significant degree of sequential reasoning: for instance, some tasks require unlocking the drawer, opening it, putting the cube in the drawer, and closing it again (see the figure above), and the longest task involves eight atomic behaviors. Hence, the agent must be able to plan and sequentially combine learned manipulation skills.

1 2 Puzzle (puzzle). This task is designed to test the combinatorial generalization abilities of the agent. It requires solving the Lights Out puzzle5 with a robot arm. The puzzle consists of a two-dimensional array of buttons (e.g., a 4 6 grid), where pressing a button toggles the colors of the pressed button and the buttons adjacent to it (typically four, except on the edges and corners; see videos). The goal is to achieve a desired configuration of colors (e.g., turning all the buttons blue) by pressing an appropriate combination of buttons. Since these buttons are implemented in the Mu Jo Co simulator, the agent must control a robot arm to physically press the buttons. We provide four levels of difficulty, 3x3, 4x4, 4x5, and 4x6, with different grid sizes. The datasets are collected by a scripted policy that randomly presses buttons in arbitrary sequences. Given the enormous state space of this task (with up to 224 = 16,777,216 distinct button states), the agent must achieve combinatorial generalization while mastering low-level continuous control. Some evaluation task in the hardest puzzle requires pressing more than 20 buttons, which also substantially challenges the long-horizon reasoning capabilities of the agent. This might sound very challenging (and it is!), but we provide different levels and enough data to ensure that they provide meaningful research signals and are solvable (see the results in Section 8).

7.3 DRAWING TASKS

powderworld

Powderworld (powderworld). To provide more diverse tasks beyond robotic locomotion or manipulation, we introduce a drawing task, Powderworld (Frans & Isola, 2023)6, which presents unique challenges with extremely high intrinsic dimensionality. The goal of Powderworld is to draw a target picture on a 32 32 grid using different types of powder brushes, where each powder brush has a distinct physical property corresponding to a unique element. For example, the sand brush falls down and piles up, and the fire brush burns combustible elements like plant. We provide three versions of tasks, easy, medium, and hard, with different numbers of available elements (2, 5, and 8 elements, respectively). The datasets are collected by a random policy that keeps drawing arbitrary shapes with random brushes. This Powderworld task poses unique challenges that are distinct from the other tasks in the benchmark. First, the agent must deal with the high intrinsic dimensionality of the states, which presents a substantial challenge in representation learning. Second, since the transitions of powder elements are mostly stochastic and unpredictable, the agent must be able to correctly handle environment stochasticity. Third, the agent must achieve a high degree of generalization and sequential reasoning through a deep understanding of the physics, in order to complete symmetrical, orderly test-time drawing tasks from random, chaotic data.

5https://en.wikipedia.org/wiki/Lights_Out_(game) 6Play here: https://kvfrans.com/powder/

Published as a conference paper at ICLR 2025

Success Rate (%)

pixels Exploratory

Success Rate (%)

pixels Stitching

Success Rate (%)

pixels Stochastic

Success Rate (%)

pixels Drawing

Success Rate (%)

pixels Locomotion

Success Rate (%)

pixels Manipulation

Figure 2: Benchmarking offline GCRL methods. We report the performances of six offline GCRL methods (GCBC, GCIVL, GCIQL, QRL, CRL, and HIQL), aggregated by different dataset categories (see Table 6 for the category list). The results are averaged over the tasks in each category, and then over 8 seeds (4 seeds for pixel-based tasks). Error bars denote 95% bootstrap confidence intervals. See Table 2 for the full results. In general, HIQL (a method that involves hierarchical policy extraction) tends to achieve strong performance across the board. Among the non-hierarchical methods, CRL tends to work best in locomotion tasks and GCIVL and GCIQL tend to work best in the others.

We now present and discuss the benchmarking results of existing offline goal-conditioned RL algorithms on OGBench.

8.1 ALGORITHMS

We benchmark six representative offline GCRL algorithms: goal-conditioned behavioral cloning (GCBC) (Lynch et al., 2019; Ghosh et al., 2021), goal-conditioned implicit {V, Q}-learning (GCIVL and GCIQL) (Kostrikov et al., 2022; Park et al., 2023), quasimetric RL (QRL) (Wang et al., 2023), contrastive RL (CRL) (Eysenbach et al., 2022), and hierarchical implicit Q-learning (HIQL) (Park et al., 2023). GCBC is the simplest goal-conditioned behavioral cloning method. GCIVL and GCIQL are offline GCRL algorithms that approximates the optimal value function using an expectile regression (Newey & Powell, 1987). QRL is a non-traditional GCRL method that fits a quasimetric value function with a dual objective. CRL is a one-step RL algorithm that fits a Monte Carlo value function via contrastive learning and performs one-step policy improvement. HIQL is a hierarchical RL method that extracts a two-level hierarchical policy from a single GCIVL value function. For benchmarking, we perform a similar amount of hyperparameter tuning for each method to ensure fair comparison. We refer the reader to Appendices F and G for the details.

8.2 BENCHMARKING RESULTS AND Q&AS

We present the full benchmarking results in Table 2 in Appendix B. Figure 2 summarizes the results by showing performances grouped by different task categories. Performances are measured by average (binary) success rates on the five test-time goals of each task. We train the agents for 1M gradient steps (500K for pixel-based tasks), and average the results over 8 seeds (4 seeds for pixel-based tasks). We discuss the results through Q&As (we refer to Appendix B for more Q&As).

Q: Which method works best in general?

A: While no single method dominates the others across all categories in Figure 2. HIQL (a method that involves hierarchical policy extraction) tends to achieve particularly strong performance among the benchmarked methods, especially in locomotion and visual manipulation tasks. Among the nonhierarchical methods, CRL tends to work best in locomotion tasks and GCIQL tends to work best in manipulation tasks. In the drawing tasks, GCIVL performs the best.

Q: There seem to be a lot of datasets. What should I use for my research?

A: For general offline GCRL algorithms research, we recommend starting with more regular datasets, such as antmaze-{large, giant}-navigate, humanoidmaze-medium-navigate, cube-{single, double}-play, scene-play, and puzzle-3x3-play. From there, depending on the performance on these tasks, try harder versions of them or more challenging tasks, such as humanoidmaze-giant, antsoccer, puzzle-{4x4, 4x5, 4x6}, and powderworld.

Published as a conference paper at ICLR 2025

We also provide more specialized datasets that pose specific challenges in offline GCRL (Section 5). For stitching, try the stitch datasets in the locomotion suite as well as complex manipulation tasks that require stitching (e.g., puzzle). For long-horizon reasoning, consider humanoidmaze-giant, which has the longest episode length, and puzzle-4x6, which has the most semantic steps. For stochastic control, try antmaze-teleport, which is specifically designed to challenge optimistically biased methods, and powderworld, which has unpredictable, stochastic dynamics. For learning from highly suboptimal data, consider antmaze-explore as well as the noisy datasets in the manipulation suite, which features high suboptimality and high coverage.

9 RESEARCH OPPORTUNITIES

In this section and Appendix C, we discuss potential research ideas and open questions.

Be the first to solve unsolved tasks! While all environments in OGBench have at least one variant that current methods can solve to some degree, there are still a number of challenging tasks on which no existing method achieves non-trivial performance, such as humanoidmaze-giant, cube-triple, puzzle-4x5, powderworld-hard, and more. We ensure that sufficient data is available for those tasks (which is estimated from the amount needed to solve their easier versions). We invite researchers to take on these challenges and push the limits of offline GCRL with better algorithms.

How can we develop a policy that generalizes well at test time? In our experiments, we found hierarchical RL methods (e.g., HIQL) to work especially well in several tasks. Among several potential explanations, we hypothesize that this is mainly because hierarchical RL reduces learning complexity by having two policies specialized in different things, which makes both policies generalize better at evaluation time. After all, test-time generalization is known to be one of the major bottlenecks in offline RL (Park et al., 2024a). But, are hierarchies really necessary to achieve good test-time generalization? Can we develop a non-hierarchical method that enjoys the same benefit by exploiting the subgoal structure of offline GCRL? This would be especially beneficial, not just because it is simpler, but also because it can potentially yield better, unified representations that can potentially serve as a foundation model for fine-tuning.

Can we develop a method that works well across all categories? Our benchmarking results reveal that no method consistently performs best across the board. HIQL tends to achieve strong performance but struggles in pixel-based locomotion and state-based manipulation. GCIQL shows strong performance in state-based manipulation, but struggles in locomotion. CRL exhibits the opposite trend: it excels in locomotion but underperforms in manipulation. Is there a way to combine only the strengths of these methods to achieve the best performance across all types of tasks?

In this work, we introduced OGBench, a new benchmark designed to advance algorithms research in offline goal-conditioned RL. With the experimental results, we now revisit the very first question posed in this paper: Why offline goal-conditioned RL?

We hypothesize that offline goal-conditioned holds significant potential as a recipe for general-purpose RL pre-training, which yields richer and more diverse behaviors than (arguably more prevalent) generative pre-training, such as behavioral cloning and next-token prediction. Our experiments, albeit preliminary, show that even current offline GCRL algorithms can to some extent acquire effective policies for exceptionally long-horizon tasks entirely with sparse rewards, using data that is highly suboptimal. These results are not limited to toy example domains, but show up across a range of realistic simulated robotics and game-like settings in our benchmark. While generative objectives might capture the data distribution, offline GCRL can learn policies that actually achieve complex outcomes (such as beating a puzzle game) that could not be achieved simply by copying random data.

However, current offline GCRL algorithms also have limitations. As shown in our results, they often struggle with long-horizon, high-dimensional tasks and those that require stitching, and no single method consistently outperforms others across all tasks. This suggests that we have not yet found the ideal algorithm that can fully realize the promise of offline GCRL as general-purpose RL pre-training. The first step toward finding such an ideal algorithm is to set up a solid benchmark that sufficiently challenges the limits of offline GCRL from diverse perspectives. We believe OGBench provides this foundation, and will lead to the development of performant, scalable offline GCRL algorithms that enable building foundation models for general-purpose behaviors.

Published as a conference paper at ICLR 2025

ACKNOWLEDGMENTS

We thank Kevin Zakka for providing the initial codebase for the manipulation environments and helping with Mu Jo Co implementations, Vivek Myers for providing a JAX-based QRL implementation, and Colin Li, along with the members of the RAIL lab, for helpful discussions. This work was partly supported by the Korea Foundation for Advanced Studies (KFAS), National Science Foundation Graduate Research Fellowship Program under Grant No. DGE 2146752, ONR under N00014-20-12383 and N00014-22-1-2773, and Qualcomm. This research used the Savio computational cluster resource provided by the Berkeley Research Computing program at UC Berkeley.

REPRODUCIBILITY STATEMENT

We provide the full implementation details in Appendix G. We provide the code as well as the exact command-line flags to reproduce the entire benchmark table, datasets, and expert policies at https://github.com/seohongpark/ogbench.

Zeyuan Allen-Zhu and Yuanzhi Li. Physics of language models: Part 1, context-free grammar. Ar Xiv, abs/2305.13673, 2023a.

Zeyuan Allen-Zhu and Yuanzhi Li. Physics of language models: Part 3.2, knowledge manipulation. Ar Xiv, abs/2309.14402, 2023b.

Marcin Andrychowicz, Filip Wolski, Alex Ray, Jonas Schneider, Rachel Fong, Peter Welinder, Bob Mc Grew, Josh Tobin, Open AI Pieter Abbeel, and Wojciech Zaremba. Hindsight experience replay. In Neural Information Processing Systems (Neur IPS), 2017.

Jimmy Ba, Jamie Ryan Kiros, and Geoffrey E. Hinton. Layer normalization. Ar Xiv, abs/1607.06450, 2016.

James Bradbury, Roy Frostig, Peter Hawkins, Matthew James Johnson, Chris Leary, Dougal Maclaurin, George Necula, Adam Paszke, Jake Vander Plas, Skye Wanderman-Milne, and Qiao Zhang. JAX: composable transformations of Python+Num Py programs, 2018. URL http://github. com/jax-ml/jax.

G. Brockman, Vicki Cheung, Ludwig Pettersson, J. Schneider, John Schulman, Jie Tang, and W. Zaremba. Open AI Gym. Ar Xiv, abs/1606.01540, 2016.

Elliot Chane-Sane, Cordelia Schmid, and Ivan Laptev. Goal-conditioned reinforcement learning with imagined subgoals. In International Conference on Machine Learning (ICML), 2021.

Cheng Chi, Zhenjia Xu, Siyuan Feng, Eric Cousineau, Yilun Du, Benjamin Burchfiel, Russ Tedrake, and Shuran Song. Diffusion policy: Visuomotor policy learning via action diffusion. In Robotics: Science and Systems (RSS), 2023.

Peter Dayan and Geoffrey E. Hinton. Feudal reinforcement learning. In Neural Information Processing Systems (Neur IPS), 1992.

Lasse Espeholt, Hubert Soyer, R emi Munos, Karen Simonyan, Volodymyr Mnih, Tom Ward, Yotam Doron, Vlad Firoiu, Tim Harley, Iain Dunning, Shane Legg, and Koray Kavukcuoglu. Impala: Scalable distributed deep-rl with importance weighted actor-learner architectures. In International Conference on Machine Learning (ICML), 2018.

Benjamin Eysenbach, Ruslan Salakhutdinov, and Sergey Levine. Search on the replay buffer: Bridging planning and reinforcement learning. In Neural Information Processing Systems (Neur IPS), 2019.

Benjamin Eysenbach, Tianjun Zhang, Ruslan Salakhutdinov, and Sergey Levine. Contrastive learning as goal-conditioned reinforcement learning. In Neural Information Processing Systems (Neur IPS), 2022.

Published as a conference paper at ICLR 2025

Kuan Fang, Patrick Yin, Ashvin Nair, and Sergey Levine. Planning to practice: Efficient online finetuning by composing goals in latent space. In IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), 2022.

Kevin Frans and Phillip Isola. Powderworld: A platform for understanding generalization via rich task distributions. In International Conference on Learning Representations (ICLR), 2023.

Justin Fu, Aviral Kumar, Ofir Nachum, G. Tucker, and Sergey Levine. D4rl: Datasets for deep datadriven reinforcement learning. Ar Xiv, abs/2004.07219, 2020.

Scott Fujimoto and Shixiang Shane Gu. A minimalist approach to offline reinforcement learning. In Neural Information Processing Systems (Neur IPS), 2021.

Dibya Ghosh, Abhishek Gupta, Ashwin Reddy, Justin Fu, Coline Devin, Benjamin Eysenbach, and Sergey Levine. Learning to reach goals via iterated supervised learning. In International Conference on Learning Representations (ICLR), 2021.

Dibya Ghosh, Chethan Bhateja, and Sergey Levine. Reinforcement learning from passive data via latent intentions. In International Conference on Machine Learning (ICML), 2023.

Raj Ghugare, Matthieu Geist, Glen Berseth, and Benjamin Eysenbach. Closing the gap between td learning and supervised learning a generalisation point of view. In International Conference on Learning Representations (ICLR), 2024.

Tuomas Haarnoja, Aurick Zhou, Pieter Abbeel, and Sergey Levine. Soft actor-critic: Off-policy maximum entropy deep reinforcement learning with a stochastic actor. In International Conference on Machine Learning (ICML), 2018.

Joey Hejna, Jensen Gao, and Dorsa Sadigh. Distance weighted supervised learning for offline interaction data. In International Conference on Machine Learning (ICML), 2023.

Dan Hendrycks and Kevin Gimpel. Gaussian error linear units (gelus). Ar Xiv, abs/1606.08415, 2016.

Christopher Hoang, Sungryull Sohn, Jongwook Choi, Wilka Carvalho, and Honglak Lee. Successor feature landmarks for long-horizon goal-conditioned reinforcement learning. In Neural Information Processing Systems (Neur IPS), 2021.

Mineui Hong, Minjae Kang, and Songhwai Oh. Diffused task-agnostic milestone planner. In Neural Information Processing Systems (Neur IPS), 2023.

Shengyi Huang, Rousslan Fernand Julien Dossa, Chang Ye, Jeff Braga, Dipam Chakraborty, Kinal Mehta, and Jo AG o GM Ara Aˇsjo. Cleanrl: High-quality single-file implementations of deep reinforcement learning algorithms. Journal of Machine Learning Research (JMLR), 23(274):1 18, 2022.

Zhiao Huang, Fangchen Liu, and Hao Su. Mapping state space using landmarks for universal goal reaching. In Neural Information Processing Systems (Neur IPS), 2019.

Zhengyao Jiang, Tianjun Zhang, Michael Janner, Yueying Li, Tim Rocktaschel, Edward Grefenstette, and Yuandong Tian. Efficient planning in a compact latent action space. In International Conference on Learning Representations (ICLR), 2023.

Leslie Pack Kaelbling. Learning to achieve goals. In International Joint Conference on Artificial Intelligence (IJCAI), 1993.

Junsu Kim, Younggyo Seo, and Jinwoo Shin. Landmark-guided subgoal generation in hierarchical reinforcement learning. In Neural Information Processing Systems (Neur IPS), 2021.

Junsu Kim, Seohong Park, and Sergey Levine. Unsupervised-to-online reinforcement learning. Ar Xiv, abs/2408.14785, 2024.

Diederik P. Kingma and Jimmy Ba. Adam: A method for stochastic optimization. In International Conference on Learning Representations (ICLR), 2015.

Published as a conference paper at ICLR 2025

Ilya Kostrikov, Ashvin Nair, and Sergey Levine. Offline reinforcement learning with implicit qlearning. In International Conference on Learning Representations (ICLR), 2022.

Aviral Kumar, Aurick Zhou, G. Tucker, and Sergey Levine. Conservative q-learning for offline reinforcement learning. In Neural Information Processing Systems (Neur IPS), 2020.

Jinning Li, Chen Tang, Masayoshi Tomizuka, and Wei Zhan. Hierarchical planning through goalconditioned offline reinforcement learning. IEEE Robotics and Automation Letters (RA-L), 7(4): 10216 10223, 2022.

Bo Liu, Yihao Feng, Qiang Liu, and Peter Stone. Metric residual network for sample efficient goalconditioned reinforcement learning. In AAAI Conference on Artificial Intelligence (AAAI), 2023.

Corey Lynch, Mohi Khansari, Ted Xiao, Vikash Kumar, Jonathan Tompson, Sergey Levine, and Pierre Sermanet. Learning latent plans from play. In Conference on Robot Learning (Co RL), 2019.

Yecheng Jason Ma, Jason Yan, Dinesh Jayaraman, and Osbert Bastani. How far i ll go: Offline goalconditioned reinforcement learning via f-advantage regression. In Neural Information Processing Systems (Neur IPS), 2022.

Yecheng Jason Ma, Shagun Sodhani, Dinesh Jayaraman, Osbert Bastani, Vikash Kumar, and Amy Zhang. Vip: Towards universal visual reward and representation via value-implicit pre-training. In International Conference on Learning Representations (ICLR), 2023.

Zhuang Ma and Michael Collins. Noise contrastive estimation and negative sampling for conditional models: Consistency and statistical efficiency. In Conference on Empirical Methods in Natural Language Processing (EMNLP), 2018.

Volodymyr Mnih, Koray Kavukcuoglu, David Silver, Alex Graves, Ioannis Antonoglou, Daan Wierstra, and Martin A. Riedmiller. Playing atari with deep reinforcement learning. Ar Xiv, abs/1312.5602, 2013.

Vivek Myers, Chongyi Zheng, Anca Dragan, Sergey Levine, and Benjamin Eysenbach. Learning temporal distances: Contrastive successor features can provide a metric structure for decisionmaking. In International Conference on Machine Learning (ICML), 2024.

Soroush Nasiriany, Vitchyr H. Pong, Steven Lin, and Sergey Levine. Planning with goal-conditioned policies. In Neural Information Processing Systems (Neur IPS), 2019.

Whitney Newey and James L. Powell. Asymmetric least squares estimation and testing. Econometrica, 55:819 847, 1987.

Seohong Park, Dibya Ghosh, Benjamin Eysenbach, and Sergey Levine. Hiql: Offline goal-conditioned rl with latent states as actions. In Neural Information Processing Systems (Neur IPS), 2023.

Seohong Park, Kevin Frans, Sergey Levine, and Aviral Kumar. Is value learning really the main bottleneck in offline rl? In Neural Information Processing Systems (Neur IPS), 2024a.

Seohong Park, Tobias Kreiman, and Sergey Levine. Foundation policies with hilbert representations. In International Conference on Machine Learning (ICML), 2024b.

Seohong Park, Oleh Rybkin, and Sergey Levine. Metra: Scalable unsupervised rl with metric-aware abstraction. In International Conference on Learning Representations (ICLR), 2024c.

Seohong Park, Qiyang Li, and Sergey Levine. Flow q-learning. Ar Xiv, abs/2502.02538, 2025.

Xue Bin Peng, Aviral Kumar, Grace Zhang, and Sergey Levine. Advantage-weighted regression: Simple and scalable off-policy reinforcement learning. Ar Xiv, abs/1910.00177, 2019.

Jan Peters and Stefan Schaal. Reinforcement learning by reward-weighted regression for operational space control. In International Conference on Machine Learning (ICML), 2007.

Matthias Plappert, Marcin Andrychowicz, Alex Ray, Bob Mc Grew, Bowen Baker, Glenn Powell, Jonas Schneider, Josh Tobin, Maciek Chociej, Peter Welinder, et al. Multi-goal reinforcement learning: Challenging robotics environments and request for research. Ar Xiv, abs/1802.09464, 2018.

Published as a conference paper at ICLR 2025

Nikolay Savinov, Alexey Dosovitskiy, and Vladlen Koltun. Semi-parametric topological memory for navigation. In International Conference on Learning Representations (ICLR), 2018.

Tom Schaul, Dan Horgan, Karol Gregor, and David Silver. Universal value function approximators. In International Conference on Machine Learning (ICML), 2015.

John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. Proximal policy optimization algorithms. Ar Xiv, abs/1707.06347, 2017.

Harshit Sikchi, Rohan Chitnis, Ahmed Touati, Alborz Geramifard, Amy Zhang, and Scott Niekum. Score models for offline goal-conditioned reinforcement learning. In International Conference on Learning Representations (ICLR), 2024.

Denis Tarasov, Alexander Nikulin, Dmitry Akimov, Vladislav Kurenkov, and Sergey Kolesnikov. Corl: Research-oriented deep offline reinforcement learning library. In Neural Information Processing Systems (Neur IPS), 2023.

Yuval Tassa, Yotam Doron, Alistair Muldal, Tom Erez, Yazhe Li, Diego de Las Casas, David Budden, Abbas Abdolmaleki, Josh Merel, Andrew Lefrancq, Timothy P. Lillicrap, and Martin A. Riedmiller. Deepmind control suite. Ar Xiv, abs/1801.00690, 2018.

Jordan Terry, Benjamin Black, Nathaniel Grammel, Mario Jayakumar, Ananth Hari, Ryan Sullivan, Luis S Santos, Clemens Dieffendahl, Caroline Horsch, Rodrigo Perez-Vicente, et al. Pettingzoo: Gym for multi-agent reinforcement learning. In Neural Information Processing Systems (Neur IPS), 2021.

Emanuel Todorov, Tom Erez, and Yuval Tassa. Mujoco: A physics engine for model-based control. In IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), 2012.

Mark Towers, Ariel Kwiatkowski, Jordan Terry, John U Balis, Gianluca De Cola, Tristan Deleu, Manuel Goul ao, Andreas Kallinteris, Markus Krimmel, Arjun KG, et al. Gymnasium: A standard interface for reinforcement learning environments. Ar Xiv, abs/2407.17032, 2024.

Mianchu Wang, Rui Yang, Xi Chen, and Meng Fang. Goplan: Goal-conditioned offline reinforcement learning by planning with learned models. Transactions on Machine Learning Research (TMLR), 2024.

Tongzhou Wang and Phillip Isola. Improved representation of asymmetrical distances with interval quasimetric embeddings. Ar Xiv, abs/2211.15120, 2022.

Tongzhou Wang, Antonio Torralba, Phillip Isola, and Amy Zhang. Optimal goal-reaching reinforcement learning via quasimetric learning. In International Conference on Machine Learning (ICML), 2023.

Rui Yang, Yiming Lu, Wenzhe Li, Hao Sun, Meng Fang, Yali Du, Xiu Li, Lei Han, and Chongjie Zhang. Rethinking goal-conditioned supervised learning and its connection to offline rl. In International Conference on Learning Representations (ICLR), 2022.

Rui Yang, Yong Lin, Xiaoteng Ma, Haotian Hu, Chongjie Zhang, and T. Zhang. What is essential for unseen goal generalization of offline goal-conditioned rl? In International Conference on Machine Learning (ICML), 2023.

Tianhe Yu, Deirdre Quillen, Zhanpeng He, Ryan Julian, Karol Hausman, Chelsea Finn, and Sergey Levine. Meta-world: A benchmark and evaluation for multi-task and meta reinforcement learning. In Conference on Robot Learning (Co RL), 2019.

Kevin Zakka, Yuval Tassa, and Mu Jo Co Menagerie Contributors. Mujoco menagerie: A collection of high-quality simulation models for mujoco, 2022. URL http://github.com/ google-deepmind/mujoco_menagerie.

Zilai Zeng, Ce Zhang, Shijie Wang, and Chen Sun. Goal-conditioned predictive coding for offline reinforcement learning. In Neural Information Processing Systems (Neur IPS), 2023.

Published as a conference paper at ICLR 2025

Tony Z Zhao, Vikash Kumar, Sergey Levine, and Chelsea Finn. Learning fine-grained bimanual manipulation with low-cost hardware. In Robotics: Science and Systems (RSS), 2023.

Chongyi Zheng, Benjamin Eysenbach, Homer Walke, Patrick Yin, Kuan Fang, Ruslan Salakhutdinov, and Sergey Levine. Stabilizing contrastive rl: Techniques for offline goal reaching. In International Conference on Learning Representations (ICLR), 2024a.

Chongyi Zheng, Ruslan Salakhutdinov, and Benjamin Eysenbach. Contrastive difference predictive coding. In International Conference on Learning Representations (ICLR), 2024b.

Published as a conference paper at ICLR 2025

Table 2: Full benchmark table. We report each method s average (binary) success rate (%) across the five testtime goals on each task. The results are averaged over 8 seeds (4 seeds for pixel-based tasks), and we report the standard deviations after the sign. Numbers at or above 95% of the best value in the row are highlighted in bold.

Environment Dataset Type Dataset GCBC GCIVL GCIQL QRL CRL HIQL

pointmaze-medium-navigate-v0 9 6 63 6 53 8 82 5 29 7 79 5 pointmaze-large-navigate-v0 29 6 45 5 34 3 86 9 39 7 58 5 pointmaze-giant-navigate-v0 1 2 0 0 0 0 68 7 27 10 46 9 pointmaze-teleport-navigate-v0 25 3 45 3 24 7 4 4 24 6 18 4

pointmaze-medium-stitch-v0 23 18 70 14 21 9 80 12 0 1 74 6 pointmaze-large-stitch-v0 7 5 12 6 31 2 84 15 0 0 13 6 pointmaze-giant-stitch-v0 0 0 0 0 0 0 50 8 0 0 0 0 pointmaze-teleport-stitch-v0 31 9 44 2 25 3 9 5 4 3 34 4

antmaze-medium-navigate-v0 29 4 72 8 71 4 88 3 95 1 96 1 antmaze-large-navigate-v0 24 2 16 5 34 4 75 6 83 4 91 2 antmaze-giant-navigate-v0 0 0 0 0 0 0 14 3 16 3 65 5 antmaze-teleport-navigate-v0 26 3 39 3 35 5 35 5 53 2 42 3

antmaze-medium-stitch-v0 45 11 44 6 29 6 59 7 53 6 94 1 antmaze-large-stitch-v0 3 3 18 2 7 2 18 2 11 2 67 5 antmaze-giant-stitch-v0 0 0 0 0 0 0 0 0 0 0 2 2 antmaze-teleport-stitch-v0 31 6 39 3 17 2 24 5 31 4 36 2

explore antmaze-medium-explore-v0 2 1 19 3 13 2 1 1 3 2 37 10 antmaze-large-explore-v0 0 0 10 3 0 0 0 0 0 0 4 5 antmaze-teleport-explore-v0 2 1 32 2 7 3 2 2 20 2 34 15

humanoidmaze

navigate humanoidmaze-medium-navigate-v0 8 2 24 2 27 2 21 8 60 4 89 2 humanoidmaze-large-navigate-v0 1 0 2 1 2 1 5 1 24 4 49 4 humanoidmaze-giant-navigate-v0 0 0 0 0 0 0 1 0 3 2 12 4

stitch humanoidmaze-medium-stitch-v0 29 5 12 2 12 3 18 2 36 2 88 2 humanoidmaze-large-stitch-v0 6 3 1 1 0 0 3 1 4 1 28 3 humanoidmaze-giant-stitch-v0 0 0 0 0 0 0 0 0 0 0 3 2

antsoccer navigate antsoccer-arena-navigate-v0 5 1 47 3 50 2 8 2 23 2 58 2 antsoccer-medium-navigate-v0 2 0 4 1 7 1 2 2 3 1 13 2

stitch antsoccer-arena-stitch-v0 24 8 21 3 2 0 1 1 1 0 15 1 antsoccer-medium-stitch-v0 2 1 1 0 0 0 0 0 0 0 4 1

visual-antmaze

visual-antmaze-medium-navigate-v0 11 2 22 2 11 1 0 0 94 1 93 4 visual-antmaze-large-navigate-v0 4 0 5 1 4 1 0 0 84 1 53 9 visual-antmaze-giant-navigate-v0 0 0 1 1 0 0 0 0 47 2 6 4 visual-antmaze-teleport-navigate-v0 5 1 8 1 6 1 6 3 48 2 37 2

visual-antmaze-medium-stitch-v0 67 4 6 2 2 0 0 0 69 2 87 2 visual-antmaze-large-stitch-v0 24 3 1 1 0 0 1 1 11 3 28 2 visual-antmaze-giant-stitch-v0 0 0 0 0 0 0 0 0 0 0 0 0 visual-antmaze-teleport-stitch-v0 32 3 1 1 1 0 1 2 32 6 37 4

explore visual-antmaze-medium-explore-v0 0 0 0 0 0 0 0 0 0 0 0 0 visual-antmaze-large-explore-v0 0 0 0 0 0 0 0 0 0 0 0 0 visual-antmaze-teleport-explore-v0 0 0 0 0 0 0 0 0 1 0 19 8

visual-humanoidmaze

navigate visual-humanoidmaze-medium-navigate-v0 0 0 0 0 0 0 0 0 1 0 0 0 visual-humanoidmaze-large-navigate-v0 0 0 0 0 0 0 0 0 0 0 0 0 visual-humanoidmaze-giant-navigate-v0 0 0 0 0 0 0 0 0 0 0 0 0

stitch visual-humanoidmaze-medium-stitch-v0 1 0 0 0 0 0 0 0 1 0 0 0 visual-humanoidmaze-large-stitch-v0 0 0 0 0 0 0 0 0 0 0 0 0 visual-humanoidmaze-giant-stitch-v0 0 0 0 0 0 0 0 0 0 0 0 0

cube-single-play-v0 6 2 53 4 68 6 5 1 19 2 15 3 cube-double-play-v0 1 1 36 3 40 5 1 0 10 2 6 2 cube-triple-play-v0 1 1 1 0 3 1 0 0 4 1 3 1 cube-quadruple-play-v0 0 0 0 0 0 0 0 0 0 0 0 0

scene play scene-play-v0 5 1 42 4 51 4 5 1 19 2 38 3

puzzle play

puzzle-3x3-play-v0 2 0 6 1 95 1 1 0 3 1 12 2 puzzle-4x4-play-v0 0 0 13 2 26 3 0 0 0 0 7 2 puzzle-4x5-play-v0 0 0 7 1 14 1 0 0 1 0 4 1 puzzle-4x6-play-v0 0 0 10 2 12 1 0 0 4 1 3 1

visual-cube play

visual-cube-single-play-v0 5 1 60 5 30 5 41 15 31 15 89 0 visual-cube-double-play-v0 1 1 10 2 1 1 5 0 2 1 39 2 visual-cube-triple-play-v0 15 2 14 2 15 1 16 1 17 2 21 0 visual-cube-quadruple-play-v0 8 1 0 0 7 1 5 1 4 1 14 1

visual-scene play visual-scene-play-v0 12 2 25 3 12 2 10 1 11 2 49 4

visual-puzzle play

visual-puzzle-3x3-play-v0 0 0 21 1 1 2 1 1 0 0 73 8 visual-puzzle-4x4-play-v0 10 1 60 5 16 4 0 0 10 6 60 41 visual-puzzle-4x5-play-v0 5 2 17 1 7 2 0 0 6 1 13 9 visual-puzzle-4x6-play-v0 2 1 15 1 2 1 0 0 3 1 9 6

powderworld play powderworld-easy-play-v0 0 0 99 1 93 5 12 2 22 5 33 9 powderworld-medium-play-v0 1 1 50 4 16 5 3 1 1 1 22 14 powderworld-hard-play-v0 0 0 4 3 0 0 0 0 0 0 1 1

A LIMITATIONS

While OGBench covers a number of challenges in offline goal-conditioned RL, such as long-horizon reasoning, goal stitching, and stochastic control, there exist other challenges that our benchmark does not address. For example, all OGBench tasks assume that the environment dynamics remain

Published as a conference paper at ICLR 2025

Table 3: How good are our reference implementations? Our implementations generally achieve better performance than previously reported ones.

D4RL antmaze-large-diverse-v2 D4RL antmaze-large-play-v2

Method Previously Reported Performance Ours Method Previously Reported Performance Ours

GCBC 20 (Park et al., 2023) 41 7 GCBC 23 (Park et al., 2023) 39 4 GCIVL 51 (Park et al., 2023) 64 8 GCIVL 57 (Park et al., 2023) 58 8 GCIQL 30 (Zeng et al., 2023) 64 10 GCIQL 40 (Zeng et al., 2023) 55 11 QRL 527 (Zheng et al., 2024b) 37 14 QRL 537 (Zheng et al., 2024b) 38 8 CRL 54 (Eysenbach et al., 2022) 79 6 CRL 49 (Eysenbach et al., 2022) 74 4 HIQL 88 (Park et al., 2023) 87 3 HIQL 86 (Park et al., 2023) 87 2

the same between the training and evaluation environments. Also, although several OGBench tasks (e.g., Cube, Puzzle, and Powderworld) require unseen goal generalization to some degree, our tasks do not specifically test visual generalization to entirely new objects. Finally, we have made several trade-offs to reduce computational cost and to focus the benchmark on algorithms research at the expense of sacrificing realism to some degree (e.g., the use of the transparent arm in manipulation environments, the use of synthetic (yet fully controllable) datasets, etc.). Nonetheless, we believe OGBench can spur the development of performant offline GCRL algorithms, which can then help researchers develop scalable data-driven unsupervised RL pre-training methods for real-world tasks.

B ADDITIONAL RESULTS AND Q&AS

We present the full benchmarking results in Table 2, and provide additional Q&As in this section.

Q: Which methods are good at goal stitching?

A: To see this, we can compare the performances on the navigate and stitch datasets from the same locomotion task in Table 2. The results suggest that, as expected, full RL-based methods like HIQL (i.e., methods that fit the optimal value function Q ) are better at stitching than one-step RL methods like CRL (i.e., methods that fit the behavioral value function Qβ). For example, in visual locomotion tasks, the relative performance between HIQL and CRL is reversed on the stitch datasets (Figure 2).

Q: Which methods are good at handling stochasticity?

A: For this, we can compare the performances on the large and teleport mazes in Table 2, where both have the same maze size, but only the latter involves stochastic transitions that incur risk. Table 2 shows that, value-only methods like HIQL and QRL (i.e., methods that do not have a separate Q function), which are optimistically biased in stochastic environments, struggle relatively more in stochastic teleport tasks. In contrast, CRL is generally robust to environment stochasticity, likely because it fits a Monte Carlo value function.

Q: Which methods are good at handling pixel-based observations?

A: Although state-based and pixel-based observations generally provide the same amount of information, several methods struggle to handle image observations due to additional representational challenges. We can understand how well a method addresses such representational challenges by comparing the performances of corresponding stateand pixel-based tasks. Table 2 shows that CRL is notably robust to the difference in input modalities, likely because it is based on a pure representation learning objective. HIQL also achieves strong performance in pixel-based tasks, especially in visual manipulation tasks. However, these methods are still not perfect at handling image observations; for example, HIQL achieves relatively weak performance on image drawing tasks. We suspect this is due to the difficulty of learning low-dimensional subgoal representations from states of high intrinsic dimensionality.

Q: How good are our reference implementations?

A: We compare the performance of our reference implementations with previously reported numbers on one of the most commonly used tasks in prior work, D4RL antmaze-large (Fu et al., 2020). Table 3 shows the comparison results, with the corresponding numbers taken from the prior works (Eysenbach et al., 2022; Park et al., 2023; Zeng et al., 2023; Zheng et al., 2024b). The results suggest that

7We note that Zheng et al. (2024b) use a different evaluation scheme based on the maximum performance over evaluation epochs. We report the average performance over the last three evaluation epochs (see Appendix G).

Published as a conference paper at ICLR 2025

Table 4: Do not use single-goal evaluation! Only using a single state-goal pair (a common practice when using D4RL tasks for offline GCRL) can potentially lead to inaccurate conclusions about offline GCRL methods. OGBench always uses multi-goal evaluation. See how the rank between GCIQL and QRL is reversed with multi-goal evaluation on the same antmaze-large maze.

Dataset GCBC GCIVL GCIQL QRL CRL HIQL

D4RL antmaze-large-diverse-v2 (single-goal evaluation) 41 7 64 8 64 10 37 14 79 6 87 3 D4RL antmaze-large-play-v2 (single-goal evaluation) 39 4 58 8 55 11 38 8 74 4 87 2 OGBench antmaze-large-navigate-v0 (multi-goal evaluation, ours) 24 2 16 5 34 4 75 6 83 4 91 2

our implementations generally achieve better performance than previously reported results, sometimes significantly surpassing them (e.g., CRL).

Q: Why can t I just use D4RL Ant Maze instead of the OGBench one?

A: D4RL Ant Maze is an excellent task for benchmarking offline RL algorithms. However, it is limited for benchmarking offline goal-conditioned RL algorithms because it only involves a single, fixed state-goal pair, and the datasets are tailored to this specific task (Fu et al., 2020). In contrast, OGBench supports multi-goal evaluation (and provides much more diverse types of tasks and datasets!). To empirically demonstrate this difference, we compare the benchmarking results on D4RL antmaze-large-{diverse, play} and OGBench antmaze-large-navigate in Table 4. The table suggests that single-goal evaluation is indeed limited, and is potentially prone to inaccurate conclusions: for example, see how the ranking between GCIQL and QRL is reversed with multigoal evaluation on the same antmaze-large maze. Moreover, the performance differences between methods are more pronounced in OGBench Ant Maze, showing that OGBench provides clearer research signals.

Q: Have you found any insights on data collection for offline GCRL?

0 0.01 0.03 0.1 0.3 Action Noise σ

Success Rate (%)

cube-single (GCIQL)

0 0.01 0.03 0.1 0.3 Action Noise σ

puzzle-3x3 (GCIQL)

Figure 3: Datasets must be noisy enough.

A: One of the main features of OGBench is that every task is accompanied by a reproducible and controllable data-generation script. Here, we show one example of how this controllability can lead to practical insights and raise open research questions. In Figure 3, we ablate the strength of Gaussian action noise σ added to expert actions on two manipulation tasks (cube-single-noisy and puzzle-3x3-noisy), and measure how this affects performance.

The results are quite remarkable: they show that having the right amount of noise (i.e., state coverage) is very important for achieving good performance. For example, the performance drops from 99% to 6% if there is no noise in expert actions, even on the most basic cube pick-and-place task. This suggests that, we may need to prioritize coverage much more than optimality when collecting datasets for offline GCRL in the real world as well, and failing to do so may lead to (surprising) failures in learning. Like action noise, we believe there are many other important properties of datasets that significantly affect performance. We believe that our fully transparent, controllable data-generation scripts can facilitate such scientific studies.

C MORE CONCRETE RESEARCH QUESTIONS

Here, we list additional, more concrete research questions that researchers may use as a starting point for research in offline GCRL:

Why is Point Maze so hard? Table 2 shows that Point Maze is surprisingly hard, sometimes even harder than Ant Maze for some methods. Why is this the case? Moreover, only in Point Maze does QRL significantly outperform the other methods. What causes this difference, and are there any insights we can take from these results? How should we train subgoal representations? Somewhat weirdly, HIQL struggles much more with state-based observations than pixel-based observations on manipulation tasks. We suspect this is related to subgoal representations, given that HIQL uses an additional learning signal

Published as a conference paper at ICLR 2025

from the policy loss to further train subgoal representations only in pixel-based environments (which we found does not help in state-based environments). HIQL uses a value function-based subgoal representation (Appendix F), but is there a better, more stable way to learn subgoal representations for hierarchical RL and planning? Do we really need the full power of RL? While learning the optimal Q function is in principle better than learning the behavioral Qβ function, CRL (which fits Qβ) significantly outperforms GCIQL (which fits Q ) on locomotion tasks. Why is this the case? Is it a problem with expectile regression in GCIQL or with temporal difference learning itself? In contrast, in manipulation environments, the result suggests the opposite: GCIQL is much better than CRL. Does this mean we do need Q in these tasks? Or can it be solvable even with Qβ if we use a better behavioral value learning technique than binary NCE in CRL? Why can t we use random goals when training policies? When training goal-conditioned policies, we found that it is usually better to sample (policy) goals only from the future state in the current trajectory (except on stitch or explore datasets; see Table 10). The fact that this works better even in Scene and Puzzle (which require goal stitching) is a bit surprising, because it means that the policy can still perform goal stitching to some degree even without being explicitly trained on the test-time state-goal pairs. At the same time, it is rather unsatisfying because this ability to stitch goals entirely depends on the seemingly magical generalization capabilities of neural networks. Is there a way to train a goal-conditioned policy with random goals while maintaining performance, so that it can perform goal stitching in a principled manner? How can we combine expressive policies with GCRL? In Appendix D, we show that current offline GCRL methods often struggle with datasets collected by non-Markovian policies in manipulation environments. Handling non-Markovian trajectory data is indeed one of the major challenges in behavioral cloning, for which many recent BC-based methods have been proposed (Zhao et al., 2023; Chi et al., 2023). How can we incorporate these recent advancements in behavioral cloning into offline GCRL?

D ADDITIONAL DATASETS

For manipulation tasks (Cube, Scene, and Puzzle), in addition to the main play datasets, we additionally provide noisy datasets that can potentially be useful for other types of research (e.g., ablation studies on datasets, comparing performances on non-Markovian and Markovian datasets, etc.). The main difference is that the play datasets are collected by open-loop, non-Markovian expert policies with temporally correlated noise, while the noisy datasets are collected by closedloop, Markovian expert policies with larger, uncorrelated Gaussian noise. Hence, the play datasets generally look more natural than the noisy datasets, but the latter has higher state coverage (videos).

Table 5 shows the full benchmark results on both the play and noisy datasets in the manipulation suite. The results suggest that the performances on these two datasets are mostly similar, but several methods struggle to handle narrower and non-Markovian trajectories in play datasets (e.g., GCIQL almost perfectly solves cube-single-noisy but struggles on cube-single-play).

E PRIOR WORK IN GOAL-CONDITIONED RL

The problem of reaching any goal from any state has long been considered one of the central problems in reinforcement learning and sequential decision making (Kaelbling, 1993; Schaul et al., 2015; Andrychowicz et al., 2017), owing to its unsupervised nature, simplicity, and generality. There are many unique features of goal-conditioned RL that make it distinct from other (multi-task) RL problems, such as the presence of recursive subgoal structures, metric structures, and probabilistic interpretations. These intriguing properties have led to the development of diverse families of online and offline GCRL algorithms based on hindsight relabeling (Andrychowicz et al., 2017), hierarchical learning (Dayan & Hinton, 1992; Chane-Sane et al., 2021; Li et al., 2022; Park et al., 2023), planning (Savinov et al., 2018; Eysenbach et al., 2019; Nasiriany et al., 2019; Huang et al., 2019; Hoang et al., 2021; Kim et al., 2021; Wang et al., 2024), metric learning (Wang et al., 2023; Park et al., 2024b; Myers et al., 2024), dual optimization (Ma et al., 2022; 2023; Sikchi et al., 2024), weighted behavioral cloning (Yang et al., 2022; 2023; Hejna et al., 2023), generative modeling (Zeng et al., 2023; Hong et al., 2023), and contrastive learning (Eysenbach et al., 2022; Zheng et al.,

Published as a conference paper at ICLR 2025

Table 5: Full benchmarking results on additional noisy manipulation datasets. The table shows the performances on both the play and noisy datasets in the manipulation suite. We report each method s average (binary) success rate (%) across the five test-time goals on each task. The results are averaged over 8 seeds (4 seeds for pixel-based tasks), and we report the standard deviations after the sign. Numbers at or above 95% of the best value in the row are highlighted in bold.

Environment Dataset Type Dataset GCBC GCIVL GCIQL QRL CRL HIQL

cube-single-play-v0 6 2 53 4 68 6 5 1 19 2 15 3 cube-double-play-v0 1 1 36 3 40 5 1 0 10 2 6 2 cube-triple-play-v0 1 1 1 0 3 1 0 0 4 1 3 1 cube-quadruple-play-v0 0 0 0 0 0 0 0 0 0 0 0 0

cube-single-noisy-v0 8 3 71 9 99 1 25 6 38 2 41 6 cube-double-noisy-v0 1 1 14 3 23 3 3 1 2 1 2 1 cube-triple-noisy-v0 1 1 9 1 2 1 1 0 3 1 2 1 cube-quadruple-noisy-v0 0 0 0 0 0 0 0 0 0 0 0 0

scene play scene-play-v0 5 1 42 4 51 4 5 1 19 2 38 3

noisy scene-noisy-v0 1 1 26 5 26 2 9 2 1 1 25 4

puzzle-3x3-play-v0 2 0 6 1 95 1 1 0 3 1 12 2 puzzle-4x4-play-v0 0 0 13 2 26 3 0 0 0 0 7 2 puzzle-4x5-play-v0 0 0 7 1 14 1 0 0 1 0 4 1 puzzle-4x6-play-v0 0 0 10 2 12 1 0 0 4 1 3 1

puzzle-3x3-noisy-v0 1 0 42 19 94 3 0 0 30 6 51 11 puzzle-4x4-noisy-v0 0 0 20 3 29 7 0 0 0 0 16 4 puzzle-4x5-noisy-v0 0 0 19 0 19 0 0 0 3 2 5 1 puzzle-4x6-noisy-v0 0 0 17 2 18 2 0 0 6 3 2 1

visual-cube

visual-cube-single-play-v0 5 1 60 5 30 5 41 15 31 15 89 0 visual-cube-double-play-v0 1 1 10 2 1 1 5 0 2 1 39 2 visual-cube-triple-play-v0 15 2 14 2 15 1 16 1 17 2 21 0 visual-cube-quadruple-play-v0 8 1 0 0 7 1 5 1 4 1 14 1

visual-cube-single-noisy-v0 14 3 75 3 48 3 10 5 39 30 99 0 visual-cube-double-noisy-v0 5 1 17 4 22 2 6 2 6 3 59 3 visual-cube-triple-noisy-v0 16 1 18 1 12 1 9 4 16 1 23 2 visual-cube-quadruple-noisy-v0 9 0 0 0 2 2 0 0 8 2 12 8

visual-scene play visual-scene-play-v0 12 2 25 3 12 2 10 1 11 2 49 4

noisy visual-scene-noisy-v0 13 2 23 2 12 4 2 0 15 2 50 1

visual-puzzle

visual-puzzle-3x3-play-v0 0 0 21 1 1 2 1 1 0 0 73 8 visual-puzzle-4x4-play-v0 10 1 60 5 16 4 0 0 10 6 60 41 visual-puzzle-4x5-play-v0 5 2 17 1 7 2 0 0 6 1 13 9 visual-puzzle-4x6-play-v0 2 1 15 1 2 1 0 0 3 1 9 6

visual-puzzle-3x3-noisy-v0 1 1 20 0 26 4 0 0 1 1 70 6 visual-puzzle-4x4-noisy-v0 7 3 47 3 49 7 0 0 6 2 84 4 visual-puzzle-4x5-noisy-v0 6 1 14 10 19 0 0 0 7 1 14 10 visual-puzzle-4x6-noisy-v0 2 1 12 8 17 1 0 0 2 1 14 2

2024a;b). In this work, we also consider the problem of offline goal-conditioned RL; however, instead of proposing a new algorithm, we introduce a new benchmark and reference implementations to facilitate and advance algorithms research in offline GCRL.

F OFFLINE GCRL ALGORITHMS

In this section, we describe the six offline GCRL methods used for benchmarking in detail. We first define four goal-sampling distributions that correspond to the current state, uniform future states, geometric future states, and random states, respectively:

p D cur(g | s) denotes the Dirac delta distribution at s (i.e., g is always set to s). p D traj(g | s) denotes the uniform future state distribution defined as follows: assuming s = st in a trajectory τ = (s0, a0, s1, . . . , s T )8, we sample an index k from the uniform distribution Unif(min(t + 1, T 1), T 1) (inclusive), and set g = sk. p D geom(g | s) denotes the truncated geometric future state distribution defined as follows: assuming s = st in a trajectory τ = (s0, a0, s1, . . . , s T ), we sample an index k from the geometric distribution Geom(1 γ) (whose support starts from 1), and set g = smin(s+k,T 1).

p D rand(g) denotes the uniform state distribution over the dataset D.

Additionally, p D mixed(g | s) denotes a mixture of these four goal-sampling distributions with a mixture ratio defined by hyperparameters, p D( ) simply denotes the uniform distribution over the dataset, and

8If there are multiple such (τ, t) tuples in the dataset, consider the uniform mixture of them.

Published as a conference paper at ICLR 2025

we sometimes use p D ( | s, a) instead of p D ( | s) to denote the distribution corresponding to the state-action pair.

Goal-conditioned behavioral cloning (GCBC). GCBC (Lynch et al., 2019; Ghosh et al., 2021) simply performs behavioral cloning using future states in the same trajectory as goals. GCBC maximizes the following objective to train a goal-conditioned policy π(a | s, g).

JGCBC(π) = E(s,a) p D(s,a),g p D traj(g|s)[log π(a | s, g)]. (1)

Goal-conditioned implicit {V, Q}-learning (GCIVL and GCIQL). GCIVL and GCIQL are goalconditioned variants of implicit Q-learning (IQL) (Kostrikov et al., 2022), which is an offline RL algorithm that fits the optimal value functions (V or Q ) using an expectile regression (Newey & Powell, 1987). GCIQL is a straightforward goal-conditioned variant of IQL, and GCIVL is the V - only variant introduced by Park et al. (2023). GCIVL fits a value function V (s, g) by minimizing the following loss:

LGCIVL(V ) = Es p D(s),g p D mixed(g|s) ℓ2 κ r(s, g) + γ V (s , g) V (s, g) , (2)

where r(s, g) = 1{g}(s) 1 denotes the ( 1, 0)-sparse goal-conditioned reward function, V denotes the target value function (Mnih et al., 2013), and ℓ2 κ(x) = |κ 1{ x : x<0 }(x)|x2 denotes the expectile loss with an expectile κ.

GCIQL fits both V (s, g) and Q(s, a, g) by jointly minimizing the following losses:

LV GCIQL(V ) = E(s,a) p D(s,a),g p D mixed(g|s) ℓ2 κ Q(s, a, g) V (s, g) , (3)

LQ GCIQL(Q) = E(s,a,s ) p D(s,a,s ),g p D mixed(g|s) h (r(s, g) + γV (s , g) Q(s, a, g))2i , (4)

where Q denotes the target Q function (Mnih et al., 2013). We note that GCIVL is optimistically biased in stochastic environments, but GCIQL is unbiased (Kostrikov et al., 2022; Park et al., 2023).

To extract a policy from the learned value functions, we can use either advantage-weighted regression (AWR) (Peters & Schaal, 2007; Peng et al., 2019) or behavior-constrained deep deterministic policy gradient (DDPG+BC) (Fujimoto & Gu, 2021). GCIVL uses the following value-only variant of the AWR objective (Park et al., 2023):

JV AWR(π) = E(s,a,s ) p D(s,a,s ),g p D mixed(g|s) h eα(V (s ,g) V (s,g)) log π(a | s, g) i , (5)

where α is the temperature hyperparameter. In our experiments, GCIQL mainly uses the following DDPG+BC objective (which is known to be better than AWR (Park et al., 2024a)):

JDDPG+BC(π) = E(s,a) p D(s,a),g p D mixed(g|s) [Q(s, πµ(s, g), g) + α log π(a | s, g)] , (6)

where πµ(s, g) = Ea π(a|s,g)[a]. In discrete-action environments, GCIQL uses the following Q version of AWR:

JQ AWR(π) = E(s,a,s ) p D(s,a,s ),g p D mixed(g|s) h eα(Q(s,a,g) V (s,g)) log π(a | s, g) i . (7)

In practice, we use standard double-value learning for GCIVL and GCIQL (Park et al., 2023), and Q normalization for DDPG+BC (Fujimoto & Gu, 2021).

Quasimetric RL (QRL). QRL (Wang et al., 2023) is a goal-conditioned value learning algorithm based on quasimetric learning, where a quasimetric means an asymmetric metric. In deterministic environments, the shortest path length between two states d (s, g) is equivalent to the optimal undiscounted goal-conditioned value function V (s, g) under the ( 1, 0)-sparse reward function (Wang et al., 2023): V (s, g) = d (s, g). The main idea of QRL is to explicitly leverage the quasimetric property (i.e., triangle inequality) of shortest path lengths, namely d (s, w) + d (w, g) d (s, g) for any s, w, g S, by modeling it with a quasimetric network architecture like MRN (Liu et al., 2023) or IQE (Wang & Isola, 2022). Concretely, QRL maximizes the following constrained optimization objective:

maximize Es p D(s),g p D rand(g)[d(s, g)] (8)

s.t. E(s,s ) p D(s,s ) (d(s, s ) 1)2 ε2, (9)

Published as a conference paper at ICLR 2025

where d(s, g) is a quasimetric distance function (e.g., an IQE network) and ε is a hyperparameter that controls the strength of the constraint.

To extract a policy from the value function V (s, g) = d(s, g), QRL uses the value-only AWR loss Equation (5) in discrete-action MDPs. In continuous-action MDPs, QRL additionally fits a dynamics model f(ϕ(s), a): Z A Z (Wang et al., 2023), where Z denotes a latent space and ϕ(s): S Z denotes the representation function used in the quasimetric distance function: namely, d(s, g) = d(ϕ(s), ϕ(g)) (e.g., ϕ is the interval representation function in IQE). The dynamics loss is as follows:

Ldyn(f) = E(s,a,s ) p D(s,a,s ) h d (ϕ(s ), f(ϕ(s), a)) + d (f(ϕ(s), a), ϕ(s )) i , (10)

where QRL jointly trains both d and f without stop-gradients. Based on the dynamics model f, QRL maximizes the following DDPG+BC-like loss:

JDDPG+BC(π) = E(s,a) p D(s,a),g p D mixed(g|s) h d (f(ϕ(s), πµ(s, g)), ϕ(g)) + α log π(a | s, g) i ,

where we use the same notation as Equation (6).

In practice, we employ a softplus loss shaping for the quasimetric loss and delta prediction for the dynamics model, as in Wang et al. (2023).

Contrastive RL (CRL). CRL (Eysenbach et al., 2022) is a one-step GCRL algorithm that first trains a Monte Carlo goal-conditioned value function using contrastive learning and performs a onestep policy improvement. CRL maximizes the following binary NCE objective (Ma & Collins, 2018) with respect to f(s, a, g): S A S R:

JCRL(f) = E(s,a) p D(s,a),g p D geom(g|s,a),g p D rand(g)[log σ(f(s, a, g)) + log(1 σ(f(s, a, g )))], (12)

where σ: R (0, 1) denotes the sigmoid function. The optimal solution to the above objective is given as f(s, a, g) = log(p D geom(g | s, a)/p D rand(g)). Given the equivalence between the geometric future goal distribution (p D geom) and the Monte Carlo goal-conditioned Q function (QMC) under the (0, 1)-sparse reward function r(s, g) = 1{g}(s) (Eysenbach et al., 2022), we get the following relation: f(s, a, g) = log QMC(s, a, g) + C(g), where C is a function that only depends on g. In practice, f is parameterized as f(s, a, g) = ϕ(s, a) ψ(g)/

d with ϕ: S A Z = Rd and ψ: S Z = Rd (note that this inner-product parameterization is universal (Park et al., 2024c)), and we use the future goals from the other states in the same batch as g . We also employ the doublevalue learning technique for f (Eysenbach et al., 2022).

For policy extraction, in continuous-action MDPs, we employ DDPG+BC (Equation (6)) using f instead of Q. In discrete-action MDPs, we use AWR (Equation (5)) using (f, f V ) instead of (Q, V ), where we additionally train a contrastive value function f V (s, g): S S R using

JCRL V(f V ) = Es p D(s),g p D geom(g|s),g p D rand(g)[log σ(f V (s, g)) + log(1 σ(f V (s, g )))], (13)

with a similar inner product parameterization for f V .

Hierarchical implicit Q-learning (HIQL). HIQL (Park et al., 2023) is an offline GCRL algorithm that extracts two policies from a single goal-conditioned value function. HIQL first trains GCIVL (Equation (2)) with a parameterized value function defined as V (s, g) = V (s, ϕ(s, g)), where ϕ: S S Z serves as a (state-dependent) subgoal representation function. Based on the GCIVL value function, HIQL extracts a high-level policy πh : S S (Z) and a low-level policy πℓ: S Z (A) with the following AWR-like objectives:

Jh HIQL(πh) = E(st,st+k) p D,g p D mixed(g|st) h eα(V (st+k,g) V (st,g)) log πh(ϕ(st, st+k) | st, g) i ,

Jℓ HIQL(πℓ) = E(st,at,st+1,st+k) p D h eα(V (st+1,st+k) V (st,st+k)) log πℓ(at | st, ϕ(st, st+k)) i ,

Published as a conference paper at ICLR 2025

where we omit the arguments in p D, and k denotes a hyperparameter corresponding to the subgoal step. For simplicity, we ignore some edge cases in the objectives above (e.g., when t + k exceeds the trajectory boundary, in which case we truncate); we refer to Park et al. (2023) or our code for the full details. Intuitively, the high-level policy predicts the representation of the optimal k-step subgoal, and the low-level policy predicts the optimal action based on the predicted subgoal.

In practice, following Park et al. (2023), we use the double-value learning technique, normalize the output of ϕ, and allow gradient flows from the low-level AWR loss into ϕ (only) in pixel-based environments.

G IMPLEMENTATION DETAILS

We provide the full implementation details in this section. We release the code as well as the exact command-line flags to reproduce the entire benchmark table, datasets, and expert policies at https://github.com/seohongpark/ogbench.

G.1 TASKS AND DATASETS

In this section, we provide further information about our tasks and datasets. We provide the basic specifications about the environments and datasets in Tables 7 and 8.

Locomotion tasks. For OGBench Ant Maze and Ant Soccer, we adopt the Ant model from D4RL Ant Maze (Fu et al., 2020) (which is based on the Ant in Open AI Gym (Brockman et al., 2016; Towers et al., 2024), but with a more restricted joint range) and the soccer ball model from the Deep Mind Control suite (Tassa et al., 2018). For Humanoid Maze, we adopt the Humanoid model from the Deep Mind Control suite (Tassa et al., 2018).

To collect datasets, we train expert low-level directional (Ant Maze and Humanoid Maze) or goalreaching (Ant Soccer) policies using SAC with dense reward functions for 400K (Ant Maze), 40M (Humanoid Maze), or 12M (Ant Soccer) steps. For Point Maze, we use a scripted directional expert policy. When collecting datasets, we add Gaussian noise with a standard deviation of 0.5 (pointmaze), 1.0 (explore), or 0.2 (others) to expert actions.

The success criteria for the locomotion tasks are based only on the distance between the agent (or the ball in Ant Soccer) and the goal location. In particular, the tasks do not consider joint positions determining success, as in previous works (Fu et al., 2020; Park et al., 2023).

Manipulation tasks. We adopt the UR5e robot arm and Robotiq 2F-85 gripper models from Mu Jo Co Menagerie (Zakka et al., 2022), and the drawer, window, and button box models from Meta-World (Yu et al., 2019). The robot is end-effector controlled with a 5-D action space, where the dimensions correspond to the displacements in the x position, y position, z position, gripper yaw, and gripper opening. In all manipulation tasks, we place invisible walls to prevent objects from moving into an area beyond the robot arm s reach. These invisible walls also prevent the cube objects from moving outside the camera viewpoint in pixel-based manipulation environments. However, since some blind spots still exist in visual-scene even with the walls, we further filter out such rare cases in trajectories prevent ambiguous camera observations completely.

In Puzzle, not every button configuration is reachable from the initial state. While the 3 3, 4 5, and 4 6 puzzles do have this property, the 4 4 puzzle does not. This can be seen by computing the rank of the nm nm button effect matrix over F2 (the field with two elements), where n and m denote the numbers of rows and columns, respectively. In our tasks, we ensure that every test-time goal is solvable. Also, we note that the maximum value of the minimum number of button presses to reach a state from another state in each puzzle is 9 (3 3 puzzle), 7 (4 4 puzzle), 20 (4 5 puzzle), or 24 (4 6 puzzle). Each puzzle environment contains at least one evaluation goal that requires the maximum number of presses.

The play datasets are collected by open-loop, non-Markovian scripted policies, and the noisy datasets are collected by closed-loop, Markovian scripted policies. For the play datasets, we add temporally correlated action noise to the expert actions to enhance state coverage. For the noisy datasets, we first sample the degree of action noise at the beginning of each episode, and collect a

Published as a conference paper at ICLR 2025

trajectory with the chosen amount of (time-independent) Gaussian action noise. This ensures high coverage while having a sufficient number of optimal trajectories.

The success criteria for the manipulation tasks are based only on the object configurations; the arm pose is not considered when determining success. For cubes, only the distances between the goal positions and their current positions are considered, and their orientations are ignored.

Drawing tasks. We modify the original Powderworld environment (Frans & Isola, 2023) to make it offline and goal-conditioned. We also re-implement Powderworld (which was originally implemented in Py Torch) in Num Py to remove the dependency on Py Torch. We provide three versions of Powderworld tasks: powderworld-easy uses two elements (plant and stone), powderworld-medium uses five elements (sand, water, fire, plant, and stone), and powderworld-hard uses eight elements (sand, water, fire, plant, stone, gas, wood, and ice). An action in Powderworld corresponds to drawing a 4 4-sized square with a specific element brush on the 32 32-sized board. Since na ıvely implementing this atomic action requires up to 512-dimensional discrete actions, we split it into three sequential 8-dimensional actions that correspond to element selection, x-coordinate selection, and y-coordinate selection. To ensure full observability, we add three additional dimensions that contain information about the currently selected element and x coordinate to the original 32 32 3-dimensional image, which results in a 32 32 6-dimensional observation space. When the agent selects an invalid action (which can only happen in powderworld-{easy, medium}, which has fewer than 8 elements), the environment instead uses a randomly sampled valid action.

The datasets are collected by a scripted policy that randomly draws squares and lines or fills the entire board with randomly selected brushes. With a probability of 0.5, it performs a random action (i.e., places a random element on a randomly sampled position).

For the success criterion for evaluation goals, we use the following procedure to allow for some tolerance: For each pixel in the goal image, we check if the current image has a matching pixel that is shifted by up to one pixel in any direction. We then compute the error as the number of pixels that do not match, and consider the task successful if the error is below a certain threshold.

G.2 SINGLE-TASK VARIANTS

OGBench also supports standard (i.e., non-goal-conditioned) offline RL by providing single-task variants of locomotion and manipulation tasks. To convert a goal-conditioned task into a standard reward-maximizing task, we fix an evaluation goal and relabel the dataset with a semi-sparse reward function. This semi-sparse reward function is defined as the negative of the number of unaccomplished subtasks in the current state, and the episode immediately terminates when the agent completes all subtasks of the target evaluation goal. In locomotion environments, rewards are always 1 or 0, as there are no separate subtasks. In manipulation environments, rewards range between ntask and 0, where ntask denotes the number of subtasks (e.g., in puzzle-4x6, ntask = 24 as there are 24 buttons).

Each locomotion and manipulation task in OGBench provides five single-task variants that correspond to the five evaluation goals (Appendix H), resulting in a total of 410 single-task tasks. They are named with the suffix singletask-task[n] (e.g., scene-play-singletask-task2-v0), where [n] denotes a number between 1 and 5 (inclusive). Among the five tasks in each environment, the most representative one is chosen as the default task, and is aliased by the suffix singletask without a task number. For example, in cube-double, the second task (standard double pickand-place; see Figure 6) is set as the default task, and cube-double-play-singletask-v0 and cube-double-play-singletask-task2-v0 refer to the same task. Default tasks can be useful in various ways. For instance, one may report performance only on default tasks to reduce the computational burden, or may treat default tasks as a training task set for tuning hyperparameters while using the other four tasks as a validation task set. We provide the list of default tasks in Table 9.

While we do not provide a separate benchmarking result on the single-task environments, a benchmarking table of several representative offline RL algorithms on 50 tasks can be found in the work by Park et al. (2025).

Published as a conference paper at ICLR 2025

G.3 ORACLE REPRESENTATION VARIANTS

OGBench also provides oracle representation variants of locomotion and manipulation tasks, denoted by the suffix oraclerep (e.g., antmaze-large-navigate-oraclerep-v0). These tasks provide low-dimensional oracle goal representations that contain only relevant information for fulfilling the goal success criterion. This corresponds to the x-y coordinates of the agent (or the ball in antsoccer) in locomotion environments, and the positions of the cubes and the states of the objects in manipulation environments. The oraclerep tasks reduce the burden of goal representation learning, potentially helping diagnose the bottlenecks in goal-conditioned RL algorithms. We do not provide a separate benchmarking result for the oracle representation variants.

G.4 METHODS

Our implementations of six offline GCRL algorithms (GCBC, GCIVL, GCIQL, QRL, CRL, and HIQL) are based on JAX (Bradbury et al., 2018). In our benchmark, each run typically takes 2-5 hours (state-based tasks) or 5-12 hours (pixel-based tasks) on an A5000 GPU, depending on the task and algorithm.

For benchmarking, we periodically evaluate the performance (goal success rate in percentage) of each agent on each test-time goal with 50 rollouts every 100K steps, and report the average success rate across the last three evaluation epochs (i.e., at 800K, 900K, and 1M steps for state-based tasks and at 300K, 400K, and 500K steps for pixel-based tasks). That is, the performance of each agent is averaged over 750 rollouts (3 evaluation epochs 5 test-time goals 50 rollouts). While we use a relatively large number of evaluation rollouts for robustness, researchers can adjust the number of evaluation rollouts (e.g., 20 episodes for each test-time goal) to reduce the computational burden.

We provide the full list of common hyperparameters in Table 10. We find that methods are more sensitive to policy extraction hyperparameters (e.g., the BC coefficient in DDPG+BC) (Park et al., 2024a), and report these in a separate table (Table 11). Specifically, for each method, we use the same value learning hyperparameters across the benchmark except for the discount factor γ (Table 10), but individually tune the policy extraction hyperparameters (e.g., AWR α and DDPG+BC α) for each dataset category (Tables 10 and 11).

We apply layer normalization (Ba et al., 2016) to the value networks, but not to the policy networks. In pixel-based environments, we use a smaller version of the IMPALA encoder (Espeholt et al., 2018). We use random-crop image augmentation (with a probability of 0.5) for pixel-based manipulation tasks, but not for pixel-based locomotion or drawing tasks, as we find it to be helpful mainly on manipulation tasks. In pixel-based environments, we do not apply frame stacking for simplicity, as we find it does not necessarily improve performance on most tasks including Visual Ant Maze (although we believe the performance on Visual Humanoid Maze can further be improved with frame stacking). For policies, we parameterize the action distribution with a unit-variance Gaussian distribution. We find that using a fixed standard deviation is especially important for DDPG+BC. During evaluation, we use the deterministic mean of the learned Gaussian policy. However, in Powderworld, which has a discrete action space, we use a stochastic policy with a temperature of 0.3 (i.e., we divide the action logits by 0.3), as this additional stochasticity helps prevent the agent from getting stuck in certain states.

Published as a conference paper at ICLR 2025

Table 6: Dataset categories. We list the dataset categories used to aggregate results in Figure 2. Note that some datasets or tasks (e.g., Point Maze) do not belong to any of these aggregation categories (for being too simple, too special, etc.).

Category Datasets

Locomotion (states)

antmaze-medium-navigate-v0 antmaze-large-navigate-v0 antmaze-giant-navigate-v0 humanoidmaze-medium-navigate-v0 humanoidmaze-large-navigate-v0 humanoidmaze-giant-navigate-v0 antsoccer-arena-navigate-v0 antsoccer-medium-navigate-v0

Locomotion (pixels)

visual-antmaze-medium-navigate-v0 visual-antmaze-large-navigate-v0 visual-antmaze-giant-navigate-v0 visual-humanoidmaze-medium-navigate-v0 visual-humanoidmaze-large-navigate-v0 visual-humanoidmaze-giant-navigate-v0

Manipulation (states)

cube-single-play-v0 cube-double-play-v0 cube-triple-play-v0 cube-quadruple-play-v0 scene-play-v0 puzzle-3x3-play-v0 puzzle-4x4-play-v0 puzzle-4x5-play-v0 puzzle-4x6-play-v0

Manipulation (pixels)

visual-cube-single-play-v0 visual-cube-double-play-v0 visual-cube-triple-play-v0 visual-cube-quadruple-play-v0 visual-scene-play-v0 visual-puzzle-3x3-play-v0 visual-puzzle-4x4-play-v0 visual-puzzle-4x5-play-v0 visual-puzzle-4x6-play-v0

Drawing (pixels) powderworld-easy-play-v0 powderworld-medium-play-v0 powderworld-hard-play-v0

Stitching (states)

antmaze-medium-stitch-v0 antmaze-large-stitch-v0 antmaze-giant-stitch-v0 humanoidmaze-medium-stitch-v0 humanoidmaze-large-stitch-v0 humanoidmaze-giant-stitch-v0 antsoccer-arena-stitch-v0 antsoccer-medium-stitch-v0

Stitching (pixels)

visual-antmaze-medium-stitch-v0 visual-antmaze-large-stitch-v0 visual-antmaze-giant-stitch-v0 visual-humanoidmaze-medium-stitch-v0 visual-humanoidmaze-large-stitch-v0 visual-humanoidmaze-giant-stitch-v0

Exploratory (states) antmaze-medium-explore-v0 antmaze-large-explore-v0

Exploratory (pixels) visual-antmaze-medium-explore-v0 visual-antmaze-large-explore-v0

Stochastic (states) antmaze-teleport-navigate-v0

Stochastic (pixels) visual-antmaze-telpeport-navigate-v0

Published as a conference paper at ICLR 2025

Table 7: Environment specifications. See Table 8 for the dataset specifications. Note that the episode lengths of datasets and environments can be different.

Environment Type Environment State Dim. Action Dim. Maximum Episode Length

pointmaze-medium-v0 2 2 1000 pointmaze-large-v0 2 2 1000 pointmaze-giant-v0 2 2 1000 pointmaze-teleport-v0 2 2 1000

antmaze-medium-v0 29 8 1000 antmaze-large-v0 29 8 1000 antmaze-giant-v0 29 8 1000 antmaze-teleport-v0 29 8 1000

humanoidmaze humanoidmaze-medium-v0 69 21 2000 humanoidmaze-large-v0 69 21 2000 humanoidmaze-giant-v0 69 21 4000

antsoccer antsoccer-arena-v0 42 8 1000 antsoccer-medium-v0 42 8 1000

visual-antmaze

visual-antmaze-medium-v0 64 64 3 8 1000 visual-antmaze-large-v0 64 64 3 8 1000 visual-antmaze-giant-v0 64 64 3 8 1000 visual-antmaze-teleport-v0 64 64 3 8 1000

visual-humanoidmaze visual-humanoidmaze-medium-v0 64 64 3 21 2000 visual-humanoidmaze-large-v0 64 64 3 21 2000 visual-humanoidmaze-giant-v0 64 64 3 21 4000

cube-single-v0 28 5 200 cube-double-v0 37 5 500 cube-triple-v0 46 5 1000 cube-quadruple-v0 55 5 1000

scene scene-v0 40 5 750

puzzle-3x3-v0 55 5 500 puzzle-4x4-v0 83 5 500 puzzle-4x5-v0 99 5 1000 puzzle-4x6-v0 115 5 1000

visual-cube

visual-cube-single-v0 64 64 3 5 200 visual-cube-double-v0 64 64 3 5 500 visual-cube-triple-v0 64 64 3 5 1000 visual-cube-quadruple-v0 64 64 3 5 1000

visual-scene visual-scene-v0 64 64 3 5 750

visual-puzzle

visual-puzzle-3x3-v0 64 64 3 5 500 visual-puzzle-4x4-v0 64 64 3 5 500 visual-puzzle-4x5-v0 64 64 3 5 1000 visual-puzzle-4x6-v0 64 64 3 5 1000

powderworld powderworld-easy-v0 32 32 6 8 (discrete) 500 powderworld-medium-v0 32 32 6 8 (discrete) 500 powderworld-hard-v0 32 32 6 8 (discrete) 500

Published as a conference paper at ICLR 2025

Table 8: Dataset specifications. See Table 7 for the environment specifications. Note that the episode lengths of datasets and environments can be different.

Environment Type Dataset Type Dataset # Transitions # Episodes Data Episode Length

pointmaze-medium-navigate-v0 1M 1000 1000 pointmaze-large-navigate-v0 1M 1000 1000 pointmaze-giant-navigate-v0 1M 500 2000 pointmaze-teleport-navigate-v0 1M 1000 1000

pointmaze-medium-stitch-v0 1M 5000 200 pointmaze-large-stitch-v0 1M 5000 200 pointmaze-giant-stitch-v0 1M 5000 200 pointmaze-teleport-stitch-v0 1M 5000 200

antmaze-medium-navigate-v0 1M 1000 1000 antmaze-large-navigate-v0 1M 1000 1000 antmaze-giant-navigate-v0 1M 500 2000 antmaze-teleport-navigate-v0 1M 1000 1000

antmaze-medium-stitch-v0 1M 5000 200 antmaze-large-stitch-v0 1M 5000 200 antmaze-giant-stitch-v0 1M 5000 200 antmaze-teleport-stitch-v0 1M 5000 200

explore antmaze-medium-explore-v0 5M 10000 500 antmaze-large-explore-v0 5M 10000 500 antmaze-teleport-explore-v0 5M 10000 500

humanoidmaze

navigate humanoidmaze-medium-navigate-v0 2M 1000 2000 humanoidmaze-large-navigate-v0 2M 1000 2000 humanoidmaze-giant-navigate-v0 4M 1000 4000

stitch humanoidmaze-medium-stitch-v0 2M 5000 400 humanoidmaze-large-stitch-v0 2M 5000 400 humanoidmaze-giant-stitch-v0 4M 10000 400

antsoccer navigate antsoccer-arena-navigate-v0 1M 1000 1000 antsoccer-medium-navigate-v0 4M 4000 1000

stitch antsoccer-arena-stitch-v0 1M 5000 200 antsoccer-medium-stitch-v0 4M 8000 500

visual-antmaze

visual-antmaze-medium-navigate-v0 1M 1000 1000 visual-antmaze-large-navigate-v0 1M 1000 1000 visual-antmaze-giant-navigate-v0 1M 500 2000 visual-antmaze-teleport-navigate-v0 1M 1000 1000

visual-antmaze-medium-stitch-v0 1M 5000 200 visual-antmaze-large-stitch-v0 1M 5000 200 visual-antmaze-giant-stitch-v0 1M 5000 200 visual-antmaze-teleport-stitch-v0 1M 5000 200

explore visual-antmaze-medium-explore-v0 5M 10000 500 visual-antmaze-large-explore-v0 5M 10000 500 visual-antmaze-teleport-explore-v0 5M 10000 500

visual-humanoidmaze

navigate visual-humanoidmaze-medium-navigate-v0 2M 1000 2000 visual-humanoidmaze-large-navigate-v0 2M 1000 2000 visual-humanoidmaze-giant-navigate-v0 4M 1000 4000

stitch visual-humanoidmaze-medium-stitch-v0 2M 5000 400 visual-humanoidmaze-large-stitch-v0 2M 5000 400 visual-humanoidmaze-giant-stitch-v0 4M 10000 400

cube-single-play-v0 1M 1000 1000 cube-double-play-v0 1M 1000 1000 cube-triple-play-v0 3M 3000 1000 cube-quadruple-play-v0 5M 5000 1000

cube-single-noisy-v0 1M 1000 1000 cube-double-noisy-v0 1M 1000 1000 cube-triple-noisy-v0 3M 3000 1000 cube-quadruple-noisy-v0 5M 5000 1000

scene play scene-play-v0 1M 1000 1000

noisy scene-noisy-v0 1M 1000 1000

puzzle-3x3-play-v0 1M 1000 1000 puzzle-4x4-play-v0 1M 1000 1000 puzzle-4x5-play-v0 3M 3000 1000 puzzle-4x6-play-v0 5M 5000 1000

puzzle-3x3-noisy-v0 1M 1000 1000 puzzle-4x4-noisy-v0 1M 1000 1000 puzzle-4x5-noisy-v0 3M 3000 1000 puzzle-4x6-noisy-v0 5M 5000 1000

visual-cube

visual-cube-single-play-v0 1M 1000 1000 visual-cube-double-play-v0 1M 1000 1000 visual-cube-triple-play-v0 3M 3000 1000 visual-cube-quadruple-play-v0 5M 5000 1000

visual-cube-single-noisy-v0 1M 1000 1000 visual-cube-double-noisy-v0 1M 1000 1000 visual-cube-triple-noisy-v0 3M 3000 1000 visual-cube-quadruple-noisy-v0 5M 5000 1000

visual-scene play visual-scene-play-v0 1M 1000 1000

noisy visual-scene-noisy-v0 1M 1000 1000

visual-puzzle

visual-puzzle-3x3-play-v0 1M 1000 1000 visual-puzzle-4x4-play-v0 1M 1000 1000 visual-puzzle-4x5-play-v0 3M 3000 1000 visual-puzzle-4x6-play-v0 5M 5000 1000

visual-puzzle-3x3-noisy-v0 1M 1000 1000 visual-puzzle-4x4-noisy-v0 1M 1000 1000 visual-puzzle-4x5-noisy-v0 3M 3000 1000 visual-puzzle-4x6-noisy-v0 5M 5000 1000

powderworld play powderworld-easy-play-v0 1M 1000 1000 powderworld-medium-play-v0 3M 3000 1000 powderworld-hard-play-v0 5M 5000 1000

Published as a conference paper at ICLR 2025

Table 9: Designated default tasks for single-task environments. For single-task (singletask) variants, each environment provides five tasks corresponding to the five evaluation goals, with the most representative one chosen as the default task.

Environment Type Environment Default Task

pointmaze-medium-v0 task1 pointmaze-large-v0 task1 pointmaze-giant-v0 task1 pointmaze-teleport-v0 task1

antmaze-medium-v0 task1 antmaze-large-v0 task1 antmaze-giant-v0 task1 antmaze-teleport-v0 task1

humanoidmaze humanoidmaze-medium-v0 task1 humanoidmaze-large-v0 task1 humanoidmaze-giant-v0 task1

antsoccer antsoccer-arena-v0 task4 antsoccer-medium-v0 task4

visual-antmaze

visual-antmaze-medium-v0 task1 visual-antmaze-large-v0 task1 visual-antmaze-giant-v0 task1 visual-antmaze-teleport-v0 task1

visual-humanoidmaze visual-humanoidmaze-medium-v0 task1 visual-humanoidmaze-large-v0 task1 visual-humanoidmaze-giant-v0 task1

cube-single-v0 task2 cube-double-v0 task2 cube-triple-v0 task2 cube-quadruple-v0 task2

scene scene-v0 task2

puzzle-3x3-v0 task4 puzzle-4x4-v0 task4 puzzle-4x5-v0 task2 puzzle-4x6-v0 task2

visual-cube

visual-cube-single-v0 task2 visual-cube-double-v0 task2 visual-cube-triple-v0 task2 visual-cube-quadruple-v0 task2

visual-scene visual-scene-v0 task2

visual-puzzle

visual-puzzle-3x3-v0 task4 visual-puzzle-4x4-v0 task4 visual-puzzle-4x5-v0 task2 visual-puzzle-4x6-v0 task2

Published as a conference paper at ICLR 2025

Table 10: Common hyperparameters.

Hyperparameter Value

Learning rate 0.0003 Optimizer Adam (Kingma & Ba, 2015) # gradient steps 1000000 (states), 500000 (pixels) Minibatch size 1024 (states), 256 (pixels) MLP dimensions (512, 512, 512) Nonlinearity GELU (Hendrycks & Gimpel, 2016) Target smoothing coefficient 0.005 Discount factor γ 0.995 ({antmaze, pointmaze}-giant, humanoidmaze), 0.99 (others) Image augmentation probability 0.5 (pixel-based manipulation), 0 (others) GCIVL/GCIQL expectile κ 0.9 GCIQL expectile κ 0.9 QRL quasimetric IQE (Wang & Isola, 2022) QRL latent dimension 512 (64 components 8-dimensional latents) QRL margin ϵ 0.05 CRL latent dimension 512 HIQL expectile κ 0.7 HIQL subgoal step k 100 (humanoidmaze), 25 (other locomotion), 10 (others) HIQL subgoal representation dimension 10 Policy (p D cur, p D traj, p D geom, p D rand) ratio for p D mixed (0, 0.5, 0, 0.5) (stitch), (0, 0, 0, 1) (explore), (0, 1, 0, 0) (others) Value (p D cur, p D traj, p D geom, p D rand) ratio for p D mixed (0.2, 0, 0.5, 0.3)

Published as a conference paper at ICLR 2025

Table 11: Hyperparameters for policy extraction. Each cell indicates the policy extraction method and its α value (i.e., the temperature (AWR) or the BC coefficient (DDPG+BC)).

Environment Type Dataset Type Dataset GCIVL GCIQL QRL CRL HIQL

pointmaze-medium-navigate-v0 AWR 10.0 DDPG 0.003 DDPG 0.0003 DDPG 0.03 AWR 3.0 pointmaze-large-navigate-v0 AWR 10.0 DDPG 0.003 DDPG 0.0003 DDPG 0.03 AWR 3.0 pointmaze-giant-navigate-v0 AWR 10.0 DDPG 0.003 DDPG 0.0003 DDPG 0.03 AWR 3.0 pointmaze-teleport-navigate-v0 AWR 10.0 DDPG 0.003 DDPG 0.0003 DDPG 0.03 AWR 3.0

pointmaze-medium-stitch-v0 AWR 10.0 DDPG 0.003 DDPG 0.0003 DDPG 0.03 AWR 3.0 pointmaze-large-stitch-v0 AWR 10.0 DDPG 0.003 DDPG 0.0003 DDPG 0.03 AWR 3.0 pointmaze-giant-stitch-v0 AWR 10.0 DDPG 0.003 DDPG 0.0003 DDPG 0.03 AWR 3.0 pointmaze-teleport-stitch-v0 AWR 10.0 DDPG 0.003 DDPG 0.0003 DDPG 0.03 AWR 3.0

antmaze-medium-navigate-v0 AWR 10.0 DDPG 0.3 DDPG 0.003 DDPG 0.1 AWR 3.0 antmaze-large-navigate-v0 AWR 10.0 DDPG 0.3 DDPG 0.003 DDPG 0.1 AWR 3.0 antmaze-giant-navigate-v0 AWR 10.0 DDPG 0.3 DDPG 0.003 DDPG 0.1 AWR 3.0 antmaze-teleport-navigate-v0 AWR 10.0 DDPG 0.3 DDPG 0.003 DDPG 0.1 AWR 3.0

antmaze-medium-stitch-v0 AWR 10.0 DDPG 0.3 DDPG 0.003 DDPG 0.1 AWR 3.0 antmaze-large-stitch-v0 AWR 10.0 DDPG 0.3 DDPG 0.003 DDPG 0.1 AWR 3.0 antmaze-giant-stitch-v0 AWR 10.0 DDPG 0.3 DDPG 0.003 DDPG 0.1 AWR 3.0 antmaze-teleport-stitch-v0 AWR 10.0 DDPG 0.3 DDPG 0.003 DDPG 0.1 AWR 3.0

explore antmaze-medium-explore-v0 AWR 10.0 DDPG 0.01 DDPG 0.001 DDPG 0.003 AWR 10.0 antmaze-large-explore-v0 AWR 10.0 DDPG 0.01 DDPG 0.001 DDPG 0.003 AWR 10.0 antmaze-teleport-explore-v0 AWR 10.0 DDPG 0.01 DDPG 0.001 DDPG 0.003 AWR 10.0

humanoidmaze

navigate humanoidmaze-medium-navigate-v0 AWR 10.0 DDPG 0.1 DDPG 0.001 DDPG 0.1 AWR 3.0 humanoidmaze-large-navigate-v0 AWR 10.0 DDPG 0.1 DDPG 0.001 DDPG 0.1 AWR 3.0 humanoidmaze-giant-navigate-v0 AWR 10.0 DDPG 0.1 DDPG 0.001 DDPG 0.1 AWR 3.0

stitch humanoidmaze-medium-stitch-v0 AWR 10.0 DDPG 0.1 DDPG 0.001 DDPG 0.1 AWR 3.0 humanoidmaze-large-stitch-v0 AWR 10.0 DDPG 0.1 DDPG 0.001 DDPG 0.1 AWR 3.0 humanoidmaze-giant-stitch-v0 AWR 10.0 DDPG 0.1 DDPG 0.001 DDPG 0.1 AWR 3.0

antsoccer navigate antsoccer-arena-navigate-v0 AWR 10.0 DDPG 0.1 DDPG 0.003 DDPG 0.3 AWR 3.0 antsoccer-medium-navigate-v0 AWR 10.0 DDPG 0.1 DDPG 0.003 DDPG 0.3 AWR 3.0

stitch antsoccer-arena-stitch-v0 AWR 10.0 DDPG 0.1 DDPG 0.003 DDPG 0.3 AWR 3.0 antsoccer-medium-stitch-v0 AWR 10.0 DDPG 0.1 DDPG 0.003 DDPG 0.3 AWR 3.0

visual-antmaze

visual-antmaze-medium-navigate-v0 AWR 10.0 DDPG 0.3 DDPG 0.003 DDPG 0.1 AWR 3.0 visual-antmaze-large-navigate-v0 AWR 10.0 DDPG 0.3 DDPG 0.003 DDPG 0.1 AWR 3.0 visual-antmaze-giant-navigate-v0 AWR 10.0 DDPG 0.3 DDPG 0.003 DDPG 0.1 AWR 3.0 visual-antmaze-teleport-navigate-v0 AWR 10.0 DDPG 0.3 DDPG 0.003 DDPG 0.1 AWR 3.0

visual-antmaze-medium-stitch-v0 AWR 10.0 DDPG 0.3 DDPG 0.003 DDPG 0.1 AWR 3.0 visual-antmaze-large-stitch-v0 AWR 10.0 DDPG 0.3 DDPG 0.003 DDPG 0.1 AWR 3.0 visual-antmaze-giant-stitch-v0 AWR 10.0 DDPG 0.3 DDPG 0.003 DDPG 0.1 AWR 3.0 visual-antmaze-teleport-stitch-v0 AWR 10.0 DDPG 0.3 DDPG 0.003 DDPG 0.1 AWR 3.0

explore visual-antmaze-medium-explore-v0 AWR 10.0 DDPG 0.01 DDPG 0.001 DDPG 0.003 AWR 10.0 visual-antmaze-large-explore-v0 AWR 10.0 DDPG 0.01 DDPG 0.001 DDPG 0.003 AWR 10.0 visual-antmaze-teleport-explore-v0 AWR 10.0 DDPG 0.01 DDPG 0.001 DDPG 0.003 AWR 10.0

visual-humanoidmaze

navigate visual-humanoidmaze-medium-navigate-v0 AWR 10.0 DDPG 0.1 DDPG 0.001 DDPG 0.1 AWR 3.0 visual-humanoidmaze-large-navigate-v0 AWR 10.0 DDPG 0.1 DDPG 0.001 DDPG 0.1 AWR 3.0 visual-humanoidmaze-giant-navigate-v0 AWR 10.0 DDPG 0.1 DDPG 0.001 DDPG 0.1 AWR 3.0

stitch visual-humanoidmaze-medium-stitch-v0 AWR 10.0 DDPG 0.1 DDPG 0.001 DDPG 0.1 AWR 3.0 visual-humanoidmaze-large-stitch-v0 AWR 10.0 DDPG 0.1 DDPG 0.001 DDPG 0.1 AWR 3.0 visual-humanoidmaze-giant-stitch-v0 AWR 10.0 DDPG 0.1 DDPG 0.001 DDPG 0.1 AWR 3.0

cube-single-play-v0 AWR 10.0 DDPG 1.0 DDPG 0.3 DDPG 3.0 AWR 3.0 cube-double-play-v0 AWR 10.0 DDPG 1.0 DDPG 0.3 DDPG 3.0 AWR 3.0 cube-triple-play-v0 AWR 10.0 DDPG 1.0 DDPG 0.3 DDPG 3.0 AWR 3.0 cube-quadruple-play-v0 AWR 10.0 DDPG 1.0 DDPG 0.3 DDPG 3.0 AWR 3.0

cube-single-noisy-v0 AWR 10.0 DDPG 0.03 DDPG 0.03 DDPG 0.1 AWR 3.0 cube-double-noisy-v0 AWR 10.0 DDPG 0.03 DDPG 0.03 DDPG 0.1 AWR 3.0 cube-triple-noisy-v0 AWR 10.0 DDPG 0.03 DDPG 0.03 DDPG 0.1 AWR 3.0 cube-quadruple-noisy-v0 AWR 10.0 DDPG 0.03 DDPG 0.03 DDPG 0.1 AWR 3.0

scene play scene-play-v0 AWR 10.0 DDPG 1.0 DDPG 0.3 DDPG 3.0 AWR 3.0

noisy scene-noisy-v0 AWR 10.0 DDPG 0.03 DDPG 0.03 DDPG 0.1 AWR 3.0

puzzle-3x3-play-v0 AWR 10.0 DDPG 1.0 DDPG 0.3 DDPG 3.0 AWR 3.0 puzzle-4x4-play-v0 AWR 10.0 DDPG 1.0 DDPG 0.3 DDPG 3.0 AWR 3.0 puzzle-4x5-play-v0 AWR 10.0 DDPG 1.0 DDPG 0.3 DDPG 3.0 AWR 3.0 puzzle-4x6-play-v0 AWR 10.0 DDPG 1.0 DDPG 0.3 DDPG 3.0 AWR 3.0

puzzle-3x3-noisy-v0 AWR 10.0 DDPG 0.03 DDPG 0.03 DDPG 0.1 AWR 3.0 puzzle-4x4-noisy-v0 AWR 10.0 DDPG 0.03 DDPG 0.03 DDPG 0.1 AWR 3.0 puzzle-4x5-noisy-v0 AWR 10.0 DDPG 0.03 DDPG 0.03 DDPG 0.1 AWR 3.0 puzzle-4x6-noisy-v0 AWR 10.0 DDPG 0.03 DDPG 0.03 DDPG 0.1 AWR 3.0

visual-cube

visual-cube-single-play-v0 AWR 10.0 DDPG 1.0 DDPG 0.3 DDPG 3.0 AWR 3.0 visual-cube-double-play-v0 AWR 10.0 DDPG 1.0 DDPG 0.3 DDPG 3.0 AWR 3.0 visual-cube-triple-play-v0 AWR 10.0 DDPG 1.0 DDPG 0.3 DDPG 3.0 AWR 3.0 visual-cube-quadruple-play-v0 AWR 10.0 DDPG 1.0 DDPG 0.3 DDPG 3.0 AWR 3.0

visual-cube-single-noisy-v0 AWR 10.0 DDPG 0.03 DDPG 0.03 DDPG 0.1 AWR 3.0 visual-cube-double-noisy-v0 AWR 10.0 DDPG 0.03 DDPG 0.03 DDPG 0.1 AWR 3.0 visual-cube-triple-noisy-v0 AWR 10.0 DDPG 0.03 DDPG 0.03 DDPG 0.1 AWR 3.0 visual-cube-quadruple-noisy-v0 AWR 10.0 DDPG 0.03 DDPG 0.03 DDPG 0.1 AWR 3.0

visual-scene play visual-scene-play-v0 AWR 10.0 DDPG 1.0 DDPG 0.3 DDPG 3.0 AWR 3.0

noisy visual-scene-noisy-v0 AWR 10.0 DDPG 0.03 DDPG 0.03 DDPG 0.1 AWR 3.0

visual-puzzle

visual-puzzle-3x3-play-v0 AWR 10.0 DDPG 1.0 DDPG 0.3 DDPG 3.0 AWR 3.0 visual-puzzle-4x4-play-v0 AWR 10.0 DDPG 1.0 DDPG 0.3 DDPG 3.0 AWR 3.0 visual-puzzle-4x5-play-v0 AWR 10.0 DDPG 1.0 DDPG 0.3 DDPG 3.0 AWR 3.0 visual-puzzle-4x6-play-v0 AWR 10.0 DDPG 1.0 DDPG 0.3 DDPG 3.0 AWR 3.0

visual-puzzle-3x3-noisy-v0 AWR 10.0 DDPG 0.03 DDPG 0.03 DDPG 0.1 AWR 3.0 visual-puzzle-4x4-noisy-v0 AWR 10.0 DDPG 0.03 DDPG 0.03 DDPG 0.1 AWR 3.0 visual-puzzle-4x5-noisy-v0 AWR 10.0 DDPG 0.03 DDPG 0.03 DDPG 0.1 AWR 3.0 visual-puzzle-4x6-noisy-v0 AWR 10.0 DDPG 0.03 DDPG 0.03 DDPG 0.1 AWR 3.0

powderworld play powderworld-easy-play-v0 AWR 3.0 AWR 3.0 AWR 3.0 AWR 3.0 AWR 3.0 powderworld-medium-play-v0 AWR 3.0 AWR 3.0 AWR 3.0 AWR 3.0 AWR 3.0 powderworld-hard-play-v0 AWR 3.0 AWR 3.0 AWR 3.0 AWR 3.0 AWR 3.0

Published as a conference paper at ICLR 2025

H EVALUATION GOALS AND PER-GOAL BENCHMARKING RESULTS

Each task in OGBench provides five evaluation goals. We provide their full image descriptions in Figures 4 to 10, and the full per-goal evaluation results in Tables 12 to 24, which share the same format as Table 2.

task5 {pointmaze, antmaze, humanoidmaze}-medium

task5 {pointmaze, antmaze, humanoidmaze}-large

task5 {pointmaze, antmaze, humanoidmaze}-giant

task5 {pointmaze, antmaze}-teleport

Figure 4: Point Maze, Ant Maze, and Humanoid Maze goals.

Published as a conference paper at ICLR 2025

task5 antsoccer-arena

task5 antsoccer-medium

Figure 5: Ant Soccer goals.

Published as a conference paper at ICLR 2025

task1 horizontal

task2 vertical1

task3 vertical2

task4 diagonal1

task5 diagnoal2 cube-single

task1 single-pnp

task2 double-pnp1

task3 double-pnp2

task5 stack cube-double

task1 single-pnp

task2 triple-pnp

task3 pnp-from-stack

task4 cycle

task5 stack cube-triple

task1 double-pnp

task2 quadruple-pnp

task3 pnp-from-square

task4 cycle

task5 stack cube-quadruple

Figure 6: Cube goals.

Published as a conference paper at ICLR 2025

task2 unlock-and-lock

task3 rearrange-medium

task4 put-in-drawer

task5 rearrange-hard scene

Figure 7: Scene goals.

Published as a conference paper at ICLR 2025

task5 puzzle-3x3

task5 puzzle-4x4

Figure 8: Puzzle goals.

Published as a conference paper at ICLR 2025

task5 puzzle-4x5

task5 puzzle-4x6

Figure 9: Puzzle goals.

Published as a conference paper at ICLR 2025

task1 plant

task2 stone

task3 square

task4 four-squares

task5 mosaic powderworld-easy

task1 squares

task2 water-plant

task3 sandpile

task4 two-rooms

task5 elements powderworld-medium

task1 bubbles

task2 firework

task3 three-rooms

task4 four-squares

task5 ice-plant powderworld-hard

Figure 10: Powderworld goals.

Published as a conference paper at ICLR 2025

Table 12: Full results on Point Maze.

Environment Type Dataset Type Dataset Task GCBC GCIVL GCIQL QRL CRL HIQL

pointmaze-medium-navigate-v0

task1 30 27 88 16 97 4 100 0 20 6 99 1 task2 3 2 95 10 76 29 94 17 45 25 87 7 task3 5 5 37 28 10 28 23 20 30 4 55 13 task4 0 1 2 2 0 0 94 14 28 29 82 12 task5 4 3 92 7 79 6 97 8 24 13 70 10 overall 9 6 63 6 53 8 82 5 29 7 79 5

pointmaze-large-navigate-v0

task1 63 11 76 23 86 14 95 8 42 27 83 13 task2 1 2 0 0 0 0 100 0 31 24 2 7 task3 10 7 98 5 83 8 40 50 78 7 88 10 task4 20 18 0 0 0 0 96 7 24 14 72 19 task5 52 17 53 20 0 0 96 7 20 10 46 16 overall 29 6 45 5 34 3 86 9 39 7 58 5

pointmaze-giant-navigate-v0

task1 1 3 0 0 0 0 98 7 6 15 0 0 task2 1 4 0 0 0 0 92 16 28 10 72 17 task3 0 0 0 1 0 0 68 27 9 5 32 11 task4 0 0 0 0 0 0 66 20 64 17 60 22 task5 5 12 0 0 0 0 19 32 29 28 66 20 overall 1 2 0 0 0 0 68 7 27 10 46 9

pointmaze-teleport-navigate-v0

task1 1 2 33 12 0 1 0 0 3 3 5 5 task2 4 6 49 2 39 14 8 10 30 23 6 6 task3 50 4 46 5 31 19 2 5 26 6 39 9 task4 33 13 49 4 42 13 12 16 40 11 24 11 task5 38 6 48 4 9 9 1 2 20 15 17 8 overall 25 3 45 3 24 7 4 4 24 6 18 4

pointmaze-medium-stitch-v0

task1 21 29 76 14 56 24 94 13 0 0 77 14 task2 32 35 79 23 26 19 81 34 0 0 61 23 task3 33 34 69 16 0 0 66 29 2 3 82 13 task4 0 0 41 37 0 0 68 32 0 0 92 6 task5 29 37 84 11 22 22 92 9 0 0 59 9 overall 23 18 70 14 21 9 80 12 0 1 74 6

pointmaze-large-stitch-v0

task1 8 13 0 1 56 11 100 1 0 0 3 5 task2 0 0 0 0 0 0 74 37 0 0 0 0 task3 26 28 60 29 98 4 74 23 0 0 59 25 task4 0 0 0 0 0 0 88 32 0 0 1 4 task5 0 0 0 0 0 0 85 22 0 0 0 0 overall 7 5 12 6 31 2 84 15 0 0 13 6

pointmaze-giant-stitch-v0

task1 0 0 0 0 0 0 99 2 0 0 0 0 task2 0 0 0 0 0 0 80 27 0 0 0 0 task3 0 0 0 0 0 0 3 5 0 0 0 0 task4 0 0 0 0 0 0 63 23 0 0 0 0 task5 0 0 0 0 0 0 4 8 0 0 0 0 overall 0 0 0 0 0 0 50 8 0 0 0 0

pointmaze-teleport-stitch-v0

task1 28 20 34 14 0 0 0 0 0 0 24 13 task2 13 15 41 8 12 14 7 7 0 0 23 11 task3 48 8 50 5 47 2 15 13 0 0 46 9 task4 40 16 50 6 46 5 19 12 8 8 46 5 task5 29 16 48 5 21 7 1 3 13 13 31 10 overall 31 9 44 2 25 3 9 5 4 3 34 4

Published as a conference paper at ICLR 2025

Table 13: Full results on Ant Maze.

Environment Type Dataset Type Dataset Task GCBC GCIVL GCIQL QRL CRL HIQL

antmaze-medium-navigate-v0

task1 35 9 81 10 63 9 93 2 97 1 94 2 task2 21 7 85 5 78 8 90 5 95 2 97 1 task3 24 6 60 13 71 8 86 6 92 3 96 2 task4 28 7 42 25 59 12 83 4 94 5 96 2 task5 37 10 92 3 85 7 88 8 96 2 96 2 overall 29 4 72 8 71 4 88 3 95 1 96 1

antmaze-large-navigate-v0

task1 6 3 16 12 21 6 71 15 91 3 93 3 task2 16 4 5 6 25 7 77 7 62 14 78 9 task3 65 4 49 18 80 5 94 2 91 2 96 2 task4 14 3 2 2 19 6 64 8 85 11 94 2 task5 18 4 5 2 26 9 67 9 85 3 94 3 overall 24 2 16 5 34 4 75 6 83 4 91 2

antmaze-giant-navigate-v0

task1 0 0 0 0 0 0 1 2 2 2 47 10 task2 0 0 0 0 0 0 17 5 21 10 74 5 task3 0 0 0 0 0 0 14 8 5 5 55 7 task4 0 0 0 0 0 0 18 6 35 9 69 5 task5 1 1 1 1 1 1 18 5 16 10 82 4 overall 0 0 0 0 0 0 14 3 16 3 65 5

antmaze-teleport-navigate-v0

task1 17 5 35 5 26 5 31 6 35 5 37 5 task2 51 5 41 5 58 8 47 22 92 3 66 8 task3 22 3 36 8 31 5 35 6 47 4 37 5 task4 25 5 45 3 33 5 33 6 50 2 30 2 task5 14 6 38 6 26 9 28 8 44 3 41 8 overall 26 3 39 3 35 5 35 5 53 2 42 3

antmaze-medium-stitch-v0

task1 70 33 76 13 17 12 43 20 43 10 92 2 task2 65 19 80 4 22 16 61 12 46 14 94 3 task3 21 15 16 12 41 9 72 29 46 17 95 2 task4 1 2 0 0 32 9 80 9 53 19 93 2 task5 70 33 47 20 34 14 41 18 75 8 95 3 overall 45 11 44 6 29 6 59 7 53 6 94 1

antmaze-large-stitch-v0

task1 2 2 23 9 0 0 7 5 1 1 85 5 task2 0 0 0 0 0 0 10 5 4 4 24 16 task3 15 14 69 6 37 10 73 8 43 11 94 3 task4 0 0 0 0 0 0 1 1 5 5 70 8 task5 0 0 0 0 0 0 1 1 1 2 60 9 overall 3 3 18 2 7 2 18 2 11 2 67 5

antmaze-giant-stitch-v0

task1 0 0 0 0 0 0 0 0 0 0 0 1 task2 0 0 0 0 0 0 0 0 0 0 5 5 task3 0 0 0 0 0 0 0 0 0 0 0 0 task4 0 0 0 0 0 0 0 0 0 0 3 3 task5 0 0 0 0 0 0 2 2 0 0 0 1 overall 0 0 0 0 0 0 0 0 0 0 2 2

antmaze-teleport-stitch-v0

task1 21 13 39 7 12 4 22 6 30 6 44 5 task2 39 12 44 6 18 7 22 6 30 4 42 3 task3 34 12 36 8 18 4 25 7 23 11 26 4 task4 46 6 44 4 18 5 24 9 38 4 26 4 task5 16 14 33 6 17 6 26 5 32 7 40 6 overall 31 6 39 3 17 2 24 5 31 4 36 2

antmaze-medium-explore-v0

task1 3 6 10 8 12 6 1 1 2 2 29 17 task2 1 2 74 9 53 8 1 1 8 6 84 10 task3 1 2 0 0 0 0 3 5 4 6 18 24 task4 0 0 0 0 0 0 0 0 0 0 0 0 task5 3 4 10 6 0 0 1 1 2 2 52 27 overall 2 1 19 3 13 2 1 1 3 2 37 10

antmaze-large-explore-v0

task1 0 0 37 12 1 1 0 0 0 1 1 3 task2 0 0 0 0 0 0 0 0 0 0 0 0 task3 0 0 12 6 1 1 0 0 1 1 18 24 task4 0 0 0 0 0 0 0 0 0 0 0 0 task5 0 0 0 0 0 0 0 0 0 0 0 0 overall 0 0 10 3 0 0 0 0 0 0 4 5

antmaze-teleport-explore-v0

task1 2 2 32 4 0 1 0 0 2 1 32 11 task2 0 0 2 4 8 6 0 1 5 4 33 17 task3 4 3 48 3 13 8 4 4 47 6 34 16 task4 2 2 47 5 14 8 4 4 16 12 37 19 task5 4 2 31 2 2 1 3 3 28 5 34 14 overall 2 1 32 2 7 3 2 2 20 2 34 15

Published as a conference paper at ICLR 2025

Table 14: Full results on Humanoid Maze.

Environment Type Dataset Type Dataset Task GCBC GCIVL GCIQL QRL CRL HIQL

humanoidmaze

humanoidmaze-medium-navigate-v0

task1 4 1 22 5 23 6 12 7 84 3 95 2 task2 8 4 42 8 49 6 25 8 80 5 96 2 task3 12 3 15 3 12 6 25 10 43 11 79 6 task4 2 1 0 0 1 0 16 7 5 5 75 6 task5 12 4 40 8 51 8 29 12 87 7 97 1 overall 8 2 24 2 27 2 21 8 60 4 89 2

humanoidmaze-large-navigate-v0

task1 1 1 6 2 3 2 3 2 36 11 67 4 task2 0 0 0 0 0 0 0 0 0 0 2 3 task3 3 1 6 2 5 2 17 6 54 17 88 3 task4 2 1 0 0 1 1 4 2 23 11 42 11 task5 1 1 1 1 1 1 2 1 6 4 47 10 overall 1 0 2 1 2 1 5 1 24 4 49 4

humanoidmaze-giant-navigate-v0

task1 0 0 0 0 0 0 0 0 1 1 13 7 task2 0 0 1 1 1 1 2 1 9 5 35 11 task3 0 0 0 0 0 0 0 0 2 2 11 4 task4 0 0 0 0 0 0 0 0 3 2 2 2 task5 1 1 0 0 1 1 2 1 1 1 2 2 overall 0 0 0 0 0 0 1 0 3 2 12 4

humanoidmaze-medium-stitch-v0

task1 20 7 13 3 12 3 6 5 27 7 84 5 task2 49 12 7 2 8 5 13 4 37 7 94 2 task3 24 8 25 3 20 7 30 6 40 4 86 4 task4 3 2 1 1 2 2 18 5 28 7 86 4 task5 49 8 16 3 18 7 22 2 49 5 90 4 overall 29 5 12 2 12 3 18 2 36 2 88 2

humanoidmaze-large-stitch-v0

task1 3 4 2 1 1 1 0 0 0 0 21 5 task2 0 0 0 0 0 0 0 0 0 0 5 2 task3 20 11 3 2 1 1 16 7 13 3 84 4 task4 2 1 1 1 0 1 1 1 4 1 19 4 task5 2 2 1 1 0 0 0 0 3 1 12 2 overall 6 3 1 1 0 0 3 1 4 1 28 3

humanoidmaze-giant-stitch-v0

task1 0 0 0 0 0 0 0 0 0 0 1 2 task2 0 0 1 1 0 0 1 1 0 0 12 6 task3 0 0 0 0 0 0 0 0 0 0 2 2 task4 0 0 0 0 0 0 0 0 0 0 1 1 task5 0 0 0 0 1 1 1 1 0 1 0 1 overall 0 0 0 0 0 0 0 0 0 0 3 2

Table 15: Full results on Ant Soccer.

Environment Type Dataset Type Dataset Task GCBC GCIVL GCIQL QRL CRL HIQL

antsoccer-arena-navigate-v0

task1 7 3 61 4 60 6 12 2 32 5 67 4 task2 10 3 45 5 56 4 8 4 27 4 59 4 task3 3 2 62 6 63 6 12 2 28 5 76 4 task4 3 1 23 5 28 6 3 2 11 2 30 3 task5 3 1 42 6 42 7 3 3 15 4 56 4 overall 5 1 47 3 50 2 8 2 23 2 58 2

antsoccer-medium-navigate-v0

task1 9 3 17 5 29 5 8 8 13 4 45 6 task2 0 0 0 0 0 0 0 0 0 0 0 0 task3 0 0 0 0 0 1 1 1 0 1 2 1 task4 0 0 0 0 1 1 1 1 1 1 3 1 task5 1 1 3 1 4 3 1 2 2 1 13 5 overall 2 0 4 1 7 1 2 2 3 1 13 2

antsoccer-arena-stitch-v0

task1 73 5 37 4 6 2 2 1 2 2 24 2 task2 36 19 13 4 2 1 2 2 1 0 14 4 task3 6 15 34 9 1 1 1 1 0 0 20 3 task4 7 12 11 3 0 0 0 0 0 0 7 2 task5 0 0 12 3 1 1 0 0 0 0 12 5 overall 24 8 21 3 2 0 1 1 1 0 15 1

antsoccer-medium-stitch-v0

task1 10 7 4 2 0 0 0 0 0 0 21 6 task2 0 0 0 0 0 0 0 0 0 0 0 0 task3 0 0 0 0 0 0 0 0 0 0 0 0 task4 0 0 0 0 0 0 0 0 0 0 0 0 task5 0 0 0 0 0 0 0 0 0 0 0 1 overall 2 1 1 0 0 0 0 0 0 0 4 1

Published as a conference paper at ICLR 2025

Table 16: Full results on Visual Ant Maze.

Environment Type Dataset Type Dataset Task GCBC GCIVL GCIQL QRL CRL HIQL

visual-antmaze

visual-antmaze-medium-navigate-v0

task1 17 6 30 7 16 3 0 0 92 2 90 4 task2 8 2 21 6 7 2 0 0 94 2 92 7 task3 17 1 24 5 16 4 0 0 98 1 94 4 task4 12 2 21 3 9 2 0 0 94 2 94 2 task5 4 2 16 5 6 2 0 0 94 2 94 5 overall 11 2 22 2 11 1 0 0 94 1 93 4

visual-antmaze-large-navigate-v0

task1 3 1 7 2 4 3 0 0 78 5 60 10 task2 4 3 4 1 2 1 0 0 80 3 28 9 task3 4 2 6 2 4 1 1 1 90 3 85 10 task4 4 2 5 3 6 1 0 1 88 3 46 7 task5 4 2 5 1 4 2 0 0 83 2 44 10 overall 4 0 5 1 4 1 0 0 84 1 53 9

visual-antmaze-giant-navigate-v0

task1 0 0 0 0 0 0 0 0 17 2 2 1 task2 1 1 2 1 1 1 0 0 73 9 12 8 task3 0 0 0 0 0 0 0 0 22 6 2 3 task4 0 1 0 1 0 0 0 0 47 5 4 2 task5 1 1 2 3 1 1 0 1 77 5 13 11 overall 0 0 1 1 0 0 0 0 47 2 6 4

visual-antmaze-teleport-navigate-v0

task1 2 2 6 1 2 1 3 2 32 3 32 5 task2 6 3 9 3 9 2 6 4 73 8 40 6 task3 9 1 12 3 9 2 10 4 47 3 33 1 task4 10 2 10 2 8 3 6 4 50 4 44 5 task5 1 1 3 1 3 1 4 2 36 5 33 7 overall 5 1 8 1 6 1 6 3 48 2 37 2

visual-antmaze-medium-stitch-v0

task1 80 4 0 1 0 0 0 0 33 4 75 8 task2 90 4 1 2 0 0 0 0 69 5 85 7 task3 69 18 15 6 8 1 0 0 88 1 92 1 task4 1 1 7 4 3 1 0 1 70 12 88 4 task5 97 1 6 3 1 1 0 0 85 5 93 1 overall 67 4 6 2 2 0 0 0 69 2 87 2

visual-antmaze-large-stitch-v0

task1 26 11 0 0 0 0 0 0 6 1 36 5 task2 0 0 0 0 0 0 0 0 2 1 3 2 task3 73 14 3 2 0 0 2 2 36 10 87 6 task4 7 5 1 1 0 0 1 1 8 1 7 4 task5 11 5 0 0 0 0 0 0 5 2 6 1 overall 24 3 1 1 0 0 1 1 11 3 28 2

visual-antmaze-giant-stitch-v0

task1 0 0 0 0 0 0 0 0 0 0 0 0 task2 1 2 0 0 0 0 0 0 0 0 1 1 task3 0 0 0 0 0 0 0 0 0 0 0 0 task4 0 0 0 0 0 0 0 0 0 0 0 0 task5 0 0 0 0 0 0 0 0 0 1 0 0 overall 0 0 0 0 0 0 0 0 0 0 0 0

visual-antmaze-teleport-stitch-v0

task1 37 4 2 2 1 1 0 0 20 5 36 5 task2 36 3 2 1 1 1 1 1 40 9 38 3 task3 17 6 2 1 2 1 3 4 32 9 36 5 task4 39 9 1 1 0 0 2 3 45 7 37 6 task5 29 1 1 1 1 1 1 1 22 9 38 5 overall 32 3 1 1 1 0 1 2 32 6 37 4

visual-antmaze-medium-explore-v0

task1 0 0 0 0 0 0 0 0 0 0 0 0 task2 0 0 0 0 0 0 0 0 0 0 0 0 task3 0 0 0 0 0 0 0 0 0 0 1 2 task4 0 0 0 0 0 0 0 0 0 0 0 0 task5 0 0 0 0 0 0 0 0 0 0 0 0 overall 0 0 0 0 0 0 0 0 0 0 0 0

visual-antmaze-large-explore-v0

task1 0 0 0 0 0 0 0 0 0 0 0 0 task2 0 0 0 0 0 0 0 0 0 0 0 0 task3 0 0 0 0 0 0 0 0 0 0 0 0 task4 0 0 0 0 0 0 0 0 0 0 0 0 task5 0 0 0 0 0 0 0 0 0 0 0 0 overall 0 0 0 0 0 0 0 0 0 0 0 0

visual-antmaze-teleport-explore-v0

task1 0 0 0 0 0 0 0 0 0 0 0 0 task2 0 0 0 0 0 0 0 0 0 0 0 0 task3 0 0 0 1 0 1 0 0 3 1 38 8 task4 0 0 0 0 0 0 0 0 0 0 28 13 task5 0 0 0 0 0 0 0 0 2 2 27 18 overall 0 0 0 0 0 0 0 0 1 0 19 8

Published as a conference paper at ICLR 2025

Table 17: Full results on Visual Humanoid Maze.

Environment Type Dataset Type Dataset Task GCBC GCIVL GCIQL QRL CRL HIQL

visual-humanoidmaze

visual-humanoidmaze-medium-navigate-v0

task1 0 0 0 0 0 0 0 0 0 0 0 0 task2 0 0 0 0 0 0 0 0 0 0 0 0 task3 0 0 0 0 0 0 0 0 2 1 0 1 task4 0 0 0 0 0 0 0 0 0 0 0 0 task5 0 0 0 0 0 0 0 0 3 2 0 1 overall 0 0 0 0 0 0 0 0 1 0 0 0

visual-humanoidmaze-large-navigate-v0

task1 0 0 0 0 0 0 0 0 0 0 0 0 task2 0 0 0 0 0 0 0 0 0 0 0 0 task3 0 0 0 0 0 0 0 0 0 0 0 0 task4 0 0 0 0 0 0 0 0 0 0 0 0 task5 0 0 0 0 0 0 0 0 0 0 0 0 overall 0 0 0 0 0 0 0 0 0 0 0 0

visual-humanoidmaze-giant-navigate-v0

task1 0 0 0 0 0 0 0 0 0 0 0 0 task2 0 0 0 0 0 0 0 0 0 0 0 0 task3 0 0 0 0 0 0 0 0 0 0 0 0 task4 0 0 0 0 0 0 0 0 0 0 0 0 task5 0 0 0 0 0 0 0 0 0 0 0 0 overall 0 0 0 0 0 0 0 0 0 0 0 0

visual-humanoidmaze-medium-stitch-v0

task1 0 0 0 0 0 0 0 0 0 0 0 0 task2 0 0 0 0 0 0 0 0 0 0 0 0 task3 3 1 0 0 0 0 0 0 3 2 1 2 task4 0 0 0 0 0 0 0 0 0 0 0 0 task5 1 1 0 0 0 0 0 0 0 0 0 0 overall 1 0 0 0 0 0 0 0 1 0 0 0

visual-humanoidmaze-large-stitch-v0

task1 0 0 0 0 0 0 0 0 0 0 0 0 task2 0 0 0 0 0 0 0 0 0 0 0 0 task3 0 0 0 0 0 0 0 0 0 0 0 0 task4 0 0 0 0 0 0 0 0 0 0 0 0 task5 0 0 0 0 0 0 0 0 0 0 0 0 overall 0 0 0 0 0 0 0 0 0 0 0 0

visual-humanoidmaze-giant-stitch-v0

task1 0 0 0 0 0 0 0 0 0 0 0 0 task2 0 0 0 0 0 0 0 0 0 0 0 0 task3 0 0 0 0 0 0 0 0 0 0 0 0 task4 0 0 0 0 0 0 0 0 0 0 0 0 task5 0 0 0 0 0 0 0 0 0 0 0 0 overall 0 0 0 0 0 0 0 0 0 0 0 0

Table 18: Full results on Cube.

Environment Type Dataset Type Dataset Task GCBC GCIVL GCIQL QRL CRL HIQL

cube-single-play-v0

task1 7 3 57 6 71 9 6 2 20 6 15 5 task2 5 2 51 6 71 6 5 2 20 4 16 5 task3 7 3 55 6 70 6 4 1 21 6 16 3 task4 4 2 50 4 61 8 4 2 16 3 14 5 task5 4 2 52 6 67 7 4 3 15 3 13 4 overall 6 2 53 4 68 6 5 1 19 2 15 3

cube-double-play-v0

task1 6 3 58 5 74 8 6 3 30 7 22 6 task2 0 0 51 6 55 11 0 0 9 2 4 3 task3 0 0 42 7 45 7 0 0 6 1 3 2 task4 0 0 7 2 4 3 0 0 0 0 1 1 task5 0 0 21 1 23 6 0 0 3 1 2 1 overall 1 1 36 3 40 5 1 0 10 2 6 2

cube-triple-play-v0

task1 5 4 3 1 13 3 1 1 19 5 12 6 task2 0 0 0 0 0 0 0 0 2 1 0 0 task3 0 0 0 0 0 0 0 0 1 1 0 0 task4 0 0 0 0 0 0 0 0 0 0 0 0 task5 0 0 0 0 0 0 0 0 0 0 0 0 overall 1 1 1 0 3 1 0 0 4 1 3 1

cube-quadruple-play-v0

task1 0 0 0 0 0 0 0 0 0 0 0 0 task2 0 0 0 0 0 0 0 0 0 0 0 0 task3 0 0 0 0 0 0 0 0 0 0 0 0 task4 0 0 0 0 0 0 0 0 0 0 0 0 task5 0 0 0 0 0 0 0 0 0 0 0 0 overall 0 0 0 0 0 0 0 0 0 0 0 0

cube-single-noisy-v0

task1 5 3 71 12 100 1 17 12 39 4 48 6 task2 7 5 70 11 100 0 23 8 39 7 39 7 task3 1 1 67 10 99 1 4 3 36 7 41 7 task4 16 5 76 10 98 2 47 14 36 4 36 6 task5 12 5 70 12 100 0 37 9 42 5 44 9 overall 8 3 71 9 99 1 25 6 38 2 41 6

cube-double-noisy-v0

task1 7 3 53 11 64 8 16 5 9 5 10 3 task2 0 0 10 4 16 4 0 0 0 0 1 0 task3 0 0 1 1 6 4 0 0 0 0 1 1 task4 0 0 4 2 11 3 0 1 0 0 0 1 task5 0 0 4 2 20 4 0 0 0 0 0 1 overall 1 1 14 3 23 3 3 1 2 1 2 1

cube-triple-noisy-v0

task1 6 3 44 7 8 2 5 2 13 6 8 3 task2 0 0 0 0 1 1 0 0 0 0 0 0 task3 0 0 2 1 0 0 0 0 0 0 0 0 task4 0 0 0 0 0 0 0 0 0 0 0 0 task5 0 0 0 0 0 0 0 0 0 0 0 0 overall 1 1 9 1 2 1 1 0 3 1 2 1

cube-quadruple-noisy-v0

task1 0 0 0 0 0 0 0 0 0 0 0 0 task2 0 0 0 0 0 0 0 0 0 0 0 0 task3 0 0 0 0 0 0 0 0 0 0 0 0 task4 0 0 0 0 0 0 0 0 0 0 0 0 task5 0 0 0 0 0 0 0 0 0 0 0 0 overall 0 0 0 0 0 0 0 0 0 0 0 0

Published as a conference paper at ICLR 2025

Table 19: Full results on Scene.

Environment Type Dataset Type Dataset Task GCBC GCIVL GCIQL QRL CRL HIQL

play scene-play-v0

task1 18 7 75 5 93 4 19 4 49 7 40 4 task2 1 1 62 8 82 8 1 1 12 4 40 9 task3 2 1 64 7 72 10 1 1 26 8 36 5 task4 3 2 7 4 8 3 5 2 5 2 55 5 task5 0 0 2 1 1 1 0 1 1 1 20 5 overall 5 1 42 4 51 4 5 1 19 2 38 3

noisy scene-noisy-v0

task1 6 3 60 11 50 5 39 10 5 4 68 5 task2 0 0 42 11 52 13 2 1 0 0 29 6 task3 0 0 27 6 28 5 1 1 0 1 17 6 task4 0 0 3 3 0 0 3 3 0 0 10 8 task5 0 0 0 0 0 0 0 0 0 0 2 2 overall 1 1 26 5 26 2 9 2 1 1 25 4

Table 20: Full results on Puzzle.

Environment Type Dataset Type Dataset Task GCBC GCIVL GCIQL QRL CRL HIQL

puzzle-3x3-play-v0

task1 5 1 17 4 99 2 3 2 11 3 29 4 task2 2 1 4 2 96 3 0 0 2 1 11 3 task3 1 1 3 1 95 1 0 0 1 1 7 3 task4 1 1 3 1 91 3 0 0 2 1 5 1 task5 1 0 2 1 94 2 0 0 2 1 8 3 overall 2 0 6 1 95 1 1 0 3 1 12 2

puzzle-4x4-play-v0

task1 0 0 17 4 42 7 1 1 1 1 10 3 task2 0 0 13 5 2 1 0 0 0 0 9 4 task3 0 0 12 3 40 5 0 0 0 0 7 3 task4 0 0 11 3 23 5 0 0 0 1 6 2 task5 0 0 10 4 23 5 0 0 0 0 5 2 overall 0 0 13 2 26 3 0 0 0 0 7 2

puzzle-4x5-play-v0

task1 1 1 33 6 71 5 0 1 6 2 17 5 task2 0 0 0 0 0 0 0 0 0 0 1 0 task3 0 0 0 0 0 0 0 0 0 0 0 0 task4 0 0 0 0 0 0 0 0 0 0 0 0 task5 0 0 0 0 0 0 0 0 0 0 0 0 overall 0 0 7 1 14 1 0 0 1 0 4 1

puzzle-4x6-play-v0

task1 0 0 43 8 52 6 0 0 12 5 12 5 task2 0 0 7 4 5 3 0 0 7 3 2 1 task3 0 0 0 0 0 0 0 0 0 0 0 0 task4 0 0 0 0 0 0 0 0 0 0 0 0 task5 0 0 0 0 0 0 0 0 0 0 0 0 overall 0 0 10 2 12 1 0 0 4 1 3 1

puzzle-3x3-noisy-v0

task1 4 2 89 8 100 0 1 1 76 8 67 10 task2 0 0 42 29 88 9 0 0 26 9 54 11 task3 0 0 26 20 99 1 0 0 15 9 43 12 task4 0 0 23 20 94 3 0 0 12 9 41 13 task5 0 0 31 19 88 5 0 0 18 6 47 15 overall 1 0 42 19 94 3 0 0 30 6 51 11

puzzle-4x4-noisy-v0

task1 0 0 51 10 49 9 0 0 0 0 19 5 task2 0 0 0 0 0 0 0 0 0 0 16 5 task3 0 0 34 4 61 14 0 0 0 0 17 6 task4 0 0 9 3 23 10 0 0 0 0 14 5 task5 0 0 8 4 14 9 0 0 0 0 12 4 overall 0 0 20 3 29 7 0 0 0 0 16 4

puzzle-4x5-noisy-v0

task1 0 0 97 1 97 2 0 0 16 9 21 5 task2 0 0 0 0 0 0 0 0 0 0 1 1 task3 0 0 0 0 0 0 0 0 0 0 1 1 task4 0 0 0 0 0 0 0 0 0 0 0 0 task5 0 0 0 0 0 0 0 0 0 0 0 0 overall 0 0 19 0 19 0 0 0 3 2 5 1

puzzle-4x6-noisy-v0

task1 0 0 80 8 86 7 0 0 28 13 8 4 task2 0 0 3 2 1 1 0 0 1 1 1 1 task3 0 0 0 0 0 0 0 0 0 0 0 0 task4 0 0 0 0 0 0 0 0 0 0 0 0 task5 0 0 0 0 0 0 0 0 0 0 0 0 overall 0 0 17 2 18 2 0 0 6 3 2 1

Published as a conference paper at ICLR 2025

Table 21: Full results on Visual Cube.

Environment Type Dataset Type Dataset Task GCBC GCIVL GCIQL QRL CRL HIQL

visual-cube

visual-cube-single-play-v0

task1 12 4 70 3 42 12 68 9 47 20 93 1 task2 4 4 65 9 44 11 35 34 40 20 93 3 task3 6 5 49 2 24 9 41 27 33 17 84 5 task4 0 0 60 13 21 6 32 10 18 16 84 5 task5 0 0 55 6 20 7 30 8 16 10 88 2 overall 5 1 60 5 30 5 41 15 31 15 89 0

visual-cube-double-play-v0

task1 4 3 44 8 6 4 20 3 7 4 91 1 task2 0 0 0 1 0 0 2 2 0 0 54 5 task3 0 1 0 0 0 0 2 0 0 0 40 6 task4 0 0 0 0 0 0 0 0 0 0 0 0 task5 0 0 4 2 0 0 0 0 0 0 11 4 overall 1 1 10 2 1 1 5 0 2 1 39 2

visual-cube-triple-play-v0

task1 73 8 68 8 76 6 81 3 85 12 98 1 task2 0 0 0 0 0 0 0 0 0 0 1 1 task3 0 0 0 0 0 0 0 0 0 0 7 3 task4 0 0 0 0 0 0 0 0 0 0 0 0 task5 0 0 0 0 0 0 0 0 0 0 0 0 overall 15 2 14 2 15 1 16 1 17 2 21 0

visual-cube-quadruple-play-v0

task1 42 4 1 2 36 7 23 5 20 4 66 7 task2 0 0 0 0 0 0 0 0 0 0 0 0 task3 0 0 0 0 0 0 0 0 0 0 1 1 task4 0 0 0 0 0 0 0 0 0 0 0 0 task5 0 0 0 0 0 0 0 0 0 0 1 1 overall 8 1 0 0 7 1 5 1 4 1 14 1

visual-cube-single-noisy-v0

task1 12 4 89 4 81 6 18 29 43 41 100 1 task2 17 6 48 14 0 0 0 0 28 13 100 1 task3 6 2 77 4 90 3 28 27 42 41 99 1 task4 14 2 82 2 35 11 2 1 43 35 100 1 task5 20 10 77 2 33 7 2 1 37 27 99 1 overall 14 3 75 3 48 3 10 5 39 30 99 0

visual-cube-double-noisy-v0

task1 20 5 70 8 70 5 27 9 24 11 98 2 task2 2 2 5 3 14 2 2 4 3 3 87 9 task3 2 1 5 5 16 6 0 0 2 2 68 10 task4 1 1 3 1 0 0 0 0 0 0 13 9 task5 0 0 0 0 7 2 1 1 0 0 30 4 overall 5 1 17 4 22 2 6 2 6 3 59 3

visual-cube-triple-noisy-v0

task1 80 7 90 5 62 6 44 21 78 7 99 1 task2 0 1 0 0 0 0 0 0 0 0 2 2 task3 0 1 0 0 0 0 0 0 0 0 14 11 task4 0 0 0 0 0 0 0 0 0 0 0 0 task5 0 1 0 0 0 0 0 0 0 0 0 0 overall 16 1 18 1 12 1 9 4 16 1 23 2

visual-cube-quadruple-noisy-v0

task1 46 2 2 1 10 9 2 2 39 10 60 41 task2 0 0 0 0 0 0 0 0 0 0 0 0 task3 0 0 0 0 0 0 0 0 0 0 2 2 task4 0 0 0 0 0 0 0 0 0 0 0 0 task5 0 0 0 0 0 0 0 0 0 0 0 0 overall 9 0 0 0 2 2 0 0 8 2 12 8

Table 22: Full results on Visual Scene.

Environment Type Dataset Type Dataset Task GCBC GCIVL GCIQL QRL CRL HIQL

visual-scene

play visual-scene-play-v0

task1 59 7 84 4 56 4 44 6 52 6 80 6 task2 0 0 24 8 1 1 2 2 1 1 81 7 task3 0 0 16 8 0 0 0 0 0 0 61 11 task4 2 1 0 0 3 4 2 1 1 1 20 8 task5 0 0 0 0 0 0 0 0 0 0 3 2 overall 12 2 25 3 12 2 10 1 11 2 49 4

noisy visual-scene-noisy-v0

task1 64 9 76 6 49 22 8 2 70 9 91 4 task2 0 0 14 7 2 2 0 0 0 0 69 5 task3 0 0 24 5 7 2 0 0 2 2 82 6 task4 0 1 0 0 0 0 0 0 1 1 8 5 task5 0 0 0 0 0 0 0 0 0 0 0 0 overall 13 2 23 2 12 4 2 0 15 2 50 1

Published as a conference paper at ICLR 2025

Table 23: Full results on Visual Puzzle.

Environment Type Dataset Type Dataset Task GCBC GCIVL GCIQL QRL CRL HIQL

visual-puzzle

visual-puzzle-3x3-play-v0

task1 1 1 97 2 5 9 3 3 2 1 98 1 task2 0 0 1 1 0 0 0 0 0 0 73 10 task3 0 0 0 1 0 0 0 0 0 0 64 9 task4 0 0 2 3 0 0 0 0 0 0 63 9 task5 0 0 3 2 0 0 0 0 0 0 66 12 overall 0 0 21 1 1 2 1 1 0 0 73 8

visual-puzzle-4x4-play-v0

task1 11 3 86 7 18 5 0 0 10 7 70 47 task2 19 5 8 5 28 12 0 0 19 13 38 32 task3 8 3 80 4 15 4 0 0 8 7 66 44 task4 6 2 65 9 10 5 0 0 7 5 66 44 task5 6 2 61 8 9 3 0 0 4 4 60 41 overall 10 1 60 5 16 4 0 0 10 6 60 41

visual-puzzle-4x5-play-v0

task1 22 8 86 4 31 8 0 0 31 6 66 44 task2 1 1 0 0 1 1 0 0 0 1 0 0 task3 0 0 0 0 0 0 0 0 0 0 0 0 task4 0 0 0 0 0 0 0 0 0 0 0 0 task5 0 0 0 0 0 0 0 0 0 0 0 0 overall 5 2 17 1 7 2 0 0 6 1 13 9

visual-puzzle-4x6-play-v0

task1 12 4 66 3 10 3 0 0 12 5 34 23 task2 0 1 8 2 2 2 0 0 2 1 8 6 task3 0 0 0 0 0 0 0 0 0 0 0 0 task4 0 0 0 0 0 0 0 0 0 0 0 0 task5 0 0 0 0 0 0 0 0 0 0 0 0 overall 2 1 15 1 2 1 0 0 3 1 9 6

visual-puzzle-3x3-noisy-v0

task1 4 6 100 1 98 2 0 0 7 7 100 0 task2 0 0 0 1 5 7 0 0 0 0 64 11 task3 0 0 0 0 2 2 0 0 0 0 55 3 task4 0 0 0 0 5 4 0 0 0 0 61 8 task5 0 0 0 0 19 10 0 0 0 0 71 13 overall 1 1 20 0 26 4 0 0 1 1 70 6

visual-puzzle-4x4-noisy-v0

task1 6 2 90 9 85 3 0 0 4 3 98 2 task2 16 7 1 1 1 1 0 0 14 1 44 20 task3 4 3 88 4 77 6 0 0 4 4 95 2 task4 2 1 36 10 47 17 0 0 2 1 91 2 task5 4 2 20 9 34 13 0 0 5 4 94 1 overall 7 3 47 3 49 7 0 0 6 2 84 4

visual-puzzle-4x5-noisy-v0

task1 30 6 72 48 96 2 0 0 33 5 72 48 task2 0 0 0 0 0 0 0 0 0 0 0 0 task3 0 0 0 0 0 0 0 0 0 0 0 0 task4 0 0 0 0 0 0 0 0 0 0 0 0 task5 0 0 0 0 0 0 0 0 0 0 0 0 overall 6 1 14 10 19 0 0 0 7 1 14 10

visual-puzzle-4x6-noisy-v0

task1 10 2 61 41 82 4 0 0 9 7 56 7 task2 0 0 1 1 2 1 0 0 0 0 12 10 task3 0 0 0 0 0 0 0 0 0 0 0 0 task4 0 0 0 0 0 0 0 0 0 0 0 0 task5 0 0 0 0 0 0 0 0 0 0 0 0 overall 2 1 12 8 17 1 0 0 2 1 14 2

Table 24: Full results on Powderworld.

Environment Type Dataset Type Dataset Task GCBC GCIVL GCIQL QRL CRL HIQL

powderworld play

powderworld-easy-play-v0

task1 1 1 99 1 95 4 25 8 40 6 62 5 task2 0 0 96 4 93 3 12 4 43 10 54 10 task3 0 0 100 0 94 8 8 7 15 5 41 26 task4 0 0 100 0 92 8 8 5 10 3 8 6 task5 0 0 100 1 93 5 7 3 2 3 2 2 overall 0 0 99 1 93 5 12 2 22 5 33 9

powderworld-medium-play-v0

task1 0 0 81 12 7 14 2 3 0 1 26 20 task2 3 3 28 14 0 0 9 6 2 3 16 12 task3 2 3 99 2 72 14 4 5 2 2 60 37 task4 0 0 39 17 0 1 0 0 0 0 5 4 task5 0 0 2 2 0 0 0 0 0 0 1 2 overall 1 1 50 4 16 5 3 1 1 1 22 14

powderworld-hard-play-v0

task1 0 0 0 0 0 0 0 0 0 0 0 0 task2 0 0 4 3 0 0 0 0 0 0 1 2 task3 0 0 4 4 0 0 0 0 0 0 2 2 task4 0 0 12 14 0 0 0 0 0 0 0 0 task5 0 0 0 0 0 0 0 0 0 0 0 1 overall 0 0 4 3 0 0 0 0 0 0 1 1