# how_does_goal_relabeling_improve_sample_efficiency__bbda8399.pdf

How Does Goal Relabeling Improve Sample Efﬁciency?

Sirui Zheng 1 Chenjia Bai 2 Zhuoran Yang 3 Zhaoran Wang 1

Hindsight experience replay and goal relabeling are successful in reinforcement learning (RL) since they enable agents to learn from failures. Despite their successes, we lack a theoretical understanding, such as (i) why hindsight experience replay improves sample efﬁciency and (ii) how to design a relabeling method that achieves sample efﬁciency. To this end, we construct an example to show the information-theoretical improvement in sample efﬁciency achieved by goal relabeling. Our example reveals that goal relabeling can enhance sample efﬁciency and exploit the rich information in observations through better hypothesis elimination. Based on these insights, we develop an RL algorithm called GOALIVE. To analyze the sample complexity of GOALIVE, we introduce a complexity measure, the goalconditioned Bellman-Eluder (GOAL-BE) dimension, which characterizes the sample complexity of goal-conditioned RL problems. Compared to the Bellman-Eluder dimension, the goalconditioned version offers an exponential improvement in the best case. To the best of our knowledge, our work provides the ﬁrst characterization of the theoretical improvement in sample efﬁciency achieved by goal relabeling.

1. Introduction

Numerous RL problems encountered in real-world scenarios involve sparse rewards and vast state spaces (Silver et al., 2017; Savinov et al., 2018; Riedmiller et al., 2018), making them challenging to solve due to the lack of meaningful feedback. A widely used technique to address these

1 Department of Industrial Engineering and Management Sciences, Northwestern University, Evanston, IL 60208, USA 2Shanghai Artiﬁcial Intelligence Laboratory 3 Department of Operations Research and Financial Engineering, Princeton University, Princeton, NJ 08544, USA.. Correspondence to: Sirui Zheng <siruizheng2025@u.northwestern.edu>.

Proceedings of the 41 st International Conference on Machine Learning, Vienna, Austria. PMLR 235, 2024. Copyright 2024 by the author(s).

challenges is hindsight experience replay and goal relabeling (Andrychowicz et al., 2017). Such algorithms typically employ a goal-dependent reward. In each iteration, the agent updates the value function concerning a relabeled goal rather than the target goal. Numerous empirical studies have demonstrated the effectiveness of hindsight experience replay and goal relabeling in various scenarios (Andrychowicz et al., 2017; Pong et al., 2018; Fang et al., 2019; Colas et al., 2019; Li et al., 2020; Pitis et al., 2020; Zhang et al., 2020).

Despite its empirical success, the theoretical understanding of hindsight experience replay and goal relabeling is still elusive. Empirical studies have employed various methods for these techniques; however, these methods lack theoretical guarantees (Andrychowicz et al., 2017; Pong et al., 2018; Nair et al., 2018; Pitis et al., 2020). Additionally, the design of provably efﬁcient algorithms incorporating goal relabeling remains unclear, as existing research on efﬁcient exploration (Jiang et al., 2017; Cai et al., 2019; Jin et al., 2020b) does not incorporate goal relabeling. While prior research has produced sample-efﬁcient algorithms for multitask RL and reward-free RL (Jin et al., 2020a; Lu et al., 2022; Cheng et al., 2022), which share similarities with goal-conditioned RL, these results do not fully capture the beneﬁts of goal relabeling.

Consequently, we aim to enhance the theoretical understanding of goal relabeling by addressing the following questions: First, does hindsight experience replay and goal relabeling provably improve sample efﬁciency? Second, if the answer is positive, how does goal relabeling contribute to this improvement? Third, can we design a principled goal-conditioned algorithm that provably achieves better sample efﬁciency for problems with sparse rewards and large state spaces? Our contributions address these questions and can be summarized as follows.

For the ﬁrst problem, we provide an afﬁrmative answer by constructive proof. We show in 3 that there exist MDPs for which model-free algorithms using goal-conditioned value functions achieve polynomial sample complexity. In contrast, all modelfree methods, given a value function class without multiple goals, incur sample complexity exponential in the horizon. In other word, we show that using

How Does Goal Relabeling Improve Sample Efﬁciency?

goal-conditioned value functions exponentially improves the sample complexity in the best case. To the best of our knowledge, this is the ﬁrst result illustrating the information-theoretical improvement achieved by using goal-conditioned value functions in modelfree algorithms.

For the second question, we show in 3.2 that we can better utilize the information in the observation by using goal relabeling in the algorithm. By incorporating goal-conditioned value functions into model-free algorithms, we can evaluate multiples Bellman errors with the rewards concerning multiple goals. Such a procedure allows us to eliminate more hypotheses using the value function of the same state-action pair with respect to different goals, which is more sample efﬁcient.

For the third question, we propose a novel algorithm called GOAl-conditioned optimism Led Iterative Value-function Elimination (GOALIVE) in 4 using the above intuition. By using goal relabeling, the agent can obtain rich feedback from the observation even when the reward with respect to the target goal is sparse. Additionally, we employed general function approximation to handle large state space. We show that the relabeling method employed in existing algorithms (Andrychowicz et al., 2017; Li et al., 2020; Pitis et al., 2020) can be regarded as a modiﬁcation of our algorithm.

To analyze the algorithm we proposed, we introduce a new complexity measure called GOAL-conditioned Bellman-Eluder (GOAL-BE) dimension in 5.1, which characterizes the complexity of goal-conditioned RL problems. We demonstrate that the sample complexity of GOALIVE can be upper-bounded by the GOAL-BE dimension. We prove that the proposed complexity measure is not larger than the original Bellman-Eluder dimension, which shows that our definition is more general. We also show in 5 that GOAL-BE dimension can be exponentially smaller than the original one in the best case.

To the best of our knowledge, this is the ﬁrst study to develop a provably efﬁcient goal-conditioned algorithm and explain the improvement achieved by goal relabeling.

1.1. Related Work

Our work is closely related to the line of research on provably efﬁcient exploration in the function approximation setting (Jiang et al., 2017; Jin et al., 2020b; Cai et al., 2019; Du et al., 2021; Uehara et al., 2021; Zhang et al., 2022; Liu et al., 2024). This line of research typically considers exploration for a single task and is not readily applicable to

scenarios involving multiple goals (Jiang et al., 2017; Jin et al., 2020b; Cai et al., 2019). Sun et al. (2019) observed that model-free algorithms can be inefﬁcient since they fail to exploit the information in observations. Our work shows that using goal-conditioned value functions improves the sample efﬁciency of model-free algorithms.

The study of efﬁcient exploration in multitask settings (Lu et al., 2022; Cheng et al., 2022) is more closely related to our work, as it considers multiple goals. However, these studies cannot explain the beneﬁts of using multitask methods for solving a single task. The research on reward-free exploration (Jin et al., 2020a; Wang et al., 2020) also aligns with our work, as both aim to develop exploration strategies that do not rely on the original reward. Nonetheless, these studies employ model-based algorithms, whereas modelfree algorithms are more prevalent in empirical results.

Our work is inspired by empirical ﬁndings in hindsight experience replay and goal relabeling (Andrychowicz et al., 2017; Pong et al., 2018; Fang et al., 2019; Colas et al., 2019; Li et al., 2020; Pitis et al., 2020; Zhang et al., 2020). We provide a theoretical explanation for the improvement in sample efﬁciency achieved through goal relabeling.

Notations. We deﬁne [n] = {1, . . . , n} when n is an integer. For a set F, we denote by |F| the cardinality of F.

2. Preliminary

In this paper, we focus on goal-conditioned reinforcement learning, where the goal serves as a parameter of the reward function. For example, the reward function can be designed as an indicator function that takes the goal as input and assigns non-zero values only when the ﬁnal state matches the speciﬁed goal. More speciﬁcally, we consider an episodic MDP V = (S, A, G, H, P , r ) with a state space S Rd, an action space A, a goal space G, a horizon H, transition kernels P = {P h}H h=1, and known reward functions r = {r h}H h=1. We assume that the reward functions are bounded and deterministic, that is, r h [0, 1] for all h [H]. The agent iteratively interacts with the environment as follows. At the beginning of each episode, the agent determines a policy π = {πh}H h=1, where πh : S G (A) for any h [H]. Without loss of generality, we assume that the initial state is ﬁxed to sinit S across all episodes. At the h-th step, the agent receives a state sh and takes an action ah following ah πh( | sh, g ), where g is the target goal. Subsequently, the agent receives a reward r h(sh, ah, g ) and the next state following sh+1 P h+1( | sh, ah). The episode ends after the agent receives the last state s H+1. For a given policy π = {πh}H h=1, we deﬁne the value function V π h and

How Does Goal Relabeling Improve Sample Efﬁciency?

the Q-function Qπ h for any h [H] and g G as

V π h (s, g) := Eπ,P h H X

i=h r i (si, ai, g) sh = s i , (1)

Qπ h(s, a, g) := Eπ,P h H X

i=h r i (si, ai, g) sh = s, ah = a i .

Here the expectation Eπ,P [ ] in (1) is taken with respect to si+1 P i ( | si, ai) and ai πi( | si, g) for i {h, h + 1, . . . , H}. For convenience, we deﬁne V π H+1(s, g) = 0 for any state s S and policy π. We then deﬁne the goalconditioned Bellman operator Th by

Thf(s, a, g) := r h(s, a, g) + Es Ph max a f(s , a , g) ,

we then have Q h(s, a, g) = Th Q h+1(s, a, g). For simplicity, we deﬁne the expected total reward J(π, g) as J(π, g) = V π 1 (sinit, g). The objective of goal-conditioned RL is to ﬁnd a policy π that maximizes the expected total reward with regard to the target goal g . Speciﬁcally, for the episodic MDP V = (S, A, G, H, P , r ), we deﬁne the optimal Q-function Q h and the optimal value function V h as Q h(s, a, g) = maxπ Qπ h(s, a, g) and V h (s, g) = maxπ V π h (s, g) for any (s, a, g) S A G. Correspondingly, we deﬁne the optimal goal-conditioned policy π by the policy that satisﬁes Q h(s, a, g) = Qπ h (s, a, g) for any (s, a, g) S A G.

In this paper, we consider RL with value function approximation. Formally, the learner is given a hypothesis class F = F1 FH, where Fh (S A G 7 [0, H h + 1]) is the set of approximators of Q h deﬁned in (1). We also deﬁne f H+1 0 for convenience. For a hypothesis f = {fh}H h=1, we deﬁne the goal-conditioned average Bellman error by

E(f, π, h, g) : (2)

= Eπ fh(s, a, g) r h(s, a, g) max a A fh+1(s , a , g)

E(f, π, h) := max g G |E(f, π, h, g)| .

By Bellman equation, we have E(Q h, π, h) = 0 for all policy π and h [H]. In this paper, an important tool we used in the analysis is the Bellman-Eluder dimension, which characterizes the complexity of an RL problem when implementing model-free algorithms. In the following, we introduce the deﬁnition of the Bellman-Eluder dimension. We begin by explaining the concept of ϵ-independence between distributions.

Deﬁnition 2.1 (ϵ-independence between distributions). Let F be a function class deﬁned on X, and ν, µ1, . . . , µn be probability measures over X. We say ν is ϵ-independent of {µ1, µ2, . . . , µn} with respect to F if there exists f F

such that v u u t

n Ex µi f(x) o2 ϵ,

but |Ex ν[f(x)]| > ϵ.

Intuitively, ν is independent of {µ1, . . . , µn} means if that there exist a hypothesis, so that the value of the loss f is small at all distribution {µi}n i=1, but the value of the loss f is rather big at ν. Jin et al. (2021a) propose a dimension based on the above notion of independence. Deﬁnition 2.2 (Distributional Eluder dimension). Let F be a function class deﬁned on X G, and Π be a family of probability measures over X. The distributional Eluder dimension dim DE(F, Π, ϵ) is the length of the longest sequence {ρ1, . . . , ρn} Π such that there exists ϵ ϵ where ρi is ϵ -independent of {ρ1, . . . , ρi 1} for all i [n].

Based on Deﬁnition 2.1 and 2.2, Jin et al. (2021a) proposed BE dimension, which characterizes the complexity of a RL problem using model-free algorithm with general function approximation in single-task settings. Deﬁnition 2.3 (Bellman Eluder (BE) dimension). Let F = F1 FH, where Fh (S A 7 [0, H h+1]). Let (I Th)F := {fh Thfh+1 : f F} be the set of Bellman residuals induced by F at step h, and D = {Dh}H h=1 be a collection of H probability measure families over S A. The ϵ-Bellman-Eluder of F with respect to D is deﬁned as

dim BE(F, D, ϵ) := max h [H] dim DE (I Th)F, Dh, ϵ .

We remark that their deﬁnition is not directly applicable to goal-conditioned RL. The hypothesis class F in Deﬁnition 2.3 does not take the goal g as input. Thus, it remains unclear how to incorporate the goal space within the given deﬁnition.

3. Mitigating Information Loss with Goal-conditioned Value Functions

In this section, we present an example demonstrating that goal-conditioned model-free algorithms can exponentially improve sample efﬁciency compared to modelfree algorithms without multiple goals. To the best of our knowledge, this is the ﬁrst result illustrating the information-theoretic improvement achieved by using a goal-conditioned value function. We introduce the example in 3.1 and provide the analysis in 3.2.

3.1. Example

We consider an episodic MDP with horizon H > 3 and set d = H 2. The state space is S = {1, 2}d, and

How Does Goal Relabeling Improve Sample Efﬁciency?

the action space is A = {1, 2}. Our transition class contains 2d transitions, each of which is uniquely indexed by a path of length d, p = {p1, p2, . . . , pd}, with pi {1, 2}. All the transitions are the same in the ﬁrst H 1 steps. For h H 2, when we use action a in the state sh = (sh,1, . . . , sh,h, . . . , sh,d) in the h-th step, we will transit to sh+1 = (sh,1, . . . , a, . . . , sh,d).

The transition in the H 1-th step does not depend on the action a H 1. In the transition indexed by p = {p1, p2, . . . , pd}, we will transit to

s H = (1{s H 1,1 = p1}, . . . , 1{s H 1,d = pd})

when we select action a in s H 1 = (s H 1,1, . . . , s H 1,d). The goal space is G = {0, 1}d, and goal-conditioned reward is

r h(s, a, g) = 0 for h [H 1], r H(s, a, g) = 1{s = g}.

We ﬁx the initial state as s1 = [1]d. We want to obtain the optimal policy with respect to the reward {r h(s, a, g )}H h=1, where g = [1]d. A graphical illustration of the setting is provided in Figure 1.

3.2. Sample Efﬁciency of Model-free Algorithms

In this subsection, we use the provided example to show that goal-conditioned model-free algorithms can be exponentially more sample-efﬁcient compared to algorithms that do not incorporate goals. We ﬁrst introduce the deﬁnition of model-free algorithms. We then introduce the value function class we use. Finally, we offer an analysis to show that goal-conditioned algorithms exponentially improve the sample efﬁciency in the given example.

Deﬁnition 3.1 (Goal-conditioned Model-free Algorithm). Given a function class F : (S A G) R, deﬁne the original F-proﬁle ΦF,g : S R|F| |A| by ΦF,g (s) := [f(s, a, g )]f F,a A. An algorithm is model-free using F without multiple goals if it accesses s exclusively through ΦF,g (s) for all s S during its entire execution. We also deﬁne the goal-conditioned F-proﬁle ΦF,G : S R|F| |A| |G| by ΦF,G(s) := [f(s, a, g)]f F,a A,g G. An algorithm is goal-conditioned model-free using F and G if it accesses s exclusively through ΦF,G(s) for all s S during its entire execution.

Deﬁnition 3.1 is a modiﬁcation of Deﬁnition 1 in Sun et al. (2019). In the deﬁnition, G is the space of goal, and f(s, a, g) is the action-value of (s, a) when we are interested in the reward indexed by the goal g. We modiﬁed Deﬁnition 1 in Sun et al. (2019) to incorporate goal.

Value Function Class. We consider solving this MDP with model-free algorithms. In the following, we introduce the value function class we employ for solving this MDP.

For a hypothesis p = (p1, . . . , pd) {1, 2}d, we deﬁne Qp h(s, a, g) as the optimal action value function of (s, a) with respect to the transition p and the reward indexed by g. It is obvious that Qp H(s, a, g) = 1{s = g}. For h < H, Qp h(s, a, g) is deﬁned as follows. For sh = (sh,1, . . . , sh,d) and p = (p1, . . . , pd), we deﬁne

fh, d(sh, a, p) =

( 1{sh, d = p d} when d < h, 1{a = p d} when d = h,

for d h. By the deﬁnition of the setting, fh, d(sh, a, p) is the d-th component of s H in the transition indexed by p when we use a in the state s H. Therefore, when we denote by Qp h the optimal value function in the MDP with the transition indexed by p, we have

Qp h(sh, a, g) =

d=1 1{fh, d(sh, a, p) = g d},

where g = (g1, . . . , gd). Intuitively, Qp h represents the feasibility of reaching the goal state g from the state-action pair (sh, a). We then deﬁne F = F1 . . . FH and Fg = Fg ,1 . . . Fg ,H, where

Fh : = Qp h( , , ) | p {1, 2}d , (3)

Fg ,h : = Qp h( , , g ) | p {1, 2}d .

By deﬁnition, F contains goal-conditioned value functions, while Fg only contains value functions with respect to g . In the example above, we have the following lemma, which show that goal-conditioned value functions exponentially improve the sample efﬁciency of model-free algorithms.

Lemma 3.2 (Exponential Improvement of Goal-conditioned Model-free Algorithms). Fix δ, ϵ (0, 1]. In the given example, any model-free algorithm without multiple goals, which uses Fg in (3) as the value function class, using o(2H) trajectories outputs a policy bπ with V bπ(s1, g ) V (s1, g ) 1/2 with probability at least 1/3. Meanwhile, with probability 1 δ, there exists a goal-conditioned model-free algorithm using F in (3) as the value function class (Algorithm 1) outputs bπ satisfying V bπ(s1, g ) V (s1, g ) ϵ using at most poly(H, 1/ϵ, log(1/δ)) trajectories for any transition in this family.

Proof. See Appendix A for a detailed proof.

Interpretation. Our observation is that, goal-conditioned model-free algorithms are more effective at utilizing the rich information in data compared to original model-free algorithms. In goal-conditioned algorithm, we can obtain nonzero reward even when the reward with respect to the target goal is sparse, since we can use the reward with respect to different goals for hypothesis elimination. As a

How Does Goal Relabeling Improve Sample Efﬁciency?

(1, 1) (2, 1)

(1, 1) (1, 2) (2, 1) (2, 2)

(0, 1) (0, 0) (1, 1) (1, 0)

a = 1 a = 2

a = 1 a = 2 a = 1 a = 2

Figure 1. An example of the MDP construction in 3, with d = 2 and H = 4. All models are deterministic, and each model is uniquely indexed by a sequence of actions p. We use p to denote the index of the true model. Here p = (2, 1). The last transition is designed such that the agent always lands in a state that contains 0 unless it follows path p .

result, we can eliminate more hypotheses using Bellman errors with respect to different goals, which improves the sample efﬁciency. In Deﬁnition 3.1, a model-free algorithm without multiple goals accesses the state s only through the value concerning the target goal g . Therefore, it can only eliminate hypotheses using the Bellman error with respect to g , which can be zero for most of the hypotheses when the reward is sparse. By using the reward and value functions with respect to multiple goal, we obtain denser feedback, leading to more effective hypothesis elimination.

To illustrate the above idea, we revisit the example in 3.1. We consider the case where the true transition p = (2, 1) and the target goal g = (1, 1). For the trajectory τ = (s1, a1, , ss, a4, s5), we calculate the Bellman error and provide it in Table 1. As we show in Table 1, since the reward with respect to the target goal g is sparse, the Bellman error of most of the hypotheses is zero if we only consider the target goal g . Therefore, we can only eliminate one hypothesis using the Bellman error with respect to g . Meanwhile, we can obtain nonzero reward and eliminate all the wrong hypotheses using the reward and Bellman error with respect to g = (1, 0). Therefore, we can achieve exponential improvement in sample efﬁciency using goal relabeling.

Rationale for Focusing Our Analysis on Model-Free Algorithms. In the above analysis, we focus on model-free algorithms as they are widely used in existing works that apply goal relabeling (Andrychowicz et al., 2017; Li et al., 2020; Pitis et al., 2020). Sun et al. (2019) show that modelbased algorithms are more sample efﬁcient as they leverage more supervision. However, model-based algorithms necessitate an additional, often complex, planning phase to derive an optimal policy. This phase, particularly in environments with complex dynamics, introduces substantial computational costs and a heightened risk of error accumulation. Errors in estimating dynamics during the planning phase can magnify, potentially resulting in signiﬁ-

cantly suboptimal policies.

In contrast, goal-conditioned model-free algorithms avoid the complexities of modeling the dynamics and subsequent planning, choosing instead to directly approximate the value function. This direct approach reduces the risk of error propagation inherent in planning, providing a potentially more robust solution in scenarios where accurately modeling the dynamics is challenging. Consequently, even though the goal-conditioned value function s theoretical complexity may match or surpass that of the dynamics, the practical advantages of operational efﬁciency, robustness, and applicability make goal-conditioned model-free methods a compelling choice in complex environments. Therefore, model-free algorithms are more popular in real-life application, making it crucial to theoretically analyze the improvements achieved by goal relabeling in model-free algorithms.

4. Algorithm: Goal-conditioned Optimism Led Iterative Value-function Elimination

In 3, we construct an example and show that using goalconditioned value function can exponentially improve the sample efﬁciency in the best case. It is natural to ask, how to design an algorithm that provably achieves the improved sample efﬁciency? In 3.2 and Table 1, we show that goalconditioned value functions improve the sample efﬁciency since it make the feedback denser and allows better hypothesis elimination. Based on the idea above, we propose an algorithm called GOAl-conditioned optimism Led Iterative Value function Elimination (GOALIVE), which is a modiﬁcation of Algorithm OLIVE proposed in Jiang et al. (2017).

At a high level, our algorithm eliminate hypothesis Qh Fh using the average Bellman error and the data we have collected. After elimination, we choose the exploration policy πt by selecting the most optimistic hypothesis concerning the target goal g and the corresponding greedy

How Does Goal Relabeling Improve Sample Efﬁciency?

Bellman error w.r.t the goal g Elimination g = (0, 0) g = (0, 1) g = (1, 0) g = (1, 1) All goals Only g

Hypothesis of p

p = (1, 1) 1 0 1 0 Yes No

p = (1, 2) 0 1 1 0 Yes No

0 0 0 0 p = (2, 1) No No

p = (2, 2) 0 0 1 1 Yes Yes

Table 1. Comparison between using multiple goals and only using the target goal g for hypothesis elimination. In this table, we use the blue color to highlight the true hypothesis p and the target goal g . This table present the Bellman error Qp 3 (s3, a3, g) r 3(s3, a3, g) Qp 4 (s4, a4, g), where s3 = (2, 2) and s4 = (1, 0). The table demonstrates that all incorrect hypotheses can be eliminated after goal relabeling with the given data, whereas only one hypothesis can be eliminated using the Bellman error with respect to the target goal g .

Algorithm 1 GOAl-conditioned optimism Led Iterative Value function Elimination (GOALIVE)

1: Input: Hypothesis class F, elimination thresholds ζact and ζelim, numbers of iterations nact and nelim. 2: Initialize: B0 F, Dt h for all h and t. 3: for iteration t = 1, 2, . . . do 4: Choose policy πt = πf t, where f t = argmaxf Bt 1 f(x1, πf(x1), g ). 5: Execute πt for nact episodes and update Dt h to include the new (sh, ah, sh+1) tuples. 6: Estimate b E(f t, πt, h, g ) for all h [H], where

b E(f, πt, h, g)

(s,a,s ) Dt h

1[s, a, s , g, fh, rh, fh+1],

where 1 is deﬁned in (4). 7: if PH h=1 b E(f t, πt, h, g ) Hζact then 8: Terminate and output πt. 9: end if 10: Pick any h [H] for which b E(f t, πt, h, g ) ζact and set Dt h = . 11: Execute πt for nelim episodes and update Dt h to include the new (sh, ah, sh+1) tuples. 12: Estimate b E(f, πt, h) for all f F, where

b E(f, πt, h)

(s,a,s ) Dt h

1[s, a, s , g, fh, rh, fh+1]

13: Update Bt = n f Bt 1 : b E(f, πt, h) ζelim o .

14: end for

The pseudocode of GOALIVE is given in Algorithm 1. In each episode, our algorithm performs three main steps:

Optimistic planing: we compute the most optimistic hypothesis Qk with regard to the target goal g , and choose πk to be its greedy policy in Line 3. We then use πk to interact with the environment in Line 4.

Computing Bellman error: We evaluate the average Bellman error of f k under πk in Line 5. We activate the elimination phase if the Bellman error is large, and otherwise returned πk in Line 7.

Eliminating hypothesis with large Bellman error: pick a step t [H] where the estimated Bellman error exceeds the activation threshold ζact; eliminate all hypotheses in the candidate set whose Bellman error at step t exceeds the elimination threshold ζelim. We highlight that we evaluate the average Bellman error with regard to all the goals in the goal space, instead of only on g , which allows better hypothesis elimination.

Connection with Existing Algorithms. The main difference between our algorithm and the original OLIVE proposed in Jiang et al. (2017) is that, we use goal relabeling when evaluating Bellman error. More speciﬁcally, for hypothesis f t, step h, and the policy πt, we use

gt h = argmax g G

(s,a,s ) Dh 1[s, a, s , g, fh, rh, fh+1], (4)

where 1[s, a, s , g, fh, rh, fh+1]

= fh(s, a, g) rh(s, a, g) max a A fh+1(s , a , g) ,

for hypothesis elimination instead of only using the target goal g . Intuitively, goal relabeling allows us to eliminate

How Does Goal Relabeling Improve Sample Efﬁciency?

more hypotheses in the elimination phase and improve the sample efﬁciency compared to algorithms that do not use multiple goals. The idea of using Bellman error for goal relabeling has also appeared in previous work (Zhang et al., 2020). Our analysis provides theoretical justiﬁcation for this relabeling method. We remark that we use average Bellman error to address the stochastic dynamic.

In general, Equation (4) is difﬁcult to compute, as it requires iterating over the goal space. Existing algorithms (Andrychowicz et al., 2017; Li et al., 2020; Pitis et al., 2020) use the state space as the goal space. They view the states in the data as achieved goals, and minimize Bellman error using achieved goals instead of all possible goals. However, this approach is not sample efﬁcient since they might eliminate more hypotheses by evaluating the Bellman error on multiple goals. Therefore, their algorithms can be viewed as modiﬁcations of Algorithm 1, which favor computational efﬁciency over sample efﬁciency. We further note that, in some special cases, the achieved goal is the same as the goal that maximizes the Bellman error, which is formalized in the following lemma.

Lemma 4.1 (Equivalence of Relabeling Method). We say a hypothesis class is deterministic, if for any (f, s, a, g) F S A G, there exists a unique s S, such that fh(s, a, g) = r h(s, a, g) + maxa A fh+1(s , a , g).

Suppose we have r H(s, a, g) = 1{s = g} and r H 1(s, a, g) = 0 for all (s, a, g) S A G. Then for a trajectory (s1, a1, . . . , a H, s H+1), we have

s H argmax g G 2[s H 1, a H 1, g, f H 1, rh, fh+1, P h]

where 2[s, a, g, f, r, f , P]

= f(s, a, g) r(s, a, g) Es P ( |s,a) max a A f (s , a , g)

for any f F when F and P H 1 are deterministic. That is, the achieved goal is the goal that maximize the Bellman error.

Proof. See Appendix B.1 for a detailed proof.

Lemma 4.1 shows that in the example in 3, the relabeling method in Algorithm 1 is the same with the relabeling method in Andrychowicz et al. (2017), which theoretically justiﬁes their method. This equivalence also suggests that the empirical performance observed in standard goalconditioned RL benchmarks for existing relabeling methods could serve as an indirect measure of our algorithm s efﬁcacy.

Repeating sampling using the same policy. In Algorithm 1, we sample multiple trajectories using the policy πk. This procedure differs from GOLF proposed by Jin et al. (2021a),

and we only have the upper bound of sample complexity instead of regret. However, this procedure is necessary since the maximum operator cannot be interchanged with the expectation operator. More speciﬁcally, the loss we want to evaluate is

max g G E(s,a,s ) µ[ 1(s, a, s , g, Qh, rh, Qh+1)]

where 1 is deﬁned in (4), and we can only evaluate

E(s,a,s ) µ[max g G | 1(s, a, s , g, Qh, rh, Qh+1)|]

if we have only one trajectory to each policy. The latter one can be large even for the true hypothesis, and cannot be used for hypothesis elimination.

5. Analysis of GOALIVE

In this section, we present the analysis of GOALIVE. We ﬁrst introduce the complexity measure we use. We then present the sample complexity of GOALIVE.

5.1. Complexity Measure: Goal-conditioned Bellman-Eluder Dimension

In this section, we propose a new complexity measure called goal-conditioned Bellman-Eluder dimension, which quantiﬁes the gain of using goal in model-free algorithms. To enable general function approximation, we modify the Bellman-Eluder dimension, which was proposed by Jin et al. (2021a) to characterize the complexity of an RL problem with general function approximation. We start by developing a new goal-conditioned version of distributional Eluder dimension.

Deﬁnition 5.1 (Goal-conditioned ϵ-independence between Distributions). Let F be a function class deﬁned on X G, , and ν, µ1, . . . , µn be probability measures over X. We say ν is goal-conditioned ϵ-independent of {µ1, µ2, . . . , µn} with respect to F and g if there exists f F and g G such that v u u t

i=1 max g G

Ex µi h f(x, g) i 2 ϵ,

but |Ex ν[f(x, g )]| > ϵ.

Deﬁnition 5.2 (Goal-conditioned Distributional Eluder Dimension). Let F be a function class deﬁned on X G, and Π be a family of probability measures over X. The goal-conditioned distributional Eluder dimension dim GOAL DE(F, Π, ϵ) is the length of the longest sequence {ρ1, . . . , ρn} Π such that there exists ϵ ϵ where ρi is ϵ -independent of {ρ1, . . . , ρi 1} for all i [n].

How Does Goal Relabeling Improve Sample Efﬁciency?

When the space of goal only contains the goal we concern about, that is, G = {g }, Deﬁnition 5.2 degenerates to the original deﬁnition of distributional Eluder dimension in Jin et al. (2021a). In Deﬁnition 5.1, we can consider Ex µi f(x, g) as the loss of hypothesis f concerning the goal g evaluated on the distribution µi. Intuitively, ν is independent of {µ1, . . . , µn} means if that there exist a hypothesis, where the loss of f is small with respect to all g G and {µi}n i=1, but the loss is considerably larger on ν and the target goal g . Since we can choose any goals for hypothesis elimination, the loss of the hypothesis that have not been eliminated should be small with respect to all g G. Therefore, we incorporate multiple goals in Definition 5.1 by taking a maximum over g G. Equipped with Deﬁnitions 5.1 and 5.2, we are ready to present the deﬁnition of the goal-conditioned Bellman-Eluder (GOALBE) dimension.

Deﬁnition 5.3 (GOAL-conditioned Bellman-Eluder (GOAL-BE) Dimension). Let (I Th)F := {fh Thfh+1 : f F} be the set of Bellman residuals induced by F at step h, and Π = {Πh}H h=1 be a collection of H probability measure families over X U. The ϵ-goal-conditioned Bellman-Eluder of F with respect to Π is deﬁned as

dim GOAL BE(F, Π, ϵ) :=

max h [H] dim GOAL DE (I Th)F, Πh, ϵ .

Similar with Jin et al. (2021a), GOAL-BE dimension also depends on the choice of the distribution class Π. To simplify the presentation, we only consider DF := {DF,h}H h=1, where DF,h denotes the set of all probability measures over S A at the h th step, which can be generated by executing the greedy policy πf with regard to the target goal g induced by any f F, i.e., πf,h( ) = argmaxa A fh( , a, g ) for all h [H].

Comparison with Original Bellman-Eluder Dimension. Deﬁnition 5.3 can be viewed as a modiﬁcation of the original Bellman-Eluder dimension proposed by Jin et al. (2021a). In fact, Deﬁnition 5.3 is more general than the Bellman-Eluder dimension. Deﬁnition 5.3 reduces to the original deﬁnition of the Bellman-Eluder dimension in Jin et al. (2021a) when G = {g }. Intuitively, since we have multiple loss for the same hypothesis, we are able to eliminate more hypothesis with the same data, which improves the sample efﬁciency and reduces the complexity of the original problem. This intuition is formalized in the following two lemmas. Lemma 5.4 shows that GOAL-BE dimension is smaller than the original BE dimension. Lemma 5.5 shows that in the example in 3, GOAL-BE dimension is exponentially smaller than the original BE dimension. Lemmas 5.4 and 5.5 show that using goal-conditioned value functions provably reduces the complexity of a RL

Lemma 5.4 (Strictly Improvement over Original Bellman-Eluder Dimension). We deﬁne

Fg = Fg ,1 . . . Fg ,H, Fg = Fg,1 . . . Fg,H,

where Fh,g = {Qh( , , g ) | Q Fh} and Fh,g = {Qh( , , g) | Q Fh, g G} . We then have

dim BE(Fg, Π, ϵ) dim BE(Fg , Π, ϵ)

and dim BE(Fg , Π, ϵ) dim GOAL BE(F, Π, ϵ). (5)

Proof. See Appendix B.2 for a detailed proof.

Lemma 5.5 (Exponential Improvement in the Best Case). In the example in 3, we have

dim BE(Fg , DF, ϵ) 2d 1,

dim GOAL BE(F, DF, ϵ) = 1,

where Fg = Fg ,1 . . . Fg ,H, F = F1 . . . FH. Here Fg ,h and Fh are deﬁned in (3).

Proof. See Appendix B.3 for a detailed proof.

5.2. Sample Complexity of GOALIVE

We ﬁrst present the realizability assumption on the hypothesis class F, which is commonly used in previous work (Jin et al., 2021a; Sun et al., 2019).

Assumption 5.6 (Realizability). We assume that Q h Fh for all h [H].

Realizability requires that the hypothesis class is wellspeciﬁed and is able to characterizes the connection between different goals. We also introduce the deﬁnition of the covering number, which is widely used in previous work (Jin et al., 2021a;b; Liu et al., 2022).

Deﬁnition 5.7 (Covering Number of Hypothesis Class). The ϵ-covering number of a set V under metric ρ, denoted as N(V, ϵ, ρ), is the minimum integer n such that there exists a subset V0 with |V0| = n, and for any x V, there exists y V0 such that ρ(x, y) ϵ. For two hypotheses f = {fh}H h=1, f = {f h}H h=1, we deﬁne

= max (h,s,a,g) [H] S A G |fh(s, a, g) f h(s, a, g)|.

For F = F1 . . . FH, we deﬁne N(F, ϵ) = maxh [H] N(Fh, ϵ, ).

How Does Goal Relabeling Improve Sample Efﬁciency?

Deﬁnition 5.8 (Covering Number of Goal Space). We say G0 is an ϵ-covering of the goal space G if for any g G, there exists g0 G0, such that

|rh(s, a, g) rh(s, a, g0)| ϵ,

|fh(s, a, g) fh(s, a, g0)| ϵ,

hold for all f F, h [H], and (x, u) X U. Let G(ϵ) be the ϵ-covering of G with the smallest cardinal, and let N(G, ϵ) = |G(ϵ)|.

When |G| < , we have N(G, ϵ) |G|. Equipped with the assumption and deﬁnitions above, we are ready to present the sample complexity of GOALIVE.

Theorem 5.9 (Sample complexity of GOALIVE.). Under Assumption 5.6, if we choose

ζgoal = ϵ/ 600H

d , ζact = 2ϵ/H,

ζelim = ϵ/ 4H

d , nact = 2592H2ι/ϵ2,

and nelim = 4608H2d log2(N(F, ζelim/8)) ι/ϵ2,

where d = dim GOAL BE F, DF, ϵ/H ,

ι = log2[N(G, ζgoal)Hd/(δϵ)],

then with probability at least 1 δ, Algorithm 1 will output an O(ϵ)-optimal policy using at most O(H3d2 log2[NF(ζelim/8)] ι/ϵ2) episodes.

Proof. See Appendix C for a detailed proof.

Theorem 5.9 shows that RL problems with low GOALBE dimension can be solved efﬁciently by GOALIVE under the realizability assumption. The sample complexity of GOALIVE is e O(H3d2/ϵ2), which is polynomial in 1/ϵ and the number of horizon. The dependency on the complexity measure is the same with Jin et al. (2021a). However, as we show in Lemma 5.4 and Lemma 5.5, GOAL-BE dimension is not larger than the original BE dimension, and can be exponentially smaller in the best case. The cost we pay for the improvement on the complexity measure is the log-covering number of the goal space. For the example in 3.1, we have |G| = 2d. Therefore, we have log N(G, ζgoal) H log 2, which is polynomial in the horizon H.

Comparison with previous work on multitask RL. The setting we study can be viewed as a special case of multitask RL. The difference is that we only concern about the suboptimality of the target task g instead of all task. Previous work (Lu et al., 2022) has also developed sampleefﬁcient algorithms for multitask RL. Lu et al. (2022)

designed a model-free algorithm for multitask RL and demonstrated its sample efﬁciency. However, the complexity measure they used is the original Eluder dimension, which cannot exploit the connection between goals. Moreover, they considered that the tasks are randomly generated, which is different from the empirical work of goal relabeling, where the algorithm can set the goal in the training phase. Therefore, their results cannot explain the success of goal relabeling.

Connection with reward-free exploration. Our algorithm is similar with reward-free exploration, in the sense that the exploration strategy does not rely on the original reward. Existing results in reward-free exploration use model-based algorithm (Jin et al., 2020a; Wang et al., 2020). However, model-based algorithms tend to have a larger asymptotic bias. In the case of complex problems, the dynamics cannot be learned perfectly, and the ﬁnal policy can be highly suboptimal (Pong et al., 2018). Therefore, empirical work that uses goal relabeling often applies model-free algorithms, and their success and not be explained by the work that use model-based algorithms. In fact, our algorithm can be viewed as a model-free algorithm for reward-free exploration.

Comparison with Zhu & Zhang (2023). In the context of goal-conditioned RL, previous research, including the notable work by Zhu & Zhang (2023), has provided a theoretical analysis. However, their analysis focuses on ofﬂine RL and does not design algorithm for provably efﬁcient exploration. Moreover, they assume that the dataset covers the optimal policy with respect to all g G, whereas our analysis does not rely on such a stringent assumption.

How Does Goal Relabeling Improve Sample Efﬁciency?

Acknowledgement

Zhaoran Wang acknowledges National Science Foundation (Awards 2048075, 2008827, 2015568, 1934931), Simons Institute (Theory of Reinforcement Learning), Amazon, J.P. Morgan, and Two Sigma for their supports.

Impact Statement

This paper presents work whose goal is to advance the ﬁeld of Machine Learning. There are many potential societal consequences of our work, none of which we feel must be speciﬁcally highlighted here.

Andrychowicz, M., Wolski, F., Ray, A., Schneider, J., Fong, R., Welinder, P., Mc Grew, B., Tobin, J., Pieter Abbeel, O., and Zaremba, W. Hindsight experience replay. Advances in neural information processing systems, 30, 2017.

Cai, Q., Yang, Z., Jin, C., and Wang, Z. Provably efﬁcient exploration in policy optimization. ar Xiv preprint ar Xiv:1912.05830, 2019.

Cheng, Y., Feng, S., Yang, J., Zhang, H., and Liang, Y. Provable beneﬁt of multitask representation learning in reinforcement learning. ar Xiv preprint ar Xiv:2206.05900, 2022.

Colas, C., Fournier, P., Chetouani, M., Sigaud, O., and Oudeyer, P.-Y. Curious: intrinsically motivated modular multi-goal reinforcement learning. In International conference on machine learning, pp. 1331 1340. PMLR, 2019.

Du, S. S., Kakade, S. M., Lee, J. D., Lovett, S., Mahajan, G., Sun, W., and Wang, R. Bilinear classes: A structural framework for provable generalization in rl. ar Xiv preprint ar Xiv:2103.10897, 2021.

Fang, M., Zhou, T., Du, Y., Han, L., and Zhang, Z. Curriculum-guided hindsight experience replay. Advances in neural information processing systems, 32, 2019.

Jiang, N., Krishnamurthy, A., Agarwal, A., Langford, J., and Schapire, R. E. Contextual decision processes with low bellman rank are pac-learnable. In International Conference on Machine Learning, pp. 1704 1713. PMLR, 2017.

Jin, C., Krishnamurthy, A., Simchowitz, M., and Yu, T. Reward-free exploration for reinforcement learning.

In International Conference on Machine Learning, pp. 4870 4879. PMLR, 2020a.

Jin, C., Yang, Z., Wang, Z., and Jordan, M. I. Provably efﬁcient reinforcement learning with linear function approximation. In Conference on Learning Theory, pp. 2137 2143. PMLR, 2020b.

Jin, C., Liu, Q., and Miryooseﬁ, S. Bellman eluder dimension: New rich classes of rl problems, and sampleefﬁcient algorithms. Advances in neural information processing systems, 34:13406 13418, 2021a.

Jin, Y., Yang, Z., and Wang, Z. Is pessimism provably efﬁcient for ofﬂine rl? In International Conference on Machine Learning, pp. 5084 5096. PMLR, 2021b.

Li, A., Pinto, L., and Abbeel, P. Generalized hindsight for reinforcement learning. Advances in neural information processing systems, 33:7754 7767, 2020.

Liu, Z., Zhang, Y., Fu, Z., Yang, Z., and Wang, Z. Learning from demonstration: Provably efﬁcient adversarial policy imitation with linear function approximation. In International conference on machine learning, pp. 14094 14138. PMLR, 2022.

Liu, Z., Lu, M., Xiong, W., Zhong, H., Hu, H., Zhang, S., Zheng, S., Yang, Z., and Wang, Z. Maximize to explore: One objective function fusing estimation, planning, and exploration. Advances in Neural Information Processing Systems, 36, 2024.

Lu, R., Zhao, A., Du, S. S., and Huang, G. Provable general function class representation learning in multitask bandits and mdps. ar Xiv preprint ar Xiv:2205.15701, 2022.

Nair, A. V., Pong, V., Dalal, M., Bahl, S., Lin, S., and Levine, S. Visual reinforcement learning with imagined goals. Advances in neural information processing systems, 31, 2018.

Pitis, S., Chan, H., Zhao, S., Stadie, B., and Ba, J. Maximum entropy gain exploration for long horizon multigoal reinforcement learning. In International Conference on Machine Learning, pp. 7750 7761. PMLR, 2020.

Pong, V., Gu, S., Dalal, M., and Levine, S. Temporal difference models: Model-free deep rl for model-based control. ar Xiv preprint ar Xiv:1802.09081, 2018.

Riedmiller, M., Hafner, R., Lampe, T., Neunert, M., Degrave, J., Wiele, T., Mnih, V., Heess, N., and Springenberg, J. T. Learning by playing solving sparse reward

How Does Goal Relabeling Improve Sample Efﬁciency?

tasks from scratch. In International conference on machine learning, pp. 4344 4353. PMLR, 2018.

Savinov, N., Raichuk, A., Marinier, R., Vincent, D., Pollefeys, M., Lillicrap, T., and Gelly, S. Episodic curiosity through reachability. ar Xiv preprint ar Xiv:1810.02274, 2018.

Silver, D., Schrittwieser, J., Simonyan, K., Antonoglou, I., Huang, A., Guez, A., Hubert, T., Baker, L., Lai, M., Bolton, A., et al. Mastering the game of go without human knowledge. Nature, 550(7676):354, 2017.

Sun, W., Jiang, N., Krishnamurthy, A., Agarwal, A., and Langford, J. Model-based rl in contextual decision processes: Pac bounds and exponential improvements over model-free approaches. In Conference on learning theory, pp. 2898 2933. PMLR, 2019.

Uehara, M., Zhang, X., and Sun, W. Representation learning for online and ofﬂine rl in low-rank mdps. ar Xiv preprint ar Xiv:2110.04652, 2021.

Wang, R., Du, S. S., Yang, L., and Salakhutdinov, R. R. On reward-free reinforcement learning with linear function approximation. Advances in neural information processing systems, 33:17816 17826, 2020.

Zhang, T., Ren, T., Yang, M., Gonzalez, J., Schuurmans, D., and Dai, B. Making linear mdps practical via contrastive representation learning. In International Conference on Machine Learning, pp. 26447 26466. PMLR, 2022.

Zhang, Y., Abbeel, P., and Pinto, L. Automatic curriculum learning through value disagreement. Advances in Neural Information Processing Systems, 33:7648 7659, 2020.

Zhu, H. and Zhang, A. Provably efﬁcient ofﬂine goalconditioned reinforcement learning with general function approximation and single-policy concentrability. ar Xiv preprint ar Xiv:2302.03770, 2023.

How Does Goal Relabeling Improve Sample Efﬁciency?

List of Notation

In the sequel, we present a list of notations in the paper.

Notation Explanation

S, A, G The state, the action, and the goal spaces, respectively. h The length of an episode. t The index of the iteration in Algorithm 1 (GOALIVE). E The average Bellman error deﬁned in (2). Φ The F-proﬁle we deﬁned i Deﬁnition 3.1. F The hypothesis class used in model-free algorithms. g The target goal we concern about. dim BE, dim GOAL BE The original BE dimension, and the GOAL-BE dimension. {Th}H h=1 The Bellman operator we deﬁned in 2.

Structure of Appendix

We provide a detailed proof of Lemma 3.2 in Appendix A, and provide the proof of Lemmas in 4, 5.1 in Appendix C. We provide the proof of auxiliary lemmas in D.

How Does Goal Relabeling Improve Sample Efﬁciency?

Table of Contents

A Proof of Lemma 3.2 13

B Proof of Lemmas in 5.1 ans 4 15

C Proof of Theorem 5.9 17

D Proof of Lemmas in Appendix C 18

A. Proof of Lemma 3.2

In this section, we ﬁrst provide the proof of Lemma 3.2. We then provide more detail on the action-value function Qp h.

Proof of Lemma 3.2. Our proof consists of two parts. In the ﬁrst part, we show the inefﬁciency of model-free algorithms without multiple goals. In the second part, we show that there exists a goal-conditioned model-free algorithm that achieves provably efﬁcient exploration.

(1, 1) (2, 1)

(1, 1) (1, 2) (2, 1) (2, 2)

(0, 1) (0, 0) (1, 1) (1, 0)

a = 1 a = 2

a = 1 a = 2 a = 1 a = 2 Figure 2. An example of the MDP construction in 3, with d = 2 and H = 4. All models are deterministic, and each model is uniquely indexed by a sequence of actions p. We use p to denote the index of the true model. Here p = (2, 1). The last transition is designed such that the agent always lands in a state that contains 0 unless it follows path p .

Inefﬁciency of Model-free Algorithms without Multiple Goals. Our proof of the inefﬁciency of model-free algorithms her is similar with the proof of Theorem 2 of Sun et al. (2019). We construct another class of non-factored models, such that (1) learning in this new class is intractable, and (2) the two families are indistinguishable to any model-free algorithm without multiple goals.

The new model class is obtained by transforming each Pp M into e Pp. e Pp has the same state space and transitions as Pp, except for the transition from horizon H 1 to H. The transition in the H-th step still does not depend on the action a H. In the transition indexed by p = {p1, p2, . . . , pd}, we will transit to

( [1]d when s H = p,

The reward function is the same as in the original model class. This example is similar with the one in Sun et al. (2019). Solving this example is equivalent to solving a multi-armed bandit problem with one optimal arm among 2H 2 arms.

How Does Goal Relabeling Improve Sample Efﬁciency?

Therefore, the sample complexity of any algorithm in the new example is Ω(2H).

To prove that the two model families are indistinguishable for model-free algorithms without multiple goals, which is deﬁned in Deﬁnition 3.1, we show that the original F-proﬁles, which is ΦF,g in Deﬁnition 3.1, in Pp are identical to those in e Pp. This implies that any model-free algorithm behaves in the same way in Pp and e Pp, so that the sample complexity must be identical, and hence Ω(2H). To continue, we introduce the deﬁnition of the optimal planning, which is an important tool to connect these two model families.

Deﬁnition A.1 (Optimal Planning). Let M be a class of transition. For a transition P = {Ph}H h=1 M, we denote by πP = {πP h }H h=1, V P = {V P h }H h=1, QP = {QP h }H h=1 the optimal policy, value function, and action-value function when the transition of the MDP is P. We denote by OP(P) the mapping that maps a transition P to its optimal policy and optimal Q function, that is OP(P) (πP, QP). We then deﬁne OP(M) {(πP, QP) | P M}.

Let M = {Pp}p {1,2}d and f M = { e Pp}p { 1,1}d. Let F = {Fh}H h=1, Π = {Πh}H h=1 to be the Q class and policy classes from OP(M, G), e F = { e Fh}H h=1 and eΠ = {eΠh}H h=1 be the policy class from OP( f M, G). Since all MDPs of interest have fully deterministic dynamics, and non-zero rewards only occur at the last step, it sufﬁces to show that for any sequence of actions a = (a1, . . . , a H), (1) the ﬁnal reward has the same distribution for P p and e P p, and (2) the original F-proﬁles [Qh(sh, a, g )]Qh Fh,a A and [Qh(sh, a, g )]Qh e Fh,a A are equivalent at all states generated by taking a in

P p and e P p, respectively. The equivalence of the reward is obvious, In the following, we study the equivalence of the original F-proﬁles.

In P p and at level H, by the deﬁnition of the reward function, the original F-proﬁle is [1]|FH| for the state without 0 and [0]|FH| otherwise. Thus, when the action sequence a = p, the original F-proﬁle of the reached state is [1]|Q|. Otherwise, the original F-proﬁle of the reached state is [0]|Q|. Similarly, in e P p the original F-proﬁle is [0]| e FH| if the state is [0]d, and it is [1]| e FH| otherwise. The equivalence here is obvious since | e FH| = |FH| = 2H 2.

For level H 1, no matter the true model path p, the Qp associated with path p has value Qp

H 1(a, +1, g ) = 1 {a = p } at state a. Hence the Q-proﬁle at a can be represented as [1 {a = p }]p { 1,1}d, for both P p and e P p. We remark that the original F-proﬁle does not depend on the true model p because all models agree on the dynamics before the last step. Similarly, for h < H 1 where each state has two actions { 1, 1}, we have:

Qp (a1:h 1, 1, g ) = 1 {a1:h 1 1 = p 1:h} , Qp (a1:h 1, 2, g ) = 1 {a1:h 1 2 = p 1:h} .

Hence, the original F-proﬁle can be represented as: h 1 {a1:h 1 1 = p 1:h} , 1 {a1:h 1 2 = p 1:h} i

which is the same between P p and e P p. Therefore, the model P p and e P p have exactly the same original F-proﬁle for all action sequences, implying that any model-free algorithm without multiple goals, must have the same behavior on both transition classes. Since the transition class f M = { e P p}p admits an information-theoretic sample complexity lower bound of Ω(2H), the same lower bound applies to M = {P p}p for model-free algorithms without multiple goals.

Sample Efﬁciency of Goal-conditioned Model-free Algorithm This can be directly proved by combining Lemma 5.5 and Theorem 5.9. By Theorem 5.9, when we use Algorithm GOALIVE, we can obtain a 4ϵ-optimal policy with at most 15000H3d2 0 log2(N(F, ζelim/8)) ι/ϵ2 episodes with probability 1 δ. Here ι = log2[N(G, ζgoal)Hd0/(δϵ)] and d0 = dim GOAL BE F, DF, ϵ/H . Since both F and G is ﬁnite, we have N(G, ζgoal) = N(F, ζelim/8) = 2H 2. By Lemma 5.5, we have d0 = 1. Therefore, the number of episodes is bounded from the above by 15000H7 log2[H/(δϵ)]/ϵ2, which is poly(H, 1/ϵ, log(1/δ)). Thus, we conclude the proof of Lemma 3.2.

How Does Goal Relabeling Improve Sample Efﬁciency?

B. Proof of Lemmas in 5.1 ans 4

B.1. Proof of Lemma 4.1

Proof of Lemma 4.1. Since we have QH(s, a, g) = 1s=g, for any s S, there exists a unique g G, such that QH(s, a, g) = 1. Since PH 1 is deterministic, there exists a unique g G, such that

TH 1f H(s H 1, a H 1, g ) = 1g=g .

Since F is deterministic, there exists a unique g G, such that f H 1(s H 1, a H 1, g ) = 1g=g . We also know that f H(s H, a H, g) = 1 if and only is g = s H. Therefore, we have

f H 1(s H 1, a H 1, g ) r H 1(s H 1, a H 1, g ) max a A f H(s H, a , g ) {0, 1}

for any g G. Therefore, Lemma 4.1 holds if we have

f H 1(s H 1, a H 1, g ) r H 1(s H 1, a H 1, g ) max a A f H(s H, a , g ) = 1

when g = s H. In the following, we consider the case that

f H 1(s H 1, a H 1, g ) r H 1(s H 1, a H 1, g ) max a A f H(s H, a , g ) = 0

when g = s H. In this case, we have f H 1(s H 1, a H 1, g ) = 0 when g = s H. Therefore, we have f H 1(s H 1, a H 1, g ) = 1g =s H, and

f H 1(s H 1, a H 1, g) r H 1(s H 1, a H 1, g) max a A f H(s H, a , g) = 1g=s H 1g=s H = 0

for all g G. Therefore, we conclude the proof of Lemma 4.1.

B.2. Proof of Lemma 5.4

Proof of Lemma 5.4. By the deﬁnition of Fh,g and Fh,g, we have Fh,g Fh,g. By Deﬁnition 2.2, we have dim DE((I Th)Fh,g , Πh, ϵ) dim DE((I Th)Fh,g, Πh, ϵ). By Deﬁnition 2.3, we have dim BE(Fg , Π, ϵ) dim BE(Fg, Π, ϵ), which ﬁnish the ﬁrst part of the proof.

In the following part of the proof, we show that dim GOAL BE(F, Π, ϵ) dim BE(Fg , Π, ϵ). It is sufﬁcient to show that dim GOAL DE((I Th)F, Πh, ϵ) dim DE((I Th)Fg , Π, ϵ). We only need to show that if ν is goal-conditioned ϵindependent with {µ1, . . . , µn}, we then have ν is ϵ-independent with {µ1, . . . , µn}. By the deﬁnition of goal-conditioned ϵ-independence, we have |Es,a νfh(s, a, g ) rh(s, a, g ) Thfh+1(s, a, g )| ϵ and

i=1 max g G |Es,a µifh(s, a, g) rh(s, a, g) Thfh+1(s, a, g)|2 ϵ

for some f F. Therefore, we have v u u t

i=1 |Es,a µifh(s, a, g ) rh(s, a, g ) Thfh+1(s, a, g )|2 ϵ.

We conclude the proof by the deﬁnition of ϵ-independence.

How Does Goal Relabeling Improve Sample Efﬁciency?

B.3. Proof of Lemma 5.5

Proof of Lemma 5.5. By the deﬁnition of F, we know that

fh(s, a, g) r h(s, a, g) Thfh+1(s, a, g) = 0

for any (f, h, s, a, g) F [H 2] S AG. Therefore, we have

dim DE((I Th)Fh,g , DF,h, ϵ) = dim GOAL DE((I Th)Fh, DF,h, ϵ) = 1

for h [H 2]. Therefore, we only need to consider the case where h = H 1. For a hypothesis p {1, 2}d, we denote by f p = {f p h }H h=1 F the corresponding value. By the setting of the dynamic system, when we use the optimal policy with respect to g and f p, we will arrive the state s H 1 = p. Therefore, we have DF,H 1 = {δ(s,a) | s {1, 2}d, a A}. For any f p F, δ(s H 1,a) DF,H 1, we have

E(s,a) µf p H 1(s, a, g ) = 0, r H 1(s, a, g ) = 0, E(s,a) µTH 1f p H(s, a, g ) = 0

when s H 1 = p and s H 1 = p. We also have

E(s,a) µf p H 1(s, a, g ) = 1, r H 1(s, a, g ) = 0, E(s,a) µTH 1f p H(s, a, g ) = 0

when s H 1 = p and s H 1 = p. Since we have 2d hypotheses in total, we consider the sequence µ1 = δ(s1 H 1,a), . . . , µn = δ(sn H 1,a), such that n = 2d 1, µi = δ(p ,a) for i [n], µi = µj when i = j. We set p[i] = s1 H 1 and choose f i = {f i h} the corresponding value function. We can easily verify that

E(s,a) µi[f j H 1(s, a, g ) r H 1(s, a, g ) TH 1f j H(s, a, g )] = 0

when i < j. Therefore, we have

E(s,a) µi[f j H 1(s, a, g ) r H 1(s, a, g ) TH 1f j H(s, a, g )] 2 = 0.

We also have

E(s,a) µi[f i H 1(s, a, g ) r H 1(s, a, g ) TH 1f i H(s, a, g )] = 1.

Therefore, µi is ϵ-independent with {µ1, , µi 1}. By the deﬁnition of distributional Eluder dimension, we have dim DE((I TH 1)FH 1,g , DF,H 1, ϵ) 2d 1 for ϵ < 1.

For the second part of the proof, we ﬁrst show that

f H 1(s, a, g) r H 1(s, a, g) Es max a f H(s , a , g) < 1

if and only if f H 1 = Q H 1, f H = Q H. First, in the given hypothesis class, we have f H = Q H holds for all f F. When we select g = s H, we have f H 1(s, a, g) r H 1(s, a, g) Es max a f H(s , a , g) = |f H 1(s, a, s H) 1| .

When f H 1 = QH 1, we have f H 1(s, a, s H) = 0, which contradict with |f H 1(s, a, s H) 1| < 1. Therefore, we have f H 1 = Q H 1.

How Does Goal Relabeling Improve Sample Efﬁciency?

If ν is goal-conditioned ϵ-independent with µ, we have

max g G Es,a µ f H 1(s, a, g) r H 1(s, a, g) Es max a f H(s , a , g) < ϵ (7)

Es,a ν f H 1(s, a, g ) r H 1(s, a, g ) Es max a f H(s , a , g ) > ϵ. (8)

By (7), we have f H 1 = Q H 1, f H = Q H, which implies

Es,a ν f H 1(s, a, g ) r H 1(s, a, g ) Es max a f H(s , a , g ) = 0,

which contradicts with (8). Therefore, for any µ, ν DF,H 1, ν is not goal-conditioned ϵ-independent with µ. Therefore, we have dim GOAL DE((I TH 1)F, DF,H 1, ϵ) = 1, which concludes the proof.

C. Proof of Theorem 5.9

Proof of Theorem 5.9. We denote by T the index of the last iteration of Algorithm 1. By Algorithm 1, we know that the elimination procedure is activated at the t-th iterations for t [T 1], and is not activated in the T-th iteration. We also know that the output policy is πT , which is the greedy policy with respect to f T . We use ht to denote the step that the elimination procedure is activated in the t-th iteration. The following lemma provides a characterization of {ft}T 1 t=1 and the elimination procedure.

Lemma C.1. We have E(f t, πt, ht, g ) > ϵ/H for all t [T ] with probability 1 δ/8. We also have E(f, πt, ht) < ϵ/(H

d) for all t [T ] when f Bt with probability 1 δ/8. Here E(f t, πt, ht, g ) and E(f, πt, ht) are deﬁned in (2), and T = min{T 1, d H + 1}.

Proof. See Appendix D.1 for a detailed proof.

Lemma C.1 shows that, the average Bellman error of f t with respect to πt is large, and hypotheses with a large average Bellman error with respect to πt will be eliminated with high probability. We denote by E1 the event in Lemma C.1. The following lemma shows that T d H + 1 when E1 holds, which bounds the number of iterations in Algorithm 1.

Lemma C.2. When we condition on E1, we have T d H + 1.

Proof. See Appendix D.2 for a detailed proof.

We also have the following lemma, which shows that the average Bellman error of f T is small.

Lemma C.3. When T d H + 1, we have PH h=1 E(f T , πT , h, g ) < 4ϵ holds with probability 1 δ/4.

Proof. See Appendix D.3 for a detailed proof.

In addition, we have the following lemma, which show that the true hypothesis will not be eliminated with high probability.

Lemma C.4. We deﬁne Q = {Q h}H h=1, where Q h is the optimal goal-conditioned action-value function we deﬁned in 2. Then with probability 1 δ/4, we have Q Bt for all t [T ]. Here T = min{T 1, d H + 1}.

Proof. See Appendix D.4 for a detailed proof.

How Does Goal Relabeling Improve Sample Efﬁciency?

We denote by E3 the event we deﬁned in Lemma C.4. In the following part of the proof, we condition on Events E1, E2, and E3. In the following, we show that the suboptimality of πT is small when we condition on E1, E2, and E3. When we condition on Event E1, We have V 1 (s1, g ) < f T 1 (s1, a1, g ) by Line 4 of Algorithm 1. Therefore, we have

V 1 (x1, g ) V πT 1 (x1, g ) max a f T 1 (s1, a, g ) V πT (s1, g ). (9)

Since f T H+1(s, a, g) = 0 and V πT H+1(s, a, g) = 0 for all (s, a, g) S A G, we have

max a f T 1 (s1, a, g ) V πT (s1, g ) =

h=1 EπT f T h (sh, ah, g ) r h(sh, ah, g ) f T h+1(sh+1, ah+1, g )

h=1 EπT QπT h (sh, ah, g ) r h(sh, ah, g ) QπT h+1(sh+1, ah+1, g ) . (10)

Here the expectation EπT [ ] is taken with respect to sh+1 P h( | sh, ah) and ah πT h ( | sh, g ). By Bellman equation, we have

EπT QπT h (sh, ah, g ) r h(sh, ah, g ) QπT h+1(sh+1, ah+1, g ) = 0. (11)

By (2), we have

EπT f T h (sh, ah, g ) r h(sh, ah, g ) f T h+1(sh+1, ah+1, g ) = E(f T , πT , h, g ). (12)

Combining (9), (10), (11), and (12), we have

V 1 (x1, g ) V πT 1 (x1, g )

h=1 E(f T , πT , h, g ).

When we condition on E2 in Lemma C.3, we have PH h=1 E(f T , πT , h, g ) 4ϵ. Therefore, when we condition on E1, E2, and E3, the policy we return is O(ϵ)-optimal. By Lemmas C.1, C.3, and C.4, we have P( 3 i=1Ei) 1 δ. Therefore, with probability at least 1 δ, Algorithm 1 will terminate in T (d H + 1) iterations and output a 4ϵ-optimal policy using at most

(d H + 1)(nact + nelim) 15000H3d2 log2(N(F, ζelim/8)) ι

episodes. Thus, we conclude the proof of Theorem 5.9.

D. Proof of Lemmas in Appendix C

D.1. Proof of Lemma C.1

Proof. First, we have the following lemma, which shows that in the activation phase, b E in Line 6 of Algorithm 1 is a good estimator of E in (2).

Lemma D.1. [Concentration in the Activation Phase] With probability 1 δ/8, we have

|b E(f t, πt, h, g ) E(f t, πt, h, g )| ϵ/(6H)

holds for all (t, h) [d H + 1] [H].

Proof. See Appendix D.5 for a detailed proof.

How Does Goal Relabeling Improve Sample Efﬁciency?

Since the elimination phase is activated for t < T, we have b E(f t, πt, ht, g ) ζact. Therefore, we have

E(f t, πt, h, g ) ζact ϵ/(6H) ϵ/H

for t min{T 1, d H + 1} when we condition on the event in Lemma D.1. Thus, we conclude the proof of the ﬁrst part of Lemma C.1. We also have the following lemma, which shows that in the elimination phase, b E in Line 12 of Algorithm 1 is a good estimator of E in (2).

Lemma D.2. [Concentration in the Elimination Phase] With probability 1 δ/8, we have

|b E(f, πt, h) E(f, πt, h)| ϵ/(4H

holds for all (f, t, h) F [d H + 1] [H]. Here b E is deﬁned in Line 12 of Algorithm 1.

Proof. See Appendix D.6 for a detailed proof.

By the deﬁnition of Bt in Line 13 of Algorithm 1, we have b E(f, πt, ht) ζelim for f Bt. Therefore, we have

E(f t, πt, h) ζelim + ϵ/(4H

for t min{T 1, d H + 1} when we condition on the event in Lemma D.1. Thus, we conclude the proof of the second part of Lemma C.1.

D.2. Proof of Lemma C.2

Proof. We prove Lemma C.2 by contradiction. For the sake of contradiction, We assume that T > d H + 1. When T > d H + 1, there exists h [H] and t1 < < td+1 d H + 1, such that the elimination phase is activated in the ti-th iteration at the h th stage for i [d + 1]. When we condition on E1 in Lemma C.1, we have E(f ti, πti, h, g ) > ϵ/H for i [d + 1]. Since f ti Btj when i > j, we have E(f ti, πtj, h) ϵ/(H

d) when i > j. Therefore, we have

E(f ti, πtj, h) 2 <

Therefore, the roll-in distribution of πt1, . . . , πtd+1 at step h is a goal-conditioned ϵ/H-independent sequence of length d + 1, which contradicts with the deﬁnition of GOAL-BE dimension. Therefore, we have T d H + 1 when we condition on E1, which conclude the proof of Lemma C.2.

D.3. Proof of Lemma C.3

Proof. Since the elimination phase is not activated in the T-th iteration, we have b E(f T , πT , h, g ) ζact. Therefore, when we condition on the event in Lemma D.1, we have

E(f T , πT , h, g ) ζact + ϵ/(6H) 4ϵ/H.

Therefore, we have PH h=1 E(f T , πT , h, g ) 4ϵ, which concludes the proof of Lemma C.3.

D.4. Proof of Lemma C.4

Proof. By Bellman equation, we have E(Q , π, h) = 0 for all policy π and h [H]. By Lemma D.2, we have b E(Q , πt, h) ϵ/(4H

d) = ζelim holds for all (t, h) [d H + 1] [H]. By Line 13 of Algorithm 1, we have Q Bt

when t d H + 1. Thus, we conclude the proof of Lemma C.4.

How Does Goal Relabeling Improve Sample Efﬁciency?

D.5. Proof of Lemma D.1

Proof. Consider a ﬁxed (t, h) [d H + 1] [H] pair. Since fh(x, u, g ) rh(x, u, g ) max u U fh+1(x , u , g ) 3,

|b E(f t, πt, h, g ) E(f t, πt, h, g )| 6

2 nact log 8H(d H2 + 1)/δ

holds with probability at least 1 δ/(8H(d H + 1)) by Azuma-Hoefdding s inequality. Here b E is deﬁned in Line 6 of Algorithm 1. By taking a union bound, we have

|b E(f t, πt, h, g ) E(f t, πt, h, g )| 6

2 nact log 8H(d H2 + 1)/δ

holds for all (t, h) [d H + 1] [H] with probability at least 1 δ/8. Since we set nact = 2592H2ι/ϵ2, we have

|b E(f t, πt, h, g ) E(f t, πt, h, g )| ϵ/(6H).

Thus, we conclude the proof of Lemma D.1.

D.6. Proof of Lemma D.2

Proof. Let G(ζgoal) be an ζgoal-cover of G with cardinality N(G, ζgoal), and let Z be an ζelim/16-cover of F with cardinality N(F, ζelim/8). We deﬁne

b E bf, πt, ht, G(ζgoal)

max g G(ζgoal)

(s,a,s ) Dt h

fh(s, a, g) rh(s, a, g) max a A fh+1(s , a , g) ,

where Dt h is the dataset in the elimination phase. By applying Azuma-Hoeffding s inequality to all (t, bf, g) [d H + 1] Z G(ϵ) and taking a union bound, we have

|b E( bf, πt, ht, g) E( bf, πt, ht, g)| 6

2 nelim log 8H(d H2 + 1)N(G, ζgoal)N(F, ζelim/8)/δ

holds for all (t, bf, g) [d H + 1] Z G(ϵ) with probability at least 1 δ/8.

By the deﬁnition of the goal covering, we have b E(f, πt, ht) E(f, πt, ht) 6ζgoal + b E(f, πt, ht, G(ζgoal)) E(f, πt, ht, G(ζgoal)) (14)

ϵ/(100H) + max g G(ζgoal)

b E(f, πt, ht, g) E(f, πt, ht, g)

holds for all f F. Combining Equations (13) and (14), we have b E( bf, πt, ht) E( bf, πt, ht) 6ζgoal + ϵ/(9H

How Does Goal Relabeling Improve Sample Efﬁciency?

holds for all (t, bf) [d H + 1] Z with probability at least 1 δ/8. We denote the event in Equation (15) as E4. By the deﬁnition of E in (2), we have

E( bf, π, h) E(f, π, h) = max g G E(f, π, h, g) max g G E( bf, π, h, g)

E(f, π, h, g) E( bf, π, h, g) 2 f bf .

Similarly, we have b E( bf, π, h) b E(f, π, h) 2 f bf . For any f F, we select bf Z with f bf ζelim/8. When condition on Event E4 in Equation (15), we have b E(f, πt, ht) E(f, πt, ht) b E( bf, πt, ht) E( bf, πt, ht) + E( bf, πt, ht) E(f, πt, ht)

+ b E( bf, πt, ht) b E(f, πt, ht)

d) + ζelim/2 ϵ/(4H

holds for all (t, f) [d H + 1] F with probability at least 1 δ/8. Thus, we complete the proof of Lemma D.2.