# how_does_goal_relabeling_improve_sample_efficiency__bbda8399.pdf How Does Goal Relabeling Improve Sample Efficiency? Sirui Zheng 1 Chenjia Bai 2 Zhuoran Yang 3 Zhaoran Wang 1 Hindsight experience replay and goal relabeling are successful in reinforcement learning (RL) since they enable agents to learn from failures. Despite their successes, we lack a theoretical understanding, such as (i) why hindsight experience replay improves sample efficiency and (ii) how to design a relabeling method that achieves sample efficiency. To this end, we construct an example to show the information-theoretical improvement in sample efficiency achieved by goal relabeling. Our example reveals that goal relabeling can enhance sample efficiency and exploit the rich information in observations through better hypothesis elimination. Based on these insights, we develop an RL algorithm called GOALIVE. To analyze the sample complexity of GOALIVE, we introduce a complexity measure, the goalconditioned Bellman-Eluder (GOAL-BE) dimension, which characterizes the sample complexity of goal-conditioned RL problems. Compared to the Bellman-Eluder dimension, the goalconditioned version offers an exponential improvement in the best case. To the best of our knowledge, our work provides the first characterization of the theoretical improvement in sample efficiency achieved by goal relabeling. 1. Introduction Numerous RL problems encountered in real-world scenarios involve sparse rewards and vast state spaces (Silver et al., 2017; Savinov et al., 2018; Riedmiller et al., 2018), making them challenging to solve due to the lack of meaningful feedback. A widely used technique to address these 1 Department of Industrial Engineering and Management Sciences, Northwestern University, Evanston, IL 60208, USA 2Shanghai Artificial Intelligence Laboratory 3 Department of Operations Research and Financial Engineering, Princeton University, Princeton, NJ 08544, USA.. Correspondence to: Sirui Zheng . Proceedings of the 41 st International Conference on Machine Learning, Vienna, Austria. PMLR 235, 2024. Copyright 2024 by the author(s). challenges is hindsight experience replay and goal relabeling (Andrychowicz et al., 2017). Such algorithms typically employ a goal-dependent reward. In each iteration, the agent updates the value function concerning a relabeled goal rather than the target goal. Numerous empirical studies have demonstrated the effectiveness of hindsight experience replay and goal relabeling in various scenarios (Andrychowicz et al., 2017; Pong et al., 2018; Fang et al., 2019; Colas et al., 2019; Li et al., 2020; Pitis et al., 2020; Zhang et al., 2020). Despite its empirical success, the theoretical understanding of hindsight experience replay and goal relabeling is still elusive. Empirical studies have employed various methods for these techniques; however, these methods lack theoretical guarantees (Andrychowicz et al., 2017; Pong et al., 2018; Nair et al., 2018; Pitis et al., 2020). Additionally, the design of provably efficient algorithms incorporating goal relabeling remains unclear, as existing research on efficient exploration (Jiang et al., 2017; Cai et al., 2019; Jin et al., 2020b) does not incorporate goal relabeling. While prior research has produced sample-efficient algorithms for multitask RL and reward-free RL (Jin et al., 2020a; Lu et al., 2022; Cheng et al., 2022), which share similarities with goal-conditioned RL, these results do not fully capture the benefits of goal relabeling. Consequently, we aim to enhance the theoretical understanding of goal relabeling by addressing the following questions: First, does hindsight experience replay and goal relabeling provably improve sample efficiency? Second, if the answer is positive, how does goal relabeling contribute to this improvement? Third, can we design a principled goal-conditioned algorithm that provably achieves better sample efficiency for problems with sparse rewards and large state spaces? Our contributions address these questions and can be summarized as follows. For the first problem, we provide an affirmative answer by constructive proof. We show in 3 that there exist MDPs for which model-free algorithms using goal-conditioned value functions achieve polynomial sample complexity. In contrast, all modelfree methods, given a value function class without multiple goals, incur sample complexity exponential in the horizon. In other word, we show that using How Does Goal Relabeling Improve Sample Efficiency? goal-conditioned value functions exponentially improves the sample complexity in the best case. To the best of our knowledge, this is the first result illustrating the information-theoretical improvement achieved by using goal-conditioned value functions in modelfree algorithms. For the second question, we show in 3.2 that we can better utilize the information in the observation by using goal relabeling in the algorithm. By incorporating goal-conditioned value functions into model-free algorithms, we can evaluate multiples Bellman errors with the rewards concerning multiple goals. Such a procedure allows us to eliminate more hypotheses using the value function of the same state-action pair with respect to different goals, which is more sample efficient. For the third question, we propose a novel algorithm called GOAl-conditioned optimism Led Iterative Value-function Elimination (GOALIVE) in 4 using the above intuition. By using goal relabeling, the agent can obtain rich feedback from the observation even when the reward with respect to the target goal is sparse. Additionally, we employed general function approximation to handle large state space. We show that the relabeling method employed in existing algorithms (Andrychowicz et al., 2017; Li et al., 2020; Pitis et al., 2020) can be regarded as a modification of our algorithm. To analyze the algorithm we proposed, we introduce a new complexity measure called GOAL-conditioned Bellman-Eluder (GOAL-BE) dimension in 5.1, which characterizes the complexity of goal-conditioned RL problems. We demonstrate that the sample complexity of GOALIVE can be upper-bounded by the GOAL-BE dimension. We prove that the proposed complexity measure is not larger than the original Bellman-Eluder dimension, which shows that our definition is more general. We also show in 5 that GOAL-BE dimension can be exponentially smaller than the original one in the best case. To the best of our knowledge, this is the first study to develop a provably efficient goal-conditioned algorithm and explain the improvement achieved by goal relabeling. 1.1. Related Work Our work is closely related to the line of research on provably efficient exploration in the function approximation setting (Jiang et al., 2017; Jin et al., 2020b; Cai et al., 2019; Du et al., 2021; Uehara et al., 2021; Zhang et al., 2022; Liu et al., 2024). This line of research typically considers exploration for a single task and is not readily applicable to scenarios involving multiple goals (Jiang et al., 2017; Jin et al., 2020b; Cai et al., 2019). Sun et al. (2019) observed that model-free algorithms can be inefficient since they fail to exploit the information in observations. Our work shows that using goal-conditioned value functions improves the sample efficiency of model-free algorithms. The study of efficient exploration in multitask settings (Lu et al., 2022; Cheng et al., 2022) is more closely related to our work, as it considers multiple goals. However, these studies cannot explain the benefits of using multitask methods for solving a single task. The research on reward-free exploration (Jin et al., 2020a; Wang et al., 2020) also aligns with our work, as both aim to develop exploration strategies that do not rely on the original reward. Nonetheless, these studies employ model-based algorithms, whereas modelfree algorithms are more prevalent in empirical results. Our work is inspired by empirical findings in hindsight experience replay and goal relabeling (Andrychowicz et al., 2017; Pong et al., 2018; Fang et al., 2019; Colas et al., 2019; Li et al., 2020; Pitis et al., 2020; Zhang et al., 2020). We provide a theoretical explanation for the improvement in sample efficiency achieved through goal relabeling. Notations. We define [n] = {1, . . . , n} when n is an integer. For a set F, we denote by |F| the cardinality of F. 2. Preliminary In this paper, we focus on goal-conditioned reinforcement learning, where the goal serves as a parameter of the reward function. For example, the reward function can be designed as an indicator function that takes the goal as input and assigns non-zero values only when the final state matches the specified goal. More specifically, we consider an episodic MDP V = (S, A, G, H, P , r ) with a state space S Rd, an action space A, a goal space G, a horizon H, transition kernels P = {P h}H h=1, and known reward functions r = {r h}H h=1. We assume that the reward functions are bounded and deterministic, that is, r h [0, 1] for all h [H]. The agent iteratively interacts with the environment as follows. At the beginning of each episode, the agent determines a policy π = {πh}H h=1, where πh : S G (A) for any h [H]. Without loss of generality, we assume that the initial state is fixed to sinit S across all episodes. At the h-th step, the agent receives a state sh and takes an action ah following ah πh( | sh, g ), where g is the target goal. Subsequently, the agent receives a reward r h(sh, ah, g ) and the next state following sh+1 P h+1( | sh, ah). The episode ends after the agent receives the last state s H+1. For a given policy π = {πh}H h=1, we define the value function V π h and How Does Goal Relabeling Improve Sample Efficiency? the Q-function Qπ h for any h [H] and g G as V π h (s, g) := Eπ,P h H X i=h r i (si, ai, g) sh = s i , (1) Qπ h(s, a, g) := Eπ,P h H X i=h r i (si, ai, g) sh = s, ah = a i . Here the expectation Eπ,P [ ] in (1) is taken with respect to si+1 P i ( | si, ai) and ai πi( | si, g) for i {h, h + 1, . . . , H}. For convenience, we define V π H+1(s, g) = 0 for any state s S and policy π. We then define the goalconditioned Bellman operator Th by Thf(s, a, g) := r h(s, a, g) + Es Ph max a f(s , a , g) , we then have Q h(s, a, g) = Th Q h+1(s, a, g). For simplicity, we define the expected total reward J(π, g) as J(π, g) = V π 1 (sinit, g). The objective of goal-conditioned RL is to find a policy π that maximizes the expected total reward with regard to the target goal g . Specifically, for the episodic MDP V = (S, A, G, H, P , r ), we define the optimal Q-function Q h and the optimal value function V h as Q h(s, a, g) = maxπ Qπ h(s, a, g) and V h (s, g) = maxπ V π h (s, g) for any (s, a, g) S A G. Correspondingly, we define the optimal goal-conditioned policy π by the policy that satisfies Q h(s, a, g) = Qπ h (s, a, g) for any (s, a, g) S A G. In this paper, we consider RL with value function approximation. Formally, the learner is given a hypothesis class F = F1 FH, where Fh (S A G 7 [0, H h + 1]) is the set of approximators of Q h defined in (1). We also define f H+1 0 for convenience. For a hypothesis f = {fh}H h=1, we define the goal-conditioned average Bellman error by E(f, π, h, g) : (2) = Eπ fh(s, a, g) r h(s, a, g) max a A fh+1(s , a , g) E(f, π, h) := max g G |E(f, π, h, g)| . By Bellman equation, we have E(Q h, π, h) = 0 for all policy π and h [H]. In this paper, an important tool we used in the analysis is the Bellman-Eluder dimension, which characterizes the complexity of an RL problem when implementing model-free algorithms. In the following, we introduce the definition of the Bellman-Eluder dimension. We begin by explaining the concept of ϵ-independence between distributions. Definition 2.1 (ϵ-independence between distributions). Let F be a function class defined on X, and ν, µ1, . . . , µn be probability measures over X. We say ν is ϵ-independent of {µ1, µ2, . . . , µn} with respect to F if there exists f F such that v u u t n Ex µi f(x) o2 ϵ, but |Ex ν[f(x)]| > ϵ. Intuitively, ν is independent of {µ1, . . . , µn} means if that there exist a hypothesis, so that the value of the loss f is small at all distribution {µi}n i=1, but the value of the loss f is rather big at ν. Jin et al. (2021a) propose a dimension based on the above notion of independence. Definition 2.2 (Distributional Eluder dimension). Let F be a function class defined on X G, and Π be a family of probability measures over X. The distributional Eluder dimension dim DE(F, Π, ϵ) is the length of the longest sequence {ρ1, . . . , ρn} Π such that there exists ϵ ϵ where ρi is ϵ -independent of {ρ1, . . . , ρi 1} for all i [n]. Based on Definition 2.1 and 2.2, Jin et al. (2021a) proposed BE dimension, which characterizes the complexity of a RL problem using model-free algorithm with general function approximation in single-task settings. Definition 2.3 (Bellman Eluder (BE) dimension). Let F = F1 FH, where Fh (S A 7 [0, H h+1]). Let (I Th)F := {fh Thfh+1 : f F} be the set of Bellman residuals induced by F at step h, and D = {Dh}H h=1 be a collection of H probability measure families over S A. The ϵ-Bellman-Eluder of F with respect to D is defined as dim BE(F, D, ϵ) := max h [H] dim DE (I Th)F, Dh, ϵ . We remark that their definition is not directly applicable to goal-conditioned RL. The hypothesis class F in Definition 2.3 does not take the goal g as input. Thus, it remains unclear how to incorporate the goal space within the given definition. 3. Mitigating Information Loss with Goal-conditioned Value Functions In this section, we present an example demonstrating that goal-conditioned model-free algorithms can exponentially improve sample efficiency compared to modelfree algorithms without multiple goals. To the best of our knowledge, this is the first result illustrating the information-theoretic improvement achieved by using a goal-conditioned value function. We introduce the example in 3.1 and provide the analysis in 3.2. 3.1. Example We consider an episodic MDP with horizon H > 3 and set d = H 2. The state space is S = {1, 2}d, and How Does Goal Relabeling Improve Sample Efficiency? the action space is A = {1, 2}. Our transition class contains 2d transitions, each of which is uniquely indexed by a path of length d, p = {p1, p2, . . . , pd}, with pi {1, 2}. All the transitions are the same in the first H 1 steps. For h H 2, when we use action a in the state sh = (sh,1, . . . , sh,h, . . . , sh,d) in the h-th step, we will transit to sh+1 = (sh,1, . . . , a, . . . , sh,d). The transition in the H 1-th step does not depend on the action a H 1. In the transition indexed by p = {p1, p2, . . . , pd}, we will transit to s H = (1{s H 1,1 = p1}, . . . , 1{s H 1,d = pd}) when we select action a in s H 1 = (s H 1,1, . . . , s H 1,d). The goal space is G = {0, 1}d, and goal-conditioned reward is r h(s, a, g) = 0 for h [H 1], r H(s, a, g) = 1{s = g}. We fix the initial state as s1 = [1]d. We want to obtain the optimal policy with respect to the reward {r h(s, a, g )}H h=1, where g = [1]d. A graphical illustration of the setting is provided in Figure 1. 3.2. Sample Efficiency of Model-free Algorithms In this subsection, we use the provided example to show that goal-conditioned model-free algorithms can be exponentially more sample-efficient compared to algorithms that do not incorporate goals. We first introduce the definition of model-free algorithms. We then introduce the value function class we use. Finally, we offer an analysis to show that goal-conditioned algorithms exponentially improve the sample efficiency in the given example. Definition 3.1 (Goal-conditioned Model-free Algorithm). Given a function class F : (S A G) R, define the original F-profile ΦF,g : S R|F| |A| by ΦF,g (s) := [f(s, a, g )]f F,a A. An algorithm is model-free using F without multiple goals if it accesses s exclusively through ΦF,g (s) for all s S during its entire execution. We also define the goal-conditioned F-profile ΦF,G : S R|F| |A| |G| by ΦF,G(s) := [f(s, a, g)]f F,a A,g G. An algorithm is goal-conditioned model-free using F and G if it accesses s exclusively through ΦF,G(s) for all s S during its entire execution. Definition 3.1 is a modification of Definition 1 in Sun et al. (2019). In the definition, G is the space of goal, and f(s, a, g) is the action-value of (s, a) when we are interested in the reward indexed by the goal g. We modified Definition 1 in Sun et al. (2019) to incorporate goal. Value Function Class. We consider solving this MDP with model-free algorithms. In the following, we introduce the value function class we employ for solving this MDP. For a hypothesis p = (p1, . . . , pd) {1, 2}d, we define Qp h(s, a, g) as the optimal action value function of (s, a) with respect to the transition p and the reward indexed by g. It is obvious that Qp H(s, a, g) = 1{s = g}. For h < H, Qp h(s, a, g) is defined as follows. For sh = (sh,1, . . . , sh,d) and p = (p1, . . . , pd), we define fh, d(sh, a, p) = ( 1{sh, d = p d} when d < h, 1{a = p d} when d = h, for d h. By the definition of the setting, fh, d(sh, a, p) is the d-th component of s H in the transition indexed by p when we use a in the state s H. Therefore, when we denote by Qp h the optimal value function in the MDP with the transition indexed by p, we have Qp h(sh, a, g) = d=1 1{fh, d(sh, a, p) = g d}, where g = (g1, . . . , gd). Intuitively, Qp h represents the feasibility of reaching the goal state g from the state-action pair (sh, a). We then define F = F1 . . . FH and Fg = Fg ,1 . . . Fg ,H, where Fh : = Qp h( , , ) | p {1, 2}d , (3) Fg ,h : = Qp h( , , g ) | p {1, 2}d . By definition, F contains goal-conditioned value functions, while Fg only contains value functions with respect to g . In the example above, we have the following lemma, which show that goal-conditioned value functions exponentially improve the sample efficiency of model-free algorithms. Lemma 3.2 (Exponential Improvement of Goal-conditioned Model-free Algorithms). Fix δ, ϵ (0, 1]. In the given example, any model-free algorithm without multiple goals, which uses Fg in (3) as the value function class, using o(2H) trajectories outputs a policy bπ with V bπ(s1, g ) V (s1, g ) 1/2 with probability at least 1/3. Meanwhile, with probability 1 δ, there exists a goal-conditioned model-free algorithm using F in (3) as the value function class (Algorithm 1) outputs bπ satisfying V bπ(s1, g ) V (s1, g ) ϵ using at most poly(H, 1/ϵ, log(1/δ)) trajectories for any transition in this family. Proof. See Appendix A for a detailed proof. Interpretation. Our observation is that, goal-conditioned model-free algorithms are more effective at utilizing the rich information in data compared to original model-free algorithms. In goal-conditioned algorithm, we can obtain nonzero reward even when the reward with respect to the target goal is sparse, since we can use the reward with respect to different goals for hypothesis elimination. As a How Does Goal Relabeling Improve Sample Efficiency? (1, 1) (2, 1) (1, 1) (1, 2) (2, 1) (2, 2) (0, 1) (0, 0) (1, 1) (1, 0) a = 1 a = 2 a = 1 a = 2 a = 1 a = 2 Figure 1. An example of the MDP construction in 3, with d = 2 and H = 4. All models are deterministic, and each model is uniquely indexed by a sequence of actions p. We use p to denote the index of the true model. Here p = (2, 1). The last transition is designed such that the agent always lands in a state that contains 0 unless it follows path p . result, we can eliminate more hypotheses using Bellman errors with respect to different goals, which improves the sample efficiency. In Definition 3.1, a model-free algorithm without multiple goals accesses the state s only through the value concerning the target goal g . Therefore, it can only eliminate hypotheses using the Bellman error with respect to g , which can be zero for most of the hypotheses when the reward is sparse. By using the reward and value functions with respect to multiple goal, we obtain denser feedback, leading to more effective hypothesis elimination. To illustrate the above idea, we revisit the example in 3.1. We consider the case where the true transition p = (2, 1) and the target goal g = (1, 1). For the trajectory τ = (s1, a1, , ss, a4, s5), we calculate the Bellman error and provide it in Table 1. As we show in Table 1, since the reward with respect to the target goal g is sparse, the Bellman error of most of the hypotheses is zero if we only consider the target goal g . Therefore, we can only eliminate one hypothesis using the Bellman error with respect to g . Meanwhile, we can obtain nonzero reward and eliminate all the wrong hypotheses using the reward and Bellman error with respect to g = (1, 0). Therefore, we can achieve exponential improvement in sample efficiency using goal relabeling. Rationale for Focusing Our Analysis on Model-Free Algorithms. In the above analysis, we focus on model-free algorithms as they are widely used in existing works that apply goal relabeling (Andrychowicz et al., 2017; Li et al., 2020; Pitis et al., 2020). Sun et al. (2019) show that modelbased algorithms are more sample efficient as they leverage more supervision. However, model-based algorithms necessitate an additional, often complex, planning phase to derive an optimal policy. This phase, particularly in environments with complex dynamics, introduces substantial computational costs and a heightened risk of error accumulation. Errors in estimating dynamics during the planning phase can magnify, potentially resulting in signifi- cantly suboptimal policies. In contrast, goal-conditioned model-free algorithms avoid the complexities of modeling the dynamics and subsequent planning, choosing instead to directly approximate the value function. This direct approach reduces the risk of error propagation inherent in planning, providing a potentially more robust solution in scenarios where accurately modeling the dynamics is challenging. Consequently, even though the goal-conditioned value function s theoretical complexity may match or surpass that of the dynamics, the practical advantages of operational efficiency, robustness, and applicability make goal-conditioned model-free methods a compelling choice in complex environments. Therefore, model-free algorithms are more popular in real-life application, making it crucial to theoretically analyze the improvements achieved by goal relabeling in model-free algorithms. 4. Algorithm: Goal-conditioned Optimism Led Iterative Value-function Elimination In 3, we construct an example and show that using goalconditioned value function can exponentially improve the sample efficiency in the best case. It is natural to ask, how to design an algorithm that provably achieves the improved sample efficiency? In 3.2 and Table 1, we show that goalconditioned value functions improve the sample efficiency since it make the feedback denser and allows better hypothesis elimination. Based on the idea above, we propose an algorithm called GOAl-conditioned optimism Led Iterative Value function Elimination (GOALIVE), which is a modification of Algorithm OLIVE proposed in Jiang et al. (2017). At a high level, our algorithm eliminate hypothesis Qh Fh using the average Bellman error and the data we have collected. After elimination, we choose the exploration policy πt by selecting the most optimistic hypothesis concerning the target goal g and the corresponding greedy How Does Goal Relabeling Improve Sample Efficiency? Bellman error w.r.t the goal g Elimination g = (0, 0) g = (0, 1) g = (1, 0) g = (1, 1) All goals Only g Hypothesis of p p = (1, 1) 1 0 1 0 Yes No p = (1, 2) 0 1 1 0 Yes No 0 0 0 0 p = (2, 1) No No p = (2, 2) 0 0 1 1 Yes Yes Table 1. Comparison between using multiple goals and only using the target goal g for hypothesis elimination. In this table, we use the blue color to highlight the true hypothesis p and the target goal g . This table present the Bellman error Qp 3 (s3, a3, g) r 3(s3, a3, g) Qp 4 (s4, a4, g), where s3 = (2, 2) and s4 = (1, 0). The table demonstrates that all incorrect hypotheses can be eliminated after goal relabeling with the given data, whereas only one hypothesis can be eliminated using the Bellman error with respect to the target goal g . Algorithm 1 GOAl-conditioned optimism Led Iterative Value function Elimination (GOALIVE) 1: Input: Hypothesis class F, elimination thresholds ζact and ζelim, numbers of iterations nact and nelim. 2: Initialize: B0 F, Dt h for all h and t. 3: for iteration t = 1, 2, . . . do 4: Choose policy πt = πf t, where f t = argmaxf Bt 1 f(x1, πf(x1), g ). 5: Execute πt for nact episodes and update Dt h to include the new (sh, ah, sh+1) tuples. 6: Estimate b E(f t, πt, h, g ) for all h [H], where b E(f, πt, h, g) (s,a,s ) Dt h 1[s, a, s , g, fh, rh, fh+1], where 1 is defined in (4). 7: if PH h=1 b E(f t, πt, h, g ) Hζact then 8: Terminate and output πt. 9: end if 10: Pick any h [H] for which b E(f t, πt, h, g ) ζact and set Dt h = . 11: Execute πt for nelim episodes and update Dt h to include the new (sh, ah, sh+1) tuples. 12: Estimate b E(f, πt, h) for all f F, where b E(f, πt, h) (s,a,s ) Dt h 1[s, a, s , g, fh, rh, fh+1] 13: Update Bt = n f Bt 1 : b E(f, πt, h) ζelim o . 14: end for The pseudocode of GOALIVE is given in Algorithm 1. In each episode, our algorithm performs three main steps: Optimistic planing: we compute the most optimistic hypothesis Qk with regard to the target goal g , and choose πk to be its greedy policy in Line 3. We then use πk to interact with the environment in Line 4. Computing Bellman error: We evaluate the average Bellman error of f k under πk in Line 5. We activate the elimination phase if the Bellman error is large, and otherwise returned πk in Line 7. Eliminating hypothesis with large Bellman error: pick a step t [H] where the estimated Bellman error exceeds the activation threshold ζact; eliminate all hypotheses in the candidate set whose Bellman error at step t exceeds the elimination threshold ζelim. We highlight that we evaluate the average Bellman error with regard to all the goals in the goal space, instead of only on g , which allows better hypothesis elimination. Connection with Existing Algorithms. The main difference between our algorithm and the original OLIVE proposed in Jiang et al. (2017) is that, we use goal relabeling when evaluating Bellman error. More specifically, for hypothesis f t, step h, and the policy πt, we use gt h = argmax g G (s,a,s ) Dh 1[s, a, s , g, fh, rh, fh+1], (4) where 1[s, a, s , g, fh, rh, fh+1] = fh(s, a, g) rh(s, a, g) max a A fh+1(s , a , g) , for hypothesis elimination instead of only using the target goal g . Intuitively, goal relabeling allows us to eliminate How Does Goal Relabeling Improve Sample Efficiency? more hypotheses in the elimination phase and improve the sample efficiency compared to algorithms that do not use multiple goals. The idea of using Bellman error for goal relabeling has also appeared in previous work (Zhang et al., 2020). Our analysis provides theoretical justification for this relabeling method. We remark that we use average Bellman error to address the stochastic dynamic. In general, Equation (4) is difficult to compute, as it requires iterating over the goal space. Existing algorithms (Andrychowicz et al., 2017; Li et al., 2020; Pitis et al., 2020) use the state space as the goal space. They view the states in the data as achieved goals, and minimize Bellman error using achieved goals instead of all possible goals. However, this approach is not sample efficient since they might eliminate more hypotheses by evaluating the Bellman error on multiple goals. Therefore, their algorithms can be viewed as modifications of Algorithm 1, which favor computational efficiency over sample efficiency. We further note that, in some special cases, the achieved goal is the same as the goal that maximizes the Bellman error, which is formalized in the following lemma. Lemma 4.1 (Equivalence of Relabeling Method). We say a hypothesis class is deterministic, if for any (f, s, a, g) F S A G, there exists a unique s S, such that fh(s, a, g) = r h(s, a, g) + maxa A fh+1(s , a , g). Suppose we have r H(s, a, g) = 1{s = g} and r H 1(s, a, g) = 0 for all (s, a, g) S A G. Then for a trajectory (s1, a1, . . . , a H, s H+1), we have s H argmax g G 2[s H 1, a H 1, g, f H 1, rh, fh+1, P h] where 2[s, a, g, f, r, f , P] = f(s, a, g) r(s, a, g) Es P ( |s,a) max a A f (s , a , g) for any f F when F and P H 1 are deterministic. That is, the achieved goal is the goal that maximize the Bellman error. Proof. See Appendix B.1 for a detailed proof. Lemma 4.1 shows that in the example in 3, the relabeling method in Algorithm 1 is the same with the relabeling method in Andrychowicz et al. (2017), which theoretically justifies their method. This equivalence also suggests that the empirical performance observed in standard goalconditioned RL benchmarks for existing relabeling methods could serve as an indirect measure of our algorithm s efficacy. Repeating sampling using the same policy. In Algorithm 1, we sample multiple trajectories using the policy πk. This procedure differs from GOLF proposed by Jin et al. (2021a), and we only have the upper bound of sample complexity instead of regret. However, this procedure is necessary since the maximum operator cannot be interchanged with the expectation operator. More specifically, the loss we want to evaluate is max g G E(s,a,s ) µ[ 1(s, a, s , g, Qh, rh, Qh+1)] where 1 is defined in (4), and we can only evaluate E(s,a,s ) µ[max g G | 1(s, a, s , g, Qh, rh, Qh+1)|] if we have only one trajectory to each policy. The latter one can be large even for the true hypothesis, and cannot be used for hypothesis elimination. 5. Analysis of GOALIVE In this section, we present the analysis of GOALIVE. We first introduce the complexity measure we use. We then present the sample complexity of GOALIVE. 5.1. Complexity Measure: Goal-conditioned Bellman-Eluder Dimension In this section, we propose a new complexity measure called goal-conditioned Bellman-Eluder dimension, which quantifies the gain of using goal in model-free algorithms. To enable general function approximation, we modify the Bellman-Eluder dimension, which was proposed by Jin et al. (2021a) to characterize the complexity of an RL problem with general function approximation. We start by developing a new goal-conditioned version of distributional Eluder dimension. Definition 5.1 (Goal-conditioned ϵ-independence between Distributions). Let F be a function class defined on X G, , and ν, µ1, . . . , µn be probability measures over X. We say ν is goal-conditioned ϵ-independent of {µ1, µ2, . . . , µn} with respect to F and g if there exists f F and g G such that v u u t i=1 max g G Ex µi h f(x, g) i 2 ϵ, but |Ex ν[f(x, g )]| > ϵ. Definition 5.2 (Goal-conditioned Distributional Eluder Dimension). Let F be a function class defined on X G, and Π be a family of probability measures over X. The goal-conditioned distributional Eluder dimension dim GOAL DE(F, Π, ϵ) is the length of the longest sequence {ρ1, . . . , ρn} Π such that there exists ϵ ϵ where ρi is ϵ -independent of {ρ1, . . . , ρi 1} for all i [n]. How Does Goal Relabeling Improve Sample Efficiency? When the space of goal only contains the goal we concern about, that is, G = {g }, Definition 5.2 degenerates to the original definition of distributional Eluder dimension in Jin et al. (2021a). In Definition 5.1, we can consider Ex µi f(x, g) as the loss of hypothesis f concerning the goal g evaluated on the distribution µi. Intuitively, ν is independent of {µ1, . . . , µn} means if that there exist a hypothesis, where the loss of f is small with respect to all g G and {µi}n i=1, but the loss is considerably larger on ν and the target goal g . Since we can choose any goals for hypothesis elimination, the loss of the hypothesis that have not been eliminated should be small with respect to all g G. Therefore, we incorporate multiple goals in Definition 5.1 by taking a maximum over g G. Equipped with Definitions 5.1 and 5.2, we are ready to present the definition of the goal-conditioned Bellman-Eluder (GOALBE) dimension. Definition 5.3 (GOAL-conditioned Bellman-Eluder (GOAL-BE) Dimension). Let (I Th)F := {fh Thfh+1 : f F} be the set of Bellman residuals induced by F at step h, and Π = {Πh}H h=1 be a collection of H probability measure families over X U. The ϵ-goal-conditioned Bellman-Eluder of F with respect to Π is defined as dim GOAL BE(F, Π, ϵ) := max h [H] dim GOAL DE (I Th)F, Πh, ϵ . Similar with Jin et al. (2021a), GOAL-BE dimension also depends on the choice of the distribution class Π. To simplify the presentation, we only consider DF := {DF,h}H h=1, where DF,h denotes the set of all probability measures over S A at the h th step, which can be generated by executing the greedy policy πf with regard to the target goal g induced by any f F, i.e., πf,h( ) = argmaxa A fh( , a, g ) for all h [H]. Comparison with Original Bellman-Eluder Dimension. Definition 5.3 can be viewed as a modification of the original Bellman-Eluder dimension proposed by Jin et al. (2021a). In fact, Definition 5.3 is more general than the Bellman-Eluder dimension. Definition 5.3 reduces to the original definition of the Bellman-Eluder dimension in Jin et al. (2021a) when G = {g }. Intuitively, since we have multiple loss for the same hypothesis, we are able to eliminate more hypothesis with the same data, which improves the sample efficiency and reduces the complexity of the original problem. This intuition is formalized in the following two lemmas. Lemma 5.4 shows that GOAL-BE dimension is smaller than the original BE dimension. Lemma 5.5 shows that in the example in 3, GOAL-BE dimension is exponentially smaller than the original BE dimension. Lemmas 5.4 and 5.5 show that using goal-conditioned value functions provably reduces the complexity of a RL Lemma 5.4 (Strictly Improvement over Original Bellman-Eluder Dimension). We define Fg = Fg ,1 . . . Fg ,H, Fg = Fg,1 . . . Fg,H, where Fh,g = {Qh( , , g ) | Q Fh} and Fh,g = {Qh( , , g) | Q Fh, g G} . We then have dim BE(Fg, Π, ϵ) dim BE(Fg , Π, ϵ) and dim BE(Fg , Π, ϵ) dim GOAL BE(F, Π, ϵ). (5) Proof. See Appendix B.2 for a detailed proof. Lemma 5.5 (Exponential Improvement in the Best Case). In the example in 3, we have dim BE(Fg , DF, ϵ) 2d 1, dim GOAL BE(F, DF, ϵ) = 1, where Fg = Fg ,1 . . . Fg ,H, F = F1 . . . FH. Here Fg ,h and Fh are defined in (3). Proof. See Appendix B.3 for a detailed proof. 5.2. Sample Complexity of GOALIVE We first present the realizability assumption on the hypothesis class F, which is commonly used in previous work (Jin et al., 2021a; Sun et al., 2019). Assumption 5.6 (Realizability). We assume that Q h Fh for all h [H]. Realizability requires that the hypothesis class is wellspecified and is able to characterizes the connection between different goals. We also introduce the definition of the covering number, which is widely used in previous work (Jin et al., 2021a;b; Liu et al., 2022). Definition 5.7 (Covering Number of Hypothesis Class). The ϵ-covering number of a set V under metric ρ, denoted as N(V, ϵ, ρ), is the minimum integer n such that there exists a subset V0 with |V0| = n, and for any x V, there exists y V0 such that ρ(x, y) ϵ. For two hypotheses f = {fh}H h=1, f = {f h}H h=1, we define = max (h,s,a,g) [H] S A G |fh(s, a, g) f h(s, a, g)|. For F = F1 . . . FH, we define N(F, ϵ) = maxh [H] N(Fh, ϵ, ). How Does Goal Relabeling Improve Sample Efficiency? Definition 5.8 (Covering Number of Goal Space). We say G0 is an ϵ-covering of the goal space G if for any g G, there exists g0 G0, such that |rh(s, a, g) rh(s, a, g0)| ϵ, |fh(s, a, g) fh(s, a, g0)| ϵ, hold for all f F, h [H], and (x, u) X U. Let G(ϵ) be the ϵ-covering of G with the smallest cardinal, and let N(G, ϵ) = |G(ϵ)|. When |G| < , we have N(G, ϵ) |G|. Equipped with the assumption and definitions above, we are ready to present the sample complexity of GOALIVE. Theorem 5.9 (Sample complexity of GOALIVE.). Under Assumption 5.6, if we choose ζgoal = ϵ/ 600H d , ζact = 2ϵ/H, ζelim = ϵ/ 4H d , nact = 2592H2ι/ϵ2, and nelim = 4608H2d log2(N(F, ζelim/8)) ι/ϵ2, where d = dim GOAL BE F, DF, ϵ/H , ι = log2[N(G, ζgoal)Hd/(δϵ)], then with probability at least 1 δ, Algorithm 1 will output an O(ϵ)-optimal policy using at most O(H3d2 log2[NF(ζelim/8)] ι/ϵ2) episodes. Proof. See Appendix C for a detailed proof. Theorem 5.9 shows that RL problems with low GOALBE dimension can be solved efficiently by GOALIVE under the realizability assumption. The sample complexity of GOALIVE is e O(H3d2/ϵ2), which is polynomial in 1/ϵ and the number of horizon. The dependency on the complexity measure is the same with Jin et al. (2021a). However, as we show in Lemma 5.4 and Lemma 5.5, GOAL-BE dimension is not larger than the original BE dimension, and can be exponentially smaller in the best case. The cost we pay for the improvement on the complexity measure is the log-covering number of the goal space. For the example in 3.1, we have |G| = 2d. Therefore, we have log N(G, ζgoal) H log 2, which is polynomial in the horizon H. Comparison with previous work on multitask RL. The setting we study can be viewed as a special case of multitask RL. The difference is that we only concern about the suboptimality of the target task g instead of all task. Previous work (Lu et al., 2022) has also developed sampleefficient algorithms for multitask RL. Lu et al. (2022) designed a model-free algorithm for multitask RL and demonstrated its sample efficiency. However, the complexity measure they used is the original Eluder dimension, which cannot exploit the connection between goals. Moreover, they considered that the tasks are randomly generated, which is different from the empirical work of goal relabeling, where the algorithm can set the goal in the training phase. Therefore, their results cannot explain the success of goal relabeling. Connection with reward-free exploration. Our algorithm is similar with reward-free exploration, in the sense that the exploration strategy does not rely on the original reward. Existing results in reward-free exploration use model-based algorithm (Jin et al., 2020a; Wang et al., 2020). However, model-based algorithms tend to have a larger asymptotic bias. In the case of complex problems, the dynamics cannot be learned perfectly, and the final policy can be highly suboptimal (Pong et al., 2018). Therefore, empirical work that uses goal relabeling often applies model-free algorithms, and their success and not be explained by the work that use model-based algorithms. In fact, our algorithm can be viewed as a model-free algorithm for reward-free exploration. Comparison with Zhu & Zhang (2023). In the context of goal-conditioned RL, previous research, including the notable work by Zhu & Zhang (2023), has provided a theoretical analysis. However, their analysis focuses on offline RL and does not design algorithm for provably efficient exploration. Moreover, they assume that the dataset covers the optimal policy with respect to all g G, whereas our analysis does not rely on such a stringent assumption. How Does Goal Relabeling Improve Sample Efficiency? Acknowledgement Zhaoran Wang acknowledges National Science Foundation (Awards 2048075, 2008827, 2015568, 1934931), Simons Institute (Theory of Reinforcement Learning), Amazon, J.P. Morgan, and Two Sigma for their supports. Impact Statement This paper presents work whose goal is to advance the field of Machine Learning. There are many potential societal consequences of our work, none of which we feel must be specifically highlighted here. Andrychowicz, M., Wolski, F., Ray, A., Schneider, J., Fong, R., Welinder, P., Mc Grew, B., Tobin, J., Pieter Abbeel, O., and Zaremba, W. Hindsight experience replay. Advances in neural information processing systems, 30, 2017. Cai, Q., Yang, Z., Jin, C., and Wang, Z. Provably efficient exploration in policy optimization. ar Xiv preprint ar Xiv:1912.05830, 2019. Cheng, Y., Feng, S., Yang, J., Zhang, H., and Liang, Y. Provable benefit of multitask representation learning in reinforcement learning. ar Xiv preprint ar Xiv:2206.05900, 2022. Colas, C., Fournier, P., Chetouani, M., Sigaud, O., and Oudeyer, P.-Y. Curious: intrinsically motivated modular multi-goal reinforcement learning. In International conference on machine learning, pp. 1331 1340. PMLR, 2019. Du, S. S., Kakade, S. M., Lee, J. D., Lovett, S., Mahajan, G., Sun, W., and Wang, R. Bilinear classes: A structural framework for provable generalization in rl. ar Xiv preprint ar Xiv:2103.10897, 2021. Fang, M., Zhou, T., Du, Y., Han, L., and Zhang, Z. Curriculum-guided hindsight experience replay. Advances in neural information processing systems, 32, 2019. Jiang, N., Krishnamurthy, A., Agarwal, A., Langford, J., and Schapire, R. E. Contextual decision processes with low bellman rank are pac-learnable. In International Conference on Machine Learning, pp. 1704 1713. PMLR, 2017. Jin, C., Krishnamurthy, A., Simchowitz, M., and Yu, T. Reward-free exploration for reinforcement learning. In International Conference on Machine Learning, pp. 4870 4879. PMLR, 2020a. Jin, C., Yang, Z., Wang, Z., and Jordan, M. I. Provably efficient reinforcement learning with linear function approximation. In Conference on Learning Theory, pp. 2137 2143. PMLR, 2020b. Jin, C., Liu, Q., and Miryoosefi, S. Bellman eluder dimension: New rich classes of rl problems, and sampleefficient algorithms. Advances in neural information processing systems, 34:13406 13418, 2021a. Jin, Y., Yang, Z., and Wang, Z. Is pessimism provably efficient for offline rl? In International Conference on Machine Learning, pp. 5084 5096. PMLR, 2021b. Li, A., Pinto, L., and Abbeel, P. Generalized hindsight for reinforcement learning. Advances in neural information processing systems, 33:7754 7767, 2020. Liu, Z., Zhang, Y., Fu, Z., Yang, Z., and Wang, Z. Learning from demonstration: Provably efficient adversarial policy imitation with linear function approximation. In International conference on machine learning, pp. 14094 14138. PMLR, 2022. Liu, Z., Lu, M., Xiong, W., Zhong, H., Hu, H., Zhang, S., Zheng, S., Yang, Z., and Wang, Z. Maximize to explore: One objective function fusing estimation, planning, and exploration. Advances in Neural Information Processing Systems, 36, 2024. Lu, R., Zhao, A., Du, S. S., and Huang, G. Provable general function class representation learning in multitask bandits and mdps. ar Xiv preprint ar Xiv:2205.15701, 2022. Nair, A. V., Pong, V., Dalal, M., Bahl, S., Lin, S., and Levine, S. Visual reinforcement learning with imagined goals. Advances in neural information processing systems, 31, 2018. Pitis, S., Chan, H., Zhao, S., Stadie, B., and Ba, J. Maximum entropy gain exploration for long horizon multigoal reinforcement learning. In International Conference on Machine Learning, pp. 7750 7761. PMLR, 2020. Pong, V., Gu, S., Dalal, M., and Levine, S. Temporal difference models: Model-free deep rl for model-based control. ar Xiv preprint ar Xiv:1802.09081, 2018. Riedmiller, M., Hafner, R., Lampe, T., Neunert, M., Degrave, J., Wiele, T., Mnih, V., Heess, N., and Springenberg, J. T. Learning by playing solving sparse reward How Does Goal Relabeling Improve Sample Efficiency? tasks from scratch. In International conference on machine learning, pp. 4344 4353. PMLR, 2018. Savinov, N., Raichuk, A., Marinier, R., Vincent, D., Pollefeys, M., Lillicrap, T., and Gelly, S. Episodic curiosity through reachability. ar Xiv preprint ar Xiv:1810.02274, 2018. Silver, D., Schrittwieser, J., Simonyan, K., Antonoglou, I., Huang, A., Guez, A., Hubert, T., Baker, L., Lai, M., Bolton, A., et al. Mastering the game of go without human knowledge. Nature, 550(7676):354, 2017. Sun, W., Jiang, N., Krishnamurthy, A., Agarwal, A., and Langford, J. Model-based rl in contextual decision processes: Pac bounds and exponential improvements over model-free approaches. In Conference on learning theory, pp. 2898 2933. PMLR, 2019. Uehara, M., Zhang, X., and Sun, W. Representation learning for online and offline rl in low-rank mdps. ar Xiv preprint ar Xiv:2110.04652, 2021. Wang, R., Du, S. S., Yang, L., and Salakhutdinov, R. R. On reward-free reinforcement learning with linear function approximation. Advances in neural information processing systems, 33:17816 17826, 2020. Zhang, T., Ren, T., Yang, M., Gonzalez, J., Schuurmans, D., and Dai, B. Making linear mdps practical via contrastive representation learning. In International Conference on Machine Learning, pp. 26447 26466. PMLR, 2022. Zhang, Y., Abbeel, P., and Pinto, L. Automatic curriculum learning through value disagreement. Advances in Neural Information Processing Systems, 33:7648 7659, 2020. Zhu, H. and Zhang, A. Provably efficient offline goalconditioned reinforcement learning with general function approximation and single-policy concentrability. ar Xiv preprint ar Xiv:2302.03770, 2023. How Does Goal Relabeling Improve Sample Efficiency? List of Notation In the sequel, we present a list of notations in the paper. Notation Explanation S, A, G The state, the action, and the goal spaces, respectively. h The length of an episode. t The index of the iteration in Algorithm 1 (GOALIVE). E The average Bellman error defined in (2). Φ The F-profile we defined i Definition 3.1. F The hypothesis class used in model-free algorithms. g The target goal we concern about. dim BE, dim GOAL BE The original BE dimension, and the GOAL-BE dimension. {Th}H h=1 The Bellman operator we defined in 2. Structure of Appendix We provide a detailed proof of Lemma 3.2 in Appendix A, and provide the proof of Lemmas in 4, 5.1 in Appendix C. We provide the proof of auxiliary lemmas in D. How Does Goal Relabeling Improve Sample Efficiency? Table of Contents A Proof of Lemma 3.2 13 B Proof of Lemmas in 5.1 ans 4 15 C Proof of Theorem 5.9 17 D Proof of Lemmas in Appendix C 18 A. Proof of Lemma 3.2 In this section, we first provide the proof of Lemma 3.2. We then provide more detail on the action-value function Qp h. Proof of Lemma 3.2. Our proof consists of two parts. In the first part, we show the inefficiency of model-free algorithms without multiple goals. In the second part, we show that there exists a goal-conditioned model-free algorithm that achieves provably efficient exploration. (1, 1) (2, 1) (1, 1) (1, 2) (2, 1) (2, 2) (0, 1) (0, 0) (1, 1) (1, 0) a = 1 a = 2 a = 1 a = 2 a = 1 a = 2 Figure 2. An example of the MDP construction in 3, with d = 2 and H = 4. All models are deterministic, and each model is uniquely indexed by a sequence of actions p. We use p to denote the index of the true model. Here p = (2, 1). The last transition is designed such that the agent always lands in a state that contains 0 unless it follows path p . Inefficiency of Model-free Algorithms without Multiple Goals. Our proof of the inefficiency of model-free algorithms her is similar with the proof of Theorem 2 of Sun et al. (2019). We construct another class of non-factored models, such that (1) learning in this new class is intractable, and (2) the two families are indistinguishable to any model-free algorithm without multiple goals. The new model class is obtained by transforming each Pp M into e Pp. e Pp has the same state space and transitions as Pp, except for the transition from horizon H 1 to H. The transition in the H-th step still does not depend on the action a H. In the transition indexed by p = {p1, p2, . . . , pd}, we will transit to ( [1]d when s H = p, The reward function is the same as in the original model class. This example is similar with the one in Sun et al. (2019). Solving this example is equivalent to solving a multi-armed bandit problem with one optimal arm among 2H 2 arms. How Does Goal Relabeling Improve Sample Efficiency? Therefore, the sample complexity of any algorithm in the new example is Ω(2H). To prove that the two model families are indistinguishable for model-free algorithms without multiple goals, which is defined in Definition 3.1, we show that the original F-profiles, which is ΦF,g in Definition 3.1, in Pp are identical to those in e Pp. This implies that any model-free algorithm behaves in the same way in Pp and e Pp, so that the sample complexity must be identical, and hence Ω(2H). To continue, we introduce the definition of the optimal planning, which is an important tool to connect these two model families. Definition A.1 (Optimal Planning). Let M be a class of transition. For a transition P = {Ph}H h=1 M, we denote by πP = {πP h }H h=1, V P = {V P h }H h=1, QP = {QP h }H h=1 the optimal policy, value function, and action-value function when the transition of the MDP is P. We denote by OP(P) the mapping that maps a transition P to its optimal policy and optimal Q function, that is OP(P) (πP, QP). We then define OP(M) {(πP, QP) | P M}. Let M = {Pp}p {1,2}d and f M = { e Pp}p { 1,1}d. Let F = {Fh}H h=1, Π = {Πh}H h=1 to be the Q class and policy classes from OP(M, G), e F = { e Fh}H h=1 and eΠ = {eΠh}H h=1 be the policy class from OP( f M, G). Since all MDPs of interest have fully deterministic dynamics, and non-zero rewards only occur at the last step, it suffices to show that for any sequence of actions a = (a1, . . . , a H), (1) the final reward has the same distribution for P p and e P p, and (2) the original F-profiles [Qh(sh, a, g )]Qh Fh,a A and [Qh(sh, a, g )]Qh e Fh,a A are equivalent at all states generated by taking a in P p and e P p, respectively. The equivalence of the reward is obvious, In the following, we study the equivalence of the original F-profiles. In P p and at level H, by the definition of the reward function, the original F-profile is [1]|FH| for the state without 0 and [0]|FH| otherwise. Thus, when the action sequence a = p, the original F-profile of the reached state is [1]|Q|. Otherwise, the original F-profile of the reached state is [0]|Q|. Similarly, in e P p the original F-profile is [0]| e FH| if the state is [0]d, and it is [1]| e FH| otherwise. The equivalence here is obvious since | e FH| = |FH| = 2H 2. For level H 1, no matter the true model path p, the Qp associated with path p has value Qp H 1(a, +1, g ) = 1 {a = p } at state a. Hence the Q-profile at a can be represented as [1 {a = p }]p { 1,1}d, for both P p and e P p. We remark that the original F-profile does not depend on the true model p because all models agree on the dynamics before the last step. Similarly, for h < H 1 where each state has two actions { 1, 1}, we have: Qp (a1:h 1, 1, g ) = 1 {a1:h 1 1 = p 1:h} , Qp (a1:h 1, 2, g ) = 1 {a1:h 1 2 = p 1:h} . Hence, the original F-profile can be represented as: h 1 {a1:h 1 1 = p 1:h} , 1 {a1:h 1 2 = p 1:h} i which is the same between P p and e P p. Therefore, the model P p and e P p have exactly the same original F-profile for all action sequences, implying that any model-free algorithm without multiple goals, must have the same behavior on both transition classes. Since the transition class f M = { e P p}p admits an information-theoretic sample complexity lower bound of Ω(2H), the same lower bound applies to M = {P p}p for model-free algorithms without multiple goals. Sample Efficiency of Goal-conditioned Model-free Algorithm This can be directly proved by combining Lemma 5.5 and Theorem 5.9. By Theorem 5.9, when we use Algorithm GOALIVE, we can obtain a 4ϵ-optimal policy with at most 15000H3d2 0 log2(N(F, ζelim/8)) ι/ϵ2 episodes with probability 1 δ. Here ι = log2[N(G, ζgoal)Hd0/(δϵ)] and d0 = dim GOAL BE F, DF, ϵ/H . Since both F and G is finite, we have N(G, ζgoal) = N(F, ζelim/8) = 2H 2. By Lemma 5.5, we have d0 = 1. Therefore, the number of episodes is bounded from the above by 15000H7 log2[H/(δϵ)]/ϵ2, which is poly(H, 1/ϵ, log(1/δ)). Thus, we conclude the proof of Lemma 3.2. How Does Goal Relabeling Improve Sample Efficiency? B. Proof of Lemmas in 5.1 ans 4 B.1. Proof of Lemma 4.1 Proof of Lemma 4.1. Since we have QH(s, a, g) = 1s=g, for any s S, there exists a unique g G, such that QH(s, a, g) = 1. Since PH 1 is deterministic, there exists a unique g G, such that TH 1f H(s H 1, a H 1, g ) = 1g=g . Since F is deterministic, there exists a unique g G, such that f H 1(s H 1, a H 1, g ) = 1g=g . We also know that f H(s H, a H, g) = 1 if and only is g = s H. Therefore, we have f H 1(s H 1, a H 1, g ) r H 1(s H 1, a H 1, g ) max a A f H(s H, a , g ) {0, 1} for any g G. Therefore, Lemma 4.1 holds if we have f H 1(s H 1, a H 1, g ) r H 1(s H 1, a H 1, g ) max a A f H(s H, a , g ) = 1 when g = s H. In the following, we consider the case that f H 1(s H 1, a H 1, g ) r H 1(s H 1, a H 1, g ) max a A f H(s H, a , g ) = 0 when g = s H. In this case, we have f H 1(s H 1, a H 1, g ) = 0 when g = s H. Therefore, we have f H 1(s H 1, a H 1, g ) = 1g =s H, and f H 1(s H 1, a H 1, g) r H 1(s H 1, a H 1, g) max a A f H(s H, a , g) = 1g=s H 1g=s H = 0 for all g G. Therefore, we conclude the proof of Lemma 4.1. B.2. Proof of Lemma 5.4 Proof of Lemma 5.4. By the definition of Fh,g and Fh,g, we have Fh,g Fh,g. By Definition 2.2, we have dim DE((I Th)Fh,g , Πh, ϵ) dim DE((I Th)Fh,g, Πh, ϵ). By Definition 2.3, we have dim BE(Fg , Π, ϵ) dim BE(Fg, Π, ϵ), which finish the first part of the proof. In the following part of the proof, we show that dim GOAL BE(F, Π, ϵ) dim BE(Fg , Π, ϵ). It is sufficient to show that dim GOAL DE((I Th)F, Πh, ϵ) dim DE((I Th)Fg , Π, ϵ). We only need to show that if ν is goal-conditioned ϵindependent with {µ1, . . . , µn}, we then have ν is ϵ-independent with {µ1, . . . , µn}. By the definition of goal-conditioned ϵ-independence, we have |Es,a νfh(s, a, g ) rh(s, a, g ) Thfh+1(s, a, g )| ϵ and i=1 max g G |Es,a µifh(s, a, g) rh(s, a, g) Thfh+1(s, a, g)|2 ϵ for some f F. Therefore, we have v u u t i=1 |Es,a µifh(s, a, g ) rh(s, a, g ) Thfh+1(s, a, g )|2 ϵ. We conclude the proof by the definition of ϵ-independence. How Does Goal Relabeling Improve Sample Efficiency? B.3. Proof of Lemma 5.5 Proof of Lemma 5.5. By the definition of F, we know that fh(s, a, g) r h(s, a, g) Thfh+1(s, a, g) = 0 for any (f, h, s, a, g) F [H 2] S AG. Therefore, we have dim DE((I Th)Fh,g , DF,h, ϵ) = dim GOAL DE((I Th)Fh, DF,h, ϵ) = 1 for h [H 2]. Therefore, we only need to consider the case where h = H 1. For a hypothesis p {1, 2}d, we denote by f p = {f p h }H h=1 F the corresponding value. By the setting of the dynamic system, when we use the optimal policy with respect to g and f p, we will arrive the state s H 1 = p. Therefore, we have DF,H 1 = {δ(s,a) | s {1, 2}d, a A}. For any f p F, δ(s H 1,a) DF,H 1, we have E(s,a) µf p H 1(s, a, g ) = 0, r H 1(s, a, g ) = 0, E(s,a) µTH 1f p H(s, a, g ) = 0 when s H 1 = p and s H 1 = p. We also have E(s,a) µf p H 1(s, a, g ) = 1, r H 1(s, a, g ) = 0, E(s,a) µTH 1f p H(s, a, g ) = 0 when s H 1 = p and s H 1 = p. Since we have 2d hypotheses in total, we consider the sequence µ1 = δ(s1 H 1,a), . . . , µn = δ(sn H 1,a), such that n = 2d 1, µi = δ(p ,a) for i [n], µi = µj when i = j. We set p[i] = s1 H 1 and choose f i = {f i h} the corresponding value function. We can easily verify that E(s,a) µi[f j H 1(s, a, g ) r H 1(s, a, g ) TH 1f j H(s, a, g )] = 0 when i < j. Therefore, we have E(s,a) µi[f j H 1(s, a, g ) r H 1(s, a, g ) TH 1f j H(s, a, g )] 2 = 0. We also have E(s,a) µi[f i H 1(s, a, g ) r H 1(s, a, g ) TH 1f i H(s, a, g )] = 1. Therefore, µi is ϵ-independent with {µ1, , µi 1}. By the definition of distributional Eluder dimension, we have dim DE((I TH 1)FH 1,g , DF,H 1, ϵ) 2d 1 for ϵ < 1. For the second part of the proof, we first show that f H 1(s, a, g) r H 1(s, a, g) Es max a f H(s , a , g) < 1 if and only if f H 1 = Q H 1, f H = Q H. First, in the given hypothesis class, we have f H = Q H holds for all f F. When we select g = s H, we have f H 1(s, a, g) r H 1(s, a, g) Es max a f H(s , a , g) = |f H 1(s, a, s H) 1| . When f H 1 = QH 1, we have f H 1(s, a, s H) = 0, which contradict with |f H 1(s, a, s H) 1| < 1. Therefore, we have f H 1 = Q H 1. How Does Goal Relabeling Improve Sample Efficiency? If ν is goal-conditioned ϵ-independent with µ, we have max g G Es,a µ f H 1(s, a, g) r H 1(s, a, g) Es max a f H(s , a , g) < ϵ (7) Es,a ν f H 1(s, a, g ) r H 1(s, a, g ) Es max a f H(s , a , g ) > ϵ. (8) By (7), we have f H 1 = Q H 1, f H = Q H, which implies Es,a ν f H 1(s, a, g ) r H 1(s, a, g ) Es max a f H(s , a , g ) = 0, which contradicts with (8). Therefore, for any µ, ν DF,H 1, ν is not goal-conditioned ϵ-independent with µ. Therefore, we have dim GOAL DE((I TH 1)F, DF,H 1, ϵ) = 1, which concludes the proof. C. Proof of Theorem 5.9 Proof of Theorem 5.9. We denote by T the index of the last iteration of Algorithm 1. By Algorithm 1, we know that the elimination procedure is activated at the t-th iterations for t [T 1], and is not activated in the T-th iteration. We also know that the output policy is πT , which is the greedy policy with respect to f T . We use ht to denote the step that the elimination procedure is activated in the t-th iteration. The following lemma provides a characterization of {ft}T 1 t=1 and the elimination procedure. Lemma C.1. We have E(f t, πt, ht, g ) > ϵ/H for all t [T ] with probability 1 δ/8. We also have E(f, πt, ht) < ϵ/(H d) for all t [T ] when f Bt with probability 1 δ/8. Here E(f t, πt, ht, g ) and E(f, πt, ht) are defined in (2), and T = min{T 1, d H + 1}. Proof. See Appendix D.1 for a detailed proof. Lemma C.1 shows that, the average Bellman error of f t with respect to πt is large, and hypotheses with a large average Bellman error with respect to πt will be eliminated with high probability. We denote by E1 the event in Lemma C.1. The following lemma shows that T d H + 1 when E1 holds, which bounds the number of iterations in Algorithm 1. Lemma C.2. When we condition on E1, we have T d H + 1. Proof. See Appendix D.2 for a detailed proof. We also have the following lemma, which shows that the average Bellman error of f T is small. Lemma C.3. When T d H + 1, we have PH h=1 E(f T , πT , h, g ) < 4ϵ holds with probability 1 δ/4. Proof. See Appendix D.3 for a detailed proof. In addition, we have the following lemma, which show that the true hypothesis will not be eliminated with high probability. Lemma C.4. We define Q = {Q h}H h=1, where Q h is the optimal goal-conditioned action-value function we defined in 2. Then with probability 1 δ/4, we have Q Bt for all t [T ]. Here T = min{T 1, d H + 1}. Proof. See Appendix D.4 for a detailed proof. How Does Goal Relabeling Improve Sample Efficiency? We denote by E3 the event we defined in Lemma C.4. In the following part of the proof, we condition on Events E1, E2, and E3. In the following, we show that the suboptimality of πT is small when we condition on E1, E2, and E3. When we condition on Event E1, We have V 1 (s1, g ) < f T 1 (s1, a1, g ) by Line 4 of Algorithm 1. Therefore, we have V 1 (x1, g ) V πT 1 (x1, g ) max a f T 1 (s1, a, g ) V πT (s1, g ). (9) Since f T H+1(s, a, g) = 0 and V πT H+1(s, a, g) = 0 for all (s, a, g) S A G, we have max a f T 1 (s1, a, g ) V πT (s1, g ) = h=1 EπT f T h (sh, ah, g ) r h(sh, ah, g ) f T h+1(sh+1, ah+1, g ) h=1 EπT QπT h (sh, ah, g ) r h(sh, ah, g ) QπT h+1(sh+1, ah+1, g ) . (10) Here the expectation EπT [ ] is taken with respect to sh+1 P h( | sh, ah) and ah πT h ( | sh, g ). By Bellman equation, we have EπT QπT h (sh, ah, g ) r h(sh, ah, g ) QπT h+1(sh+1, ah+1, g ) = 0. (11) By (2), we have EπT f T h (sh, ah, g ) r h(sh, ah, g ) f T h+1(sh+1, ah+1, g ) = E(f T , πT , h, g ). (12) Combining (9), (10), (11), and (12), we have V 1 (x1, g ) V πT 1 (x1, g ) h=1 E(f T , πT , h, g ). When we condition on E2 in Lemma C.3, we have PH h=1 E(f T , πT , h, g ) 4ϵ. Therefore, when we condition on E1, E2, and E3, the policy we return is O(ϵ)-optimal. By Lemmas C.1, C.3, and C.4, we have P( 3 i=1Ei) 1 δ. Therefore, with probability at least 1 δ, Algorithm 1 will terminate in T (d H + 1) iterations and output a 4ϵ-optimal policy using at most (d H + 1)(nact + nelim) 15000H3d2 log2(N(F, ζelim/8)) ι episodes. Thus, we conclude the proof of Theorem 5.9. D. Proof of Lemmas in Appendix C D.1. Proof of Lemma C.1 Proof. First, we have the following lemma, which shows that in the activation phase, b E in Line 6 of Algorithm 1 is a good estimator of E in (2). Lemma D.1. [Concentration in the Activation Phase] With probability 1 δ/8, we have |b E(f t, πt, h, g ) E(f t, πt, h, g )| ϵ/(6H) holds for all (t, h) [d H + 1] [H]. Proof. See Appendix D.5 for a detailed proof. How Does Goal Relabeling Improve Sample Efficiency? Since the elimination phase is activated for t < T, we have b E(f t, πt, ht, g ) ζact. Therefore, we have E(f t, πt, h, g ) ζact ϵ/(6H) ϵ/H for t min{T 1, d H + 1} when we condition on the event in Lemma D.1. Thus, we conclude the proof of the first part of Lemma C.1. We also have the following lemma, which shows that in the elimination phase, b E in Line 12 of Algorithm 1 is a good estimator of E in (2). Lemma D.2. [Concentration in the Elimination Phase] With probability 1 δ/8, we have |b E(f, πt, h) E(f, πt, h)| ϵ/(4H holds for all (f, t, h) F [d H + 1] [H]. Here b E is defined in Line 12 of Algorithm 1. Proof. See Appendix D.6 for a detailed proof. By the definition of Bt in Line 13 of Algorithm 1, we have b E(f, πt, ht) ζelim for f Bt. Therefore, we have E(f t, πt, h) ζelim + ϵ/(4H for t min{T 1, d H + 1} when we condition on the event in Lemma D.1. Thus, we conclude the proof of the second part of Lemma C.1. D.2. Proof of Lemma C.2 Proof. We prove Lemma C.2 by contradiction. For the sake of contradiction, We assume that T > d H + 1. When T > d H + 1, there exists h [H] and t1 < < td+1 d H + 1, such that the elimination phase is activated in the ti-th iteration at the h th stage for i [d + 1]. When we condition on E1 in Lemma C.1, we have E(f ti, πti, h, g ) > ϵ/H for i [d + 1]. Since f ti Btj when i > j, we have E(f ti, πtj, h) ϵ/(H d) when i > j. Therefore, we have E(f ti, πtj, h) 2 < Therefore, the roll-in distribution of πt1, . . . , πtd+1 at step h is a goal-conditioned ϵ/H-independent sequence of length d + 1, which contradicts with the definition of GOAL-BE dimension. Therefore, we have T d H + 1 when we condition on E1, which conclude the proof of Lemma C.2. D.3. Proof of Lemma C.3 Proof. Since the elimination phase is not activated in the T-th iteration, we have b E(f T , πT , h, g ) ζact. Therefore, when we condition on the event in Lemma D.1, we have E(f T , πT , h, g ) ζact + ϵ/(6H) 4ϵ/H. Therefore, we have PH h=1 E(f T , πT , h, g ) 4ϵ, which concludes the proof of Lemma C.3. D.4. Proof of Lemma C.4 Proof. By Bellman equation, we have E(Q , π, h) = 0 for all policy π and h [H]. By Lemma D.2, we have b E(Q , πt, h) ϵ/(4H d) = ζelim holds for all (t, h) [d H + 1] [H]. By Line 13 of Algorithm 1, we have Q Bt when t d H + 1. Thus, we conclude the proof of Lemma C.4. How Does Goal Relabeling Improve Sample Efficiency? D.5. Proof of Lemma D.1 Proof. Consider a fixed (t, h) [d H + 1] [H] pair. Since fh(x, u, g ) rh(x, u, g ) max u U fh+1(x , u , g ) 3, |b E(f t, πt, h, g ) E(f t, πt, h, g )| 6 2 nact log 8H(d H2 + 1)/δ holds with probability at least 1 δ/(8H(d H + 1)) by Azuma-Hoefdding s inequality. Here b E is defined in Line 6 of Algorithm 1. By taking a union bound, we have |b E(f t, πt, h, g ) E(f t, πt, h, g )| 6 2 nact log 8H(d H2 + 1)/δ holds for all (t, h) [d H + 1] [H] with probability at least 1 δ/8. Since we set nact = 2592H2ι/ϵ2, we have |b E(f t, πt, h, g ) E(f t, πt, h, g )| ϵ/(6H). Thus, we conclude the proof of Lemma D.1. D.6. Proof of Lemma D.2 Proof. Let G(ζgoal) be an ζgoal-cover of G with cardinality N(G, ζgoal), and let Z be an ζelim/16-cover of F with cardinality N(F, ζelim/8). We define b E bf, πt, ht, G(ζgoal) max g G(ζgoal) (s,a,s ) Dt h fh(s, a, g) rh(s, a, g) max a A fh+1(s , a , g) , where Dt h is the dataset in the elimination phase. By applying Azuma-Hoeffding s inequality to all (t, bf, g) [d H + 1] Z G(ϵ) and taking a union bound, we have |b E( bf, πt, ht, g) E( bf, πt, ht, g)| 6 2 nelim log 8H(d H2 + 1)N(G, ζgoal)N(F, ζelim/8)/δ holds for all (t, bf, g) [d H + 1] Z G(ϵ) with probability at least 1 δ/8. By the definition of the goal covering, we have b E(f, πt, ht) E(f, πt, ht) 6ζgoal + b E(f, πt, ht, G(ζgoal)) E(f, πt, ht, G(ζgoal)) (14) ϵ/(100H) + max g G(ζgoal) b E(f, πt, ht, g) E(f, πt, ht, g) holds for all f F. Combining Equations (13) and (14), we have b E( bf, πt, ht) E( bf, πt, ht) 6ζgoal + ϵ/(9H How Does Goal Relabeling Improve Sample Efficiency? holds for all (t, bf) [d H + 1] Z with probability at least 1 δ/8. We denote the event in Equation (15) as E4. By the definition of E in (2), we have E( bf, π, h) E(f, π, h) = max g G E(f, π, h, g) max g G E( bf, π, h, g) E(f, π, h, g) E( bf, π, h, g) 2 f bf . Similarly, we have b E( bf, π, h) b E(f, π, h) 2 f bf . For any f F, we select bf Z with f bf ζelim/8. When condition on Event E4 in Equation (15), we have b E(f, πt, ht) E(f, πt, ht) b E( bf, πt, ht) E( bf, πt, ht) + E( bf, πt, ht) E(f, πt, ht) + b E( bf, πt, ht) b E(f, πt, ht) d) + ζelim/2 ϵ/(4H holds for all (t, f) [d H + 1] F with probability at least 1 δ/8. Thus, we complete the proof of Lemma D.2.