# goalconditioned_qlearning_as_knowledge_distillation__9ac129cc.pdf

Goal-Conditioned Q-learning as Knowledge Distillation

Alexander Levine and Soheil Feizi

University of Maryland, College Park, Maryland, USA {alevine0, sfeizi}@cs.umd.edu

Many applications of reinforcement learning can be formalized as goal-conditioned environments, where, in each episode, there is a goal that affects the rewards obtained during that episode but does not affect the dynamics. Various techniques have been proposed to improve performance in goal-conditioned environments, such as automatic curriculum generation and goal relabeling. In this work, we explore a connection between off-policy reinforcement learning in goal-conditioned settings and knowledge distillation. In particular: the current Q-value function and the target Qvalue estimate are both functions of the goal, and we would like to train the Q-value function to match its target for all goals. We therefore apply Gradient-Based Attention Transfer (Zagoruyko and Komodakis 2017), a knowledge distillation technique, to the Q-function update. We empirically show that this can improve the performance of goal-conditioned off-policy reinforcement learning when the space of goals is high-dimensional. We also show that this technique can be adapted to allow for efficient learning in the case of multiple simultaneous sparse goals, where the agent can attain a reward by achieving any one of a large set of objectives, all specified at test time. Finally, to provide theoretical support, we give examples of classes of environments where (under some assumptions) standard off-policy algorithms such as DDPG require at least O(d2) replay buffer transitions to learn an optimal policy, while our proposed technique requires only O(d) transitions, where d is the dimensionality of the goal and state space. Code and appendix are available at https://github.com/alevine0/Reen GAGE.

Introduction

In recent years, many works have focused on applying deep reinforcement learning to goal-conditioned tasks, through approaches such as goal relabeling (Andrychowicz et al. 2017; Nair et al. 2018; Yang et al. 2021; Fang et al. 2019) and automatic curriculum generation (Florensa et al. 2018; Sukhbaatar et al. 2018; Zhang, Abbeel, and Pinto 2020). In this work, we focus on model-free off-policy goalconditioned RL, and present a novel technique for improving performance in this setting. Our approach relies on a connection between the standard Bellman update used in

Copyright 2023, Association for the Advancement of Artificial Intelligence (www.aaai.org). All rights reserved.

off-policy reinforcement learning in a goal-conditioned setting, and knowledge distillation, the task of training a student network to model the same function as a (generally more complex) teacher network.1 In brief, the Bellman update can be viewed as an instance of (conditional, stochastic) knowledge distillation, where the current Q-value estimate is the student, the target Q-value network (averaged over transitions) is the teacher, and the independent variable is the goal that the agent is attempting to reach. We use this insight to develop an novel off-policy algorithm that in some instances has improved performance over baselines for goalconditioned tasks. Our main contributions are as follows:

1. We propose Reen GAGE, a novel technique for goalconditioned off-policy reinforcement learning, and evaluate its performance. 2. We propose Multi-Reen GAGE, a variant of Reen GAGE well-suited for goal-conditioned environments with many simultaneous sparse goals. 3. We provide theoretical justification for Reen GAGE by showing that it is in some cases asymptotically more efficient, in terms of the total number of replay buffer transitions required to learn an optimal policy, than standard off-policy algorithms such as DDPG.

In most of this work, we focus on continuous action control problems; we extend our method to discrete action spaces in the appendix. Note that while we mostly focus on using Reen GAGE on top of HER (Andrychowicz et al. 2017) and DDPG (Lillicrap et al. 2016) in this work, it can be easily applied alongside any goal-relabeling scheme or automated curriculum, and can be adapted for other off-policy algorithms such as SAC (Haarnoja et al. 2018) or TD3 (Fujimoto, Hoof, and Meger 2018). In particular, we include an application to SAC in the appendix.

Preliminaries and Notation We consider control problems defined by goal-conditioned MDPs (S, A, G, T , R), where S, A, and G denote sets of states, actions, and goals, respectively, G and A are assumed

1Some works use the term knowledge distillation to refer to the particular method for this task proposed by (Hinton, Vinyals, and Dean 2015), while others, such as (Gou et al. 2021) use it to refer to the task in general; we use the latter definition.

The Thirty-Seventh AAAI Conference on Artificial Intelligence (AAAI-23)

to be continuous spaces, T S A P(S) is a stochastic transition function, and R S G R is a reward function. At every step, starting at state s S, an agent chooses an action a A. The system then transitions to s T (s, a), and the agent receives the reward R(s , g). For now, we assume that the reward function R(s , g) is known a priori to the learning algorithm (while the transition function is not): this just means that we know how to interpret the objective which we are trying to achieve. Note that existing goal relabeling techniques, such as HER (Andrychowicz et al. 2017), implicitly make this assumption, as it is necessary to compute the rewards of relabeled transitions. We will discuss cases where this assumption can be relaxed in later sections. We consider both shaped rewards, in which R(s , g) is continuous and differentiable everywhere, as well as sparse rewards, where R(s , g) is not assumed to be differentiable, but maintains some constant value clow (e.g., clow = -1 or 0) with g R(s , g) = 0 for a substantial fraction of inputs. (There may be some boundary points where R(s , g) = clow but g R(s , g) is not defined or g R(s , g) = 0, but in theoretical discussion we will assume these are of measure zero). The objective of goal-conditioned RL is to find a policy π S G A such that the discounted future reward:

t=0 γt R(st+1, g) (1)

is maximized in expectation. One common approach is to find the policy π and Q-function Q S A G R that solve the Bellman equation for Q-learning (Watkins and Dayan 1992), conditioned on a goal g:

Q(s, a, g) = E s T (s,a)[R(s , g) + γQ(s , π(s , g), g)] (2)

s, g, π(s, g) = arg max a Q(s, a, g). (3)

If functions π and Q satisfy these, then π is guaranteed to be an optimal policy. In practice, off-policy RL techniques, notably DDPG (Lillicrap et al. 2016) can be used to solve for these functions iteratively by drawing tuples (s, a, s , g) from a replay buffer:

Lcritic = E (s,a,s ,g) Buffer

Lmse h Qθ(s, a, g),

R(s , g) + γQθ (s , πϕ (s , g), g) i# (4)

Lactor = E (s,g) Buffer Qθ(s, πϕ(s, g), g), (5)

where θ and ϕ are the current critic and actor parameters, and θ and ϕ are target parameters, which are periodically updated to more closely match the current estimates. Note that Equation 2 should ideally hold for all (s, a, g): therefore the distribution of (s, a, g) in the replay buffer does not

need to precisely follow any particular distribution, assuming sufficient visitation of possible tuples.2 The only necessary constraint on the buffer distribution is that the marginal distribution of s matches the transition function:

s, a, g, Pr Buffer[s |s, a, g] Pr T [s |s, a], (6)

so that the relation in Equation 2 is respected. This means that the goal which is included in the buffer need not necessarily reflect a true historical experience of the agent during training, but can instead be relabeled to enhance training. (Andrychowicz et al. 2017; Nair et al. 2018; Yang et al. 2021; Fang et al. 2019). Interestingly, (Schroecker and Isbell 2020) shows that HER (Andrychowicz et al. 2017), a popular relabeling technique, actually does not respect Equation 6 when the transition function is nondeterministic, and therefore may exhibit hindsight bias.

Proposed Method From Equation 2, we can take the gradient with respect to g:

g Q(s, a, g) =

g E s T (s,a)[R(s , g) + γQ(s , π(s , g), g)] =

E s T (s,a)[ g R(s , g) + γ g Q(s , π(s , g), g)].

Because the gradient of the Q-value function is equal to the expectation of the gradient of the sum of the reward and the next-step Q-value, this suggests that we can augment the standard DDPG critic loss with a gradient term, which estimates this expected gradient using the replay buffer samples:

LReen GAGE = E (s,a,s ,g) Buffer

Lmse h Qθ(s, a, g),

R(s , g) + γQθ (s , πϕ (s , g), g) i

+αLmse h g Qθ(s, a, g),

g R(s , g) + γ g Qθ (s , πϕ (s , g), g) i#

where α is a constant hyperparameter. Note that the second MSE term is applied to a vector: thus we are fitting g Qθ(s0, a0, g) in all dim(g) dimensions. This allows more information to flow from the target function to the current Q-function (a dim(g) vector instead of a scalar), and may therefore improve training. We call our method Reinforcement learning with Gradient Attention for Goalseeking Efficiently, or Reen GAGE. In the case of shaped rewards, we can use this loss function directly. In the case of sparse rewards, g R(s , g) is not necessarily defined or available. However, it is also zero for

2In practice, replay buffers which better match the behavioral distribution result in better training, due to sources of extrapolation error , including incomplete visitation and model inductive bias; see (Fujimoto, Meger, and Precup 2019).

a substantial fraction of inputs, and, if R(s , g) = clow, then g R(s , g) = 0 with high probability. Therefore, we use the gradient loss term only when training on tuples where R(s , g) = clow, and assume g R(s , g) = 0:

L(sparse) Reen GAGE = E (s,a,s )

Lmse h Qθ(s, a, g),

R(s , g) + γQθ (s , πϕ (s , g), g) i

+1R(s ,g)=clowαLmse h g Qθ(s, a, g),

γ g Qθ (s , πϕ (s , g), g) i#

In this sparse case, if Reen GAGE is used alone (i.e., without goal relabeling), then the reward function R does not need to be known explicitly a priori. Instead, the observed values of the rewards R(s , g) from the training rollouts may be used. Note that Reen GAGE can only be used in goalconditioned reinforcement learning problems: in particular, we cannot use gradients with respect to states or actions in a way similar to Equation 7, because, unlike in Equation 7:

s,a E s T (s,a)[( )] = E s T (s,a)[ s,a( )] (10)

because the sampling distribution depends on s and a.

Connection to Knowledge Distillation We can view Equation 4 as the loss function of a regression problem, fitting Qθ to the target. We treat g as the independent variable, and s and a as parameters:

g, Qθ(g; s, a) := Targ.(g; s, a) where

Targ.(g; s, a) = E s T (s,a) R(s , g) + γQθ (s , πϕ (s , g), g) (11)

This can be framed as a knowledge distillation problem: we can view Targ.(g; s, a) as a (difficult to compute) teacher function, and we are trying to fit the network Qθ(g; s, a) to represent the same function of g. Note however that conventional knowledge distillation (Hinton, Vinyals, and Dean 2015), which matches the logits of one network to another for classification problems in order to provide richer supervision than simply matching the class label output, cannot be applied here because the output is scalar. However, (Zagoruyko and Komodakis 2017) proposes Gradient-based Attention Transfer which instead matches the gradients of the student to the teacher using a regularization term. Applied to our Q function and target, this is:

LGAT =LMSE(Qθ(g; s, a), Targ.(g; s, a))

+α g Qθ(g; s, a) g Targ.(g; s, a) 2 2, (12)

which is in fact the Reen GAGE loss function. Therefore we can think of Reen GAGE as applying knowledge distillation (specifically Gradient-based Attention Transfer) to the Qvalue update.3 See Figure 1 for an illustration. While the

3(Zagoruyko and Komodakis 2017) actually uses the ℓ2 distance between the gradients as the regularization term, rather than

Targ.(g;s,a)

Reen GAGE: DDPG:

Targ.(g;s,a)

Figure 1: Illustration comparing the information flow from the Q-value target function to the current Q-function in Reen GAGE, compared to standard DDPG. In Reen GAGE, for each (s, a, s , g) tuple, the gradient with respect to the goal is used as supervision, while in standard DDPG, only point values are used. Note that in stochastic environments, each tuple only provides a stochastic estimate of the target gradient (in Reen GAGE) or target point value (in DDPG).

computation of the loss function gradient is somewhat more complex here than in standard training, involving mixed partial derivatives, (Zagoruyko and Komodakis 2017) notes that it can still be performed efficiently using modern automatic differentiation packages; in fact, this double backpropagation should only scale the computation time by a constant factor (Etmann 2019). Further discussion of this and empirical runtime comparisons are provided in the appendix. (Zagoruyko and Komodakis 2017) also propose Activation-based Attention Transfer, which transfers intermediate layer activations from teacher to student network rather than gradients; in fact, they report better performance using this method than the gradient method. However, this is not applicable in our case. Firstly, in the dense reward case, we cannot model the reward function in this way. Secondly, unlike the gradient operator, activations are nonlinear: so, even in the sparse case, we cannot assume that the activations of a converged Q-network perfectly modeling the expected target will be the equal to the expected activations of the target network (i.e., there is no activation equivalent to Equation 7.) See (Hsu et al. 2022) for a review of sources of auxiliary network information that can be used for knowledge distillation.

Toy Example Experiments We first apply Reen GAGE to a simple sparse-reward environment, which we call Continuous Seek. This task is a continuous variant of the discrete Bit-Flipping environment proposed in (Andrychowicz et al. 2017). In our proposed task, the objective is to navigate from an initial state in ddimensional space to a desired goal state, by, at each step, adding an ℓ -bounded vector to the current state. Formally:

its square. However, because our gradient estimate is stochastic (in particular, we are using samples of s T (s, a) rather than the expectation), we instead use the mean squared error, so that the current Q gradients will converge to the population mean. (Zagoruyko and Komodakis 2017) notes that their particular choice of the ℓ2 norm is arbitrary: other metrics should work.

0 6000 12000 Training Steps

Success Rate

Continuous Seek (d = 5)

DDPG+HER Reen GAGE ( =0.1/d)+HER Reen GAGE ( =0.2/d)+HER Reen GAGE ( =0.3/d)+HER

0 25000 50000 Training Steps

Success Rate

Continuous Seek (d = 10)

DDPG+HER Reen GAGE ( =0.1/d)+HER Reen GAGE ( =0.2/d)+HER Reen GAGE ( =0.3/d)+HER

0 200000 400000 Training Steps

Success Rate

Continuous Seek (d = 20)

DDPG+HER Reen GAGE ( =0.1/d)+HER Reen GAGE ( =0.2/d)+HER Reen GAGE ( =0.3/d)+HER

Figure 2: Continuous Seek results. Lines show the mean and standard deviation over 20 random seeds (kept the same for all experiments.) The Y-axis represents the success rate, defined as the fraction of test episodes for which the goal is ever reached.

s, g [ D, D]d

T (s, a) = s + a (clipped into [ D, D]d) R(s, g) = 1 + 1 s g ϵ initial state s0 = 0.

In our experiments, we use D = 5, ϵ = 0.1, and we run for 10 steps per episode. We test with d = 5, 10 and 20. The chance that a random state achieves the goal is approximately 1 50d , so this is an extremely sparse reward problem (as sparse as Bit-Flipping with 5.6 d bits). As a baseline, we use DDPG with HER. See Figure 2 for results. For the baseline and each value of α, we performed a grid search over learning rates {0.00025, 0.0005, 0.001, 0.0015} and batch sizes {128, 256, 512}; the curves shown represent the best hyperparameter settings for each α, defined as maximizing the area under the curves. See appendix for results for all hyperparameter settings. We studied the learning rate specifically to ensure that the Reen GAGE regularization term is not simply scaling up the loss function with similar gradient updates. Other hyperparameters were kept fixed and are listed in the appendix. We see that Reen GAGE clearly improves over the baseline for larger-dimensionality goals (d = 10 and d = 20): this shows that Reen GAGE can improve the performance of DDPG in such high-dimensional goal settings. See the appendix for a similar experiment with SAC as the base off-policy learning algorithm instead of DDPG.

Robotics Experiments We tested our method on Hand Reach, the environment from the Open AI Gym Robotics suite (Plappert et al. 2018) with the highest-dimensional goal space (d = 15). In this sparse-reward environment, the agent controls a simulated robotic hand with 20-dimensional actions controlling the hand s joints; the goal is to move all of the fingertips to the specified 3-dimensional positions. As a baseline, we use the released DDPG+HER code from (Plappert et al. 2018), with all hyperparameters as originally presented, and only modify the critic loss term. Results are presented in Figure 3. In this environment, we see that Reen GAGE greatly speeds up

convergence compared to the baseline. However, at a high value of α, the success rate declines after first converging. This shows that Reen GAGE may cause some instability if the gradient loss term is too large, and that tuning the coefficient α is necessary (see also the Limitations section below). We also tried our method on the Hand Manipulate Block environment from the same paper; however, in this lowerdimensional goal environment (d = 7) Reen GAGE was not shown to improve performance. This is compatible with our observation from the Continuous Seek environment that Reen GAGE leads to greater improvements for higherdimensional tasks, as the dimensionality of the additional goal-gradient information that Reen GAGE propagates increases. Results are provided in the appendix.

Multi-Reen GAGE: Reen GAGE for Multiple Simultaneous Goals In this section, we propose a variant of Reen GAGE for a specific class of RL environments: environments where the agent is rewarded for achieving any goal in a large set of arbitrary sparse goals, all of which are specified at test time. Formally, we consider goals in the form g = {g1, ...gn}, where n may vary but n nmax. We consider {0, 1} rewards, where the reward function takes the form:

( 1 if gi g (Ritem(s , gi) = 1)

0 otherwise . (13)

In our experiments, we only consider cases where the goals are mutually exclusive, so this is equivalent to:

R(s , g) = X

gi g Ritem(s , gi). (14)

We assume that either: (i) the function Ritem is known a priori to the agent, or (ii) the item rewards Ritem(s , gi) are observed separately for each goal gi at each time step during training. This scenario presents several challenges. Firstly, many goal relabeling strategies cannot be directly applied here: strategies such as HER (Andrychowicz et al. 2017) assume that achieved states can be projected down into the space of goals. In this case, the space of goals is much larger

0 10 20 30 40 50 Epoch

Success Rate

DDPG+HER Reen GAGE( =2*10 /d)+HER Reen GAGE( =5*10 /d)+HER Reen GAGE( =1*10 /d)+HER

0 2 4 6 8 10 12 14 Epoch

Success Rate

Hand Reach (Detail)

DDPG+HER Reen GAGE ( =2*10 /d)+HER Reen GAGE ( =5*10 /d)+HER Reen GAGE ( =1*10 /d)+HER

Figure 3: Hand Reach results. (a) Rendering of the Hand Reach simulation environment. Figure taken from (Plappert et al. 2018). (b) Performance of Reen GAGE on Hand Reach, compared to the baseline from (Plappert et al. 2018). Lines represent mean and standard deviation for the same set of 5 seeds. The X-axis is the number of training epochs, as defined in (Plappert et al. 2018), while the Y-axis is the success rate, defined by (Plappert et al. 2018) as the fraction of test episodes where the final state satisfies the goal. (c) Detailed view of (b), showing the epochs before convergence, where the advantage of Reen GAGE is most clear.

(a) 0 80000 160000 Training Steps

Average Reward

DDPG M-RG( =5/n ) M-RG( =10/n ) M-RG( =20/n )

(c) 0 80000 160000 Training Steps

Average Reward

DDPG M-RG( =10/n ) M-RG( =20/n ) M-RG( =100/n ) Perfect Greedy Agent

Figure 4: Multi-Reen GAGE results. (a) and (c): Illustrations of Drive Seek and Noisy Seek environments, respectively: cyan points show goals that were never reached, blue points show goals that were reached, and magenta points show (rounded) nongoal states that were reached. In (c), we see a Noisy Reach agent (correctly) avoiding the trap of going to the nearby isolated points in favor of seeking the larger cluster. (b) and (d): Results for Drive Seek and Noisy Seek, respectively. We see that Multi Reen GAGE ( M-RG in figure legends) substantially improves over standard DDPG for both tasks. Lines are an average of (the same) 5 random seeds. For Noisy Seek, we also show the performance of a perfect greedy agent, which simply goes towards the nearest individual goal. For Noisy Seek, evaluations with more values of α are included in the appendix. Note that for both experiments, the agent takes as input a list of goal coordinates, rather than an image: the agents do not use convolutional layers to interpret the goals. (On Drive Seek, where the coordinates are bounded, we attempted learning from images as well; Reen GAGE still outperformed DDPG, but overall performance was worse for both this experiment is presented in the appendix.)

than the space of possible states, so this assumption is broken. Secondly, we suggest that standard Q-learning is somewhat unsuited to this kind of problem, because it loses information about which goal led to a reward. For instance, if there are 100 goals gi, and a reward is received for a certain state s , there is no direct indication of which goal was satisfied. This means that a very large number of episodes may need to be run in order to learn the effect of each individual goal on the reward.

We now describe our approach. For concreteness, we will assume that the agent uses an architecture based on Deep Sets (Zaheer et al. 2017) to process the goal set input

(although we believe our technique can likely be adapted to using more complex neural set architectures, such as Set Transformer (Lee et al. 2019)). Concretely, this means that our Q-function takes the form:

Qθ(s, a, g) := Qhead θh. (s, a, X

gi g [Qencoder θe. (s, gi)]) (15)

and the policy has a similar architecture. Note that Qencoder θe. outputs a vector-valued embedding for a given goal gi. From this baseline, introduce a set of scalar gate variables bi:

Qθ(s, a, g) := Qhead θh. (s, a,

i=1 [bi Qencoder θe. (s, gi)]). (16)

Each gate bi is set to 1. However, if bi were zero, this would be equivalent to the goal gi being absent from the set g. We then treat the gate variables as differentiable. If a certain goal gi contributes to the Q function (i.e., if it is likely to be satisfied), then we expect Qθ(s, a, g) to be highly sensitive to bi; in other words, we expect Qθ

bi to be large. Then b Q represents the importance of each goal to the Q-value function. Our key insight is that we can use a Reen GAGE-style loss to transfer b Q from target to current Q-value estimate, therefore preserving attention on the relevant goals. However, this requires us to have a value for b R(s , g). Note that the reward can be written as:

R(s , g) = X

gi g bi Ritem(s , gi). (17)

Setting bi = 0 is again like gi being absent. Interpolating:

R bi := Ritem(s , gi) (18)

which gives us a ground-truth reward gradient we can compute. This yields the following loss function:

LMulti-Reen GAGE = LDDPG-Critic + αLmse h b Qθ(s, a, g),

Ritem(s , g) + γ b Qθ (s , πϕ (s , g), g) i

where [Ritem(s , g)]i := Ritem(s , gi). In practice, we make two modifications to this algorithm. First, we use b2 i as the gate rather than bi.4 While algebraically this should do nothing but multiply the gradient loss term by 4, it is important for vectorized implementation; see the appendix for details. Second, we share the encoder Qencoder between the Qfunction and the policy π. This is so the policy does not have to learn to interpret the goal set from scratch and is empirically important (see ablation study in the appendix). We train the encoder only during critic training.

Experiments We test Multi-Reen GAGE on two environments: Drive Seek and Noisy Seek. Both environments are constructed such that a greedy strategy of simply going to the closest individual goal is not optimal, so the entire goal set must be considered. We describe the environments informally here and provide additional detail in the appendix. Drive Seek is a deterministic environment, where the continuous position spos. [ 10, 10]2 always moves with constant ℓ2 speed 1, in a direction determined by a velocity vector svel. on the unit circle. At each step, the agent takes a 1-dimensional action a [ 0.5, 0.5], which represents angular acceleration: it specifies an angle in radians which is added to the angle of the velocity vector. At the edges of the space, the state position wraps around to the opposite edge. The objective is to reach any of up to nmax = 200 goals. The goals all lie on integer coordinates in [ 10, 10]2, and the agent receives a reward if its coordinates round to a goal.

4And use 2Ritem(s , g) instead of Ritem(s , g).

In addition to spos. and svel., the agent receives an observation of its current rounded position. The agent also receives as input the current list of goal coordinates. Note that the agent cannot simply stay at a single goal, or take an arbitrary path between goals: it is constrained to making wide turns. Therefore all goals must be considered in planning an optimal trajectory. Noisy Seek is a randomized environment. In it, s R2, a R2, with a 2 1, and the transition function is defined as T (s, a) N(s + a, I). In other words, the agent moves through space at a capped speed, and noise is constantly added to the position. The goals are defined as integer coordinates in a similar manner to in Drive Seek, but without a box constraint. Additionally, the goal distribution is such that goals tend to be clustered together. Note that a greedy agent that simply goes to the nearest goal is suboptimal, because the probability of consistently reaching that one goal is low: it is better to seek clusters. Results are presented in Figure 4. We see that Multi Reen GAGE substantially outperforms the baseline of DDPG on both environments.

Theoretical Properties Bias In the Preliminaries section, we discuss that goal relabeling strategies can exhibit bias if Equation 6 is not respected. In the dense reward case, our method does not cause bias of this sort (although such bias may be present if our method is combined with a relabeling strategy.) However, in the sparse reward case, if the transitions are nondeterministic, our method may cause a similar bias. In particular, note that, in the sparse case, Equation 9 effectively trains the gradient of the Q-value function to match the following target:

g Qθ(s, a, g) :=

E s T (s,a)[γ g Qθ (s , πϕ (s , g), g)|R(s , g) = clow]

g E s T (s,a)

h R(s , g)+

γQθ (s , πϕ (s , g), g)|R(s , g) = clow i (20)

where the last line holds exactly if the boundary points where R(s , g) = clow but g R(s , g) = 0 are of measure zero (and the derivative is defined at all such points). Equation 20 shows us that, in the sparse case, our method trains the gradient of the Q-function to match an expected target gradient where the expectation is taken over a biased distribution: if

Pr T [s |s, a; R(s , g) = clow] = Pr T [s |s, a], (21)

then this will cause bias in our target gradient estimate, in a similar manner to the hindsight bias of HER described by (Schroecker and Isbell 2020). Note that this is only an issue in nondeterministic environments: in deterministic environments, for a given (s, a, g), either R(s , g) is always = clow, in which case the gradient

Uat st+1 R(st ,g) |g|2

R(st +1,g) |g|2

Figure 5: Example dense reward environment class. The agent receives a reward proportional to the projection of the state vector onto the goal vector at each step, and can change the state vector by adding an action vector with ℓ2 distance up to 1 at each step. However, at each step, this action is distorted by an unknown rotation U: the agent must learn to compensate for this distortion.

term is never involved in training, or R(s , g) is always clow, in which case Pr T [s |s, a; R(s , g) = clow] = Pr T [s |s, a] = 1. (22)

Learning Efficiency In this section, we provide examples of classes of environments for which our method will result in provably more efficient learning than standard DDPG-style updates. We treat both the dense reward case (in which we have access to the gradient of the reward function) and the sparse reward case (in which we do not).

Dense Reward Example Consider the class of simple, deterministic environments described as follows: g, s, a Rd; a 2 1 R(s, g) = g T s T (s, a) = s + Ua, where U Rd d is an unknown orthogonal (rotation) matrix. This environment class is illustrated in Figure 5. Environments of this class are parameterized by U, so the learning task is to estimate U. We make the following assumptions: The hypothesis class consists of all environments with dynamics of the type described above. We therefore take as an inductive bias that each model in the considered model class consists of a Q-function Q U and policy π U which are in the form of the optimal Q-function and policy for an estimate of U, notated as U Rd d (constrained to be orthogonal). We assume that Q U and π U share the same estimated parameter U. (This is analogous to although admittedly stronger than the parameter sharing we used for Multi Reen GAGE.) Taken with the above assumption, this implies that a = π U(s, g) maximizes Q U(s, a, g), so we do not need to train π separately. Similar parameter sharing occurs between the target policy and Q-function.

We are comparing our method, minimizing the loss in Equation 8, with minimizing the vanilla DDPG loss (Equation 4). States and actions in the replay buffer are in general position.

In this case, the following proposition holds:

Proposition 1. Under the above assumptions, minimizing the Reen GAGE loss can learn U (and therefore learn the optimal policy) using O(d) unique replay buffer transitions. However, minimizing the standard DDPG loss requires at least O(d2) unique transitions to successfully learn U.

Proofs are provided in the appendix. This result shows that, in some cases, Reen GAGE requires asymptotically less replay data to successfully learn to perform a task than standard DDPG.

Sparse Reward Case The result shown above might be unsurprising to many readers. Specifically, because the gradient g R(s , g) is used by our method and not by standard DDPG, in the dense-reward case, our method is utilizing more information from the environment (to the extent that R, which we assume that agents know a priori, is part of the environment ) than the standard algorithm. However, here we show a class of sparse reward environments for which the same result holds, despite g R(s , g) being unavailable. The environments are constructed as follows:

g, a Rd; a 2 1; s R2d; the state vector consists of two halves, denoted s1, s2; we write s as (s1; s2). R(s, g) = g T s1

T (s, a) = (0; s1 + Ua) if s1 = 0 (s2; 0) if s1 = 0

U Rd d is an unknown orthogonal (rotation) matrix. We define clow = 0.

See Figure 6 for an illustration. Note that this satisfies sparseness properties: namely, R(s , g) = 0 = clow at least every other step; and, when R(s , g) = 0, then g R(s , g) = 0 (assuming general position). It is also deterministic, so we do not need to worry about the bias discussed in the previous section. We can therefore apply the sparse version of our method (Equation 9), which does not use gradient feedback from the reward:

Proposition 2. Under the same assumptions as Proposition 1 (replacing Equation 8 with Equation 9), minimizing the Reen GAGE loss can learn U in the sparse environment class using O(d) unique replay buffer transitions. However, minimizing the standard DDPG loss requires at least O(d2) unique transitions to successfully learn U.

This example is admittedly a bit contrived: the single-step reward can always be computed without knowledge of the parameter U. However, it may still give insight about realworld scenarios in which predicting immediate reward is much easier than understanding long-term dynamics. Note that these two scaling results apply to the number of replay buffer transitions. In particular, if a goal relabeling algorithm is used on top of DDPG, then O(d2) replay buffer

R(st ,g) |g|2

R(st+1 ,g) = 0 s1 t+1 = 0

Figure 6: Example sparse reward environment class. If s1 is initially nonzero, then the (rotated) action Ua is added to it, as in the dense case. However, the resulting vector is immediately stored in s2, and s1 is zeroed: this means that no immediate reward is obtained. In the next step, with s1 zero, the action is ignored and s2 is reloaded into s1, resulting in a reward that depends on the previous action.

transitions may be able to be constructed from O(d) observed training rollout transitions, so standard DDPG combined with goal relabeling might only require O(d) training rollout transitions. However, this would be computationally expensive, and may not work in practice for particular goal relabeling algorithms. (HER, for instance, only relabels using achieved states from the same episode: if the episode length is O(1) in d, then O(d2) observed training rollout transitions would still be required for DDPG+HER.) Also, goal relabeling techniques require a priori knowledge of the function R, while in the sparse example, Reen GAGE does not (although in the case of this example, we assume that we are using the correct hypothesis class , i.e., the functional form of Q U: constructing this in practice would likely require knowing R).

Related Works Many prior approaches have been taken to the goalconditioned reinforcement learning problem (Schaul et al. 2015). See (Liu, Zhu, and Zhang 2022) for a recent survey of this area. One line of work for this problem involves automated curriculum generation: here, the idea is to select goals during training that that are dynamically chosen to be the most informative (Florensa et al. 2018; Sukhbaatar et al. 2018; Zhang, Abbeel, and Pinto 2020). In the offpolicy reinforcement learning setting, a related technique becomes a possibility: one can re-label past experiences with counter-factual goals. This allows a single experienced transition to be used to train for multiple goals, and the relabeled goals can be chosen using various heuristics to improve training (Andrychowicz et al. 2017; Nair et al. 2018; Yang et al. 2021; Fang et al. 2019). Note that our proposed method can be combined with any of these off-policy techniques. (Schroecker and Isbell 2020) discusses bias that can result from some goal relabeling techniques. (Eysenbach, Salakhutdinov, and Levine 2021) proposes a method based on recursive classification which is in practice similar to

hindsight relabeling, but requires less parameter tuning. In alternative approaches to goal-conditioned RL, (Eysenbach et al. 2022) has proposed using an on-policy goalconditioned reinforcement learning technique, using contrastive learning, while (Janner et al. 2022) and (Janner, Li, and Levine 2021) propose model-based techniques. Note that our proposed method is distinct from policy distillation (Rusu et al. 2016): the goal of policy distillation is to consolidate one or more already trained policy networks into a smaller network; whereas our method is intended to improve initial training. Some prior (Manchin, Abbasnejad, and Hengel 2019; Choi, Lee, and Zhang 2017; Wu, Khetarpal, and Precup 2021) and concurrent (Bertoin et al. 2022) works have focused on using attention-based mechanisms to improve either the performance or interpretability of reinforcement learning algorithms. However, to our knowledge, ours is the first to apply gradient-based attention transfer to the critic update to enhance goal-conditioned off-policy reinforcement learning. Some prior works have been proposed for goal-seeking with structured, complex goals made up of sub-goals, similar to (and in some cases more general than) the multi-goal setting that Multi-Reen GAGE is designed for. Some of these works (Oh et al. 2017; Sohn, Oh, and Lee 2018) use a hierarchical policy; however, such a structure may be unable to represent the true optimal policy (Vaezipoor et al. 2021). (Vaezipoor et al. 2021) proposes a method without this limitation; although the setting considered (Linear Temporal Logic) is different from the multi-goal setting considered here, in that a reward is achieved at most once per episode. (Touati and Ollivier 2021) proposes a method for arbitrary reward functions specified at test time, under discrete action spaces; in concurrent work (Touati, Rapin, and Ollivier 2023), this is generalized to continuous actions. (Janner et al. 2022), a model-based technique mentioned above, can use reward function gradient information to adapt to an arbitrary shaped (i.e., non-sparse) reward function at test time.

Limitations and Conclusion

Reen GAGE has some important limitations. For example, we have seen that the hyper-parameter α requires tuning and can vary greatly (likely due to diverse scales in goal coordinates and rewards); and the benefits of Reen GAGE seem limited to tasks with high goal dimension. Still, Reen GAGE represents a novel approach to goalconditioned RL, with benefits demonstrated both empirically and theoretically. In future work, we are particularly interested in exploring the use of Multi-Reen GAGE in safety and robustness applications. In particular, the ability to encode many simultaneous goals at test time could allow the agent to consider many backup goals, all of which are acceptable, rather than forcing the agent to focus only a single goal (resulting in total failure if that goal is unreachable.)

Acknowledgements

This project was supported in part by NSF CAREER AWARD 1942230 and ONR YIP award N00014-22-1-2271.

References Andrychowicz, M.; Wolski, F.; Ray, A.; Schneider, J.; Fong, R.; Welinder, P.; Mc Grew, B.; Tobin, J.; Pieter Abbeel, O.; and Zaremba, W. 2017. Hindsight experience replay. Advances in neural information processing systems, 30. Bertoin, D.; Zouitine, A.; Zouitine, M.; and Rachelson, E. 2022. Look where you look! Saliency-guided Q-networks for generalization in visual Reinforcement Learning. In Oh, A. H.; Agarwal, A.; Belgrave, D.; and Cho, K., eds., Advances in Neural Information Processing Systems. Choi, J.; Lee, B.-J.; and Zhang, B.-T. 2017. Multi-Focus Attention Network for Efficient Deep Reinforcement Learning. In Workshops at the Thirty-First AAAI Conference on Artificial Intelligence. Etmann, C. 2019. A Closer Look at Double Backpropagation. Ar Xiv, abs/1906.06637. Eysenbach, B.; Salakhutdinov, R.; and Levine, S. 2021. CLearning: Learning to Achieve Goals via Recursive Classification. In International Conference on Learning Representations. Eysenbach, B.; Zhang, T.; Salakhutdinov, R.; and Levine, S. 2022. Contrastive Learning as Goal-Conditioned Reinforcement Learning. ar Xiv preprint ar Xiv:2206.07568. Fang, M.; Zhou, T.; Du, Y.; Han, L.; and Zhang, Z. 2019. Curriculum-guided hindsight experience replay. Advances in neural information processing systems, 32. Florensa, C.; Held, D.; Geng, X.; and Abbeel, P. 2018. Automatic goal generation for reinforcement learning agents. In International conference on machine learning, 1515 1528. PMLR. Fujimoto, S.; Hoof, H.; and Meger, D. 2018. Addressing function approximation error in actor-critic methods. In International conference on machine learning, 1587 1596. PMLR. Fujimoto, S.; Meger, D.; and Precup, D. 2019. Off-policy deep reinforcement learning without exploration. In International conference on machine learning, 2052 2062. PMLR. Gou, J.; Yu, B.; Maybank, S. J.; and Tao, D. 2021. Knowledge distillation: A survey. International Journal of Computer Vision, 129(6): 1789 1819. Haarnoja, T.; Zhou, A.; Abbeel, P.; and Levine, S. 2018. Soft actor-critic: Off-policy maximum entropy deep reinforcement learning with a stochastic actor. In International conference on machine learning, 1861 1870. PMLR. Hinton, G. E.; Vinyals, O.; and Dean, J. 2015. Distilling the Knowledge in a Neural Network. Ar Xiv, abs/1503.02531. Hsu, Y.-C.; Smith, J.; Shen, Y.; Kira, Z.; and Jin, H. 2022. A Closer Look at Knowledge Distillation with Features, Logits, and Gradients. ar Xiv preprint ar Xiv:2203.10163. Janner, M.; Du, Y.; Tenenbaum, J. B.; and Levine, S. 2022. Planning with Diffusion for Flexible Behavior Synthesis. ar Xiv preprint ar Xiv:2205.09991. Janner, M.; Li, Q.; and Levine, S. 2021. Offline Reinforcement Learning as One Big Sequence Modeling Problem. In Beygelzimer, A.; Dauphin, Y.; Liang, P.; and Vaughan, J. W., eds., Advances in Neural Information Processing Systems.

Lee, J.; Lee, Y.; Kim, J.; Kosiorek, A.; Choi, S.; and Teh, Y. W. 2019. Set Transformer: A Framework for Attentionbased Permutation-Invariant Neural Networks. In Chaudhuri, K.; and Salakhutdinov, R., eds., Proceedings of the 36th International Conference on Machine Learning, volume 97 of Proceedings of Machine Learning Research, 3744 3753. PMLR. Lillicrap, T. P.; Hunt, J. J.; Pritzel, A.; Heess, N.; Erez, T.; Tassa, Y.; Silver, D.; and Wierstra, D. 2016. Continuous control with deep reinforcement learning. In Bengio, Y.; and Le Cun, Y., eds., 4th International Conference on Learning Representations, ICLR 2016, San Juan, Puerto Rico, May 2-4, 2016, Conference Track Proceedings. Liu, M.; Zhu, M.; and Zhang, W. 2022. Goal-Conditioned Reinforcement Learning: Problems and Solutions. In Raedt, L. D., ed., Proceedings of the Thirty-First International Joint Conference on Artificial Intelligence, IJCAI-22, 5502 5511. International Joint Conferences on Artificial Intelligence Organization. Survey Track. Manchin, A.; Abbasnejad, E.; and Hengel, A. v. d. 2019. Reinforcement learning with attention that works: A selfsupervised approach. In International conference on neural information processing, 223 230. Springer. Mnih, V.; Kavukcuoglu, K.; Silver, D.; Rusu, A. A.; Veness, J.; Bellemare, M. G.; Graves, A.; Riedmiller, M.; Fidjeland, A. K.; Ostrovski, G.; et al. 2015. Human-level control through deep reinforcement learning. nature, 518(7540): 529 533. Nair, A. V.; Pong, V.; Dalal, M.; Bahl, S.; Lin, S.; and Levine, S. 2018. Visual reinforcement learning with imagined goals. Advances in neural information processing systems, 31. Oh, J.; Singh, S.; Lee, H.; and Kohli, P. 2017. Zero-shot task generalization with multi-task deep reinforcement learning. In International Conference on Machine Learning, 2661 2670. PMLR. Plappert, M.; Andrychowicz, M.; Ray, A.; Mc Grew, B.; Baker, B.; Powell, G.; Schneider, J.; Tobin, J.; Chociej, M.; Welinder, P.; et al. 2018. Multi-goal reinforcement learning: Challenging robotics environments and request for research. ar Xiv preprint ar Xiv:1802.09464. Raffin, A.; Hill, A.; Gleave, A.; Kanervisto, A.; Ernestus, M.; and Dormann, N. 2021. Stable-Baselines3: Reliable Reinforcement Learning Implementations. Journal of Machine Learning Research, 22(268): 1 8. Rusu, A. A.; Colmenarejo, S. G.; G ulc ehre, C .; Desjardins, G.; Kirkpatrick, J.; Pascanu, R.; Mnih, V.; Kavukcuoglu, K.; and Hadsell, R. 2016. Policy Distillation. In ICLR (Poster). Schaul, T.; Horgan, D.; Gregor, K.; and Silver, D. 2015. Universal Value Function Approximators. In Bach, F.; and Blei, D., eds., Proceedings of the 32nd International Conference on Machine Learning, volume 37 of Proceedings of Machine Learning Research, 1312 1320. Lille, France: PMLR. Schroecker, Y.; and Isbell, C. 2020. Universal value density estimation for imitation learning and goal-conditioned reinforcement learning. ar Xiv preprint ar Xiv:2002.06473.

Sohn, S.; Oh, J.; and Lee, H. 2018. Hierarchical reinforcement learning for zero-shot generalization with subtask dependencies. Advances in neural information processing systems, 31. Sukhbaatar, S.; Lin, Z.; Kostrikov, I.; Synnaeve, G.; Szlam, A.; and Fergus, R. 2018. Intrinsic Motivation and Automatic Curricula via Asymmetric Self-Play. In International Conference on Learning Representations. Touati, A.; and Ollivier, Y. 2021. Learning one representation to optimize all rewards. Advances in Neural Information Processing Systems, 34: 13 23. Touati, A.; Rapin, J.; and Ollivier, Y. 2023. Does Zero-Shot Reinforcement Learning Exist? In The Eleventh International Conference on Learning Representations. Vaezipoor, P.; Li, A. C.; Icarte, R. A. T.; and Mcilraith, S. A. 2021. Ltl2action: Generalizing ltl instructions for multitask rl. In International Conference on Machine Learning, 10497 10508. PMLR. Watkins, C. J.; and Dayan, P. 1992. Q-learning. Machine learning, 8(3): 279 292. Wu, H.; Khetarpal, K.; and Precup, D. 2021. Self Supervised Attention-Aware Reinforcement Learning. Proceedings of the AAAI Conference on Artificial Intelligence, 35(12): 10311 10319. Yang, R.; Fang, M.; Han, L.; Du, Y.; Luo, F.; and Li, X. 2021. MHER: Model-based Hindsight Experience Replay. In Deep RL Workshop Neur IPS 2021. Zagoruyko, S.; and Komodakis, N. 2017. Paying More Attention to Attention: Improving the Performance of Convolutional Neural Networks via Attention Transfer. In International Conference on Learning Representations. Zaheer, M.; Kottur, S.; Ravanbakhsh, S.; Poczos, B.; Salakhutdinov, R. R.; and Smola, A. J. 2017. Deep Sets. In Guyon, I.; Luxburg, U. V.; Bengio, S.; Wallach, H.; Fergus, R.; Vishwanathan, S.; and Garnett, R., eds., Advances in Neural Information Processing Systems, volume 30. Curran Associates, Inc. Zhang, Y.; Abbeel, P.; and Pinto, L. 2020. Automatic curriculum learning through value disagreement. Advances in Neural Information Processing Systems, 33: 7648 7659.