# continual_auxiliary_task_learning__7bc4642f.pdf

Continual Auxiliary Task Learning

Matthew Mc Leod, Chunlok Lo, Matthew Schlegel, Andrew Jacobsen, Raksha Kumaraswamy

Department of Computing Science, University of Alberta {mmcleod2,chunlok,mkschleg,ajjacobs,kumarasw}@ualberta.ca

Martha White, Adam White Department of Computing Science, University of Alberta CIFAR Canada AI Chair, Alberta Machine Intelligence Institute (Amii)

{whitem,amw8}@ualberta.ca

Learning auxiliary tasks, such as multiple predictions about the world, can provide many beneﬁts to reinforcement learning systems. A variety of off-policy learning algorithms have been developed to learn such predictions, but as yet there is little work on how to adapt the behavior to gather useful data for those off-policy predictions. In this work, we investigate a reinforcement learning system designed to learn a collection of auxiliary tasks, with a behavior policy learning to take actions to improve those auxiliary predictions. We highlight the inherent non-stationarity in this continual auxiliary task learning problem, for both prediction learners and the behavior learner. We develop an algorithm based on successor features that facilitates tracking under non-stationary rewards, and prove the separation into learning successor features and rewards provides convergence rate improvements. We conduct an in-depth study into the resulting multi-prediction learning system.

1 Introduction

In never-ending learning systems, the agent often faces long periods of time when the external reward is uninformative. A smart agent should use this time to practice reaching subgoals, learning new skills, and reﬁning model predictions. Later, the agent should use this prior learning to efﬁciently maximize external reward. The agent engages in this self-directed learning during times when the primary drives of the agent (e.g., hunger) are satisﬁed. Other times, the agent might have to trade-off directly acting towards internal auxiliary learning objectives and taking actions that maximize reward.

In this paper we investigate how an agent should select actions to balance the needs of several auxiliary learning objectives in a no-reward setting where no external reward is present. In particular, we assume the agent s auxiliary objectives are to learn a diverse set of value functions corresponding to a set of ﬁxed policies. Our solution at a high-level is straightforward. Each auxiliary value function is learned in parallel and off-policy, and the behavior selects actions to maximize learning progress. Prior work investigated similar questions in a state-less bandit like setting, where both off-policy learning and function approximation are not required [Linke et al., 2020].

Otherwise, the majority of prior work has focused on how the agent could make use of auxiliary learning objectives, not how behavior could be used to improve auxiliary task learning. Some work has looked at deﬁning (predictive) features, such as successor features and a basis of policies [Barreto et al., 2018, Borsa et al., 2019, Barreto et al., 2020, 2019]; universal value function approximators [Schaul et al., 2015]; and features based on value predictions [Schaul and Ring, 2013, Schlegel et al., 2021]. The other focus has been exploration, using auxiliary learning objectives to generate bonuses to aid exploration on the main task [Pathak et al., 2017, Stadie et al., 2015, Badia et al., 2020, Burda et al., 2019]; using a given set of policies in a call-return fashion for scheduled auxiliary control

35th Conference on Neural Information Processing Systems (Neur IPS 2021).

[Riedmiller et al., 2018]; and discovering subgoals in environments where it is difﬁcult for the agent to reach particular parts of the state-action space [Machado et al., 2017, Colas et al., 2019, Zhang et al., 2020, Andrychowicz et al., 2017, Pong et al., 2019]. In all of these works, the behavior was either ﬁxed or optimized for the main task.

The problem of adapting the behavior to optimize many auxiliary predictions in the absence of external reward is sufﬁciently complex to merit study in isolation. It involves several inter-dependent learning mechanisms, multiple sources of non-stationarity, and high-variance due to off-policy updating. If we cannot design learning systems that efﬁciently learn their auxiliary objectives in isolation, then the agent is unlikely to learn its auxiliary tasks while additionally balancing external reward maximization.

Further, understanding how to efﬁciently learn a collection of auxiliary objectives is complementary to the goals of using those auxiliary objectives. It could amplify the auxiliary task effect in UNREAL [Jaderberg et al., 2017], improve the efﬁciency and accuracy of learning successor features and universal value function approximators, and improve the quality of the sub-policies used in scheduled auxiliary control. It can also beneﬁt the numerous systems that discover options, skills, and subgoals [Gregor et al., 2017, Eysenbach et al., 2019a, Veeriah et al., 2019, Pitis et al., 2020, Nair et al., 2020, Pertsch et al., 2020, Colas et al., 2019, Eysenbach et al., 2019b], by providing improved algorithms to learn the resulting auxiliary tasks. For example, for multiple discovered subgoals, the agent can adapt its behavior to efﬁciently learn policies to reach each subgoal.

In this paper we introduce an architecture for parallel auxiliary task learning. As the ﬁrst such work to tackle this question in reinforcement learning with function approximation, numerous algorithmic challenges arise. We ﬁrst formalize the problem of learning multiple predictions as a reinforcement learning problem, and highlight that the rewards for the behavior policy are inherently non-stationary due to changes in learning progress over time. We develop a strategy to use successor features to exploit the stationarity of the dynamics, whilst allowing for fast tracking of changes in the rewards, and prove that this separation provides a faster convergence rate than standard value function algorithms like temporal difference learning. We empirically show that this separation facilitates tracking both for prediction learners with non-stationary targets as well as the behavior.

2 Problem Formulation

We consider the multi-prediction problem, in which an agent continually interacts with an environment to obtain accurate predictions. This interaction is formalized as a Markov decision process (MDP), deﬁned by a set of states S, a set of actions A, and a transition probability function P(s, a, s0). The agent s goal, when taking actions, is to gather data that is useful for learning N predictions, where each prediction corresponds to a general value function (GVF) [Sutton et al., 2011].

A GVF question is formalized as a three tuple ( , γ, c), where the target is the expected return of the cumulant, deﬁned by c : S A S ! R, when following policy : S A ! [0, 1], discounted by γ : S A S ! [0, 1]. More precisely, the target is the action-value

def= E [Gt|St = s, At = a] for Gt

def= Ct+1 + γt+1Gt+1

def= c(St, At, St+1) and γt+1

def= γ(St, At, St+1). The extension of γ to transitions allows for a broader class of problems, including easily specifying termination, without complicating the theory [White, 2017]. The expectation is under policy , with transitions according to P. The prediction targets could also be state-value functions; we assume the targets are action-values in this work to provide a uniﬁed discussion of successor features for both the GVF and behavior learners.

At each time step, the agent produces N predictions, a ˆQ(j)

t (St, At) for prediction j with true targets Q(j)

t (St, At). We assume the GVF question can change over time, and so Q can change with time. The goal is to have low error in the prediction, in terms of the root mean-squared value error (RMSVE), under state-action weighting d : S A ! R:

RMSVE( ˆQ, Q)

d(s, a)( ˆQ(s, a) Q(s, a))2 (1)

The total error up to time step t, across all predictions, is TE

j=1 RMSVE( ˆQ(j)

Task 1 Task 2 Task N

Intrinsic Reward

Weight Change

State/Observation

Environment The agent s goal is to gather data and update its predictions to make TE small. This goal can itself be formalized as a reinforcement learning problem, by deﬁning rewards for the behavior policy that depend on the agent s predictions. Such rewards are often called intrinsic rewards. For example, if we could directly measure the RMSVE, one potential intrinsic reward would be the decrease in the RMSVE after taking action At from state St and transitioning to St+1. This reﬂects the agent s learning progress how much it was able to learn due to that new experience. The reward is high if the action generated data that resulted in substantial learning. While the RMSVE is the most direct measure of learning progress, it cannot be calculated without the true values.

Algorithm 1 Multi-Prediction Learning System Input: N GVF questions

Initialize behavior policy parameters 0

and GVF learners w(1)

0 , . . . , w(N)

0 Obtain initial observation S0 for t = 0, 1, . . . do

Choose action At according to t( |St) Observe next state vector St+1 and γt+1 // Update predictions with new data for j = 1 to N do

c c(j)(St, At, St+1) γ γ(j)(St, At, St+1) Update w(j)

t with (St, At, c, St+1, γ) // Compute intrinsic reward, update behavior

t k1 Update t with (St, At, Rt+1, St+1, γt+1)

Many intrinsic rewards have been considered to estimate the learning progress of predictions. A recent work provided a thorough survey of different options, as well as an empirical study [Linke et al., 2020]. Their conclusion was that, for reasonable prediction learners, simple learning progress measures like the change in weights were effective for producing effective data gathering. We rely on this conclusion here, and formalize the problem using the 1 norm on the change in weights. Other intrinsic rewards could be swapped into the framework; but, because our focus is on the non-stationarity in the system and because empirically we found this weight-change intrinsic reward to be effective, we opt for this simple choice upfront.

We provide the generic pseudocode for a multiprediction reinforcement learning system, in Algorithm 1. Note that the behavior agent also has a separate transition-based γ, which enables us to encode both continuing and episodic problems. For example, the pseudo-termination for a GVF could be a particular state in the environment, such as a doorway. The discount for the GVF would be zero in that state, even though it is not a true terminal state; the behavior discount γt+1 would not be zero.

3 Non-stationarity Induced by Learning

On the surface, the multi-prediction problem outlined in the previous section is a relatively straightforward reinforcement learning problem. The behavior policy learns to maximize cumulative reward, and simultaneously learns predictions about its environment. Many RL systems incorporate prediction learning, either as auxiliary tasks or to learn a model. However, unlike standard RL problems, the rewards for the behavior are non-stationary when using intrinsic rewards, even under stationary dynamics. Further, the prediction problems themselves are non-stationary due to a changing behavior.

To understand this more deeply, consider ﬁrst the behavior rewards. On each time step, the predictions are updated. Progressively, they get more and more accurate. Imagine a scenario where they can become perfectly accurate, such as in the tabular setting with stationary cumulants. The behavior rewards are high in early learning, when predictions are inaccurate. As predictions become more and more accurate, the change in weights gets smaller until eventually the behavior rewards are near zero. This means that when the behavior revisits a state, the reward distribution has actually changed. More generally, in the function approximation setting, the behavior rewards will continue to change with time, not necessarily decay to zero.

The prediction problems are also non-stationary for two reasons. First, the cumulants themselves might be non-stationary, even if the transition dynamics are stationary. For example, the cumulant could correspond to the amount of food in a location in the environment, that slowly gets depleted. Or, the cumulant could depend on a hidden variable, that makes the outcome appear non-stationary. Even with a stationary cumulant, the prediction learning problem can be non-stationary due to a changing behavior policy. As the behavior policy changes, the state distribution changes. Implicitly,

when learning off-policy, the predictions are minimizing an objective weighted by the state visitation under the behavior policy. As the behavior changes, the underlying objective is actually changing, resulting in a non-stationary prediction problem.

Though there has been some work on learning under non-stationarity in RL and bandits, none to our knowledge has addressed the multi-prediction setting in MDPs. There has been some work developing reinforcement learning algorithms for non-stationary MDPs, but largely for the tabular setting [Sutton and Barto, 2018, Da Silva et al., 2006, Abdallah and Kaisers, 2016, Cheung et al., 2020] or assuming periodic shifts [Chandak et al., 2020a,b, Padakandla et al., 2020]. There has also been some work in the non-stationary multi-armed bandit setting [Garivier and Moulines, 2008, Koulouriotis and Xanthopoulos, 2008, Besbes et al., 2014]. The non-stationary rewards for the behavior, that decay over time, have been considered for the bandit setting, under rotting bandits [Levine et al., 2017, Seznec et al., 2019]; these algorithms do not obviously extend to the RL setting.

4 Handling the Non-Stationarity in a Multi-prediction System

In this section, we describe a uniﬁed approach to handle non-stationarity in both the GVF and behavior learners, using successor features. We ﬁrst discuss how to use successor features to learn under non-stationary cumulants, for prediction. Then we discuss using successor features for control, allows us to leverage this approach for non-stationary rewards for the behavior. We then discuss state-reweightings, and how to mitigate non-stationarity due to a changing behavior.

4.1 Successor Features for Non-stationary Rewards

Successor features provide an elegant way to learn value functions under non-stationarity. The separation of learning stationary successor features and rewards enables more effective tracking of non-stationary rewards, as we explain in this section and formally prove in Section 5.

Assume that there is a weight vector w 2 Rd and features x(s, a) 2 Rd for each state and action (s, a) such that r(s, a) = hx(s, a), w i. Recursively deﬁne

(s, a) = E [x(St, At) + γt+1 (St+1, At+1)|St = s, At = a]

(s, a) is called the successor features, the discounted cumulative sum of feature vectors, if we follow policy . For t

def= (St, At) and xt

def= x(St, At), we can see Q(s, a) = h (s, a), w i

h (s, a), w i = E [hxt, w i|St = s, At = a] + E [γt+1h t+1, w i|St = s, At = a]

= r(s, a) + E [γt+1hxt+1, w i|St = s, At = a] + E [γt+1γt+2h t+2, w i|St = s, At = a] = r(s, a) + E [γt+1rt+1|St = s, At = a] + E [γt+1γt+2h t+2, w i|St = s, At = a] = . . . = E [r(s, a) + γt+1rt+1 + γt+1γt+2rt+2 + . . . |St = s, At = a] = Q(s, a).

If we have features x(s, a) 2 Rd which allow us to represent the immediate reward, then successor features provide a good representation to approximate the GVF. We simply learn another set of parameters wc 2 Rd that predict the immediate cumulant (or reward): c(s, a) hx(s, a), wci. These parameters wc are updated using a standard regression update, and Q(s, a) h (s, a), wci .

The successor features (s, a) themselves, however, also need to be approximated. In most cases, we cannot explicitly maintain a separate (s, a) for each (s, a), outside of the tabular setting. Notice that each element in (s, a) corresponds to a true expected return: the cumulative discounted sum of a reward feature into the future. Therefore, (s, a) can be approximated using any value function approximation method, such as temporal difference (TD) learning. We learn parameters w for the approximation ˆ (s, a; w ) = [ ˆ 1(s, a; w ), ..., ˆ d(s, a; w )]> 2 Rd where

ˆ m(s, a; w ) m(s, a). We can use any function approximator for ˆ (s, a; w ), such as linear function approximation with tile coding with w , linearly weighting the tile coding features to produce ˆ (s, a; w ), or neural networks, where w are the parameters of the neural network.

We summarize the algorithm using successor features for non-stationary rewards/cumulants, called SF-NR, in Algorithm 2. We provide an update formula for the approximate SF using Expected Sarsa for prediction [Sutton and Barto, 2018] for simplicity, but note that any value learning algorithm can be used here. In our experiments, we use Tree-Backup [Precup, 2000] because it reduces variance

from off-policy learning; we provide the pseudocode in Appendix D. Algorithm 2 assumes that the reward features x(s, a) are given, but of course these can be learned as well. Ideally, we would learn a compact set of reward features that provide accurate estimates as a linear function of these reward features. A compact (smaller) set of reward features is preferred because it makes the SF more computationally efﬁcient to learn. Algorithm 2 Successor Features for Non-stationary Rewards (SF-NR)

Input:(St,At,St+1,Ct+1,γt+1), ,w ,wc

x x(St, At)

ˆ ˆ (St, At; w ) ˆ 0 P

a0 (a0|St+1) ˆ (St+1, a0; w ) 0 for m = 1 to d do

δm xm + γt+1 ˆ 0

m ˆ m + δmr ˆ m w w + wc wc + (Ct+1 hx, wci)x

There are two key advantages from the separation into learning successor features and immediate cumulant estimates. First, it easily allows different or changing cumulants to be used, for the same policy, using the same successor features. The transition dynamics summarized in the stationary successor features can be learned slowly to high accuracy and re-used. This re-use property is why these representations have been used for transfer [Barreto et al., 2017, 2018, 2020]. This property is pertinent for us, because it allows us to more easily track changes in the cumulant. The regression updates can quickly update the parameters wc, and exploit the already learned successor features to more quickly track value estimates. Small changes in the rewards can result in large changes in the values; without the separation, therefore, it can be more difﬁcult to directly track the value estimates.

Second, the separation allows us to take advantage of online regression algorithms with strong convergence guarantees. Many optimizers and accelerations are designed for a supervised setting, rather than for temporal difference algorithms. Once the successor features are learned, the prediction problem reduces to a supervised learning problem. We can therefore even further improve tracking by leveraging these algorithms to learn and track the immediate cumulant. We formalize the convergence rate improvements, from this separation, in Section 5.

4.2 GPI with Successor Features for Control

In this section we outline a control algorithm under non-stationary rewards. SF-NR provides a method for updating the value estimate due to changing rewards. The behavior for the multi-prediction problem has changing rewards, and so could beneﬁt from SF-NR. But SF-NR only provides a mechanism to efﬁciently track action-values for a ﬁxed policy, not for a changing policy. Instead, we turn to the idea of constraining the behavior to act greedily with respect to the values for a set of policies, introduced as Generalized Policy Improvement (GPI) Barreto et al. [2018, 2020].

For our system, this is particularly natural, as we are already learning successor features for a collection of policies. Let us start there, where we assume our set of policies is = { 1, . . . , N}. Assume also that we have learned the successor features for these policies, ˆ (s, a; w(j)

), and that we have weights r 2 Rd such that hx(s, a), ri E[Rt+1|St = s, At = a] for behavior reward Rt+1. Then on each step, the behavior policy takes the following greedy action

µ(s) = argmax

a max j2{1,...,N}

r (s, a) = argmax

a max j2{1,...,N}h ˆ (s, a; w(j)

The resulting policy is guaranteed to be an improvement: in every state the new policy has a value at least as good as any of the policies in the set [Barreto et al., 2017, Theorem 1]. Later work also showed sampled efﬁciency of GPI when combining known reward weights to solve novel tasks [Barreto et al., 2020].

The use of successor features has similar beneﬁts as discussed above, because the estimates can adapt more rapidly as the rewards change, due to learning progress changing over time. The separation is even more critical here, as we know the rewards are constantly drifting, and tracking quickly is even more critical. We could even more aggressively adapt to these non-stationary rewards, by anticipating trends. For example, instead of a regression update, we can model the trend (up or down) in the reward for a state and action. If the reward has been decreasing over time, then likely it will continue to decrease. Stochastic gradient descent will put more weight on recent points, but would likely predict a higher expected reward than is actually observed. For simplicity here, we still choose to use stochastic gradient descent, as it is a reasonably effective tracking algorithm, but note that performance improvements could likely be obtained by exploiting this structure in this problem.

We can consider a different set of policies for GVFs and behavior. However, the two are naturally coupled. First, the GPI theory shows that greedifying over a larger collection of policies provides better policies. It is sensible then to at least include the GVF policies into the set for the behavior. Second, the behavior needs to learn the successor features for the additional policies. Arguably, it should try to gather data to learn these well, so as to facilitate its own policy improvement. It should therefore also incorporate the learning progress for these successor features, into the intrinsic reward. For this work, therefore, we assume that the behavior uses the set of GVF policies. Note that the weight change intrinsic reward uses the concatenation of w and wc.

4.3 Interest and prior corrections for the changing state distribution

The ﬁnal source of non-stationarity is in the state distribution. As the behavior µ changes, the state-action visitation distribution dµt : S A ! [0, 1] changes. The state distribution implicitly weights the relative importance of states in the GVF objective, called the projected Bellman error (PBE). Correspondingly, the optimal SF solution could be changing, since the objective is changing. The impact of a changing state-weighting depends on function approximation capacity, because the weighting indicates how to trade-off function approximation error across states. When approximation error is low or zero such as in the tabular setting the weighting has little impact on the solution. Generally, however, we expect some approximation error and so a non-negligible impact.

We can completely remove this source of non-stationary by using prior corrections. These are products of importance sampling ratios, that reweight the trajectory to match the probability of seeing that trajectory under the target policy . Namely, it modiﬁes the state weighting to d , the state-action visitation distribution under . We explicitly show this in Appendix C. Unfortunately, prior corrections can be highly problematic in a system where the behavior policy takes exploratory actions and target policies are nearly deterministic. It is likely that these corrections will often either be zero, or near zero, resulting in almost no learning.

To overcome this inherent difﬁculty, we restrict which states are important for each predictive question. Likely, when creating a GVF, the agent is interested in predictions for that GVF only in certain parts of the space. This is similar to the idea of initiation sets for options, where an option is only executed from a small set of relevant states. We can ask: what is the GVF answer, from this smaller set of states of interest? This can be encoded with a non-negative interest function, i(s, a), where some (or even many) states have an interest of zero. This interest is incorporated into the state-weighting in the objective, so the agent can focus function approximation resources on states of interest.

When using interest, it is sensible to use emphatic weights [Sutton et al., 2016]. Emphatic weightings are a prior correction method, used under the excursions model [Patterson et al., 2021]. They reweight to a discounted state-action visitation under when starting from states proportionally to dµ. Further, they ensure states inherit the interest of any states that bootstrap off of them. Even if a state has an interest of zero, we want to accurately estimate its value if an important states bootstraps off of its value. The combination of interest and emphatic weightings which shift state-action weighting to visitation under means that we mitigate much of the non-stationarity in the state-action weighting. We provide the pseudocode for this Emphatic TB (ETB) algorithm in Appendix D.

5 Sample Efﬁciency of SF-NR

As suggested in Section 4.1, the use of successor features makes SF-NR particularly well-suited to our multi-prediction problem setting. The reason for this is simple: given access to an accurate SF matrix, value function estimation reduces to a fundamentally simpler linear prediction problem. Indeed, access to an accurate SF enables one to sidestep known lower-bounds on PBE estimation.

For simplicity, we prove the result for value functions; the result easily extends to action-values. Denote by v 2 R|S| the vector with entries v (s), r 2 R|S| the vector of expected immediate rewards in each state, and P 2 R|S| |S| the matrix of transition probabilities. The following lemma, proven in Appendix A.1, relates mean squared value error (VE) to one-step reward prediction error.

Lemma 1 Assume there exists a w 2 Rd such that r = Xw . Let ˆr

def= Xw for some w 2 Rd, and let D = Diag({d(s)}s2S) for distribution d fully supported on S, with k k D the weighted norm under D. Then the value estimate ˆv

def= w satisﬁes 1

D 2(1 γ)2 .

Thus we can ensure that VE(v , ˆv) " by ensuring that kr ˆrk2

D "(1 γ)2. This is promising, as this latter expression is readily expressed as the objective of a linear regression problem. To illustrate the utility of this, let s look at a concrete example: suppose the agent has an accurate SF matrix and that the reward function changes at some point in the agent s deployment. Suppose access to a batch of transitions D

def= {St, At, S0

t=1 with which we can correct our estimate of v , where each (s, a, s0, ) 2 D is such that s dµ for some known behavior policy µ, At ( |s), s0 P( |s, a)

and r = r(s, a, s0). Assume for simplicity that t max , kφ(St)k1 L, rt Rmax for some ﬁnite max , Rmax , L 2 R+. Then we can get the following result, proven in Appendix A.2, that is a straightforward application of Orabona [2019, Theorem 7.26].

Proposition 1 Deﬁne t(w)

2 (rt hx(St), wi)2. Suppose we apply a basic recursive leastsquares estimator to minimize regret on this loss sequence, producing a sequence of iterates wt. Let w T

t=1 wt denote the average iterate. For ˆv(s) = h (s), w T i, we have that

1 + max L2T

In contrast, without the SF we are faced with minimizing a harder objective: the PBE. It can be shown that minimizing the PBE is equivalent to a stochastic saddle-point problem, and the convergence to the saddle-point of this problem has an unimprovable rate of O

T 2 + (1+γ) max L2d

where is the maximum eigenvalue of the covariance matrix and σ bounds gradient stochasticity, and this conver-

gence rate translates into the performance bound 1

T 2 + (1+γ) max L2d

[Liu et al., 2018a, Proposition 5]. Comparing with Equation 2, we observe an additional dependence of O(p /T) as well as the worse dependence of at least O(1/

T) (log (T) /T) on all other quantities of interest, reinforcing the intuition that access to the SF enables us to more efﬁciently re-evaluate the value function.

6 A First Experiment Testing the Multi-prediction System

Figure 1: Tabular TMaze with 4 GVFs, with cumulants of zero except in the goals. The right plot shows the cumulants in the goals. G2 and G4 have constant cumulants, G1 has a distractor cumulant and G4 a drifter.

In this section, we investigate the utility of using SFNR under non-stationary cumulants and rewards, both for prediction and control. We conduct the experiment in a TMaze environment, inspired by the environments used to test animal cognition [Tolman and Honzik, 1930]. The environment, depicted in Figure 1, has

four GVFs where each policy takes the fastest route to its corresponding goal. The cumulants are zero everywhere except for at the goals. The cumulant can be of three types: a constant ﬁxed value (constant), a ﬁxed-mean and high variance value (distractor), or a non-stationary zero-mean random walk process with a low variance (drifter). Exact formulas for these cumulants are in Appendix E.1.

Utility of SF-NR for a Fixed Behavior Policy

We start by testing the utility of SF-NR for GVF learning, under a ﬁxed policy that provides good data coverage for every GVF. The Fixed-Behavior Policy is started from random states in the TMaze, and moves towards the closest goal, with a 50/50 chance of going either direction if there is a tie. This policy is like a round robin policy, in that one of the GVF policies is executed each episode and, in expectation, all four policies are executed the same number of times.

We compare an agent that uses SF-NR and one that learns the approximate GVFs using Tree Back-Up (TB). TB is an off-policy temporal difference (TD) algorithm, that reduces variance in the eligibility trace. We also use TB to learn the successor features in SF-NR. Both use λ = 0.9 and a stepsize method called Auto [Mahmood et al., 2012] designed for online learning. We sweep the initial stepsize and meta stepsizes for Auto. For further details about the agents and optimizer, see Appendix D. We additionally compare to least squares TD (LSTD), with λ = 0.9, particularly as it computes a matrix similar to the SF, but does not separate out cumulant learning (see Appendix B for this connection).

µ(Fixed), (SR)

</latexit>µ(Fixed), (TB)

3Kua7A=</latexit>µ(Fixed), (LSTD)

(a) Fixed-Behavior Policy

Cd UAWCST7Ou LHm EWVD2uct Qw Pqc91Op+M8Z5Rurg XKl MB4Kn6f SKhvt Yj3z Od Po WB/u1Nx L+8Vgy9Sjs RQRQD9jnol4s MYR4Egvu Cs UZy JEhl Clhbs Vs QBVl YMLm BC+Ps X/k8sj2yn Z5LSQq1Vnca TRDtp F+8h BZVRDJ6i OGoih W3SPHt GTd Wc9WM/Wy2drypr Nb KMfs F4/AKEmkg=</latexit>µ(Sarsa), (SR)

it>µ(Sarsa), (TB) µ(GPI), (TB)

it>µ(GPI), (SR)

(b) Learned Behavior Policy

Constants Constants

(c) Visitation Plot Comparison

Figure 2: Performance in Tabular TMaze, with averages over 30 runs. (a) and (b) show average offpolicy prediction RMSE, with standard errors, where the error is weighted by (a) the state distribution dµ for the Fixed-Behavior policy and (b) a uniform state weighting when learning the behavior. (c) Goal visitation plots for GPI with SF and TB.

In Figure 2a, we can see SF-NR allows for much more effective learning, particularly later in learning when it more effectively tracks the non-stationary signals. LSTD performs much more poorly, likely because it corresponds to a closed-form batch solution, which uses old cumulants that are no longer reﬂective of the current cumulant distribution.

Investigating GPI for Learning the Behavior

Next we investigate if SF-NR improves learning of the whole system, both for the GVFs and for the behavior policy. We use SF-NR and TB for the GVF learners, and Expected Sarsa (Sarsa) and GPI for the behavior. The GPI agent uses the GVF policies for its set of policies. The reward features for the behavior are likely different than those for the GVF learners, because the cumulants are zero in most states whereas intrinsic rewards are likely non-zero in most states. The GPI agent, therefore, learns its own SFs for each policy, also using TB. The reward weights that estimate the (changing) intrinsic rewards are learned using Auto, as are the SFs. Note that the behavior and GVF learners all share the same meta-step size namely only one shared parameter is swept.

The results highlight that SF for the GVFs is critical for effective learning, though GPI and Sarsa perform similarly, as shown in Figure 3b. The utility of SF is even greater here, with TB GVF learners inducing much worse performance than SF GVF learners. GPI and Sarsa are similar, which is likely due to the fact that Sarsa uses traces with tabular features, which allow states along the trajectory to the drifter goal to update quickly. In following sections, we ﬁnd a bigger distinction between the two.

We visualize the goal visitation of GPI in Figure 2c. Once the GVF learners have a good estimate for the constant cumulant signals and the distractor cumulant signal, the agent behavior should switch to visiting only the drifter cumulant as that is the only goal where visiting would improve the GVF prediction. When using SF GVF learners, this behavior emerges, but under TB GVF learners the agent incorrectly focuses on the distractor. This is even more pronounced for Sarsa (see Appendix F.2).

7 Experiments under Function Approximation

We evaluate our system in a similar fashion to the last section, but now under function approximation. We use a benchmark problem at the end of this section, but start with experiments in the TMaze modiﬁed to be a continuous environment, with full details described in Appendix E.1. The environment observation ot 2 R2 corresponds to the xy coordinates of the agent. We use tile coded features of 2 tilings of 8 tiles for the state representation, both for TB and to learn the SF.

The reward features for the GVF learners can be much simpler than the state-action features because they only need to estimate the cumulants, which are zero in every state except the goals. The reward features are a one-hot encoding indicating if s0 in the tuple of (s, a, s0) is in the pseudo-termination goals of the GVFs. For the Continuous TMaze, this gives a 4 dimensional vector. The reward features for GPI is state aggregation applied along the one dimensional line components. Appendix E.1 contains more details on the reward features for the GVF and behavior learners.

Results for a Fixed Behavior and Learned Behaviors

Under function approximation, SF-NR continues to enable more effective tracking of the cumulants than the other methods. For control, GPI is notably better than Sarsa, potentially because under

µ(Fixed), (SR)

</latexit>µ(Fixed), (TB)

3Kua7A=</latexit>µ(Fixed), (LSTD)

(a) Fixed-Behavior Policy

Cd UAWCST7Ou LHm EWVD2uct Qw Pqc91Op+M8Z5Rurg XKl MB4Kn6f SKhvt Yj3z Od Po WB/u1Nx L+8Vgy9Sjs RQRQD9jnol4s MYR4Egvu Cs UZy JEhl Clhbs Vs QBVl YMLm BC+Ps X/k8sj2yn Z5LSQq1Vnca TRDtp F+8h BZVRDJ6i OGoih W3SPHt GTd Wc9WM/Wy2drypr Nb KMfs F4/AKEmkg=</latexit>µ(Sarsa), (SR)

it>µ(Sarsa), (TB)

µ(GPI), (TB)

o BFVQFTF0i+7RI3qy7qw H69l6+Wydsi Yz G+g Hr Nc Ph9OZFg=</latexit>µ(GPI), (SR)

(b) Learned Behavior Policy

Mh8Z0w8B2hzb5rf XF/yqik292q Zj JIUIRKfi5qpohj Tfiy0ITUIVF1Lu NDS3kp Fm2su0Ia Xsy F8f Ur/J5dbrfjst Ni4WB/GMc UWSGr ZJ14ZJck GNy Qsp Ek Fty Tx7Jk3Pn PDj Pzstn64gzn Fkm P+C8fg CBEpoz</latexit>µ(Fixed), (SR)

p C7Tu B7p DBhe69/e UPz Layb Yrb RSEc YJQsj Hi7q Jp Bj RYSy0Ix Rwl H1DGFf C3Er5NVOMowkv Y0L4+p T+Txr7tluynf NC7r A6i WOBb JFtkicu KZNDckr OSJ1wckcey BN5tu6t R+v Feh23Tlm Tm U3y A9b J5Lzmk A=</latexit>µ(Fixed), (TB)

D9ft Exk Ktu2Fg Ok OGl/q31xf/8uoptiu NTERJih Dxz4/aqa QY034wt CUc JRd Qxh Xwux K+SVTj KOJL2d C+Lq U/k/ONh2v7Lgnxfzuzj COKb JCVkm Be GSb7JDckxqh JNbck8ey ZN1Zz1Yz9b LZ+u INZx ZJj9gv X4A3Kua7A=</latexit>µ(Fixed), (LSTD)

µ(Fixed), (SR) 8

µ(Fixed), (TB) 8

(c) Replay with Fixed-Behavior

Figure 3: Performance in Continuous TMaze, with averages over 30 runs. (a) and (b) show average off-policy prediction RMSE, with standard errors, where the error is weighted by (a) the state distribution dµ for the Fixed-Behavior policy and (b) a uniform state weighting when learning the behavior. (c) RMSE in Continuous TMaze with a Fixed Behavior when incorporating replay.

function approximation eligibility traces are not as effective at sweeping back changes in behavior rewards and so the separation is more important. We include visitations plots in Appendix F.1, which are similar to the tabular setting.

Note that the efﬁcacy of SF-NR and GPI relied on having reward features that did not overly generalize. The SF learns the expected feature vector when following the target policy. For the GVF learners, if states on the trajectory share features with states near the goal, then the value estimates will likely be higher for those states. The rewards are learned using squared error, which unlike other losses, is likely only to bring cumulant estimates to near zero. These small non-zero cumulant estimates are accumulated by the SF for the entire trajectory, resulting in higher error than TB. We demonstrate this in Appendix F.3. We designed reward features to avoid this problem for our experiments, knowing that effective reward features can and have been learned for SF [Barreto et al., 2020].

Results using Replay

The above results uses completely online learning, with eligibility traces. A natural question is if the more modern approach of using replay could signiﬁcantly change the results. In fact, early versions of the system included replay but had surprisingly negative results, which we later realized was due to the inherent non-stationarity in the system. Replaying old cumulants and rewards, that have become outdated, actually harms performance of the system. Once we have the separation with the SF, however, we can actually beneﬁt from replay for this stationary component.

We demonstrate this result in Figure 3c. We use λ = 0 for this result, because we use replay. The settings are otherwise the same as above, and we resweep hyperparameters for this experiment. SF-NR beneﬁts from replay, because it only uses it for its stationary component: the SF. TB, on the other hand, actually performs more poorly with replay. As before, LSTD which similarly uses old cumulants, also performs poorly.

Incorporating Interest

µ(GPI), (SR)

jf FUa2f CZWk CIo/LBqlkm JMp4HQod DAUY4t YVw Leyvl50wzbn Mw BRv C46f0bd Ku P6h Gxw Hpcbne Rx58o F8JGXikypk CPSJC3Cy Q35QX6R386t89O5c/48t Oac+cw2e Qbn7z2gtpe N</latexit> (SR w/ Interest)

µ(Sarsa), (SR w/ ETB)

Figure 4: Using interest: shading is standard error over 30 runs.

To study the effects of interest, a more open world environment is needed. The Open 2D World is used to analyze this problem as described in Appendix E.2. At the start of each episode, the agent begins in the center of the environment. The interest for each GVF in the states is one if the state is in the same quadrant as the GVF s respective goal, and zero otherwise. This enables the GVFs to focus their learning on a subset of the entire space and thus use the function approximation resources more wisely and give a better weight change proﬁle as an intrinsic reward to the behavior learner. Each GVF prediction i is evaluated under state-action weighting induced by running i, with results in Figure 4.

Both TB with interest and ETB reweight states to focus more on state visitation under the policy. Both signiﬁcantly improve performance over not using interest, both allowing faster learning and reaching a lower error. The reweighting under ETB more closely matches state visitation under the policy, and accounts for the impacts of bootstrapping. We ﬁnd that ETB does provide some initial learning beneﬁts. The original ETD algorithm is known to suffer from variance issues; we may ﬁnd with variance reduction that the utility of ETB is even more pronounced.

µ(GPI), (SR)

it>µ(Sarsa), (SR)

Dp JW0IX5/i/0n9MJv LZ93z43S5NI0jgb RLtp HOVRAZXSGKqi GKLp D+g JPTv3zq Pz4rx OWmec6cw W+g Hn7ROf OJra</latexit>µ(Random), (SR)

(a) RMSE for each GVF (b) GVF#2 goal visits: left hill top (c) GVF#1 goal visits: right hill top

Figure 5: Performance in Mountain Car averaged over 30 runs, with standard errors. (a) Learning curves for RMSE, with a uniform weighting over states and actions. (b), (c) show the number of times that the agent reached the termination for each GVF.

Validation of the Multi-Prediction System in a Standard Benchmark Problem

Finally, we investigate multi-prediction learning in an environment not obviously designed for this setting: Mountain Car. The goal here is to show that multi-prediction learning is natural in many problem settings, and to show results in a standard benchmark not designed for our setting that has more complex transition dynamics. In the usual formulation the agent must learn to rock back and forth building up momentum to reach the top of the hill on the right a classic cost to goal problem. This is a hard exploration task where a random agent requires thousands of steps to reach the top of the hill from the bottom of the valley. Here we use Mountain Car to see if our approach can learn about more than just getting out of the valley quickly. We speciﬁed a GVF whose termination and policy focuses on reaching top of the left hill, and a second GVF about reaching the top of the other side. The full details of the GVFs and setup of this task can be found in the Appendix E.3.

Figure 5a shows how GPI and Sarsa compare against a baseline random policy. GPI provides much better data for GVF learning than the random policy and Sarsa, signiﬁcantly reducing the RMSE of the learned GVFs. The goal visitation plots show GPI explores the domain and visits both GVFs goal far more often than random, and more effectively than Sarsa.

8 Conclusion

In this work, we take the ﬁrst few steps towards building an effective multi-prediction learning system. We highlight the inherent non-stationarity in the problem and design algorithms based on successor features (SF) to better adapt to this non-stationarity. We show that (1) temporally consistent behavior emerges from optimizing the amount of learning across diverse GVF questions; (2) successor features are useful for tracking nonstationary rewards and cumulants, both in theory and empirically; (3) replay is well suited for learning the stationary components successor features while meta-learning works well for the non-stationary components; and (4) interest functions can improve the performance of the entire system, by focusing learning to a subset of states for each prediction.

Our work also highlights several critical open questions. (1) The utility of SFs is tied to the quality of the reward features; better understanding of how to learn these reward features is essential. (2) Continual Auxiliary Task Learning is an RL problem, and requires effective exploration approaches to ﬁnd and maximize intrinsic rewards the intrinsic rewards do not provide a solution to exploration. Never-ending exploration is needed. (3) The interaction between discovering predictive questions and learning them effectively remains largely unexplored. In this work, we focused on learning, for a given set of GVFs. Other work has focused on discovering useful GVFs [Veeriah et al., 2019, 2021, Nair et al., 2020, Zahavy et al., 2021]. The interaction between the two is likely to introduce additional complexity in learning behavior, including producing automatic curricula observed in previous work [Oudeyer et al., 2007, Chentanez et al., 2005].

This work demonstrates the utility of several new ideas in RL that are conceptually compelling, but not widely used in RL systems, namely SF and GVFs, GPI with SF for control, meta-descent step-size adaption, and interest functions. The trials and tribulations that lead to this work involved many failures using classic algorithms in RL, like replay; and, in the end, providing evidence for utility in these newer ideas. Our journey highlights the importance of building and analyzing complete RL systems, where the interacting parts with different timescales of learning and complex interdependencies necessitate incorporating these conceptually important ideas. Solving these integration problems represents the next big step for RL research.

Acknowledgments and Disclosure of Funding

This work was supported by NSERC Discovery, IVADO, CIFAR through CCAI Chair funding and by the Alberta Machine Intelligence Institute (Amii).

Sherief Abdallah and Michael Kaisers. Addressing environment non-stationarity by repeating

q-learning updates. The Journal of Machine Learning Research, 17(1):1582 1612, 2016.

Marcin Andrychowicz, Filip Wolski, Alex Ray, Jonas Schneider, Rachel Fong, Peter Welinder, Bob

Mc Grew, Josh Tobin, Pieter Abbeel, and Wojciech Zaremba. Hindsight experience replay. In Advances in Neural Information Processing Systems, 2017.

Adri a Puigdom enech Badia, Pablo Sprechmann, Alex Vitvitskyi, Daniel Guo, Bilal Piot, Steven

Kapturowski, Olivier Tieleman, Mart ın Arjovsky, Alexander Pritzel, Andew Bolt, et al. Never give up: Learning directed exploration strategies. In International Conference on Learning Representations, 2020.

Andr e Barreto, Will Dabney, R emi Munos, Jonathan J Hunt, Tom Schaul, Hado P van Hasselt, and

David Silver. Successor features for transfer in reinforcement learning. In Advances in Neural Information Processing Systems, 2017.

Andre Barreto, Diana Borsa, John Quan, Tom Schaul, David Silver, Matteo Hessel, Daniel Mankowitz,

Augustin Zidek, and Remi Munos. Transfer in deep reinforcement learning using successor features and generalised policy improvement. In International Conference on Machine Learning, 2018.

Andre Barreto, Diana Borsa, Shaobo Hou, Gheorghe Comanici, Eser Ayg un, Philippe Hamel, Daniel

Toyama, Jonathan hunt, Shibl Mourad, David Silver, and Doina Precup. The option keyboard: Combining skills in reinforcement learning. In Advances in Neural Information Processing Systems, 2019.

Andr e Barreto, Shaobo Hou, Diana Borsa, David Silver, and Doina Precup. Fast reinforcement

learning with generalized policy updates. Proceedings of the National Academy of Sciences, 117 (48):30079 30087, 2020.

Omar Besbes, Yonatan Gur, and Assaf Zeevi. Stochastic multi-armed-bandit problem with non-

stationary rewards. In Advances in Neural Information Processing Systems, 2014.

Diana Borsa, Andr e Barreto, John Quan, Daniel Mankowitz, R emi Munos, Hado van Hasselt, David

Silver, and Tom Schaul. Universal successor features approximators. In International Conference on Learning Representations, 2019.

Yuri Burda, Harrison Edwards, Amos Storkey, and Oleg Klimov. Exploration by random network

distillation. In International Conference on Learning Representations, 2019.

Yash Chandak, Scott M Jordan, Georgios Theocharous, Martha White, and Philip S Thomas. Towards

safe policy improvement for non-stationary mdps. Advances in Neural Information Processing Systems, 2020a.

Yash Chandak, Georgios Theocharous, Shiv Shanka, Martha White, Sridhar Mahadevan, and Philip S

Thomas. Optimizing for the future in non-stationary mdps. International Conference on Machine Learning, 2020b.

Nuttapong Chentanez, Andrew Barto, and Satinder Singh. Intrinsically motivated reinforcement

learning. In Advances in Neural Information Processing Systems, 2005.

Wang Chi Cheung, David Simchi-Levi, and Ruihao Zhu. Reinforcement learning for non-stationary

markov decision processes: The blessing of (more) optimism. International Conference on Machine Learning, 2020.

C edric Colas, Pierre Fournier, Mohamed Chetouani, Olivier Sigaud, and Pierre-Yves Oudeyer.

Curious: intrinsically motivated modular multi-goal reinforcement learning. In International conference on machine learning, 2019.

Bruno C Da Silva, Eduardo W Basso, Filipo S Perotto, Ana L C. Bazzan, and Paulo M Engel.

Improving reinforcement learning with context detection. In Autonomous agents and multiagent systems, pages 810 812, 2006.

Benjamin Eysenbach, Abhishek Gupta, Julian Ibarz, and Sergey Levine. Diversity is all you need: Learning skills without a reward function. In International Conference on Learning Representations, 2019a.

Benjamin Eysenbach, Ruslan Salakhutdinov, and Sergey Levine. Search on the replay buffer:

Bridging planning and reinforcement learning. In Advances in Neural Information Processing Systems, 2019b.

Aur elien Garivier and Eric Moulines. On upper-conﬁdence bound policies for non-stationary bandit

problems. ar Xiv preprint ar Xiv:0805.3415, 2008.

Sina Ghiassian, Andrew Patterson, Martha White, Richard S Sutton, and Adam White. Online

off-policy prediction. ar Xiv preprint ar Xiv:1811.02597, 2018.

Karol Gregor, Danilo Jimenez Rezende, and Daan Wierstra. Variational intrinsic control. Interational

Conference on Learning Representations Workshop, 2017.

Assaf Hallak and Shie Mannor. Consistent on-line off-policy evaluation. In International Conference

on Machine Learning, 2017.

Andrew Jacobsen, Matthew Schlegel, Cameron Linke, Thomas Degris, Adam White, and Martha

White. Meta-descent for online, continual prediction. In Proceedings of the AAAI Conference on Artiﬁcial Intelligence, 2019.

Max Jaderberg, Volodymyr Mnih, Wojciech Marian Czarnecki, Tom Schaul, Joel Z Leibo, David

Silver, and Koray Kavukcuoglu. Reinforcement learning with unsupervised auxiliary tasks. In International Conference on Learning Representations, 2017.

Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization. In International

Conference on Learning Representations, 2015.

Dimitris E Koulouriotis and A Xanthopoulos. Reinforcement learning and evolutionary algorithms

for non-stationary multi-armed bandit problems. Applied Mathematics and Computation, 196(2): 913 922, 2008.

Nir Levine, Koby Crammer, and Shie Mannor. Rotting bandits. In Advances in Neural Information

Processing Systems, 2017.

Cam Linke, Nadia M Ady, Martha White, Thomas Degris, and Adam White. Adapting behavior via

intrinsic reward: A survey and empirical study. Journal of Artiﬁcial Intelligence Research, 69: 1287 1332, 2020.

Bo Liu, Ian Gemp, Mohammad Ghavamzadeh, Ji Liu, Sridhar Mahadevan, and Marek Petrik.

Proximal gradient temporal difference learning: Stable reinforcement learning with polynomial sample complexity. Journal of Artiﬁcial Intelligence Research, 2018a.

Qiang Liu, Lihong Li, Ziyang Tang, and Dengyong Zhou. Breaking the Curse of Horizon: Inﬁnite-

Horizon Off-Policy Estimation. Advances in Neural Information Processing Systems, 2018b.

Yao Liu, Adith Swaminathan, Alekh Agarwal, and Emma Brunskill. Off-Policy Policy Gradient with

Stationary Distribution Correction. Uncertainty in Artiﬁcial Intelligence, pages 1180 1190, 2020.

Marlos C Machado, Marc G Bellemare, and Michael Bowling. A laplacian framework for option

discovery in reinforcement learning. In International Conference on Machine Learning, 2017.

Ashique Mahmood, Huizhen Yu, and Richard Sutton. Multi-step off-policy learning without impor-

tance sampling ratios. 02 2017.

Ashique Rupam Mahmood, Richard S Sutton, Thomas Degris, and Patrick M Pilarski. Tuning-

free step-size adaptation. In IEEE International Conference on Acoustics, Speech and Signal Processing, 2012.

Ashvin Nair, Shikhar Bahl, Alexander Khazatsky, Vitchyr Pong, Glen Berseth, and Sergey Levine.

Contextual imagined goals for self-supervised robotic learning. In Conference on Robot Learning, 2020.

Francesco Orabona. A modern introduction to online learning, 2019.

Pierre-Yves Oudeyer, Fr ed eric Kaplan, and Verena V. Hafner. Intrinsic motivation systems for

autonomous mental development. In IEEE Transactions on Evolutionary Computation, 2007.

Sindhu Padakandla, KJ Prabuchandran, and Shalabh Bhatnagar. Reinforcement learning algorithm

for non-stationary environments. Applied Intelligence, 50(11):3590 3606, 2020.

Deepak Pathak, Pulkit Agrawal, Alexei A Efros, and Trevor Darrell. Curiosity-driven exploration by

self-supervised prediction. In International Conference on Machine Learning, 2017.

Andrew Patterson, Adam White, Sina Ghiassian, and Martha White. A generalized projected bellman

error for off-policy value estimation in reinforcement learning. ar Xiv preprint ar Xiv:2104.13844, 2021.

Karl Pertsch, Oleh Rybkin, Frederik Ebert, Shenghao Zhou, Dinesh Jayaraman, Chelsea Finn, and

Sergey Levine. Long-horizon visual planning with goal-conditioned hierarchical predictors. In Advances in Neural Information Processing Systems, 2020.

Silviu Pitis, Harris Chan, Stephen Zhao, Bradly Stadie, and Jimmy Ba. Maximum entropy gain

exploration for long horizon multi-goal reinforcement learning. In International Conference on Machine Learning, 2020.

Vitchyr H Pong, Murtaza Dalal, Steven Lin, Ashvin Nair, Shikhar Bahl, and Sergey Levine. Skew-ﬁt:

State-covering self-supervised reinforcement learning. ar Xiv preprint ar Xiv:1903.03698, 2019.

Doina Precup. Eligibility traces for off-policy policy evaluation. Computer Science Department

Faculty Publication Series, page 80, 2000.

Martin Riedmiller, Roland Hafner, Thomas Lampe, Michael Neunert, Jonas Degrave, Tom Wiele,

Vlad Mnih, Nicolas Heess, and Jost Tobias Springenberg. Learning by playing solving sparse reward tasks from scratch. In International Conference on Machine Learning, 2018.

Tom Schaul and Mark B Ring. Better generalization with forecasts. In International Joint Conferences

on Artiﬁcial Intelligence, 2013.

Tom Schaul, Dan Horgan, Karol Gregor, and David Silver. Universal value function approximators.

In International Conference on Machine Learning, 2015.

Bruno Scherrer. Improved and generalized upper bounds on the complexity of policy iteration.

Mathematics of Operations Research, 2016.

Matthew Schlegel, Andrew Jacobsen, Zaheer Abbas, Andrew Patterson, Adam White, and Martha

White. General value function networks. Journal of Artiﬁcial Intelligence Research, 70:497 543, 2021.

Julien Seznec, Andrea Locatelli, Alexandra Carpentier, Alessandro Lazaric, and Michal Valko. Rot-

ting bandits are no harder than stochastic ones. International Conference on Artiﬁcial Intelligence and Statistics, 2019.

Bradly C Stadie, Sergey Levine, and Pieter Abbeel. Incentivizing exploration in reinforcement

learning with deep predictive models. ar Xiv preprint ar Xiv:1507.00814, 2015.

Richard S. Sutton and Andrew G. Barto. Reinforcement Learning: An Introduction. A Bradford

Book, Cambridge, MA, USA, 2018. ISBN 0262039249.

Richard S Sutton, Joseph Modayil, Michael Delp, Thomas Degris, Patrick M Pilarski, Adam White,

and Doina Precup. Horde: A scalable real-time architecture for learning knowledge from unsupervised sensorimotor interaction. In International Conference on Autonomous Agents and Multiagent Systems, 2011.

Richard S Sutton, A Rupam Mahmood, and Martha White. An emphatic approach to the problem

of off-policy temporal-difference learning. The Journal of Machine Learning Research, 17(1): 2603 2631, 2016.

Edward Chace Tolman and Charles H Honzik. Introduction and removal of reward, and maze

performance in rats. University of California publications in psychology, 1930.

Vivek Veeriah, Matteo Hessel, Zhongwen Xu, Janarthanan Rajendran, Richard L. Lewis, Junhyuk

Oh, Hado van Hasselt, David Silver, and Satinder Singh. Discovery of useful questions as auxiliary tasks. In Neural Information Processing Systems, 2019.

Vivek Veeriah, Tom Zahavy, Matteo Hessel, Zhongwen Xu, Junhyuk Oh, Iurii Kemaev, Hado van

Hasselt, David Silver, and Satinder Singh. Discovery of options via meta-learned subgoals. In Advances in Neural Information Processing Systems, 2021.

Martha White. Unifying task speciﬁcation in reinforcement learning. In International Conference on

Machine Learning, 2017.

Tom Zahavy, Brendan O Donoghue, Andre Barreto, Volodymyr Mnih, Sebastian Flennerhag, and

Satinder Singh. Discovering diverse nearly optimal policies with successor features. ar Xiv preprint ar Xiv:2106.00669, 2021.

Tianren Zhang, Shangqi Guo, Tian Tan, Xiaolin Hu, and Feng Chen. Generating adjacencyconstrained subgoals in hierarchical reinforcement learning. In Advances in Neural Information Processing Systems, 2020.