# does_zeroshot_reinforcement_learning_exist__3a61e520.pdf Published as a conference paper at ICLR 2023 DOES ZERO-SHOT REINFORCEMENT LEARNING EXIST? Ahmed Touati, Jérémy Rapin & Yann Ollivier Meta AI Research, Paris, {atouati,jrapin,yol}@meta.com A zero-shot RL agent is an agent that can solve any RL task in a given environment, instantly with no additional planning or learning, after an initial reward-free learning phase. This marks a shift from the reward-centric RL paradigm towards controllable agents that can follow arbitrary instructions in an environment. Current RL agents can solve families of related tasks at best, or require planning anew for each task. Strategies for approximate zero-shot RL have been suggested using successor features (SFs) (Borsa et al., 2018) or forward-backward (FB) representations (Touati & Ollivier, 2021), but testing has been limited. After clarifying the relationships between these schemes, we introduce improved losses and new SF models, and test the viability of zero-shot RL schemes systematically on tasks from the Unsupervised RL benchmark (Laskin et al., 2021). To disentangle universal representation learning from exploration, we work in an offline setting and repeat the tests on several existing replay buffers. SFs appear to suffer from the choice of the elementary state features. SFs with Laplacian eigenfunctions do well, while SFs based on auto-encoders, inverse curiosity, transition models, low-rank transition matrix, contrastive learning, or diversity (APS), perform unconsistently. In contrast, FB representations jointly learn the elementary and successor features from a single, principled criterion. They perform best and consistently across the board, reaching 85% of supervised RL performance with a good replay buffer, in a zero-shot manner. 1 INTRODUCTION For breadth of applications, reinforcement learning (RL) lags behind other fields of machine learning, such as vision or natural language processing, which have effectively adapted to a wide range of tasks, often in almost zero-shot manner, using pretraining on large, unlabelled datasets (Brown et al., 2020). The RL paradigm itself may be in part to blame: RL agents are usually trained for only one reward function or a small family of related rewards. Instead, we would like to train controllable agents that can be given a description of any task (reward function) in their environment, and then immediately know what to do, reacting instantly to such commands as fetch this object while avoiding that area . The promise of zero-shot RL is to train without rewards or tasks, yet immediately perform well on any reward function given at test time, with no extra training, planning, or finetuning, and only a minimal amount of extra computation to process a task description (Section 2 gives the precise definition we use for zero-shot RL). How far away are such zero-shot agents? In the RL paradigm, a new task (reward function) means re-training the agent from scratch, and providing many reward samples. Model-based RL trains a reward-free, task-independent world model, but still requires heavy planning when a new reward function is specified (e.g, Chua et al., 2018; Moerland et al., 2020). Model-free RL is reward-centric from start, and produces specialized agents. Multi-task agents generalize within a family of related tasks only. Reward-free, unsupervised skill pre-training (e.g, Eysenbach et al., 2018) still requires substantial downstream task adaptation, such as training a hierarchical controller. Is zero-shot RL possible? If one ignores practicality, zero-shot RL is easy: make a list of all possible rewards up to precision 𝜀, then pre-learn all the associated optimal policies. Scalable zero-shot RL must somehow exploit the relationships between policies for all tasks. Learning to go from 𝑎to 𝑐 is not independent from going from 𝑎to 𝑏and 𝑏to 𝑐, and this produces rich, exploitable algebraic relationships (Blier et al., 2021; Schaul et al., 2015). Published as a conference paper at ICLR 2023 Figure 1: Zero-shot scores of ten SF methods and FB, as a percentage of the supervised score of offline TD3 trained on the same replay buffer, averaged on some tasks, environments and replay buffers from the Unsupervised RL and Ex ORL benchmarks (Laskin et al., 2022; Yarats et al., 2021). FB and SFs with Laplacian eigenfunctions achieve zero-shot scores approaching supervised RL. Suggested strategies for generic zero-shot RL so far have used successor representations (Dayan, 1993), under two forms: successor features (SFs) (Barreto et al., 2017) as in (Borsa et al., 2018; Hansen et al., 2019; Liu & Abbeel, 2021); and forward-backward (FB) representations (Touati & Ollivier, 2021). Both SFs and FB lie in between model-free and model-based RL, by predicting features of future states, or summarizing long-term state-state relationships. Like model-based approaches, they decouple the dynamics of the environment from the reward function. Contrary to world models, they require neither planning at test time nor a generative model of states or trajectories. Yet SFs heavily depend on a choice of basic state features. To get a full zero-shot RL algorithm, a representation learning method must provide those. While SFs have been successively applied to transfer between tasks, most of the time, the basic features were handcrafted or learned using prior task class knowledge. Meanwhile, FB is a standalone method with no task prior and good theoretical backing, but testing has been limited to goal-reaching in a few environments. Here: We systematically assess SFs and FB for zero-shot RL, including many new models of SF basic features, and improved FB loss functions. We use 13 tasks from the Unsupervised RL benchmark (Laskin et al., 2021), repeated on several Ex ORL training replay buffers (Yarats et al., 2021) to assess robustness to the exploration method. We systematically study the influence of basic features for SFs, by testing SFs on features from ten RL representation learning methods. such as latent next state prediction, inverse curiosity module, contrastive learning, diversity, various spectral decompositions... We expose new mathematical links between SFs, FB, and other representations in RL. We discuss the implicit assumptions and limitations behind zero-shot RL approaches. 2 PROBLEM AND NOTATION; DEFINING ZERO-SHOT RL Let ℳ= (𝑆, 𝐴, 𝑃, 𝛾) be a reward-free Markov decision process (MDP) with state space 𝑆, action space 𝐴, transition probabilities 𝑃(𝑠 |𝑠, 𝑎) from state 𝑠to 𝑠 given action 𝑎, and discount factor 0 < 𝛾< 1 (Sutton & Barto, 2018). If 𝑆and 𝐴are finite, 𝑃(𝑠 |𝑠, 𝑎) can be viewed as a stochastic matrix 𝑃𝑠𝑎𝑠 R(|𝑆| |𝐴|) |𝑆|; in general, for each (𝑠, 𝑎) 𝑆 𝐴, 𝑃(d𝑠 |𝑠, 𝑎) is a probability measure on 𝑠 𝑆. The notation 𝑃(d𝑠 |𝑠, 𝑎) covers all cases. Given (𝑠0, 𝑎0) 𝑆 𝐴and a policy 𝜋: 𝑆 Prob(𝐴), we denote Pr( |𝑠0, 𝑎0, 𝜋) and E[ |𝑠0, 𝑎0, 𝜋] the probabilities and expectations under state-action sequences (𝑠𝑡, 𝑎𝑡)𝑡 0 starting at (𝑠0, 𝑎0) and following policy 𝜋in the environment, defined by sampling 𝑠𝑡 𝑃(d𝑠𝑡|𝑠𝑡 1, 𝑎𝑡 1) and 𝑎𝑡 𝜋(d𝑎𝑡|𝑠𝑡). We define 𝑃𝜋(d𝑠 , d𝑎 |𝑠, 𝑎) := 𝑃(d𝑠 |𝑠, 𝑎)𝜋(d𝑎 |𝑠 ) and 𝑃𝜋(d𝑠 |𝑠) := 𝑃(d𝑠 |𝑠, 𝑎)𝜋(d𝑎|𝑠), the state-action transition probabilities and state transition probabilities induced by 𝜋. Given a reward function 𝑟: 𝑆 R, the 𝑄-function of 𝜋for 𝑟is 𝑄𝜋 𝑟(𝑠0, 𝑎0) := 𝑡 0 𝛾𝑡E[𝑟(𝑠𝑡+1)|𝑠0, 𝑎0, 𝜋]. For simplicity, we assume the reward 𝑟 depends only on the next state 𝑠𝑡+1 instead on the full triplet (𝑠𝑡, 𝑎𝑡, 𝑠𝑡+1), but this is not essential. We focus on offline unsupervised RL, where the agent cannot interact with the environment. The agent only has access to a static dataset of logged reward-free transitions in the environment, 𝒟= {(𝑠𝑖, 𝑎𝑖, 𝑠 𝑖)}𝑖 ℐwith 𝑠 𝑖 𝑃(d𝑠 𝑖|𝑠𝑖, 𝑎𝑖). These can come from any exploration method or methods. The offline setting disentangles the effects of the exploration method and representation and policy learning: we test each zero-shot method on several training datasets from several exploration methods. Published as a conference paper at ICLR 2023 We denote by 𝜌(d𝑠) and 𝜌(d𝑠, d𝑎) the (unknown) marginal distribution of states and state-actions in the dataset 𝒟. We use both E𝑠 𝒟[ ] and E𝑠 𝜌[ ] for expectations under the training distribution. Zero-shot RL: problem statement. The goal of zero-shot RL is to compute a compact representation ℰof the environment by observing samples of reward-free transitions (𝑠𝑡, 𝑎𝑡, 𝑠𝑡+1) in this environment. Once a reward function is specified later, the agent must use ℰto immediately produce a good policy, via only elementary computations without any further planning or learning. Ideally, for any downstream task, the performance of the returned policy should be close to the performance of a supervised RL baseline trained on the same dataset labeled with the rewards for that task. Reward functions may be specified at test time either as a relatively small set of reward samples (𝑠𝑖, 𝑟𝑖), or as an explicit function 𝑠 𝑟(𝑠) (such as 1 at a known goal state and 0 elsewhere). The method will be few-shot, zero-planning in the first case, and truly zero-shot in the second case. 3 RELATED WORK Zero-shot RL requires unsupervised learning and the absence of any planning or fine-tuning at test time. The proposed strategies for zero-shot RL discussed in Section 1 ultimately derive from successor representations (Dayan, 1993) in finite spaces. In continuous spaces, starting with a finite number of features 𝜙, successor features can be used to produce policies within a family of tasks directly related to 𝜙(Barreto et al., 2017; Borsa et al., 2018; Zhang et al., 2017; Grimm et al., 2019), often using hand-crafted 𝜙or learning 𝜙that best linearize training rewards. VISR (Hansen et al., 2019) and its successor APS (Liu & Abbeel, 2021) use SFs with 𝜙automatically built online via diversity criteria (Eysenbach et al., 2018; Gregor et al., 2016). We include APS among our baselines, as well as many new criteria to build 𝜙automatically. Successor measures (Blier et al., 2021) avoid the need for 𝜙by directly learning models of the distribution of future states: doing this for various policies yields a candidate zero-shot RL method, forward-backward representations (Touati & Ollivier, 2021), which has been tested for goal-reaching in a few environments with discrete actions. FB uses a low-rank model of long-term state-state relationships reminiscent of the state-goal factorization from Schaul et al. (2015). Model-based RL (surveyed in Moerland et al. (2020)) misses the zero-planning requirement of zero-shot RL. Still, learned models of the transitions between states can be used jointly with SFs to provide zero-shot methods (Trans, Latent, and LRA-P methods below). Goal-oriented and multitask RL has a long history (e.g, Foster & Dayan, 2002; Sutton et al., 2011; da Silva et al., 2012; Schaul et al., 2015; Andrychowicz et al., 2017). A parametric family of tasks must be defined in advance (e.g., reaching arbitrary goal states). New rewards cannot be set a posteriori: for example, a goal-state-oriented method cannot handle dense rewards. Zero-shot task transfer methods learn on tasks and can transfer to related tasks only (e.g, Oh et al., 2017; Sohn et al., 2018); this can be used, e.g., for sim-to-real transfer (Genc et al., 2020) or slight environment changes, which is not covered here. Instead, we aim at not having any predefined family of tasks. Unsupervised skill and option discovery methods, based for instance on diversity (Eysenbach et al., 2018; Gregor et al., 2016) or eigenoptions (Machado et al., 2017) can learn a variety of behaviors without rewards. Downstream tasks require learning a hierarchical controller to combine the right skills or options for each task. Directly using unmodified skills has limited performance without heavy finetuning (Eysenbach et al., 2018). Still, these methods can speed up downstream learning. The unsupervised aspect of some of these methods (including DIAYN and APS) has been disputed, because training still used end-of-trajectory signals, which are directly correlated to the downstream task in some common environments: without this signal, results drop sharply (Laskin et al., 2022). 4 SUCCESSOR REPRESENTATIONS AND ZERO-SHOT RL For a finite MDP, the successor representation (SR) (Dayan, 1993) 𝑀𝜋(𝑠0, 𝑎0) of a state-action pair (𝑠0, 𝑎0) under a policy 𝜋, is defined as the discounted sum of future occurrences of each state: 𝑀𝜋(𝑠0, 𝑎0, 𝑠) := E [ 𝑡 0 𝛾𝑡1{𝑠𝑡+1=𝑠} | (𝑠0, 𝑎0), 𝜋 ] 𝑠 𝑆. (1) Published as a conference paper at ICLR 2023 In matrix form, SRs can be written as 𝑀𝜋= 𝑃 𝑡 0 𝛾𝑡𝑃𝑡 𝜋= 𝑃(Id 𝛾𝑃𝜋) 1, where 𝑃𝜋is the state transition probability. 𝑀𝜋satisfies the matrix Bellman equation 𝑀𝜋= 𝑃+ 𝛾𝑃𝜋𝑀𝜋. Importantly, SRs disentangle the dynamics of the MDP and the reward function: for any reward 𝑟 and policy 𝜋, the 𝑄-function can be expressed linearly as 𝑄𝜋 𝑟= 𝑀𝜋𝑟. Successor features and successor measures. Successor features (SFs) (Barreto et al., 2017) extend SR to continous MDPs by first assuming we are given a basic feature map 𝜙: 𝑆 R𝑑that embeds states into 𝑑-dimensional space, and defining the expected discounted sum of future state features: 𝜓𝜋(𝑠0, 𝑎0) := E [ 𝑡 0 𝛾𝑡𝜙(𝑠𝑡+1) | 𝑠0, 𝑎0, 𝜋 ] . (2) SFs have been introduced to make SRs compatible with function approximation. For a finite MDP, the original definition (1) is recovered by letting 𝜙be a one-hot state encoding into R|𝑆|. Alternatively, successor measures (SMs) (Blier et al., 2021) extend SRs to continuous spaces by treating the distribution of future visited states as a measure 𝑀𝜋over the state space 𝑆, 𝑀𝜋(𝑠0, 𝑎0, 𝑋) := 𝑡 0 𝛾𝑡Pr (𝑠𝑡+1 𝑋| 𝑠0, 𝑎0, 𝜋) 𝑋 𝑆. (3) SFs and SMs are related: by construction, 𝜓𝜋(𝑠0, 𝑎0) = 𝑠 𝑀𝜋(𝑠0, 𝑎0, d𝑠 ) 𝜙(𝑠 ). Zero-shot RL from successor features and forward-backward representations. Successor representations provide a generic framework for zero-shot RL, by learning to represent the relationship between reward functions and 𝑄-functions, as encoded in 𝑀𝜋. Given a basic feature map 𝜙: 𝑆 R𝑑to be learned via another criterion, universal SFs (Borsa et al., 2018) learn the successor features of a particular family of policies 𝜋𝑧for 𝑧 R𝑑, 𝜓(𝑠0, 𝑎0, 𝑧) = E [ 𝑡 0 𝛾𝑡𝜙(𝑠𝑡+1) | (𝑠0, 𝑎0), 𝜋𝑧 ] , 𝜋𝑧(𝑠) := arg max𝑎𝜓(𝑠, 𝑎, 𝑧) 𝑧. (4) Once a reward function 𝑟is revealed, we use a few reward samples or explicit knowledge of the function 𝑟to perform a linear regression of 𝑟onto the features 𝜙. Namely, we estimate 𝑧𝑟:= arg min𝑧E𝑠 𝜌[(𝑟(𝑠) 𝜙(𝑠) 𝑧)2] = E𝜌[𝜙𝜙 ] 1 E𝜌[𝜙𝑟]. Then we return the policy 𝜋𝑧𝑟. This policy is guaranteed to be optimal for all rewards in the linear span of the features 𝜙: Theorem 1 (Borsa et al. (2018)). Assume that (4) holds. Assume there exists a weight 𝑤 R𝑑such that 𝑟(𝑠) = 𝜙(𝑠) 𝑤, 𝑠 𝑆. Then 𝑧𝑟= 𝑤, and 𝜋𝑧𝑟is the optimal policy for reward 𝑟. Forward-backward (FB) representations (Touati & Ollivier, 2021) apply a similar idea to a finite-rank model of successor measures. They look for representations 𝐹: 𝑆 𝐴 R𝑑 R𝑑and 𝐵: 𝑆 R𝑑 such that the long-term transition probabilities 𝑀𝜋𝑧in (3) decompose as 𝑀𝜋𝑧(𝑠0, 𝑎0, d𝑠 ) 𝐹(𝑠0, 𝑎0, 𝑧) 𝐵(𝑠 ) 𝜌(d𝑠 ), 𝜋𝑧(𝑠) := arg max𝑎𝐹(𝑠, 𝑎, 𝑧) 𝑧 (5) In a finite space, the first equation rewrites as the matrix decomposition 𝑀𝜋𝑧= 𝐹 𝑧𝐵diag(𝜌). Once a reward function 𝑟is revealed, we estimate 𝑧𝑟:= E𝑠 𝜌[𝑟(𝑠)𝐵(𝑠)] from a few reward samples or from explicit knowledge of the function 𝑟(e.g. 𝑧𝑟= 𝐵(𝑠) to reach 𝑠). Then we return the policy 𝜋𝑧𝑟. If the approximation (5) holds, this policy is guaranteed to be optimal for any reward function: Theorem 2 (Touati & Ollivier (2021)). Assume that (5) holds. Then for any reward function 𝑟, the policy 𝜋𝑧𝑟is optimal for 𝑟, with optimal 𝑄-function 𝑄 𝑟= 𝐹(𝑠, 𝑎, 𝑧𝑟) 𝑧𝑟. For completeness, we sketch the proofs of Theorems 1 2 in Appendix A. Importantly, both theorems are compatible with approximation: approximate solutions provide approximately optimal policies. Connections between SFs and FB. A first difference between SFs and FB is that SFs must be provided with basic features 𝜙. The best 𝜙is such that the reward functions of the downstream tasks are linear in 𝜙. But for unsupervised training without prior task knowledge, an external criterion is needed to learn 𝜙. We test a series of such criteria below. In contrast, FB uses a single criterion, avoiding the need for state featurization by learning a model of state occupancy. Second, SFs only cover rewards in the linear span of 𝜙, while FB apparently covers any reward. But this difference is not as stark as it looks: exactly solving the FB equation (5) in continuous spaces Published as a conference paper at ICLR 2023 requires 𝑑= , and for finite 𝑑, the policies will only be optimal for rewards in the linear span of 𝐵 (Touati & Ollivier, 2021). Thus, in both cases, policies are exactly optimal only for a 𝑑-dimensional family of rewards. Still, FB can use an arbitrary large 𝑑without any additional input or criterion. FB representations are related to successor features: the FB definition (5) implies that 𝜓(𝑠, 𝑎, 𝑧) := 𝐹(𝑠, 𝑎, 𝑧) are the successor features of 𝜙(𝑠) := (E𝜌𝐵𝐵 ) 1𝐵(𝑠). This follows from multiplying (3) and (5) by 𝐵 (E𝜌𝐵𝐵 ) 1 on the right, and integrating over 𝑠 𝜌. Thus, a posteriori, FB can be used to produce both 𝜙and 𝜓in SF, although training is different. This connection between FB and SF is one-directional: (5) is a stronger condition. In particular 𝐹= 𝐵= 0 is not a solution: contrary to 𝜓= 𝜙= 0 in (4), there is no collapse. No additional criterion to train 𝜙is required: 𝐹and 𝐵are trained jointly to provide the best rank-𝑑approximation of the successor measures 𝑀𝜋. This summarizes an environment by selecting the features that best describe the relationship 𝑄𝜋 𝑟= 𝑀𝜋𝑟between rewards and 𝑄-functions. 5 ALGORITHMS FOR SUCCESSOR FEATURES AND FB REPRESENTATIONS We now describe more precisely the algorithms used in our experiments. The losses used to train 𝜓 in SFs, and 𝐹, 𝐵in FB, are described in Sections 5.1 and 5.2 respectively. To obtain a full zero-shot RL algorithm, SFs must specify the basic features 𝜙. Any representation learning method can be used for 𝜙. We use ten possible choices (Section 5.3) based on existing or new representations for RL: random features as a baseline, autoencoders, next state and latent next state transition models, inverse curiosity module, the diversity criterion of APS, contrastive learning, and finally, several spectral decompositions of the transition matrix or its associated Laplacian. Both SFs and FB define policies as an argmax of 𝜓(𝑠, 𝑎, 𝑧) 𝑧or 𝐹(𝑠, 𝑎, 𝑧) 𝑧over actions 𝑎. With continuous actions, the argmax cannot be computed exactly. We train an auxiliary policy network 𝜋(𝑠, 𝑧) to approximate this argmax, using the same standard method for SFs and FB (Appendix G.4). 5.1 LEARNING THE SUCCESSOR FEATURES 𝜓 The successor features 𝜓satisfy the R𝑑-valued Bellman equation 𝜓𝜋= 𝑃𝜙+ 𝛾𝑃𝜋𝜓𝜋, the collection of ordinary Bellman equations for each component of 𝜙. The 𝑃in front of 𝜙comes from using 𝜙(𝑠𝑡+1) not 𝜙(𝑠𝑡) in (2). Therefore, we can train 𝜓(𝑠, 𝑎, 𝑧) for each 𝑧by minimizing the Bellman residuals 𝜓(𝑠𝑡, 𝑎𝑡, 𝑧) 𝜙(𝑠𝑡+1) 𝛾 𝜓(𝑠𝑡+1, 𝜋𝑧(𝑠𝑡+1), 𝑧) 2 where 𝜓is a non-trainable target version of 𝜓as in parametric 𝑄-learning. This requires sampling a transition (𝑠𝑡, 𝑎𝑡, 𝑠𝑡+1) from the dataset and choosing 𝑧. We sample random values of 𝑧as described in Appendix G.3. This is the loss used in Borsa et al. (2018). But this can be improved, since we do not use the full vector 𝜓(𝑠, 𝑎, 𝑧): only 𝜓(𝑠, 𝑎, 𝑧) 𝑧is needed for the policies. Therefore, as in Liu & Abbeel (2021), instead of the vector-valued Bellman residual above, we just use ℒ(𝜓) := E(𝑠𝑡,𝑎𝑡,𝑠𝑡+1) 𝜌 ( 𝜓(𝑠𝑡, 𝑎𝑡, 𝑧) 𝑧 𝜙(𝑠𝑡+1) 𝑧 𝛾 𝜓(𝑠𝑡+1, 𝜋𝑧(𝑠𝑡+1), 𝑧) 𝑧 ) 2 (6) for each 𝑧. This trains 𝜓( , 𝑧) 𝑧as the 𝑄-function of reward 𝜙 𝑧, the only case needed, while training the full vector 𝜓( , 𝑧) amounts to training the 𝑄-functions of each policy 𝜋𝑧for all rewards 𝜙 𝑧 for all 𝑧 R𝑑including 𝑧 = 𝑧. We have found this improves performance. 5.2 LEARNING FB REPRESENTATIONS: THE FB TRAINING LOSS The successor measure 𝑀𝜋satisfies a Bellman-like equation 𝑀𝜋= 𝑃+ 𝛾𝑃𝜋𝑀𝜋, as matrices in the finite case and as measures in the general case (Blier et al., 2021). We can learn FB by iteratively minimizing the Bellman residual on the parametric model 𝑀= 𝐹 𝐵𝜌. Using a suitable norm 𝜌 for the Bellman residual (Appendix B) leads to a loss expressed as expectations from the dataset: ℒ(𝐹, 𝐵) := 𝐹 𝑧𝐵𝜌 ( 𝑃+ 𝛾𝑃𝜋𝑧 𝐹 𝑧 𝐵𝜌 ) 2 = E(𝑠𝑡,𝑎𝑡,𝑠𝑡+1) 𝜌 𝑠 𝜌 [ ( 𝐹(𝑠𝑡, 𝑎𝑡, 𝑧) 𝐵(𝑠 ) 𝛾 𝐹(𝑠𝑡+1, 𝜋𝑧(𝑠𝑡+1), 𝑧) 𝐵(𝑠 ) ) 2] 2 E(𝑠𝑡,𝑎𝑡,𝑠𝑡+1) 𝜌 [ 𝐹(𝑠𝑡, 𝑎𝑡, 𝑧) 𝐵(𝑠𝑡+1) ] + Const (8) Published as a conference paper at ICLR 2023 where the constant term does not depend on 𝐹and 𝐵, and where as usual 𝐹𝑧and 𝐵are non-trainable target versions of 𝐹and 𝐵whose parameters are updated with a slow-moving average of those of 𝐹 and 𝐵. Appendix B quickly derives this loss, with pseudocode in Appendix L. Contrary to Touati & Ollivier (2021), the last term involves 𝐵(𝑠𝑡+1) instead of 𝐵(𝑠𝑡), because we use 𝑠𝑡+1 instead of 𝑠𝑡 for the successor measures (3). We sample random values of 𝑧as described in Appendix G.3. We include an auxiliary loss (Appendix B) to normalize the covariance of 𝐵, E𝜌𝐵𝐵 Id, as in Touati & Ollivier (2021) (otherwise one can, e.g., scale 𝐹up and 𝐵down since only 𝐹 𝐵is fixed). As with SFs above, only 𝐹( , 𝑧) 𝑧is needed for the policies, while the loss above on the vector 𝐹 amounts to training 𝐹( , 𝑧) 𝑧 for all pairs (𝑧, 𝑧 ). The full loss is needed for joint training of 𝐹and 𝐵. But we include an auxiliary loss ℒ (𝐹) to focus training on the diagonal 𝑧 = 𝑧. This is obtained by multiplying the Bellman gap in ℒby 𝐵 (𝐵𝜌𝐵 ) 1𝑧on the right, to make 𝐹( , 𝑧) 𝑧appear: ℒ (𝐹) := E(𝑠𝑡,𝑎𝑡,𝑠𝑡+1) 𝜌 [ ( 𝐹(𝑠𝑡, 𝑎𝑡, 𝑧) 𝑧 𝐵(𝑠𝑡+1) (E𝜌𝐵𝐵 ) 1𝑧 𝛾 𝐹(𝑠𝑡+1, 𝜋𝑧(𝑠𝑡+1), 𝑧) 𝑧 ) 2] . (9) This trains 𝐹( , 𝑧) 𝑧as the 𝑄-function for reward 𝐵 (E𝜌𝐵𝐵 ) 1𝑧. Though ℒ= 0 implies ℒ = 0, adding ℒ reduces the error on the part used for policies. This departs from Touati & Ollivier (2021). 5.3 LEARNING BASIC FEATURES 𝜙FOR SUCCESSOR FEATURES SFs must be provided with basic state features 𝜙. Any representation learning method can be used to supply 𝜙. We focus on prominent RL representation learning baselines, and on those used in previous zero-shot RL candidates such as APS. We now describe the precise learning objective for each. Random Features (Rand). We use a non-trainable randomly initialized network as features. Autoencoder (AEnc). We learn a decoder 𝑓: R𝑑 𝑆to recover the state from its representation 𝜙: min 𝑓,𝜙E𝑠 𝒟[(𝑓(𝜙(𝑠)) 𝑠)2]. (10) Inverse Curiosity Module (ICM) aims at extracting the controllable aspects of the environment (Pathak et al., 2017). The idea is to train an inverse dynamics model 𝑔: R𝑑 R𝑑 𝐴 to predict the action used for a transition between two consecutive states. We use the loss min 𝑔,𝜙E(𝑠𝑡,𝑎𝑡,𝑠𝑡+1) 𝒟[ 𝑔(𝜙(𝑠𝑡), 𝜙(𝑠𝑡+1)) 𝑎𝑡 2]. (11) Transition model (Trans). This is a one-step forward dynamic model 𝑓: R𝑑 𝐴 𝑆that predicts the next state from the current state representation: min 𝑓,𝜙E(𝑠𝑡,𝑎𝑡,𝑠𝑡+1) 𝒟[(𝑓(𝜙(𝑠𝑡), 𝑎𝑡) 𝑠𝑡+1)2]. (12) Latent transition model (Latent). This is similar to the transition model but instead of predicting the next state, it predicts its representation: min 𝑓,𝜙E(𝑠𝑡,𝑎𝑡,𝑠𝑡+1) 𝒟[(𝑓(𝜙(𝑠𝑡), 𝑎𝑡) 𝜙(𝑠𝑡+1))2]. (13) A clear failure case of this loss is when all states are mapped to the same representation. To avoid this collapse, we compute 𝜙(𝑠𝑡+1) using a non-trainable version of 𝜙, with parameters corresponding to a slowly moving average of the parameters of 𝜙, similarly to BYOL (Grill et al., 2020). Diversity methods (APS). VISR (Hansen et al., 2019) and its successor APS (Liu & Abbeel, 2021) tackle zero-shot RL using SFs with features 𝜙built online from a diversity criterion. This criterion maximizes the mutual information between a policy parameter and the features of the states visited by a policy using that parameter (Eysenbach et al., 2018; Gregor et al., 2016). VISR and APS use, respectively, a variational or nearest-neighbor estimator for the mutual information. We directly use the code provided for APS, and refer to (Liu & Abbeel, 2021) for the details. Contrary to other methods, APS is not offline: it needs to be trained on its own replay buffer. Laplacian Eigenfunctions (Lap). Wu et al. (2018) consider the symmetrized MDP graph Laplacian induced by an exploratory policy 𝜋, defined as ℒ= Id 1 2(𝑃𝜋diag(𝜌) 1 + diag(𝜌) 1(𝑃𝜋) ). They Published as a conference paper at ICLR 2023 propose to learn the eigenfunctions of ℒvia the spectral graph drawing objective (Koren, 2003): min 𝜙E(𝑠𝑡,𝑠𝑡+1) 𝒟 [ 𝜙(𝑠𝑡) 𝜙(𝑠𝑡+1) 2] + 𝜆E 𝑠 𝒟 𝑠 𝒟 [ (𝜙(𝑠) 𝜙(𝑠 ))2 𝜙(𝑠) 2 2 𝜙(𝑠 ) 2 2 ] (14) where the second term is an orthonormality regularization to ensure that E𝑠 𝜌[𝜙(𝑠)𝜙(𝑠) ] Id, and 𝜆> 0 is the regularization weight. This is implicitly contrastive, pushing features of 𝑠𝑡and 𝑠𝑡+1 closer while keeping features apart overall. Such eigenfunctions have long been argued to play a key role in RL (Mahadevan & Maggioni, 2007; Machado et al., 2017). Low-Rank Approximation of 𝑃(LRA-P): we learn features by estimating a low-rank model of the transition probability densities: 𝑃(d𝑠 |𝑠, 𝑎) 𝜒(𝑠, 𝑎) 𝜇(𝑠 ) 𝜌(d𝑠 ). Knowing 𝜌is not needed: the corresponding loss on 𝜒 𝜇 𝑃/𝜌is readily expressed as expectations over the dataset, min 𝜒,𝜇E(𝑠𝑡,𝑎𝑡) 𝜌 𝑠 𝜌 [ ( 𝜒(𝑠𝑡, 𝑎𝑡) 𝜇(𝑠 ) 𝑃(d𝑠 |𝑠𝑡, 𝑎𝑡) = E(𝑠𝑡,𝑎𝑡) 𝜌 𝑠 𝜌 [(𝜒(𝑠𝑡, 𝑎𝑡) 𝜇(𝑠 ))2] 2 E(𝑠𝑡,𝑎𝑡,𝑠𝑡+1) 𝜌[𝜒(𝑠𝑡, 𝑎𝑡) 𝜇(𝑠𝑡+1)] + Const (16) We normalize E𝜌[𝜇𝜇 ] Id with the same loss used for 𝐵. Then we use SFs with 𝜙:= 𝜇. If the model 𝑃= 𝜒 𝜇𝜌is exact, this provides exact optimal policies for any reward (Appendix E, Thm. 3). This loss is implicitly contrastive: it compares samples 𝑠𝑡+1 to independent samples 𝑠 from 𝜌. It is an asymmetric extension of the Laplacian loss (14) (Appendix F). The loss (16) is also a special case of the FB loss (8) by setting 𝛾= 0, omitting 𝑧, and substituting (𝜒, 𝜇) for (𝐹, 𝐵). Indeed, FB learns a finite-rank model of 𝑃(Id 𝛾𝑃𝜋) 1, which equals 𝑃when 𝛾= 0. A related loss is introduced in Ren et al. (2022), but involves a second unspecified, arbitrary probability distribution 𝑝, which must cover the whole state space and whose analytic expression must be known. It is unclear how to set a suitable 𝑝in general. Contrastive Learning (CL) methods learn representations by pushing positive pairs (similar states) closer together while keeping negative pairs apart. Here, two states are considered similar if they lie close on the same trajectory. We use a Sim CLR-like objective (Chen et al., 2020): min 𝜒,𝜙 E𝑘 Geom(1 𝛾CL) (𝑠𝑡,𝑠𝑡+𝑘) 𝒟 [ log exp(cosine(𝜒(𝑠𝑡), 𝜙(𝑠𝑡+𝑘))) E𝑠 𝒟exp(cosine(𝜒(𝑠𝑡), 𝜙(𝑠 ))) where 𝑠𝑡+𝑘is the state encountered at step 𝑡+𝑘along the subtrajectory that starts at 𝑠𝑡, where 𝑘is sampled from a geometric distribution of parameter (1 𝛾CL), and cosine(𝑢, 𝑣) = 𝑢 𝑣 𝑢 2 𝑣 2 , 𝑢, 𝑣 R𝑑is the cosine similarity function. Here 𝛾CL [0; 1) is a parameter not necessarily set to the MDP s discount factor 𝛾. CL requires a dataset made of full trajectories instead of isolated transitions. CL is tightly related to the spectral decomposition of the successor measure 𝑡𝛾𝑡 CL𝑃𝑡+1 𝜋 , where 𝜋is the behavior policy generating the dataset trajectories. Precisely, assuming that 𝜒and 𝜙are centered with unit norm, and expanding the log and exp at second order, the loss (17) becomes 2 E 𝑠 𝒟 𝑠 𝒟 [(𝜒(𝑠) 𝜙(𝑠 ))2] E𝑘 Geom(1 𝛾CL) (𝑠𝑡,𝑠𝑡+𝑘) 𝒟 [ 𝜒(𝑠𝑡) 𝜙(𝑠𝑡+𝑘) ] (18) (compare (16)). Now, the law of 𝑠𝑡+𝑘given 𝑠𝑡is given by the stochastic matrix (1 𝛾CL) 𝑡𝛾𝑡 CL𝑃𝑡+1 𝜋 , the rescaled successor measure of 𝜋. Then one finds that (18) is minimized when 𝜒and 𝜙provide the singular value decomposition of this matrix in 𝐿2(𝜌) norm (Appendix C). Formal links between contrastive learning and spectral methods can be found in Tian (2022); Balestriero & Le Cun (2022). Low-Rank Approximation of SR (LRA-SR). The CL method implicitly factorizes the successor measure of the exploration policy in Monte Carlo fashion by sampling pairs (𝑠𝑡, 𝑠𝑡+𝑘) on the same trajectory. This may suffer from high variance. To mitigate this, we propose to factorize this successor measure by temporal difference learning instead of Monte Carlo. This is achieved with an FB-like loss (8) except we drop the policies 𝜋𝑧and learn successor measures for the exploration policy only: min 𝜒,𝜙E(𝑠𝑡,𝑠𝑡+1) 𝒟 𝑠 𝒟 [ ( 𝜒(𝑠𝑡) 𝜙(𝑠 ) 𝛾 𝜒(𝑠𝑡+1) 𝜙(𝑠 ) ) 2] 2 E(𝑠𝑡,𝑠𝑡+1) 𝒟 [ 𝜒(𝑠𝑡) 𝜙(𝑠𝑡+1))) ] (19) with 𝜒and 𝜙target versions of 𝜒and 𝜙. We normalize 𝜙to E𝜌[𝜙𝜙 ] Id with the same loss as for 𝐵. Of all SF variants tested, this is the closest to FB. Published as a conference paper at ICLR 2023 walker_stand walker_walk walker_flip cheetah_walk cheetah_walk_backward cheetah_run cheetah_run_backward quadruped_stand quadruped_walk quadruped_run quadruped_jump Online TD3 Offline TD3 Rand AEnc ICM Latent Trans Lap LRA-P CL LRA-SR FB APS Figure 2: Zero-shot scores for each task, with supervised online and offline TD3 as toplines. Average over 3 replay buffers and 10 random seeds. 6 EXPERIMENTAL RESULTS ON BENCHMARKS Each of the 11 methods (FB and 10 SF-based models) has been tested on 13 tasks in 4 environments from the Unsupervised RL and Ex ORL benchmarks (Laskin et al., 2021; Yarats et al., 2021): Maze (reach 20 goals), Walker (stand, walk, run, flip), Cheetah (walk, run, walk backwards, run backwards), and Quadruped (stand, walk, run, jump); see Appendix G.1. Each task and method was repeated for 3 choices of replay buffer from Ex ORL: RND, APS, and Proto (except for the APS method, which can only train on the APS buffer). Each of these 403 settings was repeated with 10 random seeds. The full setup is described in Appendix G. Representation dimension is 𝑑= 50, except for Maze (𝑑= 100). After model training, tasks are revealed by 10,000 reward samples as in (Liu & Abbeel, 2021), except for Maze, where a known goal is presented and used to set 𝑧𝑟directly. The code can be found at https://github.com/facebookresearch/controllable_agent . As toplines, we use online TD3 (with task rewards, and free environment interactions not restricted to a replay buffer), and offline TD3 (restricted to each replay buffer labelled with task rewards). Offline TD3 gives an idea of the best achievable performance given the training data in a buffer. In Fig. 2 we plot the performance of each method for each task in each environment, averaged over the three replay buffers and ten random seeds. Appendix H contains the full results and more plots. Compared to offline TD3 as a reference, on the Maze tasks, Lap, LRA-P, and FB perform well. On the Walker tasks, ICM, Trans, Lap, LRA-SR, and FB perform well. On the Cheetah tasks, ICM, Trans, Lap, LRA-SR, and FB perform well. On the Quadruped tasks, many methods perform well, including, surprisingly, random features. Appendix I plots some of the learned features on Maze. On Maze, none of the encoder-based losses learn good policies, contrary to FB and spectral SF methods. We believe this is because (𝑥, 𝑦) is already a good representation of the 2D state 𝑠, so the encoders do nothing. Yet SFs on (𝑥, 𝑦) cannot solve the task (SFs on rewards of the form 𝑎𝑥+ 𝑏𝑦 don t recover goal-oriented tasks): planning with SFs requires specific representations. Fig. 1 reports aggregated scores over all tasks. To average across tasks, we normalize scores: for each task and replay buffer, performance is expressed as a percentage of the performance of offline TD3 on the same replay buffer, a natural supervised topline given the data. These normalized scores are averaged over all tasks in each environment, then over environments to yield the scores in Fig. 1. The variations over environments, replay buffers and random seeds are reported in Appendix H. These results are broadly consistent over replay buffers. Buffer-specific results are reported in Appendices H.3 H.4. Sometimes a replay buffer is restrictive, as attested by poor offline TD3 performance, starkly so for the Proto buffer on all Quadruped tasks. APS does not work well as a zero-shot RL method, but it does work well as an exploration method: on average, the APS and RND replay buffers have close results. FB and Lap are the only methods that perform consistently well, both over tasks and over replay buffers. Averaged on all tasks and buffers, FB reaches 81% of supervised offline TD3 performance, and 85% on the RND buffer. The second-best method is Lap with 74% (78% on the Proto buffer). Published as a conference paper at ICLR 2023 7 DISCUSSION AND LIMITATIONS Can a few features solve many reward functions? Why are some features better? The choice of features is critical. Consider goal-reaching tasks: the obvious way to learn to reach arbitrary goal states via SFs is to define one feature per possible goal state (one-hot encoding). This requires |𝑆| features, and does not scale to continuous spaces. But much better features exist. For instance, with an size-𝑛cycle 𝑆= {0, . . . , 𝑛 1} with actions that move left and right modulo 𝑛, then two features suffice instead of 𝑛: SFs with 𝜙(𝑠) = (cos(2𝜋𝑠/𝑛), sin(2𝜋𝑠/𝑛)) provides exact optimal policies to reach any arbitrary state. On a 𝑑-dimensional grid 𝑆= {0, . . . , 𝑛 1}𝑑, just 2𝑑features (a sine and cosine in each direction) are sufficient to reach any of the 𝑛𝑑goal states via SFs. Goal-reaching is only a subset of possible RL tasks, but this clearly shows that some features are better than others. The sine and cosine are the main eigenfunctions of the graph Laplacian on the grid: such features have long been argued to play a special role in RL (Mahadevan & Maggioni, 2007; Machado et al., 2017). FB-based methods are theoretically known to learn such eigenfunctions (Blier et al., 2021). Yet a precise theoretical link to downstream performance is still lacking. Are these finite-rank models reasonable? FB crucially relies on a finite-rank model of 𝛾𝑡𝑃𝑡 𝜋, while some SF variants above rely on finite-rank models of 𝑃or the corresponding Laplacian. It turns out such approximations are very different for 𝑃or for 𝛾𝑡𝑃𝑡 𝜋. Unfortunately, despite the popularity of low-rank 𝑃assumptions in the theoretical literature, 𝑃𝜋 is always close to Id in situations where 𝑠𝑡+1 is close to 𝑠𝑡, such as any continuous-time physical system (Appendix D). Any low-rank model of 𝑃𝜋will be poor; actually 𝑃𝜋is better modeled as Id low-rank, thus approximating the Laplacian. On the other hand, 𝑃𝑡 𝜋with large 𝑡gets close to rank one under weak assumptions (ergodicity), as 𝑃𝑡 𝜋converges to an equilibrium distribution when 𝑡 . Thus the spectrum of the successor measures 𝛾𝑡𝑃𝑡 𝜋is usually more spread out (details and examples in Appendix D), and a low-rank model makes sense. The eigenvalues of 𝑃𝜋are close to 1 and there is little signal to differentiate between eigenvectors, but differences become clearer over time on 𝑃𝑡 𝜋. This may explain the better performance of FB and LRA-SR compared to LRA-P. Limitations. First, these experiments are still small-scale and performance is not perfect, so there is space for improvement. Also, all the environments tested here were deterministic: all algorithms still make sense in stochastic environments, but experimental conclusions may differ. Second, even though these methods can be coupled with any exploration technique, this will obviously not cover tasks too different from the actions in the replay buffer, as with any offline RL method. These zero-shot RL algorithms learn to summarize the long-term future for a wide range of policies (though without synthesizing trajectories). This is a lot: in contrast, world models only learn policyindependent one-step transitions. So the question remains of how far this can scale. A priori, there is no restriction on the inputs (e.g., images, state history...). Still, for large problems, some form of prior seems unavoidable. For SFs, priors can be integrated in 𝜙, but rewards must be linear in 𝜙. For FB, priors can be integrated in 𝐵 s input. Touati & Ollivier (2021) use FB with pixel-based inputs for 𝐹, but only the agent s (𝑥, 𝑦) position for 𝐵 s input: this recovers all rewards that are functions of (𝑥, 𝑦) (linear or not). Breaking the symmetry of 𝐹and 𝐵reduces the strain on the model by restricting predictions to fewer variables, such as an agent s future state instead of the full environment. 8 CONCLUSIONS Zero-shot RL methods avoid test-time planning by summarizing long-term state-state relationships for particular policies. We systematically tested forward-backward representations and many new models of successor features on zero-shot tasks. We also uncovered algebraic links between SFs, FB, contrastive learning, and spectral methods. Overall, SFs suffer from their dependency to basic feature construction, with only Laplacian eigenfunctions working reliably. Notably, planning with SFs requires specific features: SFs can fail with generic encoder-type feature learning, even if the learned representation of states is reasonable (such as (𝑥, 𝑦)). Forward-backward representations were best across the board, and provide reasonable zero-shot RL performance. Published as a conference paper at ICLR 2023 ACKNOWLEDGEMENTS The authors would like to thank Olivier Delalleau, Armand Joulin, Alessandro Lazaric, Sergey Levine, Matteo Pirotta, Andrea Tirinzoni, and the anonymous reviewers for helpful comments and questions on the research and manuscript. Marcin Andrychowicz, Dwight Crow, Alex Ray, Jonas Schneider, Rachel Fong, Peter Welinder, Bob Mc Grew, Josh Tobin, Pieter Abbeel, and Wojciech Zaremba. Hindsight experience replay. In NIPS, 2017. Jimmy Lei Ba, Jamie Ryan Kiros, and Geoffrey E Hinton. Layer normalization. ar Xiv preprint ar Xiv:1607.06450, 2016. Randall Balestriero and Yann Le Cun. Contrastive and non-contrastive self-supervised learning recover global and local spectral embedding methods. ar Xiv preprint ar Xiv:2205.11508, 2022. André Barreto, Will Dabney, Rémi Munos, Jonathan J Hunt, Tom Schaul, David Silver, and Hado P van Hasselt. Successor features for transfer in reinforcement learning. In NIPS, 2017. Léonard Blier, Corentin Tallec, and Yann Ollivier. Learning successor states and goal-dependent values: A mathematical viewpoint. ar Xiv preprint ar Xiv:2101.07123, 2021. Diana Borsa, André Barreto, John Quan, Daniel Mankowitz, Rémi Munos, Hado van Hasselt, David Silver, and Tom Schaul. Universal successor features approximators. ar Xiv preprint ar Xiv:1812.07626, 2018. Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. Language models are few-shot learners. Advances in neural information processing systems, 33:1877 1901, 2020. Ting Chen, Simon Kornblith, Mohammad Norouzi, and Geoffrey Hinton. A simple framework for contrastive learning of visual representations. In International conference on machine learning, pp. 1597 1607. PMLR, 2020. Kurtland Chua, Roberto Calandra, Rowan Mc Allister, and Sergey Levine. Deep reinforcement learning in a handful of trials using probabilistic dynamics models. Advances in neural information processing systems, 31, 2018. Bruno Castro da Silva, George Konidaris, and Andrew G Barto. Learning parameterized skills. In ICML, 2012. Peter Dayan. Improving generalization for temporal difference learning: The successor representation. Neural Computation, 5(4):613 624, 1993. Benjamin Eysenbach, Abhishek Gupta, Julian Ibarz, and Sergey Levine. Diversity is all you need: Learning skills without a reward function. In International Conference on Learning Representations, 2018. David Foster and Peter Dayan. Structure in the space of value functions. Machine Learning, 49(2): 325 346, 2002. Scott Fujimoto, Herke Hoof, and David Meger. Addressing function approximation error in actorcritic methods. In International conference on machine learning, pp. 1587 1596. PMLR, 2018. Sahika Genc, Sunil Mallya, Sravan Bodapati, Tao Sun, and Yunzhe Tao. Zero-shot reinforcement learning with deep attention convolutional neural networks. ar Xiv preprint ar Xiv:2001.00605, 2020. Karol Gregor, Danilo Jimenez Rezende, and Daan Wierstra. Variational intrinsic control. ar Xiv preprint ar Xiv:1611.07507, 2016. Published as a conference paper at ICLR 2023 Jean-Bastien Grill, Florian Strub, Florent Altché, Corentin Tallec, Pierre Richemond, Elena Buchatskaya, Carl Doersch, Bernardo Avila Pires, Zhaohan Guo, Mohammad Gheshlaghi Azar, et al. Bootstrap your own latent-a new approach to self-supervised learning. Advances in neural information processing systems, 33:21271 21284, 2020. Christopher Grimm, Irina Higgins, Andre Barreto, Denis Teplyashin, Markus Wulfmeier, Tim Hertweck, Raia Hadsell, and Satinder Singh. Disentangled cumulants help successor representations transfer to new tasks. ar Xiv preprint ar Xiv:1911.10866, 2019. Steven Hansen, Will Dabney, Andre Barreto, Tom Van de Wiele, David Warde-Farley, and Volodymyr Mnih. Fast task inference with variational intrinsic successor features. ar Xiv preprint ar Xiv:1906.05030, 2019. Yehuda Koren. On spectral graph drawing. In International Computing and Combinatorics Conference, pp. 496 508. Springer, 2003. Michael Laskin, Denis Yarats, Hao Liu, Kimin Lee, Albert Zhan, Kevin Lu, Catherine Cang, Lerrel Pinto, and Pieter Abbeel. Urlb: Unsupervised reinforcement learning benchmark. In Thirty-fifth Conference on Neural Information Processing Systems Datasets and Benchmarks Track (Round 2), 2021. Michael Laskin, Hao Liu, Xue Bin Peng, Denis Yarats, Aravind Rajeswaran, and Pieter Abbeel. Cic: Contrastive intrinsic control for unsupervised skill discovery. ar Xiv preprint ar Xiv:2202.00161, 2022. David A Levin, Yuval Peres, and Elisabeth L Wilmer. Markov chains and mixing times. American Mathematical Soc., Providence, 2009. Hao Liu and Pieter Abbeel. APS: Active pretraining with successor features. In International Conference on Machine Learning, pp. 6736 6747. PMLR, 2021. Clare Lyle, Mark Rowland, and Will Dabney. Understanding and preventing capacity loss in reinforcement learning. ar Xiv preprint ar Xiv:2204.09560, 2022. Chen Ma, Dylan R Ashley, Junfeng Wen, and Yoshua Bengio. Universal successor features for transfer reinforcement learning. ar Xiv preprint ar Xiv:2001.04025, 2020. Marlos C Machado, Marc G Bellemare, and Michael Bowling. A Laplacian framework for option discovery in reinforcement learning. In International Conference on Machine Learning, pp. 2295 2304. PMLR, 2017. Sridhar Mahadevan and Mauro Maggioni. Proto-value functions: A Laplacian framework for learning representation and control in Markov decision processes. Journal of Machine Learning Research, 8(10), 2007. Thomas M Moerland, Joost Broekens, and Catholijn M Jonker. Model-based reinforcement learning: A survey. ar Xiv preprint ar Xiv:2006.16712, 2020. Junhyuk Oh, Satinder Singh, Honglak Lee, and Pushmeet Kohli. Zero-shot task generalization with multi-task deep reinforcement learning. In International Conference on Machine Learning, pp. 2661 2670. PMLR, 2017. Bernt Øksendal. Stochastic differential equations. Springer, 1998. Deepak Pathak, Pulkit Agrawal, Alexei A Efros, and Trevor Darrell. Curiosity-driven exploration by self-supervised prediction. In International conference on machine learning, pp. 2778 2787. PMLR, 2017. Tongzheng Ren, Tianjun Zhang, Lisa Lee, Joseph E Gonzalez, Dale Schuurmans, and Bo Dai. Spectral decomposition representation for reinforcement learning. ar Xiv preprint ar Xiv:2208.09515, 2022. Published as a conference paper at ICLR 2023 Tom Schaul, Daniel Horgan, Karol Gregor, and David Silver. Universal value function approximators. In Francis Bach and David Blei (eds.), Proceedings of the 32nd International Conference on Machine Learning, volume 37 of Proceedings of Machine Learning Research, pp. 1312 1320, Lille, France, 07 09 Jul 2015. PMLR. URL http://proceedings.mlr.press/v37/ schaul15.html. Sungryull Sohn, Junhyuk Oh, and Honglak Lee. Hierarchical reinforcement learning for zero-shot generalization with subtask dependencies. Advances in Neural Information Processing Systems, 31, 2018. Richard S Sutton and Andrew G Barto. Reinforcement learning: An introduction. MIT press, 2018. 2nd edition. Richard S Sutton, Joseph Modayil, Michael Delp, Thomas Degris, Patrick M Pilarski, Adam White, and Doina Precup. Horde: A scalable real-time architecture for learning knowledge from unsupervised sensorimotor interaction. In The 10th International Conference on Autonomous Agents and Multiagent Systems-Volume 2, pp. 761 768, 2011. Yuval Tassa, Yotam Doron, Alistair Muldal, Tom Erez, Yazhe Li, Diego de Las Casas, David Budden, Abbas Abdolmaleki, Josh Merel, Andrew Lefrancq, et al. Deepmind control suite. ar Xiv preprint ar Xiv:1801.00690, 2018. Yuandong Tian. Deep contrastive learning is provably (almost) principal component analysis. ar Xiv preprint ar Xiv:2201.12680, 2022. Ahmed Touati and Yann Ollivier. Learning one representation to optimize all rewards. Advances in Neural Information Processing Systems, 34, 2021. Laurens Van der Maaten and Geoffrey Hinton. Visualizing data using t-sne. Journal of machine learning research, 9(11), 2008. Yifan Wu, George Tucker, and Ofir Nachum. The Laplacian in RL: Learning representations with efficient approximations. In International Conference on Learning Representations, 2018. Denis Yarats, Rob Fergus, Alessandro Lazaric, and Lerrel Pinto. Reinforcement learning with prototypical representations. In International Conference on Machine Learning, pp. 11920 11931. PMLR, 2021. Jingwei Zhang, Jost Tobias Springenberg, Joschka Boedecker, and Wolfram Burgard. Deep reinforcement learning with successor features for navigation across similar environments. In 2017 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pp. 2371 2378. IEEE, 2017. Published as a conference paper at ICLR 2023 LIST OF APPENDICES Appendix A sketches a proof of the basic properties of SFs and the FB representation, Theorems 1 and 2, respectively from Borsa et al. (2018) and Touati & Ollivier (2021). Appendix B derives the loss (8) we use to learn 𝐹and 𝐵, and the auxiliary orthonormalization loss. Appendix C describes the precise mathematical relationship between the contrastive loss (17) and successor measures of the exploration policy. Appendix D further discusses why low-rank models make more sense on successor measures than on the transition matrix 𝑃. Appendix E proves that if 𝑃is low-rank given by 𝜒 𝜇, then successor features using 𝜙= 𝜇(but not 𝜒) provide optimal policies. Appendix F proves that the loss (16) used to learn low-rank 𝑃is an asymmetric version of the Laplacian eigenfunction loss (14). Appendix G describes the detailed experimental setup: environments, architectures, policy training, methods for sampling of 𝑧, and hyperparameters. The full code can be found at https://github.com/facebookresearch/controllable_agent Appendix H contains the full table of experimental results, as well as aggregate plots over several variables (per task, per replay buffer, etc.). Appendix I analyzes the learned features. We use feature rank analysis to study the degree of feature collapse for some methods. We also provide t-SNE visualizations of the learned embeddings for all methods (for the Maze environment). Appendix J provides hyperparameter sensitivity plots (latent dimension 𝑑, learning rate, batch size, mixing ratio for 𝑧sampling). Appendix K describes a further baseline where we take a goal-oriented method inspired from Ma et al. (2020), and extend it to dense rewards by linearity. Since results were very poor except for Maze (which is goal-oriented), we did not discuss it in the main text. Appendix L provides Py Torch snippets for the key losses, notably the FB loss, the SF loss as well as the various feature learning methods for SF. Published as a conference paper at ICLR 2023 A SKETCH OF PROOF OF THEOREMS 1 AND 2 To provide an intuition behind Theorems 1 and 2 on SFs and FB, we include here a sketch of proof in the finite state case. Full proofs in the general case can be found in the Appendix of Touati & Ollivier (2021). For Theorem 1 (SFs), let us assume that the reward is linear in the features 𝜙, namely, 𝑟(𝑠) = 𝜙(𝑠) 𝑤 for some 𝑤 R𝑑. Then by definition, 𝑧𝑟= 𝑤since 𝑧𝑟is the linear regression of the reward on the features (assuming features are linearly independent). Using the definition (4) of the successor features 𝜓, and taking the dot product with 𝑧𝑟, we obtain 𝜓(𝑠0, 𝑎0, 𝑧𝑟) 𝑧𝑟= E 𝑡 𝛾𝑡𝜙(𝑠𝑡+1) 𝑧𝑟| 𝑠0, 𝑎0, 𝜋𝑧𝑟 𝑡 𝛾𝑡𝑟(𝑠𝑡+1) | 𝑠0, 𝑎0, 𝜋𝑧𝑟 since 𝑟= 𝜙 𝑤. This means that 𝜓(𝑠0, 𝑎0, 𝑧𝑟) 𝑧𝑟is the 𝑄-function of reward 𝑟for policy 𝜋𝑧𝑟. At the same time, by the definition (4), 𝜋𝑧𝑟is defined as the argmax of 𝜓(𝑠0, 𝑎0, 𝑧𝑟) 𝑧𝑟. Therefore, the policy 𝜋𝑧𝑟is the argmax of its own 𝑄-function, meaning it is the optimal policy for reward 𝑟. For Theorem 2 (FB), let us assume that FB perfectly satisfies the training criterion (5), namely, 𝑀𝜋𝑧= 𝐹 𝑧𝐵diag(𝜌) in matrix form. Thanks to the definition (1) of successor representations 𝑀𝜋, for any policy 𝜋𝑧, the 𝑄-function for the reward 𝑟can be written as 𝑄𝜋𝑧 𝑟 = 𝑀𝜋𝑧𝑟in matrix form. This is equal to 𝐹 𝑧𝐵diag(𝜌)𝑟. Thus, if we define 𝑧𝑟:= 𝐵diag(𝜌)𝑟= E𝑠 𝜌[𝐵(𝑠)𝑟(𝑠)], we obtain 𝑄𝜋𝑧 𝑟 = 𝐹 𝑧𝑧𝑟for any 𝑧 R𝑑. In particular, the latter holds for 𝑧= 𝑧𝑟as well: 𝐹 𝑧𝑟𝑧𝑟is the 𝑄-function of 𝜋𝑧𝑟. Again, the policies 𝜋𝑧are defined in (5) as the greedy policies of 𝐹 𝑧𝑧, for any 𝑧. Therefore, 𝜋𝑧𝑟is the argmax of its own 𝑄-function. Hence, 𝜋𝑧𝑟is the optimal policy for the reward 𝑟. Published as a conference paper at ICLR 2023 B DERIVATION OF THE FORWARD-BACKWARD LOSS Here we quickly derive the loss (8) used to train 𝐹and 𝐵such that 𝑀𝜋(𝑠, 𝑎, d𝑠 ) 𝐹(𝑠, 𝑎) 𝐵(𝑠 ) 𝜌(d𝑠 ). Training is based on the Bellman equation satisfied by 𝑀𝜋. This holds separately for each policy parameter 𝑧, so in this section we omit 𝑧for simplicity. Here 𝜌(d𝑠 ) is the distribution of states in the dataset. Importantly, the resulting loss does not require to know this measure 𝜌(d𝑠 ), only to be able to sample states 𝑠 𝜌from the dataset. The successor measure 𝑀𝜋satisfies a Bellman-like equation 𝑀𝜋= 𝑃+ 𝛾𝑃𝜋𝑀𝜋, as matrices in the finite case and as measures in the general case (Blier et al., 2021). We can learn FB by iteratively minimizing the Bellman residual 𝑀𝜋 (𝑃+ 𝛾𝑃𝜋𝑀𝜋) on the parametric model 𝑀= 𝐹 𝐵𝜌. 𝑀𝜋(𝑠, 𝑎, d𝑠 ) is a measure on 𝑠 for each (𝑠, 𝑎), so it is not obvious how to measure the size of the Bellman residual. In general, we can define a norm on such objects 𝑀by taking the density with respect to the reference measure 𝜌, 𝑀 2 𝜌:= E(𝑠,𝑎) 𝜌 𝑠 𝜌 [ ( 𝑀(𝑠, 𝑎, d𝑠 ) where 𝑀(𝑠,𝑎,d𝑠 ) 𝜌(d𝑠 ) is the density of 𝑀with respect to 𝜌. 1 For finite states, 𝑀is a matrix 𝑀𝑠𝑎𝑠 and this is just a 𝜌-weighted Frobenius matrix norm, 𝑀 2 𝜌= 𝑠𝑎𝑠 𝑀2 𝑠𝑎𝑠 𝜌(𝑠, 𝑎)/𝜌(𝑠 ). (This is also how we proceed to learn a low-rank approximation of 𝑃in (16).) We define the loss on 𝐹and 𝐵as the norm of the Bellman residual on 𝑀for the model 𝑀= 𝐹 𝐵𝜌. As usual in temporal difference learning, we use fixed, non-trainable target networks 𝐹and 𝐵for the right-hand-side of the Bellman equation. Thus, the Bellman residual is 𝐹 𝐵𝜌 ( 𝑃+ 𝛾𝑃𝜋 𝐹 𝐵𝜌 ) , and the loss is ℒ(𝐹, 𝐵) := 𝐹 𝐵𝜌 ( 𝑃+ 𝛾𝑃𝜋 𝐹 𝐵𝜌 ) 2 When computing the norm 𝜌, the denominator 𝜌cancels out with the 𝐹 𝐵𝜌terms, but we are left with a 𝑃/𝜌term. This term can still be integrated, because integrating 𝑃(d𝑠 |𝑠, 𝑎)/𝜌(d𝑠 ) under 𝑠 𝜌is equivalent to directly integrating under 𝑠 𝑃(d𝑠 |𝑠, 𝑎), namely, integrating under transitions (𝑠𝑡, 𝑎𝑡, 𝑠𝑡+1) in the environment. This plays out as follows: ℒ(𝐹, 𝐵) = E(𝑠𝑡.𝑎𝑡) 𝜌 𝑠 𝜌 [ ( 𝐹(𝑠𝑡, 𝑎𝑡) 𝐵(𝑠 ) 𝑃(d𝑠 |𝑠𝑡, 𝑎𝑡) 𝜌(d𝑠 ) 𝛾E𝑠𝑡+1 𝑃(d𝑠𝑡+1|𝑠𝑡,𝑎𝑡)[ 𝐹(𝑠𝑡+1, 𝜋(𝑠𝑡+1)) 𝐵(𝑠 )] ) 2] = E(𝑠𝑡,𝑎𝑡,𝑠𝑡+1) 𝜌 𝑠 𝜌 [ ( 𝐹(𝑠𝑡, 𝑎𝑡) 𝐵(𝑠 ) 𝛾 𝐹(𝑠𝑡+1, 𝜋(𝑠𝑡+1)) 𝐵(𝑠 ) ) 2] 2 E(𝑠𝑡,𝑎𝑡,𝑠𝑡+1) 𝜌 [ 𝐹(𝑠𝑡, 𝑎𝑡) 𝐵(𝑠𝑡+1) ] + Const (24) where Const is a constant term that we can discard since it does not depend on 𝐹and 𝐵. Apart from the discarded constant term, all terms in this final expression can be sampled from the dataset. Note that we have a 2𝐹(𝑠𝑡, 𝑎𝑡) 𝐵(𝑠𝑡+1) term where Touati & Ollivier (2021) have a 2𝐹(𝑠𝑡, 𝑎𝑡) 𝐵(𝑠𝑡) term: this is because we define successor representations (1) using 𝑠𝑡+1 while Touati & Ollivier (2021) use 𝑠𝑡. See also Appendix L for pseudocode (including the orthonormalization loss, and double networks for 𝑄-learning as described in Appendix G). 1This is the dual norm on measures of the 𝐿2(𝜌) norm on functions. It amounts to learning a model of 𝑀(𝑠, 𝑎, d𝑠 ) by learning relative densities to reach 𝑠 knowing we start at (𝑠, 𝑎), relative to the average density 𝜌(d𝑠 ) in the dataset. Published as a conference paper at ICLR 2023 The orthonormalization loss. An auxiliary loss is used to normalize 𝐵so that E𝑠 𝜌[𝐵(𝑠)𝐵(𝑠) ] Id. This loss is ℒnorm(𝐵) := E𝜌[𝐵𝐵 ] Id 2 Frobenius (25) = E𝑠 𝜌, 𝑠 𝜌 [ (𝐵(𝑠) 𝐵(𝑠 ))2 𝐵(𝑠) 2 2 𝐵(𝑠 ) 2 2 ] + Const. (26) (The more complex expression in Touati & Ollivier (2021) has the same gradients up to a factor 4.) The auxiliary loss ℒ (9). Learning 𝐹(𝑠, 𝑎, 𝑧) is equivalent to learning 𝐹(𝑠, 𝑎, 𝑧) 𝑧 for all vectors 𝑧 . Yet the definition of the policies 𝜋𝑧in FB only uses 𝐹(𝑠, 𝑎, 𝑧) 𝑧. Thus, as was done for SFs in Section 5.1, one may wonder if there is a scalar rather than vector loss to train 𝐹, that would reduce the error in the directions of 𝐹used to define 𝜋𝑧. In the case of FB, the full vector loss is needed to train 𝐹and 𝐵. However, one can add the following auxiliary loss on 𝐹to reduce the errors in the specific direction 𝐹(𝑠, 𝑎, 𝑧) 𝑧. This is obtained as follows. Take the Bellman gap in the main FB loss (22): this Bellman gap is 𝐹 𝐵𝜌 ( 𝑃+ 𝛾𝑃𝜋 𝐹 𝐵𝜌 ) . To specialize this Bellman gap in the direction of 𝐹(𝑠, 𝑎, 𝑧) 𝑧, multiply by 𝐵 (𝐵𝜌𝐵 ) 1𝑧on the right: this yields 𝐹 𝑧 ( 𝑃𝐵 (𝐵𝜌𝐵 ) 1𝑧+ 𝛾𝑃𝜋 𝐹 𝑧 ) (using that we compute the loss at 𝐵= 𝐵). This new loss is the Bellman gap on 𝐹 𝑧, with reward 𝑃𝐵 (𝐵𝜌𝐵 ) 1𝑧. This is the loss ℒ described in (9). It is a particular case of the main FB loss: ℒ= 0 implies ℒ = 0, since we obtained ℒ by multiplying the Bellman gap of ℒ. We use ℒ on top of the main FB loss to reduce errors in the direction 𝐹(𝑠, 𝑎, 𝑧) 𝑧. However, in the end, the differences are modest. Published as a conference paper at ICLR 2023 C RELATIONSHIP BETWEEN CONTRASTIVE LOSS AND SVD OF SUCCESSOR MEASURES Here we prove the precise relationship between the contrastive loss (17) and SVDs of the successor measure of the exploration policy. Intuitively, both methods push states together if they lie on the same trajectory, by increasing the dot product between the representations of 𝑠𝑡and 𝑠𝑡+𝑘. This is formalized as follows. Let 𝜋be the policy used to produce the trajectories in the dataset. Define 𝑀:= (1 𝛾CL) 𝑡 0 𝛾𝑡 CL𝑃𝑡+1 𝜋 (27) where 𝛾CL is the parameter of the geometric distribution used to choose 𝑘when sampling 𝑠𝑡and 𝑠𝑡+𝑘. 𝑀is a stochastic matrix in the discrete case, and a probability measure over 𝑆in the general case: it is the normalized version of the successor measure (3) with 𝜋the exploration policy. By construction, the distribution of 𝑠𝑡+𝑘knowing 𝑠𝑡with 𝑘 Geom(1 𝛾CL) is described by 𝑀. Therefore, we can rewrite the loss as E𝑘 Geom(1 𝛾CL) (𝑠𝑡,𝑠𝑡+𝑘) 𝒟 [ log exp(cosine(𝜙(𝑠𝑡), 𝜇(𝑠𝑡+𝑘))) E𝑠 𝒟exp(cosine(𝜙(𝑠𝑡), 𝜇(𝑠 ))) = E𝑠 𝜌, 𝑠 𝜌 [ 𝑀(𝑠, d𝑠 ) 𝜌(d𝑠 ) log exp(cosine(𝜙(𝑠), 𝜇(𝑠 ))) + E𝑠 𝜌[log E𝑠 𝒟exp(cosine(𝜙(𝑠), 𝜇(𝑠 )))] (29) Assume that 𝜙and 𝜇are centered with unit norm, namely, 𝜙(𝑠) 2 = 1 and E𝑠 𝜌𝜙(𝑠) = 0 and likewise for 𝜇. With unit norm, the cosine becomes just a dot product, and the loss is = E𝑠 𝜌, 𝑠 𝜌 [ 𝑀(𝑠, d𝑠 ) 𝜌(d𝑠 ) 𝜙(𝑠) 𝜇(𝑠 ) + E𝑠 𝜌 [ log E𝑠 𝒟exp(𝜙(𝑠) 𝜇(𝑠 )) ] . (30) A second-order Taylor expansion provides log E exp 𝑋= E 𝑋+ 1 2(E 𝑋)2 + 𝑂(|𝑋|3) (31) and therefore, with E 𝜙= E 𝜇= 0, the loss is approximately [ 𝑀(𝑠, d𝑠 ) 𝜌(d𝑠 ) 𝜙(𝑠) 𝜇(𝑠 ) 2 E𝑠 𝜌, 𝑠 𝜌 [ ( 𝜙(𝑠) 𝜇(𝑠 ) ) 2] (32) 2 E𝑠 𝜌, 𝑠 𝜌 𝜙(𝑠) 𝜇(𝑠 ) 𝑀(𝑠, d𝑠 ) + Const (33) where the constant term does not depend on 𝜙and 𝜇. This is minimized when 𝜙 𝜇is the SVD of 𝑀/𝜌in the 𝐿2(𝜌) norm. Published as a conference paper at ICLR 2023 D DO FINITE-RANK MODELS ON THE TRANSITION MATRIX AND ON SUCCESSOR MEASURES MAKE SENSE? It turns out finite-rank models are very different for 𝑃or for 𝛾𝑡𝑃𝑡 𝜋. In typical situations, the spectrum of 𝑃is concentrated around 1 while that of 𝛾𝑡𝑃𝑡 𝜋is much more spread-out. Despite the popularity of low-rank 𝑃in theoretical RL works, 𝑃is never close to low-rank in continuous-time systems: then 𝑃is actually always close to the identity. Generally speaking, 𝑃 cannot be low-rank if most actions have a small effect. By definition, for any feature function 𝜙, (𝑃𝜋𝜙)(𝑠) = E[𝜙(𝑠𝑡+1)|𝑠𝑡= 𝑠]. Intuitively, if actions have a small effect, then 𝑠𝑡+1 is close to 𝑠𝑡, and 𝜙(𝑠𝑡+1) 𝜙(𝑠𝑡) for continuous 𝜙. This means that 𝑃𝜋𝜙is close to 𝜙, so that 𝑃𝜋is close to the identity on a large subspace of feature functions 𝜙. In the theory of continuous-time Markov processes, the time-𝑡transition kernel is given by 𝑃𝑡= 𝑒𝑡𝐴with 𝐴the infinitesimal generator of the process (Levin et al., 2009, 20.1) (Øksendal, 1998, 8.1), hence 𝑃𝑡is Id +𝑂(𝑡) for small timesteps 𝑡. In general, the transition matrix 𝑃is better modeled as Id + low-rank, which corresponds to a low-rank model of the Markov chain Laplacian Id 𝑃. On the other hand, though 𝛾𝑡𝑃𝑡 𝜋is never exactly low-rank (it is invertible), it has meaningful low-rank approximations under weak assumptions. For large 𝑡, 𝑃𝑡 𝜋becomes rank-one under weak assumptions (ergodicity), as it converges to the equilibrium distribution of the transition kernel. For large 𝛾, the sum 𝛾𝑡𝑃𝑡 𝜋is dominated by large 𝑡. Most eigenvalues of 𝑃𝜋are close to 1, but taking powers 𝑃𝑡 𝜋sharpens the differences between eigenvalues: with 𝛾close to 1, going from 𝑃𝜋to 𝛾𝑡𝑃𝑡 𝜋= (Id 𝛾𝑃𝜋) 1 changes an eigenvalue 1 𝜀into 1/𝜀. In short, on 𝑃itself, there is little learning signal to differentiate between eigenvectors, but differences become visible over time. This may explain why FB works better than low-rank decompositions directly based on 𝑃or the Laplacian. For instance, consider the nearest-neighbor random walk on a length-𝑛cycle {0, 1, . . . , 𝑛 1 mod 𝑛}, namely, moving in dimension 1. (This extends to any-dimensional grids.) The associated 𝑃𝜋is not low-rank in any reasonable sense: the corresponding stochastic matrix is concentrated around the diagonal, and many eigenvalues are close to 1. Precisely, the eigenvalues are cos(2𝑘𝜋/𝑛) with integer 𝑘= {0, . . . , 𝑛/2}. This is 1 2𝜋2(𝑘/𝑛)2 when 𝑘 𝑛. Half of the eigenvalues are between However, when 𝛾 1, 𝛾𝑡𝑃𝑡 𝜋has one eigenvalue 1/(1 𝛾) and the other eigenvalues are 1 1 cos(2𝑘𝜋/𝑛) 𝑛2/2𝜋2𝑘2 with positive integer 𝑘: there is one large eigenvalue, then the others decrease like cst/𝑘2. With such a spread-out spectrum, a finite-rank model makes sense. Published as a conference paper at ICLR 2023 E WHICH FEATURES ARE ANALOGOUS BETWEEN LOW-RANK 𝑃AND SFS? Here we explain the relationship between a low-rank model of 𝑃and successor features. More precisely, if transition probabilities from (𝑠, 𝑎) to 𝑠 can be written exactly as 𝜒(𝑠, 𝑎) 𝜇(𝑠 ), then SFs with basic features 𝜙:= 𝜇will provide optimal policies for any reward function (Theorem 3). Indeed, under the finite-rank model 𝑃(d𝑠 |𝑠, 𝑎) = 𝜒(𝑠, 𝑎) 𝜇(𝑠 )𝜌(d𝑠 ), rewards only matter via the reward features E𝑠 𝜌𝜇(𝑠 )𝑟(𝑠 ): namely, two rewards with the same reward features have the same 𝑄-function, as the dynamics produces the same expected rewards. Then 𝑄-functions are linear in these reward features, and using successor features with 𝜙:= 𝜇provides the correct 𝑄-functions, as follows. Theorem 3. Assume that 𝑃(d𝑠 |𝑠, 𝑎) = 𝜒(𝑠, 𝑎) 𝜇(𝑠 )𝜌(d𝑠 ). Then successor features using the basic features 𝜙:= 𝜇provide optimal policies for any reward function. This is why we use 𝜇rather than 𝜒for the SF basic features. This is also why we avoided the traditional notation 𝑃(𝑠 |𝑠, 𝑎) = 𝜙(𝑠, 𝑎) 𝜇(𝑠 ) often used for low-rank 𝑃, which induces a conflict of notation with the 𝜙in SFs, and suggests the wrong analogy. Meanwhile, 𝜒plays a role more analogous to SFs 𝜓, although for one-step transitions instead of multistep transitions as in SFs: 𝑄-functions are linear combinations of the features 𝜒. In particular, the optimal 𝑄-function for reward 𝑟is 𝑄 𝑟= 𝜒 𝑤𝑟for some 𝑤𝑟. But contrary to successor features, there is no simple correspondence to compute 𝑤𝑟from 𝑟. Proof. Let 𝜋be any policy. On a finite space in matrix notation, and omitting 𝜌for simplicity, the assumption 𝑃= 𝜒 𝜇implies 𝑃𝜋= 𝜒 𝜋𝜇where 𝜒𝜋(𝑠) := E𝑎 𝜋(𝑠) 𝜒(𝑠, 𝑎) are the 𝜋-averaged features. Then, 𝑡 0 𝛾𝑡𝑃𝑡 𝜋𝑟 (34) 𝑡 0 𝛾𝑡( 𝜒 𝜋𝜇 ) 𝑡𝑟 (35) 𝑡 0 𝛾𝑡( 𝜇𝜒 𝜋 ) 𝑡) 𝜇𝑟. (36) Thus, 𝑄-functions are expressed as 𝑄𝜋 𝑟(𝑠, 𝑎) = 𝜒(𝑠, 𝑎) 𝑤(𝜋, 𝑟) with 𝑤(𝜋, 𝑟) = ( 𝑡 0 𝛾𝑡(𝜇𝜒 𝜋)𝑡) 𝜇𝑟. Moreover, rewards only matter via 𝜇𝑟. Namely, two rewards with the same 𝜇𝑟have the same 𝑄-function for every policy. In full generality on continuous spaces with 𝜌again, the same holds with 𝜇(𝑠)𝜌(d𝑠) instead of 𝜇(𝑠), and E𝑠 𝜌𝜇(𝑠)𝑟(𝑠) instead of 𝜇𝑟. Now, let 𝑟be any reward function, and let 𝑟 be its 𝐿2(𝜌)-orthogonal projection onto the space generated by the features 𝜇. By construction, 𝑟 𝑟 is 𝐿2(𝜌)-orthogonal to 𝜇, namely, E𝜌𝜇(𝑟 𝑟 ) = 0. So E𝜌𝜇𝑟= E𝜌𝜇𝑟 . Therefore, by the above, 𝑟and 𝑟 have the same 𝑄-function for every policy. By definition, 𝑟 lies in the linear span of the features 𝜇. By Theorem 1, SFs with features 𝜙= 𝜇will provide optimal policies for 𝑟 . Since 𝑟and 𝑟 have the same 𝑄-function for every policy, an optimal policy for 𝑟 is also optimal for 𝑟. Published as a conference paper at ICLR 2023 F RELATIONSHIP BETWEEN LAPLACIAN EIGENFUNCTIONS AND LOW-RANK 𝑃LEARNING The loss (16) used to learn a low-rank model of the transition probabilities 𝑃is an asymmetric version of the Laplacian eigenfunction loss (14) with 𝜆= 1. Said equivalently, if we use the low-rank 𝑃loss (16) constrained with 𝜒= 𝜇to learn a low-rank model of 𝑃𝜋instead of 𝑃(with 𝜋the exploration policy), then we get the Laplacian eigenfunction loss (14) with 𝜆= 1. Indeed, set 𝜆= 1 in (14). Assume that the distributions of 𝑠𝑡and 𝑠𝑡+1 in the dataset are identical on average (this happens, e.g., if the dataset is made of long trajectories or if 𝜌is close enough to the invariant distribution of the exploration policy). Then, in the Laplacian loss (14), the norms from the first term cancel those from the second, and the Laplacian loss simplifies to (14) = E(𝑠𝑡,𝑠𝑡+1) 𝒟 [ 𝜙(𝑠𝑡) 𝜙(𝑠𝑡+1) 2] + E 𝑠 𝒟 𝑠 𝒟 [ (𝜙(𝑠) 𝜙(𝑠 ))2 𝜙(𝑠) 2 2 𝜙(𝑠 ) 2 2 ] = E𝑠𝑡 𝒟 𝜙(𝑠𝑡) 2 + E𝑠𝑡+1 𝒟 𝜙(𝑠𝑡) 2 2 E(𝑠𝑡,𝑠𝑡+1) 𝒟 [ 𝜙(𝑠𝑡) 𝜙(𝑠𝑡+1) ] (38) + E 𝑠 𝒟 𝑠 𝒟 [ (𝜙(𝑠) 𝜙(𝑠 ))2 𝜙(𝑠) 2 2 𝜙(𝑠 ) 2 2 ] (39) = 2 E(𝑠𝑡,𝑠𝑡+1) 𝒟 [ 𝜙(𝑠𝑡) 𝜙(𝑠𝑡+1) ] + E 𝑠 𝒟 𝑠 𝒟 [ (𝜙(𝑠) 𝜙(𝑠 ))2] . (40) This is the same as the low-rank loss (16) if we omit actions 𝑎and constrain 𝜒= 𝜇. Published as a conference paper at ICLR 2023 Figure 3: Maze, Walker, Cheetah and Quadruped environments used in our experiments. In the Mmaze domain (left), we show an example of an initial state (yellow point) and the 20 test goals (red circles). G EXPERIMENTAL SETUP In this section we provide additional information about our experiments. Code snippets for the main losses are given in Appendix L. The full code can be found at https://github.com/facebookresearch/controllable_agent G.1 ENVIRONMENTS All the environments considered in this paper are based on the Deep Mind Control Suite (Tassa et al., 2018). Point-mass Maze: a 2-dimensional continuous maze with four rooms. The states are 4-dimensional vectors consisting of positions and velocities of the point mass (𝑥, 𝑦, 𝑣𝑥, 𝑥𝑦), and the actions are 2-dimensional vectors. At test, we assess the performance of the agents on 20 goal-reaching tasks (5 goals in each room described by their (𝑥, 𝑦) coordinates). Walker: a planar walker. States are 24-dimensional vectors consisting of positions and velocities of robot joints, and actions are 6-dimensional vectors. We consider 4 different tasks at test time: walker_stand reward is a combination of terms encouraging an upright torso and some minimal torso height, while walker_walk and walker_run rewards include a component encouraging some minimal forward velocity. walker_flip reward includes a component encouraging some mininal angular momentum. Cheetah: a running planar biped. States are 17-dimensional vectors consisting of positions and velocities of robot joints, and actions are 6-dimensional vectors. We consider 4 different tasks at test time: cheetah_walk and cheetah_run rewards are linearly proportional to the forward velecity up to some desired values: 2 m/s for walk and 10 m/s for run. Similarly, walker_walk_backward and walker_run_backward rewards encourage reaching some minimal backward velocities. Quadruped: a four-leg ant navigating in 3D space. States and actions are 78dimensional and 12-dimensional vectors, respectively. We consider 4 tasks at test time: quadruped_stand reward encourages an upright torso. quadruped_walk and quadruped_run include a term encouraging some minimal torso velecities. quadruped_walk includes a term encouraging some minimal height of the center of mass. G.2 ARCHITECTURES We use the same architectures for all methods. The backward representation network 𝐵(𝑠) and the feature network 𝜙(𝑠) are represented by a feedforward neural network with three hidden layers, each with 256 units, that takes as input a state and outputs a 𝐿2-normalized embedding of radius 𝑑. For both successor features 𝜓(𝑠, 𝑎, 𝑧) and forward network 𝐹(𝑠, 𝑎, 𝑧), we first preprocess separately (𝑠, 𝑎) and (𝑠, 𝑧) by two feedforward networks with two hidden layers (each with 1024 units) to 512-dimentional space. Then we concatenate their two outputs and pass it Published as a conference paper at ICLR 2023 into another 2-layer feedforward network (each with 1024 units) to output a 𝑑-dimensional vector. For the policy network 𝜋(𝑠, 𝑧), we first preprocess separately 𝑠and (𝑠, 𝑧) by two feedforward networks with two hidden layers (each with 1024 units) to 512-dimentional space. Then we concatenate their two outputs and pass it into another 2-layer feedforward network (each with 1024 units) to output to output a 𝑑𝐴-dimensional vector, then we apply a Tanh activation as the action space is [ 1, 1]𝑑𝐴. For all the architectures, we apply a layer normalization (Ba et al., 2016) and Tanh activation in the first layer in order to standardize the states and actions. We use Relu for the rest of layers. We also pre-normalized 𝑧: 𝑧 𝑑 𝑧 𝑧 2 in the input of 𝐹, 𝜋and 𝜓. Empirically, we observed that removing preprocessing and pass directly a concatenation of (𝑠, 𝑎, 𝑧) directly to the network leads to unstable training. The same holds when we preprocess (𝑠, 𝑎) and 𝑧instead of (𝑠, 𝑎) and (𝑠, 𝑧), which means that the preprocessing of 𝑧should be influenced by the current state. For maze environments, we added an additional hidden layer after the preprocessing (for both policy and forward / successor features) as it helped to improve the results. G.3 SAMPLING OF 𝑧 We mix two methods for sampling 𝑧: 1. We sample 𝑧uniformly in the sphere of radius 𝑑in R𝑑(so each component of 𝑧is of size 1). 2. We sample 𝑧using the formula for 𝑧𝑟corresponding to the reward for reaching a random goal state 𝑠in Theorems 1 2. Namely, we set 𝑧= 𝐵(𝑠) for FB and 𝑧= ( 𝑚 𝑖=1 𝜙(𝑠𝑖)𝜙(𝑠𝑖) ) + 𝜙(𝑠) for SFs, where 𝑠 𝜌is a random state sampled from the replay buffer and 𝑠𝑖are states in a minibatch. For the main series of results, we used a 50% mix ratio for those two methods. Different algorithms can benefit from different ratios: this is explored in Appendix G.4 LEARNING THE POLICIES 𝜋𝑧: POLICY NETWORK As the action space is continuous, we could not compute the arg max over action in closed form. Instead, we consider a latent-conditioned policy network 𝜋𝜂: 𝑆 𝑍 𝐴, and we learn the policy parameters 𝜂by performing stochastic gradient ascent on the objective E𝑠,𝑧[𝐹(𝑠, 𝜋𝜂(𝑠), 𝑧) 𝑧] for FB or E𝑠,𝑧[𝜓(𝑠, 𝜋𝜂(𝑠), 𝑧) 𝑧] for SFs. We also incorporate techniques introduced in the TD3 paper (Fujimoto et al., 2018) to address function approximation error in actor-critic methods: double networks and target policy smoothing, adding noise 𝜀to the actions. Let 𝜃1 and 𝜃2 the parameters of two forward networks and let 𝜔the parameters of the backward network. Let 𝜃 1 , 𝜃 2 and 𝜔 be the parameters of their corresponding target networks. Let {(𝑠𝑖, 𝑎𝑖, 𝑠next 𝑖 )}𝑖 𝐼 𝒟a mini-batch of size |𝐼| = 𝑏of transitions and let {𝑧𝑖}𝑖 𝐼a mini-batch of size |𝐼| = 𝑏of latent variables sampled according to G.3. The empirical version of the main FB loss in (8) is (with an additional 1/2 factor): L (𝜃𝑘, 𝜔) = 1 2𝑏(𝑏 1) 𝑖,𝑗 𝐼2 𝑖 =𝑗 ( 𝐹𝜃𝑘(𝑠𝑖, 𝑎𝑖, 𝑧𝑖) 𝐵𝜔(𝑠next 𝑗 ) 𝛾min 𝑙=1,2 𝐹𝜃 𝑙(𝑠next 𝑖 , 𝜋𝜂(𝑠next 𝑖 ) + 𝜀𝑖, 𝑧𝑖) 𝐵𝜔 (𝑠next 𝑗 ) ) 2 𝑖 𝐼 𝐹𝜃𝑘(𝑠𝑖, 𝑎𝑖, 𝑧𝑖) 𝐵𝜔(𝑠next 𝑖 ) 𝑘= 1, 2 (41) where 𝜀𝑖is sampled from a truncated centered Gaussian with variance 𝜎2 (for policy smoothing). The empirical version of the auxiliary 𝐹loss in (9) is: for 𝑘= 1, 2, Published as a conference paper at ICLR 2023 ( 𝐹𝜃𝑘(𝑠𝑖, 𝑎𝑖, 𝑧𝑖) 𝑧𝑖 𝐵𝜔(𝑠next 𝑖 ) Cov+ 𝑧𝑖 𝛾min 𝑙=1,2 𝐹𝜃 𝑙(𝑠next 𝑖 , 𝜋𝜂(𝑠next 𝑖 , 𝑧) + 𝜀𝑖, 𝑧𝑖) 𝑧𝑖 where Cov+ is the pseudo-inverse of the empirical covariane matrix Cov = 1 𝑖 𝐼𝐵𝜔(𝑠𝑖)𝐵𝜔(𝑠𝑖) . We use 1/𝑑as regularization coefficient in front of L (𝜃𝑘). For policy training, the empirical loss is: 𝑖 𝐼 min 𝑙=1,2 𝐹𝜃𝑙(𝑠𝑖, 𝜋𝜂(𝑠𝑖, 𝑧) + 𝜀𝑖, 𝑧𝑖) 𝑧𝑖 (43) The same techniques are also used for SFs. G.5 HYPERPARAMETERS Table 1 summarizes the hyperparameters used in our experiments. A hyperparameter sensitivity analysis for two domains (Walker and Cheetah) is included in Appendix J. Table 1: Hyperparameters used in our experiments. Hyperparameter Value Replay buffer size 5 106 (10 106 for maze) Representation dimension 50 (100 for maze) Batch size 1024 Discount factor 𝛾 0.98 (0.99 for maze) Optimizer Adam Learning rate 10 4 Mixing ratio for 𝑧sampling 0.5 Momentum coefficient for target networks 0.99 Stddev 𝜎for policy smoothing 0.2 Truncation level for policy smoothing 0.3 Number of gradient steps 106 Number of reward labels for task inference 104 Discount factor 𝛾CL for CL 0.6 (0.2 for maze) Regularization weight for orthonormality loss (spectral methods) 1 Hyperparameter tuning. Since all methods share the same core, we chose to use the same hyperparameters rather than tune per method, which could risk leading to less robust conclusions. We did not do hyperparameter sweeps for each baseline and task, first because this would have been too intensive given the number of setups, and, second, this would be too close to going back to a supervised method for each task. Instead, to avoid any overfitting, we tuned architectures and hyperparameters by hand on the Walker environment only, with the RND replay buffer, and reused these parameters across all methods and tasks (except for Maze, on which all methods behave differently). We identified some trends by monitoring learning curves and downstream performance on Walker, and we fixed a configuration that led to overall good performance for all the methods. We avoided a full sweep to avoid overfitting based on Walker, to focus on robustness. For instance, the learning rate 10 4 seemed to work well with all methods, as seen in Appendix J. Some trends were common between all methods: indeed, all the methods share a common core (training of the successor features 𝜓or 𝐹, and training of the policies 𝜋) and differ by the training of the basic features 𝜙or 𝐵. Published as a conference paper at ICLR 2023 For the basic features 𝜙, the various representation learning losses were easy to fit and had low hyperparameter sensitivity: we always observed smooth decreasing of losses until convergence. The challenging part was learning 𝜓and 𝐹and their corresponding policies, which is common across methods. Published as a conference paper at ICLR 2023 H DETAILED EXPERIMENTAL RESULTS We report the full experimental results, first as a table (Section H.1). In Section H.2 we plot aggregated results for easier interpretation: first, aggregated across all tasks with a plot of the variability for each method; second, aggregated over environments (since each environment corresponds to a trained zero-shot model); third, for each individual task but still aggregated over replay buffers. In Section H.3 we plot all individual results per task and replay buffer. In Section H.4 we plot aggregate results split by replay buffer. H.1 FULL TABLE OF RESULTS Buffer Domain Task Method Rand AEnc ICM Latent Trans Lap LRA-P CL LRA-SR FB APS cheetah run 145 7 97 3 432 8 49 14 133 15 198 4 8 1 2 0 247 10 267 33 25 3 run-backward 189 20 365 6 404 3 32 3 382 3 221 4 1 0 14 6 261 5 238 7 98 21 walk 665 60 404 19 928 54 302 67 287 53 900 49 75 31 2 1 918 22 844 51 144 11 walk-backward 653 75 982 0 986 0 453 105 985 0 937 17 8 1 150 51 983 0 981 1 452 77 maze reach 11 5 5 1 8 3 10 4 15 5 432 18 436 16 12 3 145 8 410 16 59 7 quadruped jump 784 5 727 12 164 20 554 24 624 14 718 18 309 50 102 29 632 20 649 23 311 24 run 487 1 459 5 91 19 382 14 411 16 491 3 238 21 53 15 448 6 476 8 196 10 stand 966 3 925 16 248 46 752 36 890 17 963 1 497 53 56 9 872 20 924 13 417 10 walk 543 19 444 8 108 25 489 20 490 20 524 13 228 26 45 14 463 18 712 29 205 6 walker flip 158 10 317 31 452 3 299 53 471 15 454 12 340 18 69 26 186 21 413 16 39 2 run 96 4 127 8 290 9 359 11 263 12 289 10 115 11 63 11 204 26 346 14 35 3 stand 486 27 617 37 925 19 868 57 864 24 895 9 643 35 205 51 591 35 822 26 176 17 walk 177 30 462 58 724 41 857 8 816 30 386 40 159 20 76 44 671 19 817 15 34 2 cheetah run 58 8 84 10 333 14 27 5 322 5 142 2 149 3 0 0 209 10 210 13 - run-backward 149 6 262 7 329 3 30 5 274 7 146 2 133 6 16 13 230 6 157 7 - walk 287 38 327 51 961 24 112 11 929 13 722 19 770 31 1 0 860 31 908 18 - walk-backward 664 39 973 3 987 0 101 15 982 0 798 16 629 31 151 86 979 1 742 46 - maze reach 27 5 7 1 7 2 15 4 8 1 571 16 556 15 19 5 134 7 326 16 - quadruped jump 196 29 185 37 137 27 209 32 282 27 177 26 184 25 57 14 113 12 183 24 - run 134 17 234 8 88 13 123 17 191 14 125 14 166 19 65 16 99 9 137 14 - stand 413 56 321 42 220 29 270 38 436 34 231 52 264 34 86 12 215 47 287 53 - walk 148 21 171 21 122 10 156 24 212 12 135 16 172 21 40 13 100 26 280 52 - walker flip 133 12 346 6 510 13 443 30 456 10 548 33 281 22 78 21 551 13 507 18 - run 84 2 234 9 259 18 347 20 303 9 280 22 183 28 45 5 391 15 336 9 - stand 415 26 905 9 910 13 582 62 951 5 937 5 687 41 253 49 874 23 902 25 - walk 125 21 632 33 839 17 791 17 832 16 883 30 300 31 92 16 867 12 917 7 - cheetah run 64 3 68 5 96 8 183 26 66 4 50 5 6 1 163 14 138 17 247 9 - run-backward 96 6 162 18 160 22 60 4 143 17 90 7 2 0 124 13 82 8 185 17 - walk 289 22 337 28 401 40 567 70 286 30 330 65 29 11 622 28 446 36 827 41 - walk-backward 469 38 542 70 743 60 345 59 572 100 499 40 14 1 517 60 352 45 793 66 - maze reach 9 2 4 1 4 1 8 4 8 4 707 12 759 7 736 3 532 20 710 8 - quadruped jump 770 7 474 40 176 13 663 20 806 14 490 48 447 31 326 72 731 9 651 8 - run 465 4 415 10 98 13 418 10 478 8 399 26 301 12 263 40 461 6 429 3 - stand 919 19 770 28 426 45 830 31 973 2 720 27 552 32 529 70 944 11 815 2 - walk 586 28 486 37 90 13 527 21 552 33 410 30 310 14 210 42 516 30 528 10 - walker flip 267 28 332 18 461 9 34 3 399 19 569 30 512 27 50 9 454 23 578 10 - run 96 8 167 12 251 11 47 22 250 6 299 20 325 10 38 5 350 14 388 8 - stand 516 36 733 22 813 10 191 62 853 18 836 38 904 21 326 41 828 20 890 15 - walk 152 35 457 33 518 74 30 2 607 31 748 75 818 36 53 10 853 15 760 19 - Table 2: Score of each method, split by task and replay buffer. Average over ten random seeds, with 1𝜎estimated standard deviation on this average estimator. For Maze, we report the average over the 20 goals defined in the environment. We highlight the three leading methods (four when confidence intervals overlap) for each task. Published as a conference paper at ICLR 2023 H.2 AGGREGATE PLOTS OF RESULTS Here we plot the results, first averaged over everything, then by environment averaged over the tasks of that environment, and finally by task. Normalized Score (%) Figure 4: Zero-shot scores of ten SF methods and FB, aggregated over tasks using normalized scores as described in the text. To assess variability, the box plot on the right shows the variations of the distribution of normalized scores over random seeds, environments, and replay buffers. 1000 walker tasks 1000 cheetah tasks 1000 quadruped tasks Online TD3 Offline TD3 LRA-SR FB APS Figure 5: Zero-shot scores averaged over tasks for each environment, with supervised online and offline TD3 as toplines. Average over 3 replay buffers and 10 random seeds. walker_stand walker_walk walker_flip cheetah_walk cheetah_walk_backward cheetah_run cheetah_run_backward quadruped_stand quadruped_walk quadruped_run quadruped_jump Online TD3 Offline TD3 Rand AEnc ICM Latent Trans Lap LRA-P CL LRA-SR FB APS Figure 6: Zero-shot scores for each task, with supervised online and offline TD3 as toplines. Average over 3 replay buffers and 10 random seeds. Published as a conference paper at ICLR 2023 H.3 FULL PLOTS OF RESULTS PER TASK AND REPLAY BUFFER walker_stand walker_walk walker_flip cheetah_walk cheetah_walk_backward cheetah_run cheetah_run_backward quadruped_stand quadruped_walk quadruped_run quadruped_jump Online TD3 Offline TD3 Rand AEnc ICM Latent Trans Lap LRA-P CL LRA-SR FB Figure 7: Per-task results on the RND replay buffer, average over 10 random seeds. walker_stand walker_walk walker_flip cheetah_walk cheetah_walk_backward cheetah_run cheetah_run_backward quadruped_stand quadruped_walk quadruped_run quadruped_jump Online TD3 Offline TD3 Rand AEnc ICM Latent Trans Lap LRA-P CL LRA-SR FB APS Figure 8: Per-task results on the APS replay buffer, average over 10 random seeds. walker_stand walker_walk walker_flip cheetah_walk cheetah_walk_backward cheetah_run cheetah_run_backward quadruped_stand quadruped_walk quadruped_run quadruped_jump Online TD3 Offline TD3 Rand AEnc ICM Latent Trans Lap LRA-P CL LRA-SR FB Figure 9: Per-task results on the Proto replay buffer, average over 10 random seeds. Published as a conference paper at ICLR 2023 H.4 INFLUENCE OF THE REPLAY BUFFER Here we plot the influence of the replay buffer, by reporting results separated by replay buffer, but averaged over the tasks corresponding to each environment (Fig 10). Overall, there is a clear failure case of the Proto buffer on the Quadruped environment: the TD3 supervised baseline performs poorly for all tasks in that environment. Otherwise, results are broadly consistent on the different replay buffers: with a few exceptions, the same methods succeed or fail on the same environments. 1000 walker tasks 1000 cheetah tasks 1000 quadruped tasks Online TD3 Offline TD3 LRA-P CL LRA-SR FB 1000 walker tasks 1000 cheetah tasks 1000 quadruped tasks Online TD3 Offline TD3 LRA-SR FB APS 1000 walker tasks 1000 cheetah tasks 1000 quadruped tasks Online TD3 Offline TD3 LRA-P CL LRA-SR FB Figure 10: Results on each replay buffer: RND (top), APS (middle), Proto (bottom). Average over 4 tasks for the Walker, Cheetah and Quadruped environments, average over 20 goals for Maze. Published as a conference paper at ICLR 2023 In Fig. 11, we plot results aggregated over all tasks but split by replay buffer. Box plots further show variability within random seeds and environments for a given replay buffer. Overall method rankings are broadly consistent between RND and APS, except for CL. Note that the aggregated normalized score for Proto is largely influenced by the failure on Quadruped: normalization by a very low baseline (Fig. 10, Quadruped plot for Proto), somewhat artificially pushes Trans high up (this shows the limit of using scores normalized by offline TD3 score), while the other methods rankings are more similar to RND and APS. FB and Lap work very well in all replay buffers, with LRA-SR and Trans a bit behind due to their feailures on some tasks. Normalized Score (%) Normalized Score (%) Normalized Score (%) Figure 11: Zero-shot scores by replay buffer, as a percentage of the supervised score of offline TD3 trained on the same buffer, averaged over tasks and environments and random seeds. Top: RND buffer; middle: APS buffer; bottom: Proto buffer. Published as a conference paper at ICLR 2023 I ANALYSIS OF THE LEARNED FEATURES I.1 FEATURE RANK Here we test the hypothesis that feature collapse for some methods is responsible for some cases of bad performance. This is especially relevant for Maze, where some methods may have little incentive to learn more features beyond the original two features (𝑥, 𝑦). We report in Table 3 the effective rank of learned features 𝐵or 𝜙: this is computed as in (Lyle et al., 2022), as the fraction of eigenvalues of E𝑠 𝜌𝐵(𝑠)𝐵(𝑠) or E𝑠 𝜌𝜙(𝑠)𝜙(𝑠) above a certain threshold. Domain Method Rand AEnc ICM Latent Trans Lap LRA-P CL LRA-SR FB Maze 0.38 0.31 0.33 0.32 0.15 1.0 1.0 1.0 1.0 1.0 Walker 1,0 0.91 1.0 1.0 0.90 1.0 1.0 0.65 1.0 1.0 Cheetah 1.0 0.96 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 Quadruped 1.0 0.98 0.30 1.0 0.15 1.0 0.84 1.0 1.0 1.0 Table 3: Feature rank (Lyle et al., 2022) of 𝜙or 𝐵for each method, computed as 1 𝑑# { 𝜎 eig ( 1 𝑛 𝑛 𝑖=1 𝜙(𝑠𝑖)𝜙(𝑠𝑖) ) | 𝜎> 𝜀 } , trained on RND replay buffer and averaged over 10 random seeds. We use 𝑛= 100, 000 samples to estimate the covariance, and 𝜀= 10 4. For most methods and all environments except Maze, more than 90% (often 100%) of eigenvalues are above 10 4, so the effective rank is close to full. The Maze environment is a clear exception: on Maze, for the Rand, AEnc, Trans, Latent and ICM methods, only about one third of the eigenvalues are above 10 4. The other methods keep 100% of the eigenvalues above 10 4. This is perfectly aligned with the performance of each method on Maze. So rank reduction does happen for some methods. This reflects the fact that two features (𝑥, 𝑦) already convey the necessary information about states and dynamics, but are not sufficient to solve the problem via successor features. In the Maze environment, auto-encoder or transition models can perfectly optimize their loss just by keeping the original two features (𝑥, 𝑦), and they have no incentive to learn other features, so the effective rank could have been 2. These methods may benefit from auxiliary losses to prevent eigenvalue collapse, similar to the orthonormalization loss used for 𝐵. We did not include such losses, because we wanted to keep the same methods (autoencoders, ICM, transition model...) used in the literature. But even with an auxiliary loss, keeping a full rank could be achieved just by keeping (𝑥, 𝑦) and then blowing up some additional irrelevant features. Published as a conference paper at ICLR 2023 I.2 EMBEDDING VISUALIZATION VIA T-SNE We visualize the learned state feature embeddings for Maze by projecting them into 2-dimensional space using t-SNE (Van der Maaten & Hinton, 2008) in Fig. I.2. AEnc, Trans, FB, LRA-P and CL all recover a picture of the maze. For the other methods the picture is less clear (notably, Rand gets a near-circle, possibly as an instance of concentration of measure theorems). Whether t-SNE preserves the shape of the maze does not appear to be correlated to performance: AEnc and Trans learn nice features but perform poorly, while the t-SNE of Lap is not visually interpretable but performance is good. Raw State Rand AEnc ICM Trans Latent Lap LRA-P CL Figure 12: Visualization of embedding vectors obtained by each method (𝜙for SF and 𝐵for FB) on the maze domain after projecting them in two-dimensional space with t-SNE . Published as a conference paper at ICLR 2023 J HYPERPARAMETERS SENSITIVITY 50 100 150 latent dimension average return 0.00 0.25 0.50 0.75 1.00 mixing ratio for z sampling 512 1024 2048 batch size average return 1e-5 1e-4 1e-3 learning rate LRA-P LRA-SR Figure 13: Return for each method trained in Walker domain on the RND replay buffer, for different choices of hyperparameters. Average over Walker tasks and 5 random seeds. 50 100 150 latent dimension average return 0.00 0.25 0.50 0.75 1.00 mixing ratio for z sampling 512 1024 2048 batch size average return 1e-5 1e-4 1e-3 learning rate LRA-P LRA-SR Figure 14: Return for each method trained in Cheetah domain, on the RND replay buffer, for different choices of hyperparameters. Average over Cheetah tasks and 5 random seeds. Published as a conference paper at ICLR 2023 K GOAL-ORIENTED BASELINES Goal-oriented methods such as universal value functions (UVFs) (Schaul et al., 2015) learn a 𝑄function 𝑄(𝑠, 𝑎, 𝑔) indexed by a goal description 𝑔. When taking for 𝑔the set of all possible target states 𝑠 𝑆, these methods can learn to reach arbitrary target states. For instance, Ma et al. (2020) use a mixture of universal SFs and UVFs to learn goal-conditioned agents and policies. In principle, such goal-oriented methods are not designed to deal with dense rewards, which are linear combinations of goals 𝑔. Indeed, the optimal 𝑄-function for a linear combination of goals is not the linear combination of the optimal 𝑄-functions. Nevertheless, we may still try to see how such linear combinations perform. The linear combination may be applied at the level of the 𝑄-functions, or at the level of some goal descriptors, as follows. Here, as an additional baseline for our experiments, we test a slight modification of the scheme from Ma et al. (2020), as suggested by one reviewer. We learn two embeddings (𝜓, 𝑤) using state-reaching tasks. The 𝑄-function for reaching state 𝑠 (goal 𝑔= 𝑠 ) is modeled as 𝑄(𝑠𝑡, 𝑎𝑡, 𝑠 ) = 𝜓(𝑠𝑡, 𝑎𝑡, 𝑤(𝑠 )) 𝑤(𝑠 ). (44) trained for reward 1𝑠𝑡+1=𝑠 . A policy network 𝜋(𝑠, 𝑤(𝑠 )) outputs the action at 𝑠for reaching goal 𝑠 . Thus, the training loss is ℒ(𝜓, 𝑤) := E(𝑠𝑡,𝑎𝑡,𝑠𝑡+1) 𝜌 𝑠 𝜌 [ 𝜓(𝑠𝑡, 𝑎𝑡, 𝑤(𝑠 )) 𝑤(𝑠 ) 1{𝑠𝑡+1=𝑠 } 𝛾 𝜓(𝑠𝑡+1, 𝜋(𝑠𝑡+1, 𝑤(𝑠 )), 𝑤(𝑠 )) 𝑤(𝑠 ) ] 2 Similarly to the other baselines, the policy network 𝜋(𝑠, 𝑤(𝑠 )) is trained by gradient ascent on the policy parameters to maximize E 𝑠 𝜌 𝑠 𝜌[𝑄(𝑠, 𝜋(𝑠, 𝑤(𝑠 )), 𝑠 )] = E 𝑠 𝜌 𝑠 𝜌 [ 𝜓(𝑠, 𝜋(𝑠, 𝑤(𝑠 )), 𝑤(𝑠 )) 𝑤(𝑠 ) ] (46) At test time, given a reward 𝑟, we proceed as in FB and estimate 𝑧= E𝑠 𝜌[𝑟(𝑠)𝑤(𝑠)] (47) and use policy 𝜋(𝑠, 𝑧). This amounts to extending from goal-reaching tasks to dense tasks by linearity on 𝑤, even though in principle, this method should only optimize for single-goal rewards. We use the same architectures for (𝜓, 𝑤) as for (𝐹, 𝐵), with double networks 𝜓1, 𝜓2 and policy smoothing as in (43). The sparse reward 1{𝑠𝑡+1=𝑠 } could suffer from large variance in continuous state spaces. We mitigate this either by: biasing the sampling of 𝑠 by setting 𝑠 to 𝑠𝑡+1 half of the time. replacing the reward by a less sparse one, 1{ 𝑠𝑡+1 𝑠 2 𝜀}. Table 4 reports normalized scores for each domain, for the two variants just described, trained on the RND replay buffer, averaged over tasks and over 10 random seeds. The first variant scores 84% on Maze, 62% on Quadruped tasks, 17% on Walker tasks, and 2% on Cheetah tasks. The second variant is worse overall. So overall, this works well on Maze (as expected, since this is a goal-oriented problem), moderately well on Quadruped tasks (where most other methods work well), and poorly on the other environments. This is expected, as this method is not designed to handle dense combinations of goals. Relationship with FB. The use of universal SFs in Ma et al. (2020) is quite different from the use of universal SFs in Barreto et al. (2017) and Borsa et al. (2018). The latter is mathematically related to FB, as described at the end of Section 4. The former uses SFs as an intermediate tool for a goal-oriented model, and is more distantly related to FB (notably, it is designed to deal with single goals, not linear combinations of goals such as dense rewards). A key difference between FB and goal-oriented methods is the following. Above, we use 𝜓(𝑠, 𝑎, 𝑤(𝑔)) 𝑤(𝑔) as a model of the optimal 𝑄-function for the policy with goal 𝑔. This is reminiscent of 𝐹(𝑠, 𝑎, 𝑧) 𝐵(𝑔) with 𝑧= 𝐵(𝑔), the value of 𝑧used to reach 𝑔in FB. Published as a conference paper at ICLR 2023 Reward Domain Maze Walker Cheetah Quadruped 1{𝑠=𝑠 } 83.82 17.06 1.60 61.92 1{ 𝑠 𝑠 2 𝜀} 87.44 12.42 0.74 20.25 Table 4: Normalized score for each domain of two variants of USF from Ma et al. (2020) trained on RND replay buffer, averaged over tasks and 10 random seeds However, even if we restrict FB to goal-reaching by using 𝑧= 𝐵(𝑔) only (a significant restriction), then FB learns 𝐹(𝑠, 𝑎, 𝐵(𝑔)) 𝐵(𝑔 ) to model the successor measure, i.e., the number of visits to each goal 𝑔 for the policy with goal 𝑔, starting at (𝑠, 𝑎). Thus, FB learns an object indexed by (𝑠, 𝑎, 𝑔, 𝑔 ) for all pairs (𝑔, 𝑔 ). Thus, FB learns more information (it models successor measures instead of 𝑄-functions), and allows for recovering linear combinations of goals in a principled way. Meanwhile, even assuming perfect neural network optimization in goal-reaching methods, there is no reason the goal-oriented policies would be optimal for arbitrary linear combinations of goals, only for single goals. Published as a conference paper at ICLR 2023 L PSEUDOCODE OF TRAINING LOSSES Here we provide Py Torch snippets for the key losses, notably the FB loss, SF loss as well as the various feature learning methods for SF. 2 def compute_fb_loss(agent, obs, action, next_obs, z, discount): 4 # compute target successor measure 6 with torch.no_grad(): 7 mu = agent.policy_net(next_obs, z) 8 next_action = Truncated Normal(mu=mu, stddev=agent.cfg.stddev, clip= agent.cfg.stddev_clip) 9 target_F1, target_F2 = agent.forward_target_net(next_obs, z, next_action) # batch x z_dim 10 target_B = agent.backward_target_net(next_obs) # batch x z_dim 11 target_M1, target_M2 = [torch.einsum( sd, td -> st , target_Fi, target_B) for target_Fi in [F1, F2]] # batch x batch 12 target_M = torch.min(target_M1, target_M2) 14 # compute the main FB loss 16 F1, F2 = agent.forward_net(obs, z, action) 17 B = agent.backward_net(next_obs) 18 M1, M2 = [torch.einsum( sd, td -> st , Fi, B) for Fi in [F1, F2]] # batch x batch 19 I = torch.eye(*M1.size(), device=M1.device) 20 off_diag = ~I.bool() 21 fb_offdiag: tp.Any = 0.5 * sum((M - discount * target_M)[off_diag].pow (2).mean() for M in [M1, M2]) 22 fb_diag: tp.Any = -sum(M.diag().mean() for M in [M1, M2]) 23 fb_loss = fb_offdiag + fb_diag 25 # compute the auxiliary loss 27 next_Q1, next Q2 = [torch.einsum( sd, sd -> s , target_Fi, z) for target_Fi in [target_F1, target_F2]] 28 next_Q = torch.min(next_Q1, next Q2) 29 cov = torch.matmul(B.T, B) / B.shape[0] 30 inv_cov = torch.linalg.pinv(cov) 31 implicit_reward = (torch.matmul(B, inv_cov) * z).sum(dim=1) # 32 target_Q = implicit_reward.detach() + discount * next_Q # batch_size 34 Q1, Q2 = [torch.einsum( sd, sd -> s , Fi, z) for Fi in [F1, F2]] 35 q_loss = F.mse_loss(Q1, target_Q) + F.mse_loss(Q2, target_Q) 36 q_loss /= agent.cfg.z_dim 37 fb_loss += q_loss 39 # compute Orthonormality losss 41 Cov = torch.matmul(B, B.T) 42 orth_loss_diag = - 2 * Cov.diag().mean() 43 orth_loss_offdiag = Cov[off_diag].pow(2).mean() 44 orth_loss = orth_loss_offdiag + orth_loss_diag 45 fb_loss += agent.cfg.ortho_coef * orth_loss 47 return fb_loss Listing 1: Pytorch code for FB training loss 1 def compute_sf_loss(agent, obs, action, next_obs, z, discount): 3 # compute target q-value Published as a conference paper at ICLR 2023 4 with torch.no_grad(): 5 mu = agent.policy_net(next_obs, z) 6 next_action = Truncated Normal(mu=mu, stddev=agent.cfg.stddev, clip= agent.cfg.stddev_clip) 7 next_F1, next_F2 = agent.successor_target_net(next_obs, z, next_action) # batch x z_dim 8 target_phi = agent.feature_net(next_goal).detach() # batch x z_dim 9 next_Q1, next_Q2 = [torch.einsum( sd, sd -> s , next_Fi, z) for next_Fi in [next_F1, next_F2]] 10 next_Q = torch.min(next_Q1, next_Q2) 11 target_Q = torch.einsum( sd, sd -> s , target_phi, z) + discount * next_Q 13 F1, F2 = agent.successor_net(obs, z, action) 14 Q1, Q2 = [torch.einsum( sd, sd -> s , Fi, z) for Fi in [F1, F2]] 15 sf_loss = F.mse_loss(Q1, target_Q) + F.mse_loss(Q2, target_Q) 17 return sf_loss Listing 2: Pytorch code for SF training loss 2 def compute_phi_loss(agent, obs, next_obs): 4 phi = agent.feature_net(obs) 5 next_phi = agent.feature_net(next_obs) 6 loss = (phi - next_phi).pow(2).mean() 8 # compute Orthonormality losss 10 Cov = torch.matmul(phi, phi.T) 11 I = torch.eye(*Cov.size(), device=Cov.device) 12 off_diag = ~I.bool() 13 orth_loss_diag = - 2 * Cov.diag().mean() 14 orth_loss_offdiag = Cov[off_diag].pow(2).mean() 15 orth_loss = orth_loss_offdiag + orth_loss_diag 17 loss += orth_loss 19 return loss Listing 3: Pytorch code for Laplacian Eigenfunctions Lap loss 1 def compute_phi_loss(agent, obs, future_obs): 3 future_phi = agent.feature_net(future_obs) 4 mu = agent.mu_net(obs) 5 future_phi = F.normalize(future_phi, dim=1) 6 mu = F.normalize(mu, dim=1) 7 logits = torch.einsum( sd, td-> st , mu, future_phi) # batch x batch 8 I = torch.eye(*logits.size(), device=logits.device) 9 off_diag = ~I.bool() 10 logits_off_diag = logits[off_diag].reshape(logits.shape[0], logits. shape[0] - 1) 11 loss = - logits.diag() + torch.logsumexp(logits_off_diag, dim=1) 12 loss = loss.mean() 14 return loss Listing 4: Pytorch code for the contrastive CL loss 1 def compute_phi_loss(agent, obs, action, next_obs): 3 phi = agent.feature_net(obs) 4 next_phi = agent.feature_net(next_obs) Published as a conference paper at ICLR 2023 5 predicted_action = agent.inverse_dynamic_net(torch.cat([phi, next_phi], 6 loss = (action - predicted_action).pow(2).mean() 8 return loss Listing 5: Pytorch code of ICM loss 1 def compute_phi_loss(agent, obs, action, next_obs): 3 phi = agent.feature_net(obs) 4 predicted_next_obs = agent.forward_dynamic_net(torch.cat([phi, action], 5 loss = (predicted_next_obs - next_obs).pow(2).mean() 7 return loss Listing 6: Pytorch code for Trans loss 1 def compute_phi_loss(agent, obs, action, next_obs): 2 phi = agent.feature_net(obs) 3 with torch.no_grad(): 4 next_phi = agent.target_feature_net(next_obs) 5 predicted_next_obs = agent.forward_dynamic_net(torch.cat([phi, action], 6 loss = (predicted_next_obs - next_phi.detach()).pow(2).mean() 8 # update target network 9 for param, target_param in zip(agent.feature_net.parameters(), agent. target_feature_net.parameters()): 10 target_param.data.copy_(tau * param.data + 11 (1 - tau) * target_param.data) 13 return loss Listing 7: Pytorch code for Latent loss 1 def compute_phi_loss(agent, obs): 3 phi = agent.feature_net(obs) 4 predicted_obs = agent.decoder(phi) 5 loss = (predicted_obs - obs).pow(2).mean() 7 return loss Listing 8: Pytorch code for AEnc loss 1 def compute_phi_loss(agent, obs, action, next_obs): 3 phi = agent.feature_net(next_obs) 4 mu = agent.mu_net(torch.cat([obs, action], dim=1)) 5 P = torch.einsum("sd, td -> st", mu, phi) 6 I = torch.eye(*P.size(), device=P.device) 7 off_diag = ~I.bool() 8 loss = - 2 * P.diag().mean() + P[off_diag].pow(2).mean() 10 # compute orthonormality loss 11 Cov = torch.matmul(phi, phi.T) 12 I = torch.eye(*Cov.size(), device=Cov.device) 13 off_diag = ~I.bool() 14 orth_loss_diag = - 2 * Cov.diag().mean() 15 orth_loss_offdiag = Cov[off_diag].pow(2).mean() 16 orth_loss = orth_loss_offdiag + orth_loss_diag 17 loss += orth_loss Published as a conference paper at ICLR 2023 19 return loss Listing 9: Pytorch code for LRA-P loss 1 def compute_phi_loss(agent, obs, action, next_obs, discount): 3 phi = agent.feature_net(next_obs) 4 mu = agent.mu_net(obs) 5 SR = torch.einsum( sd, td -> st , mu, phi) 6 with torch.no_grad(): 7 target_phi = agent.target_feature_net(next_obs) 8 target_mu = agent.target_mu_net(next_obs) 9 target_SR = torch.einsum("sd, td -> st", target_mu, target_phi) 11 I = torch.eye(*SR.size(), device=SR.device) 12 off_diag = ~I.bool() 13 loss = - 2 * SR.diag().mean() 14 + (SR - discount * target_SR.detach())[off_diag].pow(2).mean() 16 # compute orthonormality loss 17 Cov = torch.matmul(phi, phi.T) 18 orth_loss_diag = - 2 * Cov.diag().mean() 19 orth_loss_offdiag = Cov[off_diag].pow(2).mean() 20 orth_loss = orth_loss_offdiag + orth_loss_diag 21 loss += orth_loss 23 return loss Listing 10: Pytorch code for LRA-SR loss