# bootstrapped_representations_in_reinforcement_learning__e98e1eee.pdf

Bootstrapped Representations in Reinforcement Learning

Charline Le Lan 1 Stephen Tu 2 Mark Rowland 2

Anna Harutyunyan 2 Rishabh Agarwal 2 Marc G. Bellemare 2 Will Dabney 2

In reinforcement learning (RL), state representations are key to dealing with large or continuous state spaces. While one of the promises of deep learning algorithms is to automatically construct features well-tuned for the task they try to solve, such a representation might not emerge from endto-end training of deep RL agents. To mitigate this issue, auxiliary objectives are often incorporated into the learning process and help shape the learnt state representation. Bootstrapping methods are today s method of choice to make these additional predictions. Yet, it is unclear which features these algorithms capture and how they relate to those from other auxiliary-task-based approaches. In this paper, we address this gap and provide a theoretical characterization of the state representation learnt by temporal difference learning (Sutton, 1988). Surprisingly, we ﬁnd that this representation differs from the features learned by Monte Carlo and residual gradient algorithms for most transition structures of the environment in the policy evaluation setting. We describe the efﬁcacy of these representations for policy evaluation, and use our theoretical analysis to design new auxiliary learning rules. We complement our theoretical results with an empirical comparison of these learning rules for different cumulant functions on classic domains such as the four-room domain (Sutton et al., 1999) and Mountain Car (Moore, 1990).

1. Introduction

The process of representation learning is crucial to the success of reinforcement learning at scale. In deep reinforcement learning, a neural network is used to parameterise a

1University of Oxford 2Google Deep Mind. Correspondence to: Charline Le Lan <charline.lelan@stats.ox.ac.uk>.

Proceedings of the 40 th International Conference on Machine Learning, Honolulu, Hawaii, USA. PMLR 202, 2023. Copyright 2023 by the author(s).

Figure 1: In deep RL, we see the penultimate layer of the network as the representation φ which is linearly transformed into a value prediction ˆVφ,w and auxiliary predictions Ψ(x) by bootstrapping methods.

representation φ which is linearly mapped into a value function (Figure 1) (Yu and Bertsekas, 2009; Bellemare et al., 2019; Levine et al., 2017); this approach often leads to stateof-the-art performance in the ﬁeld (Mnih et al., 2015). State representations are key to the stability and quality of this learning process.

However, a representation supporting the downstream task of interest might not emerge from end-to-end training. Hence, auxiliary objectives are often incorporated into the training process to help the agent combine its inputs into useful features (Sutton et al., 2011; Jaderberg et al., 2017; Bellemare et al., 2017; Lyle et al., 2021) and the resulting network s representation can help the agent estimate the value function. To construct representations supporting these characteristics, different kind of auxiliary tasks have thus been incorporated into the learning process such as controlling visual aspects of observed states (Jaderberg et al., 2017), predicting the values of several policies (Bellemare et al., 2019; Dabney et al., 2021), predicting values over multiple discount factors (Fedus et al., 2019) or prediction of next state observations (Jaderberg et al., 2017; Gelada et al., 2019) and rewards (Dabney et al., 2021; Lyle et al., 2021; Farebrother et al., 2023).

While these tasks have mainly been trained by bootstrapping, a precise characterization of the representations they learn is lacking. This paper aims to ﬁll this gap. We study

Bootstrapped Representations in Reinforcement Learning

the representations learnt by TD learning when training auxiliary tasks consisting in predicting the expected return of a ﬁxed policy for several cumulant functions (Section 3). More generally, this analysis informs bootstrapped representations arising from algorithms such as Q-learning (Watkins and Dayan, 1992), n-step Q-learning (Hessel et al., 2018; Kapturowski et al., 2019; Schwarzer et al., 2021) or Retrace (Munos et al., 2016). Our key insight is that the way we train these value functions, for instance by TD learning, Monte Carlo or residual gradient, inﬂuences the resulting features. In particular, we show that when trained by TD learning, these features converge to the top-d real invariant subspace of the transition matrix P π, when it exists (Theorem 1). We present an empirical evaluation that supports our theoretical characterizations and show the importance of the choice of a learning rule to learn the value function in Section 5.

In Section 4, we characterise the goodness of these representations by quantifying the approximation error of a linear prediction of the value function from these frozen representations in the TD learning and batch Monte Carlo settings (Subsection 4.1). We ﬁnd that to minimize this error, the cumulants need to depend on the dynamics of the environment but in a different way whether we learn the main value function by batch Monte Carlo or TD learning. Then, we show random cumulants which have been used in the literature (Lyle et al., 2021; Farebrother et al., 2023) can be good pseudo-reward functions for some particular structures of the successor representation (Dayan, 1993) by providing an error bound that arises from sampling a small number of random pseudo-reward functions (Subsection 4.2).

Finally, we ﬁnd that one way to improve this bound is to sample pseudo-reward functions which depend on the dynamics of the environment and inspired by this observation, we propose a novel auxiliary task method with adaptive cumulants and show that the resulting pretrained features outperform training from scratch on the Four Rooms and Mountain Car domains Subsection 5.3.

2. Background

We consider a Markov decision process (MDP) M = S, A, R, P, γ (Puterman, 1994) with ﬁnite state space S, ﬁnite set of actions A, transition kernel P : S A P(S), deterministic reward function R : S A [ Rmax, Rmax], and discount factor γ [0, 1). A stationary policy π : S P(A) is a mapping from states to distributions over actions, describing a particular way of interacting with the environment. We denote the set of all policies by Π. We write Pπ : S P(S) the transition kernel induced by a policy π Π

Pπ(s, s ) = X

a A P(s, a)(s )π(a | s)

and rπ : S [ Rmax, Rmax] the expected reward function

rπ(s) = Eπ[R(S0, A0) | S0 = s, A0 π( | S0)].

For any policy π Π, the value function V π(s) measures the expected discounted sum of rewards received when starting from state s S and acting according to π:

V π(s) := E π,P

t=0 γt R(St, At) | S0 = s, At π( | St)

It satisﬁes the Bellman equation (Bellman, 1957)

V π(s) = rπ(s) + γES Pπ( |s)[V π(S )],

which can be expressed using vector notation (Puterman, 1994). Assuming that there are S = |S| states and viewing rπ and V π as vectors in RS and Pπ as an RS S transition matrix, we have

V π = rπ + γPπV π = (I γPπ) 1rπ.

We are interested in approximating the value function V π

using a linear combination of features (Yu and Bertsekas, 2009; Levine et al., 2017; Bellemare et al., 2019). We call the mapping φ : S Rd a state representation, where d N+. Usually, we are interested in reducing the number of parameters needed to approximate the value function and have d S. Given a feature vector φ(s) for a state s S and a weight vector w Rd, the value function approximant at s can be expressed as

Vφ,w(s) = φ(s) w.

We write the feature matrix Φ RS d whose rows correspond to the per-state feature vectors (φ(s), s S). This leads to the more concise value function approximation

In the classic linear function approximation literature, the feature map φ is held ﬁxed, and the agent adapts only the weights w to attempt to improve its predictions. By contrast, in deep reinforcement learning, φ itself is parameterized by a neural network and is typically updated alongside w to improve predictions about the value function.

We measure the accuracy of the linear approximation Vφ,w in terms of the squared ξ-weighted l2 norm, for ξ P(S), 1

Vφ,w V π 2 ξ = X

s S ξ(s)(φ(s)Tw V π(s))2.

The ξ-weighted norm describes the importance of each state.

1We assume that ξ(s) > 0 for all states s S.

Bootstrapped Representations in Reinforcement Learning

2.1. Auxiliary Tasks

In deep reinforcement learning, the agent can use its representation φ to make additional predictions on a set of T auxiliary task functions {ψt RS}t {1,...,T } where each ψt maps states to real values (Jaderberg et al., 2017; Bellemare et al., 2019; Dabney et al., 2021). These predictions are used to reﬁne the representation itself. We collect these targets into an auxiliary task matrix Ψ RS T whose rows are ψ(s) = [ψ1(s), ..., ψT (s)]. We are interested in the case of linear task approximation

where W Rd T is a weight matrix, and want to choose Φ and W such that ˆΨ Ψπ. In this paper, we consider a variety of auxiliary tasks that ultimately involve predicting the value functions of auxiliary cumulants, also referred to as general value functions (GVFs; Sutton et al., 2011). By construction, these tasks can be decomposed into a non-zero cumulant function g : S RT , mapping each state to T real values, and an expected discounted next-state term when acting according to π

ψπ(s) = g(s) + γES Pπ( |s)[ψπ(S )].

In matrix form, this recurrence can be expressed as follows

Ψπ = G + γPπΨπ = (I γP π) 1G,

where G RS T is a cumulant matrix whose columns correspond to each pseudo-reward vector. An example of a family of auxiliary tasks following this structure is the successor representation (SR) (Dayan, 1993). The SR encodes a state in terms of the expected discounted time spent in other states and satisﬁes the following recursive form

ψπ (s, s ) = I [s = s ] + γES Pπ( |s) [ψπ (S , s )] ,

for all s S. The SR is a collection of value functions associated with the cumulant matrix G = I. Here we focus our analysis in its tabular form, noting that it can be extended to larger state spaces in a number of ways (Barreto et al., 2017; Janner et al., 2020; Blier et al., 2021; Thakoor et al., 2022; Farebrother et al., 2023).

2.2. Monte Carlo Representations

To understand how auxiliary tasks shape representations, we start by presenting the simple case where the values of auxiliary cumulants are predicted in a supervised way. Here, the targets Ψπ = (I γP π) 1G are obtained by Monte Carlo rollouts, that is using the ﬁxed policy to perform roll-outs and collecting the sum of rewards. The goal is to minimize the loss below

LMC aux(Φ, W) = min W Rd T Ξ1/2(ΦW Ψπ) 2 F .

This method results in the network s representation Φ, assuming a linear, fully-connected last layer, corresponding to the k principal components of the auxiliary task matrix Ψπ if the network is other unconstrained (Bellemare et al., 2019).

Proposition 1 (Monte Carlo representations). If rank(Ψπ) d 1, all representations spanning the top-d left singular vectors of Ψπ with respect to the inner product x, y Ξ are global minimizers of LMC aux and can be recovered by stochastic gradient descent.

In large environments, it is not practical to collect full trajectories to estimate Ψπ. Instead, practitioners learn them by bootstrapping (Sutton and Barto, 1998).

2.3. Temporal Difference Learning with a Deep Network

Temporal difference (TD; Sutton, 1988) is the method of choice for these auxiliary predictions. The main idea of this approach is bootstrapping (Sutton and Barto, 1998). It consists in using the current estimate of the auxiliary task function to generate some targets replacing their true value Ψπ in order to learn a new approximant of the auxiliary task function. In this paper, we consider one-step temporal difference learning where we replace the targets by a one-step prediction from the currently approximated auxiliary task function. The analysis can easily be extended to n-step temporal difference learning; see Appendix G for further details. In deep reinforcement learning, both the representation φ and the weights W are learnt simultaneously by minimizing the following loss function

LTD aux(φ, W) = E s ξ s Pπ( |s)

h φ(s)W SG g(s) + γφ(s )W i2

where SG denotes a stop gradient and means that φ and W are treated as a constant when taking the gradient from automatic differentiation tools (Bradbury et al., 2018; Abadi et al., 2016; Paszke et al., 2019). Written in matrix form, we have

LTD aux(Φ, W) = (Ξ) 1 2 (ΦW SG(G + γP πΦW)) 2 F

Here, Ξ RS S is a diagonal matrix with elements {ξ(s) : s S} on the diagonal. For clarity of exposition, we express this loss with universal value functions but the analysis can be extented to state-action values at the cost of additional complexity. The idea is to reduce the mean squared error between the approximant ˆψ and the target values by stochastic gradient descent (SGD). Taking the gradient of L with respect to Φ and W, we obtain the semi-gradient update rule

Φ Φ αΞ ((I γP π)ΦW G) W

W W αΦTΞ ((I γP π)ΦW G) (1)

Bootstrapped Representations in Reinforcement Learning

for a step size α. Because the values of the targets change over time, the loss L does not have a proper gradient ﬁeld (Dann et al., 2014) except in some particular cases (Barnard, 1993; Ollivier, 2018) and hence classic analysis of stochastic gradient descent (Bottou et al., 2018) does not apply.

3. Bootstrapped Representations

We now study the d-dimensional features that arise when performing value estimation of a ﬁxed set of cumulants and how the choice of a learning method such as TD learning affects the learnt representations. Our ﬁrst result characterizes representations that bootstrap themselves. We assume that the features Φ are updated in a tabular manner under the dynamics in Equation (1). To simplify the presentation, we now make the following invertibility assumption.

Assumption 1. We assume that ΦTΞ(I γP π)Φ is invertible for any full rank representation Φ RS d.

This standard assumption is for instance veriﬁed when ξ is the stationary distribution over states under π of an aperiodic, irreducible Markov chain (see e.g. Sutton et al., 2016).

An interesting characterization of the dynamical system in Equation (1) is its set of critical points. For a given Φ, we write

W TD Φ,G {W Rd T | W LTD aux(Φ, W) = 0}.

Using classic linear algebra, we ﬁnd that the weights WTD obtained at convergence correspond to the LSTD solution (Bradtke and Barto, 1996; Boyan, 2002; Zhang et al., 2021)

W TD Φ,G = ΦTΞ(I γP π)Φ 1 ΦTΞG.

A key notion for our analysis is the concept of invariant subspace of a linear mapping.

Deﬁnition 1 (Gohberg et al., 2006). A representation Φ RS d spans a real invariant subspace of a linear mapping M : S RS if the column span of Φ is preserved by M, that is in matrix form

span(MΦ) span(Φ).

For instance, any real eigenvector of M generates one of its one-dimensional real invariant subspaces.

We are now equipped with the tools to enumerate the set of critical representations {Φ RS d | ΦLTD aux(Φ, W TD Φ ) = 0} in the lemma below.

Lemma 1 (Critical representations for TD). All full rank representations which are critical points to LTD aux span real invariant subspaces of (I γP π) 1GGTΞ, that is span((I γP π) 1GGTΞΦ) span(Φ).

Proof. The proof is given in Appendix C and relies on the view of LSTD as an oblique projection (Scherrer, 2010).

In the particular case of an identity cumulant matrix and a uniform distribution over states, this set can be more directly expressed as the representations invariant under the transition dynamics.

Corollary 1. If G = I and Ξ = I/S, all full rank representations which are critical points to LTD aux span real invariant subspaces of the invariant subspaces of P π.

Similarly to how the top principal components of a matrix explain most of its variability (Hotelling, 1933), these critical representations are not equally informative of the dynamics of the environment. This motivates the need to understand the behavior of the updates from Equation (1). To ease the analysis, we assume that the weights W have converged perfectly to W TD Φ,G at each time step (Le Lan et al., 2023) and consider the following continuous-time dynamics.

d dtΦ = ΦL(Φ, W TD Φ,G) = F(Φ), (2)

F(Φ) := 2Ξ (I γP π)ΦW TD Φ,G G (W TD Φ,G) .

We can characterize the behavior of the above critical representations in terms of the notion of top-d invariant subspace of P π.

Deﬁnition 2 (Top-d invariant subspace). Let λ1, . . . , λS be the (possibly complex) eigenvalues of a linear mapping M : S RS, ordered by decreasing real part Re(λi) Re(λi+1), i {1, . . . , S}. Assume that Re(λd) > Re(λd+1). A representation Φ RS d spans a top-d invariant subspace of M if (i) it is an M-invariant subspace and (ii) span(Φ) span(vj) = {0} for all j {d + 1, . . . , S}, where vj is a corresponding eigenvector for eigenvalue λj.

Our key result is that certain non top-d invariant subspaces of P π correspond to unstable critical points of the ordinary differential equation 2.

Theorem 1 (TD representations). Assume P π real diagonalisable, G = I and a uniform distribution ξ over states. Let (v1, . . . , v S) denote the eigenvectors of P π corresponding to the eigenvalues (λ1, . . . , λS). Under the dynamics in Equation (2), all real invariant subspaces of dimension d are critical points, and any real non top-d invariant subspace Φ = (vi1, . . . , vid) is unstable for gradient descent.

The result above takes a step toward establishing that the TD algorithm converges towards a real top-d invariant subspace

Bootstrapped Representations in Reinforcement Learning

Figure 2: Left: A simple 3-state MDP. Right: Five subspaces, each represented by a circle, spanned by Φ during the last training steps of gradient descent on LTD aux for d = 2.

or diverges with probability 1. While real diagonalisable transition matrices always induce real invariant subspaces, complex eigenvalues do not guarantee their existence and in such a case, where there is no top-d real invariant subspace, the representation learning algorithm does not converge. As an illustration, consider the three-state MDP depicted in Figure 2, left, whose transition matrix is complex diagonalisable and given by

0 1 0 0 0 1 1 0 0

Its eigenvalues are λ1 = 1 associated to the real eigenvector e1 and the complex conjugate pair (λ2, λ2) = (e2πi/3, e 2πi/3), associated to the pair of real eigenvectors (e2, e3). Hence, the real invariant subspaces of P π are {0}, span(e1), span(e2, e3), span(e1, e2, e3). Note that there is no 2-dimensional real invariant subspace containing the top eigenvector e1. Consequently, the 2-dimensional representation learnt by gradient descent on the TD learning rule with G = I does not converge and rotates in the higher dimensional subspace span(e1, e2, e3) (see Figure 2, right).

To understand the importance of the stop-gradient in TD learning, it useful to study the representations arising from the minimization of the following loss function

Lres aux(Φ, W) = Ξ 1 2 (ΦW (G + γP πΦW)) 2 F ,

which corresponds to residual gradient algorithms (Baird, 1995). While it has been remarked on before that the weights minimizing Lres aux(Φ, W) for a ﬁxed Φ differ from W TD Φ,G (see Lemma 8; Lagoudakis and Parr, 2003; Scherrer, 2010), this objective function also has a different optimal representation

Proposition 2 (Residual representations). Let d {1, ..., S} and Fd be the top d left singular vectors of G with respect to the inner product x, y Ξ = y TΞx, for all x, y RS. All representations spanning (I γP π) 1Fd

are global minimizers of Lres aux and can be recovered by stochastic gradient descent.

While TD and Monte Carlo representations are in general different, in the particular case of symmetric transition matrices and orthogonal cumulant matrices, they are the same.

Corollary 2 (Symmetric transition matrices). If a cumulant matrix G RS T (with T S) has unit-norm, orthogonal columns (e.g. G = I), the representations learnt from the supervised objective LMC aux and the TD update rule LTD aux are the same for symmetric transition matrices P π under a uniform state distribution ξ.

This is because eigenvectors and singular vectors are identical in that setting and the eigenvalues of the successor representation are all positive.

4. Representations for Policy Evaluation

With the results from the previous section, the question that naturally arises is which approach results in better representations. To provide an answer, we consider a two-stage procedure. First, we learn a representation Φ by predicting the values of T auxiliary cumulants simultaneously, using one of the learning rules described in Section 3. Then, we retain this representation and perform policy evaluation. If the value function is estimated on-policy, it converges towards the LSTD solution (Tsitsiklis and Van Roy, 1996)

ˆV TD = Φw TD Φ

where w TD Φ = ΦTΞ(I γP π)Φ 1 ΦTΞrπ. We are interested in whether this value function results in low approximation error on average over random reward functions.

Deﬁnition 3 (l1-ball optimal representation for TD). We say that a representation Φ TD is l1-ball optimal for TD learning when it minimizes the error

Erπ[ Φw TD Φ V π 2 ξ] (3)

where the expectation is over the reward functions rπ sampled uniformly over the l1-ball {rπ RS | rπ 1 1}.

This set of rewards models an unknown reward function. Here Φ TD depends on the transition dynamics of the environment but not on the reward function.

Lemma 2. A representation Φ TD is l1-ball optimal for TD learning iff it is a solution of the following optimization problem.

Φ TD arg minΦ Ξ1/2(ΦW TD Φ,I (I γP π) 1) 2

When P π is symmetric and Ξ = I/|S|, the minimum is achieved by both the top-d left singular vectors and top-d

Bootstrapped Representations in Reinforcement Learning

MAIN ALGORITHM l1-BALL OPTIMAL REPRESENTATION REPRESENTATION LOSS LEARNT REPRESENTATION

BATCH MC SVD (I γP π) 1 MC SVD (I γP π) 1 G

RESIDUAL SVD (I γP π) 1 Σd RESIDUAL (I γP π) 1 SVD (G) TD Φ TD TD INV (I γP π) 1GGTΞ

Table 1: Different types of representation loss and their induced representations. The supervised targets Ψ RS T are (I γP π) 1G. SVD(M) denotes the top-d left singular vectors of M, INV(M) the top-d invariant subspace of M and Σd Rd d the diagonal matrix with the top-d singular values of (I γP π) 1 on its diagonal.

invariant subspace of the SR. However, as the misalignment between the top-d left and top-d right singular vectors of (I γP π) increases, the top-d invariant subspace results in lower error compared to the top-d singular vectors (see Figure 3); note that here, none of them achieves Φ TD and hence G = I is not l1-ball optimal for TD learning. As

0.10 0.15 0.20 0.25 0.30 0.35 Distance between top left and top right

singular vector of (I γPπ)

MC approximation error

TD Supervised

0.10 0.15 0.20 0.25 0.30 0.35 Distance between top left and top right

singular vector of (I γPπ)

TD approximation error

Figure 3: MC (left) and TD (right) approximation errors as a function of the misalignment of the top left and right singular vector of the SR induced by greedifying the policy. Trained with LMC aux, LTD aux, G = I, d = 1 on a 4-state room.

a comparison, we study which representations are l1-ball optimal for linear batch Monte Carlo policy evaluation. In that setting, we are given a dataset consisting of states and their associated value, which can be estimated by the realisation of the random return (Bellemare et al., 2017; Sutton and Barto, 2018), and the weights are learnt by least square regression. As above, we want the features minimizing the resuting approximation error averaged over a set of possible reward functions.

Deﬁnition 4 (l1-ball optimal representation for MC). We say that a representation Φ MC is l1-ball optimal for batch Monte Carlo when it minimizes the error

E rπ 1 1[ Φw MC Φ V π 2 ξ] (4)

where ˆV MC = Φw MC Φ is the value function learnt at convergence and w MC Φ = (Φ ΞΦ) 1Φ ΞV π.

Unlike TD, a representation Φ MC is can be learnt by training LMC aux with G = I.

Lemma 3. A representation Φ MC is l1-ball optimal for batch Monte Carlo policy evaluation if its column space spans the top-d left singular vectors (with respect to the inner product x, y Ξ) of (I γP π) 1.

We summarize in Table 1 our representation learning results mentioned throughout Section 3 and Section 4. For completeness, we also include l1-ball optimal representations for residual algorithms. Proofs can be found in Appendix D.

4.1. TD and Monte Carlo Need Different Cumulants

Having characterized which features common auxiliary tasks capture and what representations are desirable to support training the main value function, we now show that MC policy evaluation and TD learning need different cumulants. In large environments, we are interested in cumulant matrices encoding a small number of tasks T S. Lemma 4. Denote BT the top-T right singular vectors of the SR and O(T, S) the set of orthogonal matrices in RT S. Training auxiliary tasks in a MC way with any G from the set {G RS T | M O(T, S), G = BT M} results in an l1-ball optimal representation for batch Monte Carlo.

We showed in Section 3 that training auxiliary tasks by TD does not always converge when the transition matrix has complex eigenvalues. Maybe surprisingly, we ﬁnd that this is not problematic when learning the main value function by TD. Indeed, the rotation of its own weights balances the rotation of the underlying representation. Lemma 5. Let {Φω} be the set of rotating representations from Figure 2 learnt by TD learning with G = I and d = 2. All these representations are equally good for learning the main value function by TD learning, that is ω [0, 1],

E r 2 2<1 Φωw TD Φω V π 2 F is constant and independent of ω.

Although G = I does not always lead to Φ TD when training LTD aux, by analogy with the MC setting, we assume that G = I leads to overall desirable representations. Assuming Ξ = I/|S|, this means we would like the subspace spanned by top-d invariant subspaces of (I γP π) 1 to be the same as the subspace spanned by the top d invariant subspaces of (I γP π) 1GG .

Lemma 6. The set of cumulant matrices G RS T that preserve the top-T invariant subspaces of the successor representation by TD learning are the top-T orthogonal invariant subspaces of (I γP π) 1, that is satisfying G G = I

Bootstrapped Representations in Reinforcement Learning

TD Supervised Residual

0 25 50 75 100 Number of Training Steps (x 1000)

Distance to top

left singular vectors

0 25 50 75 100 Number of Training Steps (x 1000)

Pπ-invariance

0 25 50 75 100 Number of Training Steps (x 1000)

Distance to top

left singular vectors

Figure 4: Subspace distance between Φ and the top-d left singular vectors of the SR on the left (resp. and a top-d P π-invariant subspace in the middle over the course of training LTD aux, LMC aux and Lres aux for 105 steps, averaged over 30 seeds (d = 3). MDPs with real diagonalisable (left, middle) and symmetric (right) transition matrices are randomly generated. Shaded areas represent 95% conﬁdence intervals.

by orthogonality and (I γP π) 1G G by the invariance property.

Unlike the MC case, a desirable cumulant matrix should encode the exact same information as the representation being learnt and the choice of a parametrization here matters.

4.2. A Deeper Analysis of Random Cumulants

We now study random cumulants which have mainly been used in the literature (Dabney et al., 2021; Lyle et al., 2021; Farebrother et al., 2023) as a heuristic to learn representations. We aim to explain their recent achievements as a pretraining technique (Farebrother et al., 2023) and their effectiveness in sparse reward environments (Lyle et al., 2021).

Proposition 3 (MC Error bound). Let G RS T be a sample from a standard gaussian distribution and assume d T. Let Fd be the top-d left singular vectors of the successor representation (I γP π) 1 and ˆFd be the top left singular vectors of (I γP π) 1G. Denote σ1 σ2 ... σS the singular values of the SR and dist(Fd, ˆFd) the sin θ distance between the subspaces spanned by Fd and ˆFd. Denoting p = T d, we have

E[dist(Fd, ˆFd)]

Proof. A proof can be found in Appendix E and follows arguments from random matrix theory.

This bound fundamentally depends on the ratio of the singular values σd+1/σd of the successor representation. As the oversampling parameter (T d) grows, the right hand side tends towards 0. In particular, for the right hand side to be less than ϵ, we need the oversampling parameter to satisfy (T d) 1/ϵ2. We investigate to which extent this result holds empirically for the TD objective in Subsection 4.1.

5. Empirical Analysis

In this section, we illustrate empirically the correctness of our theoretical characterizations from Section 3 and compare the goodness of different cumulants on the four room (Sutton and Barto, 2018) and Mountain car (Moore, 1990) domains. Let PΦ = Φ(Φ Φ) Φ . Here, any distance between two subspaces Φ and Φ is measured using the normalized subspace distance, 2 (Tang, 2019) deﬁned by

dist(Φ, Φ ) = 1 1

d Tr (PΦ PΦ) [0, 1].

5.1. Synthetic Matrices

To begin, we train the TD, supervised and residual update rules from Section 3 up to convergence knowing the exact transition matrices P π. In Figure 4 left and middle, we randomly sample 30 real diagonalisable matrices P π R50 50 to prevent any convergence issue from the TD update rule. In Figure 4 right, we generate symmetric transition matrices P π R50 50. To illustrate the theory, we run gradient descent on each learning rule by expressing the weights implicitly as a function of the features (see Equation (2) for TD for instance). Figure 4, left, middle show that these auxiliary task algorithms learn different representations and successfully recover our theoretical characterizations (Proposition 1, Theorem 1) from Table 1, right. Figure 4 right illustrates that the supervised and TD rules converge to the same representation for symmetric P π, as predicted by our theory (Corollary 2).

5.2. Efﬁcacy of Random Cumulants

Following our theoretical analysis from Subsection 5.2, our aim is to illustrate the goodness of random cumulants at recovering the left singular vectors of the successor representation on the four room domain (Sutton et al., 1999)

2It is equivalent to the sin θ distance up to some constant

Bootstrapped Representations in Reinforcement Learning

Gaussian Orthogonal Group Indicator Normalized Gaussian

0 20 40 60 80 Number of tasks

Distance to top

left singular vectors

Figure 5: Subspace distance after 5 105 training steps and averaged over 30 seeds (d = 5) between Φ learnt with LMC aux and the top left singular vectors of the SR (left) and between Φ learnt with LTD aux and the top invariant subspaces of the SR (right) for different random cumulants, on the four-room domain. Shaded areas represent estimates of 95% conﬁdence intervals

and to investigate to which extent an analogous result holds empirically for the TD rule. We investigate the importance of three properties of a distribution: isotropy, norm and orthogonality of the columns. We consider random cumulants from different distributions: a standard Gaussian N(0, I), a Gaussian distribution which columns are normalized to be unit-norm, the O(N) Haar distribution and random indicators functions. Figure 5, left shows that the the indicator distribution which is not isotropic performs worse overall for the supervised objective and when the number of tasks is large enough, orthogonality between the columns of the cumulant matrix leads to better accuracy. In comparison, Figure 5, right studies the goodness of random cumulants at recovering the top-d invariant subspaces of the SR and depicts a different picture. Here, the Gaussian distribution achieves the highest error irrespective of the number of tasks sampled while the normalized Gaussian achieves lower error suggesting the norm of the columns matter for TD training. The indicator distribution performs well for many number of sampled tasks indicating that the isotropy of the distribution is not as important for TD as it is for supervised training. Finally, the orthogonal cumulants achieve the lowest error when the number of tasks is large enough, showing this is an important property for both kinds of training.

5.3. Ofﬂine Pre-training

In this section we follow a similar evaluation protocol as that of Farebrother et al. (2023), but applied to the four room and Mountain car domains to allow a clear investigation of the various cumulant generation methods and the effects of their corresponding GVFs as a representation pre-training method for reinforcement learning. Details can be found in Appendix A.

We consider four cumulant functions. The ﬁrst two are stationary and are generated before ofﬂine pre-training begins. For Exact SVD, we compute the top-k right singular vectors

of the successor representation matrix of the uniform random policy. For Normal, we generate cumulant functions sampled from a standard Normal distribution.

The second two cumulant functions are learned during ofﬂine pre-training using a separate neural network. RNI (Farebrother et al., 2023), learns a set of indicator functions which are trained to be active in a particular percentage of the states (15% in this experiment). Clustering Contrastive Representations (CCR) learns cumulants by learning a representation of the state using CPC (Oord et al., 2018), and then performs online clustering of the learned representations with k clusters. The online clustering method we use differs slightly from standard approaches in that we maintain an estimate of the frequency that each cluster center is assigned to a state, pi, and the assigned cluster is identiﬁed with arg mini pi φ(x) bi , where φ(x) is the learned CPC representation and bi is cumulant i s centroid. Examples of the cumulants produced by these four methods, and their corresponding value functions, are given in Appendix A.

Figure 6 compares the online performance after pre-training, for various cumulant functions, with the online performance of DQN without pre-training. Two take-aways are readily apparent. First, that ofﬂine pre-training, speeds up online learning, as expected. Second, that the two best performing methods are both sensitive to the structure of the environment dynamics, directly in the case of Exact SVD and indirectly through the CPC representation for CCR.

6. Related Work

Optimal representations. Bellemare et al. (2019) deﬁne a notion of optimal representations for batch Monte Carlo optimization based on the worst approximation error of the value function across the set of all possible policies, later relaxed by Dabney et al. (2021). Instead, we do not consider the control setting but focus on policy evaluation. Ghosh

Bootstrapped Representations in Reinforcement Learning

0.0 0.2 0.4 0.6 0.8 1.0 Environment Step 1e5

Episode Return

DQN CCR Exact SVD Normal RNI

0.0 0.2 0.4 0.6 0.8 1.0 Environment Steps 1e5

Episode Return

Mountain Car

DQN CCR Exact SVD Normal RNI

Figure 6: Comparing effects of ofﬂine pre-training on the Four Rooms (left) and sparse Mountain Car (right) domains for different cumulant generation methods. Results are averages over 30 seeds.

and Bellemare (2020) and Le Lan et al. (2022) characterize the stability, approximation and generalization errors of the SR (Dayan, 1993) and Schur representations which are P πinvariant, a key property to ensure stability. In contrast, we formalize that predicting values functions by TD learning from G = I leads to P π-invariant subspaces.

Auxiliary tasks. Lyle et al. (2021) analyse the representations learnt by several auxiliary tasks such as random cumulants (Osband et al., 2018; Dabney et al., 2021) assuming real diagonalizability of the transition matrix P π and constant weights W. They found that in the limit of an inﬁnity of gaussian random cumulants, the subspace spanned by TD representations converges in distribution towards the left singular vectors of the successor representation. Instead, our theoretical analysis holds for any transition matrix and both the weights W and the features Φ are updated at each time step. Recently, Farebrother et al. (2023) rely on a random binary cumulant matrix which sparsity is controlled by means of a quantile regression loss. Finally, other auxiliary tasks regroup self-supervised learning methods (Schwarzer et al., 2021; Guo et al., 2020). Tang et al. (2023) demonstrate that these algorithms perform an eigendecompositon of real diagonalisable transition matrix P π, under some assumptions, suggesting a close connection to TD auxiliary tasks. Closely related, Touati and Ollivier (2021); Blier et al. (2021) propose an unsupervised pretraining algorithm to learn representations based on an eigendecomposition of transition matrix P π. They demonstrate the usefulness of their approach on discrete and continuous mazes, pixelbased Ms Pacman and the Fetch Reach virtual robot arm.

7. Conclusion

In this paper, we have studied representations learnt by bootstrapping methods and proved their beneﬁt for value-based

deep RL agents. Based on an analysis of the TD continuoustime dynamical system, we generalized existing work (Lyle et al., 2021) and provided evidence that TD representations are actually different from Monte Carlo representations.

Our investigation demonstrated that an identity cumulant matrix provides as much information as the TD and supervised auxiliary algorithms can carry; this work also shows that it is possible to design more compact pseudo-reward functions, though this requires prior knowledge about the transition dynamics. This led us to propose new families of cumulants which also proved useful empirically.

We assumed in this paper that the TD updates are carried out in tabular way, that is that there is not generalization between states when we update the features. An exciting opportunity for future work is to extend the theoretical results to the case where the representation is parametrized by a neural network. Other avenues for future work include scaling up the representation learning methods here introduced.

Acknowledgements

The authors would like to thank Yunhao Tang, Doina Precup and the anonymous reviewers for detailed feedback on this manuscript. We also thank Jesse Farebrother and Joshua Greaves for help with the Proto-Value Networks codebase (Farebrother et al., 2023). Thank you to Matthieu Geist, Bruno Scherrer, Chris Dann, Diana Borsa, Remi Munos, David Abel, Daniel Guo, Bernardo Avila Pires and Yash Chandak for useful discussions too.

We would also like to thank the Python community (Van Rossum and Drake Jr, 1995; Oliphant, 2007) for developing tools that enabled this work, including Num Py (Oliphant, 2006; Walt et al., 2011; Harris et al., 2020), Sci Py (Jones et al., 2001), Matplotlib (Hunter, 2007) and JAX (Bradbury et al., 2018).

Bootstrapped Representations in Reinforcement Learning

Abadi, M., Barham, P., Chen, J., Chen, Z., Davis, A., Dean, J., Devin, M., Ghemawat, S., Irving, G., Isard, M., Kudlur, M., Levenberg, J., Monga, R., Moore, S., Murray, D. G., Steiner, B., Tucker, P., Vasudevan, V., Warden, P., Wicke, M., Yu, Y., and Xiaoqiang, Z. (2016). Tensorﬂow: A system for large-scale machine learning. In USENIX Symposium on Operating Systems Design and Implementation.

Baird, L. (1995). Residual algorithms: Reinforcement learning with function approximation. In Machine Learning Proceedings 1995, pages 30 37. Elsevier.

Barnard, E. (1993). Temporal-difference methods and markov models. IEEE Transactions on Systems, Man, and Cybernetics, 23(2):357 365.

Barreto, A., Dabney, W., Munos, R., Hunt, J. J., Schaul, T., van Hasselt, H. P., and Silver, D. (2017). Successor features for transfer in reinforcement learning. In Advances in Neural Information Processing Systems.

Bellemare, M., Dabney, W., Dadashi, R., Ali Taiga, A., Castro, P. S., Le Roux, N., Schuurmans, D., Lattimore, T., and Lyle, C. (2019). A geometric perspective on optimal representations for reinforcement learning. In Advances in Neural Information Processing Systems.

Bellemare, M. G., Dabney, W., and Munos, R. (2017). A distributional perspective on reinforcement learning. In Proceedings of the International Conference on Machine Learning.

Bellman, R. (1957). Dynamic programming princeton university press princeton. New Jersey Google Scholar.

Blier, L., Tallec, C., and Ollivier, Y. (2021). Learning successor states and goal-dependent values: A mathematical viewpoint. ar Xiv preprint ar Xiv:2101.07123.

Bottou, L., Curtis, F. E., and Nocedal, J. (2018). Optimization methods for large-scale machine learning. Siam Review, 60(2):223 311.

Boyan, J. A. (2002). Technical update: Least-squares temporal difference learning. Machine learning, 49(2):233 246.

Bradbury, J., Frostig, R., Hawkins, P., Johnson, M. J., Leary, C., Maclaurin, D., Necula, G., Paszke, A., Vander Plas, J., Wanderman-Milne, S., and Zhang, Q. (2018). JAX: composable transformations of Python+Num Py programs.

Bradtke, S. J. and Barto, A. G. (1996). Linear least-squares algorithms for temporal difference learning. Machine learning, 22(1):33 57.

Dabney, W., Barreto, A., Rowland, M., Dadashi, R., Quan, J., Bellemare, M. G., and Silver, D. (2021). The valueimprovement path: Towards better representations for reinforcement learning. In Proceedings of the AAAI Conference on Artiﬁcial Intelligence.

Dann, C., Neumann, G., Peters, J., et al. (2014). Policy evaluation with temporal differences: A survey and comparison. Journal of Machine Learning Research, 15:809 883.

Dayan, P. (1993). Improving generalization for temporal difference learning: The successor representation. Neural Computation, 5(4):613 624.

Farebrother, J., Greaves, J., Agarwal, R., Lan, C. L., Goroshin, R., Castro, P. S., and Bellemare, M. G. (2023). Proto-value networks: Scaling representation learning with auxiliary tasks. In Proceedings of the International Conference on Learning Representations.

Fedus, W., Gelada, C., Bengio, Y., Bellemare, M. G., and Larochelle, H. (2019). Hyperbolic discounting and learning over multiple horizons. ar Xiv preprint ar Xiv:1902.06865.

Ge, R., Jin, C., and Zheng, Y. (2017). No spurious local minima in nonconvex low rank problems: A uniﬁed geometric analysis. In Proceedings of the International Conference on Machine Learning.

Gelada, C., Kumar, S., Buckman, J., Nachum, O., and Bellemare, M. G. (2019). Deep MDP: Learning continuous latent space models for representation learning. In Proceedings of the International Conference on Machine Learning.

Ghosh, D. and Bellemare, M. G. (2020). Representations for stable off-policy reinforcement learning. In Proceedings of the International Conference on Machine Learning.

Gohberg, I., Lancaster, P., and Rodman, L. (2006). Invariant subspaces of matrices with applications. SIAM.

Guo, Z. D., Pires, B. A., Piot, B., Grill, J.-B., Altché, F., Munos, R., and Azar, M. G. (2020). Bootstrap latentpredictive representations for multitask reinforcement learning. In Proceedings of the International Conference on Machine Learning.

Halko, N., Martinsson, P.-G., and Tropp, J. A. (2011). Finding structure with randomness: Probabilistic algorithms for constructing approximate matrix decompositions. SIAM review, 53(2):217 288.

Harris, C. R., Millman, K. J., van der Walt, S. J., Gommers, R., Virtanen, P., Cournapeau, D., Wieser, E., Taylor, J., Berg, S., Smith, N. J., et al. (2020). Array programming with numpy. Nature, 585(7825):357 362.

Bootstrapped Representations in Reinforcement Learning

Hessel, M., Modayil, J., van Hasselt, H., Schaul, T., Ostrovski, G., Dabney, W., Horgan, D., Piot, B., Azar, M., and Silver, D. (2018). Rainbow: Combining improvements in deep reinforcement learning. In Proceedings of the AAAI Conference on Artiﬁcial Intelligence.

Hotelling, H. (1933). Analysis of a complex of statistical variables into principal components. Journal of educational psychology, 24(6):417.

Hunter, J. D. (2007). Matplotlib: A 2d graphics environment. Computing in science & engineering, 9(3):90 95.

Jaderberg, M., Mnih, V., Czarnecki, W. M., Schaul, T., Leibo, J. Z., Silver, D., and Kavukcuoglu, K. (2017). Reinforcement learning with unsupervised auxiliary tasks. In Proceedings of the International Conference on Learning Representations.

Janner, M., Mordatch, I., and Levine, S. (2020). Gammamodels: Generative temporal difference learning for inﬁnite-horizon prediction. In Advances in Neural Information Processing Systems.

Jin, C., Ge, R., Netrapalli, P., Kakade, S. M., and Jordan, M. I. (2017). How to escape saddle points efﬁciently. In Proceedings of the International Conference on Machine Learning.

Jones, E., Oliphant, T., and Peterson, P. (2001). Sci Py: open source scientiﬁc tools for Python.

Kapturowski, S., Ostrovski, G., Quan, J., Munos, R., and Dabney, W. (2019). Recurrent experience replay in distributed reinforcement learning. In Proceedings of the International conference on learning representations.

Lagoudakis, M. G. and Parr, R. (2003). Least-squares policy iteration. The Journal of Machine Learning Research, 4:1107 1149.

Le Lan, C., Greaves, J., Farebrother, J., Rowland, M., Pedregosa, F., Agarwal, R., and Bellemare, M. G. (2023). A novel stochastic gradient descent algorithm for learning principal subspaces. In Proceedings of the International Conference on Artiﬁcial Intelligence and Statistics.

Le Lan, C., Tu, S., Oberman, A., Agarwal, R., and Bellemare, M. G. (2022). On the generalization of representations in reinforcement learning. In Proceedings of the International Conference on Artiﬁcial Intelligence and Statistics.

Levine, N., Zahavy, T., Mankowitz, D., Tamar, A., and Mannor, S. (2017). Shallow updates for deep reinforcement learning. In Advances in Neural Information Processing Systems.

Lyle, C., Rowland, M., Ostrovski, G., and Dabney, W. (2021). On the effect of auxiliary tasks on representation dynamics. In Proceedings of the International Conference on Artiﬁcial Intelligence and Statistics.

Mnih, V., Kavukcuoglu, K., Silver, D., Rusu, A. A., Veness, J., Bellemare, M. G., Graves, A., Riedmiller, M., Fidjeland, A. K., Ostrovski, G., Petersen, S., Beattie, C., Sadik, A., Antonoglou, I., King, H., Kumaran, D., Wierstra, D., Legg, S., and Hassabis, D. (2015). Human-level control through deep reinforcement learning. Nature, 518(7540):529 533.

Moore, A. W. (1990). Efﬁcient memory-based learning for robot control. Technical report, University of Cambridge.

Munos, R., Stepleton, T., Harutyunyan, A., and Bellemare, M. (2016). Safe and efﬁcient off-policy reinforcement learning. In Advances in Neural Information Processing Systems.

Nair, V. and Hinton, G. E. (2010). Rectiﬁed linear units improve restricted boltzmann machines. In Proceedings of the International Conference on Machine Learning.

Oliphant, T. E. (2006). A guide to Num Py, volume 1. Trelgol Publishing USA.

Oliphant, T. E. (2007). Python for scientiﬁc computing. Computing in Science & Engineering, 9(3):10 20.

Ollivier, Y. (2018). Approximate temporal difference learning is a gradient descent for reversible policies. ar Xiv preprint ar Xiv:1805.00869.

Oord, A. v. d., Li, Y., and Vinyals, O. (2018). Representation learning with contrastive predictive coding. ar Xiv preprint ar Xiv:1807.03748.

Osband, I., Aslanides, J., and Cassirer, A. (2018). Randomized prior functions for deep reinforcement learning. In Advances in Neural Information Processing Systems.

Paszke, A., Gross, S., Massa, F., Lerer, A., Bradbury, J., Chanan, G., Killeen, T., Lin, Z., Gimelshein, N., Antiga, L., Desmaison, A., Kopf, A., Yang, E., De Vito, Z., Raison, M., Tejani, A., Chilamkurthy, S., Steiner, B., Fang, L., Bai, J., and Chintala, S. (2019). Pytorch: An imperative style, high-performance deep learning library. In Advances in Neural Information Processing Systems.

Puterman, M. L. (1994). Markov Decision Processes: Discrete Stochastic Dynamic Programming. John Wiley & Sons.

Scherrer, B. (2010). Should one compute the temporal difference ﬁx point or minimize the bellman residual? the uniﬁed oblique projection view. In Proceedings of the International Conference on Machine Learning.

Bootstrapped Representations in Reinforcement Learning

Schwarzer, M., Anand, A., Goel, R., Hjelm, R. D., Courville, A., and Bachman, P. (2021). Data-efﬁcient reinforcement learning with self-predictive representations. In Proceedings of the International Conference on Learning Representations.

Sutton, R., Precup, D., and Singh, S. (1999). Between MDPs and semi-MDPs: A framework for temporal abstraction in reinforcement learning. Artiﬁcial Intelligence, 112:181 211.

Sutton, R. S. (1988). Learning to predict by the methods of temporal differences. Machine learning, 3(1):9 44.

Sutton, R. S. and Barto, A. G. (1998). Reinforcement learning: An Introduction. MIT Press.

Sutton, R. S. and Barto, A. G. (2018). Reinforcement learning: An Introduction. MIT Press, 2nd edition.

Sutton, R. S., Mahmood, A. R., and White, M. (2016). An emphatic approach to the problem of off-policy temporaldifference learning. Journal of Machine Learning Research, 17(1):2603 2631.

Sutton, R. S., Modayil, J., Delp, M., Degris, T., Pilarski, P. M., White, A., and Precup, D. (2011). Horde: A scalable real-time architecture for learning knowledge from unsupervised sensorimotor interaction. In Proceedings of the International Conference on Autonomous Agents and Multiagent Systems.

Tang, C. (2019). Exponentially convergent stochastic kpca without variance reduction. In Advances in Neural Information Processing Systems.

Tang, Y., Guo, Z. D., Richemond, P. H., Pires, B. Á., Chandak, Y., Munos, R., Rowland, M., Azar, M. G., Lan, C. L., Lyle, C., et al. (2023). Understanding selfpredictive learning for reinforcement learning. ar Xiv preprint ar Xiv:2212.03319.

Thakoor, S., Rowland, M., Borsa, D., Dabney, W., Munos, R., and Barreto, A. (2022). Generalised policy improvement with geometric policy composition. In Proceedings of the International Conference on Machine Learning.

Touati, A. and Ollivier, Y. (2021). Learning one representation to optimize all rewards. In Advances in Neural Information Processing Systems.

Tsitsiklis, J. and Van Roy, B. (1996). Analysis of temporaldiffference learning with function approximation. In Advances in Neural Information Processing Systems.

Van Rossum, G. and Drake Jr, F. L. (1995). Python reference manual. Centrum voor Wiskunde en Informatica Amsterdam.

Walt, S. v. d., Colbert, S. C., and Varoquaux, G. (2011). The numpy array: a structure for efﬁcient numerical computation. Computing in science & engineering, 13(2):22 30.

Watkins, C. J. and Dayan, P. (1992). Q-learning. Machine learning, 8:279 292.

Yu, H. and Bertsekas, D. P. (2009). Basis function adaptation methods for cost approximation in mdp. In Proceedings of the IEEE Symposium on Adaptive Dynamic Programming and Reinforcement Learning.

Zhang, S., Yao, H., and Whiteson, S. (2021). Breaking the deadly triad with a target network. In Proceedings of the International Conference on Machine Learning.

Bootstrapped Representations in Reinforcement Learning

A. Additional Empirical Results

A.1. Additional details for Subsection 5.1

In this experiment, we selected a step size α = 0.08 for all the algorithms. We also choose a uniform data distribution Ξ = I/|S| and a cumulant matrix G = I for simplicity.

A.2. Additional details for Subsection 5.2

In this experiment, we use a step size α = 5e 3 and train the different learning rules for 500k steps with 3 seeds. We consider the transition matrices induced by an epsilon greedy policy on the four room domain (Sutton et al., 1999) with ϵ = 0.8 and train the supervised and TD update rules as described in Subsection 5.1. We provide additionnal empirical results in Figure 7.

Gaussian Orthogonal Group Indicator Normalized Gaussian

0 20 40 60 80 Number of tasks

Monte Carlo

Approximation Error

0 20 40 60 80 Number of tasks

Monte Carlo

Approximation Error

0 20 40 60 80 Number of tasks

TD approximation Error

0 20 40 60 80 Number of tasks

TD approximation Error

Figure 7: Monte Carlo and TD approximation errors after 5.105 training steps on the learning rules LMC aux (on the left column) and LTD aux (on the right column) in the four-room domain for different distributions of cumulant, averaged over 30 seeds, for d = 5. Shaded areas represent estimates of 95% conﬁdence intervals.

A.3. Additional details for Subsection 5.3

Four Rooms is a tabular gridworld environment where the agent begins in a room in the top left corner and must navigate to the goal state in the lower right corner. The actions are up, down, left and right and have deterministic effects. The reward function is one upon transitioning into the goal state and zero otherwise.

Mountain Car is a two-dimensional continuous state environment where the agent must move an under-powered car from the bottom of a valley to a goal state at the top of the nearby hill. The agent observes the continuous-valued position and velocity of the car, and controls it with three discrete actions which apply positive, negative, and zero thrust to the car. In this sparse reward version of the domain the reward is one for reaching the goal and zero otherwise. In this domain, we compute the Exact SVD by ﬁrst discretizing the state space into approximately 2000 states, and compute an approximate P π

by simulating transitions from uniformly random continuous states belonging to each discretized state.

In this evaluation we ﬁrst pre-train a network representation ofﬂine with a large ﬁxed dataset produced from following the uniform random policy. During ofﬂine pre-training the agent does not observe the reward, and instead learns action-value functions, GVFs, for each of several cumulant functions. After pre-training, the GVF head is removed and replaced with a single action-value function head. This network is then trained online with DQN on the true environmental reward. Note that we allow gradients to propagate into the network representation during online training.

Bootstrapped Representations in Reinforcement Learning

In Four Rooms all methods use k = 40 cumulants and in Mountain Car all methods use k = 80 cumulants.

The inputs to the network were a one-hot encoding of the observation in the four-room domain and the usual position and velocity feature vector in Mountain car. The ofﬂine pre-training dataset contains 100000 and 200000 transitions for four-room and mountain car respectively. In both cases the dataset is generated and used to ﬁll a (ﬁxed) replay buffer, and then the agent is trained for 400000 updates (each update using a minibatch of 32 transitions sampled uniformly from the buffer/dataset). The learning rate for both ofﬂine and online training was the same as the standard DQN learning rate (0.00025), and similarly for the optimizer epsilon. The network architecture is a simple fully connected MLP with Re LU activations (Nair and Hinton, 2010) and two hidden layers of size 512 (ﬁrst) and 256 (second), followed by a linear layer to give action-values.

We provide visualizations of the cumulants produced by each method and their corresponding value functions in Figure 8, Figure 9, Figure 11, Figure 12.

Figure 10 and Figure 13 also show the norm and srank of the representations being learned during ofﬂine pretraining.

Figure 8: Examples of the learned cumulants produced during ofﬂine pre-training in Four Rooms under the uniform random policy.

Figure 9: Examples of the learned (general) value functions produced during ofﬂine pre-training in Four Rooms under the uniform random policy. In each case, the GVFs shown correspond to the value functions learned for the cumulants in Figure 8.

Bootstrapped Representations in Reinforcement Learning

0 50 100 150 200 250 300 350 400 Offline Updates (x 1000)

CCR Exact SVD Normal RNI

0 50 100 150 200 250 300 350 400 Offline Updates (x 1000)

CCR Exact SVD Normal RNI

Figure 10: For the GVF-based representations during ofﬂine training in Four Rooms, their (left) L2 norm and (right) srank.

Figure 11: Examples of the learned cumulants produced during ofﬂine pre-training in sparse Mountain Car under the uniform random policy.

Figure 12: Examples of the learned (general) value functions produced during ofﬂine pre-training in sparse Mountain Car under the uniform random policy. In each case, the GVFs shown correspond to the value functions learned for the cumulants in Figure 11.

Bootstrapped Representations in Reinforcement Learning

0 50 100 150 200 250 300 350 400 Offline Updates (x 1000)

CCR Exact SVD Normal RNI

0 50 100 150 200 250 300 350 400 Offline Updates (x 1000)

CCR Exact SVD Normal RNI

Figure 13: For the GVF-based representations during ofﬂine training in sparse Mountain Car, their (left) L2 norm and (right) srank.

1.0 0.5 0.0 0.5 Position

1.0 0.5 0.0 0.5 Position

CCR Exact SVD

CCR Exact SVD

Figure 14: As CCR is effectively a clustering of states, we might ask in which states each cluster is active. We show an example of the maximally active cluster index for CCR (left) and Exact SVD (right) in the Four Rooms (top) and sparse Mountain Car (bottom) domains. These examples are for 20 cumulants setting.

Bootstrapped Representations in Reinforcement Learning

B. Proofs for Section 2

Proposition 1 (Monte Carlo representations). If rank(Ψπ) d 1, all representations spanning the top-d left singular vectors of Ψπ with respect to the inner product x, y Ξ are global minimizers of LMC aux and can be recovered by stochastic gradient descent.

Proof. Let Fd denote the top d left singular vectors of Ψ.

arg min Φ RS d min W Rd T Ξ1/2(ΦW Ψ) 2 F = arg min Φ RS d P Ξ1/2ΦΞ1/2Ψ 2 F

= {Φ RS d | M GLd(R), Φ = Fd M}

The problem on the left side is a bilinear optimization problem in the variables (Φ, W). Despite the non-convexity of this problem, it is now well known that these types of optimization problems can be efﬁciently solved using (noisy) gradient descent efﬁciently, i.e., with number of iterations scaling at most polynomially in all problem speciﬁc parameters (Ge et al., 2017; Jin et al., 2017).

C. Proofs for Section 3

Throughout the appendix, we will use the notation L := I γP π.

The beginning of this section is dedicated to proving the main result of Section 3, Theorem 1. Before that, we introduce the following necessary lemma.

Lemma 7. Let Φ RS d and Ψ RS T . Let PΦ be a (possibly oblique) projection onto span(Φ). We have

PΦΨ = Ψ span(Ψ) span(Φ)

Proof. PΦ can be written as PΦ = Φ(X Φ) 1X where Φ, X RS d and X Φ Rd d is invertible. Write PΦ = ΦQ with Q = (X Φ) 1X .

( = ) Suppose Ψ RS T such that PΦΨ = Ψ. Then, Ψ = Φ(QΨ). Let ω RT . Ψω = Φ(QΨ)ω so Ψω span(Φ) Hence span(Ψ) span(Φ). ( = ) Suppose span(Ψ) span(Φ). Denote (et) the standard basis. We have PΦΨ = (P

t PΦ(Ψet)e t ). Note that Ψet span(Ψ) span(Φ). Hence, there exists yt Rd such that Ψet = Φyt. Now, PΦΨ = (P

t PΦ(Φyt)e t ) = (P

t Φ(X Φ) 1X Φyte t ) = (P

t Φyte t ) = (P

t Ψete t ) = Ψ.

Lemma 1 (Critical representations for TD). All full rank representations which are critical points to LTD aux span real invariant subspaces of (I γP π) 1GGTΞ, that is span((I γP π) 1GGTΞΦ) span(Φ).

Proof. Start with these equations.

For a ﬁxed Φ, W (Ξ) 1 2 (ΦW G γP πSG[ΦW]) 2 F = 2ΦTΞ(ΦW G γP πΦW)

For a ﬁxed W, Φ (Ξ) 1 2 (ΦW G γP πSG[ΦW]) 2 F = 2Ξ(ΦW G γP πΦW)W T

By Assumption 1, ΦTΞLΦ is invertible for all full rank representations Φ. Hence, for a ﬁxed full rank Φ,

W (Ξ) 1 2 (ΦW G γP πSG[ΦW]) 2 F = 0 W Φ = ΦTΞLΦ 1 ΦTΞG

Using the second ﬁxed-point equation:

0 = (LΦW G)W T LΦWW T = GW T.

Bootstrapped Representations in Reinforcement Learning

Now plugging in the expression for W Φ,

LΦ ΦTDπLΦ 1 ΦTDπG ΦTDπLΦ 1 ΦTDπG T = G ΦTDπLΦ 1 ΦTDπG T

LΦ ΦTDπLΦ 1 ΦTDπGGTDπΦ ΦTDπLΦ T = GGTDπΦ ΦTDπLΦ T

Φ ΦTDπLΦ 1 ΦTDπGGTDπΦ = L 1GGTDπΦ

ΠLTDπΦL 1GGTDπΦ = L 1GGTDπΦ

where ΠX = Φ(XTΦ) 1XT is the oblique projection onto span(Φ) orthogonally to span(X). This is equivalent to Π LTDπΦL 1GGTDπΦ = 0, which is equivalent to saying that span(Φ) must be an invariant subspace of L 1GGTDπ by Lemma 7.

In other words, we have shown that all non-degenerate full-rank Φ which are critical points span invariant subspaces of L 1GGTDπ.

Corollary 1. If G = I and Ξ = I/S, all full rank representations which are critical points to LTD aux span real invariant subspaces of the invariant subspaces of P π.

Proof. Let G = I and Ξ = I/|S|. By Lemma 1, all full rank representations which are critical points of LTD aux span real invariant subspaces of (I γP π) 1.

Let Φ be a representation spanning an invariant subspace of (I γP π) 1. By deﬁnition, span((I γP π) 1Φ) span(Φ). Because (I γP π) is invertible, we have dim((I γP π) 1Φ) = dim(Φ). Hence, we actually have span((I γP π) 1Φ) = span(Φ). There exists w1, w2 Rd such that Φw1 = (I γP π) 1Φw2 so (I γP π)Φw1 = Φw2. It follows that Φ (w1 w2)

γ = P πΦw1. Hence, P πΦw1 span(Φ) and span(P πΦ) span(Φ). We conclude that Φ spans an invariant subspace of P π.

Theorem 1 (TD representations). Assume P π real diagonalisable, G = I and a uniform distribution ξ over states. Let (v1, . . . , v S) denote the eigenvectors of P π corresponding to the eigenvalues (λ1, . . . , λS). Under the dynamics in Equation (2), all real invariant subspaces of dimension d are critical points, and any real non top-d invariant subspace Φ = (vi1, . . . , vid) is unstable for gradient descent.

Proof. Consider this objective:

2 (Ξ 1 2 )(ΦW TD Φ,G G γP πSG[ΦW TD Φ,G]) 2 F ,

and W TD Φ,G = ΦTΞLΦ 1 ΦTΞG and deﬁne L := I γP π. Observe that:

For a ﬁxed W, Φ ΦW G γP πSG[ΦW] 2 F = 2Ξ(LΦW G)(W)T

So now we consider the continuous time dynamics:

d dtΦ = ΦL(Φ) := F(Φ), (5)

F(Φ) := Ξ(LΦW TD Φ,G G)(W TD Φ,G)T = ΞL(ΠL ΞΦ I)L 1GGTΞΦ(ΦTΞLΦ) T

Consider the case G = I and Ξ = I/|S|.

The proof strategy consists in constructing an eigenvector RS d of ΦF(Φ) as a function of Φ, L, G such that ΦF(Φ)[ ] = λ for some Re(λ) > 0. For every non top-d invariant subspace, we prove that the Jacobian of the dynamics F has a positive real part eigenvalue.

Bootstrapped Representations in Reinforcement Learning

ΦF(Φ)[ ] = (L W Φ + LΦ(d W Φ))(W Φ)T + (LΦW Φ I)(d W Φ)T

ΦW Φ[ ] = (ΦTLΦ) 1( TLΦ + ΦTL )(ΦTLΦ) 1ΦT + (ΦTLΦ) 1 T.

We also have the identity

W = (ΦTLΦ) 1ΦT.

Now, if it is the case that TΦ = 0, then by the L-invariance of Φ, we also have that TLΦ = 0, and hence:

W = 0, WLT = 0.

Next, we start with the simpliﬁcation:

ΦW Φ[ ] = (ΦTLΦ) 1ΦTL (ΦTLΦ) 1ΦT + (ΦTLΦ) 1 T

= WL W + (ΦTLΦ) 1 T.

Using the shorthand W = W Φ and the optimality condition (LΦW I)W T = 0,

(L W + LΦ( W))W T = L WW T LΦWL WW T

(LΦW I)( W)T = (LΦW I) (ΦTLΦ) T = (ΦTLΦ) T.

ΦF(Φ)[ ] = L WW T LΦWL WW T (ΦTLΦ) T.

Furthermore, let us write LΦ = ΦN for N Rd d with N invertible; this holds for some such N since Φ is L-invariant and rank d. With this notation, we see immediately that LΦW = PΦ, since:

ΦTLΦ = ΦTΦN = W = (ΦTLΦ) 1ΦT = N 1(ΦTΦ) 1ΦT,

and therefore:

LΦW = ΦNW = Φ(ΦTΦ) 1ΦT = PΦ.

ΦF(Φ)[ ] = P Φ L WW T (ΦTLΦ) T.

This is the starting point of our analysis. Let (v1, . . . , v S) denote the eigenvectors of P corresponding to (λ1, . . . , λS). Let us choose = P Φ viu T. The identity P Φ L = (1 γλi) yields that:

ΦF(Φ)[ ] = [(1 γλi)I N]N 1(ΦTΦ) 1N T.

Let (v1, . . . , v S) denote the eigenvectors of P corresponding to (λ1, . . . , λS). Let the columns of Φ be (vi1, . . . , vid) with corresponding eigenvalues (λi1, . . . , λid). Then, we have the identity:

LΦ = Φ(I γΛ), Λ = diag(λi1, . . . , λid).

W = (I γΛ) 1(ΦTΦ) 1ΦT,

WW T = (I γΛ) 1(ΦTΦ) 1(I γΛ) 1,

Bootstrapped Representations in Reinforcement Learning

so we have:

ΦF(Φ)[ ] = P Φ L WW T (ΦTΦ) 1(I γΛ) 1.

Next, we will choose = P Φ viu T, with the index i and vector u to be speciﬁed. Note that if we want = 0 we are constrained to choosing i {i1, . . . , id}.

With this choice:

P Φ L = P Φ [(1 γλi)viu T LPΦviu T] = (1 γλi) ,

where the last equality holds since by the L-invariance of Φ, we have that P Φ LPΦ = 0. Hence:

ΦF(Φ)[ ] = (1 γλi) (I γΛ) 1(ΦTΦ) 1(I γΛ) 1 (ΦTΦ) 1(I γΛ) 1.

= γ (λi I Λ)(I γΛ) 1(ΦTΦ) 1(I γΛ) 1.

So, we just need to choose u to be a left eigenvector of the matrix (λi I Λ)(I γΛ) 1(ΦTΦ) 1(I γΛ) 1, associated with the largest eigenvalue. Thus, it remains to show that this matrix has at least one positive eigenvalue. By using the fact that the eigenvalues of size conforming matrices AB equals the eigenvalues of BA, the eigenvalues of this matrix are:

eig((λi I Λ)(I γΛ) 1(ΦTΦ) 1(I γΛ) 1)

= eig((I γΛ) 1(λi I Λ)(I γΛ) 1(ΦTΦ) 1)

= eig((ΦTΦ) 1/2(I γΛ) 1(λi I Λ)(I γΛ) 1(ΦTΦ) 1/2).

Now, we need an auxiliary lemma.

Proposition 4. Let A, B be symmetric matrices, with B invertible. We have that:

λmax(BAB) λmax(A) λmax(B 2).

Proof. Let q be the unit eigenvector associated with the largest eigenvalue of A. Since BAB is symmetric, we can use the variational characterization of the largest eigenvalue to show:

λmax(BAB) = sup v =1 v TBABv

q TAq B 1q 2 choosing v = B 1q B 1q

λmax(A) λmax(B 2) since B 1q B 1 op.

With this proposition, we can conclude the proof. In particular, since we assume that Φ contains at least one eigenvector associated with λd+1, . . . , λS, then as long as we choose the index i of v in the following set [d] {i1, . . . , id}c (where the set complementation is done w.r.t. [S]), then we will satisfy both (a) = 0 and (b) there is at least one diagonal entry in λi I Λ which is positive (this is where the strict gap assumption λd > λd+1 enters). Hence, we have that λmax((I γΛ) 1(λi I Λ)(I γΛ) 1) > 0, which concludes the proof.

Proposition 2 (Residual representations). Let d {1, ..., S} and Fd be the top d left singular vectors of G with respect to the inner product x, y Ξ = y TΞx, for all x, y RS. All representations spanning (I γP π) 1Fd are global minimizers of Lres aux and can be recovered by stochastic gradient descent.

Bootstrapped Representations in Reinforcement Learning

Proof. We can write the loss function to be minimized as

J(Φ) = min W Rd T Ξ1/2(ΦW (G + γP πΦW)) 2 F

= min W Rd T Ξ1/2(ΦW γP πΦW G) 2 F

= min W Rd T Ξ1/2((I γP π)ΦW G) 2 F

arg min Φ RS d min W Rd T Ξ1/2((I γP π)ΦW G) 2 F = arg min Φ RS d P Ξ1/2(I γP π)ΦΞ1/2G 2 F

= {Φ RS d | Φ = (I γP π) 1Fd M, M GLd(R)}

This set of representations can be recovered by stochastic gradient descent efﬁciently, i.e., with number of SGD iterations scaling at most polynomially in all problem speciﬁc parameters (Ge et al., 2017; Jin et al., 2017) in the context of SGD.

Corollary 2 (Symmetric transition matrices). If a cumulant matrix G RS T (with T S) has unit-norm, orthogonal columns (e.g. G = I), the representations learnt from the supervised objective LMC aux and the TD update rule LTD aux are the same for symmetric transition matrices P π under a uniform state distribution ξ.

Proof. Assume that P π is symmetric so that L and L 1 are also symmetric.

By Proposition 1, running SGD on the supervised objective LMC aux using Ψ = L 1G as targets results in a representation spanning the top-d left singular vectors of L 1G which are the same as the top-d left singular vectors of L 1.

By assumption G is orthogonal, hence GGT = I. Because L 1GGT is symmetric, all its eigenvalues are real. By Theorem 1, running gradient descent on LTD aux using G as the cumulant matrix converges to the top-d eigenvectors of L 1GGT = L 1. Indeed, the subspaces given by the span of the right eigenvectors of L 1 are the only L 1-invariant subspaces. These eigenvectors are also the singular vectors of L 1 as this matrix is symmetric.

Because P is a row stochastic matrix, we have that the spectral radius of P satisﬁes ρ(P) = 1, and therefore λ(P) [ 1, 1]. Hence:

1 1 γλ [1/(1 + γ), 1/(1 γ)].

Hence, the eigenvalues of L 1 are positive. Because L 1 is symmetric, the singular values of L 1 are exactly its eigenvalues. Hence, the top-d eigenvectors are the top-d singular vectors and the conclusion follows.

D. Proofs for Section 4

Lemma 2. A representation Φ TD is l1-ball optimal for TD learning iff it is a solution of the following optimization problem.

Φ TD arg minΦ Ξ1/2(ΦW TD Φ,I (I γP π) 1) 2

Proof. By deﬁnition, a representation is enough for TD learning when it is a minimizer of Equation (3), that is,

Φ TD arg min Φ RS d Erπ Φw TD Φ V π 2 ξ, (6)

where the expectation is over the reward functions rπ sampled uniformly over the l1 ball rπ 2 1 1 and

w TD Φ = ΦTΞ(I γP π)Φ 1 ΦTΞrπ.

Bootstrapped Representations in Reinforcement Learning

Write P LTΞΦ = I PLTΞΦ and PX = Φ(XTΦ) 1XT the oblique projection onto span(Φ) orthogonally to span(X). We have

E r 2 1 1 Φw TD Φ V π 2 ξ = E r 2 1 1 Ξ1/2P LTΞΦ(I γP π) 1r 2 2

= E r 2 1 1 Ξ1/2P LTΞΦ(I γP π) 1r 2 2 = E r 2 1 1 tr(r L (P LTΞΦ) ΞP LTΞΦL 1r)

= tr(L (P LTΞΦ) ΞP LTΞΦL 1E(rr ))

Ξ1/2P LTΞΦL 1 2 F because r is sampled from an isotropic distribution

Ξ1/2(ΦW TD Φ,I (I γP π) 1) 2

Lemma 3. A representation Φ MC is l1-ball optimal for batch Monte Carlo policy evaluation if its column space spans the top-d left singular vectors (with respect to the inner product x, y Ξ) of (I γP π) 1.

Proof. We have

E r 2 1 1 ˆV MC V π 2 ξ = E r 2 1 1 P Ξ1/2ΦΞ1/2(I γP π) 1r 2 2

= E r 2 1 1 tr(r L Ξ1/2P Ξ1/2ΦΞ1/2L 1r)

= tr(L Ξ1/2P Ξ1/2ΦΞ1/2L 1E(rr ))

= P Ξ1/2ΦΞ1/2L 1 2 F

Write (I γP π) 1 = FΣB the weighted SVD of (I γP π) 1 where F RS S such that F TΞF = I and B RS S

such that BTB = I. Write Fd the top-d left singular vectors corresponding to the top-d singular values on the diagonal of Σ. By deﬁnition, an l1-ball optimal representation is solution to the following optimization problem

arg min Φ RS d E r 2 1 1 ˆV MC V π 2 ξ = arg min Φ RS d P Ξ1/2ΦΞ1/2L 1 2 F

= arg min Φ RS d P Ξ1/2ΦΞ1/2FΣB 2 F

By the Eckart-Young theorem, P FdΞ1/2FΣBT 2 F P Φ Ξ1/2FΣBT 2 F . Hence, the set of optimal representations is {Fd M, M GLd(R)}.

Lemma 8. Write FdΣd B d the truncated weighted SVD of the successor representation (I γP π) 1. A representation is l1-ball optimal for residual policy evaluation if its column space spans FdΣd.

Proof. Write (I γP π) 1 = FΣBT the weighted SVD of (I γP π) 1 where F RS S such that F TΞF = I and B RS S such that BTB = I. Write Fd the top-d left singular vectors corresponding to the top-d singular values on the diagonal of Σ. For a ﬁxed Φ RS d, the solution of minw Rd Ξ1/2(Φw (rπ + γP πΦw)) 2 F is the Bellman residual minimizing approximation (Lagoudakis and Parr, 2003) and is given by

wres Φ = (Φ γP πΦ)TΞ(Φ γP πΦ) 1 (Φ γP πΦ)TΞrπ.

Hence, the value approximant can be expressed by means of an orthogonal projection matrix as follows

Φwres Φ = (I γP π) 1Ξ 1/2PΞ1/2(I γP π)ΦΞ1/2rπ

Bootstrapped Representations in Reinforcement Learning

where PX = X(XTX) 1XT denotes an orthogonal projection. By deﬁnition, a representation l1-ball optimal for residual policy evaluation is solution to the following optimization problem

arg min Φ RS d E r 2 1 1 ˆV res V π 2 ξ = arg min Φ RS d Ξ1/2(I γP π) 1Ξ 1/2PΞ1/2(I γP π)ΦΞ1/2rπ Ξ1/2(I γP π) 1rπ 2 F

= arg min Φ RS d Ξ1/2(I γP π) 1Ξ 1/2PΞ1/2(I γP π)ΦΞ1/2 Ξ1/2(I γP π) 1 2 F

= arg min Φ RS d Ξ1/2(I γP π) 1P Ξ1/2(I γP π)Φ 2 F

Using an oblique projection,

Φwres Φ = (I γP π) 1Ξ 1/2PΞ1/2(I γP π)ΦΞ1/2rπ

arg min Φ RS d E r 2 1 1 ˆV res V π 2 ξ = arg min Φ RS d Ξ1/2(I γP π) 1Ξ 1/2PΞ1/2(I γP π)ΦΞ1/2rπ Ξ1/2(I γP π) 1rπ 2 F

= arg min Φ RS d Ξ1/2(I γP π) 1Ξ 1/2PΞ1/2(I γP π)ΦΞ1/2 Ξ1/2(I γP π) 1 2 F

= arg min Φ RS d Ξ1/2(I γP π) 1P Ξ1/2(I γP π)Φ 2 F

L 1 = UΣV T

L 1 the top d right singular vectors of (I γP π) 1 is a solution. Let Ud, Σd, Vd correspond to the top d svals. Lets say

that Ud is S d, Σd is square, and Vd is also S d. What is V TVd = Id 0

We want LΦ = Vd so Φ = L 1Vd = UΣV TVd = UdΣd. If LΦ = Vd, then P LΦ = P Vd, so L 1P Vd = U d Σ d (V d )T, so the objective is now sum of the last (S d) singular values squared.

E. Proofs for Subsection 4.1

Lemma 4. Denote BT the top-T right singular vectors of the SR and O(T, S) the set of orthogonal matrices in RT S. Training auxiliary tasks in a MC way with any G from the set {G RS T | M O(T, S), G = BT M} results in an l1-ball optimal representation for batch Monte Carlo.

Proof. By Lemma 3, a representation is l1-ball optimal for batch Monte Carlo policy evaluation if it spans the top-d left singular vectors of the successor representation.

Let G RS T be a cumulant matrix.

LSL aux(Φ) = min W Rd S (ΦW (I γP π) 1G) 2 F

By Proposition 1, we know that training on such a loss with G = I results in a representation spanning the same subspace as the left singular vectors of the SR, that is {Φ RS d | M GLd(R), Φ = Fd M} where Fd are the left singular vectors of the SR. We note that there is not a unique matrix G resulting into a representation spanning that subspace. In particular, training with any of the matrices from the set of cumulant matrices G(G) = {G RS T | M O(T, S), G = GM} results in the same representation, where O(T, S) denotes the set of orthogonal matrices in RT S (rows have l2 norm 1).

We are interested in ﬁnding a cumulant matrix G RS T with T < S such that training the Monte Carlo loss LMC aux results in a representation spanning the top-d left singular vectors of the successor representation.

Bootstrapped Representations in Reinforcement Learning

Denote BT the top T right singular vectors of the SR. Then the set G(BT ) satisﬁes the requirement.

In particular, this ﬁnding is consistent in the case where S = T because G(BS) = {G RS T | M O(T), G = BSM} = G(IS).

Indeed, let G G(BS). There exists M O(S) such that G = BSM = IS(BSM). Because BSM O(S), we have G G(IS). Hence G(BS) G(IS).

Let G G(IS). There exists M O(S) such that G = ISM = (BSBT S)M = BS(BT SM). Because BT SM O(S), we have G G(BS). Hence G(IS) G(BS).

As a conclusion, we have G(IS) = G(BS).

Lemma 5. Let {Φω} be the set of rotating representations from Figure 2 learnt by TD learning with G = I and d = 2. All these representations are equally good for learning the main value function by TD learning, that is ω [0, 1],

E r 2 2<1 Φωw TD Φω V π 2 F

is constant and independent of ω.

Proof. Let s start by considering the case of the three-state circular example. We consider an orthogonal basis for the invariant subspaces of Φ. By deﬁnition, P πe1 = e1, P π[e2, e3] = [e2, e3]Λ so Le1 = (1 γ)e1 and L[e2, e3] = (I γP)[e2, e3] = [e2, e3] γ[e2, e3]Λ = [e2, e3](I γΛ).

Assume that there exists ω [0, 1] such that the representation is Φ = [e1, ωe2 + (1 ω)e3] = [e1, e2, e3]Ωwith

1 0 0 ω 0 (1 ω)

. LΦ = [(1 γ)e1, [e2, e3](I γΛ)]Ω. Hence, we have LΦ = [e1, e2, e3] 1 γ 0 0 I γΛ

and ΦTLΦ = ΩT[e1, e2, e3]T[e1, e2, e3] 1 γ 0 0 I γΛ

Ω = ΩT 1 γ 0 0 I γΛ

Ω. Hence, (ΦTLΦ) 1 = (1 γ) 1 0 0 (u T(I γΛ)u) 1

with u = (w, (1 w))T. Note that u T(I γΛ)u = ω2λ1,1 + (1 ω)2λ1,1

The TD value function is given by ˆV TD = Φ(ΦTLΦ) 1ΦT

ˆV TD = [e1, e2, e3]Ω (1 γ) 1 0 0 (u T(I γΛ)u) 1

ΩT[e1, e2, e3]T

= [e1, e2, e3] (1 γ) 1 0 0 u(u T(I γΛ)u) 1u T

[e1, e2, e3]T

= 1/(1 γ)e1e T 1 + ω2e2e T 2 + ω(1 ω)e3e T 2 + ω(1 ω)e2e T 3 + (1 ω)2e3e T 3 ω2λ1,1 + (1 ω)2λ1,1

Now Φ(ΦTLΦ) 1ΦT V π 2 F is independent of ω.

Lemma 6. The set of cumulant matrices G RS T that preserve the top-T invariant subspaces of the successor representation by TD learning are the top-T orthogonal invariant subspaces of (I γP π) 1, that is satisfying G G = I by orthogonality and (I γP π) 1G G by the invariance property.

Proof. Let Φ RS d spanning an invariant subspace of L 1. By deﬁnition, there exists a block diagonal matrix JΦ Rd d

such that L 1Φ = ΦJΦ. Let G O(S, T) spanning the top T invariant subspaces of L 1. By deﬁnition, there exists a block diagonal matrix JG Rd d such that L 1G = GJG. Hencer, we have

(L 1GGT)Φ = (L 1G)GTΦ

= (ΦJΦ) by orthonormality

Then, Φ is an invariant subspace of L 1GG .

Bootstrapped Representations in Reinforcement Learning

Lemma 6. The set of cumulant matrices G RS T that preserve the top-T invariant subspaces of the successor representation by TD learning are the top-T orthogonal invariant subspaces of (I γP π) 1, that is satisfying G G = I by orthogonality and (I γP π) 1G G by the invariance property.

Proof. Let Φ RS d spanning an invariant subspace of L 1. By deﬁnition, there exists a block diagonal matrix JΦ Rd d

such that L 1Φ = ΦJΦ. Let G O(S, T) spanning the top T invariant subspaces of L 1. By deﬁnition, there exists a block diagonal matrix JG Rd d such that L 1G = GJG. Hencer, we have

(L 1GGT)Φ = (L 1G)GTΦ

= (ΦJΦ) by orthonormality

Then, Φ is an invariant subspace of L 1GG .

F. Proofs for Subsection 4.2

We now proceed to the proof of Proposition 3. Before that, we introduce some necessary notations and lemmas.

F.1. Notations

Let O(S, d) := {A RS d : ATA = I}.

Deﬁnition 5. Let A, B O(S, d). The principle angles Θ between A and B are given by writing the SVD of ATB = U cos ΘV T.

Deﬁnition 6. Let A, B O(S, d) with principle angles Θ. We deﬁne the distance d(A, B) as d(A, B) := sin Θ op.

Proposition 5. Let A, B O(S, d). We have the following identities:

d(A, B) = AAT BBT op = sin Θ op = AT B op,

where B O(S, S d) satisﬁes BBT + B BT = I.

F.2. Approximate matrix decompositions

Lemma 9 (Deterministic error bound). Let A be an S S matrix. Fix d S, and partition the SVD of A as:

A = U1 U2 Σ1 0 0 Σ2

where Σ1 is d d (the dimensions of all the other factors are determined by this selection). Put Ad := U1Σ1V T 1 as the rank-d approximation of A. Let Ωbe an S ℓtest matrix (ℓ d). Put Y = AΩ, Ω1 = V 1 Ωand Ω2 = V 2 Ω. We have that: (I PY )Ak 2 op Σ2Ω2Ω 1 2 op.

Proof. This proof is adapted from Theorem 9.1 of Halko et al. (2011).

Write Ad = ˆU ˆΣ ˆ V the full SVD of Ad. By invariance of the spectral norm to unitary transformations,

(I PY )Ad 2 op = ˆU (I PY ) ˆU( ˆU Ad) 2 op = (I P ˆU Y )( ˆU Ad) 2 op

Assume the diagonal entries of Σ2 are not all strictly positive. Then Σ2 is zero as a consequence of the ordering of the singular values.

range( ˆU Y ) = range Σ1Ω1 0

= range Σ1V 1 0

= range( ˆU Ad)

So we can conclude that (I PY )Ad 2 op = 0 assuming that V 1 and Ω1 have full row rank.

Bootstrapped Representations in Reinforcement Learning

Now assume that the diagonal entries of Σ1 are strictly positive. Let Z = ˆU Y Ω 1Σ 1 1 = Id F

with F = Σ2Ω2Ω 1Σ 1 1

By construction, range(Z) range( ˆU Y ), hence we have,

(I P ˆU Y )( ˆU Ad) 2 op (I PZ) ˆU Ad 2 op A d ˆU(I PZ) ˆU Ad op ˆΣ(I PZ)ˆΣ op

Following the proof from Theorem 9.1 of Halko et al. (2011), we have

(I PZ) F F B B IS d

where B = (Id F F) 1F Rd (S d).

Consequently, we have

ˆΣ(I PZ)ˆΣ Σ1F FΣ1 0 0 0

ˆΣ(I PZ)ˆΣ is PSD by the conjugation rule, hence the matrix on the right hand side is PSD too. It follows that

ˆΣ(I PZ)ˆΣ op Σ1F FΣ1 op = FΣ1 2 op = Σ2Ω2Ω 1 2 op

Lemma 10 (Average spectral error). Let A be an S S matrix with singular values σ1 σ2 .... Fix a target rank 2 d S and an oversampling parameter p 2 where p + d S. Draw and S (d + p) standard gaussian matrix Ω and construct the sample matrix Y = AΩ. Then, we have

E (I PY )Ad op

d p 1σd+1 + e d + p

Proof. By Lemma 9 and linearity of the expectation, we have

E (I PY )Ad op E Σ2Ω2Ω 1 op

d p 1σd+1 + e d + p

where the last inequality comes from Theorem 10.6 of Halko et al. (2011).

Lemma 11. Let A Rm n, and ﬁx a d < n. Let σ1 σ2 . . . σn denote the singular values of M listed in decreasing order, and suppose that σk > 0. Let Ad denote the rank-d approximation of A. Fix any matrix Y Rm T . We have:

(I PY )Ak op (I PY )PAk opσk.

Proof. Decompose P Y Ak as: P Y Ak = P Y PAk Ak

P Y Ak op = P Y PAk Ak op P Y PAk op Ak op = P Y PAk opσk

where the inequality comes from the sub-multiplicativity of the the operator norm

Bootstrapped Representations in Reinforcement Learning

Proposition 6. Let A be an S S matrix with singular values σ1 σ2 .... Fix a target rank 2 d n and an oversampling parameter p 2 where p + d S. Draw and n (d + p) standard gaussian matrix Ωand construct the sample matrix Y = AΩ. Then, we have

E (I PY )PAd op

σd + e d + p

Proof. By Lemma 11 and linearity of the expectation, we have

1 σd E (I PY )Ad op E (I PY )PAd op

Now applying Lemma 10, we have

σd + e d + p

E (I PY )PAd op

Observe that, as the oversampling factor p grows, the RHS tends to zero. However, the dependence will be something like p 1/ε2, if you want the RHS to be ε. This actually makes sense I think you are using concentration of measure to increase the accuracy, so you should pay 1/ε2 sample complexity.

F.3. Analysis

Proposition 3 (MC Error bound). Let G RS T be a sample from a standard gaussian distribution and assume d T. Let Fd be the top-d left singular vectors of the successor representation (I γP π) 1 and ˆFd be the top left singular vectors of (I γP π) 1G. Denote σ1 σ2 ... σS the singular values of the SR and dist(Fd, ˆFd) the sin θ distance between the subspaces spanned by Fd and ˆFd. Denoting p = T d, we have

E[dist(Fd, ˆFd)]

Proof. Let l {d,...,S}. Fl O(S, l) be the top l left singular vectors of (I γP π) 1 and ˆFl O(S, d) be the top left singular vectors of (I γP π) 1G.

d(Fd, ˆFd) = ˆF d F d op

= P ˆ Fd P Fd op

PL 1GP Fd op as span( ˆFd) span(L 1G)

= ˆF T F d op

= F d ˆF T op

= PFd P ˆ FT op

= P ˆ FT PFd op by symmetry of the projection matrices

= (I P ˆ FT )PFd op

= (I PL 1G)P(L 1)d op

σd (I PL 1G)(L 1)d op by Lemma 11

Bootstrapped Representations in Reinforcement Learning

Now taking the expectation with respect to G and applying Proposition 6,

E[d(Fd, ˆFd)]

d T d 1 σd+1

G. n-step TD case

We now provide intuition about how our theory extends from 1-step to n-step temporal difference learning. The n-step TD loss corresponds to the following loss

LTD aux(Φ, W) =

k=0 γk(P π)k G + γn(P π)nΦW

Our analysis mostly stays the same, with the exception that the successor representation (I γP π) becomes (I γn(P π)n) and the cumulant matrix G becomes Gn = Pn 1 0 γk (P π)k G. Following the same proof strategy as that of Lemma 1, we ﬁnd that the critical representation are given by the real invariant subspaces of (I γn (P π)n) 1 Gn G n Ξ. Now following the proof strategy from Theorem 1 shows that all non-top real invariant subspaces of (I γn (P π)n) 1 Gn G n Ξ are unstable. This implies that the n-step TD algorithm also converges towards a real top-d invariant subspace of or diverges with probability 1.