# learning_one_representation_to_optimize_all_rewards__f94c46c4.pdf

Learning One Representation to Optimize All

Ahmed Touati Mila, University of Montreal ahmed.touati@umontreal.ca

Yann Ollivier Facebook Artiﬁcial Intelligence Research

Paris yol@fb.com

We introduce the forward-backward (FB) representation of the dynamics of a reward-free Markov decision process. It provides explicit near-optimal policies for any reward speciﬁed a posteriori. During an unsupervised phase, we use rewardfree interactions with the environment to learn two representations via off-the-shelf deep learning methods and temporal difference (TD) learning. In the test phase, a reward representation is estimated either from reward observations or an explicit reward description (e.g., a target state). The optimal policy for that reward is directly obtained from these representations, with no planning. We assume access to an exploration scheme or replay buffer for the ﬁrst phase. The corresponding unsupervised loss is well-principled: if training is perfect, the policies obtained are provably optimal for any reward function. With imperfect training, the sub-optimality is proportional to the unsupervised approximation error. The FB representation learns long-range relationships between states and actions, via a predictive occupancy map, without having to synthesize states as in model-based approaches. This is a step towards learning controllable agents in arbitrary black-box stochastic environments. This approach compares well to goal-oriented RL algorithms on discrete and continuous mazes, pixel-based Ms Pacman, and the Fetch Reach virtual robot arm. We also illustrate how the agent can immediately adapt to new tasks beyond goal-oriented RL. 2

1 Introduction

We consider one kind of unsupervised reinforcement learning problem: Given a Markov decision process (MDP) but no reward information, is it possible to learn and store a compact object that, for any reward function speciﬁed later, provides the optimal policy for that reward, with a minimal amount of additional computation? In a sense, such an object would encode in a compact form the solutions of all possible planning problems in the environment. This is a step towards building agents that are fully controllable after ﬁrst exploring their environment in an unsupervised way.

Goal-oriented RL methods [ACR+17, PAR+18] compute policies for a series of rewards speciﬁed in advance (such as reaching a set of target states), but cannot adapt in real time to new rewards, such as weighted combinations of target states or dense rewards.

Learning a model of the world is another possibility, but it still requires explicit planning for each new reward; moreover, synthesizing accurate trajectories of states over long time ranges has proven difﬁcult [Tal17, KST+18].

Work done during an internship at Facebook Artiﬁcial Intelligence Research Paris.

2Code: https://github.com/ahmed-touati/controllable_agent

35th Conference on Neural Information Processing Systems (Neur IPS 2021).

Instead, we exhibit an object that is both simpler to learn than a model of the world, and contains the information to recover near-optimal policies for any reward provided a posteriori, without a planning phase.

[BBQ+18] learn optimal policies for all rewards that are linear combinations of a ﬁnite number of feature functions provided in advance by the user. This limits applications: e.g., goal-oriented tasks would require one feature per goal state, thus using inﬁnitely many features in continuous spaces. We reuse a policy parameterization from [BBQ+18], but introduce a novel representation with better properties, based on state occupancy prediction instead of expected featurizations. We use theoretical advances on successor state learning from [BTO21]. We obtain the following.

We prove the existence of a learnable summary of a reward-free discrete or continuous MDP,

that provides an explicit formula for optimal policies for any reward speciﬁed later. This takes the form of a pair of representations F : S A Z ! Z and B : S A ! Z from state-actions into a representation space Z ' Rd, with policies z(s) := arg maxa F(s, a, z)>z. Once a reward is speciﬁed, a value of z is computed from reward values and B; then z is used. Rewards may be speciﬁed either explicitly as a function, or as target states, or by samples as in usual RL setups. We provide a well-principled unsupervised loss for F and B. If FB training is perfect, then the

policies are provably optimal for all rewards (Theorem 2). With imperfect training, sub-optimality is proportional to the FB training error (Theorems 8 9). In ﬁnite spaces, perfect training is possible with large enough dimension d (Proposition 6). Explicitly, F and B are trained so that F(s, a, z)>B(s0, a0) approximates the long-term probability to reach s0 from s if following z. This is akin to a model of the environment, without synthesizing state trajectories. We provide a TD-like algorithm to train F and B for this unsupervised loss, with function

approximation, adapted from recent methods for successor states [BTO21]. No sparse rewards are used: every transition reaches some state s0, so every step is exploited. As usual with TD, learning seeks a ﬁxed point but the loss itself is not observable. We prove viability of the method on several environments from mazes to pixel-based Ms Pacman

and a virtual robotic arm. For single-state rewards (learning to reach arbitrary states), we provide quantitative comparisons with goal-oriented methods such as HER. (Our method is not a substitute for HER: in principle they could be combined, with HER improving replay buffer management for our method.) For more general rewards, which cannot be tackled a posteriori by trained goal-oriented models, we provide qualitative examples. We also illustrate qualitatively the sub-optimalities (long-range behavior is preserved but local

blurring of rewards occurs) and the representations learned.

2 Problem and Notation

Let M = (S, A, P, γ) be a reward-free Markov decision process with state space S (discrete or continuous), action space A (discrete for simplicity, but this is not essential), transition probabilities P(s0|s, a) from state s to s0 given action a, and discount factor 0 < γ < 1 [SB18]. If S is ﬁnite, P(s0|s, a) can be viewed as a matrix; in general, for each (s, a) 2 S A, P(ds0|s, a) is a probability measure on s0 2 S. The notation P(ds0|s, a) covers all cases.

Given (s0, a0) 2 S A and a policy : S ! Prob(A), we denote Pr( |s0, a0, ) and E[ |s0, a0, ] the probabilities and expectations under state-action sequences (st, at)t 0 starting with (s0, a0) and following policy in the environment, deﬁned by sampling st P(dst|st 1, at 1) and at (st).

For any policy and state-action (s0, a0), deﬁne the successor measure M (s0, a0, ) as the measure over S A representing the expected discounted time spent in each set X S A:

M (s0, a0, X) :=

γt Pr ((st, at) 2 X | s0, a0, ) (1)

for each X S A. Viewing M as a measure deals with both discrete and continuous spaces.

Given a reward function r: S A ! R, the Q-function of for r is Q

r (s0, a0) := P

t 0 γt E[r(st, at)|s0, a0, ]. We assume that rewards are bounded, so that all Q-functions are well-deﬁned. We state the results for deterministic reward functions, but this is not essential. We abuse notation and write greedy policies as (s) = arg maxa Q(s, a) instead of (s) 2 arg maxa Q(s, a). Ties may be broken any way.

We consider the following informal problem: Given a reward-free MDP (S, A, P, γ), can we compute a convenient learnable object E such that, once a reward function r: S A ! R is speciﬁed, we can easily (with no planning) compute, from E and r, a policy whose performance is close to maximal?

3 Encoding All Optimal Policies via the Forward-Backward Representation

We ﬁrst present forward-backward (FB) representations of a reward-free MDP as a way to summarize all optimal policies via explicit formulas. The resulting learning procedure is described in Section 4.

Core idea. The main algebraic idea is as follows. Assume, at ﬁrst, that S is ﬁnite. For a ﬁxed policy, the Q-function depends lineary on the reward: namely, Q

r (s, a) = P

s0,a0 M (s, a, s0, a0)r(s0, a0) where M (s, a, s0, a0) = P

t 0 γt Pr ((st, at) = (s0, a0)|s, a, ). This rewrites as Q

r = M r viewing everything as vectors and matrices indexed by state-actions.

Now let ( z)z2Rd be any family of policies parameterized by z. Assume that for each z, we can ﬁnd d (S A)-matrices Fz and B such that M z = F>

z B. Then Q z r = F>

z Br. Specializing to z R := Br, the Q-function of policy z R on reward r is Q

z Rz R. So far z was unspeciﬁed; but if we deﬁne z(s) := arg maxa(F>

z z)sa at each state s, then by deﬁnition, z R is the greedy policy with respect to F>

z Rz R. At the same time, F>

z Rz R is the Q-function of z R for reward r: thus, z R is the greedy policy of its own Q-function, and is therefore optimal for reward r.

Thus, if we manage to ﬁnd F, B, and z such that z = arg max F>

z B = M z for all z 2 Rd, then we obtain the optimal policy for any reward r, just by computing Br and applying policy Br.

This criterion on (F, B, z) is entirely unsupervised. Since F and B depend on z but z is deﬁned via F, this is a ﬁxed point equation. An exact solution exists for d large enough (Appendix, Prop. 6), while a smaller d provides lower-rank approximations M z F>

z B. In Section 4 we present a well-grounded algorithm to learn such F, B, and z.

In short, we learn two representations F and B such that F(s0, a0, z)>B(s0, a0) is approximately the long-term probability M z(s0, a0, s0, a0) to reach (s0, a0) if starting at (s0, a0) and following policy z. Then all optimal policies can be computed from F and B. We think of F as a representation of the future of a state, and B as the ways to reach a state (Appendix B.4): if F>B is large, then the second state is reachable from the ﬁrst. This is akin to a model of the environment, without synthesizing state trajectories.

General statement. In continuous spaces with function approximation, Fz and B become functions S A ! Rd instead of matrices; since Fz depends on z, F itself is a function S A Rd ! Rd. The sums over states will be replaced with expectations under the data distribution . Deﬁnition 1 (Forward-backward representation). Let Z = Rd be a representation space, and let be a measure on S A. A pair of functions F : S A Z ! Z and B : S A ! Z, together with a parametric family of policies ( z)z2Z, is called a forward-backward representation of the MDP with respect to , if the following conditions hold for any z 2 Z and (s, a), (s0, a0) 2 S A:

z(s) = arg max

a F(s, a, z)>z, M z(s0, a0, ds, da) = F(s0, a0, z)>B(s, a) (ds, da) (2)

where M is the successor measure deﬁned in (1), and the last equality is between measures. Theorem 2 (FB representations encode all optimal policies). Let (F, B, ( z)) be a forward-backward representation of a reward-free MDP with respect to some measure .

Then, for any bounded reward function r: S A ! R, the following holds. Set

r(s, a)B(s, a) (ds, da). (3)

assuming the integral exists. Then z R is an optimal policy for reward r in the MDP. Moreover, the optimal Q-function Q? for reward r is Q?(s, a) = F(s, a, z R)>z R.

For instance, for a single reward located at state-action (s, a), the optimal policy is z R with z R = B(s, a). (In that case the factor (ds, da) does not matter because scaling the reward does not change the optimal policy.)

We present in Section 4 an algorithm to learn FB representations. The measure will be the distribution of state-actions visited in a training set or under an exploration policy: then z R = E(s,a) [r(s, a)B(s, a)] can be obtained by sampling from visited states.

In ﬁnite spaces, exact FB representations exist, provided the dimension d is larger than #S #A (Appendix, Prop. 6). In inﬁnite spaces, arbitrarily good approximations can be obtained by increasing d, corresponding to a rank-d approximation of the cumulated transition probabilities M . Importantly, the optimality guarantee extends to approximate F and B, with optimality gap proportional to F>B M z/ (Appendix, Theorems 8 9 with various norms on F>B M / ). For instance, if, for some reward r, the error

$$F(s0, a0, z R)>B(s, a) M z R (s0, a0, ds, da)/ (ds, da)

$$ is at most " on average over (s, a) for every (s0, a0), then z R is 3" krk1 /(1 γ)-optimal for r.

These results justify using some norm over

$$, averaged over z 2 Rd, as a training loss for unsupervised reinforcement learning. (Below, we average over z 2 Rd from a ﬁxed rescaled Gaussian. If prior information is available on the rewards r, the corresponding distribution of z R may be used instead.)

If B is ﬁxed in advance and only F is learned, the method has similar properties to successor features based on B (Appendix B.4). But one may set a large d and let B be learned: arguably, by Theorem 2, the resulting features linearize optimal policies as much as possible. The features learned in F and B may have broader interest.

4 Learning and Using Forward-Backward Representations

Our algorithm starts with an unsupervised learning phase, where we learn the representations F and B in a reward-free way, by observing state transitions in the environment, generated from any exploration scheme. Then, in a reward estimation phase, we estimate a policy parameter z R = E[r(s, a)B(s, a)] from some reward observations, or directly set z R if the reward is known (e.g., set z R = B(s, a) to reach a known target (s, a)). In the exploitation phase, we directly use the policy z R(s) = arg maxa F(s, a, z R)>z R.

The unsupervised learning phase. No rewards are used in this phase, and no family of tasks has to be speciﬁed manually. F and B are trained off-policy from observed transitions in the environment. The ﬁrst condition of FB representations, z(s) = arg maxa F(s, a, z)>z, is just taken as the deﬁnition of z given F. In turn, F and B are trained so that the second condition (2), F( , z)>B = M z/ , holds for every z. Here is the (unknown) distribution of state-actions in the training data. Training is based on the Bellman equation for the successor measure M ,

M (s0, a0, {(s0, a0)}) =

s0=s0, a0=a0 + γ Es1 P (ds1|s0,a0) M (s1, (s1), {(s0, a0)}). (4)

We leverage a well-principled algorithm from [BTO21] in the single-policy setting: it learns the successor measure of a policy without using the sparse reward

s0=s0, a0=a0 (which would vanish in continuous spaces). Other successor measure algorithms could be used, such as C-learning [ESL21].

The algorithm from [BTO21] uses a parametric model m

(s0, a0, s0, a0) to represent M (s0, a0, ds0, da0) m

(s0, a0, s0, a0) (ds0, da0). It is not necessary to know , only to sample states from it. Given an observed transition (s0, a0, s1) from the training set, generate an action a1 (a1|s1), and sample another state-action (s0, a0) from the training set, independently from (s0, a0, s1). Then update the parameter by + δ with learning rate and

(s0, a0, s0, a0)+@ m

(s0, a0, s0, a0) (γ m

(s1, a1, s0, a0) m

(s0, a0, s0, a0)) (5)

This computes the density m of M with respect to the distribution of state-actions in the training set. Namely, the true successor state density m = M / is a ﬁxed point of (5) in expectation [BTO21, Theorem 6] (and is the only ﬁxed point in the tabular or overparameterized case). Variants exist, such as using a target network for m

(s1, a1, s0, a0) on the right-hand side, as in DQN.

Thus, we ﬁrst choose a parametric model F , B for the representations F and B, and set m z

(s0, a0, s0, a0) := F (s0, a0, z)>B (s0, a0). Then we iterate the update (5) over many stateactions and values of z. This results in Algorithm 1. At each step, a value of z is picked at random, together with a batch of transitions (s0, a0, s1) and a batch of state-actions (s0, a0) from the training set, with (s0, a0) independent from z and (s0, a0, s1).

For sampling z, we use a ﬁxed distribution (rescaled Gaussians, see Appendix D). Any number of values of z may be sampled: this does not use up training samples. We use a target network with soft updates (Polyak averaging) as in DDPG. For training we also replace the greedy policy z = arg maxa F(s, a, z)>z with a regularized version z = softmax(F(s, a, z)>z/ ) with ﬁxed temperature (Appendix D). Since there is unidentiﬁability between F and B (Appendix, Remark 7), we normalize B via an auxiliary loss in Algorithm 1.

For exploration in this phase, we use the policies being learned: the exploration policy chooses a random value of z from some distribution (e.g., Gaussian), and follows z for some time (Appendix, Algorithm 1). However, the algorithm can also work from an existing dataset of off-policy transitions.

The reward estimation phase. Once rewards are available, we estimate a reward representation (policy parameter) z R by weighing the representation B by the reward:

z R := E[r(s, a)B(s, a)] (6)

where the expectation must be computed over the same distribution of state-actions (s, a) used to learn F and B (see Appendix B.5 for using a different distribution). Thus, if the reward is black-box as in standard RL algorithms, then the exploration policy has to be run again for some time, and z R is obtained by averaging r(s, a)B(s, a) over the states visited. An approximate value for z R still provides an approximately optimal policy (Appendix, Prop. 10 and Thm. 12).

If the reward is known explicitly, this phase is unnecessary. For instance, if the reward is to reach a target state-action (s0, a0) while avoiding some forbidden state-actions (s1, a1), ..., (sk, ak), one may directly set

z R = B(s0, a0) λ

B(si, ai) (7)

where the constant λ sets the negative reward for forbidden states and adjusts for the unknown (dsi, dai) factors in (3). This can be used for goal-oriented RL.

If the reward is known algebraically as a function r(s, a), then z R may be computed by averaging the function r(s, a)B(s, a) over a replay buffer from the unsupervised training phase. We may also use a reward model ˆr(s, a) of r(s, a) trained on some reward observations from any source.

The exploitation phase. Once the reward representation z R has been estimated, the Q-function is estimated as

Q(s, a) = F(s, a, z R)>z R. (8)

The corresponding policy z R(s) = arg maxa Q(s, a) is used for exploitation.

Fine-tuning was not needed in our experiments, but it is possible to ﬁne-tune the Q-function using actual rewards, by setting Q(s, a) = F(s, a, z R)>z R + q (s, a) where the ﬁne-tuning model q is initialized to 0 and learned via any standard Q-learning method.

Incorporating prior information on rewards in B. Trying to plan in advance for all possible rewards in an arbitrary environment may be too generic and problem-agnostic, and become difﬁcult in large environments, requiring long exploration and a large d to accommodate all rewards. In practice, we are often interested in rewards depending, not on the full state, but only on a part or some features of the state (e.g., a few components of the state, such as the position of an agent, or its neighbordhood, rather than the full environment).

If this is known in advance, the representation B can be trained on that part of the state only, with the same theoretical guarantees (Appendix, Theorem 4). F still needs to use the full state as input. This way, the FB model of the transition probabilities (1) only has to learn the future probabilities of the part of interest in the future states, based on the full initial state (s0, a0). Explicitly, if ': S A ! G is a feature map to some features g = '(s, a), and if we know that the reward will be a function R(g), then Theorem 2 still holds with B(g) everywhere instead of B(s, a), and with the successor measure M (s0, a0, dg) instead of M (s0, a0, ds0, da0) (Appendix, Theorem 4). Learning is done by replacing @ m

(s0, a0, s0, a0) with @ m

(s0, a0, '(s0, a0)) in the ﬁrst term in (5) [BTO21]. Rewards can be arbitrary functions of g, so this is more general than [BBQ+18] which only considers rewards linear in g. For instance, in Ms Pacman below, we let g be the 2D position (x, y) of the agent, so we can optimize any reward function that depends on this position.

Limitations. First, this method does not solve exploration: it assumes access to a good exploration strategy. (Here we used the policies z with random values of z, corresponding to random rewards.)

Next, this task-agnostic approach is relevant if the reward is not known in advance, but may not bring the best performance on a particular reward. Mitigation strategies include: increasing d; using prior information on rewards by including relevant variables into B, as discussed above; and ﬁne-tuning the Q-function at test time based on the initial F>B estimate.

As reward functions are represented by a d-dimensional vector z R = E[r.B], some information about the reward is necessarily lost. Any reward uncorrelated to B is treated as 0. The dimension d controls how many types of rewards can be optimized well. A priori, a large d may be required. Still, in the experiments, d 100 manages navigation in a pixel-based environment with a huge state space. Appendix B.2 argues theoretically that d = 2n is enough for navigation on an n-dimensional grid. The algorithm is linear in d, so d can be taken as large as the neural network models can handle.

We expect this method to have an implicit bias for long-range behavior (spatially smooth rewards), while local details of the reward function may be blurred. Indeed, F>B is optimized to approximate the successor measure M = P

the t-step transition kernel for each policy . The rank-d approximation will favor large eigenvectors of P , i.e., small eigenvectors of the Markov chain Laplacian Id γP . These loosely correspond to long-range (low-frequency) behavior [MM07]: presumably, F and B will learn spatially smooth rewards ﬁrst. Indeed, experimentally, a small d leads to spatial blurring of rewards and Q-functions (Fig. 3). Arguably, without any prior information this is a reasonable prior. [SBG17] have argued for the cognitive relevance of low-dimensional approximations of successor representations.

Variance is a potential issue in larger environments, although this did not arise in our experiments. Learning M requires sampling a state-action (s0, a0) and an independent state-action (s0, a0). In large spaces, most state-action pairs will be unrelated. A possible mitigation is to combine FB with strategies such as Hindsight Experience Replay [ACR+17] to select goals related to the current state-action. The following may help a lot: the update of F and B decouples as an expectation over (s0, a0), times an expectation over (s0, a0). Thus, by estimating these expectations by a moving average over a dataset, it is easy to have many pairs (s0, a0) interact with many (s0, a0). The cost is handling full d d matrices. This will be explored in future work.

5 Experiments

We ﬁrst consider the task of reaching arbitrary goal states. For this, we can make quantitative comparisons to existing goal-oriented baselines. Next, we illustrate qualitatively some tasks that cannot be tackled a posteriori by goal-oriented methods, such as introducing forbidden states. Finally, we illustrate some of the representations learned.

5.1 Environments and Experimental Setup

We run our experiments on a selection of environments that are diverse in term of state space dimensionality, stochasticity and dynamics.

Discrete Maze is the classical gridworld with four rooms. States are represented by one-hot unit

vectors. Continuous Maze is a two dimensional environment with impassable walls. States are represented

by their Cartesian coordinates (x, y) 2 [0, 1]2. The execution of one of the actions moves the agent in the desired direction, but with normal random noise added to the position of the agent. Fetch Reach is a variant of the simulated robotic arm environment from [PAR+18] using discrete

actions instead of continuous actions. States are 10-dimensional vectors consisting of positions and velocities of robot joints. Ms. Pacman is a variant of the Atari 2600 game Ms. Pacman, where an episode ends when the

agent is captured by a monster [RUMS18]. States are obtained by processing the raw visual input directly from the screen. Frames are preprocessed by cropping, conversion to grayscale and downsampling to 84 84 pixels. A state st is the concatenation of (xt 12, xt 8, xt 4, xt) frames, i.e. an 84 84 4 tensor. An action repeat of 12 is used. As Ms. Pacman is not originally a multi-goal domain, we deﬁne the goals as the 148 reachable coordinates (x, y) on the screen; these can be reached only by learning to avoid monsters.

For all environments, we run algorithms for 800 epochs, with three different random seeds. Each epoch consists of 25 cycles where we interleave between gathering some amount of transitions, to add to the replay buffer, and performing 40 steps of stochastic gradient descent on the model parameters. To collect transitions, we generate episodes using some behavior policy. For both mazes, we use a uniform policy while for Fetch Reach and Ms. Pacman, we use an "-greedy policy with respect to the current approximation F(s, a, z)>z for a sampled z. At evaluation time, "-greedy policies are also used, with a smaller ". More details are given in Appendix D.

5.2 Goal-Oriented Setting: Quantitative Comparisons

We investigate the FB representation over goal-reaching tasks and compare it to goal-oriented baselines: DQN3, and DQN with HER when needed. We deﬁne sparse reward functions. For Discrete Maze, the reward function is equal to one when the agent s state is equal exactly to the goal state. For Discrete Maze, we measured the quality of the obtained policy to be the ratio between the true expected discounted reward of the policy for its goal and the true optimal value function, on average over all states. For the other environments, the reward function is equal to one when the distance of the agent s position and the goal position is below some threshold, and zero otherwise. We assess policies by computing the average success rate, i.e the average number of times the agent successfully reaches its goal.

Figure 1: Comparative performance of FB for different dimensions and DQN in Fetch Reach. Left: success rate averaged over 20 randomly selected goals as function of the ﬁrst 100 training epochs. Right: success rate averaged over 20 random goals after 800 training epochs.

Figure 2: Comparative performance of FB for different dimensions and DQN in Ms. Pacman. Left: success rate averaged over 20 randomly selected goals as function of the ﬁrst 200 training epochs. Right: success rate averaged over the goal space after 800 training epochs.

Figs. 1 and 2 show the comparative performance of FB for different dimensions d, and DQN respectively in Fetch Reach and Ms. Pacman (similar results in Discrete and Continuous Mazes are provided in Appendix D). In Ms. Pacman, DQN totally fails to learn and we had to add HER to make it work. The performance of FB consistently increases with the dimension d and the best dimension matches the performance of the goal-oriented baseline.

In Discrete Maze, we observe a drop of performance for d = 25 (Appendix D, Fig. 8): this is due to the spatial smoothing induced by the small rank approximation and the reward being nonzero only if the agent is exactly at the goal. This spatial blurring is clear on heatmaps for d = 25 vs d = 75 (Fig. 3). With d = 25 the agent often stops right next to its goal.

To evaluate the sample efﬁciency of FB, after each epoch, we evaluate the agent on 20 randomly selected goals. Learning curves are reported in Figs. 1 and 2 (left). In all environments, we observe no loss in sample efﬁciency compared to the goal-oriented baseline. In Ms. Pacman, FB even learns faster than DQN+HER.

5.3 More Complex Rewards: Qualitative Results

We now investigate FB s ability to generalize to new tasks that cannot be solved by an already trained goal-oriented model: reaching a goal with forbidden states imposed a posteriori, reaching the nearest of two goals, and choosing between a small, close reward and a large, distant one.

First, for the task of reaching a target position g0 while avoiding some forbidden positions g1, . . . gk , we set z R = B(g1) λ Pk

i=1 B(gi) and run the corresponding "-greedy policy deﬁned

3Here DQN is short for goal-oriented DQN, Q(s, a, g).

Figure 7: Visualization of FB embedding vectors on Continuous Maze after projecting them in two-dimensional space with t-SNE. Left: the states to be mapped. Middle: the F embedding. Right: the B embedding. The walls appear as large dents; the smaller dents correspond to the number of steps needed to get past a wall.

by F(s, a, z R)>z R. Fig. 5 shows the resulting trajectories, which succeed at solving the task for the different domains. In Ms. Pacman, the path is suboptimal (though successful) due to the sudden appearance of a monster along the optimal path. (We only plot the initial frame; see the full series of frames along the trajectory in Appendix D, Fig. 16.) Fig. 4 (left) provides a contour plot of maxa2A F(s, a, z R)>z R for the continuous maze and shows the landscape shape around the forbidden regions.

Next, we consider the task of reaching the closest target among two equally rewarding positions g0 and g1, by setting z R = B(g0) + B(g1). The optimal Q-function is not a linear combination of the Q-functions for g0 and g1. Fig. 6 shows successful trajectories generated by the policy z R. On the contour plot of maxa2A F(s, a, z R)>z R in Fig. 4 (right), the two rewarding positions appear as basins of attraction. Similar results for a third task are shown in Appendix D: introducing a distracting small reward next to the initial position of the agent, with a larger reward further away.

Figure 3: Heatmap of maxa F(s, a, z R)>z R for z R = B( ) Left: d = 25. Right: d = 75.

Figure 4: Contour plot of maxa2A F(s, a, z R)>z R in Continuous Maze. Left: for the task of reaching a target while avoiding a forbidden region, Right: for two equally rewarding targets.

Figure 5: Trajectories generated by the F>B policies for the task of reaching a target position (star shape while avoiding forbidden positions (red shape )

Figure 6: Trajectories generated by the F>B policies for the task of reaching the closest among two equally rewarding positions (star shapes ). (Optimal Q-values are not linear over such mixtures.)

5.4 Embedding Visualizations

We visualize the learned FB state embeddings for Continuous Maze by projecting them into 2dimensional space using t-SNE [Vd MH08] in Fig. 7. For the forward embeddings, we set z = 0 corresponding to the uniform policy. We can see that FB partitions states according to the topology induced by the dynamics: states on opposite sides of walls are separated in the representation space

and states on the same side lie together. Appendix D includes embedding visualizations for different z and for Discrete Maze and Ms. Pacman.

6 Related work

[BBQ+18] learn optimal policies for rewards that are linear combinations of a ﬁnite number of feature functions provided in advance by the user. This approach cannot tackle generic rewards or goal-oriented RL: this would require introducing one feature per possible goal state, requiring inﬁnitely many features in continuous spaces.

Our approach does not require user-provided features describing the future tasks, thanks to using successor states [BTO21] where [BBQ+18] use successor features. Schematically, and omitting actions, successor features start with user-provided features ', then learn such that (s0) = P

t 0 γt E['(st) | s0]. This limits applicability to rewards that are linear combinations of '. Here we use successor state probabilities, namely, we learn two representations F and B such that F(s0)>B(s0) = P

t 0 γt Pr(st = s0 | s0). This does not require any user-provided input.

Thus we learn two representations instead of one. The learned backward representation B is absent from [BBQ+18]. B plays a different role than the user-provided features ' of [BBQ+18]: if the reward is known a priori to depend only on some features ', we learn B on top of ', which represents all rewards that depend linearly or nonlineary on '. Up to a change of variables, [BBQ+18] is recovered by setting B = Id on top of ', or B = ' and ' = Id, and then only training F.

We use a similar parameterization of policies by F(s, a, z)>z as in [BBQ+18], for similar reasons, although z encodes a different object.

Successor representations where ﬁrst deﬁned in [Day93] for ﬁnite spaces, corresponding to an older object from Markov chains, the fundamental matrix [KS60, Bré99, GS97]. [SBG17] argue for their relevance for cognitive science. For successor representations in continuous spaces, a ﬁnite number of features ' are speciﬁed ﬁrst; this can be used for generalization within a family of tasks, e.g., [BDM+17, ZSBB17, GHB+19, HDB+19]. [BTO21] moves from successor features to successor states by providing pointwise occupancy map estimates even in continuous spaces, without using the sparse reward

st=s0. We borrow a successor state learning algorithm from [BTO21]. [BTO21] also introduced simpler versions of F and B for a single, ﬁxed policy; [BTO21] does not consider the every-optimal-policy setting.

There is a long literature on goal-oriented RL. For instance, [SHGS15] learn goal-dependent value functions, regularized via an explicit matrix factorization. Goal-dependent value functions have been investigated in earlier works such as [FD02] and [SMD+11]. Hindsight experience replay (HER) [ACR+17] improves the sample efﬁciency of multiple goal learning with sparse rewards. A family of rewards has to be speciﬁed beforehand, such as reaching arbitrary target states. Specifying rewards a posteriori is not possible: for instance, learning to reach target states does not extend to reaching the nearest among several goals, reaching a goal while avoiding forbidden states, or maximizing any dense reward.

Hierarchical methods such as options [SPS99] can be used for multi-task RL problems. However, policy learning on top of the options is still needed after the task is known.

For ﬁnite state spaces, [JKSY20] use reward-free interactions to build a training set that summarizes a ﬁnite environment, in the sense that any optimal policies later computed on this training set instead of the true environment are provably "-optimal, for any reward. They prove tight bounds on the necessary set size. Policy learning still has to be done afterwards for each reward.

Acknowledgments. The authors would like to thank Léonard Blier, Diana Borsa, Alessandro Lazaric, Rémi Munos, Tom Schaul, Corentin Tallec, Nicolas Usunier, and the anonymous reviewers for numerous comments, technical questions, references, and invaluable suggestions for presentation that led to an improved text.

7 Conclusion

The forward-backward representation is a learnable mathematical object that summarizes a rewardfree MDP. It provides near-optimal policies for any reward speciﬁed a posteriori, without planning. It is learned from black-box reward-free interactions with the environment. In practice, this unsupervised method performs comparably to goal-oriented methods for reaching arbitrary goals, but is also able to tackle more complex rewards in real time. The representations learned encode the MDP dynamics and may have broader interest.

[ACR+17] Marcin Andrychowicz, Dwight Crow, Alex Ray, Jonas Schneider, Rachel Fong, Peter

Welinder, Bob Mc Grew, Josh Tobin, Pieter Abbeel, and Wojciech Zaremba. Hindsight experience replay. In NIPS, 2017.

[ARO+19] Ankesh Anand, Evan Racah, Sherjil Ozair, Yoshua Bengio, Marc-Alexandre Côté, and

R Devon Hjelm. Unsupervised state representation learning in atari. ar Xiv preprint ar Xiv:1906.08226, 2019.

[BBQ+18] Diana Borsa, André Barreto, John Quan, Daniel Mankowitz, Rémi Munos, Hado van

Hasselt, David Silver, and Tom Schaul. Universal successor features approximators. ar Xiv preprint ar Xiv:1812.07626, 2018.

[BDM+17] André Barreto, Will Dabney, Rémi Munos, Jonathan J Hunt, Tom Schaul, David Silver,

and Hado P van Hasselt. Successor features for transfer in reinforcement learning. In NIPS, 2017.

[BNVB13] Marc G Bellemare, Yavar Naddaf, Joel Veness, and Michael Bowling. The arcade

learning environment: An evaluation platform for general agents. Journal of Artiﬁcial Intelligence Research, 47:253 279, 2013.

[Bog07] Vladimir I Bogachev. Measure theory. Springer, 2007.

[Bré99] Pierre Brémaud. Markov chains: Gibbs ﬁelds, Monte Carlo simulation, and queues,

volume 31. 1999.

[BTO21] Léonard Blier, Corentin Tallec, and Yann Ollivier. Learning successor states and goal-

dependent values: A mathematical viewpoint. ar Xiv preprint ar Xiv:2101.07123, 2021.

[Day93] Peter Dayan. Improving generalization for temporal difference learning: The successor

representation. Neural Computation, 5(4):613 624, 1993.

[ESL21] Benjamin Eysenbach, Ruslan Salakhutdinov, and Sergey Levine. C-learning: Learning

to achieve goals via recursive classiﬁcation. In International Conference on Learning Representations, 2021.

[FD02] David Foster and Peter Dayan. Structure in the space of value functions. Machine

Learning, 49(2):325 346, 2002.

[GHB+19] Christopher Grimm, Irina Higgins, Andre Barreto, Denis Teplyashin, Markus Wulfmeier,

Tim Hertweck, Raia Hadsell, and Satinder Singh. Disentangled cumulants help successor representations transfer to new tasks. ar Xiv preprint ar Xiv:1911.10866, 2019.

[GS97] Charles Miller Grinstead and James Laurie Snell. Introduction to probability. American

Mathematical Soc., 1997.

[HDB+19] Steven Hansen, Will Dabney, Andre Barreto, Tom Van de Wiele, David Warde-Farley,

and Volodymyr Mnih. Fast task inference with variational intrinsic successor features. ar Xiv preprint ar Xiv:1906.05030, 2019.

[JKSY20] Chi Jin, Akshay Krishnamurthy, Max Simchowitz, and Tiancheng Yu. Reward-free ex-

ploration for reinforcement learning. In International Conference on Machine Learning, pages 4870 4879. PMLR, 2020.

[KS60] J. G. Kemeny and J. L. Snell. Finite Markov Chains. Van Nostrand, New York, 1960.

[KST+18] Nan Rosemary Ke, Amanpreet Singh, Ahmed Touati, Anirudh Goyal, Yoshua Ben-

gio, Devi Parikh, and Dhruv Batra. Modeling the long term future in model-based reinforcement learning. In International Conference on Learning Representations, 2018.

[MM07] Sridhar Mahadevan and Mauro Maggioni. Proto-value functions: A Laplacian framework

for learning representation and control in Markov decision processes. Journal of Machine Learning Research, 8(10), 2007.

[PAR+18] Matthias Plappert, Marcin Andrychowicz, Alex Ray, Bob Mc Grew, Bowen Baker, Glenn

Powell, Jonas Schneider, Josh Tobin, Maciek Chociej, Peter Welinder, et al. Multi-goal reinforcement learning: Challenging robotics environments and request for research. ar Xiv preprint ar Xiv:1802.09464, 2018.

[RUMS18] Paulo Rauber, Avinash Ummadisingu, Filipe Mutz, and Jürgen Schmidhuber. Hindsight

policy gradients. In International Conference on Learning Representations, 2018.

[SB18] Richard S Sutton and Andrew G Barto. Reinforcement learning: An introduction. MIT

press, 2018. 2nd edition.

[SBG17] Kimberly L Stachenfeld, Matthew M Botvinick, and Samuel J Gershman. The hip-

pocampus as a predictive map. Nature neuroscience, 20(11):1643, 2017.

[SHGS15] Tom Schaul, Daniel Horgan, Karol Gregor, and David Silver. Universal value function

approximators. In Francis Bach and David Blei, editors, Proceedings of the 32nd International Conference on Machine Learning, volume 37 of Proceedings of Machine Learning Research, pages 1312 1320, Lille, France, 07 09 Jul 2015. PMLR.

[SMD+11] Richard S Sutton, Joseph Modayil, Michael Delp, Thomas Degris, Patrick M Pilarski,

Adam White, and Doina Precup. Horde: A scalable real-time architecture for learning knowledge from unsupervised sensorimotor interaction. In The 10th International Conference on Autonomous Agents and Multiagent Systems-Volume 2, pages 761 768, 2011.

[SPS99] Richard S Sutton, Doina Precup, and Satinder Singh. Between mdps and semi-mdps: A

framework for temporal abstraction in reinforcement learning. Artiﬁcial intelligence, 112(1-2):181 211, 1999.

[Tal17] Erik Talvitie. Self-correcting models for model-based reinforcement learning. In

Proceedings of the AAAI Conference on Artiﬁcial Intelligence, volume 31, 2017.

[Vd MH08] Laurens Van der Maaten and Geoffrey Hinton. Visualizing data using t-sne. Journal of

machine learning research, 9(11), 2008.

[ZSBB17] Jingwei Zhang, Jost Tobias Springenberg, Joschka Boedecker, and Wolfram Burgard.

Deep reinforcement learning with successor features for navigation across similar environments. In 2017 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pages 2371 2378. IEEE, 2017.