# preferential_temporal_difference_learning__2832d3e5.pdf

Preferential Temporal Difference Learning

Nishanth Anand 1 2 Doina Precup 1 2 3

Abstract Temporal-Difference (TD) learning is a general and very useful tool for estimating the value function of a given policy, which in turn is required to ﬁnd good policies. Generally speaking, TD learning updates states whenever they are visited. When the agent lands in a state, its value can be used to compute the TD-error, which is then propagated to other states. However, it may be interesting, when computing updates, to take into account other information than whether a state is visited or not. For example, some states might be more important than others (such as states which are frequently seen in a successful trajectory). Or, some states might have unreliable value estimates (for example, due to partial observability or lack of data), making their values less desirable as targets. We propose an approach to re-weighting states used in TD updates, both when they are the input and when they provide the target for the update. We prove that our approach converges with linear function approximation and illustrate its desirable empirical behaviour compared to other TD-style methods.

1. Introduction

The value function is a crucial quantity in reinforcement learning (RL), summarizing the expected long-term return from a state or a state-action pair. The agent uses this knowledge to make informed action decisions. Temporal Difference (TD) learning methods (Sutton, 1988) enable updating the value function before the end of an agent s trajectory by contrasting its return predictions over consecutive time steps, i.e., computing the temporal difference error (TD-error). State-of-the-art RL algorithms, e.g. Mnih et al. (2015); Schulman et al. (2017) use this idea coupled with function approximation.

1Mila (Quebec Artiﬁcial Intelligence Institute), Montreal, Canada 2School of Computer Science, Mc Gill University, Montreal, Canada 3Deepmind, Montreal, Canada. Correspondence to: Nishanth Anand <nishanth.anand@mail.mcgill.ca>.

Proceedings of the 38 th International Conference on Machine Learning, PMLR 139, 2021. Copyright 2021 by the author(s).

TD-learning can be viewed as a way to approximate dynamic programming algorithms in Markovian environments (Barnard, 1993). But, if the Markovian assumption does not hold (as is the case when function approximation is used to estimate the value function), its use can create problems (Gordon, 1996; Sutton & Barto, 2018). To see this, consider the situation depicted in Figure 1a, where an agent starts in a fully observable state and chooses one of two available actions. Each action leads to a different longterm outcome, but the agent navigates through aliased states that are distinct but have the same representation before observing the outcome. This setting poses two challenges:

1. Temporal credit assignment: The starting state and the outcome state are temporally distant. Therefore, an efﬁcient mechanism is required to propagate the credit (or blame) between them.

2. Partial observability: With function approximation, updating one state affects the value prediction at other states. If the generalization is poor, TD-updates at partially observable states can introduce errors, which propagate to estimates at fully observable states.

TD(λ) is a well known class of span independent algorithm (van Hasselt & Sutton, 2015) for temporal credit assignment introduced by Sutton (1988) and further developed in many subsequent works, e.g. Singh & Sutton (1996); Seijen & Sutton (2014), which uses a recency heuristic: any TD-error is attributed to preceding states with an exponentially decaying weight. However, recency can lead to inefﬁcient propagation of credit (Aberdeen, 2004; Harutyunyan et al., 2019). Generalized TD(λ) with a state-dependent λ can tackle this problem (Sutton, 1995; Sutton et al., 1999; 2016; Yu et al., 2018). This approach mainly modiﬁes the target used in the update, not the extent to which any state is updated. Besides, the update sequence may not converge. Emphatic TD (Sutton et al., 2016) uses an independent interest function, not connected to the eligibility parameter, in order to modulate how much states on a trajectory are updated.

In this paper, we introduce and analyze a new algorithm: Preferential Temporal Difference (Preferential TD or PTD) learning, which uses a single state-dependent preference function to emphasize the importance of a state both as an input for the update, as well as a participant in the target. If

Preferential Temporal Difference Learning

a state has a low preference, it will not be updated much, and its value will also not be used much as a bootstrapping target. This mechanism allows, for example, skipping states that are partially observable and propagating the information instead between fully observable states. Older POMDP literature (Loch & Singh, 1998; Theocharous & Kaelbling, 2004) has demonstrated the utility of similar ideas empirically. We provide a convergence proof for the expected updates of Preferential TD in the linear function approximation setting and illustrate its behaviour in partially observable environments.

Figure 1. Delayed effect MDPs: decision states are shown as boxes, goal states are shown in circles. Feature vectors are speciﬁed next to the states. The corridor represents a chain of partially observable states.

2. Preliminaries and Notation

A Markov Decision Process (MDP) is deﬁned by a tuple M = (S, A, P, r, γ), where S is the ﬁnite set of states, A is the ﬁnite set of actions, P(s |s, a) = P{st+1 = s |st = s, at = a} is the transition model, r : S A R is the reward function, and γ [0, 1) is the discount factor. The agent interacts with the environment in state st S by selecting an action at A according to its policy, π(a|s) = P{at = a|st = s}. As a consequence, the agent transitions to a new state st+1 P( |st, at) and receives reward rt+1 = r(st, at). We consider the policy evaluation setting, where the agent s goal is to estimate the value function:

vπ(s) = Eπ[Gt|st = s], (1)

where Gt = P i=t γi tri+1 is the discounted return obtained by following the policy π from state s. If |S| is very big, vπ must be approximated by using a function approxi-

mator. We consider linear approximations:

vw(s) = w T φ(s), (2)

where w Rk is the parameter vector and φ(s) is the feature vector for state s. Note that the linear case encompasses both the tabular case as well as the use of ﬁxed non-linear features, as detailed in Sutton & Barto (2018). By deﬁning a matrix Φ whose rows are φT (i), i S, this approximation can be written as a vector in R|S| as vw = Φw.

TD(λ) (Sutton, 1984; 1988) is an online algorithm for updating w, which performs the following update after every time step:

et = γλet 1 + φ(st) (3)

wt+1 = wt + αt(rt+1 + γw T φ(st+1) w T φ(st))et,

where αt is the learning rate parameter, λ [0, 1] is the eligibility trace parameter, used to propagate TD-errors with exponential decay to states that are further back in time, and et is the eligibility trace.

3. Preferential Temporal Difference Learning

Let β : S [0, 1] be a preference function, which assigns a certain importance to each state. A preference of β(s) = 0 means that the value function will not be updated at all when s is visited, while β(s) = 1 means s will receive a full update. Note, however, that if s is not updated, its value will be completely inaccurate, and hence it should not be used as a target for predecessor states because it would lead to a biased update. To prevent this, we can modify the return Gt to a form that is similar to λ-returns (Watkins, 1989; Sutton & Barto, 2018), by bootstrapping according to the preference:

Gβ t = rt+1+

γ[β(st+1)w T φ(st+1) + (1 β(st+1))Gβ t+1]. (4)

The update corresponding to this return leads to the ofﬂine Preferential TD algorithm:

wt+1 = wt + αβ(st)(Gβ t w T φ(st))φ(st),

αt β(st)Gβ t + (1 β(st))w T φ(st) | {z } target

w T φ(st) φ(st).

The expected target of ofﬂine PTD (cf. equation 5) can be written as a Bellman operator:

Theorem 1. The expected target in the forward view can be summarized using the operator

T βv = B(I γPπ(I B)) 1(rπ +γPπBv)+(I B)v,

Preferential Temporal Difference Learning

where B is the |S| |S| diagonal matrix with β(s) on its diagonal and rπ and Pπ are the state reward vector and state-to-state transition matrix for policy π.

We obtain the desired result by considering expected updates in vector form. The complete proof is provided in Appendix A.1.

Using equation 4, one would need to wait until the end of the episode to compute Gβ t . We can turn this into an online update rule using the eligibility trace mechanism (Sutton, 1984; 1988):

et = γ(1 β(st))et 1 + β(st)φ(st), (6)

wt+1 = wt + αt(rt+1 + γw T φ(st+1) w T φ(st))et,

where et is the eligibility trace as before. The equivalence between the ofﬂine and online algorithm can be obtained following Sutton (1988); Sutton & Barto (2018). Theorem 2 in Section 4 formalizes this equivalence. We can turn the update equations into an algorithm as shown in algorithm 1.

Algorithm 1 Preferential TD: Linear FA

1: Input: π,γ,β, φ 2: Initialize: w 1 = 0, e 1 = 0 3: Output: w T 4: for t : 0 T do 5: Take action a π(st) , observe rt+1, st+1 6: v(st) = w T t φ(st), v(st+1) = w T t φ(st+1) 7: δt = rt+1 + γv(st+1) v(st) 8: et = β(st)φ(st) + γ(1 β(st))et 1 9: wt+1 wt + αtδtet 10: end for

4. Convergence of PTD

We consider a ﬁnite, irreducible, aperiodic Markov chain. Let {st|t = 0, 1, 2, . . . } be the sequence of states visited by the Markov chain and let dπ(s) denote the steady-state probability of s S. We assume that dπ(s) > 0, s S. Let Dπ be the |S| |S| diagonal matrix with Dπ i,i = dπ(i). We denote the Euclidean norm on vectors or Euclideaninduced norm on matrices by : A = max x =1 Ax . We make the following assumptions to establish the convergence result 1. Assumption 1. The feature matrix, Φ R|S| Rk is a full column rank matrix, i.e., the column vectors {φi|i = 1 . . . k} are linearly independent. Also, Φ M, where M is a constant. Assumption 2. The Markov chain is rapidly mixing:

|Pπ(st = s|s0) dπ(s)| Cρt, s0 S, ρ < 1,

1Similar assumptions are made to establish the convergence proof of TD(λ) in the linear function approximation setting for a constant λ (Tsitsiklis & Van Roy, 1997).

where C is a constant. Assumption 3. The sequence of step sizes satisﬁes the Robbins-Monro conditions:

t=0 αt = and

t=0 α2 t < .

The update equations of Preferential TD (cf. equation 6) can be written as:

wt+1 wt + αt(b(Xt) A(Xt)wt),

where b(Xt) = etrt+1, A(Xt) = et(φ(st) γφ(st+1))T , Xt = (st, st+1, et) and et is the eligibility trace of PTD. Let A = Edπ[A(Xt)] and b = Edπ[b(Xt)]. Let Pβ π be a new transition matrix that accounts for the termination due to bootstrapping and discounting 2, deﬁned as: Pβ π = γ P k=0(γPπ(I B))k PπB. This is a sub-stochastic matrix for γ < 1 and a stochastic matrix when γ = 1. Lemma 1. The expected quantities A and b are given by A = ΦT DπB(I Pβ π ) and b = ΦT DπB(I γPπ) 1rπ.

We use the proof template from Emphatic TD (Sutton et al., 2016) to get the desired result. The proof is provided in Appendix A.1.

We are now equipped to establish the forward-backward equivalence. Theorem 2. The forward and the backward views of PTD are equivalent in expectation:

b Aw = ΦT D T β(Φw) Φw .

The proof is provided in Appendix A.2.

The next two lemmas verify certain conditions on A that are required to show that it is positive deﬁnite. Lemma 2. A has positive diagonal elements with positive row sums.

Proof. Consider the term (I Pβ π ). Pβ π is a sub-stochastic matrix for γ [0, 1), hence, P

j[Pβ π ]i,j < 1. Therefore, P j[I Pβ π ]i,j = 1 P j[Pβ π ]i,j > 0. Additionally, [I Pβ π ]i,i > 0 since [Pβ π ]i,i < 1.

Lemma 3. The column sums of A are positive.

The idea of the proof is to show that dπBPβ π = dπB for γ = 1. This implies dπB(I Pβ π ) = 0 for γ = 1 and dπB(I Pβ π ) > 0 for γ [0, 1) as dπBPβ π < dπB. Therefore, column sums are positive. The complete proof is provided in Appendix A.2. Note that dπB is not the stationary distribution of Pβ π , as it is unnormalized.

2Pβ π is similar to ETD s Pλ π (Sutton et al., 2016), β(s) takes the role of 1 λ(s) and we consider a constant discount setting.

Preferential Temporal Difference Learning

Lemma 4. A is positive deﬁnite.

Proof. From lemmas 2 and 3, the row sums and the column sums of A are positive. Also, the diagonal elements of A are positive. Therefore, A + AT is a strictly diagonally dominant matrix with positive diagonal elements. We can use corollary 1.22 from Varga (1999) (provided in Appendix A.1 for completeness) to conclude that A + AT is positive deﬁnite. Hence, A is also positive deﬁnite.

The next lemma shows that the bias resulting from initial updates, i.e., when the chain has not yet reached the stationarity, is controlled.

Lemma 5. We have, P t=0 Eπ[A(Xt)|X0] A C1 and P t=0 Eπ[b(Xt)|X0] b C2 under assumption 2, where C1 and C2 are constants.

The proof is provided in Appendix A.3.

We now present the main result of the paper.

Theorem 3. The sequence of expected updates computed by Preferential TD converges to a unique ﬁxed point.

Proof. We can establish convergence by satisfying all the conditions of a standard result from the stochastic approximation literature, such as theorem 2 in Tsitsiklis & Van Roy (1997) or proposition 4.8 in Bertsekas & Tsitsiklis (1996) (provided in Appendix A.3 for completeness). Assumption 3 satisﬁes the step-size requirement. Lemmas 1 and 4 meet conditions 3 and 4. Lemma 5 controls the noise from sampling, hence satisfying the ﬁnal requirement. Therefore, the expected update sequence of Preferential TD converges. The ﬁxed point lies in the span of the feature matrix and this point has zero projected Bellman error. That is, Π T β(Φwπ) Φwπ = 0, where

Π = Φ(ΦT DπΦ) 1ΦT Dπ is the projection matrix.

Consequences: It is well-known that using state-dependent bootstrapping or re-weighing updates in TD(λ) will result in convergence issues even in the on-policy setting. This was demonstrated using counterexamples (Mahmood, 2017; Ghiassian et al., 2017). Our result establishes convergence when a state-dependent bootstrapping parameter is used in addition to re-weighing the updates. We analyze these examples below and show that they are invalid for Preferential TD. This makes Preferential TD one of only two algorithms to converge with standard assumptions in the function approximation setting, Emphatic TD being the other (Yu, 2015).

Counterexample 1 (Mahmood, 2017): Two state MDP

with Pπ = 0.5 0.5 0.5 0.5

, Φ = 0.5 1

, λ = 0.99 0.8 ,

dπ = 0.5 0.5 , γ = 0.99. The key matrix of TD(λ) is given by, A = 0.0429 , which is not positive deﬁnite.

Therefore, the updates will not converge. The key matrix of Preferential TD for the same setup is A = 0.009 which is positive deﬁnite 3. The second counterexample is analyzed in Appendix A.3.

5. Related work

A natural generalization of TD(λ) is to make λ a statedependent function (Sutton, 1995; Sutton et al., 1999; 2016). This setting produces signiﬁcant improvements in many cases (Sutton, 1995; Sutton et al., 2016; White & White, 2016; Xu et al., 2018). However, it only modiﬁes the target during bootstrapping, not the extent to which states are updated. The closest approach in spirit to ours is Emphatic TD(λ) (Emphatic TD or ETD), which introduces a framework to reweigh the updates using the interest function (i(s) R+, s S) (Sutton et al., 2016). Emphatic TD uses the interest of past states and the state-dependent bootstrap parameter λ of the current state to construct the emphasis, Mt R+, at time t. The updates are then weighted using Mt. By this construction, the emphasis is a trajectory dependent quantity. Two different trajectories leading to a particular state have a different emphasis on the updates. This can be beneﬁcial in the off-policy case, where one may want to carry along importance sampling ratios from the past. However, a trajectory-based quantity can result in high variance, as the same state could get different updates depending upon its past.

Emphatic TD and Preferential TD share the idea of reweighing the updates based on the agent s preference for states. Emphatic TD uses a trajectory-based quantity, whereas Preferential TD uses a state-dependent parameter to reweigh the updates. Furthermore, Preferential TD uses a single parameter to update and bootstrap, instead of two in Emphatic TD. Our analysis in Section 4 suggests that the two algorithms ﬁxed points are also different. However, we can set up Emphatic TD to achieve a similar effect to PTD (though not identical) when β {0, 1}, by setting λ = 1 β and the interest of a state proportional to the preference for that state. In our experiments, we use this setting as well, which will be denoted as ETD-variable algorithm.

The fact that preference is a state-dependent quantity ﬁts well with the intuition that if a state is partially observable, we may always want to avoid involving it in bootstrapping, regardless of the trajectory on which it occurs. Preferential TD lies in between TD(λ) and Emphatic TD. Like TD(λ), it uses a single state-dependent bootstrap function. Like Emphatic TD, it reweighs the updates but using only the preference.

3We set λ(s0) = 0.99 instead of 1, a small change to the original example, because β(s0) = 0 when λ(s0) = 1 and such states can be excluded from the analysis.

Preferential Temporal Difference Learning

Other works share the idea of ignoring updates on partially observable states, e.g. Xu et al. 2017; Thodoroff et al. 2019. They use a trajectory-based value as a substitute to the value of a partially observable state. Temporal value transport (Hung et al., 2019) uses an attention mechanism to pick past states to update, bypassing partially observable states for credit assignment. However, the theoretical properties of trajectory-dependent values or attention-based credit assignment are poorly understood at the moment. Our method is unbiased, and, as seen from Section 4, it can be understood well from a theoretical standpoint.

Our ideas are also related to the idea of a backward model Chelu et al. (2020), in which an explicit model of precursor states is constructed and used to propagate TD-errors in a planning-style algorithm, with a similar motivation of avoiding partially observable states and improving credit assignment. However, our approach is model-free.

6. Illustrations

In this section, we test Preferential TD on policy evaluation tasks in four different settings4: tabular, linear, semi-linear (linear predictor with non-linear features), and non-linear (end-to-end training). Note that our theoretical results do not cover the last setup; however, it is quite easy to implement the algorithm in this setup.

6.1. Tabular

In this setting, the value function is represented as a look-up table. This experiment is provided to assist in understanding PTD as a whole in the limited sample setting for various choices of constant preference function.

Task description: We consider the 19-state random walk problem from Sutton 1988; Sutton & Barto 2018. In this MDP, there are 19 states connected sequentially, with two terminal states on either side. From each state, the agent can choose to go left or right. The agent gets a reward of +1 when it transitions to the rightmost terminal state and 1 in the leftmost state; otherwise, rewards are 0. We set γ = 1 and the policy to uniformly random. We compare Emphatic TD, TD(λ), and PTD on this task. We consider ﬁxed parameters for these methods, i.e., interest in ETD is ﬁxed, λ in ETD and TD(λ) are ﬁxed, and β in PTD is ﬁxed. For ETD, we selected a constant interest of 0.01 on all states (selected from {0.01, 0.05, 0.1, 0.25} based on hyperparameter search) for each λ-α pair. We consider ﬁxed parameters to understand the differences between the three algorithms.

Observations: The U-curves obtained are presented in Figure 2, which depicts Root Mean Square Error (RMSE) on

4Code to reproduce the results can be found here.

all states, obtained in the ﬁrst 10 training episodes, as a function of α. Various curves correspond to different choices of ﬁxed λ (or β). In TD(λ), when λ is close to 0, the updates are close to TD(0), and hence there is high bias. When λ is close to 1, the updates are close to Monte-Carlo, and there is a high variance. A good bias-variance trade-off is obtained for intermediate values of λ, resulting in low RMSE. A similar trade-off can be observed in PTD; however, the trade-off is different from TD(λ). A β value close to 1 results in a TD(0)-style update (similar to λ = 0), but β 0 results in negligible updates (as opposed to a Monte-Carlo update for similar values of λ). The resulting bias-variance trade-off is better.

TD(λ) is very sensitive to the value of the learning rate. When λ gets close to 1, the RMSE is high even for a low learning rate, resulting in sharp U-curves. This is because of the high variance in the updates. However, this is not the case for PTD, which is relatively stable for all values of β on a wide range of learning rates. PTD exhibits the best behaviour when using a learning rate greater than 1 for some values of β. This behaviour can be attributed to negligible updates performed when a return close to Monte-Carlo (β is close to 0) is used. Due to negligible updating, the variance in the updates is contained.

The performance of ETD was slightly worse than the other two algorithms. ETD is very sensitive to the value of the interest function. We observed a large drop in performance for very small changes in the interest value. ETD also performs poorly with low learning rates when the interest is set high. This behaviour can be attributed to the high variance caused by the emphasis.

6.2. Linear setting

Task description: We consider two toy tasks to evaluate PTD with linear function approximation.

In the ﬁrst task, depicted in Figure 1a, the agent starts in S1 and can pick one of the two available actions, {up, down}, transitioning onto the corresponding chain. A goal state is attached at the end of each chain. Rewards of +2 and 1 are obtained upon reaching the goal state on the upper and lower chains, respectively. The chain has a sequence of partially observable states (corridor). In these states, both actions move the agent towards the goal by one state.

For the second task, we attach two MDPs of the same kind as used in task 1, as shown in Figure 1b. The actions in the connecting states S1, S2, S3 transition the agent to the upper and lower chains, but both actions have the same effect in the partially observable states (corridor). A goal state is attached at the end of each chain. To make the task challenging, a stochastic reward (as shown in Figure 1b) is given to the agent. In both tasks, the connecting states and the goal

Preferential Temporal Difference Learning

0 1 2 3 Learning rate

Mean RMSE over 10 episodes

Emphatic TD (interest: 0.01)

0 1 2 3 Learning rate

Mean RMSE over 10 episodes

Preferential TD

0.0 0.5 1.0 1.5 Learning rate

Mean RMSE over 10 episodes

0.0 0.1 0.2 0.4 0.8 0.9 0.95 0.975 0.99 1.0

Figure 2. Root Mean Square Error (RMSE) of Emphatic TD, TD(λ), and Preferential TD as a function of learning rate α. Different curves in the plot correspond to different values of λ or β (depending upon the algorithm).

states are fully observable and have a unique representation, while the states in the corridor are represented by Gaussian noise, N(0.5, 1).

0 25 50 75 100 Episodes

MSE on FO states

0 25 50 75 100 Episodes

MSE on FO states

Length = 15

0 25 50 75 100 Episodes

MSE on FO states

Length = 25

ETD variable ETD fixed TD(λ) PTD

(a) Task 1 results

0 50 100 150 200 Episodes

MSE on FO states

0 50 100 150 200 Episodes

MSE on FO states

Length = 15

0 50 100 150 200 Episodes

MSE on FO states

Length = 25

TD(λ) ETD fixed ETD variable PTD

(b) Task 2 results

Figure 3. The mean squared error of fully observable states values plotted as a function of episodes for various algorithms. Different plots correspond to different lengths of the corridor.

Setup: We tested 4 algorithms: TD(λ), PTD, and two versions of ETD. The ﬁrst ETD version has a ﬁxed interest on all the states (referred to as ETD-ﬁxed), and the second variant has a state-dependent interest (referred to as ETDvariable). We set the preference on fully observable states to 1 and partially observable states to 0. For a fair comparison, we make λ a state-dependent quantity by setting the values of the fully observable states to 0 and 1 on partially observable states for TD(λ) and ETD. In the case of ETD-ﬁxed, we set the interest of all states to 0.01 (selected from {0.01, 0.05, 0.1} based on the hyperparameter search, see Appendix A.4.2 for details). For ETD-variable, we set the interest to 0.5 for fully observable states and 0 on all other states. This choice results in an emphasis of 1 on fully

observable states. ETD-variable yielded very similar performance with various interest values but with different learning rates. We varied the corridor length ({5, 10, 15, 20, 25}) in our experiments. For each length and algorithm, we chose an optimal learning rate from 20 different values. We used γ = 1, a uniformly random policy, and the value function is predicted only at fully observable states. The results are averaged across 25 seeds, and a conﬁdence interval of 95% is used in the plots5.

Observations: A state-dependent λ (or β) results in bootstrapping from fully observable states only. Additionally, PTD weighs the updates on various states according to β. The learning curves on both tasks are reported in Figures 3 and A.2. TD(λ) performs poorly, and the errors increase with the length of the chain on both tasks. Also, the values of the fully observable states stop improving even as the number of training episodes increases. This is to be expected, as TD(λ) updates the values of all states. Thus, errors in estimating the partially observable states affect the predictions at the fully observable states due to the generalization in the function approximator.

ETD-ﬁxed performs a very small update on partially observable states due to the presence of a small interest. Nevertheless, this is sufﬁcient to affect the predictions on fully observable states. The error in ETD-ﬁxed increases with a small increase in the interest. ETD-ﬁxed is also highly sensitive to the learning rate. However, ETD-variable performs the best. With the choice of λ and interest function described previously, ETD-variable ignores the updates on corridor states and, at the same time, does not bootstrap from these states, producing a very low error in both tasks. This effect is similar to PTD. However, bootstrapping and updating are controlled by a single parameter (β) in PTD, while ETD-variable has two parameters, λ and the interest function, and we have to optimize both. Hence, while it

5We use the t-distribution to compute the conﬁdence intervals in all the settings.

Preferential Temporal Difference Learning

converges slightly faster than PTD, this is at the cost of increased tuning complexity.

6.3. Semi-linear setting

Task description: We consider two grid navigation tasks shown in ﬁgures 4a and 4b. In the ﬁrst task, the states forming a cross at the center are partially observable. In the second task, 50% of the states are partially observable (sampled randomly for each seed). The agent starts randomly in one of the states in the top left quadrant, and it can choose from four actions (up, down, left, right), moving in the corresponding direction by one state. A reward of +5 is obtained when the agent transitions to the state at the bottom right corner. The transitions are deterministic and γ = 0.99. To generate the features for these tasks, we ﬁrst train a single-layer neural network to predict the value function using Monte Carlo targets on the fully observable grid. We use the output of the penultimate layer of the trained network to get the features for the states. We then train a linear function approximator on top of these features to predict the value function of a uniformly random policy. We use three grid sizes ({8 8, 12 12, 16 16}) for experimentation.

Setup: We consider the same four algorithms in this task. We set λ = 0 and β = 1 on fully observable states, and set λ = 1 and β = 0 on other states. We set the interest to 0.01 for ETD-ﬁxed (selected from {0.01, 0.05, 0.1} based on the hyperparameter search, see Appendix A.4.3). The interest is set to 0.5 on fully observable states and 0 on other states for ETD-variable. The results are averaged across 50 seeds, and a conﬁdence interval of 50% is used in the plots6.

Observations: We report the mean squared error of the value function at fully observable states as the performance measure. The learning curves on both tasks, as a function of number of episodes are reported in ﬁgures 4c, 4d, and A.8. TD(λ) performed the worst because of the reasons mentioned earlier. Interestingly, ETD-ﬁxed performed much better in this setting compared to the previous one. Its performance dropped signiﬁcantly as the grid size increased in task 1, but the drop was marginal for task 2. ETD-variable performed much better than ETD-ﬁxed. The performance on both tasks for grid size 8 8 and 12 12 was the same as PTD. However, there was a slight drop in performance for grid size 16 16. This is because, in the larger tasks, the trajectories can vary widely. As a result, the update on a particular state is reweighed differently depending upon the trajectory, causing variance. The hyperparameter tuning experiments also showed that ETD-variable is sensitive to small changes in learning rate. PTD performs best on all grid sizes and both tasks.

6Picked for clarity of the plotting

(a) Grid task 1

(b) Grid task 2

0 10 20 30 40 50 Episodes

MSE on FO states

ETD variable

(c) Grid task 1 results

0 20 40 60 80 100 Episodes

MSE on FO states

ETD variable

(d) Grid task 2 results

Figure 4. (Top) Grid tasks: The dark brown square is the terminal state. The light squares are fully observable and the darker squares are partially observable. The walls are indicated in orange. (Bottom) The mean squared error of fully observable states values as a function of the number of episodes for various algorithms on 16 16 grid.

6.4. Non-linear setting

Task description: The aim of this experiment is two-fold: (1) to verify if PTD still learns better predictions when the capacity of the function approximator is decreased, and (2) to verify if PTD is compatible with end-to-end training of a non-linear function approximator. We use the same grid tasks described in section 6.3 for experimentation.

6.4.1. FORWARD VIEW (RETURNS)

Setup: We consider the same four algorithms and set λ, β and interest values as described in the previous section. We use a single-layered neural network whose input is a one-hot vector (size n n) indicating the agent s location in the grid, if the state is fully observable, and Gaussian noise otherwise. We used Re LU non-linearity in the hidden layer. We experimented with {1, 2, 4, 8, 16} hidden units to check if PTD can estimate the values when the approximation capacity is limited. We searched for the best learning rate for each algorithm and network conﬁguration from a range of values. We picked the best learning rate based on the area under the error curve (AUC) of the MSE averaged across episodes and 25 seeds. We used 250 episodes for all the tasks. We trained the networks using the forward-view returns for all algorithms.

Preferential Temporal Difference Learning

1 2 4 8 16 Number of hidden units

AUC of Mean error

Grid size = 8

1 2 4 8 16 Number of hidden units

AUC of Mean error

Grid size = 12

1 2 4 8 16 Number of hidden units

AUC of Mean error

Grid size = 16

ETD variable

(a) Grid task 1 results

1 2 4 8 16 Number of hidden units

AUC of averaged MSE

Grid size = 8

1 2 4 8 16 Number of hidden units

AUC of averaged MSE

Grid size = 12

1 2 4 8 16 Number of hidden units

AUC of averaged MSE

Grid size = 16

ETD variable

(b) Grid task 2 results

Figure 5. The area under the curve (AUC) of mean squared error of the values of the fully observable states, averaged across seeds and episodes as a function of the number of hidden units for various algorithms. The error bars indicate the conﬁdence interval of 90% across 50 seeds. Different plots corresponds to different sizes of the grid task.

Observations: The results are presented in Figure 5. Each point in the plot is the AUC of the MSE for an algorithm and neural network conﬁguration pair averaged across episodes and seeds. The error bars indicate the conﬁdence interval of 90% across seeds. TD(λ) performs poorly because of the reasons stated before. ETD-ﬁxed performs slightly better than TD(λ), but the error is still high. PTD performs better than the other algorithms on both tasks and across all network sizes. The error is high only in the 16 16 grid for really small networks, with four or fewer hidden units. This is because the number of fully observable states is signiﬁcantly larger for this grid size, and the network is simply not large enough to provide a good approximation for all these states. Nevertheless, PTD s predictions are signiﬁcantly better than other algorithms in this setting. The behavior of ETD-variable is similar, but the variance is much higher. As before, hyperparameter tuning experiments indicate that ETD is also much more sensitive to learning rates.

6.4.2. BACKWARD VIEW (ELIGIBILITY TRACES)

In this experiment, we are interested in ﬁnding out how PTD s eligibility traces would fare in the non-linear setting.

Setup: The set of algorithms tested, network architecture, and the task details are same as Section 6.4.1. We also followed the same procedure as the previous section to select the learning rate. We trained the networks in an online fashion using eligibility traces for all algorithms.

Observations: Eligibility traces facilitate online learning,

1 2 4 8 16 Number of hidden units

AUC of Mean error

Grid size = 8

1 2 4 8 16 Number of hidden units

AUC of Mean error

Grid size = 12

1 2 4 8 16 Number of hidden units

AUC of Mean error

Grid size = 16

ETD variable

PTD (forward)

(a) Grid task 1 results

1 2 4 8 16 Number of hidden units

AUC of Mean error

Grid size = 8

1 2 4 8 16 Number of hidden units

AUC of Mean error

Grid size = 12

1 2 4 8 16 Number of hidden units

AUC of Mean error

Grid size = 16

ETD variable

PTD (forward)

(b) Grid task 2 results

Figure 6. The area under the curve (AUC) of mean squared error of the values of the fully observable states, averaged across seeds and episodes as a function of the number of hidden units for various algorithms. The error bars indicate the conﬁdence interval of 90% across 50 seeds. Different plots corresponds to different sizes of the grid task.

and they provide a way to assign credit to the past states, making it a valuable tool in theory. However, the prediction errors of all the algorithms are higher than their forward view counterparts. The drop in performance is because eligibility traces remember the gradient that is computed with respect to the past parameters, which introduces signiﬁcant bias. Besides, the bias in the eligibility traces caused instabilities in the training procedure of PTD and ETD-variable for high learning rates. The results are presented in Figure 6. As shown, the performances of TD(λ) and ETD-ﬁxed performs are still poor compared to PTD and ETD-variable. Backward views of PTD and ETD-variable have slightly more error and variance across the seeds than their forward view equivalents when the capacity of the function approximator is small. However, they match the forward view performance when the capacity is increased. PTD still performs slightly better than ETD-variable in the 16 16 grids, indicating the usefulness of its eligibility traces with neural networks.

7. Conclusions and Future Work

We introduced Preferential TD learning, a novel TD learning method that updates the value function and bootstraps in the presence of state-dependent preferences, which allow these operations to be prioritized differently in different states. Our experiments show that PTD compares favourably to other TD variants, especially on tasks with partial observability. However, partial observability is not a requirement

Preferential Temporal Difference Learning

to use our method. PTD can be useful in other settings where updates can beneﬁt from re-weighting. For instance, in an environment with bottleneck states, having a high preference on such states could propagate credit faster. Preliminary experiments in Appendix A.4.5 corroborate this claim, and further analysis could be interesting.

We set the preference function from the beginning to a ﬁxed value in our experiments, but an important direction for future work is to learn or adapt it based on data. The preference function plays a dual role: its presence in the targets controls the amount of bootstrapping from future state values, and its presence in the updates determines the amount of re-weighting. Both can inspire learning approaches. The bootstrapping view opens the door to existing techniques such as meta-learning (White & White, 2016; Xu et al., 2018; Zhao et al., 2020), variance reduction (Kearns & Singh, 2000; Downey et al., 2010), and gradient interference (Bengio et al., 2020). Alternatively, we can leverage methods that identify partially observable states or important states (e.g. bottleneck states) (Mc Govern & Barto, 2001; Stolle & Precup, 2002; Bacon et al., 2017; Harutyunyan et al., 2019) to learn the preference function. We also believe that PTD would be useful in transfer learning, where one could learn parameterized preferences in a source task and use them in the updates on a target task to achieve faster training.

We demonstrated the utility of preferential updating in onpolicy policy evaluation. The idea of preferential updating could also be exploited in other RL settings, such as offpolicy learning, control, or policy gradients, to achieve faster learning. Our algorithm could also be applied to learning General Value Functions (Comanici et al., 2018; Sutton et al., 2011), an exciting future direction.

As discussed earlier, our method bridges the gap between Emphatic TD and TD(λ). Eligibility traces in PTD propagate the credit to the current state based on its preference, and the remaining credit goes to the past states. This idea is similar to the gradient updating scheme in the backpropagation through time algorithm, where the gates in recurrent neural networks control the ﬂow of credit to past states. We suspect an interesting connection between the eligibility trace mechanism in PTD and backpropagation through time. In the binary preference setting (β {0, 1}), our method completely discards zero preference states in the MDP from bootstrapping and updating. The remaining states, whose preference is non-zero, participate in both. This creates a level of state abstraction in the given problem.

Our ultimate goal is to train a small agent capable of learning in a large environment, where the agent can t represent values correctly everywhere since the state space is enormous (Silver et al., 2021). A logical step towards this goal is to ﬁnd ways to effectively use function approximators with

limited capacity compared to the environment. Therefore, it should prefer to update and bootstrap from a few states well rather than poorly estimate the values for all states. PTD helps achieve this goal and opens up avenues for developing more algorithms of this ﬂavour.

Acknowledgements

This research was funded through grants from NSERC and the CIFAR Learning in Machines and Brains Program. We thank Ahmed Touati for the helpful discussions on theory; and the anonymous reviewers and several colleagues at Mila for the constructive feedback on the draft.

Aberdeen, D. Filtered reinforcement learning. In European Conference on Machine Learning, pp. 27 38. Springer, 2004.

Bacon, P.-L., Harb, J., and Precup, D. The option-critic architecture. In Proceedings of the AAAI Conference on Artiﬁcial Intelligence, volume 31/1, 2017.

Barnard, E. Temporal-difference methods and markov models. IEEE Transactions on Systems, Man, and Cybernetics, 23(2):357 365, 1993.

Bengio, E., Pineau, J., and Precup, D. Interference and generalization in temporal difference learning. In International Conference on Machine Learning, pp. 767 777. PMLR, 2020.

Bertsekas, D. P. and Tsitsiklis, J. N. Neuro-dynamic programming. Athena Scientiﬁc, 1996.

Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., and Zaremba, W. Openai gym. ar Xiv preprint ar Xiv:1606.01540, 2016.

Chelu, V., Precup, D., and van Hasselt, H. Forethought and hindsight in credit assignment. Advances in Neural Information Processing Systems (Neur IPS), 2020.

Comanici, G., Precup, D., Barreto, A., Toyama, D. K., Ayg un, E., Hamel, P., Vezhnevets, S., Hou, S., and Mourad, S. Knowledge representation for reinforcement learning using general value functions. Open review, 2018.

Downey, C., Sanner, S., et al. Temporal difference bayesian model averaging: A bayesian perspective on adapting lambda. In ICML, pp. 311 318. Citeseer, 2010.

Ghiassian, S., Raﬁee, B., and Sutton, R. S. A ﬁrst empirical study of emphatic temporal difference learning. ar Xiv preprint ar Xiv:1705.04185, 2017.

Preferential Temporal Difference Learning

Gordon, G. J. Chattering in SARSA(λ). A CMU learning lab internal report. Citeseer, 1996.

Harutyunyan, A., Dabney, W., Mesnard, T., Azar, M. G., Piot, B., Heess, N., van Hasselt, H. P., Wayne, G., Singh, S., Precup, D., et al. Hindsight credit assignment. In Advances in neural information processing systems, pp. 12467 12476, 2019.

Hung, C.-C., Lillicrap, T., Abramson, J., Wu, Y., Mirza, M., Carnevale, F., Ahuja, A., and Wayne, G. Optimizing agent behavior over long time scales by transporting value. Nature communications, 10(1):1 12, 2019.

Kearns, M. J. and Singh, S. P. Bias-variance error bounds for temporal difference updates. In COLT, pp. 142 147. Citeseer, 2000.

Loch, J. and Singh, S. P. Using eligibility traces to ﬁnd the best memoryless policy in partially observable markov decision processes. In ICML, pp. 323 331, 1998.

Mahmood, A. Incremental off-policy reinforcement learning algorithms. Ph.D. thesis, U of Alberta, 2017.

Mc Govern, A. and Barto, A. G. Automatic discovery of subgoals in reinforcement learning using diverse density. In Proceedings of the Eighteenth International Conference on Machine Learning, ICML 01, pp. 361 368, San Francisco, CA, USA, 2001. Morgan Kaufmann Publishers Inc. ISBN 1558607781.

Mnih, V., Kavukcuoglu, K., Silver, D., Rusu, A. A., Veness, J., Bellemare, M. G., Graves, A., Riedmiller, M., Fidjeland, A. K., Ostrovski, G., et al. Human-level control through deep reinforcement learning. Nature, 518(7540): 529 533, 2015.

Schulman, J., Wolski, F., Dhariwal, P., Radford, A., and Klimov, O. Proximal policy optimization algorithms. Co RR, abs/1707.06347, 2017.

Seijen, H. and Sutton, R. True online td (lambda). In International Conference on Machine Learning, pp. 692 700, 2014.

Silver, D., Singh, S., Precup, D., and Sutton, R. S. Reward is enough. Artiﬁcial Intelligence, pp. 103535, 2021.

Singh, S. P. and Sutton, R. S. Reinforcement learning with replacing eligibility traces. Machine learning, 22(1-3): 123 158, 1996.

Stolle, M. and Precup, D. Learning options in reinforcement learning. In International Symposium on abstraction, reformulation, and approximation, pp. 212 223. Springer, 2002.

Sutton, R. S. Temporal credit assignment in reinforcement learning, phd thesis. University of Massachusetts, Department of Computer and Information Science, 1984.

Sutton, R. S. Learning to predict by the methods of temporal differences. Machine learning, 3(1):9 44, 1988.

Sutton, R. S. Td models: Modeling the world at a mixture of time scales. In Machine Learning Proceedings 1995, pp. 531 539. Elsevier, 1995.

Sutton, R. S. and Barto, A. G. Reinforcement learning: An introduction. MIT press, 2018.

Sutton, R. S., Precup, D., and Singh, S. Between mdps and semi-mdps: A framework for temporal abstraction in reinforcement learning. Artiﬁcial intelligence, 112(1-2): 181 211, 1999.

Sutton, R. S., Modayil, J., Delp, M., Degris, T., Pilarski, P. M., White, A., and Precup, D. Horde: A scalable real-time architecture for learning knowledge from unsupervised sensorimotor interaction. In The 10th International Conference on Autonomous Agents and Multiagent Systems - Volume 2, AAMAS 11, pp. 761 768, Richland, SC, 2011. International Foundation for Autonomous Agents and Multiagent Systems. ISBN 0982657161.

Sutton, R. S., Mahmood, A. R., and White, M. An emphatic approach to the problem of off-policy temporal-difference learning. The Journal of Machine Learning Research, 17 (1):2603 2631, 2016.

Theocharous, G. and Kaelbling, L. P. Approximate planning in pomdps with macro-actions. In Advances in Neural Information Processing Systems, pp. 775 782, 2004.

Thodoroff, P., Anand, N., Caccia, L., Precup, D., and Pineau, J. Recurrent value functions. 4th Multidisciplinary Conference on Reinforcement Learning and Decision Making (RLDM), 2019.

Tsitsiklis, J. N. and Van Roy, B. An analysis of temporaldifference learning with function approximation. IEEE transactions on automatic control, 42(5):674 690, 1997.

van Hasselt, H. and Sutton, R. S. Learning to predict independent of span. ar Xiv preprint ar Xiv:1508.04582, 2015.

Varga, R. S. Matrix iterative analysis, volume 27. Springer Science & Business Media, 1999.

Watkins, C. J. C. H. Learning from delayed rewards. King s College, Cambridge, 1989.

White, M. and White, A. A greedy approach to adapting the trace parameter for temporal difference learning. In

Preferential Temporal Difference Learning

Proceedings of the 2016 International Conference on Autonomous Agents and Multiagent Systems, AAMAS 16, pp. 557 565, Richland, SC, 2016. International Foundation for Autonomous Agents and Multiagent Systems. ISBN 9781450342391.

Xu, Z., Modayil, J., van Hasselt, H. P., Barreto, A., Silver, D., and Schaul, T. Natural value approximators: Learning when to trust past estimates. In Advances in Neural Information processing systems, pp. 2120 2128, 2017.

Xu, Z., van Hasselt, H. P., and Silver, D. Meta-gradient reinforcement learning. In Advances in neural information processing systems, pp. 2396 2407, 2018.

Yu, H. On convergence of emphatic temporal-difference learning. In Conference on Learning Theory, pp. 1724 1751, 2015.

Yu, H., Mahmood, A. R., and Sutton, R. S. On generalized bellman equations and temporal-difference learning. The Journal of Machine Learning Research, 19(1):1864 1912, 2018.

Zhao, M., Luan, S., Porada, I., Chang, X.-W., and Precup, D. Meta-learning state-based eligibility traces for more sample-efﬁcient policy evaluation. In International Conference on Autonomous Agents and Multiagent Systems, AAMAS 20. International Foundation for Autonomous Agents and Multiagent Systems, 2020. URL http://www.ifaamas.org/ Proceedings/aamas2020/pdfs/p1647.pdf.