# chaining_value_functions_for_offpolicy_learning__88be3556.pdf

Chaining Value Functions for Off-Policy Learning

Simon Schmitt1,2, John Shawe-Taylor2, Hado van Hasselt1

1Deep Mind 2University College London, UK suschmitt@google.com

To accumulate knowledge and improve its policy of behaviour, a reinforcement learning agent can learn off-policy about policies that differ from the policy used to generate its experience. This is important to learn counterfactuals, or because the experience was generated out of its own control. However, off-policy learning is non-trivial, and standard reinforcementlearning algorithms can be unstable and divergent. In this paper we discuss a novel family of off-policy prediction algorithms which are convergent by construction. The idea is to first learn on-policy about the data-generating behaviour, and then bootstrap an off-policy value estimate on this onpolicy estimate, thereby constructing a value estimate that is partially off-policy. This process can be repeated to build a chain of value functions, each time bootstrapping a new estimate on the previous estimate in the chain. Each step in the chain is stable and hence the complete algorithm is guaranteed to be stable. Under mild conditions this comes arbitrarily close to the off-policy TD solution when we increase the length of the chain. Hence it can compute the solution even in cases where off-policy TD diverges. We prove that the proposed scheme is convergent and corresponds to an iterative decomposition of the inverse key matrix. Furthermore it can be interpreted as estimating a novel objective that we call a k-step expedition of following the target policy for finitely many steps before continuing indefinitely with the behaviour policy. Empirically we evaluate the idea on challenging MDPs such as Baird s counter example and observe favourable results.

1 Introduction

Value estimation is key to decision making and reinforcement learning (Sutton and Barto 2018). To accumulate knowledge and improve its policy of behaviour, an agent can estimate values off-policy corresponding to policies that differ from the policy used to generate the experience it learns from. This can be useful to learn counterfactuals, or because the experience was generated out of its own control. Indeed the applications of off-policy learning are manifold: learning to exploit while exploring as e.g. in ϵ-greedy, learning multiple policies concurrently (Sutton et al. 2011; Badia et al. 2020), for representation shaping (Jaderberg et al. 2017), to min-

Copyright 2022, Association for the Advancement of Artificial Intelligence (www.aaai.org). All rights reserved.

imize costly mistakes (Hauskrecht and Fraser 2000) or to learn from demonstrations (Hester et al. 2018). However, off-policy learning is non-trivial, because standard reinforcement-learning algorithms can be unstable: (Baird 1995) showed that off-policy TD predictions can diverge to infinity in what is now known as Baird s MDP. (Sutton and Barto 2018) attribute this to the popular combination of function approximation (to support large state spaces) and bootstrapping (to reduce variance) in the off-policy context since called the deadly triad. Both are essential and ubiquitous in deep reinforcement learning (van Hasselt et al. 2018) hence algorithms that are convergent even in the face of the deadly triad are a prominent research direction. Over the years, several variants and solutions have been proposed (Sutton et al. 2009; Maei 2011; van Hasselt, Mahmood, and Sutton 2014; Sutton, Mahmood, and White 2016), but these do not uniformly outperform off-policy TD (Hackman 2013) and sometimes suffer from high (even infinite) variance (Sutton, Mahmood, and White 2016). In this paper we analyze a novel family of off-policy prediction algorithms that is convergent (i.e. breaks the deadly triad) and conceptually simple. The idea is to first learn on-policy about the data-generating behaviour, and then bootstrap an off-policy value estimate on this on-policy estimate, thereby constructing a value estimate that is partially off-policy. This process can be repeated to build a chain of value functions, each time bootstrapping a new estimate on the previous estimate in the chain. Each step in the chain is stable and hence the complete algorithm is guaranteed to be stable. When employing off-policy TD at each step in the chain we call it chained TD learning. While off-policy TD sometimes diverges and is unable to obtain its own solution (fixed point) we prove that chained TD always converges and that its solution comes arbitrarily close to the off-policy TD solution under mild conditions when we increase the length of the chain. Interestingly our approach can be interpreted as estimating the value of following the target policy for a finite number of steps k and then following the behaviour indefinitely. We call this behaviour a k-step π-expedition (k-step expedition in short) as the prediction envisions a k-steps limited expedition following a potentially novel π before continuing with the well known behaviour µ. Naturally longer and longer expeditions (larger k) approach the target policy. Chained

The Thirty-Sixth AAAI Conference on Artificial Intelligence (AAAI-22)

TD exploits the recursive structure of this objective to reduce variance through bootstrapping. For TD learning contrary to estimating the target value directly this is guaranteed to be stable as we prove in this paper. While in practice we use a finite number of value functions we also consider what happens if k and use this to acquire insights into the convergence of the popular albeit different technique of target networks (Mnih et al. 2015). We prove convergence of the expected chained TD update with a single learning rate and empirically confirm it on Baird s counter example that we augment to include rewards, where TD, TDC, GTD2 and ETD either diverge or make little progress.

2 Background We consider state values v(s) that are parameterised by parameter vector θ for instance the weights of a neural network. The goal is to approximate the value of each state s under target policy π, as defined by

i=0 γi Rt+i+1 | St = s

= E [Rt+1 + γv(St+1) | St = s] .

Off-policy TD (Sutton and Barto 2018) is an iterative process

θt+1 := θt + αρt [Rt + γv(St+1) v(St)] θtv(St) (1)

where each update aims to improve the parameters θt such that the new estimate vθt+1 on average gets closer to the target value vπ, even when following a different policy µ. Here α is the step-size, γ is the discount and Rt is the reward observed when transitioning from state St to St+1 after executing action At µ(At|St). In update (1), ρt := π(At|St)/µ(At|St) is the importance-sampling ratio between the probability of selecting action At under the target policy π and under the behaviour policy µ not to be confused with the spectral radius of a matrix ρ (M). Unfortunately, when using function approximation, convergence of this algorithm can only be guaranteed in the on-policy setting where π = µ (Baird 1995; Sutton and Barto 2018). This is an actively pursued research area where a series of solutions have been proposed (Sutton et al. 2009; Maei 2011; van Hasselt, Mahmood, and Sutton 2014; Sutton, Mahmood, and White 2016), but these often suffer from either performing worse than off-policy TD when it does not diverge (Hackman 2013) or even from infinite variance (Sutton, Mahmood, and White 2016). Our approach is similar in spirit to (De Asis et al. 2020) that estimate a new kind of return: fixed horizon returns (i.e. the rewards only from the next k steps) instead of the typical discounted return. This special return can also be estimated through a series of value functions and is guaranteed to converge albeit to a different fixed point. The special case of chaining for a single step has been considered before: (Wiering and van Hasselt 2007) consider bootstrapping an action value off of a state value, which itself is learnt on-policy or off-policy (Wiering and van Hasselt 2009). (Mazoure et al. 2021) consider bootstrapping off of an on-policy estimate with an off-policy multi-step return. These

Algorithm 1: Sequential chained TD is described below. Concurrent chained TD is obtained by moving line 2 between line 6 and 7. Note that T needs to be specified large enough to ensure convergence. Input: π, µ, number of chains K, number of update steps T Parameter: step size α 1: Initialize all {θk}k Z.k K randomly, t 0. 2: for k 0 to K do 3: for i 1 to T do 4: t t + 1 5: Play one action At with µ. 6: Observe next state St+1 and reward Rt+1. 7: if k = 0 then 8: δ Rt+1 + γvθ0(St+1) vθ0(St); ρ 1 9: else 10: δ Rt+1 + γvθk 1(St+1) vθk(St)

11: ρ π(At|St)

µ(At|St) 12: end if 13: θk θk + αρδ θvk(St) 14: end for 15: θk+1 θk Only used in sequential chained TD. 16: end for 17: return {θk t }k Z.k K

approaches can all be interpreted as performing one step in the more general chained TD algorithms that we consider in this paper.

3 Chaining Off-Policy Predictors We want an off-policy algorithm that is 1) stable (i.e., convergent) and 2) with low bias with respect to the true values vπ. To this extend we propose a novel family of algorithms and show that it satisfies these desiderata. Starting with the behaviour value

the idea is to define a series of value functions {vk}k N0 recursively such that they approach the desired target value:

lim k vk vπ

This is achieved recursively by employing an off-policy estimator OPE such as off-policy TD learning that estimates vk by bootstrapping off the previous value vk 1:

vk := Eτ µ OPE(π, vk 1, τ, µ)

This principle can be applied to any off-policy estimator that employs trajectories τ sampled from µ and a bootstrap value vk 1 to predict the values of target policy π e.g. vk(s) := Eτ µ ρt(Rt + γvk 1(St+1)) | St = s . The idea of chaining off-policy estimators has a natural interpretation: vµ is the value of the behaviour policy µ and vk has the value of at first performing k steps according to the target policy π and then following µ indefinitely. We call such behaviour an k-step expedition and vk the k-step expedition value.

Definition 1. A k-step expedition from state s acts with π for k steps and then with µ indefinitely. Let the k-step expedition value of state s be the expected return of a k-step expedition from state s.

As k increases the value vk becomes more and more offpolicy and v0 ultimately becomes irrelevant. This perspective illustrates that typically vk vπ as k increases (i.e. that vk becomes unbiased). While this is easy to see for tabular RL, we analyse bias and convergence in the more general case of function approximation in the following section. If estimated sequentially convergence is guaranteed by induction: v0 := vµ can be estimated on-policy, and hence TD is stable (i.e. converges). Then, for each k > 0, vk is stable because it bootstraps off a stable vk 1. In the next section we prove that convergence is also guaranteed for chained off-policy learning if all parameters are updated concurrently, e.g., when learning all value functions online. A concrete stochastic update for each such value function when transitioning from state St to St+1 and observing a reward Rt+1 is given by

θk t+1 := θk t + αρtδk t θk t vk t (St) , (2)

where θk t are the parameters of the kthvalue function after observing t transitions , vk t (s) := vθk t (s) and

δk t := Rt+1 + γvk 1 t (St+1) vk t (St) .

We call this chained (off-policy) TD learning. Sequential chained TD only updates θk on timesteps after the previous θk 1 has converged, while concurrent chained TD updates all {θk}k at each timestep (see Algorithm 1). In the next sections we analyse both algorithms theoretically (Section 4) and empirically (Section 5).

4 Analysis To analyse Algorithm 1 in this section we consider linear function approximation, so that vθ(s) = θ ϕ(St), where ϕ(St) are the features observed at time t. We recall that offpolicy TD sometimes diverges and is unable to obtain its own solution (fixed point) θπ := A 1 π bπ, then we show that chained TD is always convergent and can compute θπ under mild conditions via the following steps:

1. Section 4.2 observes that sequentially chained TD defines a recursion of fixed points: The fixed point θk of value function vk can be computed from θk 1 . 2. Section 4.3 shows that this recursion approaches the offpolicy solution under mild conditions: limk θk = θπ. 3. Section 4.4 proves convergence of both expected sequential and concurrent chained TD to the fixed points: limt θk t = θk .

Hence chained TD is convergent for any fixed k and the attained fixed points of the kthvalue function θk indeed approaches the off-policy TD solution θπ := A 1 π bπ under mild conditions that we investigate further in 4.3. Then chained TD learning is unbiased wrt. θπ in the limit: i.e. limk θk = θπ.

Off-policy TD and chained TD can be analysed through their expected updates which can be written in matrix form. For off-policy TD (Equation (1)) we obtain: θt+1 = θt + α (bπ Aπθt) (3) and the expected update for chained TD (Equation (2)) is

θk t+1 = θk t + α bπ + γYθk 1 t Xθk t . (4) with bπ := Eµ [ρt Rtϕ(St)] , bµ := Eµ [Rtϕ(St)] , (5)

Aπ := Eµ h ρtϕ(St) ϕ(St) γϕ(St+1) i (6)

= Φ Dµ(I γPπ)Φ = X γY (7)

X := Eµ h ρtϕ(St)ϕ(St) i = Φ DµΦ (8)

Y := Eµ h ρtϕ(St)ϕ(St+1) i = Φ DµPπΦ (9)

Π := Φ Φ DµΦ 1 Φ Dµ = ΦX 1Φ Dµ (10) where Φ is the state-feature matrix, Pπ is π s transition matrix and Dµ is a diagonal matrix with µ s steady-state distribution, Aπ is called the key matrix and Π is called the projection matrix (Sutton and Barto 2018). We make the common technical assumptions that the columns of Φ are linearly independent and that µ covers all states such that Dµ and hence X have full rank (Sutton, Mahmood, and White 2016).

4.1 Viewing Expected TD as Richardson Iteration Expected TD (see equation (3)) can be viewed as Richardson Iteration (Richardson 1911) which is a simple and wellstudied iterative algorithm that given M and b converges to θ = M 1b under the condition that all eigenvalues of M are positive. Rather than inverting Aπ expected TD learning attempts to determine the solution of Aπθπ = bπ iteratively through Richardson Iteration and may diverge even though Aπ is invertible. Definition 2. Given a square matrix M, vector b and stepsize α Richardson Iteration computes: θt+1 = θt + α (b Mθt) (11) Definition 3. We call Richardson Iteration stable if limt θt converges. Proposition 1. Let θ1 be any initial value, M any square matrix with only positive eigenvalues and b any compatibly shaped vector then Richardson Iteration θt converges to θ = M 1b for a sufficiently small step size α.

Proof. Let rt = θt θ , then rt+1 = θt + α (b Mθt) θ

= θt + α (Mθ Mθt) θ

= (I αM) rt = (I αM)t r0

Since M has only positive eigenvalues we can pick α such that I αM satisfies |λi| < 1.0 for all eigenvalues λi. Furthermore we can diagonalize I αM = VΛV 1 such that (I αM)k = VΛt V 1. Since all entries of Λ have absolute value smaller than 1.0 convergence is ensured θt θ 2 = rt 2 0 for t and any b.

4.2 Fixed Point Recursion The expected update of sequential chained TD (4) can also be seen as Richardson Iteration. Once the (k 1)thvalue function is estimated and θk 1 is fixed, the chained TD update for the next value and its parameters θk converges to a fixed point θk that depends on θk 1: θk (θk 1) := lim t θk t = X 1(γYθk 1 + bπ) (13)

convergence follows by Proposition 1 for a sufficiently small step-size α because X is positive-definite. Should θk bootstrap on the fixed point of a previous value θk 1 , we obtain a recursion of fixed points: θk = X 1(γYθk 1 + bπ) (14)

4.3 Bias The established fixed point recursion (14) can be interpreted as a transformation of the unstable off-policy TD inverse problem ( determine θπ = A 1 π bπ ) where Richardson Iteration and hence TD diverge into a recursive sequence of stable sub-problems ( given θk 1 determine θk ) that are all stable under Richardson Iteration (see sections 4.2 and 4.4). In this section we prove under which conditions lim k θk = A 1 π bπ

i.e. that the sequence of fixed points converges to the offpolicy TD solution θπ as k increases. Proposition 2. Let θk denote the fixed point of the kth chained value function defined as Eq. (14). Its bias (distance to the TD off-policy solution θπ := A 1 π bπ) is then given by θk θπ = γk X 1Y k θ0 A 1 π bπ for any initial value θ0.

Proof. Given any θ0 (e.g. without loss of generality the fixed point θµ of the on-policy algorithm estimating vµ), the sequence (14) can be written in closed form as:

X 1bπ + X 1Yγ k θ0 (15)

Since Wk is a geometric series wrt. X 1Yγ it satisfies:

I X 1Yγ k =

= Wk X 1 (X γY)

= Wk X 1Aπ Hence Wk X 1 = A 1 π X 1Yγ k A 1 π (16)

Plugging this into the closed form for θk from equation (15):

θk = Wk X 1bπ + X 1Yγ k θ0

= A 1 π bπ X 1Yγ k A 1 π bπ + X 1Yγ k θ0

= A 1 π bπ + γk X 1Y k θ0 A 1 π bπ

| {z } Bias wrt. θπ

Observe that X 1Yγ k can be rewritten in terms of the TD projection Π and the transition matrix Pπ.

= X 1γΦ DµPπ | {z } :=C

γ ΦX 1Φ Dµ | {z } =Π

= C(γΠPπ)k 1Φ (18)

While we will see in the next section that chained TD is always convergent for any fixed k, Proposition 2 allows us to analyze its distance to θπ. For a fixed k the distance depends on θ0. The distance can be greatly reduced should θ0 already be close to the solution A 1 π bπ. Hence a heuristic choice of θ0 may be beneficial and without loss of generality we chose to use the behaviour value vµ i.e. θ0 = A 1 µ bµ which is always convergent independently of π and has recently been advocated with a single greedification step for offline RL (Gulcehre et al. 2020; Brandfonbrener et al. 2021).

Bias for Infinitely Long Chains (k ) In practice we can only use chains of finite length but we can analyse what happens as the chains get longer. Below we prove that θk θπ if ρ (γPπΠ) < 1. Here Π is the TD projection and Pπ the transition matrix. We can observe that ρ (Π) 1 and ρ (Pπ) 1 hold for any MDP (see appendix of (Schmitt, Shawe-Taylor, and van Hasselt 2022)). While those are not sufficient conditions to ensure that also ρ (ΠPπ) 1 in practice it often still holds. In the appendix of (Schmitt, Shawe-Taylor, and van Hasselt 2022) we conjecture and discuss why. In Figure 1 we investigate this condition numerically: We show how often the provably convergent chained TD is unbiased in the limit of infinite k on random MDPs and observe that it is nearly always the case. On the other hand off-policy TD on the same MDPs diverges in roughly 20% of the cases.

Proposition 3. Let θk denote the fixed point of the kth chained value function defined as Eq. (14). Then the fixed point limit θ := limk θk is equal to θπ := A 1 π bπ (i.e. θ = θπ) for any initial value θ0 if either ρ X 1Yγ < 1 or equivalently ρ (γΠPπ) < 1.

Proof. ρ X 1Yγ < 1 = limk X 1Yγ k 2 = 0 hence the bias in Proposition 2 vanishes as k . By similar argument from ρ (γΠPπ) < 1 it follows that (γΠPπ)k 1 converges to the zero matrix as k . Then by Equation (18) so does X 1Yγ k.

Hence besides being always convergent (see next section), chained TD can even be unbiased wrt. θπ if ρ (γΠPπ) < 1. In that case the bias in Eq. (17) reduces exponentially with k.

4.4 Convergence The previous section showed when the unstable off-policy TD inverse problem θπ = A 1 π bπ can be decomposed into a recursive sequence of sub-problems ( given θk 1 determine

25 50 75 100 125 150 175 200 States

Behaviour on Random MDPs

Solvable by Off-Policy TD Solvable by Chained TD

Figure 1: Off-policy TD sometimes diverges and is unable to obtain its own solution (fixed point). On the other hand chained TD always converges and often solves the off-policy TD problem to arbitrary precision for sufficiently large k even in cases where off-policy TD diverges such as Baird s MDP. The difference between off-policy TD and chained TD becomes most apparent by looking at their worst case scenarios: Off-policy TD diverges for MDPs such as Baird s or the two-state MDP (Tsitsiklis and Van Roy 1997; Sutton, Mahmood, and White 2016). Chained TD obtains the target value in both MDPs. The latter can be modified such that chained TD becomes biased (see Section C of the appendix in (Schmitt, Shawe-Taylor, and van Hasselt 2022)). However chained TD remains convergent i.e never diverges as we have proved. One may now ask how often each algorithm is able to solve the TD problem to arbitrary precision. We investigate this numerically by checking their relevant matrices Aπ and ΠPπ on random MDPs (sampling entries in Φ normal, rows of Pπ and the diagonal of Dµ uniformly and re-normalizing to sum to 1) with γ = 0.99 and as many features as states. We observe that chained TD solves the TD problem in nearly all cases while off-policy TD diverges in about 20% of the cases. Note that chained TD remains stable even when it does not solve the problem. Hence one may argue that the worst case scenario of chained TD is favourable.

θk ) that approach the off-policy TD solution (limk θk = θπ). We will now show that each θk can be estimated through TD learning. To do this we prove that the corresponding Richardson Iterations converge. Later we will show that all θk can be determined concurrently, hence we do not need to wait until θk 1 t has converged before updating θk t .

Sequential Estimation We call Sequential Estimation the process where each value function vk bootstraps off the previous value function vk 1 only when the latter has converged. The resulting θk 1 is then fixed and used as a TD bootstrap target in Equation (4) to estimate the next θk . Convergence can be proved by induction. Given a convergent initial value e.g. θ0 := θµ or previous solution θk 1 it remains to show that the induction step Equation (4) converges with now fixed bootstrap target θk 1 . This update converges to θk by Proposition 1 even for unstable Aπ because X is positive definite. Hence sequential estimation is convergent. In Figure 2 (left) we estimate a sequence of value functions with their expected update for T = 250 steps each and can observe convergence to the off-policy target value. For sequential estimation we use a strictly optional hot-start heuristic where after each 250

update steps we initialize the next θk+1 t with the previous solution θk to accelerate convergence. Test string: Solvable Proposition 4. Expected sequential estimation of Chained TD is convergent.

Proof. Iterating Eq. (4) converges due to Proposition 1 for sufficiently small α because X is positive definite.

Concurrent Estimation We call Concurrent Estimation the process where all value functions in the chain are updated simultaneously at each time step. In contrast to sequential training we do not assume that the previous value function in the chain has converged. This estimation may for example be more convenient for online learning, but requires a new proof of convergence. The proof works as follows: We will show that the matrix M (see (20)) corresponding to the joint TD update of all parameters has solely positive eigenvalues. Then viewing this expected concurrent update (see (19)) as Richardson Iteration implies the existence of a unique solution and convergence for a suitable step-size α. In Figure 2 (center) we train a sequence of value functions with their expected concurrent update and observe convergence in accordance with the proposition below. We also observe oscillations in the value predictions in early training. This effect vanishes eventually as the parameters converge. Nevertheless such oscillations may be inconvenient and their mitigation provides an interesting direction for future research. We present a simple mitigation technique of gradient normalization to reduce the pre-convergence oscillation magnitude in Figure 2 (right).

The Expected Concurrent Update of Chained TD The expected update of all chain parameters {θk}k Z.k K can be written as a joint update in matrix form using one block structured update matrix M.

t+1 | {z } θt+1

bµ bπ ... bπ

t | {z } θt

γY X ... ... ... ... γY X 0 . . . γY X

Fixed Point and Convergence of the Concurrent TD Update The formulation above allows us employ Richardson Iteration to analyze the convergence properties of all simultaneously changing parameters by investigating M. As we will see M has only positive eigenvalues such that convergence to the unique solution θ = M 1b follows. Contrary to GTD2 and TDC a single step-size suffices.

0 2000 4000 6000 8000 10000 Steps

Predicted Value

Sequential Estimation

0 200 400 600 Steps

Predicted Value

Concurrent Estimation

0 200 400 600 Steps

Predicted Value

Concurrent Estimation with Gradient Norm

Chain's Value No 1 Chain's Value No 32 Target Value = 10

Figure 2: Various implementations (all with step-size α = 0.1) of chained off-policy TD on Baird-Reward with discount 0.9 evaluated at state 8. Note that the target value at any state is 1/(1 γ) = 10 and that all three displayed implementations approach the target-value as k increases. Left: Sequential Estimation. Center: Observe how Concurrent Estimation converges to the same correct results as Sequential Estimation with faster pace but with oscillations prior to reaching the target value. Right: Concurrent Estimation with gradient normalization. Note that the oscillations are reduced and that the predictions approach the target value.

Proposition 5. M has only positive eigenvalues.

Proof. We make use of the fact that the eigenvalues of a triangular block matrix are the union of eigenvalues of the diagonal blocks. The diagonal blocks are Aµ and X. Since both are positive definite M has positive eigenvalues.

Proposition 6. The expected concurrent update has the same unique fixed point as the sequential update: θ = [θµ, θ1 , , θK ].

Proof. From Proposition 5 it follows that M is invertible hence θ = M 1b is the unique fixed point of the joint update. Block-wise solving M 1b leads to an identical recursion as Equation (14) the sequential fixed points.

Proposition 7. Expected concurrent chained TD is convergent. The expected update converges to the fixed point θ = [θµ, θ1 , , θK ] given a suitably small step-size.

Proof. Convergence to θ = M 1b follows from Proposition 5 (key matrix M has positive eigenvalues) and Proposition 1 (positive eigenvalues imply convergence). Then θ = [θµ, θ1 , , θK ] by Proposition 6.

5 Empirical Study In the previous sections we have shown that the expected update of chained TD is guaranteed to converge for sequential and concurrent parameter updates. Furthermore we have shown that it is unbiased wrt. θπ under mild assumptions. In this section we empirically study how the corresponding stochastic update for chained TD converges on a selection of MDPs and observe favourable results. We compare to regular off-policy TD and Emphatic Temporal Differences (ETD), and two forms of Gradient Temporal Difference Learning (GTD2 and TDC). All but the foremost are proven to be stable and have different trade-offs in practice. In our study we observe that ETD, GTD2 and TDC can suffer more from variance - and may even diverge for that reason - than chained TD if the discount is large

γ = 0.99. However they converge faster if the discount is small γ = 0.9.

5.1 Methodology While our method could also be applied offline, here we consider online off-policy learning where the stochastic update samples one transition at at time according to µ and then updates all parameters using temporal difference learning to estimate vπ. For chained-TD we bootstrap from the previous value function in the chain, while the first chain estimates vµ with TD(0). We consider three MDPs all with small discount of γ = 0.9 and large discount γ = 0.99 and evaluate algorithms according to the following experimental protocol: We evaluate the product of all relevant hyper-parameters for 100, 000 transitions and select the result with the lowest mean squared error averaged over the final 50% of transitions and over 10 seeds. We then select the best hyper-parameters and rerun the experiment with 100 new seeds. As hyper-parameters we consider all step-sizes α form the range S = {2 i/3|i {1, . . . , 40}} (i.e. logarithmically spaced between 9.6 10 5 and 0.5), for GTD2 and TDC we also consider all secondary step-sizes β form the same range, for chained TD we consider chains of length 256 and evaluate the performance of only 9 indices k I = {2i|i {0, . . . , 8}}. This can be seen as a more efficient concurrent equivalent of experimenting with 9 different chain length separately. For sequential chained TD we split the training into windows of T {25, 50, 100, 200} steps during which only one θk is estimated and all others kept unchanged. To prevent pollution from accidentally good initial values we initialize all parameters from a Gaussian distribution with σ = 100 such that errors at t = 0 are high.

5.2 Diagnostic Markov Decision Processes Baird s MDP With and Without Rewards Baird s MDP is a classic example that demonstrates the divergence of offpolicy TD with linear function approximation and has been used to evaluate the convergence of novel approaches. Originally proposed with a discount of γ = 0.99 it is often used

RMSE for MDP Baird Baird-Reward Threestate Baird Baird-Reward Threestate with discount γ = 0.9 γ = 0.99 with reward No Yes Yes No Yes Yes TD (no correction) 0.0 10.0 10.1 0.0 99.3 102.8 Off-Policy TD div div div div div div

ETD 0.0 div 0.0 136.7 div div GTD2 0.2 0.1 0.0 12.5 83.4 139.6 TDC 0.3 0.3 0.0 13.3 87.0 43.6

Concurrent Chained TD 0.0 0.4 0.1 0.0 72.6 77.9 Sequential Chained TD 0.0 0.0 0.0 0.0 0.0 0.2

Table 1: Evaluation of various 1-step TD algorithms on several MDPs. Observe that MDPs with large discount and rewards (Baird-Reward and Threestate) are the most challenging and that only sequentially chained TD learning obtains RMSE close to 0. Results with RMSE larger than 150 are considered divergent.

with γ = 0.9, which results in lower variance updates. We consider both discounts. Furthermore we introduce a version of Baird s MDP with rewards as the rewards of the classic MDP are all 0. By introducing rewards we are able to investigate the bias of various convergent algorithms. To see why this interesting consider divergent off-policy TD with a large l2 regularization on θ. If the regularization is large enough it will push all parameters to 0, hence the value prediction will be 0 and match the target value of 0. This would be a stable but biased prediction if vπ = 0. To measure the bias we introduce rewards such that vπ = 1 1 γ (i.e 10 or 100) and vµ = 0 by rewarding each solid action with 1 and each dashed action with 1

6. We refer to this MDP as the Baird-Reward MDP.

The Threestate MDP Inspired by the Twostate MDP (Tsitsiklis and Van Roy 1997; Sutton, Mahmood, and White 2016) that demonstrates the divergence of off-policy TD concisely without rewards and with only two states, we propose the Threestate MDP with one middle state and two border states and two actions: left with 1 reward and right with 1 reward, leading to the corresponding neighbouring states or remaining if there is no further state in that direction. The starting state distribution is uniform. As with Baird-Reward introducing rewards permits us to measure the bias and convergence speed of various off-policy value predictors. We de-

"1 1 1 1 2 1 2 2 1

with full rank such that any state-value

combination can be represented by a linear function. Hence any observed bias is entirely due to the evaluated algorithm. The target policy is right at all states while the behaviour is uniform. Again we consider γ = 0.9 and γ = 0.99 and observe that vπ = 1 1 γ and vµ = 0.

5.3 Experimental Results Insights into 1-Step TD Estimators In Table 1 we evaluate popular TD off-policy value estimators on three MDPs each with two discounts (γ = 0.9 and γ = 0.99) and can observe that the larger discount is more challenging: Only sequential chained TD obtains an RMSE close to 0 on all MDPs and discounts.

0 20000 40000 60000 80000 100000 Steps

Baird-Reward ( =0.99, with rewards)

Uncorrected TD Off-Policy TD ETD

Chained TD Sequentially Chained TD

Figure 3: Learning process of the 1-step TD algorithms corresponding to Table 1 on Baird s MDP with rewards. Observe that chained TD learning reduces the RMSE most with only sequential chained TD learning reducing the error entirely. Off-policy TD diverged and is off the scale.

Furthermore we provide learning curves for Baird-Reward with discount γ = 0.99 in Figure 3. Learning curves corresponding to all entries in the table can be found in the appendix of (Schmitt, Shawe-Taylor, and van Hasselt 2022). At first we note that naive TD estimation (without offpolicy correction) of vµ is stable but its bias wrt. vπ is noticeable in MDPs with rewards (Baird-Reward and Threestate). It is desirable that an off-policy estimator is at least better than this naive baseline. However on Bairds MDP without rewards it inadvertently predicts the correct value, hence we invite the reader to focus on Baird-Reward and Threestate. Next we observe that off-policy TD indeed either diverges or obtains a large error where divergence could be slowed down by a low learning rate. ETD, GTD2 and TDC mostly fare well where the discount is small γ = 0.9. For γ = 0.99 ETD diverges on the MDPs with rewards. GTD2 and TDC obtain errors on Threestate of 139.6 and 43.6 respectively, on Baird-Reward they reduce the RMSE to 83.4 and 87.0. Concurrent chained TD converges to the true value for small discounts γ = 0.9 and Baird irrespective of discount, while for large discount reducing the error to 72.6 and 77.9 on

0 20000 40000 60000 80000 100000 0

Chained 1-Step TD

1st Value 2nd Value 3rd Value 4th Value 5th Value 6th Value 7th Value 8th Value

0 20000 40000 60000 80000 100000 Steps

Chained 8-Step TD

1st Value 2nd Value 3rd Value 4th Value 5th Value 6th Value 7th Value 8th Value

Figure 4: Convergence behaviour with increasing k for chains with chained 1-step (top) vs. chained 8-step off-policy TD (bottom) . We present the RMSE of first eight kth-values that are learned concurrently i.e. each bootstrapping off the previous value prediction. Observe how this leads to a sequence of increasingly better predictions. Finally note that the RMSE of the 8th 8-step value prediction is lower on Threestate than the concurrent chained TD presented in Table 1 which only contains 1-step algorithms.

the challenging Baird-Reward and Threestate MDPs. Finally we observe that sequential chained TD converges close to the true value for all considered MDPs and discounts.

Chained N-step Estimators The principle of chaining value functions can also be applied to n-step estimators. Nstep estimators predict the value of taking n steps with target policy and then following the policy corresponding to the bootstrap target. Chaining k such estimators results in a total prediction of m = k n steps following π. This allows to predict the m-step expedition value vm with a chain of fewer value functions. In Figure 4 we confirm this fact empirically on the Threestate MDP with γ = 0.99. One can see that a (m = 8)-step chain of length k = 8 attains a much lower RMSE than a (m = 1)-step chain of the same length. This suggests that n-step estimators may permit the use of shorter chains. Using importance sampling estimators to reduce the total length of the chain comes at the cost of increased variance. On the other hand it may come at the benefits of faster convergence and lower bias. Overall there is a bias, variance and computational complexity trade-off and n-step estimators allow to trade this off through the choice of n and k.

6 Conclusion

We present a novel family of off-policy value prediction algorithms that is convergent by construction. It works through chaining estimators that themselves do not need to be convergent. In particular we prove convergence of sequential

and concurrent chained TD, which comes with the intuitive interpretation of estimating the value of a k-step expedition: following π for k steps and then following µ indefinitely. Furthermore we provide an analytic formula for the bias of chained TD which can be used to derive three insights:

Sequential chained TD is equivalent to TD with target networks that are switched slowly (i.e. once the current objective has converged) allowing us to compute the bias of such target-network TD and note ρ (γΠPπ) < 1 as the precise condition for its convergence. Sequential and concurrent chained TD are always convergent but may be biased, while off-policy TD may diverge and yield unbounded values when computing θπ. Chained TD is unbiased wrt. θπ in the theoretical limit of using infinitely many value functions if ρ (γΠPπ) < 1 e.g. on Baird s MDP where off-policy TD diverges.

Future work may be directed to investigate chaining other updates e.g. chained V-trace (Espeholt et al. 2018), chained Expected SARSA (van Seijen et al. 2009), chained Retrace (Munos et al. 2016) and to investigate the bias vs. variance trade-off of those chained estimators. For example better multi-step off-policy returns may lead to faster convergence. Chaining importance-sampling-free Q-learning can be used to estimate values off-policy even if no action probabilities were recorded. This may be useful to learn when the behaviour policy is unknown, e.g., from human demonstrations. Finally, for concurrent chaining, where all value functions in the chain are learned at the same time, the choice of which to select for acting may be taken at run-time, and potentially learnt, for example via bandits (Badia et al. 2020) or metagradients (Sutton 1992; Xu, van Hasselt, and Silver 2018).

Acknowledgements

We would like to thank Tom Zahavy and the anonymous AAAI 2022 reviewers for their valuable feedback.

References Badia, A. P.; Sprechmann, P.; Vitvitskyi, A.; Guo, D.; Piot, B.; Kapturowski, S.; Tieleman, O.; Arjovsky, M.; Pritzel, A.; Bolt, A.; and Blundell, C. 2020. Never Give Up: Learning Directed Exploration Strategies. In International Conference on Learning Representations. Baird, L. 1995. Residual Algorithms: Reinforcement learning with function approximation. In Machine Learning: Proceedings of the Twelfth International Conference, 30 37. Brandfonbrener, D.; Whitney, W. F.; Ranganath, R.; and Bruna, J. 2021. Offline RL Without Off-Policy Evaluation. In Thirty-Fifth Conference on Neural Information Processing Systems. De Asis, K.; Chan, A.; Pitis, S.; Sutton, R.; and Graves, D. 2020. Fixed-Horizon Temporal Difference Methods for Stable Reinforcement Learning. Proceedings of the AAAI Conference on Artificial Intelligence, 34(04): 3741 3748. Espeholt, L.; Soyer, H.; Munos, R.; Simonyan, K.; Mnih, V.; Ward, T.; Doron, Y.; Firoiu, V.; Harley, T.; Dunning, I.;

Legg, S.; and Kavukcuoglu, K. 2018. IMPALA: Scalable Distributed Deep-RL with Importance Weighted Actor-Learner Architectures. In Arxiv. Gulcehre, C.; Colmenarejo, S. G.; Wang, Z.; Sygnowski, J.; Paine, T.; Zolna, K.; Chen, Y.; Hoffman, M.; Pascanu, R.; and de Freitas, N. 2020. Addressing Extrapolation Error in Deep Offline Reinforcement Learning. In Advances in Neural Information Processing Systems. Hackman, L. M. 2013. Faster Gradient-TD Algorithms. Master s thesis, University of Alberta. Hauskrecht, M.; and Fraser, H. 2000. Planning treatment of ischemic heart disease with partially observable Markov decision processes. Artificial Intelligence in Medicine, 18(3): 221 244. Hester, T.; Vecerik, M.; Pietquin, O.; Lanctot, M.; Schaul, T.; Piot, B.; Horgan, D.; Quan, J.; Sendonaris, A.; Osband, I.; Dulac-Arnold, G.; Agapiou, J.; Leibo, J.; and Gruslys, A. 2018. Deep Q-learning From Demonstrations. Proceedings of the AAAI Conference on Artificial Intelligence. Jaderberg, M.; Mnih, V.; Czarnecki, W. M.; Schaul, T.; Leibo, J. Z.; Silver, D.; and Kavukcuoglu, K. 2017. Reinforcement learning with unsupervised auxiliary tasks. International Conference on Learning Representations. Maei, H. R. 2011. Gradient temporal-difference learning algorithms. Ph.D. thesis, University of Alberta. Mazoure, B.; Mineiro, P.; Srinath, P.; Sedeh, R. S.; Precup, D.; and Swaminathan, A. 2021. Improving Long-Term Metrics in Recommendation Systems using Short-Horizon Reinforcement Learning. ar Xiv:2106.00589. Mnih, V.; Kavukcuoglu, K.; Silver, D.; Rusu, A. A.; Veness, J.; Bellemare, M. G.; Graves, A.; Riedmiller, M.; Fidjeland, A. K.; Ostrovski, G.; Petersen, S.; Beattie, C.; Sadik, A.; Antonoglou, I.; King, H.; Kumaran, D.; Wierstra, D.; Legg, S.; and Hassabis, D. 2015. Human-level control through deep reinforcement learning. Nature, 518(7540): 529 533. Munos, R.; Stepleton, T.; Harutyunyan, A.; and Bellemare, M. 2016. Safe and Efficient Off-Policy Reinforcement Learning. In Lee, D.; Sugiyama, M.; Luxburg, U.; Guyon, I.; and Garnett, R., eds., Advances in Neural Information Processing Systems, volume 29. Curran Associates, Inc. Richardson, L. F. 1911. The Approximate Arithmetical Solution by Finite Differences of Physical Problems Involving Differential Equations, with an Application to the Stresses in a Masonry Dam. Philosophical Transactions of the Royal Society of London. Series A, Containing Papers of a Mathematical or Physical Character, 210: 307 357. Schmitt, S.; Shawe-Taylor, J.; and van Hasselt, H. 2022. Chaining Value Functions for Off-Policy Learning. Co RR, abs/2201.06468. Sutton, R. S. 1992. Adapting bias by gradient descent: An incremental version of delta-bar-delta. In Proceedings of the Tenth National Conference on Artificial Intelligence, 171 176. MIT Press. Sutton, R. S.; and Barto, A. G. 2018. Reinforcement Learning: An Introduction. The MIT press, Cambridge MA.

Sutton, R. S.; Maei, H. R.; Precup, D.; Bhatnagar, S.; Silver, D.; Szepesv ari, C.; and Wiewiora, E. 2009. Fast gradientdescent methods for temporal-difference learning with linear function approximation. In Proceedings of the 26th Annual International Conference on Machine Learning (ICML 2009), 993 1000. ACM. Sutton, R. S.; Mahmood, A. R.; and White, M. 2016. An Emphatic Approach to the Problem of Off-policy Temporal Difference Learning. Journal of Machine Learning Research, 17(73): 1 29. Sutton, R. S.; Modayil, J.; Delp, M.; Degris, T.; Pilarski, P. M.; White, A.; and Precup, D. 2011. Horde: A scalable real-time architecture for learning knowledge from unsupervised sensorimotor interaction. In The 10th International Conference on Autonomous Agents and Multiagent Systems-Volume 2, 761 768. International Foundation for Autonomous Agents and Multiagent Systems. Tsitsiklis, J. N.; and Van Roy, B. 1997. An analysis of temporal-difference learning with function approximation. IEEE Transactions on Automatic Control, 42(5): 674 690. van Hasselt, H.; Doron, Y.; Strub, F.; Hessel, M.; Sonnerat, N.; and Modayil, J. 2018. Deep Reinforcement Learning and the Deadly Triad. Co RR, abs/1812.02648. van Hasselt, H.; Mahmood, A. R.; and Sutton, R. S. 2014. Offpolicy TD(λ) with a true online equivalence. In Uncertainty in Artificial Intelligence. van Seijen, H.; van Hasselt, H.; Whiteson, S.; and Wiering, M. 2009. A theoretical and empirical analysis of Expected Sarsa. In Proceedings of IEEE Symposium on Adaptive Dynamic Programming and Reinforcement Learning, 177 184. Wiering, M. A.; and van Hasselt, H. 2007. Two Novel Onpolicy Reinforcement Learning Algorithms based on TD(λ)- methods. In 2007 IEEE International Symposium on Approximate Dynamic Programming and Reinforcement Learning, 280 287. Wiering, M. A.; and van Hasselt, H. 2009. The QV family compared to other reinforcement learning algorithms. In 2009 IEEE Symposium on Adaptive Dynamic Programming and Reinforcement Learning, 101 108. Xu, Z.; van Hasselt, H. P.; and Silver, D. 2018. Meta Gradient Reinforcement Learning. In Bengio, S.; Wallach, H.; Larochelle, H.; Grauman, K.; Cesa-Bianchi, N.; and Garnett, R., eds., Advances in Neural Information Processing Systems, volume 31. Curran Associates, Inc.