# separating_value_functions_across_timescales__3738991c.pdf

Separating value functions across time-scales

Joshua Romoff * 1 2 Peter Henderson * 3 Ahmed Touati 4 2 Emma Brunskill 3 Joelle Pineau 1 2 Yann Ollivier 2

In many ﬁnite horizon episodic reinforcement learning (RL) settings, it is desirable to optimize for the undiscounted return in settings like Atari, for instance, the goal is to collect the most points while staying alive in the long run. Yet, it may be difﬁcult (or even intractable) mathematically to learn with this target. As such, temporal discounting is often applied to optimize over a shorter effective planning horizon. This comes at the risk of potentially biasing the optimization target away from the undiscounted goal. In settings where this bias is unacceptable where the system must optimize for longer horizons at higher discounts the target of the value function approximator may increase in variance leading to difﬁculties in learning. We present an extension of temporal difference (TD) learning, which we call TD( ), that breaks down a value function into a series of components based on the differences between value functions with smaller discount factors. The separation of a longer horizon value function into these components has useful properties in scalability and performance. We discuss these properties and show theoretic and empirical improvements over standard TD learning in certain settings.

1. Introduction

The goal of reinforcement learning (RL) algorithms is to learn a policy that optimizes the cumulative reward (return) provided by the environment. A discount factor 0 γ < 1 can be used to optimize an exponentially decreasing function of the future return. Discounting is often used as a biased proxy for optimizing the cumulative reward to reduce variance and make use of convenient theoretical con-

*Equal contribution 1MILA, Mc Gill University 2Facebook AI Research 3Stanford University 4MILA, Universit e de Montr eal. Correspondence to: Joshua Romoff <joshua.romoff@mail.mcgill.ca>, Peter Henderson <phend@stanford.edu>.

Proceedings of the 36 th International Conference on Machine Learning, Long Beach, California, PMLR 97, 2019. Copyright 2019 by the author(s).

vergence properties, making learning more efﬁcient and stable (Bertsekas & Tsitsiklis, 1995; Prokhorov & Wunsch, 1997; Even-Dar & Mansour, 2003). However, in many of the complex tasks used for evaluating current state-of-the-art reinforcement learning systems (Mnih et al., 2013; Open AI, 2018), it is more desirable to optimize for performance over long horizons. The optimal choice of discount factor, which balances asymptotic policy performance with learning ability, is often difﬁcult, and solutions have ranged from scheduled curricula (Open AI, 2018; Prokhorov & Wunsch, 1997; Franc ois-Lavet et al., 2015) to meta-gradient learning of the discount factor (Xu et al., 2018).

Open AI (2018), for example, start with a small discount factor and gradually increase it to bootstrap the learning process. Rather than explicitly tackling the problem of discount selection, we make the observation that for any arbitrary discount factor, the discounted value function already encompasses all smaller timescales (discounts). This simple observation allows us to derive a novel method of generating separable value functions. That is, we can separate the value function into a number of partial estimators, which we call delta estimators, which approximate the difference Wz = Vγz Vγz 1 between value functions. Importantly, each delta estimators is learnable by itself, because it satisﬁes a Bellman-like equation based on the Ws of shorter horizons. Thus, these delta estimators can then be summed to yield the same discounted value function, and any subset of estimators from the series of smaller γz values. The use of difference methods (the delta between two value functions at different time scales) leads us to call our method TD( ).

The separable nature of the full TD( ) estimator allows for each component to be learned in a way that is optimal for that part of the overall value function. This means that, for example, the learning rate (and similarly other parameters) can be adjusted for each component, yielding overall faster convergence. Moreover, the components corresponding to smaller effective horizons can converge faster, bootstrapping larger horizon components (at the risk of some bias). Our method provides a simple drop-in way to separate value functions in any TD-like algorithm to increase performance in a variety of settings, particularly in MDPs with dense rewards.

We provide an intuitive method for setting intermediary γ values which yields performance gains, in most cases, without additional tuning. Yet, we also show that this method affords the option of further ﬁne-tuning for further performance improvement and note that our method is compatible with adaptive γ selection methods (Xu et al., 2018). We demonstrate these beneﬁts theoretically and highlight performance gains in a simple ring MDP used by Kearns & Singh (2000) for a similar bias-variance analysis by adjusting the k-step returns used to update each delta estimator. We also show how this method can be combined with TD(λ) (Sutton, 1984) and Generalized Advantage Estimation (GAE) (Schulman et al., 2015), leading to empirical gains in dense reward Atari games.

2. Related work

Many recent works approach discount factors in different ways. To our knowledge, the closest work to our own is that of Fedus et al. (2019), Sherstan et al. (2018), Sutton et al. (2011), Sutton (1995), Feinberg & Shwartz (1994), and Reinke et al. (2017), which learn ensembles of value functions at different time scales to form a generalized value function. In the case of Reinke et al. (2017), they do so for imitating the average return estimator. Feinberg & Shwartz (1994) examine an optimal policy for the mixture of two value functions with different discount factors. Similarly, Sutton (1995) present learning value functions across different levels of temporal abstraction through mixing functions. In the case of Sherstan et al. (2018) and Sutton et al. (2011), they train a value function such that it can be queried for a given set of timescales. Finally, concurrent to this work, Fedus et al. (2019) re-weight multiple value functions across different discount factors to form a hyperbolic value function. However, we note that none of the aforementioned works utilize short term estimates to train the longer term value functions. Thus, while our method can similarly be used as a generalized value function, the ability to query smaller timescales is a side-beneﬁt to the performance increases yielded by separating value functions into different time scales via TD( ).

Some recent work has investigated how to precisely select the discount factor choice (Franc ois-Lavet et al., 2015; Xu et al., 2018). Franc ois-Lavet et al. (2015) suggest a particular scheduling mechanism, seen similarly in Open AI (2018) and Prokhorov & Wunsch (1997). Xu et al. (2018) propose a meta-gradient approach which learns the discount factor (and λ value) over time. All of these methods can be applied to our own as we do not necessarily prescribe a ﬁnal overall γ value to be used.

Finally, another broad category of work relates to our own in a somewhat peripheral way. Indeed, hierarchical reinforcement learning methods often decompose value functions or

reward functions into a number of smaller systems which can be optimized somewhat separately (Dietterich, 2000; Henderson et al., 2018a; Hengst, 2002; Reynolds, 1999; Menache et al., 2002; Russell & Zimdars, 2003; van Seijen et al., 2017). These works learn hierarchical policies, paired with the decomposed value functions, which reﬂect the structure of the goals.

3. Background and notation

Consider a fully observable Markov Decision Process (MDP) (Bellman, 1957) (S, A, P, r) with state space S, action space A, transition probabilities P : S A (S [0, 1]) mapping state-action pairs to distributions over next states, and reward function r : (S A) R. At every timestep t, an agent is in a state st, can take an action at, receive a reward rt = r(st, at), and transition to its next state in the system st+1 P( | st, at).

In the usual MDP setting, an agent optimizes the discounted return: V π γ (s) = [P t=0 γtrt|s0 = s, π], where γ is the discount factor and π : S (A [0, 1]) is the policy that the agent follows. V π γ can be obtained as the ﬁxed point of the Bellman operator over the action-value function T πV π = rπ + γP πV π where rπ and P π are respectively the expected immediate reward and transition probabities operator induced by the policy π. In the rest of the paper, we drop the superscript π to avoid clutter in the formulas.

The value estimate, ˆVγ may approximate the true value function Vγ via temporal difference (TD) learning (Sutton, 1984). Given a transition (st, at, rt, st+1) we can update our value function using the one-step TD error: δγ t = rt + γ ˆVγ(st+1) ˆVγ(st). Alternatively, given an entire trajectory, we can instead use the discounted sum of one-step TD errors, which is commonly referred to as either the λ-return (Sutton, 1984) or equivalently the generalized advantage estimator (GAE) (Schulman et al., 2015):

k=0 (λγ)kδγ t+k, (1)

where the λ controls the bias-variance trade-off.

With function approximation we use a parameterized value function ˆVγ( ; θ) and then update our value function via the following loss:

L(θ) = E ˆVγ(s; θ) ˆVγ(s) + A(s) 2 . (2)

In actor-critic methods (Sutton et al., 2000; Konda & Tsitsiklis, 2000; Mnih et al., 2016), the value function is updated per equation 2, and a stochastic parameterized policy (actor, πω(a|s)) is learned from this value estimator via the advantage function where the loss is:

L(ω) = E [ log π(a, s; ω)A(s)] . (3)

Building on top of actor-critic methods, Proximal Policy Optimization (PPO) (Schulman et al., 2017) constrains the policy update to a given optimization region (a trust region) in the form of a clipping objective between the current and old parameters, ω and ωold:

L(ω) = E [min (ρ(ω)A(s), ψ(ω)A(s))] , (4)

where ρ = πω(a|s) πωold(a|s) is the likelihood ratio, ψ(ω) = clip (ρ, 1 ϵ, 1 + ϵ) is the clipped likelihood ratio, and ϵ < 1 is some small factor applied to constrain the update.

In this section, we introduce TD( ), along with several variations, including: Multi-step TD, TD(λ), and GAE.

4.1. Single-step TD( )

Consider learning with Z + 1 different discount factors := γ0, γ1, . . . , γZ. Each of these deﬁne a corresponding value function Vγz. We deﬁne the delta functions Wz by

Wz := Vγz Vγz 1, W0 := Vγ0. (5)

This results in Z + 1 delta functions such that the desired Vγz is simply the sum of the delta functions:

i=0 Wi(s). (6)

We can derive a Bellman-like equation for the delta functions W. Indeed, W0 = V0 satisﬁes the Bellman equation

W0(st) = E [rt + γ0W0(st+1)] , (7)

while the delta functions at larger time scales satisfy:

Wz(st) = Vγz(st) Vγz 1(st)

= E (rt + γz Vγz(st+1)) rt + γz 1Vγz 1(st+1)

= E γz Wz(st+1) + Vγz 1(st+1) γz 1Vγz 1(st+1)

= E (γz γz 1)Vγz 1(st+1) + γz Wz(st+1) . (8)

This is a Bellman-type equation for Wz, with decay factor γz and rewards Vγz 1(st+1). Thus, we can use it to deﬁne the expected TD update for Wz. Note that in this expression, Vγz 1(st+1) can be expanded as the sum of Wi(st+1) for i z 1, so that the Bellman equation for Wz depends on the values of all delta functions Wi, i z 1.

This way, the delta value function at a given timescale appears as an autonomous reinforcement learning problem

with rewards coming from the value function of the immediately lower timescale. Thus, for a target discounted value function Vγz(s), we can train all the delta components in parallel according to this TD update, bootstrapping off of the old value of all the estimators. Of course, this requires assuming a sequence of γz values, including a largest and smallest discount γ0 and γZ. We will see in Section 6.3 that these can affect results, further allowing tuning. However, to avoid the addition of a number of hyperparameters, we assume a simple sequence where we double the effective horizon of the γz values until the ﬁnal γZ value is reached. This simple sequence of γ s, without tuning, yields performance gains in many settings as seen in Section 6.2.

4.2. Multi-step TD( )

In many scenarios, it has been shown that multi-step TD is more efﬁcient than single-step TD (Sutton & Barto, 1998). We can easily extend TD( ) to the multi-step case as follows. To begin, since W0 := Vγ0, the multi-step target for W0 is identical to the standard multi-step target with γ = γ0. For all other Ws, we can unroll both the bootstrap term and the rewards from the previous value function in Section 4.1:

W0(st) = E k0 1 X

i=0 γi 0rt+i + γk0 0 W0(st+k) ,

Wz(st) = E (γz γz 1)Vγz 1(st+1) + γz Wz(st+1)

= E (γz γz 1)rt+1 + γz 1(γz γz 1)Vγz 1(st+2)

+ γz(γz γz 1)Vγz 1(st+2) + γ2 z Wz(st+2)

= E (γz γz 1)rt+1 + (γ2 z γ2 z 1)Vγz 1(st+2)

+ γ2 z Wz(st+2)

i=1 (γi z γi z 1)rt+i + (γkz z γkz z 1)Vγz 1(st+k)

+ γkz z Wz(st+k) . (9)

Thus, each Wz receives a fraction of the rewards from the environment up to time-step kz 1. Additionally, each W bootstraps off of its own value function as well as the value at the previous time-scale. A version of this algorithm based on k-step bootstrapping from Sutton & Barto (1998) can be seen in Algorithm 1. We also note that while Alorithm 1 has quadratic complexity w.r.t. Z, we can make the algorithm linear in implementation for large Z by storing ˆV values at each timescale γz.

Algorithm 1 Multi-step TD( )

Inputs (γ0, γ1, ..., γZ), (k0, k1, ..., k Z), (α0, α1, ..., αZ) Initialize ˆWz( ) = 0 z for t = 0, 1, 2... do

Take step according to policy and store (st, rt, st+1) if t k Z then

τ t k Z + 1 for z 0, 1, ..., Z do

if z = 0 then

G0 τ Pτ+k0 1 i=τ γi τ 0 ri + γk0 0 ˆW0(sτ+k0) else

Gz τ Pτ+kz 1 i=τ+1 (γi τ z γi τ z 1)ri+ (γkz z γkz z 1) Pz 1 g=0 ˆWg(sτ+kz)+ γkz z ˆWz(sτ+kz) end if end for for z 0, 1, ..., Z do

ˆWz(sτ) ˆWz(sτ) + αz h Gz τ ˆWz(sτ) i

end for end if end for

4.3. TD(λ, )

The traditional TD(λ) (Sutton, 1984) uses the following λ-return as a target for its update rules:

Gγ,λ t = ˆVγ(st) +

k=0 (λγ)kδγ t+k. (10)

The underlying TD(λ) operator can be written:

TλV = V + (I λγP) 1(TV V ) (11)

Similarly, for each Wz we can deﬁne a λ return:

Gz,λz t := ˆWz(st) +

k=0 (λzγz)kδz t+k, (12)

where δ0 t := δγ0 t and δz t := (γz γz 1) ˆVγz 1(st+1) + γz ˆWz(st+1) ˆWz(st) are the TD-errors.

4.4. TD(λ, ) with GAE

Since GAE is used in powerful policy gradient baselines (Schulman et al., 2017), we propose a simple extension of TD( ) that leverages GAE. Speciﬁcally, to train the policy we use the following generalized advantage estimator:

k=0 (λZγZ)kδ t+k, (13)

where δ t+k := rt + γZ PZ z=0 ˆWz(st+1) PZ z=0 ˆWz(st).

Thus, we use γZ as our discount factor and the sum of all our W estimators as a replacement for VγZ. This objective can easily be applied to PPO by using the policy update from Eq. 4 and replacing A with A . Similarly, to train each Wz, we use a truncated version of their respective λ-return deﬁned in Equation 12. See Algorithm 2 for details.

Algorithm 2 PPO-TD(λ, )

Initialize policy ω and values θz z for t = 0, 1, 2, ... do

Take step according to πω and store (st, at, rt, st+1) if t T then

Gz,λz ˆWz(st T ) + PT 1 k=0 (λzγz)kδz t T +k z A = PT 1 k=0 (λZγZ)kδ t T +k Update θz with TD (Eq. 2) using Gz z Update ω with PPO (Eq. 4) using A

end if end for

5. Analysis

We now analyze our estimators more formally. The goal is that our estimator will provide favorable bias-variance trade-offs under some circumstances (as we shall see experimentally). To shed light on this, we ﬁrst start by illustrating when our estimator is identical to the single estimator ˆVγ (Theorem 1) which gives insight into the important quantities of our estimator that can determine when we may achieve beneﬁts over the standard ˆVγ estimator. Then motivated by these results and prior work by Kearns & Singh (2000), we bound the error of our estimator in terms of a variance and bias term (Theorem 4) that also yields insight into how to trade-off this quantities to achieve the best result.

5.1. Equivalence settings and improvement

In some cases, we can show that our TD( ) update and its variations are equivalent to the non-delta estimator Vγ when recomposed into a value function. In particular, we focus here on linear function approximation of the form:

ˆVγ(s) := θγ, φ(s) and ˆWz(s) := θz, φ(s) , z

where θγ and {θz}z are weight vectors in Rd and φ : S Rd is a feature map from a state to a given d-dimensional feature space. Let θγ be updated using TD(λ) as follows:

θγ t+1 = θγ t + α Gγ,λ t ˆVγ(st) φ(st), (14)

where Gγ,λ t is the TD(λ) return deﬁned in equation 10.

Similarly, each ˆWz is updated using TD(λz, ) as follows:

θz t+1 = θz t + αz Gz,λz t ˆWz(st) φ(st), (15)

where Gz,λz t is TD( ) return deﬁned in equation 12. Here, α and {αz}z are positive learning rates. The following theorem establishes the equivalence of the two algorithms.

Theorem 1. If αz = α, λzγz = λγ, z and if we pick the initial conditions such that PZ z=0 θz 0 = θγ 0, then the iterates produced by TD(λ) (Eq. 14) and TD(λ, ) (Eq. 15) with linear function approximation satisfy:

z=0 θz t = θγ t , t, (16)

(The proof is provided in the Supplemental).

Note that the equivalence is achieved when λzγz = λγ, z. When λ is close to 1 and γz < γ, the latter condition implies that λz = λγ

γz could potentially be larger than one. One would conclude that the TD(λz) could diverge. Fortunately, we show in the next theorem that the TD(λ) operator deﬁned in equation 11 is a contraction mapping for 1 λ < 1+γ

2γ which implies that λγ < 1.

Theorem 2. λ [0, 1+γ

2γ [, the operator Tλ deﬁned as TλV = V + (I λγP) 1(TV V ), V R|S| is well deﬁned. Moreover, TλV is a contraction with respect to the max norm and its contraction coefﬁcient is equal to γ|1 λ|

1 λγ (The proof is provided in the Supplemental).

Similarly, we can consider learning each Wz using kz-step TD( ) instead of TD(λ, ). In this case, the analysis of Theorem 1 could be extended to show that with linear function approximation, standard multi-step TD and multi-step TD( ) are equivalent if kz = k, z.

However, we note that the equivalence with unmodiﬁed TD learning is the exception rather than the rule. For one, in order to achieve equivalence we require the same learning rate across every time scale. This is a strong restriction as intuitively the shorter timescales can be learned faster than the longer ones. Further, adaptive optimizers are typically used in the nonlinear approximation setting (Henderson et al., 2018c; Schulman et al., 2017). Thus, the effective rate of learning can differ depending on the properties of each delta estimator and its target. In principle, the optimizer can automatically adapt the learning to be different for the shorter and longer time scales.

Besides for the learning rate, such a decomposition allows for some particularly helpful properties not afforded to the non-delta estimator. In particular, every Wz delta component need not use the same k-step return (or λ-return) as the non-delta estimator (or the higher Wz components). Specifically, if kz < kz+1, z (or γzλz < γz+1λz+1, z), then there is the possibility for variance reduction (at the risk of some bias introduction).

5.2. Analysis for reducing kz values

To see intuitively how our method differs from the single estimator case, let us consider the tabular phased version of k-step TD studied by Kearns & Singh (2000). In this setting, starting from each state s S, we generate n trajectories {s(j) 0 = s, a0, r0, . . . , s(j) k , a(j) k , r(j) k , s(j) k+1, . . .}1 j n following policy π. For each iteration t, called also phase t, the value function estimate for s is deﬁned as follows:

ˆVγ,t(s) = 1

i=0 γir(j) i + γk ˆVγ,t 1(s(j) k )

The following theorem from Kearns & Singh (2000) provides an upper bound on the error in the value function

estimates deﬁned by ˆVγ t := maxs{| ˆVγ,t(s) Vγ(s)|}. Theorem 3. (Kearns & Singh, 2000) for any 0 < δ < 1, let

2 log(2k/δ)

n . with probability 1 δ,

ˆVγ t ϵ 1 γk

| {z } variance term

+ γk ˆVγ t 1 | {z } bias term

(The proof is provided in the Supplemental).

The ﬁrst term ϵ( 1 γk

1 γ ), in the bound in Eq. 18, is a variance term arising from sampling transitions. In particular, ϵ bounds the deviation of the empirical average of rewards from the true expected reward. The second term is a bias term due to bootstrapping off of the current value estimate.

Similarly, we consider a phased version of multi-step TD( ). For each phase t, we update each W as follows:

ˆWz,t(s) = 1

i=1 (γi z γi z 1)r(j) i +

(γkz z γkz z 1)Vγz 1(s(j) t+k) + γkz z ˆWz(s(j) t+k) . (19)

We now establish an upper bound on the error of phased TD( ) deﬁned as the sum of error incurred by each W components PZ z=0 z t , where z t = maxs{| ˆWz(s) Wz(s)|} Theorem 4. Assume that γ0 γ1 . . . γZ = γ and k0

k1 . . . k Z = k, for any 0 < δ < 1, let ϵ = q

2 log(2k/δ)

n , with probability 1 δ,

z=0 z t ϵ1 γk

γkz+1 z γkz z 1 γz | {z } variance reduction

z=0 (γkz z γkz+1 z )

u=0 u t 1 | {z } bias introduction

(The proof is provided in the Supplemental).

Comparing the bound for phased TD(λ) in Theorem 3 with the one for phased TD( ) in Theorem 4, we see that the latter allows for a variance reduction equal to

ϵ PZ 1 z=0 γ kz+1 z γkz z 1 γz 0 but it suffers from a potential bias

introduction equal to PZ 1 z=0 (γkz z γkz+1 z ) Pz u=0 u t 1 0. This is due to the compounding bias from all shorter-horizon estimates. We note that in the case that kz are all equal we obtain the same upper bound for both algorithms.

It is a well known and often used result that the expected discounted return over T steps is close to the inﬁnite-horizon discounted expected return after T 1 1 γ Kearns & Singh (2002). Thus, we can conveniently reduce kz for any γz such that kz 1 1 γz so that we follow this rule. Thus, if we have T samples, we can have an excellent bias-variance compromise on all timescales << T by choosing kz = 1 (1 γz),

so that γkz z is bound by a constant (since γ

e) for all z. This provides intuitive ways to set both γz and kz values (as well as all other parameters) without necessarily searching. We can double the effective horizon at each increasing Wz (to keep a logarithmic number of value functions with respect to the horizon) and similarly adjust all other parameters for estimation.

6. Experiments

All hyperparameter settings, extended details, and the reproducibility checklist for machine learning research (Pineau, 2018) can be found in the Supplemental1.

6.1. Tabular

Figure 1. (Left) γZ = .9375, 250 random seeds on the 5-state ring MDP. Error denotes the absolute error against the true discounted value function (pre-computed ahead of time using Value Iteration) averaged across the entire learning trajectory (5000 timesteps). Error bars denote standard error across random seeds. (Right) The average absolute error for the optimal learning rate at each k-step return up to the effective planning horizon of γZ.

We use the same 5-state ring MDP as in Kearns & Singh (2000) a diagram of which is available in the Supplemental for clarity to demonstrate performance gains under

1Link to Code: github.com/facebookresearch/td-delta

decreasing k-step regimes as described in Section 5.1. For all experiments we provide a variable number of gammas starting with 0 and increasing according to γz+1 = γz+1

2 until the maximum desired γZ is reached. Similarly, kz := 1 1 γz , z as described earlier. The baseline is a single estimator with γ = γZ, k = k Z. We run a grid of various γZ and k Z values and use standard TD-style updates (Sutton, 1988) for our experiments.

We compare against the true error which can be calculated ahead of time using value iteration (VI) (Bellman, 1957). In the case where we do not tailor k (all kz are equal), as predicted by the theory in Section 5.1, the performance is exactly equal to the single estimator case. We compute the average error from the VI pre-computed optimal value function across the entire training trajectory and plot a sample of these results in Figure 1. We supply all results in the supplemental across a set of 7 different γ values corresponding to effective planning horizons of (4, 8, 16, 32, 64, 125, 250). We note that performance gains tend to increase with larger γ and k values as discussed further in the supplemental. However, consistent with the theory, in all cases we still perform about equal to (statistically) or signiﬁcantly better than the single estimator setting.

6.2. Dense reward Atari

We further demonstrate performance gains in Atari using the PPO-based version of TD( ). We directly update PPO with TD(λ, ), using the code of Kostrikov (2018). We compare against the standard PPO baseline with hyperparameters as found in (Schulman et al., 2017; Kostrikov, 2018). Our architecture differs slightly from the PPO baseline as the value function now outputs Z + 1 outputs (1 for each W). For complete fairness, we also add another neural network architecture which replicates the parameters of TD( ). That is, we use a neural network value function that outputs Z +1 values which are summed together before computing the value loss (we call this PPO+). We run two versions of TD( ). The ﬁrst version, as described in Section 4.4, uses a similar set of γz sequence as in the ring MDP experiments (starting at γZ = 0.99 and halving the horizon) where λz is set for each lower γz such that γzλz = γZλZ as per Theorem 1. However, we note that due to the use of adaptive optimizers, performance may improve as parameters are honed for each delta estimator. Just as in the tabular setting where kz can be reduced for lower delta estimators, in this setting as well, parity with the baseline model is not necessary and λ can effectively be reduced. To this end, we introduce a second version of our method, labelled PPO-TD(ˆλ, ), where we limit λz 1.

We run experiments on the 9 games deﬁned in (Bellemare et al., 2016) as Hard with dense rewards. We chose Hard games as these games are most likely to need algorithmic

Figure 2. The Wz estimators versus the reward over a single episode in two games - the drops in value align with a lost life. This is done on a single rollout trajectory of the trained PPO-TD(ˆλ, ) agent using random seed 1153780.

improvements to solve. We chose dense reward tasks since we do not tackle the problem of exploration here (needed for tackling sparse reward settings), but rather modeling of complex value functions which dense reward settings are likely to beneﬁt from. As seen in Table 1 (with average return across training and on hold-out no-op starts in the Supplemental), PPO-TD(λ, ) performs (statistically) signiﬁcantly better in a certain class of games roughly related to the frequency of non-zero rewards (the density). In both versions of TD( ), the algorithms perform worse asymptotically than the baselines in two games, Zaxxon and Wizard of Wor, which belong to a class of games with lower density. Though PPO-TD(ˆλ, ) performs somewhat better in both cases, as we will see in Section 6.3, it is still possible to improve performance further in these games by tuning the number and scale of γZ factors.

One may wonder why performance improves in increasingly dense reward settings. There is a basic intuition that TD( ) would allow for quick learning of short-term phenomena, followed by slower learning of long-term dependencies. Such a decomposition is reﬂected in a rolled out trajectory using the learned policy in Figure 2. There, the long-term WZ value declines early according to a consistent gradient towards a lost life in the game, while short-term phenomena continue to be captured in the smaller components like W0.

6.3. Tuning and Ablation

In the previous section we demonstrated how using a ﬁxed set of γ, λ tailored to an intuitive set of progressively large horizons, we could yield performance gains in a number of environments over the single estimator case. However, a performance drop was seen in the case of Zaxxon and Wizard Of Wor. Due to our bias-variance trade-off in bootstrapping from smaller delta estimators, a curriculum based on smaller horizons may effectively slow learning in some cases. However, the beneﬁt of separating value functions in a ﬂexible way, as we propose here, is that they can be tuned. In Figure 3 (with full results in the Supplemental), we show

Figure 3. Performance of TD( ) variations vs. the baselines on Zaxxon and Wizard Of Wor. ppo+ refers to ppo with an augmented architecture. ppo Delta refers to setting γzλz = γλ z. ppo Delta3 and ppo Delta12 only use two value functions with horizons (3, 100) and (12, 100) respectively. Shaded region is standard error across 10 random seeds.

how different γ values can be used to improve asymptotic performance to match the baseline. By increasing the lowest effective horizon (γ0) of W0, we bias the algorithm less toward myopic settings and increase the rate of learning comparable to the baselines. Further tuning of the number of components and their parameters (γz, λz, learning rate, etc.) may further improve performance.

7. Discussion

In this work we explore temporal decomposition of the value function. More concretely, we proposed a novel way for decomposing value estimators via a Bellman update based on the difference between two value estimators with different discount factors. This has convenient theoretical and practical properties which help improve performance in certain settings. These properties have additional beneﬁts: they allow for a natural way to distribute and parallelize training, easy inspection of performance at different discount factors, and the possibility of lifelong learning by adding or removing components. Moreover, we have also highlighted the limitations of this method (introduced bias toward myopic returns) when using the simple parameter settings we propose. However, these limitations can be overcome with the

Algorithm Zaxxon Wizard Of Wor Qbert Ms Pacman Hero Frostbite Bank Heist Amidar Alien PPO-TD(λ, ) 396 210 2118 138 13428 333 2273 67 29074 512 292 7 1183 13 731 30 1606 112

3291 812 2440 89 13092 430 2241 78 29014 764 304 21 1166 5 672 45 1663 113

PPO+ 7006 211 2870 218 10594 335 1876 89 23511 843 299 2 1199 5 611 34 1374 85 PPO 7366 223 3408 193 11735 387 1888 111 21038 972 294 5 1190 3 575 54 1315 70 Reward Density 1.15 1.07 12.26 13.27 13.46 5.04 6.3 4.63 11.33

Table 1. Asymptotic Atari performance (across last 100 episodes) with the mean across 10 seeds and the standard error. denotes signiﬁcantly better results over our algorithm in the case of baselines or over the best baseline in the case of our algorithm using Welch s t-test with a signiﬁcance level of .05 and bootstrap conﬁdence intervals (Colas et al., 2018; Henderson et al., 2018b). indicates signiﬁcant using bootstrap CI, but not t-test. Bold algorithms are where we perform as well as or signiﬁcantly better than the baselines. Reward Density is frequency of rewards per 100 time-steps averaged over 10k timesteps under learned policy using baseline (PPO). Notice how the task Zaxxon has a much lower frequency than the largest frequency task (Hero). More information in Supplemental.

additional ability to tune parameters at different timescales. We brieﬂy discuss the added beneﬁts of TD( ) below.

Scalability: While we have not pursued it experimentally here, another beneﬁt of separating value functions in this way is that this reﬂects a natural way of distributing updates across systems for large scale problems. In fact, prior work has sought different ways to scale RL algorithms through partitioning methods (though typically through other means like dividing the state space) (Wingate, 2004; Wingate & Seppi, 2004). Our work provides another such method for scaling RL systems in a different way. A TD( ) update can be spread across many machines, such that each Wz is updated separately (as long as weights are synced across machines after a parallel update).

Additional tuning ability: Many of the performance improvements seen here come not necessarily from the decomposition method itself, but from the ability to set certain parameters differently for each component. The ﬁne-grained nature of the decomposition of the value function allows for further improvement by tuning the number of delta estimators and the γz values which correlate with them. In the future, a meta-gradient method as Xu et al. (2018) proposed could be used to automatically scale delta estimators to timescales which require more computational complexity. However, the default method for tailoring γz and kz and λz values as described above (doubling effective horizons until the maximum horizon is reached), still yields improvements in most games tested here, without additional tuning.

Interpreting performance at different time-scales: As we mention in Section 2, another beneﬁt of TD( ) is the ability to examine the value function at different time scales after a single pass of learning. That is, we can compose value functions from γ0, ..., γZ and understand the differences between different timescales. This has implications for real-world uses with similar motivations as Sherstan et al. (2018) describe. Take for example an MDP where the bulk of rewards are in some central region, requiring following a policy π for some number of timesteps before reaching the dense reward region. By examining each Wz component as

we do in Figure 2, a practitioner could understand how far into a trajectory π must be followed before the dense reward region is reached. This adds some layer of interpretability to the value function which is missing in the single estimator case. Similarly, this may have the beneﬁt in determining an optimal stopping point for the policy. In production systems where there is a cost to running a policy (time, money, or energy resources), yet the policy can be run indeﬁnitely, a practitioner may use Wz components to determine if the discounted return at a larger horizon is worth the cost.

TD( ) as an (almost) anytime algorithm: Throughout this work, we emphasize this algorithm as a complement to selection of a ﬁnal γZ. The longest horizon discount factor can be chosen according to other methods (hyperparameter optimization or meta-gradient methods). However, an added beneﬁt of our method not explored in this work is its functionality as an almost anytime algorithm. While longer time horizons will take longer to converge, at any point in time the sum of all horizons which have converged are a suitable approximation for the value function at that intermediary point. Therefore, with enough resources, TD( ) could potentially at anytime add one further timescale Z Z +1 (initialized to WZ+1 = 0 which preserves the current V estimate). This has implications for methods which already extend discount factors through a curriculum (Open AI, 2018).

Other extensions: Our method should also extend easily to any TD-like methods such as Sarsa(λ) and Q-learning with few adjustments. We leave this to future work.

Conclusion: We believe that TD( ) is a important dropin addition to any TD-based training methods that can be applied to a number of existing model-free RL algorithms. We especially highlight the value of this method for performance tuning. We show that a simple sequence of γz values based on doubling horizon values can yield performance gains especially in dense settings, but this performance can be enhanced further with tuning. As the complexity of modeling and training long-horizon problems increases, TD( ) may be another tool for scaling and honing production systems for optimal performance.

Acknowledgements

We thank Alexandre Pich e, Vincent Franc ois-Lavet, and Harsh Satija for many helpful discussions about the work.

Bellemare, M. G., Srinivasan, S., Ostrovski, G., Schaul, T., Saxton, D., and Munos, R. Unifying count-based exploration and intrinsic motivation. Co RR, 2016. URL http://arxiv.org/abs/1606.01868.

Bellman, R. A markovian decision process. Journal of Mathematics and Mechanics, pp. 679 684, 1957.

Bertsekas, D. P. and Tsitsiklis, J. N. Neuro-dynamic programming: an overview. In Proceedings of the 34th IEEE Conference on Decision and Control, volume 1, pp. 560 564. IEEE Publ. Piscataway, NJ, 1995.

Colas, C., Sigaud, O., and Oudeyer, P.-Y. How many random seeds? statistical power analysis in deep reinforcement learning experiments. Co RR, 2018. URL http:// arxiv.org/abs/1806.08295.

Dietterich, T. G. Hierarchical reinforcement learning with the maxq value function decomposition. Journal of Artiﬁcial Intelligence Research, 13:227 303, 2000.

Even-Dar, E. and Mansour, Y. Learning rates for q-learning. Journal of Machine Learning Research, 5(Dec):1 25, 2003.

Fedus, W., Fatemi, M., Bengio, Y., Bellemare, M. G., and Larochelle, H. Hyperbolic discounting and learning over multiple horizons. Co RR, 2019. URL https: //arxiv.org/abs/1902.06865.

Feinberg, E. A. and Shwartz, A. Markov decision models with weighted discounted criteria. Mathematics of Operations Research, 19(1):152 168, 1994.

Franc ois-Lavet, V., Fonteneau, R., and Ernst, D. How to discount deep reinforcement learning: Towards new dynamic strategies. Co RR, 2015. URL http://arxiv. org/abs/1512.02011.

Henderson, P., Chang, W.-D., Bacon, P.-L., Meger, D., Pineau, J., and Precup, D. Optiongan: Learning joint reward-policy options using generative adversarial inverse reinforcement learning. In Thirty-Second AAAI Conference on Artiﬁcial Intelligence, 2018a.

Henderson, P., Islam, R., Bachman, P., Pineau, J., Precup, D., and Meger, D. Deep reinforcement learning that matters. In Thirty-Second AAAI Conference on Artiﬁcial Intelligence, 2018b.

Henderson, P., Romoff, J., and Pineau, J. Where did my optimum go?: An empirical analysis of gradient descent optimization in policy gradient methods. ar Xiv preprint ar Xiv:1810.02525, 2018c.

Hengst, B. Discovering hierarchy in reinforcement learning with hexq. In ICML, volume 2, pp. 243 250, 2002.

Kearns, M. and Singh, S. Near-optimal reinforcement learning in polynomial time. Machine learning, 49(2-3):209 232, 2002.

Kearns, M. J. and Singh, S. P. Bias-variance error bounds for temporal difference updates. In COLT, pp. 142 147. Citeseer, 2000.

Konda, V. R. and Tsitsiklis, J. N. Actor-critic algorithms. In Advances in neural information processing systems, pp. 1008 1014, 2000.

Kostrikov, I. Pytorch implementations of reinforcement learning algorithms. https://github.com/ ikostrikov/pytorch-a2c-ppo-acktr, 2018.

Menache, I., Mannor, S., and Shimkin, N. Q-cutdynamic discovery of sub-goals in reinforcement learning. In European Conference on Machine Learning, pp. 295 306. Springer, 2002.

Mnih, V., Kavukcuoglu, K., Silver, D., Graves, A., Antonoglou, I., Wierstra, D., and Riedmiller, M. Playing atari with deep reinforcement learning. ar Xiv preprint ar Xiv:1312.5602, 2013.

Mnih, V., Badia, A. P., Mirza, M., Graves, A., Lillicrap, T., Harley, T., Silver, D., and Kavukcuoglu, K. Asynchronous methods for deep reinforcement learning. In International conference on machine learning, pp. 1928 1937, 2016.

Open AI. Openai ﬁve. https://blog.openai.com/ openai-five/, 2018.

Pineau, J. The machine learning reproducibility checklist (version 1.0), 2018.

Prokhorov, D. V. and Wunsch, D. C. Adaptive critic designs. IEEE transactions on Neural Networks, 8(5):997 1007, 1997.

Reinke, C., Uchibe, E., and Doya, K. Average reward optimization with multiple discounting reinforcement learners. In International Conference on Neural Information Processing, pp. 789 800. Springer, 2017.

Reynolds, S. I. Decision boundary partitioning: Variable resolution model-free reinforcement learning. Cognitive Science Research Papers, 1999.

Russell, S. J. and Zimdars, A. Q-decomposition for reinforcement learning agents. In Proceedings of the 20th International Conference on Machine Learning (ICML03), pp. 656 663, 2003.

Schulman, J., Moritz, P., Levine, S., Jordan, M., and Abbeel, P. High-dimensional continuous control using generalized advantage estimation. Co RR, 2015. URL https:// arxiv.org/abs/1506.02438.

Schulman, J., Wolski, F., Dhariwal, P., Radford, A., and Klimov, O. Proximal policy optimization algorithms. Co RR, 2017. URL http://arxiv.org/ abs/1707.06347.

Sherstan, C., Mac Glashan, J., and Pilarski, P. M. Generalizing value estimation over timescale. FAIM Workshop on Prediction and Generative Modeling in Reinforcement Learning, 2018.

Sutton, R. S. Temporal credit assignment in reinforcement learning. Doctoral Dissertations Available from Proquest, 1984.

Sutton, R. S. Learning to predict by the methods of temporal differences. Machine learning, 3(1):9 44, 1988.

Sutton, R. S. Td models: Modeling the world at a mixture of time scales. In Machine Learning Proceedings 1995, pp. 531 539. Elsevier, 1995.

Sutton, R. S. and Barto, A. G. Introduction to reinforcement learning, volume 135. MIT press Cambridge, 1998.

Sutton, R. S., Mc Allester, D. A., Singh, S. P., and Mansour, Y. Policy gradient methods for reinforcement learning with function approximation. In Advances in neural information processing systems, pp. 1057 1063, 2000.

Sutton, R. S., Modayil, J., Delp, M., Degris, T., Pilarski, P. M., White, A., and Precup, D. Horde: A scalable real-time architecture for learning knowledge from unsupervised sensorimotor interaction. In The 10th International Conference on Autonomous Agents and Multiagent Systems-Volume 2, pp. 761 768. International Foundation for Autonomous Agents and Multiagent Systems, 2011.

van Seijen, H., Fatemi, M., Romoff, J., Laroche, R., Barnes, T., and Tsang, J. Hybrid reward architecture for reinforcement learning. Co RR, 2017. URL http://arxiv. org/abs/1706.04208.

Wingate, D. Solving large mdps quickly with partitioned value iteration. All Theses and Dissertations, 2004.

Wingate, D. and Seppi, K. D. P3vi: A partitioned, prioritized, parallel value iterator. In Proceedings of the twenty-ﬁrst international conference on Machine learning, pp. 109. ACM, 2004.

Xu, Z., van Hasselt, H., and Silver, D. Meta-gradient reinforcement learning. Co RR, 2018. URL http: //arxiv.org/abs/1805.09801.