# taylor_expansion_of_discount_factors__d1007bee.pdf Taylor Expansions of Discount Factors Yunhao Tang 1 Mark Rowland 2 R emi Munos 3 Michal Valko 3 Abstract In practical reinforcement learning (RL), the discount factor used for estimating value functions often differs from that used for defining the evaluation objective. In this work, we study the effect that this discrepancy of discount factors has during learning, and discover a family of objectives that interpolate value functions of two distinct discount factors. Our analysis suggests new ways for estimating value functions and performing policy optimization updates, which demonstrate empirical performance gains. This framework also leads to new insights on commonly-used deep RL heuristic modifications to policy optimization algorithms. 1. Introduction One of the most popular models for reinforcement learning (RL) is the Markov decision process (MDP) with exponential discounting over an infinite horizon (Sutton and Barto, 2018; Puterman, 2014), with discounted objectives of the following form V π γ (x) = Eπ Discounted models enjoy favorable theoretical properties, and are also the foundation of many practical RL algorithms that enjoy empirical success (e.g. see (Mnih et al., 2015; Schulman et al., 2015a; Lillicrap et al., 2015; Schulman et al., 2017)). However, in most applications of RL, the objective of interest is the expected undiscounted cumulative return, where T < is a (possibly random) evaluation horizon, which usually also denotes the end of the trajectory. For 1Columbia University, New York, USA 2Deep Mind, London, UK 3Deep Mind, Paris, France. Correspondence to: yt2541@columbia.edu . Proceedings of the 38 th International Conference on Machine Learning, PMLR 139, 2021. Copyright 2021 by the author(s). example, T could be the first time the MDP gets into a terminal state (e.g., a robot falls); when the MDP does not have a natural terminal state, T could be enforced as a deterministic horizon. This creates a technical gap between algorithmic developments and implementations: it is tempting to design algorithms that optimize V π γ (x), however, further heuristics are often needed to get strong practical performance. This issue manifests itself with the policy gradient (PG) theorem (Sutton et al., 2000). Let πθ be a parameterized policy. The policy gradient (PG) θV πθ γ (x) is computed as t=0 γt Qπθ γ (xt, at) θ log πθ(at|xt) However, the practical implementation of PG updates usually omits the discount factors (see for example the high-quality open source packages (Dhariwal et al., 2017; Achiam and Open AI, 2018), leading to an approximate gradient of the form t=0 Qπθ γ (xt, at) θ log πθ(at|xt) Most prior work on PG algorithms rely on this heuristic update to work properly in deep RL applications. The intuitive argument for dropping the factor γt is that Eqn (2) optimizes V πθ γ (x), which is very myopic compared to the objective in Eqn (1). Consequently, the exponential discount γt is too aggressive for weighting updates with large t. As a concrete example, in many Mu Jo Co control tasks (Brockman et al., 2016), the most commonly used discount factor is γ = 0.99. This leads to an effective horizon of 1 1 γ = 100, which is much smaller than the evaluation horizon T = 1000. This technical gap between theory and practice has been alluded to previously (by e.g., O Donoghue et al., 2016) and is explicitly discussed by Nota and Thomas (2019). To bypass this gap, a straightforward solution would be to na ıvely increase the discount factor γ 1 1 T and apply the PG in Eqn (2). In the example above, this implies using γ 0.999. Unfortunately, this rarely works well in practice, as we will also see in experiments. The failure might be due to the higher variance of the estimation (Schulman et al., 2015b) or the collapse of the action gaps (Lehnert et al., 2018; Laroche and van Seijen, 2018), which is aggravated when combined with function approximations. Taylor Expansions of Discount Factors Nevertheless, as a theoretical framework, it is insightful to emulate the undiscounted objective in Eqn (1) using the (un)discounted objective V π γ (x) with γ 1 1 T . To build intuitions about this approximation, note that when the time step is small t T, the multiplicative factor (γ )t 1 and the cumulative rewards are almost undiscounted; even when t = T, we have (γ )t (1 1 e 0. Overall, this is a much more accurate approximation than V π γ (x). This naturally prompts us to answer the following general question: How do we evaluate and optimize V π γ (x) with estimates built for V π γ (x) where 0 < γ < γ 1? Main idea. We study the relation between V π γ (x) and V π γ (x) via Taylor expansions. In Section 3, we identify a family of interpolating objectives between the more myopic objective V π γ (x) and the true objective of interest V π γ (x). In Section 4, we start with insights on why the heuristic in Eqn (3) might be useful in practice. Then, we apply Taylor expansions directly to the heuristic updates, to arrive at a family of interpolating updates. In Section 5, we build on theoretical insights to derive improvements to established deep RL algorithms. We show their performance gains in Section 7. 2. Background Consider the setup of a MDP. At any discrete time t 0, the agent is in state xt X, takes an action at A, receives an instant reward rt = r(xt, at) [0, Rmax] and transitions to a next state xt+1 p( |xt, at). For simplicity, we assume r(x, a) to be deterministic. Let policy π : X P(A) be a mapping from states to distributions over actions. Let γ [0, 1) be a discount factor, define the Q-function Qπ γ(x, a) := Eπ [P t=0 γtrt | x0 = x, a0 = a] and value function V π γ (x) := Eπ [P t=0 γtrt | x0 = x]. We also define the advantage function Aπ γ(x, a) := Qπ γ(x, a) V π γ (x). Here, Eπ [ ] denotes that the trajectories (xt, at, rt) t=0 are generated under policy π. Throughout the paper, we use subscripts γ to emphasize that RL quantities implicitly depend on discount factors. 2.1. Linear programs for reinforcement learning Henceforth, we assume all vectors to be column vectors. The value functions V π γ satisfy the Bellman equations V π γ (x) = Eπ r(x, a) + γV π γ (x ) | x0 = x (Bellman, 1957). Such equations can be encoded into a linear program (LP) (De Farias and Van Roy, 2003; Puterman, 2014). Let V RX be the primal variables, consider the following LP, max δT x V, V = rπ + γP πV, (4) where rπ RX is the state-dependent reward rπ(x ) := P a π(a |x )r(x , a ) and P π RX X is the transition matrix under π. Here, δx RX encodes the one-hot distribution (Dirac) at x. Similar results hold for considering the LP objective v T V with a general distribution v P(X). It then follows that the optimal solution to the above LP is V = V π γ . Now, consider the dual LP to Eqn (4), let d RX be the dual variables, min (1 γ) 1(rπ)T d, d = (1 γ)δx + γ(P π)T d. (5) The optimal solution to the dual program has a natural probabilistic interpretation. It is the discounted visitation distribution dπ x,γ under policy π with starting state x as dπ x,γ(x ) := (1 γ) P t 0 γt Pπ(xt = x |x0 = x) where Pπ(xt = x |x0 = x) is a probability measure induced by the policy π and the MDP transition kernel. By strong duality, the value function can be equivalently written as V π γ (x) = 1 1 γ Ex dπ x,γ,a π( |x ) [r(x , a )] . (6) 3. Taylor Expansions of Value Functions Below, we show how to estimate V π γ (x) with approximations constructed from value functions V π γ (x) for γ < γ . Unless otherwise stated, we always assume γ < 1 for a more convenient mathematical treatment of the problem. 3.1. Taylor expansions of discount factors We start with some notations: we abuse the notation of value functions V π γ RX to both refer to the scalar function as well as a vector. The Bellman equation for the valuefunction is expressed in the matrix form (Puterman, 2014) V π γ = rπ + γ P πV π γ . (7) Inverting the equation, V π γ = (I γ P π) 1rπ. (8) Now, we present the main result of Taylor expansions. Proposition 3.1. The following holds for all K 0, (γ γ)(I γP π) 1P π k V π γ + (γ γ)(I γP π) 1P π K+1 V π γ | {z } residual When γ < γ < 1, the residual norm converges to 0, which implies (γ γ)(I γP π) 1P π k V π γ . (10) Taylor Expansions of Discount Factors We provide a proof sketch here: Note that γ P π = (γ γ)P π + γP π and apply the Woodbury matrix identity to obtain (I γ P π) 1 = (I γP π) 1 + (γ γ)(I γP π) 1P π(I γ P π) 1. We can then recursively expand Eqn (8) K times to arrive at Eqn (9). In particular, by expanding the equation once, we see that (I γ P π) 1 is equivalent to the following, (I γP π) 1 + (γ γ)(I γP π) 1P π(I γP π) 1 + (γ γ)2 (I γP π) 1P π 2 (I γ P π) 1 | {z } can be expanded further where the last term can be expanded further by plugging in the Woodbury matrix identity. See the complete proof in Appendix A. Extensions to γ = 1. The above result can extend to the case γ = 1. We make two assumptions: A.1 The Markov chain induced by π is absorbing and T is the absorption time; A.2 rπ(x) = 0 for absorbing states x. Under these assumptions, we can interpret such absorbing states as the terminal states. As a result, V π γ =1(x) = Eπ h PT t=0 rt x0 = x i is well-defined and Proposition 3.1 still holds; see Appendix A for the complete proof. In practice, it is infeasible to sum up all infinite number of terms in the Taylor expansion. It is then of interest to consider the Kth-order expansion of V π γ , which truncates the infinite series. Specifically, we define the Kth-order expansion as V π K,γ,γ := k=0 ((γ γ)(I γP π) 1P π)k V π γ . (11) As K increases, the Kth order expansion becomes increasingly close to the infinite series, which evaluates to V π γ (x). This is formalized next. Proposition 3.2. The following bound holds for all K 0, V π γ (x) V π K,γ,γ (x) γ γ 3.2. Sample-based approximations of Taylor expansions We now describe how to estimate V π K,γ,γ (x) via samples. First, we build some intuition on the behavior of expansions at different orders K by considering a few special cases. Zeroth-order expansion. By setting K = 0, we see that V π 0,γ,γ = V π γ . (13) The zeroth order expansion approximates the value function V π γ (x) of the discount factor γ with that V π γ (x) of a lower discount factor γ < γ . This is a very straightforward approximation to use in that no sampling at all is required, but it may not be accurate. First-order expansion. When K = 1, we consider the increments of the expansions, V π 1,γ,γ V π 0,γ,γ = (γ γ)(I γP π) 1P πV π γ . (14) To understand the first order expansion, recall that in the definition of value function V π γ = (I γP π) 1rπ, immediate rewards rπ are accumulated via the matrix (I γP π) 1. In general, for any X, Y RX , we can interpret X = (I γP π)) 1Y as accumulating Y as rewards to compute X as value functions. By analogy, we can interpret the RHS of Eqn (14) as the value function assuming (γ γ)P πV π γ as immediate rewards. In other words, the first order expansion bootstraps the zeroth order expansion V π γ to form a more accurate approximation. Combined with the zeroth order expansion, we can also conveniently write the difference of firstand zeroth-order expansions as an expectation V π 1,γ,γ (x) V0,γ,γ (x) = (γ γ)Eπ P t=1 γt 1V π γ (xt) x0 = x . Let τ Geometric(1 γ) be a random time such that P(τ = t) = (1 γ)γt, t Z 1. The difference can also be expressed via this random time V π 1,γ,γ (x) V0,γ,γ (x) = γ γ 1 γ Eπ,τ V π γ (xτ) . Note that from this expression, we obtain a simple unbiased estimate for V π 1,γ,γ (x) V0,γ,γ (x), using a sampled trajectory and a random time step τ. General Kth-order expansion. We now present results for general K. Consider the incremental term, V π K,γ,γ V π K 1,γ,γ = (γ γ)K (I γP π) 1P π K V π γ . (15) Note that the aggregate matrix (I γP π) 1P π K suggests a recursive procedure to bootstrap from lower order expansions to construct higher order expansions. To see why, we can rewrite the right-hand side of Eqn (15) as (γ γ)(I γP π) 1P π V π K 1,γ,γ V π K 2,γ,γ . Indeed, we can interpret the difference V π K,γ,γ V π K 1,γ,γ as the value function under the immediate reward (γ γ)P π V π K 1,γ,γ V π K 2,γ,γ . This generalizes the bootstrap procedure of the first order expansion as a special case where we naturally assume V π 1,γ,γ = 0. Given K Taylor Expansions of Discount Factors i.i.d. random times τi Geometric(1 γ), we can write V π K,γ,γ (x) V π K 1,γ,γ (x) as the expectation γ γ K Eτi,1 i K V π γ (xτ1+ +τK) . Based on the above expression, Algorithm 1 provides a subroutine that generates unbiased estimates of V π K,γ,γ (x) by sub-sampling an infinite trajectory (xt, at, rt) t=0 with the random times. Practical implementations. While the above and Algorithm 1 show how to compute one-sample estimates, in practice, we might want to average multiple samples along a single trajectory for variance reduction. See Appendix F for further details on the practical estimates. Algorithm 1 Estimating the Kth order expansion Require: A trajectory (xt, at, rt) t=0 π and discount factors γ < γ < 1 1. Compute an unbiased estimate b V π γ (xt) for states along the trajectory, e.g., b V π γ (xt) = P t t γt trt . 2. Sample K random time {τi}1 i K, all i.i.d. geometrically distributed τi Geometric(1 γ). 3. Return the unbiased estimate PK k=0 γ γ 1 γ k b V π γ (xtk) where tk = Pk i=1 τi. Interpretation of expansions in the dual space. Recall that V π γ = (I γ P π) 1rπ = I(I γ P π) 1rπ where the identity matrix I = [δ0, δ1, ...δX ] concatenates Dirac delta vectors δx, x X. Since rπ is a constant vector, Taylor expansions essentially construct approximations to the matrix (I γ P π) 1. By grouping the matrix with the reward vector (or the density matrix), we arrive at the primal expansion (or the dual expansion), I (I γ P π) 1rπ | {z } primal expansions of V π γ (x) = I(I γ P π) 1 | {z } dual expansions of dπ x,γ The derivations above focus on the primal expansion view. We show a parallel theory of dual expansion in Appendix B. The equivalence of primal-dual view of Taylor expansions suggests connections with seemingly disparate lines of prior work: Janner et al. (2020) propose a density model for visitation distribution of different γ in the context of model-based RL. They show that predictions of large discount factors could be bootstrapped from predictions of small discount factors. This corresponds exactly to the dual space expansions, which is equivalent to the primal space expansions. Extensions to Q-functions. In Appendix C, we show that it is possible to build approximations to Qπ γ using Qπ γ as building blocks. The theoretical guarantees and estimation procedures are similar to the case of value functions. 3.3. Approximation errors with finite samples Proposition 3.2 shows that the expected approximation error decays as V π K,γ,γ (x) V π γ (x) = O γ γ 1 γ K+1 for γ < γ < 1. This motivates using a high value of K when constructing the approximation. However, in practice, all constituent terms in the Kth order expansion are random estimates, each with a non-zero variance. This might lead the variance of the overall estimate to increase as K increases. As a result, K mediates a trade-off between bias (expected approximation error) and variance. We formalize such intuitions in Appendix E, where we theoretically analyze the trade-off using the phased TD-learning framework (Kearns and Singh, 2000). A numerical example. To get direct intuition about the effect of K, we focus on a tabular MDP example. The MDP has |X| = 10 states and |A| = 2 actions. All entries of the transition table p(y|x, a) are generated from a Dirichlet distribution with parameters (α, . . . , α) with α = 0.01. The policy π(a|x) is uniformly random. We take γ = 0.2 and γ = 0.8. The agent generates N = 10 trajectories (xt, at, rt)T t=0 with a very large horizon T with a fixed starting state x0. We assume access to base estimates b V π γ (xt) and the Taylor expansion estimates b V π K,γ,γ (x0) are computed based on Algorithm 1. We estimate the relative error as b EK(x0) = V π γ (x0) b V π K,γ,γ (x0) . For further experiment details, see Appendix F. In Figure 1(a), we show how errors vary as a function of K. We study two settings: (1) Expected estimates (red), where b V π K,γ,γ (x0) is computed analytically through access to transition tables. In this case, similar to how the theory suggests, the error decays exponentially; (2) Sample-based estimates (blue) with base estimates b V π γ (xt) = P s=0 γsrt+s. The errors decay initially with K but later start to increase a bit as K gets large. The optimal K in the middle achieves the best bias-variance trade-off. Note that in this particular example, the estimates do not pay a very big price in variance for large K. We speculate this is because increments to the estimates are proportional to γ γ 1 γ K+1 , which scales down additional variance terms quickly as K increases. In Figure 1(b), we study how the optimal expansion order K depends on the noise level of base estimates. To emulate the noise, we assume access to base estimates b V π γ (xt) = V π γ (xt) + N(0, σ2) for some noise level σ. The optimal order K is computed as K = arg mink b Ek(x0). In general, we observe that when σ increases, K decreases. Intuitively, this implies that as the base estimates b V π γ (x) become noisy, we should prefer smaller value of K to control the variance. This result bears some insights for practical applications such as downstream policy optimization, where Taylor Expansions of Discount Factors (a) Trade-off of K (b) Optimal K Figure 1. Comparison of Taylor expansions with different orders. The x-axis shows the order K, the y-axis shows the log relative errors of the approximations. The blue curve shows the exact computations while the red curve shows the sample based estimations. See Appendix F for more details. we need to select an optimal K for the tasks at hand. 4. Taylor Expansions of Gradient Updates In Section 3, we discussed how to construct approximations to V π γ (x). For the purpose of policy optimization, it is of direct interest to study approximations to θV πθ γ (x). As stated in Section 1, a major premise of our work is that in many practical contexts, estimating discounted values under γ 1 is difficult. As a result, directly evaluating the full gradient θV πθ γ (x) is challenging, because it requires estimating Q-functions Qπθ γ (x, a). Below, we start by showing how the decomposition of θV πθ γ (x) motivates a particular form of gradient update, which is generally considered a deep RL heuristic. Then we construct approximations to this update based on Taylor expansions. 4.1. V π γ as a weighted mixture of V π γ We can explicitly rewrite V π γ (x) as a weighted mixture of value functions V π γ (x ), x X. This result was alluded to in (Romoff et al., 2019) and formally shown below. Lemma 4.1. Assume γ < γ < 1. We can write V π γ (x) = (ρπ x,γ,γ )T V π γ , where the weight vector ρπ x,γ,γ RX is I γ(P π)T I γ (P π)T 1 δx. Also we can rewrite V π γ (x), using an expectation, as: V π γ (x) + Eπ t=1 (γ γ)(γ )t 1V π γ (xt) When γ = 1, ρπ x,γ,γ might be undefined. However, Eqn (16) still holds if assumptions A.1 and A.2 are satisfied. 4.2. Decomposing the full gradient θV πθ γ (x) Lemma 4.1 highlights that V π γ (x) depends on π in two aspects: (1) the value functions V π γ (x ), x X; (2) the statedependent distribution ρπ x,γ,γ (x ). Let πθ be a parameterized policy. For conceptual clarity, we can write V πθ γ (x) = F(V πθ γ , ρπθ x,γ,γ ) with a function F : RX RX R. Though this function is essentially the inner product, i.e., F(V, ρ) = V T ρ, notationally, it helps stress that V πθ γ (x) depends on θ through two vector arguments. Now, we can decompose θV πθ γ (x). Lemma 4.2. The full gradient θV πθ γ (x) can be decomposed into the sum of two partial gradients as follows, ( V F(V, ρ))T θV πθ γ + ( ρF(V, ρ))T θρπθ x,γ,γ = E θV πθ γ (x ) | {z } first partial gradient + E h V πθ γ (x ) θ log ρπθ x,γ,γ (x ) i | {z } second partial gradient where the above partial gradients are both evaluated at V = V πθ γ , ρ = ρπθ x,γ,γ and both expectations are with respect to x ρπθ x,γ,γ . We argue that the second partial gradient introduces most challenges in practical optimization. Intuitively, this is because its unbiased estimator is equivalent to a REINFORCE gradient estimator which requires estimating discounted values that accumulate V π γ (x ) as reward under discount factor γ . By the premise of our work, this estimation would be difficult. We will detail the discussions in Appendix D. The following result characterizes the first partial gradient. Proposition 4.3. For any γ < γ < 1, the first partial gradient ( V F(V πθ γ , ρπθ x,γ,γ ))T θV πθ γ can be expressed as t=0 (γ )t Qπθ γ (xt, at) θ log πθ(at|xt) When γ = 1, under assumptions A.1 and A.2, the first partial gradient exists and is expressed as t=0 Qπθ γ (xt, at) θ log πθ(at|xt) Connections to common deep RL heuristic. Many highquality deep RL algorithms (see, e.g. Dhariwal et al., 2017; Achiam and Open AI, 2018) implement parameter updates which are very similar to Eqn (18). As such, Proposition 4.3 provides some insights on why implementing such a heuristic might be useful in practice: though in general Eqn (18) is not a gradient (Nota and Thomas, 2019), it is a partial gradient of V πθ γ =1(x), which is usually the objective of interest at evaluation time. Compared with the formula of vanilla Taylor Expansions of Discount Factors PG in Eqn (2), Eqn (18) offsets the over-discounting by via a uniform average over states. However, it is worth noting that in deep RL practice, the definition of the evaluation horizon T might slightly differ from that specified in A.1. In such cases, Proposition 4.3 does not hold. By A.1, T is the absorption time that defines when the MDP enters a terminal absorbing state. In many applications, however, for MDPs without a natural terminatal state, T is usually enforced by an external time constraint which does not depend on states. In other words, an environment can terminate even when it does not enter any terminal state (see, e.g., Brockman et al., 2016 for such examples). To bypass this subtle technical gap, one idea is to incorporate time steps as part of the state ex [x, t]. This technique was hinted at in early work such as (Schulman et al., 2015b) and empirically studied in (Pardo et al., 2018). In this case, the random absorbing time T depends fully on the augmented states, and Proposition 4.3 holds. 4.3. Taylor expansions of partial gradients We now consider approximations to the first partial gradients V F(V πθ γ , ρπθ x,γ,γ ) T θV πθ γ = (ρπθ x,γ,γ )T θV πθ γ . Since θV πθ γ does not depend on γ , the approximation is effectively with respect to the weight vector ρπθ x,γ,γ . Below, we show results for the Kth order approximation. Proposition 4.4. Assume γ < γ < 1. For any x X, define the Kth Taylor expansion to ρπ x,γ,γ as ρπ x,K,γ,γ = (γ γ) I γ(P π)T 1 (P π)T k δx. It can be shown that V π K,γ,γ (x) = (ρπ x,K,γ,γ )T V π γ and ρπ x,K,γ,γ ρπ K,γ,γ = O γ γ We build some intuitions about the approximations. Note that in general we can write the partial gradient as a weighted mixture of local gradients Qt θ log πθ(at|xt) where Qt := Qπθ γ (xt, at), t=0 w K,γ,γ (t)Qt θ log πθ(at|xt) for some weight function w K,γ,γ (t) R. When K , lim w K,γ,γ (t) = (γ )t and we recover the original first partial gradient defined in Eqn (17); when K = 0, w K,γ,γ (t) = γt recovers the vanilla PG in Eqn (2). For other values of K, we show the analytic weights w K,γ,γ (t) in Appendix D. Similar to how V π K,γ,γ interpolates V π γ and V π γ , here the Kth order expansion to the partial gradients interpolate the full partial gradients and vanilla PG. In practice, we might expect an intermediate value of K achieve the best bias and variance trade-off of the update. 5. Policy optimization with Taylor expansions Based on theoretical insights of previous sections, we propose two algorithmic changes to baseline algorithms. Based on Section 3, we propose Taylor expansion advantage estimation; based on Section 4, we propose Taylor expansion update weighting. It is important to note that other algorithmic changes are possible, which we leave to future work. 5.1. Baseline near on-policy algorithm We briefly introduce backgrounds for near on-policy policy optimization algorithms (Schulman et al., 2015a; Mnih et al., 2016; Schulman et al., 2017; Espeholt et al., 2018). We assume that the data are collected under a behavior policy (xt, at, rt) t=0 µ, which is close to the target policy πθ. The on-policyness is ensured by constraining D(πθ, µ) ε for some divergence D and threshold ε > 0. Usually, ε is chosen to be small such that little off-policy corrections are needed for estimating value functions. With data (xt, at, rt) t=0, the algorithms estimate Q-functions b Qπθ γ Qπθ γ . Then the estimates b Qπθ γ (x, a) are used as plug-in alternatives to the Q-functions in the definition of gradient updates such as Eqn (2) for sample-based updates. 5.2. Taylor expansion Q-function estimation In Section 3, we discussed how to construct approximations to Qπθ γ using Qπθ γ as building blocks. As the first first algorithmic change, we propose to construct the Kth order expansion Qπθ K,γ,γ as a plug-in alternative to Qπθ γ when combined with downstream optimization. Since Qπθ K,γ,γ Qπθ γ , we expect the optimization subroutine to account for an objective of a longer effective horizon. In many baseline algorithms, we have access to a value function critic Vφ(x) and a subroutine which produces Q-function estimates b Qπθ γ (x, a) (e.g., b Qπθ γ (xt, at) = P s=0 γsrt+s). We then construct the Kth order expansion b Qπθ K,γ,γ (x, a) using b Qπθ γ . This procedure is similar to Algorithm 1 and we show the full algorithm in Appendix C. See also Appendix F for further experimental details. 5.3. Taylor expansion update weighting In Section 4, we discussed Taylor expansions approximation ρπθ x,K,γ,γ to the weight vector ρπθ x,γ,γ . As the second algorithmic change to the baseline algorithm, we update parameters in the direction of Kth order approximations to the partial gradient θ θ + α ρπθ x,K,γ,γ T θV πθ γ . Eqn (19) shows that the update effectively translates into adjusting the weight wt = w K,γ,γ (t). When combined with other Taylor Expansions of Discount Factors Algorithm 2 Taylor expansion Q-function estimation Require: policy πθ with parameter θ and α while not converged do 1. Collect partial trajectories (xt, at, rt)T t=1 µ. 2. Estimate Q-functions b Qπθ γ (xt, at). 3. Construct Kth order Taylor expansion estimator b Qπθ K,γ,γ (xt, at) using b Qπθ(xt, at). 4. Update the parameter via gradient ascent θ θ + α PT t=1 b Qπθ K,γ,γ(xt) θ log πθ(at|xt). end while components of the algorithm, the pseudocode is shown in Algorithm 3. Under this framework, the common deep RL heuristic could be recovered by setting wt = 1. Algorithm 3 Taylor expansion update weighting Require: policy πθ with parameter θ and α while not converged do 1. Collect partial trajectories (xt, at, rt)T t=1 µ. 2. Estimate Q-functions b Qt = b Qπθ γ (xt, at). 3. Compute weights for each state wt = wx0,K,γ,γ (t), and average gθ = PT t=1 wt b Qt θ log πθ(at|xt). 4. Update parameters θ θ + αgθ. end while 6. Related work Discount factors in RL. Discount factors impact RL agents in various aspects. A number of work suggest that RL problems with large discount factors are generally more difficult to solve (Jiang et al., 2016), potentially due to increased complexities of the optimal value functions or collapses of the action gaps (Lehnert et al., 2018; Laroche and van Seijen, 2018). However, optimal policies defined with small discounts can be very sub-optimal for RL objectives with a large discount factor. To entail numerical stability of using large discounts, prior work has suggested nonlinear transformation of the Bellman targets for Q-learning (Pohlen et al., 2018; van Hasselt et al., 2019; Kapturowski et al., 2018; Van Seijen et al., 2019). However, when data is scarce, small discount factors might prove useful due to its implicit regularization effect (Amit et al., 2020). As such, there is a trade-off mediated by choosing different values of discount factors. Similar trade-off effects are most well-known in the context of TD(λ), where λ [0, 1] trades-off the bias and variance of the TD updates (Sutton and Barto, 2018; Kearns and Singh, 2000). Adapting discount factors & multiple discount factors. In general, when selecting a single optimal discount factor for training is difficult, it might be desirable to adjust the discount during training. This could be achieved by human- designed (Prokhorov and Wunsch, 1997; Franc ois-Lavet et al., 2015) or blackbox adaptation (Xu et al., 2018). Alternatively, it might also be beneficial to learn with multiple discount factors at the same time, which could improve TDlearning (Sutton, 1995) or representation learning (Fedus et al., 2019). Complementary to all such work, we study the connections between value functions defined with different discounts. Taylor expansions for RL. Recently in (Tang et al., 2020), Taylor expansions were applied to study the relationship between V π γ and V µ γ , i.e., value functions under the same discount factor but different policies π = µ. This is useful in the context of off-policy learning. Our work is orthogonal and could be potentially combined with this approach. 7. Experiments In this section, we evaluate the empirical performance of new algorithmic changes to the baseline algorithms. We focus on robotics control experiments with continuous state and action space. The tasks are available in Open AI gym (Brockman et al., 2016), with backends such as Mu Jo Co (Todorov et al., 2012) and bullet physics (Coumans, 2015). We label the tasks as gym (G) and bullet (B) respectively. We always compare the undiscounted cumulative rewards evaluated under a default evaluation horizon T = 1000. Hyper-parameters. Throughout the experiments, we use the same hyper-parameters across all algorithms. The learning rate is tuned for the baseline PPO, and fixed across all algorithms. See Appendix F for further details. 7.1. Taylor expansion Q-function estimation We use b Qπθ K,γ,γ (x, a) with K = 1 as the Q-function estimator plug-in for the gradient update. When combining with PPO (Schulman et al., 2017), the resulting algorithm is named PPO(K). We compare with the baseline PPO and TRPO (Schulman et al., 2015a). In practice, we consider a mixture of advantage estimator b Qπθ(x, a) = (1 η) b Qπθ γ (x, a) + η b Qπθ K,γ,γ (x, a) with η [0, 1] a constant that interpolates between the PPO (i.e., η = 0) and PPO(K). Note that though η should be selected such that it balances the numerical scales of the two extremes, as a result, we usually find η to work well when it is small in absolute scale (η = 0.01 works the best). Results. In Figure 2, we compare a few baselines: (1) PPO with γ = 0.99 (default); (2) PPO with high discount factor γ = 1 1 T = 0.999; (3) PPO with Taylor expansion based advantage estimator, PPO(K). Throughout, we use a single hyper-parameter η = 0.01. We see that in general, PPO(K) leads to better performance (faster learning speed, Taylor Expansions of Discount Factors (a) Half Cheetah(G) (c) Walker2d(G) (d) Half Cheetah(B) (f) Walker2d(B) Figure 2. Comparison of Taylor expansion Q-function estimation with other baselines. Each curve shows median std results across 5 seeds. Taylor expansion outperforms PPO baselines with both lower and high discount factors. better asymptotic performance or smaller variance across 5 seeds). This shows Taylor expansion Q-function estimation could lead to performance gains across tasks, given that the hyper-parameter η is carefully tuned. We provide a detailed ablation study on η in Appendix F, where we show how the overall performance across the benchmark tasks vary as η changes from small to large values. A second observation is that simply increasing the discount factor to γ = 1 1 T = 0.999 generally degrades the performance. This confirms issue with instability of directly applying high discount factors which motivates this work. We also compare with the open source implementation of (Romoff et al., 2019) in Appendix F, where they estimate b Qπ γ based on recursive bootstraps of Q-function differences. Conceptually, this is similar to Taylor expansions with K = . We show that without a careful trade-off mediated by smaller K, this algorithm does not improve performance out of the box. (a) Half Cheetah(G) (c) Walker2d(G) (d) Half Cheetah(B) (f) Walker2d(B) Figure 3. Comparison of Taylor expansion update weighting with other baselines. Each curve shows median std results across 5 seeds. Taylor expansion outperforms the default PPO baseline most stably. 7.2. Taylor expansion update weighting As introduced in Section 5, we weigh local gradients b Qt θ log πθ(at|xt) with Kth order expansion weights w K,γ,γ (t). Here, we take γ = 1 1 T . Note that since K = corresponds to lim w K,γ,γ (t) = (γ )t 1, this is very close to the commonly implemented PPO baseline. We hence expect the algorithm to work better with relatively large values of K and set K = 100 throughout experiments. In practice, we find the performance to be fairly robust in the choice of K. We provide further analysis and ablation study in Appendix F. Results. We compare a few baselines: (1) default PPO; (2) PPO with time limit (Pardo et al., 2018). In this case, the states are augmented with time steps ex [x, t] such that the augmented states ex are Markovian; (3) PPO with Taylor expansion update weighting PPO(K). In Figure 3, we see that in general, PPO(K) and PPO with time limit outperform the baseline PPO. We speculate that the performance gains Taylor Expansions of Discount Factors arise from the following empirical motivation: since the evaluation stops at t = T, local gradients close to t = T should be weighed down because they do not contribute as much to the final objective. However, the default PPO ignores such an effect and weighs all updates uniformly. To tackle this issue, PPO(K) explicitly weighs down the update while and PPO with time limit augments the state space to restore stationarity. Empirically, though in some cases PPO with time limit also outperforms PPO(K), it behaves fairly unstably in other cases. Extensions to off-policy algorithms. Above, we mainly focused on on-policy algorithms. The setup is simpler because the data are collected (near) on-policy. It is possible to extend similar results to off-policy algorithms (Mnih et al., 2015; Lillicrap et al., 2015; Fujimoto et al., 2018; Haarnoja et al., 2018). Due to the space limit, we present extended results in Appendix F, where we show how to combine similar techniques to off-policy actor-critic algorithms such as TD3 (Fujimoto et al., 2018) and SAC (Haarnoja et al., 2018) in continuous control domains. 8. Conclusion We have proposed a family of objectives that interpolate value functions defined with two discount factors. We have shown that similar techniques are applicable to other cumulative quantities defined through discounts, such as PG updates. This framework allowed us to achieve trade-off in estimating value functions or gradient updates, and led to empirical performance gains. We also highlighted a new direction for bridging the gap between theory and practice: the gap between a fully discounted objective (in theory) and an undiscounted objective (in practice). By building a better understanding of this gap, we shed light on seemingly opaque heuristics which are necessary to achieve good empirical performance. We expect this framework to be useful for new practical algorithms. Acknowledgements. Yunhao thanks Tadashi Kozuno and Shipra Agrawal for discussions on the discrepancy between policy gradient theory and practices. Yunhao acknowledges the support from Google Cloud Platform for computational resources. Joshua Achiam and Open AI. Spinning Up in Deep Reinforcement Learning. https://github.com/openai/spinningup, 2018. Ron Amit, Ron Meir, and Kamil Ciosek. Discount factor as a regularizer in reinforcement learning. In Proceedings of the International Conference on Machine Learning, 2020. Richard Bellman. A Markovian decision process. Journal of mathematics and mechanics, pages 679 684, 1957. Greg Brockman, Vicki Cheung, Ludwig Pettersson, Jonas Schneider, John Schulman, Jie Tang, and Wojciech Zaremba. Open AI gym. ar Xiv, 2016. Erwin Coumans. Bullet physics simulation. In ACM SIGGRAPH 2015 Courses, page 1. 2015. Daniela Pucci De Farias and Benjamin Van Roy. The linear programming approach to approximate dynamic programming. Operations research, 51(6):850 865, 2003. Thomas Degris, Martha White, and Richard S Sutton. Offpolicy actor-critic. ar Xiv preprint ar Xiv:1205.4839, 2012. Prafulla Dhariwal, Christopher Hesse, Oleg Klimov, Alex Nichol, Matthias Plappert, Alec Radford, John Schulman, Szymon Sidor, Yuhuai Wu, and Peter Zhokhov. Open AI baselines, 2017. Lasse Espeholt, Hubert Soyer, Remi Munos, Karen Simonyan, Volodymir Mnih, Tom Ward, Yotam Doron, Vlad Firoiu, Tim Harley, Iain Dunning, et al. Impala: Scalable distributed deep-rl with importance weighted actor-learner architectures. In Proceeedings of the International Conference on Machine Learning, 2018. William Fedus, Carles Gelada, Yoshua Bengio, Marc G Bellemare, and Hugo Larochelle. Hyperbolic discounting and learning over multiple horizons. ar Xiv, 2019. Roy Fox, Ari Pakman, and Naftali Tishby. Taming the noise in reinforcement learning via soft updates. In Proceedings of the Conference on Uncertainty in Artificial Intelligence, 2016. Vincent Franc ois-Lavet, Raphael Fonteneau, and Damien Ernst. How to discount deep reinforcement learning: Towards new dynamic strategies. NIPS Deep Reinforcement Learning Workshop, 2015. Scott Fujimoto, Herke Van Hoof, and David Meger. Addressing function approximation error in actor-critic methods. In Proceedings of the International Conference on Machine Learning, 2018. Charles Miller Grinstead and James Laurie Snell. Introduction to probability. American Mathematical Soc., 2012. Tuomas Haarnoja, Aurick Zhou, Pieter Abbeel, and Sergey Levine. Soft actor-critic: Off-policy maximum entropy deep reinforcement learning with a stochastic actor. In Proceeedings of the International Conference on Machine Learning, 2018. Taylor Expansions of Discount Factors Hado Hasselt. Double Q-learning. Advances in Neural Information Processing Systems, 2010. Michael Janner, Igor Mordatch, and Sergey Levine. Gammamodels: Generative temporal difference learning for infinite-horizon prediction. Advances in Neural Information Processing Systems, 2020. Nan Jiang, Satinder P Singh, and Ambuj Tewari. On structural properties of MDPs that bound loss due to shallow planning. In Proceedings of the International Joint Conference on Artificial Intelligence, 2016. Steven Kapturowski, Georg Ostrovski, John Quan, Remi Munos, and Will Dabney. Recurrent experience replay in distributed reinforcement learning. In Proceedings of the International Conference on Learning Representations, 2018. Michael J Kearns and Satinder P Singh. Bias-variance error bounds for temporal difference updates. In Proceedings of the Conference on Learning Theory, 2000. Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization. ar Xiv preprint ar Xiv:1412.6980, 2014. Romain Laroche and Harm van Seijen. In reinforcement learning, all objective functions are not equal. 2018. Lucas Lehnert, Romain Laroche, and Harm van Seijen. On value function representation of long horizon problems. In Proceedings of the AAAI Conference on Artificial Intelligence, 2018. Timothy P Lillicrap, Jonathan J Hunt, Alexander Pritzel, Nicolas Heess, Tom Erez, Yuval Tassa, David Silver, and Daan Wierstra. Continuous control with deep reinforcement learning. In Proceedings of the International Conference on Learning Representations, 2015. Volodymyr Mnih, Koray Kavukcuoglu, David Silver, Andrei A Rusu, Joel Veness, Marc G Bellemare, Alex Graves, Martin Riedmiller, Andreas K Fidjeland, Georg Ostrovski, Stig Petersen, Charles Beattie, Amir Sadik, Ioannis Antonoglou, Helen King, Dharshan Kumaran, Daan Wierstra, Shane Legg, and Demis Hassabis. Humanlevel control through deep reinforcement learning. Nature, 518(7540):529 533, 2015. Volodymyr Mnih, Adria Puigdomenech Badia, Mehdi Mirza, Alex Graves, Timothy Lillicrap, Tim Harley, David Silver, and Koray Kavukcuoglu. Asynchronous methods for deep reinforcement learning. In Proceedings of the International Conference on Machine Learning, 2016. Chris Nota and Philip S Thomas. Is the policy gradient a gradient? In Proceedings of the International Conference on Autonomous Agenst and Multiagent Systems, 2019. Brendan O Donoghue, Remi Munos, Koray Kavukcuoglu, and Volodymyr Mnih. Combining policy gradient and Qlearning. In Proceedings of the International Conference on Learning Representations, 2016. Fabio Pardo, Arash Tavakoli, Vitaly Levdik, and Petar Kormushev. Time limits in reinforcement learning. In Proceedings of the International Conference on Machine Learning, 2018. Tobias Pohlen, Bilal Piot, Todd Hester, Mohammad Gheshlaghi Azar, Dan Horgan, David Budden, Gabriel Barth Maron, Hado Van Hasselt, John Quan, Mel Veˇcer ık, Matteo Hessel, R emi Munos, and Olivier Pietquin. Observe and look further: Achieving consistent performance on atari. ar Xiv, 2018. Danil V Prokhorov and Donald C Wunsch. Adaptive critic designs. IEEE transactions on Neural Networks, 8(5): 997 1007, 1997. Martin L Puterman. Markov decision processes: Discrete stochastic dynamic programming. John Wiley & Sons, 2014. Joshua Romoff, Peter Henderson, Ahmed Touati, Emma Brunskill, Joelle Pineau, and Yann Ollivier. Separating value functions across time-scales. Proceedings of the International Conference on Machine Learning, 2019. Sheldon M Ross. Introduction to probability models. Academic press, 2014. Tom Schaul, John Quan, Ioannis Antonoglou, and David Silver. Prioritized experience replay. In Proceedings of the International Conference on Learning Representations, 2015. John Schulman, Sergey Levine, Pieter Abbeel, Michael Jordan, and Philipp Moritz. Trust region policy optimization. In Proceedings of the International Conference on Machine Learning, 2015a. John Schulman, Philipp Moritz, Sergey Levine, Michael Jordan, and Pieter Abbeel. High-dimensional continuous control using generalized advantage estimation. ar Xiv preprint ar Xiv:1506.02438, 2015b. John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. Proximal policy optimization algorithms. ar Xiv, 2017. David Silver, Guy Lever, Nicolas Heess, Thomas Degris, Daan Wierstra, and Martin Riedmiller. Deterministic Taylor Expansions of Discount Factors policy gradient algorithms. In Proceedings of the International Conference on Machine Learning, 2014. Samarth Sinha, Jiaming Song, Animesh Garg, and Stefano Ermon. Experience replay with likelihood-free importance weights. ar Xiv, 2020. Richard S Sutton. TD models: Modeling the world at a mixture of time scales. In Proceedings of the International Conference on Machine Learning. 1995. Richard S Sutton and Andrew G Barto. Reinforcement learning: An introduction. MIT Press, 2018. Richard S Sutton, David A Mc Allester, Satinder P Singh, and Yishay Mansour. Policy gradient methods for reinforcement learning with function approximation. In Advances in Neural Information Processing Systems, 2000. Yunhao Tang, Michal Valko, and R emi Munos. Taylor expansion policy optimization. In Proceedings of the International Conference on Machine Learning, 2020. Emanuel Todorov, Tom Erez, and Yuval Tassa. Mu Jo Co: A physics engine for model-based control. In International Conference on Intelligent Robots and Systems, 2012. Hado Van Hasselt, Arthur Guez, and David Silver. Deep reinforcement learning with double Q-learning. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 30, 2016. Hado van Hasselt, John Quan, Matteo Hessel, Zhongwen Xu, Diana Borsa, and Andr e Barreto. General non-linear Bellman equations. ar Xiv, 2019. Harm Van Seijen, Mehdi Fatemi, and Arash Tavakoli. Using a logarithmic mapping to enable lower discount factors in reinforcement learning. In Advances in Neural Information Processing Systems, 2019. Zhongwen Xu, Hado P van Hasselt, and David Silver. Metagradient reinforcement learning. Advances in Neural Information Processing Systems, 2018. Brian D. Ziebart, Andrew L. Maas, J. Andrew Bagnell, and Anind K. Dey. Maximum entropy inverse reinforcement learning. In Proceedings of the AAAI Conference on Artificial Intelligence, 2008.