# toward_efficient_gradientbased_value_estimation__2905c9ee.pdf Toward Efficient Gradient-Based Value Estimation Arsalan Sharifnassab 1 Richard Sutton 1 Gradient-based methods for value estimation in reinforcement learning have favorable stability properties, but they are typically much slower than Temporal Difference (TD) learning methods. We study the root causes of this slowness and show that Mean Square Bellman Error (MSBE) is an ill-conditioned loss function in the sense that its Hessian has large condition-number. To resolve the adverse effect of poor conditioning of MSBE on gradient based methods, we propose a low complexity batch-free proximal method that approximately follows the Gauss-Newton direction and is asymptotically robust to parameterization. Our main algorithm, called RANS, is efficient in the sense that it is significantly faster than the residual gradient methods while having almost the same computational complexity, and is competitive with TD on the classic problems that we tested. 1. Introduction Value estimation is a core problem in reinforcement learning (Sutton & Barto, 2018), and is a key ingredient in several policy optimization methods, e.g., (Bhatnagar et al., 2009; Minh et al., 2015; Lillicrap et al., 2015) . The popular class of value estimation algorithms based on temporal difference learning via forward bootstrapping; including TD(λ) (Sutton, 1988), Expected Sarsa (van Seijen et al., 2009), and Q-learning (Watkins & Dayan, 1992) have found substantial empirical success when combined with proper policy optimization (Minh et al., 2015; Minh et al., 2016; Lillicrap et al., 2015). Nevertheless, these algorithms are not gradient-based optimization methods (Barnard, 1993) and their convergence cannot be guaranteed for general function approximation setting (Baird, 1995; Tsitsiklis & Van Roy, 1997; Brandfonbrener & Bruna, 2019). The stabil- 1Authors are with the Department of Computing Science, University of Alberta, Canada. Correspondence to: Arsalan Sharifnassab . Proceedings of the 40 th International Conference on Machine Learning, Honolulu, Hawaii, USA. PMLR 202, 2023. Copyright 2023 by the author(s). ity problem of TD learning has inspired other classes of value estimation algorithms that involve optimizing a loss function through gradient updates. This includes Residual Gradient (RG) algorithm for minimizing Mean Squared Bellman Error (MSBE) (Baird, 1995), Gradient-TD algorithms for minimizing projected Bellman error (Sutton et al., 2009; Maei et al., 2010; Maei, 2011; Hackman, 2012), and their extensions for optimizing a dual formulation of BE2 (Liu et al., 2015; Macua et al., 2014; Dai et al., 2017). These algorithms enjoy the general robustness and convergence properties of Stochastic Gradient Descent (SGD), but are known to be slower than TD in tabular and linear function approximation settings (Baird, 1995; Schoknecht & Merke, 2003; Gordon, 1999; Ghiassian & Sutton, 2021) . In this paper, we investigate the root causes of the slowness problem of gradient-based value estimation by taking a deeper look into the landscape of MSBE, and propose linear complexity methods to alleviate these problems. We provide theoretical results showing that MSBE is an ill-conditioned loss function in the senses that the condition-number of its Hessian matrix is typically very large. This explains slowness of gradient-based value estimation methods, because gradient descent in general is slow in minimizing ill-conditioned loss functions. In contrast, algorithms like Newton and Gauss-Newton methods are invariant to conditioning of the loss. Unfortunately a direct implementation of these methods requires matrix inversion, which is computationally costly even if computed incrementally. We propose a linear complexity incremental algorithm, called Residual Approximate Gauss-Newton (RAN), that incorporates a trace to approximate the Gauss-Newton direction and then updates the weights along that trace. We show that RAN can be equivalently formulated as a batch-free proximal algorithm. A weakness of RAN is that it requires double sampling (Baird, 1995), which limits its use in stochastic environments. We propose a double-sampling-free extension of RAN by following similar ideas that underlie GTD-type methods. The resulting algorithms significantly outperform RG and GTD2, being orders of magnitudes faster on the simple classic environments that we tested, while having almost similar computational complexity to RG and GTD2. We then turn our focus to a second cause of slowness of gradient-based value estimation: under function approximation, sample gradients of MSBE involve large outliers Toward Efficient Gradient-Based Value Estimation that carry important information, resulting in large variance of stochastic updates. Outliers of this type often appear in every episode (usually at pre-terminal transitions), and are specific to gradient-based value estimation methods (i.e. such outliers do not appear in TD learning). We propose a general technique called outlier-splitting, which results in no information loss as opposed to the standard clipping methods. Our main value estimation algorithm, called RAN with outlier-Splitting (RANS), has linear computational complexity and has only one effective hyper-parameter (and some other hyper-parameters that can be set to their default values), thanks to its adaptive step-size mechanism. Our empirical results on a few classic control environments with neural network function approximation show significant improvement over RG, and achieving competitive performance to TD. 2. Background We consider a discounted Markov Decision Process (MDP) defined by the tuple (S, A, R, p, γ), where S is a finite set of states, A is a finite set of actions, R is a set of rewards, p : S A S R [0, 1] is the environment dynamics determining the probability of the next state and immediate reward given a current state and action pair, and γ [0, 1] is a discount factor. We fix a stationary policy π : S A [0, 1], and let pπ(s , a , r|s, a) = p(St+1 = s , Rt+1 = r|St = s, At = a)π(At+1 = a |St+1 = s ). We consider an episodic and online setting where a data stream (S1, A1, R1), (S2, A2, R2), . . . is generated according to the policy π. The action-value function qπ : S A R, at each state s and action a, is the expected discounted sum of rewards obtained by starting from state s and action a and following policy π. We then define the value function vπ : S R as vπ(s) = Ea π( |s)[qπ(s, a)]. In value estimation, we aim to obtain an estimate of the true action-values qπ, usually through a function qw : S A R parameterized by a d-dimensional weight vector w. Corresponding to qw is a Bellman residual at each state and action pair (s, a), defined as δw(s, a) def= Es ,a ,r pπ( , , |s,a) r+γqw(s , a ) qw(s, a) . According to Bellman equations (Sutton & Barto, 2018; Bertsekas & Tsitsiklis, 1996), qw = qπ if and only if δw(s, a) = 0 for all (s, a) S A. In this view, MSBE(w), defined below, serves as a proxy for the quality of estimates w: MSBED(w) def= E(s,a) D δw(s, a)2 , (1) where D is some distribution over states and action pairs. When the distribution is online, we drop the subscript D and write MSBE( ). For simplicity of notation, we also write Es,a[ ] to denote the expectation with respect to state and action pairs sampled from the online distribution. In the same vein, we consider a parameterized estimate vw : S R of value function vπ, and let MSBEV D(w) def= Es D δw(s)2 , (2) where δw(s) = Ea π( |s)[δw(s, a)], and D is some distribution over states. Gradient-based value estimation methods use gradientbased optimization algorithms to minimize MSBE or other related objectives such as MSPBE (Sutton et al., 2009). The first and simplest method in this category is the RG algorithm (Baird, 1995). In this algorithm, to obtain an unbiased sample estimate of w(δw(St, At)2), we require independent samples (St+1, At+1, Rt) and (S t+1, A t+1, R t) from pπ( , , |St, At). For simplicity of notation, at time t, we let δt def= Rt + γqw(St+1, At+1) qw(St, At), (3) δ t def= R t + γqw(S t+1, A t+1) qw(St, At), (4) where wt = w. The RG update is then w w αδ t wδt. (5) The requirement for two independent sample transitions at time t is called double sampling (Sutton & Barto, 2018). In stochastic environments, double sampling is possible only if we have a correct model of the world. In other words, MSBE minimizer is not learnable if an exact model of the underlying stochastic environment is not available, which is the case in real-world applications (Sutton & Barto, 2018). A general technique to circumvent double sampling is using Fenchel duality to obtain an equivalent saddle point formulation of MSBE (Dai et al., 2017, Du et al., 2017) as min w max ˆδ( , ) Es,a δw(s, a) ˆδ(s, a) 1 2 ˆδ(s, a)2 , (6) where ˆδ(s, a) is an auxiliary variable that serves as a proxy of δw(s, a). In practice, one can consider a parametric approximation ˆδθ( , ) of ˆδ( , ), and perform gradient updates on the resulting minimax problem minw maxθ Es,a h δw(s, a) ˆδθ(s, a) 1 2 ˆδθ(s, a)2i : w w αˆδθ(St, At) wδt, θ θ + η δt ˆδθ(St, At) θˆδθ(St, At), (7) (Sutton et al., 2009; Liu et al., 2020; Dai et al., 2017). Intuitively, this is similar to the RG algorithm in (5) except for using the parametric approximation ˆδθ(St, At) instead of δ t, and updating ˆδθ(s, a) by SGD on Es,a (ˆδθ(s, a) δw(s, a))2 . The GTD2 algorithm (Sutton et al., 2009) is a special case of (7) in which qw and ˆδθ are linear approximations of the form qw(s, a) = φT s,aw and ˆδθ(s, a) = φT s,aθ, for feature vectors φs,a (Liu et al., 2020; Dai et al., 2017). Toward Efficient Gradient-Based Value Estimation 3. MSBE loss is ill-conditioned The condition-number of a symmetric square matrix, H, is defined as the ratio of its largest to smallest singular values, maxx: x =1 |x T Hx|/ miny: y =1 |y T Hy|. For a quadratic function f(x) = x T Hx, we define the conditionnumber, C(f), of f as the condition-number of its Hessian matrix H. Intuitively, level sets (or contours) of a convex quadratic function have an elliptical shape, and the condition-number C(f) equals the squared ratio between the largest and the smallest diameters of each of these ellipsoids (see Fig. 1). We say that f is ill-conditioned if C(f) is very large. Then, the level sets of an ill-conditioned quadratic function have the shape of ellipsoids that are thin. It is known that the convergence rate of the gradient descent on a quadratic loss f scales with C(f) (Polyak, 1964), which can be very slow for ill-conditioned loss functions. In this section, we consider linear function approximation. In this case, MSBEV D( ) defined in (2) is a convex quadratic function. We denote the condition-number of MSBEV D( ) under uniform distribution D by C. We let l be the average episode length, defined as the expected time until termination when starting from a state, uniformly averaged over all states. We also let h def= Es unif P(St+1 = s|St = s) be the the average self-loop probability. Note that h is typically much smaller than 1. Theorem 3.1. In the tabular case, the following statements hold for any discount factor γ [0, 1]: a) For any MDP and under any policy, we have 4 min 1 (1 γ)2 , l2 (8) where l is the average episode length and h is the average self-loop probability. b) For any n 1, there exists an n-state MDP and a policy for which C γ4n2/(1 γ)2. The proof is given in Appendix A.1. A similar result also holds for the condition-number of MSBE defined in (1) (see Proposition A.1 in Appendix A.2). Theorem 3.1 shows that MSBE is typically ill-conditioned in the tabular case. This explains the slow convergence of gradient-based methods for minimizing MSBE. As an example, the bound in (8) implies that for γ = .99 and for any MDP and policy pair with average episode length at least 100 and average self-loop probability no larger than 0.1, we have C > 2000. Moreover, Theorem 3.1 (b) implies that for γ = .99, there is a 100-state Markov chain for which C > 96, 000, 000. Lower bounds similar to Theorem 3.1 are not possible for non-tabular linear function approximation. This is because different feature representations can improve or worsen the Figure 1. Level sets of MSBE (gray curves) in a 2-state loop environment with p(s0 s1) = p(s1 s0) = 1 for γ = 0.8. Here, condition number of MSBE is 81, and is equal to the squared ratio between the diameters (red) of each ellipsoid. The solution trajectory of RG (blue) for α = .9 and the Gauss-Newton direction (green) are also depicted. In this environment, C = O(1/(1 γ)2) (see Theorem 3.1), which rapidly grows for larger γ. condition-number. To see why, note that in the linear function approximation case and under uniform state distribution, MSBEV unif(w) = w T ΦT (I γP)T (I γP)Φw, where Φ is an n d matrix, each row of which is a feature vector of a state; and P is the transition matrix. For the specific choice Φ = (I γP) 1 we obtain C = 1, while for the case that Φ is not full-rank, we have C = . In general, since underparameterized function approximation reduces parameters dimension, it usually improves condition-number. Fig. 2 illustrates dependence of C on the number of features, d, in an extended Boyan chain environment (Boyan, 2002) with 200 states and with random binary features (see Appendix G.1 for details). We observe that smaller d results in better condition-number; but this comes at the cost of larger value-error at MSBE minimum (the red curve in Fig. 2), where by value-error we mean Es unif Ea π( |s)[ qw(s, a) qπ(s, a) 2] . See Appendix B for more experiments on condition number under linear function approximation. 4. A review of the Gauss-Newton method Consider an expected loss function of the form F(w) = Ef[f 2(w)], and the associated Hessian matrix HF = E[ f f T ] + E[f Hf], where Hf denotes Hessian of sample function f. The first term on the right hand side, E[ f f T ], is called the Gauss-Newton matrix and is denoted by G. The Gauss-Newton algorithm then updates w as w w αG 1 F(w). In the special case that functions f are linear, we have Hf = 0 and thereby HF = G. In this case, Gauss-Newton and Newton methods become equivalent. However, the Gauss-Newton algorithm has two advantages in the non-linear case. Firstly, G 1 F(w) is Toward Efficient Gradient-Based Value Estimation Figure 2. Condition-number of MSBE (blue) and value-error at MSBE minimizer (red) versus number of features, in a 200-state extended Boyan chain with random binary features. always a descent direction, as opposed to the Newton updates that may climb uphill and converge to local maxima or saddle points (Nesterov & Polyak, 2006). Secondly, G can be computed in terms of gradients while H entails second order derivatives which are not as easily accessible in certain settings (Nocedal & Wright, 1999). As opposed to gradient descent which is prohibitively slow in ill-conditioned problems, Newton and Gauss-Newton methods are invariant to conditioning of the loss function. Some recent works proposed using Gauss-Newton method for value estimation (Gottwald et al., 2021). However, these algorithms require matrix inversion, which is computationally costly even if computed incrementally via Sherman Morrison formula with quadratic complexity (Sherman & Morrison, 1950). In the next section, we propose an incremental low complexity approach that approximately follows the Gauss-Newton direction. 5. Our first algorithm: RAN The Gauss-Newton direction for minimizing MSBE is m GN(w) = G 1 w MSBE(w)/2, where Gw = Es,a δw(s, a) δw(s, a)T is the Gauss-Newton matrix and MSBE(w) = 2Es,a δw(s, a) δw(s, a) . Then, m GN is the minimizer of the following quadratic function: L(m) def= 1 2 Es,a h δw(s, a) δw(s, a)T m 2i . (9) This is because for any m, m L(m) = Es,a δw(s, a)T m δw(s, a) δw(s, a) = Gwm MSBE(w)/2, and therefore m L(m GN) = 0. We follow a two time scale approach (Yao et al., 2009, Bhatnagar et al., 2009; Dabney & Thomas, 2014) to incrementally find an approximate minimizer m of L and update w along that direction. More concretely, given a β > 0 and λ [0, 1], at Algorithm 1 RAN Parameters: step-sizes α, β, and decay parameter λ Initialize: m = 0 and w for t = 1, 2, . . . do consider δt and δ t defined in (3) and (4), respectively m λm + β δ t δt m m β(m T δt) δt w w αm end for time t, we update m along an unbiased sample gradient of βL(m) + (1 λ) m 2, m λm + β δ t m T δ t δt, (10) where δt and δ t are defined in (3) and (4), and (1 λ) m 2 is a Levenberg Marquardt regularizer (Marquardt, 1963)1. We then update w along m, i.e, w w αm. Algorithm 1 gives the pseudo code of the RAN algorithm. For better stability and faster convergence, the update of m in RAN is of the form m λm + β δ t m T δt δt. (11) which is the same as (10) except for using δt instead of δ t. The updates in (11) have lower variance compared to (10), and additionally δt δT t in (11) is positive semidefinite, as opposed to the finite sample estimate of the Gauss-Newton matrix 1 τ Pτ t=1 δt δ T t in (10). For fixed w, the update in (11) is in expectation along Et[ δt δT t ] + (1 λ)I 1 MSBE(w). (12) In Appendix C, we provide further intuition for RAN, by presenting a derivation of Algorithm 1 as a proximal method with momentum for minimizing MSBE. In this view, m serves as a momentum of MSBE gradients, to which we add a correction term equal to the gradient of a penalty function that aims to regularize the change in δw(s, a) for all state action pairs (s, a). Convergence of the RAN algorithm can be shown in twotime-scale regime where αt, βt 0, with α diminishing faster than β (i.e., αt/βt 0)2. Convergence of such two-time-scale algorithms is well-studied (Kushner & Yin, 2003; Konda & Tsitsiklis, 1999; Bhatnagar et al., 2009), 1We have empirically observed that λ = 1 often leads to slow convergence, because it causes large inertia in m, and therefore large oscillations in w. The best performance is achieved for λ (0.99, 0.9999). 2Note that the two-time-scale view is only for the purpose of convergence analysis, and in practice we consider fixed or adaptive step-sizes whose ratio needs not go to zero. Toward Efficient Gradient-Based Value Estimation Figure 3. The Hallway experiment discussed in Section 5. under some smoothness and irreducibly conditions. In Appendix D, we discuss different conditions for convergence of Algorithm 1 in the two-time-scale regime. Moreover, in this regime, RAN is robust to reparameterization: Proposition 5.1 (Informal). For λ = 1 and asymptotically small step-sizes α 0 and α/β 0, the trajectory of w in the RAN algorithm is invariant to any differentiable and bijective non-linear transformation on parameterization. The formal version of Proposition 5.1 and its proof are given in Appendix E. We evaluated the performance of RAN in a simple benchmark environment. Consider an environment with n states and one action, in which each state i = 1, 2 . . . , n transits to state min(i + 1, n) with probability 1 ϵ, and transits to a terminal state with probability ϵ, for some ϵ [0, 1). This is a generalization of the Hallway environment (Baird, 1995), and is known to be a challenging task for the RG algorithm (Baird, 1995). We tested Algorithm 1 in this environment with n = 50, ϵ = 0.01, and γ = 0.99 in the tabular setting (see Appendix G.2 for the details of this experiment). The learning curves are depicted in Fig. 3. We observe that, in this experiment, Algorithm 1 is about 30 times faster than RG, and reaches a convergence rate close to TD(0) (Sutton, 1988; Sutton & Barto, 2018). 6. Double-sampling-free RAN algorithm In Algorithm 1, we require double sampling to compute δ t. In this section, we propose a Double-Sampling-Free version of RAN, called DSF-RAN. Double sampling is easily doable in deterministic environments (Saleh & Jiang, 2019; Zhang et al., 2020), in which case δ t can be computed using an independent sample A t+1 from the policy. However, for double sampling in stochastic environments, we require a model to get an independent sample S t+1 of the next state, which is typically possible only in simulated environments. To resolve the double sampling issue of RAN in stochastic Algorithm 2 DSF-RAN Parameters: step-sizes α, β, η, and decay parameter λ Initialize: m = 0, w, θ. for t = 1, 2, . . . do δt = γ wqw(St+1, At+1) wqw(St, At) m λm + β ˆδθ(St, At) δt m m β(m T δt) δt w w αm θ θ + η δt ˆδθ(St, At) θˆδθ(St, At) end for Figure 4. The Baird s star experiment discussed in Section 6 environments, we use the technique discussed in Section 2, which was also used in the GTD2 algorithm. More specifically, instead of δ t in Algorithm 1, we use a parametric approximation ˆδθ(St, At) of δw(St, At), parameterized by θ. Similar to GTD2 (see (7)), we then learn θ through SGD on Es,a (ˆδθ(s, a) δw(s, a))2 . Pseudo code of DSF-RAN is given in Algorithm 2. We tested RAN and DNS-RAN algorithms on Baird s Star environment (Baird, 1995), that is a Markov chain with six states, each represented by seven features (see Appendix G.3 for details of this experiment). The results are illustrated in Fig. 4. We observe that in this environment, RAN and DNSRAN converge about 200 times faster than RG and GTD2 algorithms, respectively. It is well-known that off-policy TD(0) is unstable in this environment (Baird, 1995). 7. The problem of outliers In this section we argue that the gradient of MSBE involves large outliers and discuss its impact on the RAN algorithm. For simplicity, temporarily suppose that the set of actions is a singleton, A = {a}. In the function approximation case, successive states St and St+1 often have similar representations. As a result, qw(St, a) and γ qw(St+1, a) are often similar, rendering δt = γ qw(St+1, a) qw(St, a) to be small (Zhang et al., 2020). This would not have been Toward Efficient Gradient-Based Value Estimation problematic if δt was small for all t, in which case we could compensate by increasing the step-size. However, δt can occasionally be large, for example when St+1 is a terminal state in which case δt = γ qw(St, a), or when St+1 is far from St (e.g., in large jump transitions). Although these outliers occur with low probability, they carry important information. For example, the pre-terminal transitions are important because they pin down the estimated values to the terminal values. In environments with larger action sets, if the policy has small entropy, At+1 and At would have similar representations with high probability, causing δt to be small. We now discuss how these outliers affect RAN. The updates of m in Algorithm 1 involve a momentum (of MSBE gradient) term λm + δ t δt and a correction term β( δT t m) δt that aims to slowly modify m towards the approximate Gauss-Newton direction. However, when δt is an outlier, β( δT t m) δt can grow very large, cause an overshoot, and completely change the direction of m. In particular, if β δt 2 > 1, then magnitude of the correction term would be larger than the projection of m on δt, i.e. β( δT t m) δt, δt > δT t m , (13) which results in an overshoot along δt. Such overshoots hinder m from tracking the approximate Gauss-Netwon direction. To reduce the adverse effect of outliers, one can reduce stepsize β, at the cost of slowed down learning. Another popular solution is gradient clipping (Zhang et al., 2019). However, as discussed in the first paragraph of this section, the outliers in our problem carry important information, which can be lost via gradient clipping. 8. Outlier-splitting We now propose outlier-splitting as a general metatechnique for stochastic optimization, appropriate for the case that data contains rare sample functions with abnormally large gradients, and these sample functions carry important information that would be lost in gradient clipping. We first explain the key idea by an example. Consider minimizing f1 + + fn for smooth functions f1, . . . , fn. Suppose that f1 is an outlier in the sense that the norm of its gradient is locally k times larger than the gradient norms of other functions, for some integer k > 1. The idea is that instead of applying SGD on f1 + + fn, we break down f1 into k copies of f1/k and apply SGD on f1/k + + f1/k + f2 + + fn in a random order. The latter updates are outlier-free while being equivalent to the former updates in expectation. We now proceed to a formal description. Consider SGD on an objective function F = E[f]. For any Algorithm 3 Outlier-splitting for online SGD, applied to loss function F = E[f] Parameters: step-size β, outlier threshold ρ, trace parameter λξ, outlier sampling probability σ. Initialize: ˆξ = 0, w. for t = 1, 2, . . . do ˆξ λξ ˆξ + (1 λξ)ξ(ft, w) ξ = ˆξ/(1 λt ξ) bias-corrected trace estimate k = ξ(ft, w)/(ρ ξ) + 1 w w (β/k) ft(w) if k > 1 then Store (f, k, k 1) in the outlier buffer end if With probability min(1, σ length of outlier bufffer): Sample (f, k , j) uniformly form outlier buffer k = max k , ξ(f, w)/(ρ ξ) + 1 w w (β/k ) f(w) if j > 1 then Replace (f, k , j) with (f, k , j 1) in the buffer else if j = 1 then Remove (f, k , j) from the outlier buffer end if end for sample sample function f and any point w, we consider a non-negative measure ξ(f, w); e.g., ξ(f, w) = f(w) or f(w) 2. Let ξ be a trace of ξ, updated by ξ λξ ξ + (1 λξ)ξ(ft, wt), where λξ (0, 1) is a constant close to 1. We say that ft is an outlier if ξ(ft, wt) ρ ξt, for some outlier threshold ρ > 1. The pseudo code of the outlier-splitting method for online SGD is given in Algorithm 3. At time t of this algorithm, we let k = ξ(ft, wt) If ft is an outlier (equivalently k > 1), instead of ft we pretend to have k copies of ft/k. We use one of these copies to do a gradient update at time t, and store the remaining k 1 copies in a buffer to use them for future updates. These copies are stored in one cell of an outlier-buffer as a tuple (ft, k, k 1), where k 1 indicates the number of remaining copies to be used for future updates. In each iteration we perform one update based on the online sample, and perform at most one update based on a sample from the buffer. More concretely, in each iteration t, after applying a gradient update w w (β/k) ft(w), we take a sample (f, kf, j) from the outlier buffer with some positive probability, and perform a gradient update w w (β/kf) f(w). We now show that the outlier buffer is stable. The expected Toward Efficient Gradient-Based Value Estimation number of copies, k 1, added to the buffer at time t satisfies Et [ξ(ft, wt)] ρE[ ξt] = 1 where the inequality is due to (14) and the approximate equality is because ξt is a long-time average. On the other hand, as the length of the outlier buffer increases, the probability of performing a sample update from the buffer goes to 1. In this case, arrival rate to the buffer, 1/ρ, is smaller than its departure rate, 1; implying stability of the outlier buffer. 9. Our main algorithm: RANS Our final algorithm, RAN with outlier Splitting (RANS), is a combination of RAN, outlier-splitting, and adaptive stepsize ideas. In order to improve updates of m, we employ an adaptive vector step-size β that evolves according to a mechanism quite similar to RMSProp (Kochenderfer & Wheeler, 2019), as we discuss next. Consider a trace vector νt of ( δt)2 updated according to νt λ νt 1 + (1 λ )( δt)2, where ( δt)2 is the entrywise square vector of δt, and λ [0, 1) is a constant. We consider an outlier-measure ξt = 1 νt δt, δt (15) where 1/ νt is entrywise square root, and and , denote entrywise product and inner product of two vectors, respectively. We then compute the trace ξ and k as in Section 8: ξt λ ξt + (1 λ )ξt and k = ξt/(ρ ξt) + 1. We finally fix an η (0, 1) and choose the step-size 1 νt . (16) The pseudo code of RANS is given in Algorithm 4 in Appendix F. The algorithm involves applying the outliersplitting method on the updates of m in RAN, and using the adaptive step-size in (16). We now shows that the outlier-splitting mechanism in RANS effectively prevents overshoots of type (13) in the updates of m. Given the above choice of βt, we have 1 k βt δt, δt = 1 k η ρ ξt 1 νt δt, δt η ρ ξt 1 νt δt, δt = η, where the first equality is from the definition of βt in (16), the inequality is due to the definition of k, and the last equality follows from the definition of ξt in (15). This implies that k β( δT t m) δt, δt η δT t m . (17) Therefore overshoots of type (13) do not occur in RANS. The RANS algorithm has hyperparameters α, η, ρ, λ, λ , and σ (the outlier sampling probability). Setting η = 0.2 and ρ = 1.2 are always good choices. Furthermore, our experiments show that the parameters λ, λ , and σ can be set to the default values λ = 0.999, λ = 0.9999, and σ = 0.02 without much performance degradation. In this case, the RANS algorithm would have essentially one hyperparameter α, just like RG and TD algorithms with Adam optimizer (Kingma & Ba, 2014). The per-iteration computational complexity of RANS is at most twice the RG algorithm with Adam optimizer. 10. Experiments Similar to TD, the RANS algorithm can be utilized within any control loop. More specifically, one can used RAN instead of TD to estimate Q-values and plug these estimates into the policy update of interest, including actor-critic algorithms like A3C (Minh et al., 2016), deterministic policy gradient algorithms (Silver et al., 2014) like DDPG (Lillicrap et al., 2015), and greedy/soft-max policy updates like DQN (Minh et al., 2015). In this section, we assess the performance of softmax policy updates and deterministic policy gradient methods when the Q-functions in these algorithms are calculated using RANS. For softmax policy updates, we conducted experiments on Acrobot and Cartpole environments. We used a single-layer neural network with 64 hidden units with Re LU activation to learn the action-values via three algorithms, TD(0), RG, and RANS. The actions are chosen according to a softmax distribution on the action-values. Fig. 5 illustrates expected returns versus number of step. We trained TD(0) and RG using Adam optimizer. Refer to Appendix G.4 for complementary experimental results and details of the experiments. The results show that the RANS algorithm outperforms RG and TD on these environments. For the deterministic policy gradient actor updates, we conducted experiments on Mu Jo Co environments: Hopper and Half Cheetah. We employed a two-layer feedforward neural network with 400 and 300 hidden Re LU units respectively for both the actor and critic, and a final tanh unit following the output of the actor. The actor was trained using deterministic policy gradient policy updates (Silver et al., 2014; Lillicrap et al., 2015), while the critics were trained by three algorithms: RANS, Adam TD with delayed target network update, and Adam RG. We considered an on-policy setting where samples are drawn from the current policy Toward Efficient Gradient-Based Value Estimation Figure 5. Performance of RANS, TD(0), and RG on classic control tasks. A single-layer neural network with 64 hidden Re LU units was used to learn the Q-values, and a softmax distribution on the Q-values was used as the policy. in an online manner and are directly fed into the actor and critic training algorithms. We did not use replay buffers or batch updates. Fig. 6 depicts the learning curves of these algorithms. Refer to Appendix G.4 for additional experimental results and details of the experiments. The results indicate that the RANS algorithm surpasses TD and RG in these environments. It is important to note that the results cannot be fairly compared to the state-of-the-art because our setting is on-policy and does not take advantage of replay buffers and batch updates. We leave the integration of these techniques into the RANS ideas for future work. 11. Related works Poor conditioning of MSBE was previously observed in (Wang & Ueda, 2021) through study of an example Markov chains. More specifically, Wang and Ueda (2021) analyzed a particular n-state Markov chain and showed that the condition-number of MSBE in this Markov chain scales with n2. They also showed that the condition-number scales with 1/(1 γ)2 in another example Markov chain. In comparison, our lower bound in Theorem 3.1 (a) holds for every Markov chain, and the lower bound in Theorem 3.1 (b) scales with n2/(1 γ)2. A prevalent explanation for slowness of gradient-based value estimation methods is the so called information flow in the wrong direction (Baird, 1995). More concretely, each update in RG can be decomposed into a forward bootstrapping component (or a TD update) and a backward bootstrapping component (the so called wrong direction of information flow). A common approach for accelerating the gradient updates is by suppressing the second component (e.g., via some sort of combination with TD updates), especially in early stages of training. The acceleration gained in the residual algorithm (Baird, 1995), TDC (Sutton et al., 2009), TDRC, and QRC (Ghiassian et al., 2020) can be understood from this perspective. In contrast, acceleration gained in our algorithms does not rely on combinations with TD updates. Use of Gauss-Newton method for value estimation was explicitly proposed in (Gottwald et al., 2021; Gottwald & Shen, 2022), recently. Value estimation algorithms based on Kalman filter (Choi & Van Roy, 2006; Geist & Pietquin, 2010) are also known to have an equivalent form to online Gauss-Newton updates (Geist & Pietquin, 2010). Sun and Bagnell (2015) studied MSBE minimization with Newton method. However, all of the above methods involve approximating a variant of the Hessian or Gauss-Newton matrices and solving a system of linear equations in each iteration, which is computationally costly. Yao et al. (2009) proposed a low complexity two time-scale method, called LMS-2, for stochastic linear regression. Our RAN algorithm can be perceived as a generalization of the LMS-2 algorithm to MSBE minimization under non-linear function approximation. Several other algorithms including least squares TD (Sutton & Barto, 2018) and (Devraj & Meyn, 2017) also leverage matrix gain for improved convergence, under linear function approximation. In the same spirit, natural gradient methods (Amari, 1998; Kakade, 2001; Martens, 2020) also enjoy robustness to parameterization. Dabney and Thomas (2014), Knight and Lerner (2018), and Achiam et al. (2019) proposed natural gradients algorithms for value estimation. Dabney and Thomas (2014) also proposed a low complexity two time scale implementation that has high-level algorithmic similarities to the RAN algorithm. In Section 5 and Appendix C we showed that the RAN algorithm can be perceived as a proximal method. A proximal method for value estimation, called GTD2-MP, was proposed in (Liu et al., 2020; Mahadevan et al., 2014). However, these works consider a Bregman divergence that does not depend on the value estimates. In fact, Toward Efficient Gradient-Based Value Estimation Figure 6. Performance of RANS, TD with target networks, and RG algorithms on simple Mu Jo Co environments. The Q-values, trained via these algorithms, were integrated into a standard deterministic policy gradient control loop for policy training. as the step-size goes to zero, update direction of GTD2MP tends to the expected GTD2 update direction. Schulman et al. (2015), Sun and Bagnell (2015), and Zhu and Murray (2022) considered proximal methods with value dependent penalties of the form E[(vwt+1(St) vwt(St))2]. Although the resulting expected update direction Es[ vw(s) vw(s)T ] 1 MSBE(w) is robust to parameterization, it is not robust against poor conditioning. For example, in the tabular case, this expected update direction simplifies to MSBE(w), which is the same as RG. In contrast, in the proximal view of the RAN algorithm, we used penalties of type E[(δwt+1(St, At) δwt(St, At))2], which provides robustness to the conditioning of MSBE, as discussed in Section 5 and Appendix C. The control algorithm SBEED (Dai et al., 2018) involves mirror descent RG along with some other ideas including entropy regularization, akin to SAC (Haarnoja et al., 2018), and blending RG with naive residual gradient (Sutton and Barto, 2018, Chapter 11). The entropy regularization technique, in particular, is known to produce significant performance gains. It is noteworthy that the entropy regularization technique can be used in conjunction with RANS and DFSRAN algorithms, as well. Karampatziakis and Langford (2010) and Tian and Sutton (2019) proposed a method, called sliding-step, to reduce the adverse effect of outliers in certain problems. This method is pretty similar to the outlier-splitting algorithm, with the only difference that in the sliding-step method, all k updates w w ft(w)/k are applied sequentially and before time t + 1, while the outlier-splitting method spreads these updates over a long time. Another simple approach is using momentum to reduce the variance of updates. However, smoothing large outliers requires large momentum parameters, in which case the delayed effect of gradients propagate far into future and become out-dated, pushing w along outdated outlier gradient even if the outlier gradient at current w is reversed. 12. Future works and discussion In this paper, we highlighted causes that underlie slowness of gradient-based value estimation methods, and proposed low complexity techniques to resolve them. Our focus was on the on-policy case, however the proposed algorithms are easily applicable for off-policy learning when combined with standard importance sampling techniques. We provided evidence for the potential of the proposed algorithms via experiments on a few classic environments. Other than applying standard techniques (such as batch updates, replay buffers, different forms of step-size adaptation, etc.) and testing the algorithms on more complex environments, there are several directions for future research. This includes adopting the unbiased gradient estimate of (9) in (10) instead of the biased estimate in (11), and comparing these methods with other means of solving (9), including conjugate gradient and low rank approximation of the Gauss-Newton matrix. Another important direction is further exploration of the proposed double-sampling-free methods in stochastic environments with neural network function approximation. On the theory side, it would be interesting to study condition-number of MSBE, and in general the shape of MSBE landscape, under linear and non-linear function approximation under common feature representations in asymptotically large environments. Moreover, non-asymptotic behavior and finite sample complexity analysis of the proposed methods would be very helpful for understanding the effectiveness of these algorithms in reducing sensitivity to condition-number. 13. Acknowledgments The authors want to thank Yi wan, Sina Ghiassian, John N. Tsitsiklis, and Saber Salehkaleybar for their valuable feedback in various stages of development of this work. Toward Efficient Gradient-Based Value Estimation Achiam, J., Knight, E., and Abbeel, P. Towards characterizing divergence in deep Q-learning. ar Xiv preprint ar Xiv:1903.08894, 2019. Amari, S. I. Natural gradient works efficiently in learning. Neural Computation, 10(2):251 276, 1998. Baird, L. Residual algorithms: Reinforcement learning with function approximation. In Proceedings of the International Conference on Machine Learning, pp. 30 37. 1995. Barnard, E. Temporal-difference methods and Markov models. IEEE Transactions on Systems, Man, and Cybernetics, 23(2):357 365, 1993. Bertsekas, D. and Tsitsiklis, J. N. Neuro-dynamic programming. Athena Scientific, 1996. Bhatnagar, S., Sutton, R. S., Ghavamzadeh, M., and Lee, M. Natural actor-critic algorithms. Automatica, 45, 2009. Boyan, J. A. Technical update: Least-squares temporal difference learning. Machine Learning, 49(2):233 246, 2002. Brandfonbrener, D. and Bruna, J. Geometric insights into the convergence of nonlinear TD learning. ar Xiv preprint ar Xiv:1905.12185, 2019. Choi, D. and Van Roy, B. A generalized Kalman filter for fixed point approximation and efficient temporaldifference learning. Discrete Event Dynamic Systems, 16(2):207 239, 2006. Dabney, W. and Thomas, P. Natural temporal difference learning. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 28, 2014. Dai, B., He, N., Pan, Y., Boots, B., and Song, L. Learning from conditional distributions via dual embeddings. In Proceedings of the International Conference on Artificial Intelligence and Statistics, pp. 1458 1467, 2017. Deb, R. and Bhatnagar, S. Gradient temporal difference with momentum: Stability and convergence. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 36, pp. 6488 6496, 2022. Devraj, A. M. and Meyn, S. Zap Q-learning. In Advances in Neural Information Processing Systems, volume 30, 2017. Du, S., Chen, J., Li, L., Xiao, L., and Zhou, D. Stochastic variance reduction methods for policy evaluation. In Proceedings of the International Conference on Machine Learning, pp. 1049-1058, 2017. Geist, M. and Pietquin, O. Kalman temporal differences. Journal of artificial intelligence research, 39:483 532, 2010. Ghiassian, S. and Sutton, R. S. An empirical comparison of off-policy prediction learning algorithms in the four rooms environment. ar Xiv preprint ar Xiv:2109.05110, 2021. Ghiassian, S., Patterson, A., Garg, S., Gupta, D., White, A., and White, M. Gradient temporal-difference learning with regularized corrections. In Proceedings of the International Conference on Machine Learning, pp. 3524 3534, 2020. Gordon, G. J. Approximate solutions to Markov decision processes. Ph.D. thesis, Carnegie Mellon University, 1999. Gottwald, M. and Shen, H. On the compatibility of multistep lookahead and hessian approximation for neural residual gradient. In Proceedings of the Multi-disciplinary Conference on Reinforcement Learning and Decision Making, 2022. Gottwald, M., Gronauer, S., Shen, H., and Diepold, K. Analysis and optimization of Bellman residual errors with neural function approximation. ar Xiv preprint ar Xiv:2106.08774, 2021. Hackman, L. M. Faster Gradient-TD Algorithms. M.Sc. thesis, University of Alberta, Edmonton, 2012. Juditsky, A. and Nemirovski, A. Optimization for Machine Learning. MIT Press, 2011. Kakade, S. M. A natural policy gradient. In Advances in Neural Information Processing Systems, volume 14, 2001. Karampatziakis, N. and Langford, J. Online importance weight aware updates. ar Xiv preprint ar Xiv:1011.1576, 2010. Kingma, D. P. and Ba, J. Adam: A method for stochastic optimization. ar Xiv preprint ar Xiv:1412.6980, 2014. Knight, E. and Lerner, O. Natural gradient deep Q-learning. ar Xiv preprint ar Xiv:1803.07482, 2018. Kochenderfer, M. J. and Wheeler, T. A. Algorithms for optimization. MIT Press, 2019. Konda, V. and Tsitsiklis, J. Actor-critic algorithms. In Advances in Neural Information Processing Systems, volume 12, 1999. Kushner, H. and Yin, G. G. Stochastic approximation and recursive algorithms and applications. Springer Science & Business Media, 2003. Lillicrap, T. P., Hunt, J. J., Pritzel, A., Heess, N., Erez, Toward Efficient Gradient-Based Value Estimation T., Tassa, Y., Silver, D., and Wierstra, D. Continuous control with deep reinforcement learning. ar Xiv preprint ar Xiv:1509.02971, 2015. Liu, B., Liu, J., Ghavamzadeh, M., Mahadevan, S., and Petrik, M. Finite-sample analysis of proximal gradient td algorithms. ar Xiv preprint ar Xiv:2006.14364, 2020. Macua, S. V., Chen, J., Zazo, S., and Sayed, A. H. Distributed policy evaluation under multiple behavior strategies. IEEE Transactions on Automatic Control, 60(5): 1260 1274, 2014. Maei, H. R. Gradient temporal-difference learning algorithms. Ph.D. thesis, University of Alberta, Edmonton, 2011. Maei, H. R., Szepesv ari, C., Bhatnagar, S., and Sutton, R. S. Toward off-policy learning control with function approximation. In Proceedings of the 27th International Conference on Machine Learning, pp. 719 726, 2010. Mahadevan, S., Liu, B., Thomas, P., Dabney, W., Giguere, S., Jacek, N., Gemp, I., and Liu, J. Proximal reinforcement learning: A new theory of sequential decision making in primal-dual spaces. ar Xiv preprint ar Xiv:1405.6757, 2014. Marquardt, D. W. An algorithm for least-squares estimation of nonlinear parameters. Journal of the Society for Industrial and Applied Mathematics, 11(2):431 441, 1963. Martens, J. New insights and perspectives on the natural gradient method. Journal of Machine Learning Research, 21(1):5776 5851, 2020. Mnih, V., Kavukcuoglu, K., Silver, D., Rusu, A. A., Veness, J., Bellemare, M. G., Graves, A., Riedmiller, M., Fidjeland, A. K., Ostrovski, G., et al. Human-level control through deep reinforcement learning. Nature, 518(7540): 529 533, 2015. Mnih, V., Badia, A. P., Mirza, M., Graves, A., Lillicrap, T., Harley, T., Silver, D., and Kavukcuoglu, K. Asynchronous methods for deep reinforcement learning. In Proceedings of the International Conference on Machine Learning, pp. 1928 1937, 2016. Nesterov, Y. and Polyak, B. T. Cubic regularization of newton method and its global performance. Mathematical Programming, 108(1):177 205, 2006. Nocedal, J. and Wright, S. J. Numerical optimization. Springer, 1999. Polyak, B. T. Some methods of speeding up the convergence of iteration methods. USSR Computational Mathematics and Mathematical Physics, 4(5):1 17, 1964. Saleh, E. and Jiang, N. Deterministic bellman residual minimization. In Proceedings of Optimization Foundations for Reinforcement Learning Workshop at Neur IPS, 2019. Schoknecht, R. and Merke, A. TD(0) converges provably faster than the residual gradient algorithm. In Proceedings of International Conference on Machine Learning, pp. 680 687, 2003. Schulman, J., Moritz, P., Levine, S., Jordan, M., and Abbeel, P. High-dimensional continuous control using generalized advantage estimation. ar Xiv preprint ar Xiv:1506.02438, 2015. Sherman, J. and Morrison, W. J. Adjustment of an inverse matrix corresponding to a change in one element of a given matrix. Annals of Mathematical Statistics, 21(1): 124 127, 1950. Silver, D., Lever, G., Heess, N., Degris, T., Wierstra, D., and Riedmiller, M. Deterministic policy gradient algorithms. International Conference on Machine Learning, pp. 387395, 2014. Strang, G. Linear algebra and its applications. Thomson, Brooks/Cole, Belmont, CA, 2006. Sun, W. and Bagnell, J. A. Online bellman residual algorithms with predictive error guarantees. 2015. Sutton, R. S. Learning to predict by the methods of temporal differences. Machine Learning, 3(1):9 44, 1988. Sutton, R. S. and Barto, A. G. Reinforcement learning: An introduction. MIT Press, 2018. Sutton, R. S., Maei, H. R., Precup, D., Bhatnagar, S., Silver, D., Szepesv ari, C., and Wiewiora, E. Fast gradientdescent methods for temporal-difference learning with linear function approximation. In Proceedings of the 26th International Conference on Machine Learning, pp. 993 1000, 2009. Tian, T. and Sutton, R. S. Extending sliding-step importance weighting from supervised learning to reinforcement learning. In Proceedings of the International Joint Conference on Artificial Intelligence, pp. 67 82. Springer, 2019. Tsitsiklis, J. and Van Roy, B. Analysis of temporaldifference learning with function approximation. In Advances in Neural Information Processing Systems, volume 9, 1996. Van Seijen, H., Van Hasselt, H., Whiteson, S., and Wiering, M. A theoretical and empirical analysis of expected Sarsa. In IEEE Symposium on Adaptive Dynamic Programming and Reinforcement Learning, pp. 177 184, 2009. Toward Efficient Gradient-Based Value Estimation Wang, Z. T. and Ueda, M. A convergent and efficient deep Q network algorithm. ar Xiv preprint ar Xiv:2106.15419, 2021. Watkins, C. J. and Dayan, P. Q-learning. Machine learning, 8(3):279 292, 1992. Yao, H., Bhatnagar, S., and Szepesv ari, C. LMS-2: Towards an algorithm that is as cheap as LMS and almost as efficient as RLS. In Proceedings of the 48h Conference on Decision and Control (CDC), pp. 1181 1188. IEEE, 2009. Zhang, J., He, T., Sra, S., and Jadbabaie, A. Why gradient clipping accelerates training: A theoretical justification for adaptivity. ar Xiv preprint ar Xiv:1905.11881, 2019. Zhang, S., Boehmer, W., and Whiteson, S. Deep residual reinforcement learning. In Proceedings of the 19th International Conference on Autonomous Agents and Multiagent Systems, pp. 1611 1619, 2020. Zhu, R. J. and Murray, J. M. Gradient descent temporal difference-difference learning. ar Xiv preprint ar Xiv:2209.04624, 2022. Toward Efficient Gradient-Based Value Estimation A. Proof of condition-number bounds In this appendix, we first present the proof of Theorem 3.1 that involves bounds on the condition-number of MSBEV ( ). We then establish similar bounds for MSBE( ) defined in (1). A.1. Proof of Theorem 3.1 Note that an MDP with a fixed policy boils down to a Markov chain with termination. Consider a Markov chain with termination that has n non-terminal states, and let P be its associated n n transition matrix. Note that if transitions from a state can terminate with positive probability, sum over the corresponding row of P will be less than one. Let A def= (I γP)T (I γP). (18) In the tabular setting and under uniform state distribution, we have MSBEV (w) = w T Aw/n. Therefore, the conditionnumber C of MSBEV ( ) is equal to the condition-number of A. Let λmax and λmin denote the largest and smallest eigenvalues of A, respectively. It follows that λmin . (19) Proof of Part (a). We first propose an upper bound for λmin and then a lower bound for λmax. For states i = 1, . . . , n, let li be the expected number of steps until termination when we start from state i and follow the Markov chain s transitions. Then, for any state i = 1, . . . , n, we have j=1 Pijlj. (20) Let l def= [l1, . . . , ln]T be the vector representation of l1, . . . , ln. Then, (20) can be written in the vector form as l = 1 + Pl. (21) where 1 is the vector of all ones. It then follows that (I γP)l = l γPl = l γ(l 1) = (1 γ)l + γ1, (22) where the second equality follows from (21). Let l def= (l1 + , ln)/n be the mean of l1, . . . , ln. Cauchy-Schwarz inequality implies that For the smallest eigenvalue of A, we have λmin l T Al = (I γP)l 2 = (1 γ)l + γ1 2 = (I γ)2 l 2 + 2γ(1 γ)1T l + γ2n = (I γ)2 + 2γ(1 γ)nl + γ2n (I γ)2 + 2γ(1 γ)l + γ2 Toward Efficient Gradient-Based Value Estimation where the first inequality follows from the definition of the smallest eigenvalue of a symmetric matrix, the first equality is due to the definition of A in (18), the second equality results from (22), the fourth equality is from the definition of l, and the second inequality follows from (23). Let trace(A) be the trace of A defined as the sum of diagonal entries of A. It is well-known that the trace of any matrix is equal to the sum of eigenvalues of that matrix (Strang, 2006). Therefore, for the largest eigenvalue of A, we have j=1 (Iji γPji)2 (1 γPii)2 + X j =i γ2P 2 ji i=1 (1 γPii)2 i=1 (1 γPii) where the first inequality is because trace(A) equals the sum of eigenvalues of A, the first equality is from the definition of trace, the second equality is due to the definition of A in (18), the third inequality follows from the Cauchy-Schwarz inequality, and the last equality is from the definition h = Pn i=1 Pii/n in the theorem statement. Plugging (24) and (25) into (19), we obtain λmin (1 γh)2 λmin (1 γh)2 (1 γ + γ/l)2 (1 γh)2 2(1 γ)2 + 2γ2/l2 (1 γh)2 4 min 1 (1 γ)2 , l2 where the first and second inequalities are due to (25) and (24), respectively. This implies (8) and completes the proof of Part (a) of Theorem 3.1. Proof of Part (b). Consider an n-state Markov chain with transition matrix 0 0 1 ... ... ... ... 0 0 1 In what follows, we derive bounds on the largest and smallest eigenvalues of A defined in (18). Let ϵ = γ/(n 1) and v = [ ϵ, . . . , ϵ, 1]T . Then, Pv = 1, and as a result, v T Av = (I γP)v 2 = v γ1 2 = (n 1)(γ + ϵ)2 + (1 γ)2 (n 1)(γ + ϵ)2 = n2γ2 where the first equality is from the definition of A in (18), and the last equality is due to the definition of ϵ. On the other hand, v 2 = (n 1)ϵ2 + 1 = γ2 n 1 + 1 = n 1 + γ2 n 1 n n 1. (28) Toward Efficient Gradient-Based Value Estimation It follows that λmax v T Av v n2γ2/(n 1) n/(n 1) = nγ2, (29) where the second inequality is due to (27) and (28). In order to bound the smallest eigenvalue of A, let x = [γ, . . . , γ, 1]T . Therefore Px = 1 and (I γP)x = x γPx = x γ1 = [0, . . . , 0, 1 γ]T . (30) It follows that x 2 = (I γP)x 2 x 2 = (1 γ)2 x 2 = (1 γ)2 (n 1)γ2 + 1 (1 γ)2 where the first equality is from the definition of A in (18), the second equality is due to (30), and the third equality follows from the definition of x. Plugging (29) and (31) into (19), we obtain (1 γ)2/(nγ2) = γ4n2 This completes the proof of Part (b) of Theorem 3.1. A.2. Condition-number of the MSBE defined in terms of action-values Theorem 3.1 involves bounds on the condition-number of MSBEV ( ) defined in (2). In this appendix, we establish similar bounds for MSBE( ) defined in (1). Given an MDP and a policy π, we consider an induced augmented Markov chain that is a Markov chain whose states are the state-action pairs of the MDP and its transition probabilities are as follows. For any s, s S and a, a A, the probability of transition from (s, a) to (s , a ) in the induced augmented Markov chain is p π(s , a |s, a) def= Z r pπ(s , a , r|s, a) dr (32) where pπ is defined in Section 2. We consider a tabular setting, and denote the condition-number of MSBED( ) under uniform distribution D on state-action pairs by C . We let h def= 1 nm a A p π(s, a|s, a) (33) be the self-loop probability in the induced augmented Markov chain, where n is the number of states and m = |A| is the number of actions. Also let l be the expected number of steps until termination when starting from a uniformly random state-action pair. The following proposition is the counterpart of Theorem 3.1 for C . Proposition A.1. In the tabular case, the following statements hold for any discount factor γ [0, 1]: a) For any MDP and any policy π, 4 min 1 (1 γ)2 , l 2 . (34) b) For any n, m > 0, there exists an MDP with n states and m actions, and a policy π for which, C γ4(nm)2/(1 γ)2. Proof. We can perceive the dynamics under any given MDP and policy as an induced augmented Markov chain defined in (32). Applying the proof of Theorem 3.1 on this induced augmented Markov chain implies Proposition A.1. B. Experiment on condition number under linear function approximation We ran an experiment to investigate the growth of condition number, C, in an extended Boyan chain under linear function approximation. Fig. 7 shows the dependence of C on the size of extended Boyan chain, under standard Boyan feature vectors (see Appendix G.1 for details). The number of standard Boyan chain features d in this experiments, satisfies n = 4d 3, where n is the number of states. We observe that the condition number can grow very large under linear function approximation even when d/n < 1 (in this case d/n 1/4). Toward Efficient Gradient-Based Value Estimation Figure 7. Condition-number of MSBE versus number of states in an extended Boyan chain under linear function approximation with Boyan chain s standard features (n = 4d 3). C. RAN as a proximal algorithm In this section, we provide further intuition for the RAN algorithm, by showing that Algorithm 1 can be equivalently derived as a proximal algorithm with momentum for minimizing MSBE. Given an objective function f, a proximal algorithm in its general form aims to find an approximate solution for a proximal operator of the following form in each iteration wt+1 argmin w f(w) + Dt(w, wref) , (35) where wref is a reference point, usually equal to wt or a trace of past w s, and Dt is a penalty function (also called divergence) that encourages wt+1 to stay close to wref. In the special case that Dt is a fixed Bregman divergence, (35) boils down to the mirror-descent algorithm (Juditsky & Nemirovski, 2011). However, in general, Dt can be a time varying function and can depend on the local shape of the objective f. Back to the value estimation problem, for any consecutive state-action pairs (s, a) and (s , a ), and any pair w and w of weights, we let δw(s, a, s , a ) def= γqw(s , a ) qw(s, a), (36) and δw,w (s, a, s , a ) def= δw(s, a, s , a ) δw (s, a, s , a ). (37) Consider a divergence measure of the form Dt(w, wt) = c Es,a h Es ,a |s,a δw,wt(s, a, s , a )2 i , for some constant c > 0. Then, the proximal operator in (35) turns into argmin w MSBE(w) + c Es,a h Es ,a |s,a δw,wt(s, a, s , a )2 i . (38) To obtain a low complexity incremental version of the above proximal updates, we consider doing sample updates along the gradient of (38) at time t. For this purpose3, we work with wref = wt 1 instead of wref = wt, i.e. we consider the proximal objective MSBE(w) + c Es,a h Es ,a |s,a δw,wt 1(s, a, s , a )2 and its unbiased sample gradient gt def= δ t c δwt,wt 1(St, At, St+1, At+1) δt δ t c (wt wt 1)T δt δt, (39) 3Here we avoid using wref = wt because in this case, for any (s, a, s , a ), δwt,wt(s, a, s , a ) = 0. This implies that w δw,wt(s, a, s , a )2 w=wt = 0. As such, the penalty would not affect gradient of the proximal objective at w = wt. Toward Efficient Gradient-Based Value Estimation where the approximate equality is because (wt wt 1)T δt is a first order approximation of δwt,wt 1. Let ˆgt = δ t c(wt wt 1)T δt δt and consider the approximate gradient update w w βˆgt. We further employ a momentum to reduce the variance of these updates. The resulting momentum based algorithm is of the form mt = λmt 1 + βˆgt, wt+1 = wt αmt. (40) Let η = 1/α. Since wt wt 1 = αmt 1, ˆgt simplifies to ˆgt = δ t m T t 1 δt δt. Plugging this into (40), we obtain updates that are identical to (11). This establishes our earlier claim that the RAN algorithm can be equivalently formulated as a proximal algorithm with momentum for minimizing MSBE. As another intuitive perspective for Algorithm 1, we can perceive m as a momentum of MSBE gradients, to which we add a correction term equal to a sample gradient of the penalty Es,a Es ,a |s,a δw,wt(s, a, s , a )2 , in each iteration. This momentum spreads the effect of each MSBE gradient over an O(1/(1 λ))-long horizon, which provides enough time for the correction updates to trim the direction of those MSBE gradients. The correction terms are small along directions that δ is small, allowing MSBE gradients to accumulate along those directions. Conversely, the correction terms are large along directions that δ is large, preventing m from growing large along those directions. This leads to accelerated convergence along the directions where δ is small, while preventing instability along directions where δ is large. D. Convergence of the RAN algorithm In this appendix, we study conditions for convergence of Algorithm (11) under different choices of λ. Convergence proofs of two-time-scale approaches are well-studied, and generally involve tedious yet pretty standard statistical arguments. As such, similar to several other papers (Konda & Tsitsiklis, 1999; Dabney & Thomas, 2014), here we keep our arguments in a high level, and only discuss the steps of the proof without going into the proof details. Throughout this Appendix, we consider irreducible and aperiodic Markov chains with finite number of states. We assume that the function approximation Qw(s, a) is a differentiable function of w with bounded and Lipschitz constant derivatives. By boundedness of derivatives we mean that there exists a C > 0 such that for consecutive state-action pairs (s, a, s , a ) and any w, δw(s, a, s , a ) < C, (41) where δw(s, a, s , a ) is defined in (36). Given a distribution D over state and action pairs, for any w Rd, we let ˆGw def= Es,a D h Es ,a |s,a δw(s, a, s , a ) δw(s, a, s , a )T i . (42) We study convergence in three different regimes on λ, namely λ = 1, λ = 1 cβt for some constant c > 0, and for a constant λ < 1. Case 1: (λ = 1). We assume that αt and βt are decreasing positive sequneces, satisfying t=0 β2 t < , αt βt 0. (43) We further assume that ˆGw defined in (42) is uniformly positive definite, in the sense that there is an ϵ > 0 such that for any w and any x Rd, we have x T ˆGwx ϵ x 2. In this case, for fixed w, the updates in (11) converge to ˆG 1 w MSBED(w). Since αt/βt 0, w is updated much slower than m. As such, m is updated with much larger step-sizes, perceiving w as almost stationary, and therefore m converges to an asymptotically small neighborhood of ˆG 1 w MSBED(w). Since ˆGw is uniformly positive definite, this m is an absolutely decreasing direction for MSBED(w). Then, standard proof techniques for stochastic approximation algorithms can be used to establish convergence of w to a stationary point of MSBE. Case 2: (λ = 1 cβt, for some constant c > 0). In this case, the update rule (11) boils down to: m λm + βt(δ t m T δt δt = (1 cβt)m + βt δ t m T δt δt = m βt (c I + δt δT t )m δ t δt 2m T c I + wδt wδT t m δ t wδT t m , Toward Efficient Gradient-Based Value Estimation where I is the identity matrix. Therefore, in each iteration, m is updated along a sample gradient of the loss function ˆL(m) = m T c I + ˆGw m w MSBE(w)T m, where ˆGw is defined in (42). Thus, assuming (43), m will asymptotically converge to the minimizer c I + ˆGw 1 MSBE(w) of ˆL. Since c I + ˆGw is uniformly positive definite, this is an absolutely descent direction for MSBE(w). Then, standard proof techniques for stochastic approximation algorithms can be used to establish convergence of w to a stationary point of MSBE. Case 3: (Constant λ < 1). As opposed to the Cases 1 and 2, here we do not need two-time-scale step-sizes. More specifically, we assume that αt > 0 is constant and βt is a decreasing positive sequence satisfying P t=0 βt = and P t=0 β2 t < . We also assume that w remains bounded which can be enforced either by projection of w on a compact set or by adequate normalization of αt (see (Konda & Tsitsiklis, 1999) and (Kushner & Yin, 2003)). It follows from (41) that for any time t, |δt| = |δwt(St, At, St+1, At+1)| C + C wt . (45) for some constant C . Then, from (11) we obtain mt = λmt 1 + βt δ t m T t 1 δt δt λ mt 1 + βt |δ t| + mt 1 δt δt λ mt 1 + βt |δ t| + C mt 1 C λ mt 1 + βt C C + C wt + C mt 1 = (λ + βt C2) mt 1 + βt(C C + C2 wt ), where the second and third inequalities follow from (41) nad (45), respectively. When βt is small enough such that λ + βt C2 < 1, it follows from the boundedness assumption of w that mt = O(βt/(1 λ)). Therefore, (11) can be expressed as mt = λmt 1 + βtδ t δt βt(m T t 1 δt) δt = λmt 1 + βtδ t δt + O β2 t /(1 λ) λτβt τδ t τ δt τ + O β2 t /(1 λ)2 . Therefore, m is essentially a momentum of sample gradient of MSBE. Then, convergence of w to a stationary point of MSBE follows from standard techniques for analysing SGD algorithms with momentum. E. Asymptotic robustness of RAN to parameterization In this appendix, we present a formal version of Proposition 5.1 and its proof. Here, we assume that λ = 1 and consider an asymptotically small step-sizes regime with α 0 and α/β 0. As discussed in Section 5 and Appendix D, when α/β 0, m in the RAN algorithm converges to the approximate Gauss-Newton direction m GN(w, q) = ˆG 1 w,q gw,q, (47) ˆGw,q = Es,a D h Es ,a |s,a h γ wqw(s , a ) wqw(s, a) γ wqw(s , a ) wqw(s, a) T i i , (48) gw,q = Es,a D h Es ,a ,r|s,a [r + γqw(s , a ) qw(s, a)] Es ,a |s,a γ wqw(s , a ) wqw(s, a) i . (49) where D is a distribution over state and action pairs. In this case, when α 0 the trajectory of w approaches the trajectory of the following Ordinary Differential Equations (ODE): w = m GN(w, q), (50) where w is the derivative of w with respect to time. We refer to (50) as the ODE formulation of two time-scale RAN for the q function. Toward Efficient Gradient-Based Value Estimation Let u : Rd Rd be a (possibly non-linear) bijective and differentiable mapping. Suppose that the Jacobian of u( ) is invertible everywhere. We define q as a reparameterization of the q function by u. More specifically, for any state-action pair (s, a) and any v Rd, we let qv(s, a) = qu(v)(s, a). (51) Consider the ODE formulation of two time-scale RAN for the q function: v = m GN(v, q), (52) for m GN(w, q) defined in (47). The following proposition is a formal version of Proposition 5.1. It draws a connection between solution trajectories of (50) and (52). Proposition 6.1 (Formal). Let u : Rd Rd be a (possibly non-linear) bijective and differentiable mapping with invertible Jacobian, and consider the corresponding reparameterization q of the q function as in (51). Let wt and vt for t 0 be solution trajectories of the ODE formulations of two time-scale RAN in (50) and (52), respectively. If w0 = u v0 , then wt = u(vt), for all t 0. Moreover, qwt(s, a) = qvt(s, a), for all times t 0 and all state-action pairs (s, a). The proposition suggests that the two-time scale RAN algorithm with asymptotically small step-sizes is invariant to any non-linear bijective reparameterization of the q function. Proof of Proposition 6.1. For any v Rd, let Jv be the Jacobian matrix of u(v). Then, for any state-action pair (s, a) and any v Rd, v qv(s, a) = vqu(v)(s, a) = JT v wqw(s, a) w=u(v), (53) where the first equality is from the definition of q in (51), and the second equality follows from the chain rule for differentiation. Then, ˆGv, q = Es,a D h Es ,a |s,a h γ v qv(s , a ) v qv(s, a) γ v qv(s , a ) v qv(s, a) T i i = Es,a D h Es ,a |s,a h JT v γ wqw(s , a ) wqw(s, a) γ wqw(s , a ) wqw(s, a) T Jv w=u(v) = JT v ˆGw,q Jv w=u(v), where the first and the last equalities are due to (48) second equality follows from (53) In the same vein, gv, q = Es,a D h Es ,a ,r|s,a [r + γ qv(s , a ) qv(s, a)] Es ,a |s,a γ v qv(s , a ) v qv(s, a) i = Es,a D h Es ,a ,r|s,a [r + γ qv(s , a ) qv(s, a)] Es ,a |s,a γJT v wqw(s , a ) JT v wqw(s, a) w=u(v) = Es,a D h Es ,a ,r|s,a [r + γqw(s , a ) qw(s, a)] Es ,a |s,a γJT v wqw(s , a ) JT v wqw(s, a) w=u(v) = JT v gw,q w=u(v), where the first and the last equalities are due to (49), the second equality follows from (53), and the third equality is from the definition of q in (51). Plugging (54) and (55) into (50) and (52), we obtain v = m GN(v, q) = ˆG 1 v, q gv, q = JT v ˆGw,q J 1 gv, q w=u(v) = J 1 v ˆG 1 w,q J T v gv, q w=u(v) = J 1 v ˆG 1 w,q gw,q w=u(v) = J 1 v m GN(w, q) w=u(v) = J 1 v w w=u(v), Toward Efficient Gradient-Based Value Estimation Algorithm 4 RANS Hyper parameters: stepsize α, η (0, 1) outlier threshold ρ > 1, momentum and trace parameters λ, λ [0, 1), outlier sampling probability σ. Good default values for η, ρ, λ, λ , and p based on our experiments are η = 0.2, ρ = 1.2, λ = 0.999, λ = 0.9999, and σ = 0.02 . Initialize: m = 0, ˆν = 0, ˆξ = 0, and w. for t = 1, 2, . . . do Take two independent samples (St+1, At+1) and (S t+1, A t+1), and consider δt and δ t as in (3) and (4). ˆνt λ ˆνt 1 + (1 λ )( δt)2 ( δt)2 is the entrywise square vector of δt νt = ˆνt/(1 λ t) bias-corrected trace of ( δt)2 ξt = (1/ νt) δt, δt Outlier measure ˆξ λ ˆξ + (1 λ )ξt ξ = ˆξ/(1 λ t ξ ) bias-corrected trace of ξ k = ξt/(ρ ξ) + 1 outlier-splitting factor β = η/(ρ ξt νt) m λm + δ t m T δt β δt/k w w αm if k > 1 then Store (St, At, St+1, At+1, rt, , S t+1, A t+1, r t, k, k 1) in the outlier buffer the last entry in the tuple indicates the remaining number of future updates based on this sample end if With probability min(1, σ length of outlier bufffer): do an update using outlier buffer samples Sample (Sτ, Aτ, Sτ+1, Aτ+1, rτ, S τ+1, A τ+1, r τ, k , j) uniformly form the outlier buffer Let δt and δ t be as in (3) and (4), respectively ξ = (1/ νt) δτ, δτ k = max k , ξ/(ρ ξ) + 1 m λm + δ τ m T δ τ β δτ/k w w αm if j > 1 then Replace (Sτ, . . . , k , j) with (Sτ, . . . , k , j 1) in the outlier buffer else if j = 1 then Remove (Sτ, . . . , k , j) from the outlier buffer end if end for where the equalities are respectively due to (52), (47), (54), the assumption that the Jacobian J of u is invertible, (55), (47), and (50). Therefore, if at time t, wt = u(vt), then wt = Jv vt = d dtu(vt). This together with the assumption w0 = u(v0) implies that wt = u vt , for all t 0. It then follows from the definition of q in (51) that qwt(s, a) = qvt(s, a), for all times t 0 and all state-action pairs (s, a). This completes the proof of Proposition 1. F. Pseudo code of RANS The pseudo code of the RANS algorithm is given in Algorithm 4. Note that in the two time scale regime, where α, η 0 and α/η 0, outlier-splitting would have no effect on the expected direction of m updates. In this case, convergence of the RANS algorithm follows from similar arguments to Appendix D. G. Details of experiments G.1. Experiments of Section 3 and Appendix B An n-state extended Boyan chain with termination is a Markov chain with termination with states 0, 1, . . . , n 1 and transition probabilities: P(1 0) = 1 and P(i i 1) = P(i i 2) = 0.5 for i = 2, 3, . . . , n 1. Furthermore, state 0 goes to a terminal state with probability 1. By standard features, we mean feature representations similar to (Boyan, Toward Efficient Gradient-Based Value Estimation 2002). More specifically, given a d > 1 and n = 4d 3, we consider d standard features for the n-state extended Boyan-chain as follows. In the ith standard feature vector for i = 0, 1, . . . , d 1, the jth entry is equal to 1 |j 4i|/4 for j = max(0, 4i 3), . . . , min(d 1, 4i + 3), and all other entries equal zero. For the special case of n = 13 and d = 4, the standard features would be the same as the features considered in (Boyan, 2002). By random binary features we mean an n d feature matrix Φ with i.i.d. entries that take 0 and 1 values with equal probability. For evaluating value-errors in Fig. 2 we consider reward 1 for the transition at state 0 (that leads to the terminal state), and reward 0 for all other transitions. Note that the reward function does not affect condition-number. Each point in this first experiment is the median of 100 independent runs with different random feature matrices. We use median instead of mean to eliminate the adverse effect of unbounded values in degenerate cases (e.g., infinite condition-number in the case of low rank feature matrices). For both experiments in Fig. 2 and 7 we use discount factor γ = 0.995. Condition-numbers and value-errors in these experiments are computed with respect to uniform state distribution. G.2. Experiment of Section 5 We considered an extension of the Hallway environment with 50 states and one action, in which each state s = 1, 2 . . . , 50 transits to state min(s + 1, 50) with probability 0.99, and transits to a terminal state with probability 0.01. The experiment was tabular with γ = 1. All rewards were set equal to zero, in which case the correct q-values are qπ(s, ) = 0, for all sates s. All algorithms were initialized with q(s, ) = 1, for s = 1, . . . , 50. Fig. 3 illustrates learning curves of value-error P50 s=1 q(s, 1) qπ(s, 1) 2/50 for RAN (Algorithm 1), RG, and TD(0) algorithms. Each point is an average over 100 independent runs. We optimized the parameters for RG and Algorithm 1. The parameters used in the experiments are as follows. For RAN, we set α = 0.025, β = 0.4, and λ = 0.9998. For RG and TD(0) we used α = 0.5. G.3. Experiment of Section 6 We ran an experiment on Baird s Star environment (Baird, 1995). We performed off-policy learning with uniform state distribution. All rewards were set to zero, in which case the correct q-values are qπ(s, ) = 0, for all sates s. We used the initial point w = [2, 1, 1, 1, 1, 1, 1] for all algorithms. Fig. 4 illustrates learning curves of value-error P6 s=1 qw(s, 1) qπ(s, 1) 2/6 for the RAN (Algorithm 1), DSF-RAN (Algorithm 2), RG, GTD2, and TD(0) algorithms. Each point is an average over 10 independent runs. TD(0) was unstable on this environment, and we chose a very small step-size α = 10 5 for TD(0). For all other algorithms, we used optimized parameters as follows. For RAN, we set α = 2, β = 0.15, and λ = 0.995. For DSF-RAN, we set α = 1, β = 0.15, η = 0.3, and λ = 0.995. For RG we used α = 0.3. For GTD2 we set α = 0.15 and β = 0.3. G.4. Experiment of Section 10 We ran an experiment on classic control tasks Acrobot and Cartpole to test the performance of the RANS algorithm. We used a single-layer neural network with 64 hidden Re LU units to learn the action-values, while choosing actions according to a softmax distribution on the action-values, a Softmax qw(s, ) . The network for qw was trained with three algorithms: TD(0), RG, and RANS (Algorithm 4). Since RANS incorporates adaptive step-sizes, for fair comparison we trained TD(0) and RG using Adam optimizer. For each algorithm, we performed training for 100 randomly generated random seeds. For each training seed, in order to obtain an estimate of expected returns, we took an average over 400 independent environment simulations once every 500 steps. In Fig. 8 we plot the average of these estimated expected returns over the 100 training seeds. In the Cartpole environment, once an algorithm reaches score 500, it will not see any failure for many episodes in a row. In this case, the agent starts to forget the actions that prevented failure and led it to success, causing the performance to drop before it can rise again. This phenomenon is known as catastrophic forgetting, and leads to large oscillations in learning curves. In order to hide the effect of catastrophic forgetting on the learning curves, in Fig. 5 we eliminated the 50 worst return estimates (corresponding to the 50 worst training seeds) at each point and plotted the mean of top 50 return estimates. The shades in Fig. 5 and Fig. 8 show the 99 percent confidence intervals over the averaged data. We did not use replay buffer and batch updates. We used a small quadratic regularizer with coefficient 10 5 on the weights of the neural network. For all algorithms, we used optimized parameters that maximize area under the curves, as follows. In Toward Efficient Gradient-Based Value Estimation Acrobot experiment: For TD(0), we used softmax coefficient 1 and Adam optimizer with step-size 0.005. For RG, we used softmax coefficient 16 and Adam optimizer with step-size 0.001. For RANS, we used softmax coefficient 16 and α = 0.005 and set all other parameters were set to their default values described in Algorithm 4. In Cartpole experiment: For TD(0), we used softmax coefficient 0.005 and Adam optimizer with step-size 0.3. For RG, we used softmax coefficient 0.002 and Adam optimizer with step-size 0.3. For RANS, we used softmax coefficient 8 and α = 0.001 and set all other parameters were set to their default values described in Algorithm 4. In another experiment, we evaluated the performance of RANS on simple Mu Jo Co environments Hopper and Half Cheetah. We used the standard (300 400) two layer DDPG neural network architecture (Lillicrap et al., 2015) for the actor and the Q-network. The network for qw was trained with three algorithms: TD with delayed targets, RG, and RANS (Algorithm 4). For TD and RG we used Adam optimizer. The actor was trained using a deterministic policy gradient, in principle similar to (Lillicrap et al., 2015) but with Adam optimizer and without delayed target actor. For each algorithm, we performed training for 100 randomly generated random seeds. In Fig. 9 we plot average returns over the 100 training seeds. We used no replay buffer and no batch updates in our experiments. We employed a small quadratic regularizer with coefficient 10 5 on the weights of the neural network for qw . For all algorithms, we optimized all parameters to maximize area under the return curves. Parameters used for the Hopper environment are as follows. For TD, we used delayed target with Polyak averaging factor 0.001 and Adam optimizer with step-size 0.005 for Q-network updates. We set γ = 0.99 and used actor learning rate 10 8. For RG, we used Adam optimizer with step-size 0.001 for Q-network updates, actor learning rate 10 8, and γ = 0.999. For RANS, we set α = 0.001 and all other parameters were set to their default values (see Algorithm 4). In this experiment, we sued actor learning rate 10 7 and γ = 0.999. Parameters used for the Half Cheetah environment are as follows: For TD, we used delayed target with Polyak averaging factor 4 10 5 and Adam optimizer with step-size 3 10 5 for Q-network updates. We set γ = 0.99 and used actor learning rate 10 5. For RG, we used Adam optimizer with step-size 10 5 for Q-network updates, actor learning rate 10 5, and γ = 0.999. For RANS, we set α = 0.003 and all other parameters were set to their default values (see Algorithm 4). In this experiment, we sued actor learning rate 10 6 and γ = 0.999. Toward Efficient Gradient-Based Value Estimation Figure 8. Performance of RANS, TD(0), and RG on classic control tasks. The only difference with Fig.5 is that here, each point is an average of estimate returns over 100 independent training. See Appendix G.4 for details. Figure 9. Performance of RANS, TD with target networks, and RG algorithms on simple Mu Jo Co environments. The only difference with Fig.6 is that here, each point is an average of estimate returns over 100 independent training, whereas Fig.6 depicts the average of top 50 percent of estimated returns.