# bias_in_natural_actorcritic_algorithms__091a53d9.pdf Bias in Natural Actor-Critic Algorithms Philip S. Thomas PTHOMAS@CS.UMASS.EDU School of Computer Science, University of Massachusetts, Amherst, MA 01002 USA We show that several popular discounted reward natural actor-critics, including the popular NACLSTD and e NAC algorithms, do not generate unbiased estimates of the natural policy gradient as claimed. We derive the first unbiased discounted reward natural actor-critics using batch and iterative approaches to gradient estimation. We argue that the bias makes the existing algorithms more appropriate for the average reward setting. We also show that, when Sarsa(λ) is guaranteed to converge to an optimal policy, the objective function used by natural actor-critics has only global optima, so policy gradient methods are guaranteed to converge to globally optimal policies as well. 1. Introduction Natural actor-critics are an increasingly popular class of algorithms for finding locally optimal policies for continuous-action Markov decision processes (MDPs). We show that the existing discounted natural actor-critic algorithms (Degris et al., 2012; Peters & Schaal, 2006; 2008) do not produce unbiased estimates of the natural policy gradient as sometimes purported (Peters & Schaal, 2006; 2008), since they are missing a γt term. Some algorithms do not claim to follow unbiased estimates of the natural policy gradient (Degris et al., 2012), however they are still missing the term, which results in additional bias. Although the missing term is just a γt, we argue that it has significant ramifications, beyond voiding some convergence guarantees. We prove that, for a set of Markov decision processes, these biased discounted reward natural actor-critics are actually unbiased average reward natural actor-critics. We derive the first unbiased discountedreward natural actor-critics, but we argue that our unbiased algorithms are not practical, and only support their Proceedings of the 31 st International Conference on Machine Learning, Beijing, China, 2014. JMLR: W&CP volume 32. Copyright 2014 by the author(s). use when maximization of the discounted-reward objective is imperative. The goal of this paper is to raise awareness about what the algorithms that are in use are actually doing, not to present a superior method. We also address a common misconception about policy gradient algorithms that they are inferior to algorithms like Sarsa(λ) because they are local methods that can become stuck at arbitrarily bad locally optimal policies. We prove that, in all of the settings where Sarsa(λ) is guaranteed to reach a globally optimal policy, the average and discounted-reward objective functions have only global optima, and hence policy gradient methods also converge to globally optimal policies. We are interested in the problem of finding locally optimal policies for Markov decision processes (MDPs). An MDP is a tuple, (S, A, P, R, d0, γ). S and A denote the sets of possible states and actions, which may be countable (discrete), or uncountable (continuous).1 P is called the transition function, where Pa ss = Pr(st+1=s |st=s, at=a), where t N0 denotes the time step, s, s S and a A. Ra s is the expected value of the scalar reward, rt, when action a is taken in state s, i.e., Ra s = E[rt|st = s, at = a]. We assume that rt [ rmax, rmax] for some uniformly bounding constant rmax. The initial state distribution is d0, where d0(s) = Pr(s0=s), and γ [0, 1] is a discount factor. A parameterized policy, π, is a distribution over actions given a state and parameter vector θ Rn. That is, π(s, a, θ) = Pr(at=a|st=s, θ). We assume that, for all s, a, and θ, π(s, a, θ) is differentiable with respect to θ. The state value function, V θ, when using policy parameters θ, is a function that gives the expected sum of discounted reward (or expected discounted return) that would be accrued from a state, s, when following the policy with parameters θ. That is, V θ(s) = 1We abuse notation by writing summations and probabilities over S and A. If these sets are continuous, the summations and probabilities should be replaced with integrals and probability densities. We also treat st and at as random variables or observed values depending on context. Bias in Natural Actor-Critic Algorithms E[P t=0 γtrt|s0=s, θ]. Similarly, the state-action value function is Qθ(s, a) = E[P t=0 γtrt|s0=s, a0=a, θ]. The discounted state distribution, dθ, gives the probability of each state when using policy parameters θ, with a discount applied to states that occur at later times: dθ(s) = (1 γ) P t=0 γt Pr(st=s|s0, θ). The objective function, J : Rn R, gives the expected discounted return for using the provided policy parameters for one episode: J(θ) = E[P t=0 γtrt|θ], where an episode is one sequence of states, actions, and rewards, starting from a state sampled from d0 and following the dynamics specified by P and R. We call an MDP episodic if there is one or more state in which the process terminates, and, for all policies, every episode reaches a terminal state within some finite time, T. To model episodic MDPs in a unified manner with nonepisodic MDPs, we assume that there is only one admissible action in terminal states, and it causes a transition to an absorbing state with zero reward, which we call a postterminal absorbing state. This absorbing state also has only one admissible action, which causes a self-transition with zero reward. We allow γ = 1 only when the MDP is episodic.2 If S and A are discrete, then the goal is to find a policy parameter vector that maximizes J. If S or A is continuous, we search for locally optimal policy parameters, θ , that is, parameters satisfying J(θ ) = 0. 3. Policy Gradient Gradient ascent algorithms for maximizing J are called policy gradient algorithms. Their basic update is θt+1 = θt + αt J(θt), where {αt} is a scalar step size schedule. Policy gradient methods may also use unbiased estimates of the gradient, making them stochastic gradient ascent algorithms. Gradient ascent is guaranteed to converge to a local maximum if J continuously differentiable, J is Lipschitz, P t=0 αt = , and P t=0 α2 t < (Bertsekas & Tsitsiklis, 2000). We assume that these constraints are satisfied. The policy gradient, J(θ), is the direction θ that maximizes J(θ+ θ) subject to θ 2 = ϵ2, for infinitesimally small ϵ, where denotes the Euclidean (L2) norm. Amari (1998) suggested that Riemannian distance may be a more appropriate metric than Euclidean distance for parameter space. He calls the direction satisfying this modified constraint the natural gradient. Kakade (2002) suggested the application of natural gradients to policy gradients to get the natural policy gradient. Bagnell & Schneider (2003) 2If γ = 1, we modify the sum in the definition of dθ to sum to T rather than . then derived a proper Reimannian distance metric,3 based on Amari and Kakade s work, and showed that the natural policy gradient is covariant. Bhatnagar et al. (2009) built on this foundation to create several provably convergent policy gradient and natural policy gradient algorithms for the average reward setting. Let fϖ(s, a) = ϖ ψsa be a linear function approximator with weight vector ϖ = [w , v ] , where |w| = |θ|, and feature vector ψsa = [( θ log π(s, a, θ)) , φ(s) ] , for arbitrary uniformly bounded φ. If s,a dθ(s)π(s, a, θ) Qθ(s, a) fϖ(s, a) fϖ(s, a) (1) then the natural policy gradient is e J(θ) = w (Kakade, 2002).4 4. Finding w To find w that satisfy (1), Sutton et al. (2000), working in the |v| = 0 setting, suggest letting fϖ : S A R be an approximation to Qθ with parameter vector ϖ = w. They claim that learning fϖ by following π(s, , θ) and updating ϖ by a rule such as ϖt ϖ[ ˆQθ(st, at) fϖ(st, at)]2, where ˆQθ(s, a) is some unbiased estimate of Qθ(s, a), will result in satisfactory w. However, this is only true for the average reward setting or the discounted setting when γ = 1 because, in the discounted setting, dθ in (1) is the discounted weighting of states encountered, whereas the states observed when merely following π(s, , θ) come from the undiscounted state distribution. Peters & Schaal (2006; 2008) observed that the scheme proposed by Sutton et al. (2000) is a forward TD(1) algorithm. Because forward and backward TD(λ) are approximately equivalent, they suggest using least squares temporal difference (LSTD), a backwards TD(λ) method, to approximate Qθ with fϖ, where λ = 1. They call the resulting algorithms the natural actor-critic using LSTD (NAC-LSTD) and the episodic natural actor-critic (e NAC). Because the scheme proposed by Sutton et al., and thus TD(1), does not incorporate the γt weighting in the discounted state distribution, this results in w that do not sat- 3Recent work has proposed the use of a different metric that accounts not only for how the distribution over actions (the policy) changes as the parameters change, but also for how the state distribution changes as the parameters change (Morimura et al., 2009). 4Notice that if |φ(s)| = 0, we can drop v from Equation (1) to get the exact constraint specified by Sutton et al. (2000). Equation (1) follows immediately since P a π(s, a, θ)v φ(s) fϖ(s,a) ϖ = 0 for all s, θ, φ, and ϖ. We include the baseline, v φ(s), since it can reduce the variance of gradient estimates (Sutton et al., 2000). Also, for simplicity later, we assume that φ(s) = 0 for the postterminal absorbing state. Bias in Natural Actor-Critic Algorithms isfy (1), and thus bias in the natural policy gradient estimates. Natural actor-critic algorithms that do not include the γt term continue to be published (Degris et al., 2012). One solution would be to convert the discounted MDP into an equivalent undiscounted MDP (Bertsekas & Tsitsiklis, 1996). To do this, each observed trajectory must be truncated after each transition with probability 1 γ. Notice that NAC-LSTD is not biased when γ = 1, because then the discounted and undiscounted state distributions are identical.5 So, after the trajectories are truncated, the existing NAC-LSTD algorithm could be used with γ = 1 to find a policy for the original MDP. However, this approach may discard significant amounts of data when truncating episodes. Instead, we propose the use of all of the observed data with proper discounting in order to produce unbiased gradient estimates. We present a new objective function, H, and prove that the local minima of this objective give w satisfying (1). We then provide the stochastic gradient descent updates for this objective. When using policy parameters θ, the discounting from the discounted state distribution can be shifted into the objective function in order to properly satisfy (1). We select a w that is a component of a local minimum for the objective function H: s Pr (st = s|θ) X a π(s, a, θ) 2 Qθ(s, a) fϖ(s, a) 2 (2) ˆQθ(s, a) fϖ(s, a) 2 , where denotes scalar-scalar multiplication split across two lines. This objective function is always finite because either γ < 1 or the MDP is episodic. If the MDP is episodic, it must enter the post-terminal absorbing state within a finite number of steps. In this state, ψs,a = 0, and Qθ(s, a) = 0 for all π and the one admissible a, so P a π(s, a, θ)γt Qθ(s, a) fϖ(s, a) 2 = 0 for all ϖ. Hence, if the MDP is episodic, only a finite number of terms in the infinite sum will be non-zero. We propose performing stochastic gradient descent on H to obtain a local minimum where ϖH(ϖ) = 0: t=0 γt Pr(st = s|θ) X a π(s, a, θ) Qθ(s, a) fϖ(s, a) fϖ(s, a) 5It is unclear whether e NAC would be unbiased in this situation, as described in Section 6. By the definition of dθ, this is equivalent to (1). Hence, when gradient descent on H has converged, the resulting w component of ϖ satisfies (1). Notice that the expectation in (2) is over the observed probabilities of states and actions at time t when using θ. Hence, we can update ϖ via stochastic gradient descent: ϖ ϖ + ηt (3) h γt ˆQθ(st, at) fϖ(st, at) i fϖ(st, at) where ˆQθ is an unbiased estimate of Qθ and {ηt} is a step size schedule that satisfies the typical decay constraints. The substitution of ˆQθ for Qθ does not influence convergence (Bertsekas & Tsitsiklis, 2000). Because fϖ(s, a)/ ϖ is zero for terminal states and the postterminal absorbing state, the above update need only be performed for the pre-terminal states. With |v| = 0, this differs from the method proposed by Sutton et al. (2000) only by the sum over time and the γt term. 5. Algorithms and Convergence A simple algorithm to find w would be to execute episodes and then perform the update in (3) using the Monte Carlo return, ˆQθ(st, at) = P τ=0 γτrt+τ, as the unbiased estimate of Qθ(st, at). This is a forward TD(1) algorithm, with an additional discount applied to updates based on the time at which they occur. However, this algorithm requires that entire trajectories be stored in memory. To overcome this, we can derive the equivalent backwards update by following Sutton and Barto s derivation of backwards TD(λ) (Sutton & Barto, 1998). The resulting on-policy backwards algorithm for estimating Qθ for a fixed policy parameter vector θ is: et+1 =γλet + γt fϖ(st, at) ϖ δt =rt + γfϖ(st+1, at+1) fϖ(st, at) ϖt+1 =ϖt + ηtδtet+1, where λ is a decay parameter for eligibility traces as in TD(λ) and st, at, and rt come from using policy parameter vector θ. Although the backwards and forward algorithms are only approximately equivalent (Sutton & Barto, 1998), their convergence guarantees are the same (Bertsekas & Tsitsiklis, 1996). Hence, if λ = 1 and ηt is decayed appropriately, the modified backwards TD(λ) algorithm above will produce w satisfying (1). The only difference between this algorithm and Sarsa(λ) is the γt in the equation for et+1. One can then reproduce the work of Bradtke & Barto (1996) to create LSTD in this new setting, which approximates V θ in a least squares manner. This can be extended Bias in Natural Actor-Critic Algorithms following the work of Lagoudakis & Parr (2001) to create LSQ, which approximates Qθ in a least squares manner. The resulting LSQ algorithm in NAC-LSTD changes only by the introduction of a γt term: zt+1 = λzt + γt ˆφt.6 To create an episodic algorithm, we convert (1) into a system of linear equations using the assumption that all episodes terminate within T steps. We rewrite (1) by replacing the infinite sum in dθ with a finite one because fϖ(s, a)/ ϖ is zero for absorbing states: t=0 Pr(st = s) X a π(s, a, θ)γt Qθ(s, a) ϖ ψsa ψsa = 0. This can be written as Aϖ = b, where b = P s,a PT t=0 Pr(st=s)π(s, a, θ)γt Qθ(s, a)ψsa and A = P s,a PT t=0 Pr(st=s)π(s, a, θ)γtψsaψ sa. We can then generate unbiased estimates of A and b from sample trajectories. As the number of observed trajectories grows, our estimates of A and b converge to their true values, giving an unbiased estimate of the natural gradient. The resulting episodic natural actor-critic algorithm is presented in Algorithm 1. Notice that this is different from e NAC (Peters & Schaal, 2008), so we call it e NAC2. For both algorithms presented, the user must select either TYPE1 or TYPE2 updates. In the former, which emulates the update scheme proposed by Peters & Schaal (2008), the policy is updated when the gradient estimate has converged, while in the latter, which emulates the two-timescale update scheme proposed by Bhatnagar et al. (2009), the policy is updated after a constant number of time steps. The user must also select f(t) = γt to get the unbiased algorithm or f(t) = 1 to get the biased algorithm. The unbiased algorithms are only truly unbiased when λ = 1, β = 0 (if β is present), and ϵ 0 (TYPE1) or k (TYPE2), in which case they compute and ascend the exact natural policy gradient. NAC-LSTD and e NAC2 have computational complexity proportional to |ϖ|2 per time step just to update statistics, and |ϖ|3 to compute the natural policy gradient estimate for policy improvement steps. The complexity of policy improvement steps can be improved to |ϖ|2 using the Sherman-Morrison formula to maintain estimates of A 1 directly. It can be further improved to linear by using the modified Sarsa(λ) algorithm in place of LSTD to find w satisfying (1). We call the resulting algorithm the Natural Actor-Critic using Sarsa(λ), or NAC-S. Notice that some mean zero terms can be removed from the Sarsa(λ) update and the resulting algorithm, provided in Algorithm 2, can be viewed as the discounted reward and eligibility trace extensions of the Natural-Gradient Actor-Critic with Advan- 6This equation uses the notation of Peters & Schaal (2008). Algorithm 1 episodic Natural Actor Critic 2 e NAC2 1: Input: Parameterized policy π(s, a, θ) with initial parameters θ, basis functions φ(s) for the state-value estimation, update frequency parameter k, discount parameter γ, decay constant β, step size schedule {ηt}, and maximum episode duration T. 2: A 0; 3: b 0; τ 0 4: for ep = 0, 1, 2, . . . do 5: Run an episode and remember the trajectory, 6: {st, at, st+1, rt}, where t [0, T 1]. 7: Update Statistics: 8: A A + PT t=0 f(t)ψstatψ stat 9: b b + PT t=0 f(t)ψstat PT ˆt=t γ ˆt trˆt 10: [w ep, v ep] = (A A) 1 A b 11: Update Actor (Natural Policy Gradient): 12: if TYPE1, ep k 0, and (wep, wep k) ϵ or 13: TYPE2 and (ep + 1) mod k = 0 then 14: θ θ + ητ wep ||wep||2 15: τ = τ + 1; A βA; b βb tage Parameters (Bhatnagar et al., 2009).7 NAC-S can also be viewed as INAC (Degris et al., 2012) or NTD (Morimura et al., 2005) corrected to include the γt term and with the option of computing exact gradient estimates or using twotimescales. Notice that in all algorithms presented in this paper, the natural gradient is normalized. This normalization is optional. It may void convergence guarantees and it often makes it difficult to achieve empirical convergence. However, in practice we find it easier to find fixed step sizes that work on difficult problems when using normalized updates to θ. Amari (1998) defined the natural gradient as only a direction and even discarded scaling constants in his derivation of a closed form for the natural gradient. Peters & Schaal (2008) claim that the natural actor-critics compute and ascend the natural gradient of J, and thus will converge to a locally optimal policy, at which point J(θ) = 0, assuming the step size schedules are properly decayed and that the natural actor-critic s estimates of the natural gradient are unbiased. As stated previously, when λ = 1, β = 0 (if β is present), and ϵ 0 (TYPE1) or k (TYPE2), the natural gradient estimates will be exact. In practice, large k or small ϵ and small fixed step sizes usually result in convergence. Policy gradient approaches are typically purported to have one significant drawback: whereas Q-based methods converge to globally optimal policies for problems with discrete states and actions, policy gradient algorithms can become stuck in arbitrarily bad local optima (e.g., (Peters & 7To get Bhatnagar s algorithm, select TYPE2 updates with k = 1, f(t) = 1, and replace the discounted TD error with the average reward TD error. Bias in Natural Actor-Critic Algorithms Algorithm 2 Natural Actor Critic using Sarsa(λ) NAC-S(λ) 1: Input: Parameterized policy π(s, a, θ) with initial parameters θ, basis functions φ(s) for the state-value estimation, update frequency parameter k, discount parameter γ, eligibility decay rate λ, and step size schedules {αw t }, {αv t } and {ηt}. 2: w0 0; v0 0; count 0 3: for episode = 0, 1, 2, . . . do 4: Draw initial state s0 d0( ) 5: ew 1 = 0; ev 1 = 0; τ1 = 0; τ2 = 0 6: for t = 0, 1, 2, . . . do 7: at π(st, , θ); st+1 P(st, at, ); 8: rt Rat st ; count count + 1; 9: Update Critic (Sarsa): 10: δt = rt + γv t φ(st+1) v t φ(st) 11: ew t = γλew t 1 + f(t)[ θ log π(st, at, θ)] 12: ev t = γλev t 1 + f(t)φ(st) 13: wt+1 = wt+αw t τ1[δt w t [ θ log π(st, at, θ)]]ew t 14: vt+1 = vt + αv t τ1δtev t 15: Update Actor (Natural Policy Gradient): 16: if TYPE1, t k 0, and (wt, wt k) ϵ or 17: TYPE2 and (count mod k = 0) then 18: θ θ + ητ2 wt+1 ||wt+1||2 ; τ1 = t; τ2 = τ2 + 1 19: if st+1 terminal then break out of loop over t Bagnell, 2010; Peters, 2010)). We argue that with assumptions similar to those required by Q-learning and Sarsa, ascending the policy gradient results in convergence to a globally optimal policy as well.8 First, we assume that S and A are countable and that every state-action pair is observed infinitely often. Second, we assume that for all θ, all states s, and all actions a and ˆa, where a = ˆa, there is a direction dθ of change to θ that causes the probability of a in state s to increase while that of ˆa decreases, while all other action probabilities remain unchanged. These two assumptions are satisfied by policy parameterizations such as tabular Gibbs softmax action selection (Sutton & Barto, 1998). We argue that at all suboptimal θ, the policy gradient will be non-zero. For any policy that is not globally optimal, there exists a reachable state for which increasing the probability of a specific action a while decreasing the probability of ˆa would increase J (see Section 4.2 of Sutton and Barto s book (Sutton & Barto, 1998)). By our first assumption, this state-action pair is reached by the policy, and by our second assumption, there is a direction, dθ, of change to θ that can make exactly this change. So, the directional derivative of J at θ in the direction dθ is non-zero and therefore the gradient of J at θ must also be non-zero. Hence, θ cannot be a local optimum. Policy gradient is typically applied to problems with con- 8Notice that this applies to all algorithms that ascend the policy gradient or natural policy gradient with a well-behaved metric tensor. tinuous state or action sets, in which case the assumptions above cannot be satisfied, so convergence to only a local optimum can be guaranteed. However, the above argument suggests that, in practice and on continuous problems, local optima can be avoided by increasing exploration and the representational power of the policy parameterization. However, if one desires a specific low-dimensional policy parameterization, such as a proportional-derivative controller with limited exploration, then increasing the exploration and representational power of the policy may not be an acceptable option, in which case local optima may be unavoidable. 6. Analysis of Biased Algorithms In this section we analyze how the bias changes performance. Recall that, without the correct discounting, w are the weights that minimize the squared error in the Qθ estimate, with states sampled from actual episodes. With the proper discounting, states that are visited at later times factor less into w. Because w will be the change to the policy parameters, this means that in the biased algorithms the change to the policy parameters considers states that are visited at later times just as much as states that are visited earlier. This suggests that the biased algorithms may be optimizing a different objective function similar to J(θ) = (1 γ) X s dθ(s)V θ(s), (4) where dθ is the stationary distribution of the Markov chain induced by the policy π. More formally, we assume dθ(s) = limt Pr(st = s|s0, θ) exists and is independent of s0 for all policies. Notice that J is not interesting for episodic MDPs since, for all policies, dθ(s) is non-zero for only the post-terminal absorbing state. So, henceforth, our discussion is limited to the non-episodic setting. For comparison, we can write J in the same form: J(θ) = P s d0(s)V θ(s). This original objective function gives the expected discounted return from an episode. This means that for small γ, it barely considers the quality of the policy at states that are visited late in a trajectory. On the other hand, J considers states based on their visitation frequency, regardless of when they are visited. Kakade (2001) showed that J, which includes discounting in V θ, is the typical average reward objective function. To see that the biased algorithms appear to optimize something closer to this average reward objective, consider an MDP with S = [0, 1], where s0 = 0, s = 1 is terminal, st+1 = st+0.01, and Ra s = (a s)2. The optimal policy is to select at = st. We parameterize the policy with one parameter, such that at N(θ, σ2) for all states, where N is a normal distribution with small constant variance, σ2. If Bias in Natural Actor-Critic Algorithms Optimal Biased e NAC Unbiased Figure 1. The optimal policy (optimal), the mean action selected by the biased NAC-LSTD, e NAC2, and INAC (biased), the mean action selected by the unbiased NAC-LSTD, e NAC2, NAC-S, as well as a random restart hill-climbing algorithm (unbiased), and the mean action selected by e NAC (e NAC). Note that the mean action is the policy parameter for each algorithm after training. γ = 1, the optimal parameter, θ , is θ = 0.5. Both the biased and unbiased algorithms converge to this θ . However, when γ = 0.995 or γ = 0.5, the optimal θ decreases in order to receive more reward initially. We found that the unbiased natural actor-critics properly converge to the new optimal θ , as does a simple hill-climbing algorithm that we implemented as a control. However, the biased algorithms still converge to θ 0.5.9 We found that e NAC converges to θ that differ from those of all other algorithms when γ = 1, which suggests that e NAC, but not e NAC2, may have additional bias. These results are presented in Figure 1. This difference raises the question of whether the biased algorithms actually compute the natural policy gradient in the average reward setting. In the remainder of this section, we prove that they do whenever s V θ(s) dθ(s) This can happen, for example, if the distribution over states does not depend on the policy, in which case dθ(s) The typical objective for average reward learning is J(θ) = limn 1 n P t=0 E[rt|θ], which is equivalent to the definition in (4) (Kakade, 2001). The state-action value function is Qθ(s, a) = P t=0 E[rt J(θ)|s0 = s, a0 = a, θ]. Kakade (2002) stated that the natural gradient of J is e J(θ) = w if s,a dθ(s)π(s, a, θ) Qθ(s, a) fϖ(s, a) fϖ(s, a) (6) Thus, the unbiased average reward natural policy gradient is given by w satisfying (6). The biased algorithms sample states, s, from dθ and actions, a, from π(s, , θ) and perform gradient descent on the squared difference between 9We used random restarts for all methods and observed no local optima. Qθ(s, a) and fϖ(s, a). So, they select w that satisfy s,a dθ(s)π(s, a, θ) Qθ(s, a) fϖ(s, a) fϖ(s, a) (7) Notice that (7) uses the discounted state-action value function while (6) uses the average reward state-action value funciton. To determine if and when the biased algorithms compute e J(θ), we must determine when a constant multiple of the solutions to (7) satisfy (6). To do this, we solve (7) for w and substitute a constant, k > 0, times these w into (6) to generate a constraint that, when satisfied, results in the biased algorithms producing the same direction (but not necessarily magnitude) as the average reward natural policy gradient. When doing so, we assume that v = 0, since it does not influence the solutions to either equation. First, we establish a lemma that relates the policy gradient theorem using the average reward state distribution but discounted reward state-action value function (left hand side of Lemma 1) to the derivative of J without proper application of the chain rule: Lemma 6.1. For all θ, s,a dθ(s) π(s, a, θ) s dθ(s) V θ a π(s, a, θ)Qθ(s, a) " π(s, a, θ) θ Qθ(s, a)+ s Pa ss γV θ(s ) " π(s, a, θ) θ Qθ(s, a)+ π(s, a, θ) X Solving for P θ Qθ(s, a) yields a π(s, a, θ) X s Pa ss V θ(s ) Bias in Natural Actor-Critic Algorithms Summing both sides over all states weighted by dθ gives s,a dθ(s) π(s, a, θ) s dθ(s) V θ(s) a π(s, a, θ) X s Pa ss V θ(s ) s dθ(s)V θ(s) s dθ(s) V θ(s) θ = (1 γ) X s dθ(s) V θ(s) Solving (7) for w, which gives the direction of the biased algorithms, we get a π(s, a, θ)ψsaψ sa a π(s, a, θ)Qθ(s, a)ψsa Substituting k times this w into (6) for w and canceling the product of the Fisher information matrix and its inverse gives a π(s, a, θ) Qθ(s, a)ψsa a π(s, a, θ)Qθ(s, a)ψsa s dθ(s) V θ(s) by substitution of the policy gradient theorem (Sutton et al., 2000) and Lemma 1. Thus, when, for some k, θ = k(1 γ) X s dθ(s) V θ(s) the biased algorithms produce the direction of the unbiased average reward natural policy gradient. If we let k = 1, we will still get a constraint that results in the two directions being the same, although if the constraint is not satisfied, it does not mean the two are different (since a different k may result in (8) being satisfied). Setting k = 1 and substituting (4) for J(θ), we get: s dθ(s)V θ(s) =(1 γ) X s dθ(s) V θ(s) s dθ(s) V θ(s) θ V θ(s) = X s dθ(s) V θ(s) s V θ(s) dθ(s) We have shown that when (5) holds, the biased algorithms compute the average reward natural policy gradient. 7. Discussion and Conclusion We have shown that NAC-LSTD and e NAC produce biased estimates of the natural gradient. We argued that they, and INAC, act more like average reward natural actor-critics that do not properly account for how changes to θ change the expected return via dθ. We proved that in certain situations the biased algorithms produce unbiased estimates of the natural policy gradient for the average reward setting. The bias stems from improper discounting when approximating the state-action value function using compatible function approximation. We derived the properly discounted algorithms to produce the unbiased NAC-LSTD and e NAC2, as well as the biased and unbiased NAC-S, a linear time complexity alternative to the squared to cubic time complexity NAC-LSTD and e NAC2. However, the unbiased algorithms have a critical drawback that limits their practicality. The unbiased algorithms discount their updates by γt, which can result in poor data efficiency, particularly when γ is small.10 With small γ, the updates will decay to zero rapidly, causing the unbiased algorithms to ignore data collected after a short burn-in period. However, in some cases, this data inefficiency is unavoidable. Consider an MDP like the one presented earlier, where the set of states that occur early and those that occur later are disjoint. In this setting, the discounted reward objective mandates that data recorded late in trajectories must be ignored. In this situation, the rapid decay of updates is a curse of the choice of objective function. However, if the states that are visited early in a trajectory are also visited later in a trajectory, off-policy methods may be able to take advantage of data from late in an episode to provide meaningful updates even for the discounted reward setting. They may also be able to properly use data from previous policies to improve the estimates of the natural policy gradient in a principled manner. These are possible avenues for future research. As stated in the introduction: The goal of this paper is to raise awareness about what the algorithms that are in use are actually doing, not to present a superior method. Until the data efficiency of these unbiased algorithms is improved, perhaps by leveraging off-policy techniques, they are of little practical value. We only recommend their use when optimization of the discounted-reward objective is 10Although the unbiased algorithms suffer from data efficiency problems, they are still more efficient than the na ıve truncation approach discussion in Section 4, which discards data. Bias in Natural Actor-Critic Algorithms absolutely critical. Another interesting extension would be to determine how γ should be selected in the biased algorithms. Recall that (4) is the average reward objective, for all γ. This suggests that in the biased algorithms, γ may be selected by the researcher. Smaller values of γ are known to result in faster convergence of value function estimates (Szepesvari, 1997), however larger γ typically result in smoother value functions that may be easier to approximate accurately with few features. Lastly, we argued that, with certain policy parameterizations, policy gradient methods converge to globally optimal policies for discrete problems, and suggested that local optima may be avoided in continuous problems by increasing exploration and the policy s representational power. Future work may attempt to provide global convergence guarantees for a subset of the continuous-action setting by intelligently increasing the representational power of the policy when it becomes stuck in a local optimum. Amari, S. Natural gradient works efficiently in learning. Neural Computation, 10:251 276, 1998. Bagnell, J. A. and Schneider, J. Covariant policy search. In Proceedings of the International Joint Conference on Artificial Intelligence, pp. 1019 1024, 2003. Bertsekas, D. P. and Tsitsiklis, J. N. Neuro-Dynamic Programming. Athena Scientific, Belmont, MA, 1996. Bertsekas, D. P. and Tsitsiklis, J. N. Gradient convergence in gradient methods. SIAM J. Optim., 10:627 642, 2000. Bhatnagar, S., Sutton, R. S., Ghavamzadeh, M., and Lee, M. Natural actor-critic algorithms. Automatica, 45(11): 2471 2482, 2009. Bradtke, S. J. and Barto, A. G. Linear least-squares algorithms for temporal difference learning. Machine Learning, 22:33 57, 1996. Degris, T., Pilarski, P. M., and Sutton, R. S. Model-free reinforcement learning with continuous action in practice. In Proceedings of the 2012 American Control Conference, 2012. Kakade, S. Optimizing average reward using discounted rewards. In Proceedings of the 14th Annual Conference on Computational Learning Theory, 2001. Kakade, S. A natural policy gradient. In Advances in Neural Information Processing Systems, volume 14, pp. 1531 1538, 2002. Lagoudakis, M. and Parr, R. Model-free least-squares policy iteration. In Neural Information Processing Systems: Natural and Synthetic, pp. 1547 1554, 2001. Morimura, T., Uchibe, E., and Doya, K. Utilizing the natural gradient in temporal difference reinforcement learning with eligibility traces. In International Symposium on Information Geometry and its Application, 2005. Morimura, T., Uchibe, E., Yoshimoto, J., and Doya, K. A generalized natural actor-critic algorithm. In Neural Information Processing Systems: Natural and Synthetic, 2009. Peters, J. Policy gradient methods. Scholarpedia, 5(11): 3698, 2010. Peters, J. and Bagnell, J. A. Policy gradient methods. Encyclopedia of Machine Learning, 2010. Peters, J. and Schaal, S. Policy gradient methods for robotics. In Proceedings of the IEEE/RSJ International Conference on Intelligent Robots and Systems, 2006. Peters, J. and Schaal, S. Natural actor-critic. Neurocomputing, 71:1180 1190, 2008. Sutton, R. S. and Barto, A. G. Reinforcement Learning: An Introduction. MIT Press, Cambridge, MA, 1998. Sutton, R. S., Mc Allester, D., Singh, S., and Mansour, Y. Policy gradient methods for reinforcement learning with function approximation. In Advances in Neural Information Processing Systems 12, pp. 1057 1063, 2000. Szepesvari, C. S. The asymptotic convergence-rate of Qlearning. In Advances in Neural Information Processing Systems, volume 10, pp. 1064 1070, 1997.