# mixingtime_regularized_policy_gradient__e4f22afe.pdf Mixing-Time Regularized Policy Gradient Tetsuro Morimura IBM Research - Tokyo 5-6-52 Toyosu, Koto-ku Tokyo, Japan tetsuro@jp.ibm.com Takayuki Osogami IBM Research - Tokyo 5-6-52 Toyosu, Koto-ku Tokyo, Japan osogami@jp.ibm.com Tomoyuki Shirai Kyushu University 744 Motooka, Nishi-ku Fukuoka, Japan shirai@imi.kyushu-u.ac.jp Policy gradient reinforcement learning (PGRL) has been receiving substantial attention as a mean for seeking stochastic policies that maximize cumulative reward. However, the learning speed of PGRL is known to decrease substantially when PGRL explores the policies that give the Markov chains having long mixing time. We study a new approach of regularizing how the PGRL explores the policies by the use of the hitting time of the Markov chains. The hitting time gives an upper bound on the mixing time, and the proposed approach improves the learning efficiency by keeping the mixing time of the Markov chains short. In particular, we propose a method of temporal-difference learning for estimating the gradient of the hitting time. Numerical experiments show that the proposed method outperforms conventional methods of PGRL. 1 Introduction Policy Gradient Reinforcement Learning (PGRL) attempts to find a policy that maximizes the average reward, based on gradient ascent in the policy parameter space (Gullapalli 1990; Williams 1992; Baxter and Bartlett 2001). Since PGRLs can optimize the parameters controlling the measure of randomness of the policy, PGRLs, as compared with value-function-based approaches (Sutton and Barto 1998), can find appropriate stochastic policies. Meanwhile, PGRL methods often require an excessively large number of learning steps to construct a good stochastic policy. The number of learning steps depends on the mixing time of the Markov chains that are given by intermediate policies that the PGRL explores (Bartlett and Baxter 2000; Baxter and Bartlett 2000; 2001; Kakade 2003). Roughly speaking the mixing time of a Markov chain represents the number of steps needed for the Markov chain to approach sufficiently close to its stationary distribution. In this paper, we give a new PGRL method that regularizes the hitting time as a bound of a mixing time, where hitting-time regressions based on temporal-difference learning are proposed. This will keep the Markov chain compact and can improve the learning efficiency. Copyright c 2014, Association for the Advancement of Artificial Intelligence (www.aaai.org). All rights reserved. The organization and the contributions of this paper are summarized as follows. In Section 2, we briefly review the PGRL and also present a motivation and outline of our mixing-time regularization. The relation among a bias and variance of an estimator for an arbitrary linear sum of the stationary distribution on a Markov chain, a mixing time, and a hitting time is described in Section 3. It is proved as our first theoretical contribution that the bias and variance of the estimator are bounded via the Ces aro mixing time, which in turn is bounded via the worst-case expected hitting time. In Section 4, we derive a new framework of PGRL with mixing-time regularization, where, as the second contribution, the sufficient condition for the convergence to a local optimum is also provided in terms of the strength of the regularization term. The estimating method of this regularization term is proposed in Section 5. Numerical experiments in Section 6 show that the proposed method outperforms conventional PGRL methods. 2 Background of Policy Gradient Reinforcement Learning Problems of PGRL are usually modeled on a Markov decision process (MDP) (Bertsekas 1995; Sutton and Barto 1998). It is defined by the quintuplet (S, A, p T, R, π), where S = {1, . . . , |S|} and A = {1, . . . , |A|} are finite sets of states and actions, respectively. Also, p T : S A S [0, 1] is a state transition probability function of a state st, an action at, and the following state st+1 at time t N, i.e.,1 p T(st+1|st, at) Pr(st+1|st, at). The R : S A S R is a bounded reward function of st, at, and st+1, which defines an immediate reward rt+1 = R(st, at, st+1) observed by a learning agent at each time step. The action probability function π : A S Rd [0, 1] defines the decision-making rule of the learning agent, which is also called a policy, i.e., π(a|s; θ) Pr(a|s, θ), where θ Rd is a policy parameter. The policy is parametrized by users and is controlled by tuning θ. Here, we make the following usual assumptions for the MDP (Bartlett and Baxter 2000; Baxter and Bartlett 2001; Kakade 2002). 1Although it should be Pr(St+1 = st+1|St = st, At = at) for the random variables St+1, St, and At to be precise, we write Pr(st+1|st, at) for brevity. The same rule is applied to the other distributions. Proceedings of the Twenty-Eighth AAAI Conference on Artificial Intelligence Assumption 1 The Markov chain on S prescribed by M(θ) {S, A, p T, R, π, θ} is always ergodic (irreducible and aperiodic). Under Assumption 1, there exists a unique stationary distribution, dθ(s), which satisfies the balance equations: dθ(st+1) = X at A p T(st+1|st, at)π(at|st;θ)dθ(st). This stationary distribution is equal to the limiting distribution and independent of the initial state, dθ(s ) = limt Pr(St =s |S0 =s, M(θ)), s S. The goal of PGRL is to find the policy parameter θ that maximizes the average immediate reward in the infinitely long run, so called the average reward, η(θ) lim T 1 T Eθ t=0 R(St, At, St+1) st+1 S dθ(st)π(at|st;θ) p T(st+1|st, at)R(st, at, st+1), where Eθ[ ] denotes the expectation operator over random variables in the Markov chain M(θ), e.g., St, At, etc. The derivative of the average reward with respect to the policy parameter, η(θ) [ η(θ)/ θ1, ..., η(θ)/ θd] , is referred to as the Policy Gradient (PG). The average reward η is increased by updating the policy parameter θt at time t as θt+1 = θt + αt Gθt η(θt), (1) where αt is a learning rate. The matrix Gθ Rd d is an arbitrary uniformly-bounded positive definite matrix, which often consists of the Fisher information matrix of the policy (Kakade 2002) and/or the stationary state distribution (Morimura et al. 2008). This framework is called the PGRL (Baxter and Bartlett 2001). In the normal setting of reinforcement learning, the state transition probability or the reward function is unknown. Thus the PG η(θ) cannot be computed analytically and thus needs to be estimated. However, as is described in the next section, the number of time steps needed to make the estimated η(θ) almost unbiased will increase as the mixing time of the Markov chain M(θ) gets larger. To control the magnitude of the mixing time, we consider adding a regularization term, l(θ), into the policy update of Eq. (1) as θt+1 = θt + αt Gθt η(θt) λtl(θt), (2) where λt is a regularization parameter. 3 Mixing time and hitting time for PGRL We introduce a mixing time and analyze its effect on biases and variances of estimators required for the policy update, such as η(θ) or Gθ in Eq. (1), in a finite-time Markov chain. Then, as a bound of the mixing time, a hitting time is also introduced. Ces aro mixing time In PGRL, a standard mixing time2, which is the time to be close to the stationary state distribution dθ(s), is often used for connection with the policy gradient estimator with discount factor (Bartlett and Baxter 2000; Baxter and Bartlett 2000; 2001). Here we consider an alternative formula of a mixing time which will be useful for discussing effects on an estimation problem in a finite-time Markov chain as follows. That is the Ces aro mixing time m(ε, θ) (Lov asz and Winkler 1998; Levin, Peres, and Wilmer 2008), which is defined as m(ε, θ) min t t max s0 S 1 2 s S |νθ(t, s0, s) dθ(s)| ε where νθ(t, s0, s) is the time-average probability of visiting s within t time-steps given an initial state S0 = s0, νθ(t, s0, s) 1 e|S|(s0) P k θ The matrix Pθ is a matrix representation of the state transition probability given the policy π(a|s;θ), i.e., {Pθ}s,s P a A π(a|s;θ)p T(s |s, a), where {X}i,j or {x}i denotes the (i, j)-th or i-th element of a matrix X or a vector x, respectively. The vector en(k) denotes the n-dimensional column vector whose k-th element is 1 and otherwise zero. Ces aro mixing time for a bound of estimation bias or variance on finite-time Markov chain Let us consider a prediction problem of a general class of a linear combination of an arbitrary function f(s) [ C, C] and the stationary state distribution dθ(s), s S f(s)dθ(s), (4) which relates to many sufficient statistics computed in REINFORCE (Williams 1992), GPOMDP (Baxter and Bartlett 2001), and other PGRL methods (Kakade 2002; Peters and Schaal 2008; Morimura et al. 2009). A natural unbiased estimator of gθ on a finite-time Markov chain of a time length t is ˆgt(s0) = 1 k=0 f(Sk). (5) We show a connection between bias/variance of the estimator and the mixing time in the following propositions as part of the main contribution. Proposition 1 The Ces aro mixing time m(ϵ, θ) of Eq. (3) is an upper bound of the number of time steps required to 2The only difference between this standard mixing time and the Ces aro mixing time of Eq. (3) is the definition of νθ(t, s0, s). In the case of the standard mixing time, e|S|(s0) P k θ s is used for νθ(t, s0, s) in Eq. (3). decrease the maximum of the absolute expected bias of the estimator ˆgt(s0) of Eq. (5) for gθ of Eq. (4), Biasθ(t) max s0 S Eθ[gθ ˆgt(s0)] below 2Cε, i.e., the inequality, Biasθ(m(ε, θ)) 2Cε, holds. Proof. Because of |f(s)| C, this bias is bounded as Biasθ(t) = max s0 S gθ Eθ[ˆgt(s0)] s S f(s) dθ(s) X s S f(s) νθ(t, s0, s) f(s) dθ(s) νθ(t, s0, s) dθ(s) νθ(t, s0, s) (6) If t is equal to or more than m(ε, θ), the inequality P s S|dθ(s) νθ(t , s0, s)| 2ε holds from the definition of the Ces aro mixing time (Eq. (3)) and thus the proposition, Biasθ(t ) 2Cε, is proved. Proposition 2 The Ces aro mixing time m(ϵ, θ) of Eq. (3) is an upper bound of the number of time steps required to decrease the maximum of the expected (pseudo) variance of the estimator ˆgt(s0) of Eq. (5) for gθ of Eq. (4), Varθ(t) max s0 S Eθ {gθ ˆgt(s0)}2 below 4C2ε, i.e., the inequality, Varθ(m(ε, θ)) 4C2ε, holds. Proof. Because of |gθ ˆgt(s0)| 2C, this variance is bounded as Varθ(t) 2C Biasθ(t). With Proposition 1, if t m(ε, θ), then Varθ(t ) 4C2ε holds. Propositions 1 and 2 mean min t t 0 Biasθ(t) 2Cε m (ε, θ) , min t t 0 Varθ(t) 4C2ε m (ε, θ) . and indicate that the magnitude of the Ces aro mixing time m(ϵ, θ) can have a great effect on bias and variance for estimating a linear combination of the stationary state distribution on a finite-time MDP. The bias and variance can increase as m(ϵ, θ) gets larger. Thus, in order to learn a policy in PGRL efficiently, it is required to keep m(ϵ, θ) low. Hitting time for a bound of Ces aro mixing time The hitting time is the first time at which a Markov chain M(θ) hits a state s S from an initial state s S, which is defined as τθ(s, s ) min {t 0 | S0 = s, St = s , M(θ)} . The expected hitting time is given as hθ(s, s ) E[τθ(s, s )] , where E[ ] is the expectation with respect to the distribution of τθ(s0, s). Also the worst-case (expected) hitting time is h (θ) max s,s S {hθ(s, s )} . (7) We use the following proposition for a bound of Ces aro mixing time with the hitting time. Proposition 3 (Levin, Peres, and Wilmer 2008) The Ces aro mixing time is bounded by using the worst-case hitting time, εh (θ) + 1. 4 Mixing-time regularized policy gradient We derive a framework of policy gradient with mixing-timebound regularization. The results in the previous section indicate that, in order to compute some statistics for the policy update efficiently, a learner should keep magnitude of the Ces aro mixing time low3. Thus we want to directly control the Ces aro mixing time during in learning process. However, according to the best of our knowledge, there is no practical method in RL to estimate and control it. On the other hand, as shown in Proposition 3, the worst-case hitting time is an upper bound of the Ces aro mixing time, and its derivative with respect to the policy parameter can be easily estimated as described in Section 5. Accordingly, we consider a natural approach for keeping the Ces aro mixing time low, in which the derivative of the worst-case hitting time is used for the regularization term l(θ) in Eq. (2), such as θt+1 = θt + αt Gθt η(θt) λt h (θt), (8) or θt+1 = θt+αt Gθt η(θt) λt Gθt h (θt). The update of the policy tries to increase the average reward and also to decrease the worst-case hitting time or an upper bound of the Ces aro mixing time. However, the intended objective of the proposed approach is not to decrease excessively long Ces aro mixing time but rather tries to prevent the Ces aro mixing time from increasing. This suppression of the Ces aro mixing time can be understood with an analogy to the natural gradient (Amari 1998) or natural policy gradient (Kakade 2002), which tries to prevent the learning parameter from falling into the region that causes learning plateau. One cannot expect the natural gradient to perform well in the regions with heavy learning plateau. In practice, it is important to balance the effects of the last two terms in Eq. (8). Conceivably, the mixing time might be huge or diverge with the optimal policy. Thus it would be needed to decrease the regularization parameter depending on time t, such as λt := x/(y + t2), where x > 0 and y 0 are constants. We give a proposition for setting of αt and λt to guarantee a convergence to a local optimum. 3Note that it is also known that keeping magnitude of a mixing time low encourages the exploration (Kearns and Singh 2002; Kakade 2003). Proposition 4 Let a Lipschitz continuity condition on η hold, i.e., there is some constant L > 0 satisfying η(θ) η(θ ) L θ θ , θ, θ Rd, where denotes the Euclidean norm, and h be uniformly bounded, i.e., there is a constant M satisfying h (θ) < M, θ Rd. Assume that the learning rate αt 0 and the regularization parameter λt 0 satisfy X t=0 α2 t < , and, for some positive constant c, λt cα2 t, respectively. Then, with the update manner of Eq. (8), the η(θt) converges to a finite value and limt η(θt) = 0. Furthermore, every limit point of θt is a stationary point of η. Proof. See the associated technical report (Morimura, Osogami, and Shirai 2014). Alternatively, the regularization term can be decreased in consideration of the achieved objective value. For example, the following heuristic for the update can be used, θ := θ + α η(θ) λ max(0, η ˆη(θ)) h (θ), (9) where η is a targeted average reward set manually, which does not have to be the maximum average reward. Also, ˆη(θ) is an estimator of the current average reward, which will be simply updated at time t as ˆη(θ) := (1 α)ˆη(θ) + αRt+1, where Rt+1 = R(St, At, St+1) is a random variable of reward. Hereinafter, the update rule of the mixing-time regularized PG with Eq. (8) will be called as Option 1, and also that with Eq. (9) as Option 2. For the implementation, we need to estimate η(θ) and h (θ). Since there are several existing methods for estimating η(θ), it remains to provide a method for estimating h (θ), which is described in the following section. 5 Estimation of Mixing-time-Bound Derivative h (θ) We derive an estimation method for the derivative of the worst-case hitting time, h (θ), as the mixing-time regularization term. Since a direct estimation of h (θ) is difficult due to non-linearity of h (θ) resulting from its maximizing operation (see Eq. (7)), we have the following stepwise approach. First the expected hitting time and its derivative are estimated as ˆh(s, s ) and b h(s, s ), respectively. Second the state pair (s , s ) that maximizes ˆh(s, s ) is searched. Then we compute b h(s , s ) as an estimate of h (θ). An estimation method for the expected hitting time or its derivative is described in subsequent subsections. Note that we derive those estimation methods under a fixed policy, but this is standard in the literature of PGRL. In particular, there is a policy evaluation step that evaluates the performance of a current (fixed) policy. Then the policy is updated based on the results of evaluation. A concrete algorithm for the mixing-time regularized PG is shown in Algorithm 1. Estimation of hitting time The hitting time estimation problem of hθ(s, s ) can be reduced to that of the value function (Sutton and Barto 1998) on a finite-time-horizon MDP with an absorbing state s , where the reward is always 1. This observation leads to the following recursive formula, which we call the hitting-time Bellman equation, hθ(s, s ) = 0 if s = s , 1 + Eθ hθ(St+1, s ) | St =s otherwise. Let us consider an estimator of the expected hitting time, ˆh : S S R+. From the hitting-time Bellman equation, ˆh at time t can be updated on the basis of the temporaldifference learning of the value function (Sutton and Barto 1998): if st = st+1, ˆh(st, i) := ˆh(st, i) + αtδ(h) t (i), i {s | s = st, s S}, (10) where δ(h) is a temporal-difference function for the hitting time, δ(h) t (i) 1 + ˆh(st+1, i) ˆh(st, i). Throughout, we keep ˆh(s, s) = 0, s S. We also derive an alternative approach for estimating hθ on the basis of the least-square temporal-difference learning (Bradtke and Barto 1996; Boyan 2002). The sufficient statistics for this estimation are defined as A R|S| |S| and b R|S|. The update rule at time t is given as, if st = st+1, A := βA + e|S|(st){e|S|(st) e|S|(st+1)} , b := βb + e|S|(st), where β [0, 1] is a forgetting rate. The estimated value with the sufficient statistics is given as ˆh(i, j) := i if i < j, 0, if i = j, A 1 /j b/j i 1, otherwise, where X/y denotes a partitioned matrix in which the y-th column and row are removed from a matrix X, and x/y a partitioned vector where the y-th element is removed from a vector x. It is empirically shown that the efficiency of the estimation with the least-square temporal-difference learning is higher than that with temporal-difference learning. However, the computational cost with the least-square temporaldifference is also much higher and will be O(|S|3) due to the matrix inversion at each time step in our scenario. Note that the eligibility-trace technique (Sutton and Barto 1998) will be applied to those estimation analogously to TD(λ) in (Sutton and Barto 1998) or LSTD(λ) (Boyan 2002). However, we skip it due to lack of space. Estimation of hitting-time derivative We consider an estimator of the hitting-time derivative with respect to the policy parameter θ, b h : S S Rd. By taking partial differentiation of the hitting-time Bellman equation of Eq. (10), the following recursive formula, called the Algorithm 1: An implementation of the mixing-time regularized policy gradient reinforcement learning A mixing time bound regularized PGRL algorithm with temporal-difference learning (Option 1) Given: a policy π(a|s;θ) with an policy parameter θ Rd hyper-parameters: α(h) t , α( ) t , α(π) t , λt 0 Set: an initial hitting time function: ˆh : S S R+ s.t. ˆh(s, s) = 0, s S an initial hitting time derivative function: b h : S S Rd s.t. b h(s, s) = 0, s S an initial state: s0 {s, . . . , |S|} ( Pr(s0)) For t = 0 to T 1 do (Interaction with environment) chose and execute action at π(a|s;θ) observe following state st+1 and reward rt+1 (Update ˆh and b h for all i {s | s = st, s S}) ˆh(st, i):= ˆh(st, i) + α(h) t 1+ˆh(st+1, i) ˆh(st, i) b h(st, i):= b h(st, i) + α( ) t n b h(st, i) +ˆh(st+1, i) log π(at|st;θ) + b h(st+1, i) o (Compute the worst-case hitting time estimate b h as the mixing-time regularization term) find (s , s ) := arg maxs,s S ˆh(s, s ) compute b h := b h(s , s ) (Compute θ as update direction for the parameter θ) do an arbitrary PG algorithm, such as GPOMDP (Baxter and Bartlett 2001) or NPG (Kakade 2002) (Update the policy parameter θ) θ := θ + α(π) t θ λt b h End Return: the policy π(a|s;θ). hitting-time-derivative Bellman equation, is derived, 0, if s = s , Eθ hθ(St+1, s ) log π(At|s; θ) + hθ(St+1, s ) | St =s , otherwise. (11) The lower part comes from the derivative of Eθ[hθ(St+1, s )|St =s] at,st+1 hθ(st+1, s )p T(st+1|s, at)π(at|st;θ). From the Bellman equation of Eq. (11), an update rule of b h, on the basis of the temporal-difference learning (Sutton and Barto 1998), is given at time t as, if st = st+1, b h(st, i) := b h(st, i) + αtδ( ) t (i), i {s | s = st, s S}, where δ( ) Rd is a temporal-difference-vector function for the hitting time derivative, ˆh(st+1, i) log π(at|st;θ) + b h(st+1, i) b h(st, i). We also give a least-squares based estimation of hθ, where the sufficient statistics are defined as A R|S| |S| and C(i) R|S| |S|, i {1, . . . , |S|}. These are updated at time t as, if st = st+1, A := βA + e|S|(st){e|S|(st) e|S|(st+1)} , C := βC + log π(at 1|st 1;θ)ˆh(st, :) , where β [0, 1] is a forgetting rate and ˆh(s, :) denotes [ˆh(s, 1), . . . , ˆh(s, |S|)] . The estimated value with the sufficient statistics is given as b h(i, j) := i,: if i < j, 0 if i = j, A 1 /j C/j i 1,: otherwise, where {X}i,: is the i-th column of a matrix X. 6 Experiments We look into the effect of the mixing-time-bound regularization through numerical experiments. To simply evaluate this effect, all of the applied methods here use a simple, standard policy gradient method in (Baxter and Bartlett 2001) to estimate η. In particular, the baseline methods compared with the proposed methods are the GPOMDPs with the ordinary policy gradient (Baxter and Bartlett 2001) and with the natural policy gradient (Kakade 2002). In the case of our proposed methods of Option 1 and Option 2, GPOMDP for η and the temporal-difference learning for hθ and hθ are used. The task is a simple two-state MDP in (Kakade 2002), where each state s {1, 2} has selfand cross-transition actions A = {self, cross} and each state transition is deterministic. The reward function is set as R(1, self, 1) = 1, R(2, self, 2) = 2, and R(i, cross, j) = 0 for every feasible pair of (i, j). The policy with θ R2 is represented by the sigmoidal function: π(self|s; θ) = 1/(1 + exp( θs)) and π(cross|s; θ) = 1 π(self|s; θ). The hyper-parameters of those methods were tuned. The targeted average reward η , which is a hyper-parameter in the proposed method of Option 2, was set as η := 0.75 maxθ R2 η(θ) = 1.5.4 Figure 1 shows performance comparison when the initial policy parameter θ0 = [2.2, 2.2] , which corresponds to π(self|1; θ0) = π(cross|2; θ0) 0.9 and severe plateau occurs with the baseline methods during the first 104 steps due to a large magnitude of the mixing time. From this result, we confirm that the proposed approach of the mixing-time regularization can improve the performance of the conventional PGRL methods. 4Although not shown in the paper, we also tested other values for η . The results was that the performances differed only little when η is in [0.6 maxθ η(θ), 0.9 maxθ η(θ)]. 0 0.5 1 1.5 2 2.5 3 x 10 4 Average Reward Proposed method with Option 1 Proposed method with Option 2 Policy gradient Natural policy gradient Figure 1: Performance comparison at the initial policy parameter θ0 = [2.2, 2.2] on 2-state MDP Since it is known that performance of the PGRL methods are severely affected by the initialization of the policy, we also looked into the dependence on the initial policy parameter in Figure 2. The results indicate that our approach can largely soften the dependence. The comparison between the proposed methods of Option 1 and 2 indicates that, if a good targeted average reward η is set, Option 2 will be a good heuristic. 7 Conclusion In this paper, we proved in Propositions 1 and 2 that the Ces aro mixing time is an upper bound of bias and variance of an estimate in a finite-time Markov chain. Then we proposed an approach for regularizing the Ces aro mixing time via the hitting time on policy gradient reinforcement learning, which keeps the Markov chain compact and improves the learning efficiency. That is to say, the proposed methods for suppressing the hitting time can prevent the mixing time from heavily increasing and thus can avoid large estimation errors of policy-gradient and hitting-time. A sufficient condition of a convergence for the proposed approach was also presented in Proposition 4. For the implementation of this approach, several methods for estimating the hitting time and its derivative were presented based on the temporaldifference learning or the least-squares temporal-difference learning. Finally we demonstrated the effectiveness of the proposed methods through numerical experiments. Further theoretical analysis, especially for the case that the policy-gradient and regularization terms are estimated with samples rather than known exactly, will be necessary to more deeply understand the properties and efficiency, specifically in term of their convergence and sample complexities. For the scalability in the number of states, while the paper considers all of the state-pairs in the calculation of the hitting time, one can focus only on particular state-pairs, if one knows which states are desirable, undesirable, etc. There are several interesting directions on theoretical analysis as well as algorithmic studies. Also, empirical studies with some more challenging domains is important for our future work. Acknowledgments A part of this research was supported by JST, CREST. (A) Proposed methods with Option 1 (B) Proposed methods with Option 2 (C) Policy gradient (ordinary GPOMDP) (D) Natural policy gradient Figure 2: Success rates for achieving a targeted average reward 1.9 (= 0.95 maxθ η(θ)) in 104 time steps by using the various initial policy parameters θ0. References Amari, S. 1998. Natural gradient works efficiently in learning. Neural Computation 10(2):251 276. Bartlett, P. L., and Baxter, J. 2000. Estimation and approximation bounds for gradient-based reinforcement learning. In Annual Conference on Computational Learning Theory, 133 141. Baxter, J., and Bartlett, P. L. 2000. Reinforcement learning in POMDP s via direct gradient ascent. In International Conference on Machine Learning, 41 48. Baxter, J., and Bartlett, P. L. 2001. Infinite-horizon policygradient estimation. Journal of Artificial Intelligence Research 15:319 350. Bertsekas, D. P. 1995. Dynamic Programming and Optimal Control, Volumes 1 and 2. Athena Scientific. Boyan, J. A. 2002. Technical update: Least-squares temporal difference learning. Machine Learning 49(2-3):233 246. Bradtke, S. J., and Barto, A. G. 1996. Linear leastsquares algorithms for temporal difference learning. Machine Learning 22(1-3):33 57. Gullapalli, V. 1990. A stochastic reinforcement learning algorithm for learning real-valued functions. Neural Networks 3(6):671 692. Kakade, S. 2002. A natural policy gradient. In Advances in Neural Information Processing Systems, volume 14. MIT Press. Kakade, S. 2003. On the Sample Complexity of Reinforcement Learning. Ph.D. thesis, University College London. Kearns, M., and Singh, S. 2002. Near-optimal reinforcement learning in polynomial time. Machine Learning 49(23):209 232. Levin, D.; Peres, Y.; and Wilmer, E. 2008. Markov Chains and Mixing Times. American Mathematical Society. Lov asz, L., and Winkler, P. 1998. Mixing times. In Aldous, D., and Propp, J., eds., DIMACS Series in Discrete Mathematics and Theoretical Computer Science, 85 133. Morimura, T.; Uchibe, E.; Yoshimoto, J.; and Doya, K. 2008. A new natural policy gradient by stationary distribution metric. In European Conference on Machine Learning and Principles and Practice of Knowledge Discovery in Databases. Morimura, T.; Uchibe, E.; Yoshimoto, J.; and Doya, K. 2009. A generalized natural actor-critic algorithm. In Advances in Neural Information Processing Systems, volume 22. Morimura, T.; Osogami, T.; and Shirai, T. 2014. Mixingtime regularized policy gradient. In Technical Report. IBM Research, RT0961. Peters, J., and Schaal, S. 2008. Natural actor-critic. Neurocomputing 71(7-9):1180 1190. Sutton, R. S., and Barto, A. G. 1998. Reinforcement Learning. MIT Press. Williams, R. J. 1992. Simple statistical gradient-following algorithms for connectionist reinforcement learning. Machine Learning 8:229 256.