# near_optimal_policy_optimization_via_reps__51712b75.pdf Near Optimal Policy Optimization via REPS Aldo Pacchiano Microsoft Research apacchiano@microsoft.com Jonathan Lee Stanford University jnl@stanford.edu Peter L. Bartlett UC Berkeley peter@berkeley.edu Google ofirnachum@google.com Since its introduction a decade ago, relative entropy policy search (REPS) has demonstrated successful policy learning on a number of simulated and real-world robotic domains, not to mention providing algorithmic components used by many recently proposed reinforcement learning (RL) algorithms. While REPS is wellknown in the community, there exist no guarantees on its performance when using stochastic, gradient-based solvers. In this paper we aim to fill this gap by providing guarantees and convergence rates for the sub-optimality of a policy learned using first-order optimization methods applied to the REPS objective. We first consider the setting in which we are given access to exact gradients and demonstrate how near-optimality of the objective translates to near-optimality of the policy. We then consider the setting of stochastic gradients and introduce a technique that uses generative access to the underlying Markov decision process to compute parameter updates that maintain favorable convergence to the optimal regularized policy. 1 Introduction Introduced by Peters et al. [23], relative entropy policy search (REPS) is an algorithm for learning agent policies in a reinforcement learning (RL) context. REPS has demonstrated successful policy learning in a variety of challenging simulated and real-world robotic tasks, encompassing table tennis [23], tether ball [12], beer pong [1], and ball-in-a-cup [8], among others. Beyond these direct applications of REPS, the mathematical tools and algorithmic components underlying REPS have inspired and been utilized as a foundation for a number of later algorithms, with their own collection of practical successes [13, 25, 20, 22, 15, 2, 18, 21]. At its core, the REPS algorithm is derived via an application of convex duality [22, 19], in which a Kullback Leibler (KL)-regularized version of the max-return objective in terms of state-action distributions is transformed into an logsumexp objective in terms of state-action advantages (i.e., the difference of the value of the state-action pair compared to the value of the state alone, with respect to some learned state value function). If this dual objective is optimized, then the optimal policy of the original primal problem may be derived as a softmax of the state-action advantages. This basic derivation may be generalized, using any number of entropic regularizers on the original primal to yield a dual problem in the form of a convex function of advantages, whose optimal values may be transformed back to optimal regularized policies [7]. While the motivation for the REPS objective through the lens of convex duality is attractive, it leaves two main questions unanswered regarding the theoretical soundness of using such an approach. First, in practice, the dual objective in terms of advantages is likely not optimized fully. Rather, standard gradient-based solvers only provide guarantees on the near-optimality of a returned candidate solution. While convex duality asserts a relationship between primal and dual variables at the exact optimum, 35th Conference on Neural Information Processing Systems (Neur IPS 2021). it is far from clear whether a near-optimal dual solution will be guaranteed to yield a near-optimal primal solution, and this is further complicated by the fact that the primal candidate solution must be transformed to yield an agent policy. The second of the two main practical difficulties is due to the form of the dual objective. Specifically, the form of the dual objective as a convex function of advantages frustrates the use of gradient-based solvers in stochastic settings. That is, the advantage of a state-action pair consists of an expectation over next states an expectation over the transition function associated with the underlying Markov decision process (MDP). In practical settings, one does not have explicit knowledge of this transition function. Rather, one only has access to stochastic samples from this transition function, and so calculation of unbiased gradients of the REPS objective is not directly feasible. In this paper, we provide solutions to these two main difficulties. To the first issue, we present guarantees on the near-optimality of a derived policy from dual variables optimized via a first-order gradient method, relying on a key property of the REPS objective that ensures near-optimality in terms of gradient norms. To the second issue, we propose and analyze a stochastic gradient descent procedure that makes use of a plug-in estimator of the REPS gradients. Under some mild assumptions on the MDP, our estimators need only sample transitions from a behavior policy rather than full access to a generative model (where one can uniformly sample transitions). We combine these results to yield high-probability convergence rates of REPS to a near-optimal policy. In this way, we show that REPS enjoys not only favorable practical performance but also strong theoretical guarantees. 2 Related Work As REPS is a popular and influential work, there exist a number of previous papers that have studied its performance guarantees. These previous works predominantly study REPS as an iterative algorithm, where each step comprises of an exact optimization of the REPS objective and then the derived policy is used as the reference distribution for the KL regularization of the next step. This iterative scheme may be interpreted as a form of mirror descent or similar proximal algorithms [6], and this interpretation can provide guarantees on convergence to a near-optimal policy [29, 22]. However, because this approach assumes the ability to optimize the REPS objective exactly, it still suffers from the practical limitations discussed above; specifically (1) translation of near-optimality of advantages to near-optimality of the policy and (2) ability to compute unbiased gradients when one does not have explicit knowledge of the MDP dynamics. Our analysis attacks these issues head-on, providing guarantees on first-order optimization methods applied to the REPS objective. To maintain focus we do not consider iterative application of REPS, although extending our guarantees to the iterative setting is a promising direction for future research. In a somewhat related vein, a number of works use REPS-inspired derivations to yield dynamic programming algorithms [13, 14, 26] and subsequently provide guarantees on the convergence of approximate dynamic programming in these settings. Our results focus on the use of REPS in a convex programming context, and optimizing these programs via standard gradient-based solvers. The use of convex programming for RL in this way has recently received considerable interest. Works in this area typically propose to learn near-optimal policies through saddle-point optimization [9, 28, 10, 4, 11, 17]. Instead of solving the primal or dual max-return problem directly, these works optimize the Lagrangian in the form of a min-max bilinear problem. The Lagrangian form helps to mitigate the two main issues we identify with advantage learning, since (1) the candidate primal solution can be used to derive a policy in a significantly more direct fashion than using the candidate dual solution, and (2) the bilinear form of the Lagrangian is immediately amenable to stochastic gradient computation. In contrast to these works, our analysis focuses on learning exclusively in the dual (advantage) space. The first part of our results is most comparable to the work of [4], which proposes a saddle-point optimization with runtime O(1/ ), assuming access to known dynamics. While our results yield a O(1/ 2) rate, we show that it can be achieved via optimizing the dual objective alone. More similar to our work is the analysis of Bas-Serrano et al. [5], which considers an objective similar to REPS, but which is in terms of Q-values as opposed to state (V ) values. Beyond these structural differences, our proof techniques also differ. For example, our result on the suboptimality of the policy derived from dual variables (Lemma 4), is arguably simpler from the analogous result in Bas-Serrano et al. [5], which uses a two-step process to first connect suboptimality of the dual variables to constraint violation in the primal, and then connects this to suboptimality of the policy. 3 Contributions The main contributions of this paper are the following: 1. We prove several structural results regarding entropy regularized objectives for reinforcement learning and leverage them to prove convergence guarantees for Accelerated Gradient Descent on the dual (REPS) objective under mild assumptions on the MDP (see Theorem 2). For discounted MDPs we show that an -optimal policy can be found after O(1/(1 γ)2 2) steps and an optimal regularized policy can be found in O(1/(1 γ)2 ) steps. 2. Similarly we show that a simple version of stochastic gradient descent using biased plug-in gradient estimators can be used to find an optimal policy after O(1/(1 γ)8 8) iterations (see Theorem 3) and an -optimal regularized policy in O(1/(1 γ)8 4) steps. Although our rates are short of the ones achievable by alternating optimization methods, we are the first to show meaningful convergence guarantees for a purely dual approach based on on-policy access to samples from the underlying MDP. 3. In Appendix I, we extend our results beyond the REPS objective and consider the use of Tsallis Entropy regularizers. Similar to our results for the REPS objective we show that for discounted MDPs an optimal policy can be found after O(1/(1 γ)2 2) steps and an optimal regularized policy can be found in O(1/(1 γ)2 ) steps. 4 Background In this section we review the basics of Markov decision processes and their Linear Programming primal and dual formulations (see section 4.1) and some facts about the geometry of convex functions. 4.1 RL as an LP We consider a discounted Markov decision process (MDP) described by a tuple M = (S, A, P, r, µ, γ), where S is a finite state space, A is a finite action space, P is a transition probability matrix, r is a reward vector, µ is an initial state distribution, and γ 2 (0, 1) is a discount factor. We make the following assumption regarding the reward values {rs,a}. Assumption 1 (Unit rewards). For all s, a 2 S A, the rewards satisfy, rs,a 2 [0, 1]. The agent interacts with M via a policy : S ! A. The agent is initialized at a state s0 sampled from an initial state distribution µ and at time k = 0, 1, . . . it uses its policy to sample an action ak (sk). The MDP provides an immediate reward rsk,ak and transitions randomly to a next state sk+1 according to probabilities Pa(sk+1|sk). Given a policy we define its infinite-horizon discounted reward as V := E P1 k=0 γkrsk,ak , where we use E to denote the expectation over trajectories induced by the MDP M and policy . In RL, the agent s objective is to find an optimal policy ?; that is, find a policy maximizing V over all policy mappings : S ! A. We denote the optimal policy as ? := arg max V . We now review the definitions of state value functions and visitation distributions: Definition 1. We define the value vector v 2 R|S| of a policy as v k=0 γkrsk,ak|s0 = s Definition 2. Given a policy we define its state-action visitation distribution λ 2 R|S| |A| as, λ s,a := (1 γ)E P1 k=0 γk1(sk = s, ak = a) . Notice that by definition P s,a λs,a = 1. We note that any vector of nonnegative entries λ may be used to define a policy λ as: λ(a|s) := λs,a P a02A λs,a0 . (1) Note that λ = , while the visitation distribution λ λ of λ is not necessarily λ. Definition 3. Given a policy we define its state visitation distribution as, λ s := (1 γ)E P1 k=0 γk1(sk = s) . Notice that λ The optimal visitation distribution λ is defined as λ := arg maxλ P s,ars,a. It can be shown [24, 9] that solving for the optimal visitation distribution is equivalent to the following linear program: max λs,a2 S A λs,ars,a, s.t. λs,a = (1 γ)µs + γ Pa(s|s0)λs0,a 8s 2 S. (Primal-λ) Where we write P 2 R|S||A| |S| to denote the transition operator. Specifically, the |S| constraints of Primal-λ restrict any feasible λ to be the state-action visitations for some policy (given by λ). The dual of this LP is given by, µsvs, s.t. 0 Av s,a 8s 2 S, a 2 A, (Dual-v) s,a = rs,a vs + γ P s0 Pa(s0|s)vs0 is the advantage evaluated at s, a 2 S A. It can be shown [24, 9] that the unique primal solution λ is exactly λ and the unique dual solution v is v . We finalize this section by defining the notion of suboptimality satisfied by the final policy produced by the algorithms that we propose. Definition 4. Let > 0. We say that policy is -optimal if maxs2S |v s v ? s | . Our objective is to design algorithms such that for any > 0, can return an optimal policy. 4.2 Regularized Policy Search Following Belousov & Peters [7], we consider regularizing Primal-λ with a convex function F : |S| |A| ! R [ {1}. The resulting regularized LP is given by, max λs,a2 S A λs,ars,a F(λ) := JP (λ), s.t. λs,a = (1 γ)µs + γ Pa(s|s0)λs0,a. (Primal Reg-λ) Henceforth we denote the primal objective function as JP (λ) = P s,a λs,ars,a F(λ). Any feasible λ that satisfies the |S| constraints in this regularized LP is the (true) state-action visitation distribution for some policy ; therefore, the optimal λ of this problem can be used to derive an optimal Fregularized max-return policy F, := λ . To simplify our derivations, we introduce the definition of the convex conjugate of a convex function, oftentimes referred to as the Fenchel conjugate: Definition 5 (Fenchel Conjugate). Let F : D ! R be a convex function over a convex domain D Rd. We denote its D constrained Fenchel conjugate as F : Rn ! R defined as F (u) = maxx2D hx, ui F(x). The dual JD of the regularized problem is given by the following optimization problem [7, 19]: v JD(v) := (1 γ) vsµs + F (Av) , (2) where F is the S A-constrained Fenchel conjugate of F. The vector quantity inside F is known as the advantage. That is, it quantifies the advantage (the difference in estimated value) of taking an action a at s, with respect to some state value function v. Using Fenchel-Rockafellar duality, the optimal solution v of the dual function JD may be used to derive an optimal primal solution λ as: Algorithm 1 Relative Entropy Policy Search [Sketch]. Input: Initial iterate v0, accuracy level > 0, gradient optimization algorithm O. 1. Optimize the objective in 2 using O to yield a candidate dual solution ˆv where F satisfies Equation 4. 2. Use the candidate dual solution to derive a candidate primal solution ˆλ 3. Extract a candidate policy ˆλ via Equation 1. Return: ˆλ Relative Entropy Policy Search (REPS) is derived by setting F(λ) := DKL(λkq), the KL-divergence of λ from some reference distribution q 2 |S||A|. The reader should think of q as the visitation distribution of a behavior policy. As we can see, the derivation we provide here further generalizes to arbitrary regularizers F. We focus on a specific F given by for some scalar > 0. In this case F : R|S| |A| ! R equals F (u) = s,a exp ( us,a) qs,a , its gradient satisfies [r F (u)]s,a = exp( us,a)qs,a P s0,a0 exp( us0,a0)qs0,a0 and therefore the dual function equals: JD(v) := (1 γ) , (Dual Reg-v) And the dual problem equals the unconstrained minimization problem: v JD(v) (5) The objective of REPS is to find the minimizer v? of Dual Reg-v (with regularization level ). Algorithm 1 raises two practical issues discussed in Section 1. Specifically, optimization algorithms applied to REPS will typically only give guarantees on the near-optimality of ˆv . We will need to translate near-optimality of ˆv to near-primal-optimality (w.r.t. JP (λ?)) of ˆλ , and then translate that to near-optimality of the final returned policy ˆλ . Secondly, first-order optimization of the REPS objective requires access to a gradient rv JD(v), which involved computing r F (Av). Exact computation of this quantity is often infeasible in practical scenarios where one does not have access to P, but rather only stochastic generative access to samples from P. We show how to compute approximate (biased) gradients of JD(v) using samples from a distribution qs,a (here thought of as a behavior policy) and how to use them to derive convergence rates for Relative Entropy Policy Search. 5 Relative Entropy Policy Search We start by deriving some general results regarding the geometry of regularized linear programs. Our first result (Lemma 2) characterizes the smoothness properties of a regularized LP. This will prove crucial in later sections where we make use of this result to derive convergence rates for the REPS objective. We start by recalling the definitions of both strong convexity and smoothness of a function. Definition 6. A function f : Rn ! R is β strongly convex w.r.t norm k k if f(x) f(y) + hrf(y), x yi + β Let s also define smoothness: Definition 7. A function h is smooth1 w.r.t. norm k k? if: h(u) h(w) + hrh(w), u wi + We will now characterize the smoothness properties of the dual of a regularized linear program. Let s start by considering the generic linear program: λ2D hr, λi, s.t. Eλ = b, where r 2 Rn, E 2 Rm n, and b 2 Rm and D is a convex domain. Let s regularize this objective using a function F that is β-strongly convex with respect to norm k k: λ2D hr, λi F(λ), s.t. Eλ = b. (Reg LP) 1Smoothness is independent of the convexity properties of h. The Lagrangian of problem Reg LP is given by g L(λ, v) = hr, λi F(λ) + Pm i=1 vi (bi (Eλ)i) . Therefore, the dual function g D : Rm ! R with respect to the original primal regularized LP is, g D(v) = hv, bi + max λ2Dhλ, r v>Ei F(λ) = hv, bi + F (r v>E), where the last equality follows from the definition of the Fenchel conjugate of F. It is possible to relate the smoothness properties of F with the strong convexity of F. A crucial result that we will use in our results is the following: Lemma 1. If F is β-strongly convex w.r.t. k k over D then F is 1 β -smooth w.r.t the dual k k?. Definitions 6 and 7 are stated in terms of a generic norm k k and its dual k k?. When applied to the REPS objective in Equation 2, using these general norm definitions of smoothness and strong convexity allow us to obtain guarantees with a milder dependence on S and A than would be possible if we were to use their 2 norm characterization instead. We can use the result of Lemma 1 to characterize the smoothness properties of the dual function JD of a generic regularized LP. Lemma 2. Consider the regularized LP Reg LP with r 2 Rn, E 2 Rm n, b 2 Rm, and where F is β strongly convex w.r.t. norm k k. The dual function g D : Rm ! R of this regularized LP is ,? β -smooth w.r.t. to the dual norm k k?, where we use k Ek ,? to denote the k k norm over the k k? norm of E0s rows. As a consequence of Lemma 2 we can bound the smoothness parameter of JD in the REPS objective: Lemma 3. The dual function JD(v) is (|S| + 1) -smooth in the k k1 norm. 5.1 Structural results for the REPS objective Armed with Lemma 2 we are ready to derive some useful structural properties of the REPS objective. In this section we present two main results. First we show that under some mild assumptions it is possible to relate the gradient magnitude of any candidate solution to JD with its suboptimality gap and second, we show an l1 bound for the norm of the optimal dual solution v?. For most of the analysis we make the following assumptions: Assumption 2. There is β > 0 such that qs,a β 8s, a 2 S A. We introduce the following assumption on the discounted state visitation distribution of arbitrary policies in the MDP, paraphrased from Wang [27]: Assumption 3. There exists > 0 such that for any policy , the discounted state visitation distribution λ defined as λ s,a satisfies λ s for all states s 2 S. Suppose we have a candidate dual solution ev for JD(v) in Dual Reg-v with its corresponding candidate primal solution eλ = exp( A v) q e Z where the operators exp and act pointwise and e Z = P a,s exp( A v)qs,a. We denote the corresponding candidate policy (computed using Equation 1) associated with ev as e (a|s). This candidate policy induces a discounted visitation distribution λe that may be substantially different from eλ. We now show that it is possible to control the deviation of primal objective value of λe from JP (λ) in terms of kr JD(ev)k1: Lemma 4. Let ev 2 R|S| be arbitrary and let eλ be its corresponding candidate primal variable. If krv JD(ev)k1 and Assumptions 2 and 3 hold then whenever |S| 2: JP (λe ) JP (λ? 1 γ + kevk1 1+log( 1 3β ) is the JP optimum. We finish this section by proving a bound on the norm of the dual variables. This bound will inform our optimization algorithms as it will allow us to set up the right constraints. Lemma 5. Under Assumptions 1, 2 and 3, the optimal dual variables are bounded as kv k1 1 1 γ From now on we use the notation D to refer to the quantity on the RHS of Equation 7. 5.2 Convergence rates As a warm up we derive convergence rates for the case when we have access to exact knowledge of the transition dynamics P and therefore exact gradients. We analyze the effects of using Accelerated Gradient Descent (see Section F for a full description of the algorithm) on the REPS objective JD(v). Theorem 1 (Accelerated Gradient Descent for general norms. Theorem 4.1 in Allen-Zhu & Orecchia [3]). Let D? be an upper bound to kx0, x?k2 2. Given an smooth function h w.r.t. the k k? norm over domain D, then T iterations of Algorithm 4 ensure h(yt) h(x?) 4 D We want to find almost optimal solutions (in function value). Let s define an optimal solution: Definition 8. Let > 0. We say that x is an optimal solution of an smooth function h : Rd ! R if h(x) h(x?) , where h(x?) = minx2Rd h(x). We can also show the following bound on the gradient norm for any optimal solutions of h. Lemma 6. If x is an optimal solution for the smooth function h : Rd ! R w.r.t. norm k k? then the gradient of h at x satisfies krh(x)k When h = JD the Dual Reg-v function in the reinforcement learning setting, we set k k? = k k1 and k k = k k1. We are ready to prove convergence guarantees for Algorithm 4 when applied to the objective JD. Lemma 7. Let Assumptions 1, 2 and 3 hold. Let D = {v s.t. kvk1 D} and c0 = β . After T steps of Algorithm 4, the objective function JD evaluated at the iterate v T = y T satisfies: JD(v T ) JD(v ) 4 (|S| + 1)2 (1 + c0)2 (1 γ)2T 2 , Proof. This result follows simply by invoking the guarantees of Theorem 1, making use of the fact that JD is (|S| + 1) smooth as proven by Lemma 3, observing that as a consequence of Lemma 5, v? 2 D and using the inequality kxk2 1 for x 2 R|S|. Lemma 7 can be easily turned into the following guarantee on the final dual function value: Corollary 1. Let > 0. If Algorithm 4 is ran for at least T rounds where T 2 1/2(|S|+1) (1+c0) (1 γ)p then v T is an optimal solution for the dual objective JD. If T satisfies the conditions of Corollary 1 a simple use of Lemma 6 allows us to bound the k k1 norm of the dual function s gradient at v T : kr JD(v T )k1 If we denote as T to be the policy induced by λv T , and λ? is the candidate dual solution corresponding to v?. A simple application of Lemma 4 yields: JP (λ T ) JP (λ? 1 + log |S||A| The following is the equivalent version of optimality for regularized objectives: Definition 9. Let > 0. We say is an optimal regularized policy if JP (λe ) JP (λ? This leads us to the main result of this section: Corollary 2. For any > 0, and let c00 = 1+log |S||A| β2 4 . If T 4 (|S| + 1)3/2 (2+c00) (1 γ)2 then JP (λ T ) JP (λ? Thus Algorithm 4 achieves an O(1/(1 γ)2 ) rate to an optimal regularized policy. We proceed to show that an appropriate choice for can be leveraged to obtain an optimal policy. Theorem 2. For any > 0, let = 1 2 log( |S||A| β ). If T (|S| + 1)3/2 (2+c00)2 (1 γ)2 2 , then T is an optimal policy. The main difficulty in deriving the guarantees of Theorem 2 lies in the need to translate the function value optimality guarantees of Accelerated Gradient Descent into -optimality guarantees for the candidate policy T . This is where our results from Lemma 4 have proven fundamental. It remains to show that it is possible to obtain an optimal policy when access to the true model is only via samples. 6 Stochastic Gradients In this section we show how to obtain stochastic (albeit biased) gradient estimators brv JD(v) for rv JD(v) (see Algorithm 2). We use brv JD(v) to perform biased stochastic gradient descent steps on JD(v) (see Algorithm 3). In Lemma 8 we prove guarantees for the bias and variance of this estimator and show rates for convergence in function value to the optimum of JD(v) in Lemma 10. We turn these results into guarantees for optimality of the final candidate policy in Theorem 3. Let s start by noting that: (rv JD(v))s = (1 γ)µs + E(s0,a,s00) q Pa( |s0) s0,a (γ1(s00 = s) 1(s0 = s)) s,a) Z and Z = P qs,a. We will make use of this characterization to devise a plug-in estimator for this quantity: Algorithm 2 Biased Gradient Estimator Input Number of samples t. Collect samples {(s , a , s0 =1 such that (s , a ) q while s0 Pa ( |s ) for (s, a) 2 S A do Build empirical estimators b Av(t) 2 R|S| |A| and bq(t) 2 R|S| |A|. Compute estimators b Bv s,a(t)) b Z(t) . Where b Z(t) = P s,a exp( b Av s,a(t))bqs,a(t). end Produce a final sample (st+1, at+1) q and s0 t+1 Pat+1( |st+1). Compute brv JD(v) such that: s = (1 γ)µs + b Bst+1,at+1 (t) t+1 = s) 1(st+1 = s) Output: brv JD(v). We now proceed to bound the bias of this estimator: Lemma 8. Let δ, 2 (0, 1) with min(β, 1 4). With probability at least 1 δ for all t 2 N such that t ln ln(2t) 120(ln 41.6|S||A| δ +1) β 2 max 480 2γ2kvk2 , the plugin estimator brv JD(v) satisfies: max u2{1,2,1} kˆg Et+1[ˆg]ku 8 β ; max u2{1,2,1} k Et+1[ˆg] gku 8 ; kˆg Et+1[ˆg]k2 where ˆg = brv JD(v), g = rv JD(v), and Et+1[ ] = Est+1,at+1,s0 t+1[ |b Bv(t)]. We will now make use of Lemma 8 along with the following guarantee for projected Stochastic Gradient Descent to prove convergence guarantees for Algorithm 3. Algorithm 3 Biased Stochastic Gradient Descent Input Desired accuracy , learning rates { t}1 t=1, and number-of-samples function n : N ! N . Initialize v0 = 0 for t = 1, , T do Get brv JD(v) with n(t) samples via Algorithm 2. Perform update: v0 t vt t brv JD(v); vt D(v0 t), where D denotes the projection to D = {v s.t. kvk1 D}. end Output: v T . The following holds: Lemma 9. Let f : Rd ! R be an L smooth function. We consider the following update: t+1 = xt (rf(xt) + t + bt) ; xt+1 = D(x0 f(xt+1) f(x?) kxt x?k2 kxt+1 x?k2 2 + 2 krf(xt)k2 + 5 kbtk2 + 5 k tk2 + kbtk1kxt x?k1 h t, xt x?i. Lemma 8 implies the following guarantee for the following projected stochastic gradient algorithm with biased gradients brv JD(v)R: Lemma 10. We assume 4 β . Set t = 8|S| D p t and t = 1 16|S| t. If we take t gradient steps using n(t) samples from q P (possibly reusing the samples for multiple gradient computations) with n(t) satisfying n(t) ln 100|S||A|t2 β|S|2 . Then for all t 1 we have that with probability at least 1 3δ and simultaneously for all t 2 N such that t 64|S|2 2D2 JD(v?) + e O Lemma 10 implies that making use of N samples it is possible to find a candidate v N such that JD( v N) JD(v?) + e O . This in turn implies by a simple use of Lemma 6 that kr JD( v N)k1 e O |S|1/2D pβN 1/4 . If we denote as N to the policy induced by λ v N , a simple application of Lemma 4 yields: JP (λ N ) JP (λ? |S|1/2D (1 γ)pβN 1/4 Thus Algorithm 3 achieves an O(1/(1 γ)8 4) rate of convergence to an optimal regularized policy. We proceed to we can set to obtain an optimal policy: Theorem 3 (Informal). For any > 0 let = 1 2 log( |S||A| β ). If N e O 1 8(1 γ)8β2 , then with probability at least 1 δ it is possible to find a candidate v N such that N is an optimal policy. 7 Conclusion This work presents an analysis of first-order optimization methods for the REPS objective in reinforcement learning. We prove convergence rates of O(1/ 2) for accelerated gradient descent on the dual of the KL-regularized max-return LP in the case of a known transition function with convergence rate. For the unknown case, we propose a biased stochastic gradient descent method relying on samples from behavior policy and show that it converges to an optimal policy with rate O(1/ 8). There are several interesting questions that remain open. First, while directly optimizing the dual via gradient methods is convenient from an algorithmic perspective, prior unregularized saddle-point methods have been shown to achieve a faster O(1/ ) convergence [4]. An important open direction is thus to understand if faster rates are possible in order to bridge this gap, or if optimizing the regularized dual directly is fundamentally limited. Second, we only considered MDPs with finite state and action spaces. It is therefore of interest to see if these ideas readily extend to infinite or very large spaces through function approximation. Acknowledgments and Disclosure of Funding This research was supported by the Google-BAIR Commons program. JL is supported by NSF GRFP. [1] Abdolmaleki, A., Lioutikov, R., Peters, J. R., Lau, N., Reis, L. P., and Neumann, G. Model- based relative entropy stochastic search. In Advances in Neural Information Processing Systems, pp. 3537 3545, 2015. [2] Abdolmaleki, A., Springenberg, J. T., Tassa, Y., Munos, R., Heess, N., and Riedmiller, M. Maximum a posteriori policy optimisation. ar Xiv preprint ar Xiv:1806.06920, 2018. [3] Allen-Zhu, Z. and Orecchia, L. Linear coupling: An ultimate unification of gradient and mirror descent. ar Xiv preprint ar Xiv:1407.1537, 2014. [4] Bas-Serrano, J. and Neu, G. Faster saddle-point optimization for solving large-scale markov decision processes. ar Xiv preprint ar Xiv:1909.10904, 2019. [5] Bas-Serrano, J., Curi, S., Krause, A., and Neu, G. Logistic q-learning, 2020. [6] Beck, A. and Teboulle, M. Mirror descent and nonlinear projected subgradient methods for convex optimization. Operations Research Letters, 31(3):167 175, 2003. [7] Belousov, B. and Peters, J. f-divergence constrained policy improvement. ar Xiv preprint ar Xiv:1801.00056, 2017. [8] Boularias, A., Kober, J., and Peters, J. Relative entropy inverse reinforcement learning. In Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, pp. 182 189, 2011. [9] Chen, Y. and Wang, M. Stochastic primal-dual methods and sample complexity of reinforcement learning. ar Xiv preprint ar Xiv:1612.02516, 2016. [10] Chen, Y., Li, L., and Wang, M. Scalable bilinear learning using state and action features. ar Xiv preprint ar Xiv:1804.10328, 2018. [11] Cheng, C.-A., Combes, R. T., Boots, B., and Gordon, G. A reduction from reinforcement learning to no-regret online learning. In International Conference on Artificial Intelligence and Statistics, pp. 3514 3524. PMLR, 2020. [12] Daniel, C., Neumann, G., and Peters, J. Hierarchical relative entropy policy search. In Artificial Intelligence and Statistics, pp. 273 281, 2012. [13] Fox, R., Pakman, A., and Tishby, N. Taming the noise in reinforcement learning via soft updates, 2017. [14] Geist, M., Scherrer, B., and Pietquin, O. A theory of regularized markov decision processes. ar Xiv preprint ar Xiv:1901.11275, 2019. [15] Haarnoja, T., Zhou, A., Hartikainen, K., Tucker, G., Ha, S., Tan, J., Kumar, V., Zhu, H., Gupta, A., Abbeel, P., et al. Soft actor-critic algorithms and applications. ar Xiv preprint ar Xiv:1812.05905, 2018. [16] Howard, S. R., Ramdas, A., Mc Auliffe, J., and Sekhon, J. Uniform, nonparametric, non- asymptotic confidence sequences. ar Xiv preprint ar Xiv:1810.08240, 2018. [17] Jin, Y. and Sidford, A. Efficiently solving mdps with stochastic mirror descent. In International Conference on Machine Learning, pp. 4890 4900. PMLR, 2020. [18] Kostrikov, I., Nachum, O., and Tompson, J. Imitation learning via off-policy distribution matching, 2019. [19] Nachum, O. and Dai, B. Reinforcement learning via fenchel-rockafellar duality. ar Xiv preprint ar Xiv:2001.01866, 2020. [20] Nachum, O., Norouzi, M., Xu, K., and Schuurmans, D. Bridging the gap between value and policy based reinforcement learning. In Advances in Neural Information Processing Systems, pp. 2775 2785, 2017. [21] Nachum, O., Dai, B., Kostrikov, I., Chow, Y., Li, L., and Schuurmans, D. Algaedice: Policy gradient from arbitrary experience, 2019. [22] Neu, G., Jonsson, A., and Gómez, V. A unified view of entropy-regularized markov decision processes. ar Xiv preprint ar Xiv:1705.07798, 2017. [23] Peters, J., Mülling, K., and Altun, Y. Relative entropy policy search. In AAAI, volume 10, pp. 1607 1612. Atlanta, 2010. [24] Puterman, M. L. Markov decision processes: discrete stochastic dynamic programming. John Wiley & Sons, 2014. [25] Schulman, J., Levine, S., Abbeel, P., Jordan, M., and Moritz, P. Trust region policy optimization. In International conference on machine learning, pp. 1889 1897, 2015. [26] Vieillard, N., Kozuno, T., Scherrer, B., Pietquin, O., Munos, R., and Geist, M. Leverage the average: an analysis of regularization in rl. ar Xiv preprint ar Xiv:2003.14089, 2020. [27] Wang, M. Primal-dual learning: Sample complexity and sublinear run time for ergodic markov decision problems. ar Xiv preprint ar Xiv:1710.06100, 2017. [28] Wang, M. Randomized linear programming solves the discounted markov decision problem in nearly-linear (sometimes sublinear) running time. ar Xiv preprint ar Xiv:1704.01869, 2017. [29] Zimin, A. and Neu, G. Online learning in episodic markovian decision processes by relative entropy policy search. In Advances in neural information processing systems, pp. 1583 1591, 2013. 1. For all authors... (a) Do the main claims made in the abstract and introduction accurately reflect the paper s contributions and scope? [Yes] (b) Did you describe the limitations of your work? [Yes] See Related Work and Conclusion. (c) Did you discuss any potential negative societal impacts of your work? [N/A] Theoretical paper analyzing existing algorithms. (d) Have you read the ethics review guidelines and ensured that your paper conforms to them? [Yes] 2. If you are including theoretical results... (a) Did you state the full set of assumptions of all theoretical results? [Yes] Assumptions are labeled Assumption and numbered throughout. (b) Did you include complete proofs of all theoretical results? [Yes] See Appendix. 3. If you ran experiments... (a) Did you include the code, data, and instructions needed to reproduce the main experi- mental results (either in the supplemental material or as a URL)? [N/A] (b) Did you specify all the training details (e.g., data splits, hyperparameters, how they were chosen)? [N/A] (c) Did you report error bars (e.g., with respect to the random seed after running experi- ments multiple times)? [N/A] (d) Did you include the total amount of compute and the type of resources used (e.g., type of GPUs, internal cluster, or cloud provider)? [N/A] 4. If you are using existing assets (e.g., code, data, models) or curating/releasing new assets... (a) If your work uses existing assets, did you cite the creators? [N/A] (b) Did you mention the license of the assets? [N/A] (c) Did you include any new assets either in the supplemental material or as a URL? [N/A] (d) Did you discuss whether and how consent was obtained from people whose data you re using/curating? [N/A] (e) Did you discuss whether the data you are using/curating contains personally identifiable information or offensive content? [N/A] 5. If you used crowdsourcing or conducted research with human subjects... (a) Did you include the full text of instructions given to participants and screenshots, if applicable? [N/A] (b) Did you describe any potential participant risks, with links to Institutional Review Board (IRB) approvals, if applicable? [N/A] (c) Did you include the estimated hourly wage paid to participants and the total amount spent on participant compensation? [N/A]