# provably_efficient_exploration_in_policy_optimization__9b8d4b56.pdf Provably Efficient Exploration in Policy Optimization Qi Cai 1 Zhuoran Yang 2 Chi Jin 3 Zhaoran Wang 1 While policy-based reinforcement learning (RL) achieves tremendous successes in practice, it is significantly less understood in theory, especially compared with value-based RL. In particular, it remains elusive how to design a provably efficient policy optimization algorithm that incorporates exploration. To bridge such a gap, this paper proposes an Optimistic variant of the Proximal Policy Optimization algorithm (OPPO), which follows an optimistic version of the policy gradient direction. This paper proves that, in the problem of episodic Markov decision process with linear function approximation, unknown transition, and adversarial reward with full-information feedback, OPPO achieves e O( d2H3T) regret. Here d is the feature dimension, H is the episode horizon, and T is the total number of steps. To the best of our knowledge, OPPO is the first provably efficient policy optimization algorithm that explores.1 1. Introduction Coupled with powerful function approximators such as neural networks, policy optimization plays a key role in the tremendous empirical successes of deep reinforcement learning (Silver et al., 2016; 2017; Duan et al., 2016; Open AI, 2019; Wang et al., 2018). In sharp contrast, the theoretical understandings of policy optimization remain rather limited from both computational and statistical perspectives. More specifically, from the computational perspective, it remains unclear until recently whether policy optimization 1Department of Industrial Engineering and Management Sciences, Northwestern University 2Department of Operations Research and Financial Engineering, Princeton University 3Department of Electrical Engineering, Princeton University. Correspondence to: Qi Cai , Zhuoran Yang , Chi Jin , Zhaoran Wang . Proceedings of the 37 th International Conference on Machine Learning, Online, PMLR 119, 2020. Copyright 2020 by the author(s). 1Full version of this paper is available on ar Xiv via https: //arxiv.org/pdf/1912.05830.pdf. converges to the globally optimal policy in a finite number of iterations, even given infinite data. Meanwhile, from the statistical perspective, it still remains unclear how to attain the globally optimal policy with a finite regret or sample complexity. A line of recent work (Fazel et al., 2018; Yang et al., 2019a; Abbasi-Yadkori et al., 2019a;b; Bhandari & Russo, 2019; Liu et al., 2019; Agarwal et al., 2019; Wang et al., 2019) answers the computational question affirmatively by proving that a wide variety of policy optimization algorithms, such as policy gradient (PG) (Williams, 1992; Baxter & Bartlett, 2000; Sutton et al., 2000), natural policy gradient (NPG) (Kakade, 2002), trust-region policy optimization (TRPO) (Schulman et al., 2015), proximal policy optimization (PPO) (Schulman et al., 2017), and actor-critic (AC) (Konda & Tsitsiklis, 2000), converge to the globally optimal policy at sublinear rates of convergence, even when they are coupled with neural networks (Liu et al., 2019; Wang et al., 2019). However, such computational efficiency guarantees rely on the regularity condition that the state space is already well explored. Such a condition is often implied by assuming either the access to a simulator (also known as the generative model) (Koenig & Simmons, 1993; Azar et al., 2011; 2012a;b; Sidford et al., 2018a;b; Wainwright, 2019) or finite concentratability coefficients (Munos & Szepesv ari, 2008; Antos et al., 2008; Farahmand et al., 2010; Tosatto et al., 2017; Yang et al., 2019b; Chen & Jiang, 2019), both of which are often unavailable in practice. In a more practical setting, the agent sequentially explores the state space, and meanwhile, exploits the information at hand by taking the actions that lead to higher expected total rewards. Such an exploration-exploitation tradeoff is better captured by the aforementioned statistical question regarding the regret or sample complexity, which remains even more challenging to answer than the computational question. As a result, such a lack of statistical understanding hinders the development of more sample-efficient policy optimization algorithms beyond heuristics. In fact, empirically, vanilla policy gradient is known to exhibit a possibly worse sample complexity than random search (Mania et al., 2018), even in basic settings such as linear-quadratic regulators. Meanwhile, theoretically, vanilla policy gradient can be shown to suffer from exponentially large variance in the well-known combination lock setting (Kakade, 2003; Provably Efficient Exploration in Policy Optimization Leffler et al., 2007; Azar et al., 2012a), which only has a finite state space. In this paper, we aim to answer the following fundamental question: Can we design a policy optimization algorithm that incorporates exploration and is provably sample-efficient? To answer this question, we propose the first policy optimization algorithm that incorporates exploration in a principled manner. In detail, we develop an Optimistic variant of the PPO algorithm, namely OPPO. Our algorithm is also closely related to NPG and TRPO. At each update, OPPO solves a Kullback-Leibler (KL)-regularized policy optimization subproblem, where the linear component of the objective function is defined using the action-value function. As is shown subsequently, solving such a subproblem corresponds to one iteration of infinite-dimensional mirror descent (Nemirovsky & Yudin, 1983) or dual averaging (Xiao, 2010), where the action-value function plays the role of the gradient. To encourage exploration, we explicitly incorporate a bonus function into the action-value function, which quantifies the uncertainty that arises from only observing finite historical data. Through uncertainty quantification, such a bonus function ensures the (conservative) optimism of the updated policy. Based on NPG, TRPO, and PPO, OPPO only augments the action-value function with the bonus function in an additive manner, which makes it easily implementable in practice. Theoretically, we establish the sample efficiency of OPPO in an episodic setting of Markov decision processes (MDPs) with full-information feedback, where the transition dynamics are linear in features (Yang & Wang, 2019b;a; Jin et al., 2019; Ayoub et al., 2020; Zhou et al., 2020). In particular, we allow the transition dynamics to be nonstationary within each episode. See also the work of (Du et al., 2019a; Van Roy & Dong, 2019; Lattimore & Szepesvari, 2019) for a related discussion on the necessity of the linear representation. In detail, we prove that OPPO attains a d2H3T-regret up to logarithmic factors, where d is the feature dimension, H is the episode horizon, and T is the total number of steps taken by the agent. Note that such a regret does not depend on the numbers of states and actions, and therefore, allows them to be even infinite. In particular, OPPO attains such a regret without knowing the transition dynamics or accessing a simulator . Moreover, we prove that, even when the reward functions are adversarially chosen across the episodes, OPPO attains the same regret in terms of competing with the globally optimal policy in hindsight (Cesa-Bianchi & Lugosi, 2006; Bubeck & Cesa-Bianchi, 2012). In comparison, existing algorithms based on value iteration, e.g., optimistic least-squares value iteration (LSVI) (Jin et al., 2019), do not allow adversarially chosen reward functions. Such a notion of robustness par- tially justifies the empirical advantages of KL-regularized policy optimization (Neu et al., 2017; Geist et al., 2019). To the best of our knowledge, OPPO is the first provably sample-efficient policy optimization algorithm that incorporates exploration. 1.1. Related Work Our work is based on the aforementioned line of recent work (Fazel et al., 2018; Yang et al., 2019a; Abbasi-Yadkori et al., 2019a;b; Bhandari & Russo, 2019; Liu et al., 2019; Agarwal et al., 2019; Wang et al., 2019) on the computational efficiency of policy optimization, which covers PG, NPG, TRPO, PPO, and AC. In particular, OPPO is based on PPO (and similarly, NPG and TRPO), which is shown to converge to the globally optimal policy at sublinear rates in tabular and linear settings, as well as nonlinear settings involving neural networks (Liu et al., 2019; Wang et al., 2019). However, without assuming the access to a simulator or finite concentratability coefficients, both of which imply that the state space is already well explored, it remains unclear whether any of such algorithms is sample-efficient, that is, attains a finite regret or sample complexity. In comparison, by incorporating uncertainty quantification into the actionvalue function at each update, which explicitly encourages exploration, OPPO not only attains the same computational efficiency as NPG, TRPO, and PPO, but is also shown to be sample-efficient with a d2H3T-regret up to logarithmic factors. Our work is closely related to another line of work (Even Dar et al., 2009; Yu et al., 2009; Neu et al., 2010a;b; Zimin & Neu, 2013; Neu et al., 2012; Rosenberg & Mansour, 2019b;a) on online MDPs with adversarially chosen reward functions, which mostly focuses on the tabular setting. Assuming the transition dynamics are known and the full information of the reward functions is available, the work of (Even-Dar et al., 2009) establishes a p 2T log |A|-regret, where A is the action space, |A| is its cardinality, and upper bounds the mixing time of the MDP. See also the work of (Yu et al., 2009), which establishes a T 2/3-regret in a similar setting. Assuming the transition dynamics are known but only the bandit feedback of the received rewards is available, the work of (Neu et al., 2010a;b; Zimin & Neu, 2013) establishes an H2p |A|T/β-regret (Neu et al., 2010b), a T 2/3-regret (Neu et al., 2010a), and a H|S||A|Tregret (Zimin & Neu, 2013), respectively, all up to logarithmic factors. Here S is the state space and |S| is its cardinality. In particular, it is assumed by (Neu et al., 2010b) that, with probability at least β, any state is reachable under any policy. Assuming the full information of the reward functions Provably Efficient Exploration in Policy Optimization is available but the transition dynamics are unknown, the work of (Neu et al., 2012; Rosenberg & Mansour, 2019b) establishes an H|S||A| T-regret (Neu et al., 2012) and an H|S| |A|T-regret (Rosenberg & Mansour, 2019b), respectively, both up to logarithmic factors. Assuming the transition dynamics are unknown and only the bandit feedback of the received rewards is available, the recent work of (Rosenberg & Mansour, 2019a) establishes an H|S| |A|T/β-regret up to logarithmic factors. In particular, it is assumed by (Rosenberg & Mansour, 2019a) that, with probability at least β, any state is reachable under any policy. Without such an assumption, an H3/2|S||A|1/4T 3/4-regret is established. In the latter two settings with unknown transition dynamics, all the existing algorithms (Neu et al., 2012; Rosenberg & Mansour, 2019b;a) follow the gradient direction with respect to the visitation measure, and thus, differ from most practical policy optimization algorithms. In comparison, OPPO is not restricted to the tabular setting and indeed follows the gradient direction with respect to the policy. OPPO is simply an optimistic variant of NPG, TRPO, and PPO, which makes it also a practical policy optimization algorithm. In particular, when specialized to the tabular setting, our setting corresponds to the third setting with d = |S|2|A|, where OPPO attains an H3/2|S|2|A| T-regret up to logarithmic factors. Broadly speaking, our work is related to a vast body of work on value-based reinforcement learning in tabular (Jaksch et al., 2010; Osband et al., 2014; Osband & Van Roy, 2016; Azar et al., 2017; Dann et al., 2017; Strehl et al., 2006; Jin et al., 2018) and linear settings (Yang & Wang, 2019b;a; Jin et al., 2019; Ayoub et al., 2020; Zhou et al., 2020), as well as nonlinear settings involving general function approximators (Wen & Van Roy, 2017; Jiang et al., 2017; Du et al., 2019b; Dong et al., 2019). In particular, our setting is the same as the linear setting studied by (Ayoub et al., 2020; Zhou et al., 2020), which generalizes the one proposed by (Yang & Wang, 2019a). We remark that our setting differs from the linear setting studied by (Yang & Wang, 2019b; Jin et al., 2019). It can be shown that the two settings are incomparable in the sense that one does not imply the other (Zhou et al., 2020). Also, our setting is related to the low-Bellman-rank setting studied by (Jiang et al., 2017; Dong et al., 2019). In comparison, we focus on policy-based reinforcement learning, which is significantly less studied in theory. In particular, compared with the work of (Yang & Wang, 2019b;a; Jin et al., 2019; Ayoub et al., 2020; Zhou et al., 2020), which focuses on value-based reinforcement learning, OPPO attains the same T-regret even in the presence of adversarially chosen reward func- tions. Compared with optimism-led iterative value-function elimination (OLIVE) (Jiang et al., 2017; Dong et al., 2019), which handles the more general low-Bellman-rank setting but is only sample-efficient, OPPO simultaneously attains computational efficiency and sample efficiency in the linear setting. Despite the differences between policy-based and value-based reinforcement learning, our work shows that the general principle of optimism in the face of uncertainty (Auer et al., 2002; Bubeck & Cesa-Bianchi, 2012) can be carried over from existing algorithms based on value iteration, e.g., optimistic LSVI, into policy optimization algorithms, e.g., NPG, TRPO, and PPO, to make them sample-efficient, which further leads to a new general principle of conservative optimism in the face of uncertainty and adversary that additionally allows adversarially chosen reward functions. 1.2. Notation We denote by k k2 the 2-norm of a vector or the spectral norm of a matrix. We denote by (A) the set of probability distributions on a set A and correspondingly define (A | S, H) = h=1 : h( | x) 2 (A) for any x 2 S and h 2 [H] for any set S and H 2 Z+. For p1, p2 2 (A), we denote by DKL(p1 k p2) the KL-divergence, DKL(p1 k p2) = p1(a) log p1(a) Throughout this paper, we denote by C, C0, C00, . . . absolute constants whose values can vary from line by line. 2. Preliminaries 2.1. MDPs with Adversarial Rewards In this paper, we consider an episodic MDP (S, A, H, P, r), where S and A are the state and action spaces, respectively, H is the length of each episode, Ph( | , ) is the transition kernel from a state-action pair to the next state at the hth step of each episode, and rk h : S A ! [0, 1] is the reward function at the h-th step of the k-th episode. We assume that the reward function is deterministic, which is without loss of generality, as our subsequent regret analysis readily generalizes to the setting where the reward function is stochastic. At the beginning of the k-th episode, the agent determines a policy k = { k h=1 2 (A | S, H). We assume that the initial state xk 1 is fixed to x1 2 S across all the episodes, which is without loss of generality, as our subsequent regret analysis readily generalizes to the setting where xk 1 is sampled from a fixed distribution across all the episodes. Then the agent iteratively interacts with the environment as Provably Efficient Exploration in Policy Optimization follows. At the h-th step, the agent receives a state xk h and takes an action following ak h). Subsequently, the agent receives a reward rk h) and the next state following xk h+1 Ph( | xk h). The k-th episode ends after the agent receives the last reward rk We allow the reward function rk = {rk h=1 to be adversarially chosen by the environment at the beginning of the k-th episode, which can depend on the (k 1) historical trajectories. The reward function rk h is revealed to the agent after it takes the action ak h at the state xk h, which together determine the received reward rk h). We define the regret in terms of competing with the globally optimal policy in hindsight (Cesa-Bianchi & Lugosi, 2006; Bubeck & Cesa-Bianchi, 2012) as Regret(T) = max 2 (A | S,H) where T = HK is the total number of steps taken by the agent in all the K episodes. For any policy = { h}H h=1 2 (A | S, H), the value function V ,k h : S ! R associated with the reward function rk = {rk h=1 is defined by Here we denote by E [ ] the expectation with respect to the randomness of the state-action sequence {(xh, ah)}H h=1, where the action ah follows the policy h( | xh) at the state xh and the next state xh+1 follows the transition dynamics Ph( | xh, ah). Correspondingly, we define the action-value function (also known as the Q-function) Q ,k h : S A ! R as h (x, a) = E ))) xh = x, ah = a By the definitions in (2.2) and (2.3), we have the following Bellman equation, h , hi A, Q ,k h + Ph V ,k Here h , i A denotes the inner product over A, where the subscript is omitted subsequently if it is clear from the context. Also, Ph is the operator form of the transition kernel Ph( | , ), which is defined by (Phf)(x, a) = E[f(x0) | x0 Ph( | x, a)] (2.5) for any function f : S ! R. By allowing the reward function to be adversarially chosen in each episode, our setting generalizes the stationary setting commonly adopted by the existing work on value-based reinforcement learning (Jaksch et al., 2010; Osband et al., 2014; Osband & Van Roy, 2016; Azar et al., 2017; Dann et al., 2017; Strehl et al., 2006; Jin et al., 2018; 2019; Yang & Wang, 2019b;a), where the reward function is fixed across all the episodes. 2.2. Linear Function Approximations We consider the linear setting where the transition dynamics are linear in a feature map, which is formalized in the following assumption. Assumption 2.1 (Linear MDP). We assume that the MDP (S, A, H, P, r) is a linear MDP with the known feature map : S A S ! Rd, that is, for any h 2 [H], there exists h 2 Rd with k hk2 d such that Ph(x0 | x, a) = (x, a, x0)> h for any (x, a, x0) 2 S A S. Also, we assume that (x, a, x0) V (x0)dx0+++ for any (x, a) 2 S A and V : S ! [0, H]. See (Ayoub et al., 2020; Zhou et al., 2020) for various examples of linear MDPs, including the one proposed by (Yang & Wang, 2019a). In particular, a tabular MDP corresponds to the linear MDP with d = |S|2|A| and the feature vector (x, a, x0) being the canonical basis e(x,a,x0) of R|S|2|A|. See also (Du et al., 2019a; Van Roy & Dong, 2019; Lattimore & Szepesvari, 2019) for a related discussion on the necessity of the linear representation. We remark that (Yang & Wang, 2019b; Jin et al., 2019) study another variant of linear MDPs, where the transition kernel can be written as Ph(x0 | x, a) = '(x, a)>µh(x0) for any h 2 [H] and (x, a, x0) 2 S A S. Here ': S A ! Rd is a known feature map and µh : S ! Rd is an unknown function on S for any h 2 [H]. Although the variant of linear MDPs defined in Assumption 2.1 and the one studied by (Yang & Wang, 2019b; Jin et al., 2019) both cover the tabular setting and the one proposed by (Yang & Wang, 2019a) as special cases, they are two different definitions of linear MDPs as their feature maps ( , , ) and '( , ) are defined on different domains. It can be shown that the two variants are incomparable in the sense that one does not imply the other (Zhou et al., 2020). 3. Algorithm and Theory 3.1. Optimistic PPO (OPPO) We present Optimistic PPO (OPPO) in Algorithm 1, which involves a policy improvement step and a policy evaluation step. Policy Improvement Step. In the k-th episode, OPPO updates k based on k 1 (Lines 4-9 of Algorithm 1). In Provably Efficient Exploration in Policy Optimization detail, we define the following linear function of the policy 2 (A | S, H), = V k 1,k 1 h Q k 1,k 1 h( | xh) k 1 ))) x1 = xk which is a local linear approximation of V ,k 1 (Schulman et al., 2015; 2017). In particular, we have that Lk 1( k 1) = V k 1,k 1 1). The policy improvement step is defined by k argmax 2 (A | S,H) Lk 1( ) (3.2) 1 E k 1[ e DKL( k k 1) | x1 = xk where e DKL( k k 1) = Here the KL-divergence regularizes to be close to k 1 so that Lk 1( ) well approximates V ,k 1 1), which further ensures that the updated policy k improves the expected total reward (associated with the reward function rk 1) upon k 1. Also, > 0 is the stepsize, which is specified in Theorem 3.1. By executing the updated policy k, the agent receives the state-action sequence {(xk h=1 and observes the reward function rk, which together determine the received rewards {rk The policy improvement step defined in (3.2) corresponds to one iteration of NPG (Kakade, 2002), TRPO (Schulman et al., 2015), and PPO (Schulman et al., 2017). In particular, PPO solves the same KL-regularized policy optimization subproblem as in (3.2) at each iteration, while TRPO solves an equivalent KL-constrained subproblem. In the special case where the reward function rk 1 h is linear in the feature map φk 1 h defined subsequently, which implies that the Q-function Q k 1,k 1 h is also linear in φk 1 h , the updated policy k can be equivalently obtained by one iteration of NPG when the policy is parameterized by an energy-based distribution (Agarwal et al., 2019; Wang et al., 2019). Such a policy improvement step can also be cast as one iteration of infinite-dimensional mirror descent (Nemirovsky & Yudin, 1983) or dual averaging (Xiao, 2010), where the Q-function plays the role of the gradient (Liu et al., 2019; Wang et al., 2019). The updated policy k obtained in (3.2) takes the following closed form, h( | x) / k 1 h ( | x) exp for any h 2 [H] and x 2 S. However, the Q-function Algorithm 1 Optimistic PPO (OPPO) 1: Initialize { 0 h=1 as uniform distributions on A and {Q0 h=1 as zero functions. 2: For episode k = 1, 2, . . . , K do 3: Receive the initial state xk 1. 4: For step h = 1, 2, . . . , H do 5: Update the policy by 6: k h( | ) / k 1 h ( | ) exp{ Qk 1 h ( , )}. 7: Take the action following ak h). 8: Observe the reward function rk h( , ). 9: Receive the next state xk h+1. 10: Initialize V k H+1( ) as a zero function. 11: For step h = H, H 1, . . . , 1 do 12: k h+1). 14: φk S ( , , x0) V k h+1(x0)dx0. h( , ) β [φk h( , )]1/2. 16: Qk h( , ) + φk h( , ). 17: Qk h( , ) min{Qk h( , ), H h + 1}+. 18: V k h remains to be estimated through the subsequent policy evaluation step. We denote by Qk 1 h the estimated Q- function, which replaces the Q-function Q k 1,k 1 h in (3.1)- (3.3) and is correspondingly used in Line 6 of Algorithm 1. Policy Evaluation Step. At the end of the k-th episode, OPPO evaluates the policy k based on the (k 1) historical trajectories (Lines 11-18 of Algorithm 1). In detail, for any h 2 [H], we define the empirical mean-squared Bellman error (MSBE) (Sutton & Barto, 2018) as ( , , x0) V h+1(x0)dx0, h+1( ) = h Q h+1( | )i A, while we initialize V H+1 as a zero function on S. The policy evaluation step is defined by iteratively updating the estimated Q-function Qk = {Qk h=1 associated with the reward function rk = {rk h(w) + λ kwk2 h( , ) min{rk h( , ) + φk h( , ), H h + 1}+ in the order of h = H, H 1, . . . , 1. Here λ > 0 is the regularization parameter, which is specified in Theorem 3.1. Also, Γk h : S A ! R+ is a bonus function, which quantifies the uncertainty in estimating the Q-function Q k,k Provably Efficient Exploration in Policy Optimization based on only finite historical data. In particular, the weight vector wk h obtained in (3.5) and the bonus function Γk h take the following closed forms, '1/2, (3.6) Here β > 0 scales with d, H, and K, which is specified in Theorem 3.1. The policy evaluation step defined in (3.5) corresponds to one iteration of least-squares temporal difference (LSTD) (Bradtke & Barto, 1996; Boyan, 2002). In particular, as we have h+1(x0) | x0 Ph( | x, a)] = (Ph V for any 2 [k 1] and (x, a) 2 S A in the empirical MSBE defined in (3.4), φk> h in (3.5) is an estimator of Ph V k h+1 in the Bellman equation defined in (2.4) (with h+1 replaced by V k h+1). Meanwhile, we construct the bonus function Γk h according to (3.6) so that φk> h is an upper confidence bound (UCB), that is, it holds that h( , ) (Ph V k with high probability, which is subsequently characterized in Lemma 4.3. Here the inequality holds uniformly for any (h, k) 2 [H] [K] and (x, a) 2 S A. As the fact that rk [0, 1] for any h 2 [H] implies that Q k,k h 2 [0, H h + 1], we truncate Qk h to the range [0, H h + 1] in (3.5), which is correspondingly used in Line 17 of Algorithm 1. 3.2. Regret Analysis We establish an upper bound of the regret of OPPO (Algorithm 1) in the following theorem. Recall that the regret is defined in (2.1) and T = HK is the total number of steps taken by the agent, where H is the length of each episode and K is the total number of episodes. Also, |A| is the cardinality of A and d is the dimension of the feature map . Theorem 3.1 (Total Regret). Let = 2 log |A|/(HT) in (3.2) and Line 6 of Algorithm 1, λ = 1 in (3.5) and Line 12 of Algorithm 1, and β = C d H2 log(d T/ ) in (3.6) and Line 15 of Algorithm 1, where C > 1 is an absolute constant and 2 (0, 1]. Under Assumption 2.1 and the assumption that log |A| = O(d2 [log(d T/ )]2), the regret of OPPO satisfies Regret(T) C0p d2H3T log(d T/ ) with probability at least 1 , where C0 > 0 is an absolute constant. Proof. See Section 4 for a proof sketch and Appendix C for a detailed proof. Theorem 3.1 proves that OPPO attains a d2H3T-regret up to logarithmic factors, where the dependency on the total number of steps T is optimal. In the stationary setting where the reward function and initial state are fixed across all the episodes, such a regret translates to a d2H4/"2-sample complexity (up to logarithmic factors) following the argument of (Jin et al., 2018) (Section 3.1). Here " > 0 measures the suboptimality of the obtained policy k in the following sense, max 2 (A | S,H) V where k is sampled from [K] uniformly at random. Here we denote the value function by V 1 and the initial state by x1 = xk 1 for any k 2 [K], as the reward function and initial state are fixed across all the episodes. Moreover, compared with the work of (Yang & Wang, 2019b;a; Jin et al., 2019; Ayoub et al., 2020; Zhou et al., 2020), OPPO additionally allows adversarially chosen reward functions without exacerbating the regret, which leads to a notion of robustness. Also, as a tabular MDP satisfies Assumption 2.1 with d = |S|2|A| and being the canonical basis of Rd, Theorem 3.1 yields an |S|2|A| H3T-regret in the tabular setting. Our subsequent discussion intuitively explains how OPPO achieves such a notion of robustness while attaining the d2H3T-regret (up to logarithmic factors). Discussion of Mechanisms. In the sequel, we consider the ideal setting where the transition dynamics are known, which, by the Bellman equation defined in (2.4), allows us to access the Q-function Q ,k h for any policy and (h, k) 2 [H] [K] once given the reward function rk. The following lemma connects the difference between two policies to the difference between their expected total rewards through the Q-function. Lemma 3.2 (Performance Difference). For any policies , 0 2 (A | S, H) and k 2 [K], it holds that h (xh, ), 0 h( | xh) h( | xh)i ))) x1 = xk Proof. See Appendix A.1 for a detailed proof. For notational simplicity, we omit the conditioning on x1 = xk 1, e.g., in (3.7) of Lemma 3.2, subsequently. The following lemma characterizes the policy improvement step defined Provably Efficient Exploration in Policy Optimization in (3.2), where the updated policy k takes the closed form in (3.3). Lemma 3.3 (One-Step Descent). For any distributions p , p 2 (A), state x 2 S, and function Q : S A ! [0, H], it holds for p0 2 (A) with p0( ) / p( ) exp{ Q(x, )} that h Q(x, ), p ( ) p( )i H2/2 Proof. See Appendix A.2 for a detailed proof. Corresponding to the definition of the regret in (2.1), we define the globally optimal policy in hindsight (Cesa-Bianchi & Lugosi, 2006; Bubeck & Cesa-Bianchi, 2012) as = argmax 2 (A | S,H) which attains a zero-regret. In the ideal setting where the Q-function Q k,k h associated with the reward function rk is known and the updated policy k+1 h takes the closed form in (3.3), Lemma 3.3 implies h( | x)i (3.9) for any (h, k) 2 [H] [K] and x 2 S. Combining (3.9) with Lemma 3.2, we obtain H3K/2 + 1H log |A|. (3.10) Here the first inequality follows from telescoping the righthand side of (3.9) across all the episodes and the fact that the KL-divergence is nonnegative. Also, the second inequality follows from the initialization of the policy and Q-function in Line 1 of Algorithm 1. Setting = 2 log |A|/(HT) in (3.10), we establish a H3T log |A|-regret in the ideal setting. Such an ideal setting demonstrates the key role of the KLdivergence in the policy improvement step defined in (3.2), where > 0 is the stepsize. Intuitively, without the KLdivergence, that is, setting ! 1, the upper bound of the regret on the right-hand side of (3.10) tends to infinity. In fact, for any < 1, the updated policy k h in (3.3) is conservatively greedy with respect to the Q-function Q k 1,k 1 h associated with the reward function rk 1. In particular, the regularization effect of both k 1 h and in (3.3) ensures that k h is not fully committed to perform well only with respect to rk 1, just in case the subsequent adversarially chosen reward function rk significantly differs from rk 1. In comparison, the fully greedy policy improvement step, which is commonly adopted by the existing work on value-based reinforcement learning (Jaksch et al., 2010; Osband et al., 2014; Osband & Van Roy, 2016; Azar et al., 2017; Dann et al., 2017; Strehl et al., 2006; Jin et al., 2018; 2019; Yang & Wang, 2019b;a), lacks such a notion of robustness. On the other hand, an intriguing question is whether being conservatively greedy is less sampleefficient than being fully greedy in the stationary setting, where the reward function is fixed across all the episodes. In fact, in the ideal setting where the Q-function Q k 1,k 1 h associated with the reward function rk 1 in (3.3) is known, the fully greedy policy improvement step with ! 1 corresponds to one step of policy iteration (Sutton & Barto, 2018), which converges to the globally optimal policy within K = H episodes and hence equivalently induces an H2-regret. However, in the realistic setting, the Q-function Q k 1,k 1 h in (3.1)-(3.3) is replaced by the estimated Qfunction Qk 1 h in Line 6 of Algorithm 1, which is obtained by the policy evaluation step defined in (3.5). As a result of the estimation uncertainty that arises from only observing finite historical data, it is indeed impossible to do better than the T-regret even in the tabular setting (Jin et al., 2018), which is shown to be an information-theoretic lower bound. In the linear setting, OPPO attains such a lower bound in terms of the total number of steps T = HK. In other words, in the stationary setting, being conservatively greedy suffices to achieve sample-efficiency, which complements its advantages in terms of robustness in the more challenging setting with adversarially chosen reward functions. 4. Proof Sketch 4.1. Regret Decomposition For the simplicity of discussion, we define the model prediction error as which arises from estimating Ph V k h+1 in the Bellman equa- tion defined in (2.4) (with V k,k h+1 replaced by V k h+1) based on only finite historical data. Also, we define the following filtration generated by the state-action sequence and reward Provably Efficient Exploration in Policy Optimization Definition 4.1 (Filtration). For any (k, h) 2 [K] [H], we define Fk,h,1 as the σ-algebra generated by the following state-action sequence and reward functions, i )}( ,i)2[k 1] [H] [ {r } 2[k] [ {(xk For any (k, h) 2 [K] [H 1], we define Fk,h,2 as the σ-algebra generated by i )}( ,i)2[k 1] [H] [ {r } 2[k] [ {(xk i )}i2[h] [ {xk while for any k 2 [K], we define Fk,H,2 as the σ-algebra generated by i )}( ,i)2[k] [H] [ {r } 2[k+1]. The σ-algebra sequence {Fk,h,m}(k,h,m)2[K] [H] [2] is a filtration with respect to the timestep index t(k, h, m) = (k 1) 2H + (h 1) 2 + m. (4.2) In other words, for any t(k, h, m) t(k0, h0, m0), it holds that Fk,h,m Fk0,h0,m0. By the definition of the σ-algebra Fk,h,m, for (k, h, m) 2 [K] [H] [2], the estimated value function V k h and Qfunction Qk h are measurable to Fk 1,H,2, as they are obtained based on the (k 1) historical trajectories and the reward function rk adversarially chosen by the environment at the beginning of the k-th episode, both of which are measurable to Fk 1,H,2. In the following lemma, we decompose the regret defined in (2.1) into three terms. Recall that the globally optimal policy in hindsight is defined in (3.8) and the model prediction error k h is defined in (4.1). Lemma 4.2 (Regret Decomposition). It holds that Regret(T) (4.3) + MK,H,2 | {z } (ii) h(xh, ah)] k | {z } (iii) which is independent of the linear setting in Assumption 2.1. Here {Mk,h,m}(k,h,m)2[K] [H] [2] is a martingale adapted to the filtration {Fk,h,m}(k,h,m)2[K] [H] [2], both with re- spect to the timestep index t(k, h, m) defined in (4.2) of Definition 4.1. Proof. See Appendix B.1 for a detailed proof. Lemma 4.2 allows us to characterize the regret by upper bounding terms (i), (ii), and (iii) in (4.3), respectively. In detail, term (i) corresponds to the right-hand side of (3.2) in Lemma 3.2 with the Q-function Q k,k h replaced by the estimated Q-function Qk h, which is obtained by the policy evaluation step defined in (3.5). In particular, as the updated policy k+1 h is obtained by the policy improvement step in Line 6 of Algorithm 1 using k h, term (i) can be upper bounded following a similar analysis to the discussion in Section 3.2, which is based on Lemmas 3.2 and 3.3 as well as (3.10). Also, by the Azuma-Hoeffding inequality, term (ii) is a martingale that scales as O(BM p TM) with high probability, where TM is the total number of timesteps and BM is an upper bound of the martingale differences. More specifically, we prove that TM = 2HK = 2T and BM = 2H in Appendix C, which implies that term (ii) is O( H2T) with high probability. Meanwhile, term (iii) corresponds to the model prediction error, which is characterized subsequently in Section 4.2. Note that the regret decomposition in (4.3) of Lemma 4.2 is independent of the linear setting in Assumption 2.1, and therefore, applies to any forms of estimated Q-functions Qk h in more general settings. In particular, as long as we can upper bound term (iii) in (4.3), our regret analysis can be carried over even beyond the linear setting. 4.2. Model Prediction Error To upper bound term (iii) in (4.3) of Lemma 4.2, we characterize the model prediction error k h defined in (4.1) in the following lemma. Recall that the bonus function Γk h is defined in (3.6). Lemma 4.3 (Upper Confidence Bound). Let λ = 1 in (3.5) and Line 12 of Algorithm 1, and β = C d H2 log(d T/ ) in (3.6) and Line 15 of Algorithm 1, where C > 1 is an absolute constant and 2 (0, 1]. Under Assumption 2.1, it holds with probability at least 1 /2 that for any (k, h) 2 [K] [H] and (x, a) 2 S A. Proof. See Appendix B.2 for a detailed proof. Lemma 4.3 demonstrates the key role of uncertainty quantification in achieving sample-efficiency. More specifically, due to the uncertainty that arises from only observing finite historical data, the model prediction error k h(x, a) can be possibly large for the state-action pairs (x, a) that are less Provably Efficient Exploration in Policy Optimization visited or even unseen. However, as is shown in Lemma 4.3, explicitly incorporating the bonus function Γk h into the estimated Q-function Qk h ensures that k h(x, a) 0 with high probability for any (k, h) 2 [K] [H] and (x, a) 2 S A. In other words, the estimated Q-function Qk h is optimistic in the face of uncertainty , as k h(x, a) 0 or equivalently h(x, a) + (Ph V k h+1)(x, a) (4.4) implies that E [ k h(xh, ah)] in term (iii) of (4.3) is upper bounded by zero. Also, Lemma 4.3 implies that k h) with high probability for any (k, h) 2 [K] [H]. As a result, it only remains to upper bound the cumulative sum PK h) corresponding to term (iii) in (4.3), which can be characterized by the elliptical potential lemma (Dani et al., 2008; Rusmevichientong & Tsitsiklis, 2010; Chu et al., 2011; Abbasi Yadkori et al., 2011; Jin et al., 2019). See Appendix C for a detailed proof. To illustrate the intuition behind the model prediction error k h defined in (4.1), we define the implicitly estimated transition dynamics as b Pk,h(x0 | x, a) = (x, a, x0)>( k h is defined in (3.6). Correspondingly, the policy evaluation step defined in (3.5) takes the following equivalent form (ignoring the truncation step for the simplicity of discussion), h + b Pk,h V k Here b Pk,h is the operator form of the implicitly estimated transition kernel b Pk,h( | , ), which is defined by (b Pk,hf)(x, a) = b Pk,h(x0 | x, a) f(x0)dx0 for any function f : S ! R. Correspondingly, by (4.1) and (4.5) we have h = (Ph b Pk,h)V k where Ph b Pk,h is the error that arises from implicitly estimating the transition dynamics based on only finite historical data. Such a model estimation error enters the regret in (4.3) of Lemma 4.2 only through the model prediction error (Ph b Pk,h)V k h+1, which allows us to bypass explicitly estimating the transition dynamics, and instead, employ the estimated Q-function Qk h obtained by the policy evaluation step defined in (4.5). As is shown in Appendix B.2, the bonus function Γk h upper bounds (Ph b Pk,h)V k (4.6) with high probability for any (k, h) 2 [K] [H] and (x, a) 2 S A, which then ensures the optimism of the estimated Q-function Qk h in the sense of (4.4). 5. Conclusion We study the sample efficiency of policy-based reinforcement learning in the episodic setting of linear MDPs with full-information feedback. We proposed an optimistic variant of the proximal policy optimization algorithm, dubbed as OPPO, which incorporates the principle of optimism in the face of uncertainty into policy optimization. When applied to the episodic MDP with unknown transition and adversarial reward, OPPO provably achieves a near-optimal p d2H3T-regret. To the best of our knowledge, OPPO is the first provably efficient policy optimization algorithm that explicitly incorporates exploration. Acknowledgements The authors would like to thank Lingxiao Wang, Wen Sun, and Sham Kakade for pointing out a technical issue in the initial version regarding the covering number of value functions in the linear setting. Such a technical issue is fixed in the camera-ready version with a definition of the linear MDP different from the one in the initial version. The authors would like to thank Csaba Szepesv ari, Lin F. Yang, Yining Wang, and Simon S. Du for helpful discussions. The authors would also like to thank the anonymous reviewers for the valuable comments and the program chairs for helping with preparing for the camera-ready version. Provably Efficient Exploration in Policy Optimization Abbasi-Yadkori, Y., P al, D., and Szepesv ari, C. Improved algorithms for linear stochastic bandits. In Advances in Neural Information Processing Systems, pp. 2312 2320, 2011. Abbasi-Yadkori, Y., Bartlett, P., Bhatia, K., Lazic, N., Szepesv ari, C., and Weisz, G. POLITEX: Regret bounds for policy iteration using expert prediction. In International Conference on Machine Learning, volume 97, pp. 3692 3702, 2019a. Abbasi-Yadkori, Y., Lazic, N., Szepesvari, C., and Weisz, G. Exploration-enhanced POLITEX. ar Xiv preprint ar Xiv:1908.10479, 2019b. Agarwal, A., Kakade, S. M., Lee, J. D., and Mahajan, G. Optimality and approximation with policy gradient methods in Markov decision processes. ar Xiv preprint ar Xiv:1908.00261, 2019. Antos, A., Szepesv ari, C., and Munos, R. Fitted Q-iteration in continuous action-space mdps. In Advances in Neural Information Processing Systems, pp. 9 16, 2008. Auer, P., Cesa-Bianchi, N., and Fischer, P. Finite-time analysis of the multiarmed bandit problem. Machine Learning, 47(2-3):235 256, 2002. Ayoub, A., Jia, Z., Szepesvari, C., Wang, M., and Yang, L. Model-based reinforcement learning with value-targeted regression. ar Xiv preprint ar Xiv:2006.01107, 2020. Azar, M. G., Munos, R., Ghavamzadaeh, M., and Kappen, H. J. Speedy Q-learning. In Advances in Neural Information Processing Systems, 2011. Azar, M. G., G omez, V., and Kappen, H. J. Dynamic policy programming. Journal of Machine Learning Research, 13(Nov):3207 3245, 2012a. Azar, M. G., Munos, R., and Kappen, B. On the sample complexity of reinforcement learning with a generative model. ar Xiv preprint ar Xiv:1206.6461, 2012b. Azar, M. G., Osband, I., and Munos, R. Minimax regret bounds for reinforcement learning. In International Conference on Machine Learning, pp. 263 272, 2017. Baxter, J. and Bartlett, P. L. Direct gradient-based reinforce- ment learning. In International Symposium on Circuits and Systems, pp. 271 274, 2000. Bhandari, J. and Russo, D. Global optimality guarantees for policy gradient methods. ar Xiv preprint ar Xiv:1906.01786, 2019. Boyan, J. A. Least-squares temporal difference learning. Machine Learning, 49(2-3):233 246, 2002. Bradtke, S. J. and Barto, A. G. Linear least-squares algo- rithms for temporal difference learning. Machine Learning, 22(1-3):33 57, 1996. Bubeck, S. and Cesa-Bianchi, N. Regret analysis of stochas- tic and nonstochastic multi-armed bandit problems. Foundations and Trends R in Machine Learning, 5(1):1 122, 2012. Cesa-Bianchi, N. and Lugosi, G. Prediction, Learning, and Games. Cambridge, 2006. Chen, J. and Jiang, N. Information-theoretic considera- tions in batch reinforcement learning. ar Xiv preprint ar Xiv:1905.00360, 2019. Chu, W., Li, L., Reyzin, L., and Schapire, R. Contextual bandits with linear payoff functions. In International Conference on Artificial Intelligence and Statistics, pp. 208 214, 2011. Dani, V., Hayes, T. P., and Kakade, S. M. Stochastic lin- ear optimization under bandit feedback. Conference on Learning Theory, 2008. Dann, C., Lattimore, T., and Brunskill, E. Unifying PAC and regret: Uniform PAC bounds for episodic reinforcement learning. In Advances in Neural Information Processing Systems, pp. 5713 5723, 2017. Dong, K., Peng, J., Wang, Y., and Zhou, Y. pn-regret for learning in Markov decision processes with function approximation and low Bellman rank. ar Xiv preprint ar Xiv:1909.02506, 2019. Du, S. S., Kakade, S. M., Wang, R., and Yang, L. Is a good representation sufficient for sample efficient reinforcement learning? ar Xiv preprint ar Xiv:1910.03016, 2019a. Du, S. S., Luo, Y., Wang, R., and Zhang, H. Provably efficient Q-learning with function approximation via distribution shift error checking oracle. ar Xiv preprint ar Xiv:1906.06321, 2019b. Duan, Y., Chen, X., Houthooft, R., Schulman, J., and Abbeel, P. Benchmarking deep reinforcement learning for continuous control. In International Conference on Machine Learning, pp. 1329 1338, 2016. Even-Dar, E., Kakade, S. M., and Mansour, Y. Online Markov decision processes. Mathematics of Operations Research, 34(3):726 736, 2009. Farahmand, A.-m., Szepesv ari, C., and Munos, R. Error propagation for approximate policy and value iteration. In Advances in Neural Information Processing Systems, pp. 568 576, 2010. Provably Efficient Exploration in Policy Optimization Fazel, M., Ge, R., Kakade, S. M., and Mesbahi, M. Global convergence of policy gradient methods for the linear quadratic regulator. ar Xiv preprint ar Xiv:1801.05039, 2018. Geist, M., Scherrer, B., and Pietquin, O. A theory of regularized Markov decision processes. ar Xiv preprint ar Xiv:1901.11275, 2019. Jaksch, T., Ortner, R., and Auer, P. Near-optimal regret bounds for reinforcement learning. Journal of Machine Learning Research, 11(4):1563 1600, 2010. Jiang, N., Krishnamurthy, A., Agarwal, A., Langford, J., and Schapire, R. E. Contextual decision processes with low Bellman rank are PAC-learnable. In International Conference on Machine Learning, pp. 1704 1713, 2017. Jin, C., Allen-Zhu, Z., Bubeck, S., and Jordan, M. I. Is Q-learning provably efficient? In Advances in Neural Information Processing Systems, pp. 4863 4873, 2018. Jin, C., Yang, Z., Wang, Z., and Jordan, M. I. Provably efficient reinforcement learning with linear function approximation. ar Xiv preprint ar Xiv:1907.05388, 2019. Kakade, S. M. A natural policy gradient. In Advances in Neural Information Processing Systems, 2002. Kakade, S. M. On the Sample Complexity of Reinforcement Learning. Ph D thesis, University of London, 2003. Koenig, S. and Simmons, R. G. Complexity analysis of real-time reinforcement learning. In Association for the Advancement of Artificial Intelligence, pp. 99 107, 1993. Konda, V. R. and Tsitsiklis, J. N. Actor-critic algorithms. In Advances in Neural Information Processing Systems, 2000. Lattimore, T. and Szepesvari, C. Learning with good feature representations in bandits and in RL with a generative model. ar Xiv preprint ar Xiv:1911.07676, 2019. Leffler, B. R., Littman, M. L., and Edmunds, T. Efficient reinforcement learning with relocatable action models. In Association for the Advancement of Artificial Intelligence, pp. 572 577, 2007. Liu, B., Cai, Q., Yang, Z., and Wang, Z. Neural proxi- mal/trust region policy optimization attains globally optimal policy. ar Xiv preprint ar Xiv:1906.10306, 2019. Mania, H., Guy, A., and Recht, B. Simple random search provides a competitive approach to reinforcement learning. ar Xiv preprint ar Xiv:1803.07055, 2018. Munos, R. and Szepesv ari, C. Finite-time bounds for fitted value iteration. Journal of Machine Learning Research, 9 (May):815 857, 2008. Nemirovsky, A. S. and Yudin, D. B. Problem Complexity and Method Efficiency in Optimization. Wiley, 1983. Neu, G., Antos, A., Gy orgy, A., and Szepesv ari, C. Online Markov decision processes under bandit feedback. In Advances in Neural Information Processing Systems, pp. 1804 1812, 2010a. Neu, G., Gy orgy, A., and Szepesv ari, C. The online loop- free stochastic shortest-path problem. In Conference on Learning Theory, volume 2010, pp. 231 243, 2010b. Neu, G., Gy orgy, A., and Szepesv ari, C. The adversarial stochastic shortest path problem with unknown transition probabilities. In International Conference on Artificial Intelligence and Statistics, pp. 805 813, 2012. Neu, G., Jonsson, A., and G omez, V. A unified view of entropy-regularized Markov decision processes. ar Xiv preprint ar Xiv:1705.07798, 2017. Open AI. Open AI Five. https://openai.com/ five/, 2019. Osband, I. and Van Roy, B. On lower bounds for regret in reinforcement learning. ar Xiv preprint ar Xiv:1608.02732, 2016. Osband, I., Van Roy, B., and Wen, Z. Generalization and ex- ploration via randomized value functions. ar Xiv preprint ar Xiv:1402.0635, 2014. Rosenberg, A. and Mansour, Y. Online stochastic short- est path with bandit feedback and unknown transition function. In Advances in Neural Information Processing Systems, pp. 2209 2218, 2019a. Rosenberg, A. and Mansour, Y. Online convex optimization in adversarial Markov decision processes. ar Xiv preprint ar Xiv:1905.07773, 2019b. Rusmevichientong, P. and Tsitsiklis, J. N. Linearly parame- terized bandits. Mathematics of Operations Research, 35 (2):395 411, 2010. Schulman, J., Levine, S., Abbeel, P., Jordan, M., and Moritz, P. Trust region policy optimization. In International Conference on Machine Learning, pp. 1889 1897, 2015. Schulman, J., Wolski, F., Dhariwal, P., Radford, A., and Klimov, O. Proximal policy optimization algorithms. ar Xiv preprint ar Xiv:1707.06347, 2017. Sidford, A., Wang, M., Wu, X., Yang, L., and Ye, Y. Near- optimal time and sample complexities for solving Markov decision processes with a generative model. In Advances in Neural Information Processing Systems, pp. 5186 5196, 2018a. Provably Efficient Exploration in Policy Optimization Sidford, A., Wang, M., Wu, X., and Ye, Y. Variance reduced value iteration and faster algorithms for solving Markov decision processes. In Symposium on Discrete Algorithms, pp. 770 787, 2018b. Silver, D., Huang, A., Maddison, C. J., Guez, A., Sifre, L., Van Den Driessche, G., Schrittwieser, J., Antonoglou, I., Panneershelvam, V., Lanctot, M., et al. Mastering the game of Go with deep neural networks and tree search. Nature, 529(7587):484, 2016. Silver, D., Schrittwieser, J., Simonyan, K., Antonoglou, I., Huang, A., Guez, A., Hubert, T., Baker, L., Lai, M., Bolton, A., et al. Mastering the game of Go without human knowledge. Nature, 550(7676):354, 2017. Strehl, A. L., Li, L., Wiewiora, E., Langford, J., and Littman, M. L. PAC model-free reinforcement learning. In International Conference on Machine Learning, pp. 881 888, 2006. Sutton, R. S. and Barto, A. G. Reinforcement Learning: An Introduction. MIT, 2018. Sutton, R. S., Mc Allester, D. A., Singh, S. P., and Mansour, Y. Policy gradient methods for reinforcement learning with function approximation. In Advances in Neural Information Processing Systems, 2000. Tosatto, S., Pirotta, M., D Eramo, C., and Restelli, M. Boosted fitted Q-iteration. In International Conference on Machine Learning, pp. 3434 3443, 2017. Van Roy, B. and Dong, S. Comments on the Du Kakade-Wang-Yang lower bounds. ar Xiv preprint ar Xiv:1911.07910, 2019. Wainwright, M. J. Variance-reduced Q-learning is minimax optimal. ar Xiv preprint ar Xiv:1906.04697, 2019. Wang, L., Cai, Q., Yang, Z., and Wang, Z. Neural policy gradient methods: Global optimality and rates of convergence. ar Xiv preprint ar Xiv:1909.01150, 2019. Wang, W. Y., Li, J., and He, X. Deep reinforcement learning for NLP. In Association for Computational Linguistics, pp. 19 21, 2018. Wen, Z. and Van Roy, B. Efficient reinforcement learn- ing in deterministic systems with value function generalization. Mathematics of Operations Research, 42(3): 762 782, 2017. Williams, R. J. Simple statistical gradient-following algo- rithms for connectionist reinforcement learning. Machine Learning, 8(3-4):229 256, 1992. Xiao, L. Dual averaging methods for regularized stochastic learning and online optimization. Journal of Machine Learning Research, 11(Oct):2543 2596, 2010. Yang, L. and Wang, M. Reinforcement leaning in feature space: Matrix bandit, kernels, and regret bound. ar Xiv preprint ar Xiv:1905.10389, 2019a. Yang, L. and Wang, M. Sample-optimal parametric Q- learning using linearly additive features. In International Conference on Machine Learning, pp. 6995 7004, 2019b. Yang, Z., Chen, Y., Hong, M., and Wang, Z. On the global convergence of actor-critic: A case for linear quadratic regulator with ergodic cost. ar Xiv preprint ar Xiv:1907.06246, 2019a. Yang, Z., Xie, Y., and Wang, Z. A theoretical analysis of deep Q-learning. ar Xiv preprint ar Xiv:1901.00137, 2019b. Yu, J. Y., Mannor, S., and Shimkin, N. Markov decision processes with arbitrary reward processes. Mathematics of Operations Research, 34(3):737 757, 2009. Zhou, D., He, J., and Gu, Q. Provably efficient reinforce- ment learning for discounted mdps with feature mapping. ar Xiv preprint ar Xiv:2006.13165, 2020. Zimin, A. and Neu, G. Online learning in episodic Marko- vian decision processes by relative entropy policy search. In Advances in Neural Information Processing Systems, pp. 1583 1591, 2013.