# provably_efficient_exploration_in_policy_optimization__9b8d4b56.pdf

Provably Efﬁcient Exploration in Policy Optimization

Qi Cai 1 Zhuoran Yang 2 Chi Jin 3 Zhaoran Wang 1

While policy-based reinforcement learning (RL) achieves tremendous successes in practice, it is signiﬁcantly less understood in theory, especially compared with value-based RL. In particular, it remains elusive how to design a provably efﬁcient policy optimization algorithm that incorporates exploration. To bridge such a gap, this paper proposes an Optimistic variant of the Proximal Policy Optimization algorithm (OPPO), which follows an optimistic version of the policy gradient direction. This paper proves that, in the problem of episodic Markov decision process with linear function approximation, unknown transition, and adversarial reward with full-information feedback, OPPO achieves e O(

d2H3T) regret. Here d is the feature dimension, H is the episode horizon, and T is the total number of steps. To the best of our knowledge, OPPO is the ﬁrst provably efﬁcient policy optimization algorithm that explores.1

1. Introduction

Coupled with powerful function approximators such as neural networks, policy optimization plays a key role in the tremendous empirical successes of deep reinforcement learning (Silver et al., 2016; 2017; Duan et al., 2016; Open AI, 2019; Wang et al., 2018). In sharp contrast, the theoretical understandings of policy optimization remain rather limited from both computational and statistical perspectives. More speciﬁcally, from the computational perspective, it remains unclear until recently whether policy optimization

1Department of Industrial Engineering and Management Sciences, Northwestern University 2Department of Operations Research and Financial Engineering, Princeton University 3Department of Electrical Engineering, Princeton University. Correspondence to: Qi Cai <qicai2022@u.northwestern.edu>, Zhuoran Yang <zy6@princeton.edu>, Chi Jin <chij@princeton.edu>, Zhaoran Wang <zhaoranwang@gmail.com>.

Proceedings of the 37 th International Conference on Machine Learning, Online, PMLR 119, 2020. Copyright 2020 by the author(s).

1Full version of this paper is available on ar Xiv via https: //arxiv.org/pdf/1912.05830.pdf.

converges to the globally optimal policy in a ﬁnite number of iterations, even given inﬁnite data. Meanwhile, from the statistical perspective, it still remains unclear how to attain the globally optimal policy with a ﬁnite regret or sample complexity.

A line of recent work (Fazel et al., 2018; Yang et al., 2019a; Abbasi-Yadkori et al., 2019a;b; Bhandari & Russo, 2019; Liu et al., 2019; Agarwal et al., 2019; Wang et al., 2019) answers the computational question afﬁrmatively by proving that a wide variety of policy optimization algorithms, such as policy gradient (PG) (Williams, 1992; Baxter & Bartlett, 2000; Sutton et al., 2000), natural policy gradient (NPG) (Kakade, 2002), trust-region policy optimization (TRPO) (Schulman et al., 2015), proximal policy optimization (PPO) (Schulman et al., 2017), and actor-critic (AC) (Konda & Tsitsiklis, 2000), converge to the globally optimal policy at sublinear rates of convergence, even when they are coupled with neural networks (Liu et al., 2019; Wang et al., 2019). However, such computational efﬁciency guarantees rely on the regularity condition that the state space is already well explored. Such a condition is often implied by assuming either the access to a simulator (also known as the generative model) (Koenig & Simmons, 1993; Azar et al., 2011; 2012a;b; Sidford et al., 2018a;b; Wainwright, 2019) or ﬁnite concentratability coefﬁcients (Munos & Szepesv ari, 2008; Antos et al., 2008; Farahmand et al., 2010; Tosatto et al., 2017; Yang et al., 2019b; Chen & Jiang, 2019), both of which are often unavailable in practice.

In a more practical setting, the agent sequentially explores the state space, and meanwhile, exploits the information at hand by taking the actions that lead to higher expected total rewards. Such an exploration-exploitation tradeoff is better captured by the aforementioned statistical question regarding the regret or sample complexity, which remains even more challenging to answer than the computational question. As a result, such a lack of statistical understanding hinders the development of more sample-efﬁcient policy optimization algorithms beyond heuristics. In fact, empirically, vanilla policy gradient is known to exhibit a possibly worse sample complexity than random search (Mania et al., 2018), even in basic settings such as linear-quadratic regulators. Meanwhile, theoretically, vanilla policy gradient can be shown to suffer from exponentially large variance in the well-known combination lock setting (Kakade, 2003;

Provably Efﬁcient Exploration in Policy Optimization

Lefﬂer et al., 2007; Azar et al., 2012a), which only has a ﬁnite state space.

In this paper, we aim to answer the following fundamental question:

Can we design a policy optimization algorithm that incorporates exploration and is provably sample-efﬁcient?

To answer this question, we propose the ﬁrst policy optimization algorithm that incorporates exploration in a principled manner. In detail, we develop an Optimistic variant of the PPO algorithm, namely OPPO. Our algorithm is also closely related to NPG and TRPO. At each update, OPPO solves a Kullback-Leibler (KL)-regularized policy optimization subproblem, where the linear component of the objective function is deﬁned using the action-value function. As is shown subsequently, solving such a subproblem corresponds to one iteration of inﬁnite-dimensional mirror descent (Nemirovsky & Yudin, 1983) or dual averaging (Xiao, 2010), where the action-value function plays the role of the gradient. To encourage exploration, we explicitly incorporate a bonus function into the action-value function, which quantiﬁes the uncertainty that arises from only observing ﬁnite historical data. Through uncertainty quantiﬁcation, such a bonus function ensures the (conservative) optimism of the updated policy. Based on NPG, TRPO, and PPO, OPPO only augments the action-value function with the bonus function in an additive manner, which makes it easily implementable in practice.

Theoretically, we establish the sample efﬁciency of OPPO in an episodic setting of Markov decision processes (MDPs) with full-information feedback, where the transition dynamics are linear in features (Yang & Wang, 2019b;a; Jin et al., 2019; Ayoub et al., 2020; Zhou et al., 2020). In particular, we allow the transition dynamics to be nonstationary within each episode. See also the work of (Du et al., 2019a; Van Roy & Dong, 2019; Lattimore & Szepesvari, 2019) for a related discussion on the necessity of the linear representation. In detail, we prove that OPPO attains a

d2H3T-regret up to logarithmic factors, where d is the feature dimension, H is the episode horizon, and T is the total number of steps taken by the agent. Note that such a regret does not depend on the numbers of states and actions, and therefore, allows them to be even inﬁnite. In particular, OPPO attains such a regret without knowing the transition dynamics or accessing a simulator . Moreover, we prove that, even when the reward functions are adversarially chosen across the episodes, OPPO attains the same regret in terms of competing with the globally optimal policy in hindsight (Cesa-Bianchi & Lugosi, 2006; Bubeck & Cesa-Bianchi, 2012). In comparison, existing algorithms based on value iteration, e.g., optimistic least-squares value iteration (LSVI) (Jin et al., 2019), do not allow adversarially chosen reward functions. Such a notion of robustness par-

tially justiﬁes the empirical advantages of KL-regularized policy optimization (Neu et al., 2017; Geist et al., 2019). To the best of our knowledge, OPPO is the ﬁrst provably sample-efﬁcient policy optimization algorithm that incorporates exploration.

1.1. Related Work

Our work is based on the aforementioned line of recent work (Fazel et al., 2018; Yang et al., 2019a; Abbasi-Yadkori et al., 2019a;b; Bhandari & Russo, 2019; Liu et al., 2019; Agarwal et al., 2019; Wang et al., 2019) on the computational efﬁciency of policy optimization, which covers PG, NPG, TRPO, PPO, and AC. In particular, OPPO is based on PPO (and similarly, NPG and TRPO), which is shown to converge to the globally optimal policy at sublinear rates in tabular and linear settings, as well as nonlinear settings involving neural networks (Liu et al., 2019; Wang et al., 2019). However, without assuming the access to a simulator or ﬁnite concentratability coefﬁcients, both of which imply that the state space is already well explored, it remains unclear whether any of such algorithms is sample-efﬁcient, that is, attains a ﬁnite regret or sample complexity. In comparison, by incorporating uncertainty quantiﬁcation into the actionvalue function at each update, which explicitly encourages exploration, OPPO not only attains the same computational efﬁciency as NPG, TRPO, and PPO, but is also shown to be sample-efﬁcient with a

d2H3T-regret up to logarithmic factors.

Our work is closely related to another line of work (Even Dar et al., 2009; Yu et al., 2009; Neu et al., 2010a;b; Zimin & Neu, 2013; Neu et al., 2012; Rosenberg & Mansour, 2019b;a) on online MDPs with adversarially chosen reward functions, which mostly focuses on the tabular setting.

Assuming the transition dynamics are known and the

full information of the reward functions is available, the work of (Even-Dar et al., 2009) establishes a p

2T log |A|-regret, where A is the action space, |A| is its cardinality, and upper bounds the mixing time of the MDP. See also the work of (Yu et al., 2009), which establishes a T 2/3-regret in a similar setting.

Assuming the transition dynamics are known but only

the bandit feedback of the received rewards is available, the work of (Neu et al., 2010a;b; Zimin & Neu, 2013) establishes an H2p

|A|T/β-regret (Neu et al., 2010b), a T 2/3-regret (Neu et al., 2010a), and a

H|S||A|Tregret (Zimin & Neu, 2013), respectively, all up to logarithmic factors. Here S is the state space and |S| is its cardinality. In particular, it is assumed by (Neu et al., 2010b) that, with probability at least β, any state is reachable under any policy.

Assuming the full information of the reward functions

Provably Efﬁcient Exploration in Policy Optimization

is available but the transition dynamics are unknown, the work of (Neu et al., 2012; Rosenberg & Mansour, 2019b) establishes an H|S||A|

T-regret (Neu et al., 2012) and an H|S|

|A|T-regret (Rosenberg & Mansour, 2019b), respectively, both up to logarithmic factors.

Assuming the transition dynamics are unknown and

only the bandit feedback of the received rewards is available, the recent work of (Rosenberg & Mansour, 2019a) establishes an H|S|

|A|T/β-regret up to logarithmic factors. In particular, it is assumed by (Rosenberg & Mansour, 2019a) that, with probability at least β, any state is reachable under any policy. Without such an assumption, an H3/2|S||A|1/4T 3/4-regret is established.

In the latter two settings with unknown transition dynamics, all the existing algorithms (Neu et al., 2012; Rosenberg & Mansour, 2019b;a) follow the gradient direction with respect to the visitation measure, and thus, differ from most practical policy optimization algorithms. In comparison, OPPO is not restricted to the tabular setting and indeed follows the gradient direction with respect to the policy. OPPO is simply an optimistic variant of NPG, TRPO, and PPO, which makes it also a practical policy optimization algorithm. In particular, when specialized to the tabular setting, our setting corresponds to the third setting with d = |S|2|A|, where OPPO attains an H3/2|S|2|A|

T-regret up to logarithmic factors.

Broadly speaking, our work is related to a vast body of work on value-based reinforcement learning in tabular (Jaksch et al., 2010; Osband et al., 2014; Osband & Van Roy, 2016; Azar et al., 2017; Dann et al., 2017; Strehl et al., 2006; Jin et al., 2018) and linear settings (Yang & Wang, 2019b;a; Jin et al., 2019; Ayoub et al., 2020; Zhou et al., 2020), as well as nonlinear settings involving general function approximators (Wen & Van Roy, 2017; Jiang et al., 2017; Du et al., 2019b; Dong et al., 2019). In particular, our setting is the same as the linear setting studied by (Ayoub et al., 2020; Zhou et al., 2020), which generalizes the one proposed by (Yang & Wang, 2019a). We remark that our setting differs from the linear setting studied by (Yang & Wang, 2019b; Jin et al., 2019). It can be shown that the two settings are incomparable in the sense that one does not imply the other (Zhou et al., 2020). Also, our setting is related to the low-Bellman-rank setting studied by (Jiang et al., 2017; Dong et al., 2019). In comparison, we focus on policy-based reinforcement learning, which is signiﬁcantly less studied in theory. In particular, compared with the work of (Yang & Wang, 2019b;a; Jin et al., 2019; Ayoub et al., 2020; Zhou et al., 2020), which focuses on value-based reinforcement learning, OPPO attains the same

T-regret even in the presence of adversarially chosen reward func-

tions. Compared with optimism-led iterative value-function elimination (OLIVE) (Jiang et al., 2017; Dong et al., 2019), which handles the more general low-Bellman-rank setting but is only sample-efﬁcient, OPPO simultaneously attains computational efﬁciency and sample efﬁciency in the linear setting. Despite the differences between policy-based and value-based reinforcement learning, our work shows that the general principle of optimism in the face of uncertainty (Auer et al., 2002; Bubeck & Cesa-Bianchi, 2012) can be carried over from existing algorithms based on value iteration, e.g., optimistic LSVI, into policy optimization algorithms, e.g., NPG, TRPO, and PPO, to make them sample-efﬁcient, which further leads to a new general principle of conservative optimism in the face of uncertainty and adversary that additionally allows adversarially chosen reward functions.

1.2. Notation

We denote by k k2 the 2-norm of a vector or the spectral norm of a matrix. We denote by (A) the set of probability distributions on a set A and correspondingly deﬁne

(A | S, H) =

h=1 : h( | x) 2 (A)

for any x 2 S and h 2 [H]

for any set S and H 2 Z+. For p1, p2 2 (A), we denote by DKL(p1 k p2) the KL-divergence,

DKL(p1 k p2) =

p1(a) log p1(a)

Throughout this paper, we denote by C, C0, C00, . . . absolute constants whose values can vary from line by line.

2. Preliminaries

2.1. MDPs with Adversarial Rewards

In this paper, we consider an episodic MDP (S, A, H, P, r), where S and A are the state and action spaces, respectively, H is the length of each episode, Ph( | , ) is the transition kernel from a state-action pair to the next state at the hth step of each episode, and rk

h : S A ! [0, 1] is the reward function at the h-th step of the k-th episode. We assume that the reward function is deterministic, which is without loss of generality, as our subsequent regret analysis readily generalizes to the setting where the reward function is stochastic.

At the beginning of the k-th episode, the agent determines a policy k = { k

h=1 2 (A | S, H). We assume that the initial state xk

1 is ﬁxed to x1 2 S across all the episodes, which is without loss of generality, as our subsequent regret analysis readily generalizes to the setting where xk

1 is sampled from a ﬁxed distribution across all the episodes. Then the agent iteratively interacts with the environment as

Provably Efﬁcient Exploration in Policy Optimization

follows. At the h-th step, the agent receives a state xk

h and takes an action following ak

h). Subsequently, the agent receives a reward rk

h) and the next state following xk

h+1 Ph( | xk

h). The k-th episode ends after the agent receives the last reward rk

We allow the reward function rk = {rk

h=1 to be adversarially chosen by the environment at the beginning of the k-th episode, which can depend on the (k 1) historical trajectories. The reward function rk

h is revealed to the agent after it takes the action ak

h at the state xk

h, which together determine the received reward rk

h). We deﬁne the regret in terms of competing with the globally optimal policy in hindsight (Cesa-Bianchi & Lugosi, 2006; Bubeck & Cesa-Bianchi, 2012) as

Regret(T) = max 2 (A | S,H)

where T = HK is the total number of steps taken by the agent in all the K episodes. For any policy = { h}H

h=1 2 (A | S, H), the value function V ,k

h : S ! R associated with the reward function rk = {rk

h=1 is deﬁned by

Here we denote by E [ ] the expectation with respect to the randomness of the state-action sequence {(xh, ah)}H

h=1, where the action ah follows the policy h( | xh) at the state xh and the next state xh+1 follows the transition dynamics Ph( | xh, ah). Correspondingly, we deﬁne the action-value function (also known as the Q-function) Q ,k

h : S A ! R as

h (x, a) = E

))) xh = x, ah = a

By the deﬁnitions in (2.2) and (2.3), we have the following Bellman equation,

h , hi A, Q ,k

h + Ph V ,k

Here h , i A denotes the inner product over A, where the subscript is omitted subsequently if it is clear from the context. Also, Ph is the operator form of the transition kernel Ph( | , ), which is deﬁned by

(Phf)(x, a) = E[f(x0) | x0 Ph( | x, a)] (2.5)

for any function f : S ! R. By allowing the reward function to be adversarially chosen in each episode, our setting generalizes the stationary setting commonly adopted by the existing work on value-based reinforcement learning (Jaksch et al., 2010; Osband et al., 2014; Osband & Van Roy,

2016; Azar et al., 2017; Dann et al., 2017; Strehl et al., 2006; Jin et al., 2018; 2019; Yang & Wang, 2019b;a), where the reward function is ﬁxed across all the episodes.

2.2. Linear Function Approximations

We consider the linear setting where the transition dynamics are linear in a feature map, which is formalized in the following assumption.

Assumption 2.1 (Linear MDP). We assume that the MDP (S, A, H, P, r) is a linear MDP with the known feature map

: S A S ! Rd, that is, for any h 2 [H], there exists h 2 Rd with k hk2

d such that

Ph(x0 | x, a) = (x, a, x0)> h

for any (x, a, x0) 2 S A S. Also, we assume that

(x, a, x0) V (x0)dx0+++

for any (x, a) 2 S A and V : S ! [0, H].

See (Ayoub et al., 2020; Zhou et al., 2020) for various examples of linear MDPs, including the one proposed by (Yang & Wang, 2019a). In particular, a tabular MDP corresponds to the linear MDP with d = |S|2|A| and the feature vector (x, a, x0) being the canonical basis e(x,a,x0) of R|S|2|A|. See also (Du et al., 2019a; Van Roy & Dong, 2019; Lattimore & Szepesvari, 2019) for a related discussion on the necessity of the linear representation.

We remark that (Yang & Wang, 2019b; Jin et al., 2019) study another variant of linear MDPs, where the transition kernel can be written as Ph(x0 | x, a) = '(x, a)>µh(x0) for any h 2 [H] and (x, a, x0) 2 S A S. Here ': S A ! Rd

is a known feature map and µh : S ! Rd is an unknown function on S for any h 2 [H]. Although the variant of linear MDPs deﬁned in Assumption 2.1 and the one studied by (Yang & Wang, 2019b; Jin et al., 2019) both cover the tabular setting and the one proposed by (Yang & Wang, 2019a) as special cases, they are two different deﬁnitions of linear MDPs as their feature maps ( , , ) and '( , ) are deﬁned on different domains. It can be shown that the two variants are incomparable in the sense that one does not imply the other (Zhou et al., 2020).

3. Algorithm and Theory

3.1. Optimistic PPO (OPPO)

We present Optimistic PPO (OPPO) in Algorithm 1, which involves a policy improvement step and a policy evaluation step.

Policy Improvement Step. In the k-th episode, OPPO updates k based on k 1 (Lines 4-9 of Algorithm 1). In

Provably Efﬁcient Exploration in Policy Optimization

detail, we deﬁne the following linear function of the policy 2 (A | S, H),

= V k 1,k 1

h Q k 1,k 1

h( | xh) k 1

))) x1 = xk

which is a local linear approximation of V ,k 1

(Schulman et al., 2015; 2017). In particular, we have that Lk 1( k 1) = V k 1,k 1

1). The policy improvement step is deﬁned by

k argmax 2 (A | S,H)

Lk 1( ) (3.2)

1 E k 1[ e DKL( k k 1) | x1 = xk

where e DKL( k k 1) =

Here the KL-divergence regularizes to be close to k 1 so that Lk 1( ) well approximates V ,k 1

1), which further ensures that the updated policy k improves the expected total reward (associated with the reward function rk 1) upon k 1. Also, > 0 is the stepsize, which is speciﬁed in Theorem 3.1. By executing the updated policy k, the agent receives the state-action sequence {(xk

h=1 and observes the reward function rk, which together determine the received rewards {rk

The policy improvement step deﬁned in (3.2) corresponds to one iteration of NPG (Kakade, 2002), TRPO (Schulman et al., 2015), and PPO (Schulman et al., 2017). In particular, PPO solves the same KL-regularized policy optimization subproblem as in (3.2) at each iteration, while TRPO solves an equivalent KL-constrained subproblem. In the special case where the reward function rk 1

h is linear in the feature map φk 1

h deﬁned subsequently, which implies that the

Q-function Q k 1,k 1

h is also linear in φk 1

h , the updated policy k can be equivalently obtained by one iteration of NPG when the policy is parameterized by an energy-based distribution (Agarwal et al., 2019; Wang et al., 2019). Such a policy improvement step can also be cast as one iteration of inﬁnite-dimensional mirror descent (Nemirovsky & Yudin, 1983) or dual averaging (Xiao, 2010), where the Q-function

plays the role of the gradient (Liu et al., 2019; Wang et al., 2019).

The updated policy k obtained in (3.2) takes the following closed form,

h( | x) / k 1

h ( | x) exp

for any h 2 [H] and x 2 S. However, the Q-function

Algorithm 1 Optimistic PPO (OPPO)

1: Initialize { 0

h=1 as uniform distributions on A and {Q0

h=1 as zero functions. 2: For episode k = 1, 2, . . . , K do 3: Receive the initial state xk

1. 4: For step h = 1, 2, . . . , H do 5: Update the policy by 6: k

h( | ) / k 1

h ( | ) exp{ Qk 1

h ( , )}. 7: Take the action following ak

h). 8: Observe the reward function rk

h( , ). 9: Receive the next state xk

h+1. 10: Initialize V k

H+1( ) as a zero function. 11: For step h = H, H 1, . . . , 1 do 12: k

h+1). 14: φk

S ( , , x0) V k

h+1(x0)dx0.

h( , ) β [φk

h( , )]1/2. 16: Qk

h( , ) + φk

h( , ). 17: Qk

h( , ) min{Qk

h( , ), H h + 1}+. 18: V k

h remains to be estimated through the subsequent policy evaluation step. We denote by Qk 1

h the estimated Q-

function, which replaces the Q-function Q k 1,k 1

h in (3.1)- (3.3) and is correspondingly used in Line 6 of Algorithm 1.

Policy Evaluation Step. At the end of the k-th episode, OPPO evaluates the policy k based on the (k 1) historical trajectories (Lines 11-18 of Algorithm 1). In detail, for any h 2 [H], we deﬁne the empirical mean-squared Bellman error (MSBE) (Sutton & Barto, 2018) as

( , , x0) V

h+1(x0)dx0,

h+1( ) = h Q

h+1( | )i A,

while we initialize V

H+1 as a zero function on S. The policy evaluation step is deﬁned by iteratively updating the estimated Q-function Qk = {Qk

h=1 associated with the reward function rk = {rk

h(w) + λ kwk2

h( , ) min{rk

h( , ) + φk

h( , ), H h + 1}+

in the order of h = H, H 1, . . . , 1. Here λ > 0 is the regularization parameter, which is speciﬁed in Theorem 3.1. Also, Γk

h : S A ! R+ is a bonus function, which quantiﬁes the uncertainty in estimating the Q-function Q k,k

Provably Efﬁcient Exploration in Policy Optimization

based on only ﬁnite historical data. In particular, the weight vector wk

h obtained in (3.5) and the bonus function Γk

h take the following closed forms,

'1/2, (3.6)

Here β > 0 scales with d, H, and K, which is speciﬁed in Theorem 3.1.

The policy evaluation step deﬁned in (3.5) corresponds to one iteration of least-squares temporal difference (LSTD) (Bradtke & Barto, 1996; Boyan, 2002). In particular, as we have

h+1(x0) | x0 Ph( | x, a)] = (Ph V

for any 2 [k 1] and (x, a) 2 S A in the empirical MSBE deﬁned in (3.4), φk>

h in (3.5) is an estimator of Ph V k

h+1 in the Bellman equation deﬁned in (2.4) (with

h+1 replaced by V k

h+1). Meanwhile, we construct the bonus function Γk

h according to (3.6) so that φk>

h is an upper conﬁdence bound (UCB), that is, it holds that

h( , ) (Ph V k

with high probability, which is subsequently characterized in Lemma 4.3. Here the inequality holds uniformly for any (h, k) 2 [H] [K] and (x, a) 2 S A. As the fact that rk

[0, 1] for any h 2 [H] implies that Q k,k

h 2 [0, H h + 1], we truncate Qk

h to the range [0, H h + 1] in (3.5), which is correspondingly used in Line 17 of Algorithm 1.

3.2. Regret Analysis

We establish an upper bound of the regret of OPPO (Algorithm 1) in the following theorem. Recall that the regret is deﬁned in (2.1) and T = HK is the total number of steps taken by the agent, where H is the length of each episode and K is the total number of episodes. Also, |A| is the cardinality of A and d is the dimension of the feature map .

Theorem 3.1 (Total Regret). Let =

2 log |A|/(HT) in (3.2) and Line 6 of Algorithm 1, λ = 1 in (3.5) and Line 12 of Algorithm 1, and β = C

d H2 log(d T/ ) in (3.6) and Line 15 of Algorithm 1, where C > 1 is an absolute constant and 2 (0, 1]. Under Assumption 2.1 and the assumption that log |A| = O(d2 [log(d T/ )]2), the regret of OPPO satisﬁes

Regret(T) C0p

d2H3T log(d T/ )

with probability at least 1 , where C0 > 0 is an absolute constant.

Proof. See Section 4 for a proof sketch and Appendix C for a detailed proof.

Theorem 3.1 proves that OPPO attains a

d2H3T-regret up to logarithmic factors, where the dependency on the total number of steps T is optimal. In the stationary setting where the reward function and initial state are ﬁxed across all the episodes, such a regret translates to a d2H4/"2-sample complexity (up to logarithmic factors) following the argument of (Jin et al., 2018) (Section 3.1). Here " > 0 measures the suboptimality of the obtained policy k in the following sense,

max 2 (A | S,H) V

where k is sampled from [K] uniformly at random. Here we denote the value function by V

1 and the initial state by x1 = xk

1 for any k 2 [K], as the reward function and initial state are ﬁxed across all the episodes. Moreover, compared with the work of (Yang & Wang, 2019b;a; Jin et al., 2019; Ayoub et al., 2020; Zhou et al., 2020), OPPO additionally allows adversarially chosen reward functions without exacerbating the regret, which leads to a notion of robustness. Also, as a tabular MDP satisﬁes Assumption 2.1 with d = |S|2|A| and being the canonical basis of Rd, Theorem 3.1 yields an |S|2|A|

H3T-regret in the tabular setting. Our subsequent discussion intuitively explains how OPPO achieves such a notion of robustness while attaining the

d2H3T-regret (up to logarithmic factors).

Discussion of Mechanisms. In the sequel, we consider the ideal setting where the transition dynamics are known, which, by the Bellman equation deﬁned in (2.4), allows us to access the Q-function Q ,k

h for any policy and (h, k) 2 [H] [K] once given the reward function rk. The following lemma connects the difference between two policies to the difference between their expected total rewards through the Q-function.

Lemma 3.2 (Performance Difference). For any policies , 0 2 (A | S, H) and k 2 [K], it holds that

h (xh, ), 0

h( | xh) h( | xh)i

))) x1 = xk

Proof. See Appendix A.1 for a detailed proof.

For notational simplicity, we omit the conditioning on x1 = xk

1, e.g., in (3.7) of Lemma 3.2, subsequently. The following lemma characterizes the policy improvement step deﬁned

Provably Efﬁcient Exploration in Policy Optimization

in (3.2), where the updated policy k takes the closed form in (3.3).

Lemma 3.3 (One-Step Descent). For any distributions p , p 2 (A), state x 2 S, and function Q : S A ! [0, H], it holds for p0 2 (A) with p0( ) / p( ) exp{ Q(x, )} that

h Q(x, ), p ( ) p( )i H2/2

Proof. See Appendix A.2 for a detailed proof.

Corresponding to the deﬁnition of the regret in (2.1), we deﬁne the globally optimal policy in hindsight (Cesa-Bianchi & Lugosi, 2006; Bubeck & Cesa-Bianchi, 2012) as

= argmax 2 (A | S,H)

which attains a zero-regret. In the ideal setting where the Q-function Q k,k

h associated with the reward function rk is known and the updated policy k+1

h takes the closed form in (3.3), Lemma 3.3 implies

h( | x)i (3.9)

for any (h, k) 2 [H] [K] and x 2 S. Combining (3.9) with Lemma 3.2, we obtain

H3K/2 + 1H log |A|. (3.10)

Here the ﬁrst inequality follows from telescoping the righthand side of (3.9) across all the episodes and the fact that the KL-divergence is nonnegative. Also, the second inequality follows from the initialization of the policy and Q-function in Line 1 of Algorithm 1. Setting =

2 log |A|/(HT) in (3.10), we establish a

H3T log |A|-regret in the ideal setting.

Such an ideal setting demonstrates the key role of the KLdivergence in the policy improvement step deﬁned in (3.2),

where > 0 is the stepsize. Intuitively, without the KLdivergence, that is, setting ! 1, the upper bound of the regret on the right-hand side of (3.10) tends to inﬁnity. In fact, for any < 1, the updated policy k

h in (3.3) is conservatively greedy with respect to the Q-function Q k 1,k 1

h associated with the reward function rk 1. In particular, the regularization effect of both k 1

h and in (3.3) ensures that k

h is not fully committed to perform well only with respect to rk 1, just in case the subsequent adversarially chosen reward function rk signiﬁcantly differs from rk 1. In comparison, the fully greedy policy improvement step, which is commonly adopted by the existing work on value-based reinforcement learning (Jaksch et al., 2010; Osband et al., 2014; Osband & Van Roy, 2016; Azar et al., 2017; Dann et al., 2017; Strehl et al., 2006; Jin et al., 2018; 2019; Yang & Wang, 2019b;a), lacks such a notion of robustness. On the other hand, an intriguing question is whether being conservatively greedy is less sampleefﬁcient than being fully greedy in the stationary setting, where the reward function is ﬁxed across all the episodes. In fact, in the ideal setting where the Q-function Q k 1,k 1

h associated with the reward function rk 1 in (3.3) is known, the fully greedy policy improvement step with ! 1 corresponds to one step of policy iteration (Sutton & Barto, 2018), which converges to the globally optimal policy

within K = H episodes and hence equivalently induces an H2-regret. However, in the realistic setting, the Q-function Q k 1,k 1

h in (3.1)-(3.3) is replaced by the estimated Qfunction Qk 1

h in Line 6 of Algorithm 1, which is obtained by the policy evaluation step deﬁned in (3.5). As a result of the estimation uncertainty that arises from only observing ﬁnite historical data, it is indeed impossible to do better than the

T-regret even in the tabular setting (Jin et al., 2018), which is shown to be an information-theoretic lower bound. In the linear setting, OPPO attains such a lower bound in terms of the total number of steps T = HK. In other words, in the stationary setting, being conservatively greedy sufﬁces to achieve sample-efﬁciency, which complements its advantages in terms of robustness in the more challenging setting with adversarially chosen reward functions.

4. Proof Sketch

4.1. Regret Decomposition

For the simplicity of discussion, we deﬁne the model prediction error as

which arises from estimating Ph V k

h+1 in the Bellman equa-

tion deﬁned in (2.4) (with V k,k

h+1 replaced by V k

h+1) based on only ﬁnite historical data. Also, we deﬁne the following ﬁltration generated by the state-action sequence and reward

Provably Efﬁcient Exploration in Policy Optimization

Deﬁnition 4.1 (Filtration). For any (k, h) 2 [K] [H], we deﬁne Fk,h,1 as the σ-algebra generated by the following state-action sequence and reward functions,

i )}( ,i)2[k 1] [H] [ {r } 2[k] [ {(xk

For any (k, h) 2 [K] [H 1], we deﬁne Fk,h,2 as the σ-algebra generated by

i )}( ,i)2[k 1] [H] [ {r } 2[k] [ {(xk

i )}i2[h] [ {xk

while for any k 2 [K], we deﬁne Fk,H,2 as the σ-algebra generated by

i )}( ,i)2[k] [H] [ {r } 2[k+1].

The σ-algebra sequence {Fk,h,m}(k,h,m)2[K] [H] [2] is a ﬁltration with respect to the timestep index

t(k, h, m) = (k 1) 2H + (h 1) 2 + m. (4.2)

In other words, for any t(k, h, m) t(k0, h0, m0), it holds that Fk,h,m Fk0,h0,m0.

By the deﬁnition of the σ-algebra Fk,h,m, for (k, h, m) 2 [K] [H] [2], the estimated value function V k

h and Qfunction Qk

h are measurable to Fk 1,H,2, as they are obtained based on the (k 1) historical trajectories and the reward function rk adversarially chosen by the environment at the beginning of the k-th episode, both of which are measurable to Fk 1,H,2.

In the following lemma, we decompose the regret deﬁned in (2.1) into three terms. Recall that the globally optimal policy in hindsight is deﬁned in (3.8) and the model prediction error k

h is deﬁned in (4.1).

Lemma 4.2 (Regret Decomposition). It holds that

Regret(T) (4.3)

+ MK,H,2 | {z } (ii)

h(xh, ah)] k

| {z } (iii)

which is independent of the linear setting in Assumption 2.1. Here {Mk,h,m}(k,h,m)2[K] [H] [2] is a martingale adapted to the ﬁltration {Fk,h,m}(k,h,m)2[K] [H] [2], both with re-

spect to the timestep index t(k, h, m) deﬁned in (4.2) of Deﬁnition 4.1.

Proof. See Appendix B.1 for a detailed proof.

Lemma 4.2 allows us to characterize the regret by upper bounding terms (i), (ii), and (iii) in (4.3), respectively. In detail, term (i) corresponds to the right-hand side of (3.2) in Lemma 3.2 with the Q-function Q k,k

h replaced by the estimated Q-function Qk

h, which is obtained by the policy evaluation step deﬁned in (3.5). In particular, as the updated policy k+1

h is obtained by the policy improvement step in Line 6 of Algorithm 1 using k

h, term (i) can be upper bounded following a similar analysis to the discussion in Section 3.2, which is based on Lemmas 3.2 and 3.3 as well as (3.10). Also, by the Azuma-Hoeffding inequality, term (ii) is a martingale that scales as O(BM

p TM) with high probability, where TM is the total number of timesteps and BM is an upper bound of the martingale differences. More speciﬁcally, we prove that TM = 2HK = 2T and BM = 2H in Appendix C, which implies that term (ii) is O(

H2T) with high probability. Meanwhile, term (iii) corresponds to the model prediction error, which is characterized subsequently in Section 4.2. Note that the regret decomposition in (4.3) of Lemma 4.2 is independent of the linear setting in Assumption 2.1, and therefore, applies to any forms of estimated Q-functions Qk

h in more general settings. In particular, as long as we can upper bound term (iii) in (4.3), our regret analysis can be carried over even beyond the linear setting.

4.2. Model Prediction Error

To upper bound term (iii) in (4.3) of Lemma 4.2, we characterize the model prediction error k

h deﬁned in (4.1) in the following lemma. Recall that the bonus function Γk

h is deﬁned in (3.6).

Lemma 4.3 (Upper Conﬁdence Bound). Let λ = 1 in (3.5) and Line 12 of Algorithm 1, and β = C

d H2 log(d T/ ) in (3.6) and Line 15 of Algorithm 1, where C > 1 is an absolute constant and 2 (0, 1]. Under Assumption 2.1, it holds with probability at least 1 /2 that

for any (k, h) 2 [K] [H] and (x, a) 2 S A.

Proof. See Appendix B.2 for a detailed proof.

Lemma 4.3 demonstrates the key role of uncertainty quantiﬁcation in achieving sample-efﬁciency. More speciﬁcally, due to the uncertainty that arises from only observing ﬁnite historical data, the model prediction error k

h(x, a) can be possibly large for the state-action pairs (x, a) that are less

Provably Efﬁcient Exploration in Policy Optimization

visited or even unseen. However, as is shown in Lemma 4.3, explicitly incorporating the bonus function Γk

h into the estimated Q-function Qk

h ensures that k

h(x, a) 0 with high probability for any (k, h) 2 [K] [H] and (x, a) 2 S A. In other words, the estimated Q-function Qk

h is optimistic in the face of uncertainty , as k

h(x, a) 0 or equivalently

h(x, a) + (Ph V k

h+1)(x, a) (4.4)

implies that E [ k

h(xh, ah)] in term (iii) of (4.3) is upper bounded by zero. Also, Lemma 4.3 implies that k

h) with high probability for any (k, h) 2 [K] [H]. As a result, it only remains to upper

bound the cumulative sum PK

h) corresponding to term (iii) in (4.3), which can be characterized by the elliptical potential lemma (Dani et al., 2008; Rusmevichientong & Tsitsiklis, 2010; Chu et al., 2011; Abbasi Yadkori et al., 2011; Jin et al., 2019). See Appendix C for a

detailed proof.

To illustrate the intuition behind the model prediction error k

h deﬁned in (4.1), we deﬁne the implicitly estimated transition dynamics as

b Pk,h(x0 | x, a)

= (x, a, x0)>( k

h is deﬁned in (3.6). Correspondingly, the policy evaluation step deﬁned in (3.5) takes the following equivalent form (ignoring the truncation step for the simplicity of discussion),

h + b Pk,h V k

Here b Pk,h is the operator form of the implicitly estimated transition kernel b Pk,h( | , ), which is deﬁned by

(b Pk,hf)(x, a) =

b Pk,h(x0 | x, a) f(x0)dx0

for any function f : S ! R. Correspondingly, by (4.1) and (4.5) we have

h = (Ph b Pk,h)V k

where Ph b Pk,h is the error that arises from implicitly estimating the transition dynamics based on only ﬁnite historical data. Such a model estimation error enters the regret in (4.3) of Lemma 4.2 only through the model prediction error (Ph b Pk,h)V k

h+1, which allows us to bypass explicitly estimating the transition dynamics, and instead, employ the estimated Q-function Qk

h obtained by the policy evaluation step deﬁned in (4.5). As is shown in Appendix B.2, the bonus function Γk

h upper bounds (Ph b Pk,h)V k

(4.6) with high probability for any (k, h) 2 [K] [H] and (x, a) 2 S A, which then ensures the optimism of the

estimated Q-function Qk

h in the sense of (4.4).

5. Conclusion

We study the sample efﬁciency of policy-based reinforcement learning in the episodic setting of linear MDPs with full-information feedback. We proposed an optimistic variant of the proximal policy optimization algorithm, dubbed as OPPO, which incorporates the principle of optimism in the face of uncertainty into policy optimization. When applied to the episodic MDP with unknown transition and adversarial reward, OPPO provably achieves a near-optimal p

d2H3T-regret. To the best of our knowledge, OPPO is the ﬁrst provably efﬁcient policy optimization algorithm that explicitly incorporates exploration.

Acknowledgements

The authors would like to thank Lingxiao Wang, Wen Sun, and Sham Kakade for pointing out a technical issue in the initial version regarding the covering number of value functions in the linear setting. Such a technical issue is ﬁxed in the camera-ready version with a deﬁnition of the linear MDP different from the one in the initial version. The authors would like to thank Csaba Szepesv ari, Lin F. Yang, Yining Wang, and Simon S. Du for helpful discussions. The authors would also like to thank the anonymous reviewers for the valuable comments and the program chairs for helping with preparing for the camera-ready version.

Provably Efﬁcient Exploration in Policy Optimization

Abbasi-Yadkori, Y., P al, D., and Szepesv ari, C. Improved

algorithms for linear stochastic bandits. In Advances in Neural Information Processing Systems, pp. 2312 2320, 2011.

Abbasi-Yadkori, Y., Bartlett, P., Bhatia, K., Lazic, N.,

Szepesv ari, C., and Weisz, G. POLITEX: Regret bounds for policy iteration using expert prediction. In International Conference on Machine Learning, volume 97, pp. 3692 3702, 2019a.

Abbasi-Yadkori, Y., Lazic, N., Szepesvari, C., and Weisz,

G. Exploration-enhanced POLITEX. ar Xiv preprint ar Xiv:1908.10479, 2019b.

Agarwal, A., Kakade, S. M., Lee, J. D., and Mahajan,

G. Optimality and approximation with policy gradient methods in Markov decision processes. ar Xiv preprint ar Xiv:1908.00261, 2019.

Antos, A., Szepesv ari, C., and Munos, R. Fitted Q-iteration

in continuous action-space mdps. In Advances in Neural Information Processing Systems, pp. 9 16, 2008.

Auer, P., Cesa-Bianchi, N., and Fischer, P. Finite-time analysis of the multiarmed bandit problem. Machine Learning, 47(2-3):235 256, 2002.

Ayoub, A., Jia, Z., Szepesvari, C., Wang, M., and Yang, L.

Model-based reinforcement learning with value-targeted regression. ar Xiv preprint ar Xiv:2006.01107, 2020.

Azar, M. G., Munos, R., Ghavamzadaeh, M., and Kappen,

H. J. Speedy Q-learning. In Advances in Neural Information Processing Systems, 2011.

Azar, M. G., G omez, V., and Kappen, H. J. Dynamic policy

programming. Journal of Machine Learning Research, 13(Nov):3207 3245, 2012a.

Azar, M. G., Munos, R., and Kappen, B. On the sample

complexity of reinforcement learning with a generative model. ar Xiv preprint ar Xiv:1206.6461, 2012b.

Azar, M. G., Osband, I., and Munos, R. Minimax regret

bounds for reinforcement learning. In International Conference on Machine Learning, pp. 263 272, 2017.

Baxter, J. and Bartlett, P. L. Direct gradient-based reinforce-

ment learning. In International Symposium on Circuits and Systems, pp. 271 274, 2000.

Bhandari, J. and Russo, D. Global optimality guarantees for policy gradient methods. ar Xiv preprint ar Xiv:1906.01786, 2019.

Boyan, J. A. Least-squares temporal difference learning.

Machine Learning, 49(2-3):233 246, 2002.

Bradtke, S. J. and Barto, A. G. Linear least-squares algo-

rithms for temporal difference learning. Machine Learning, 22(1-3):33 57, 1996.

Bubeck, S. and Cesa-Bianchi, N. Regret analysis of stochas-

tic and nonstochastic multi-armed bandit problems. Foundations and Trends R in Machine Learning, 5(1):1 122, 2012.

Cesa-Bianchi, N. and Lugosi, G. Prediction, Learning, and

Games. Cambridge, 2006.

Chen, J. and Jiang, N. Information-theoretic considera-

tions in batch reinforcement learning. ar Xiv preprint ar Xiv:1905.00360, 2019.

Chu, W., Li, L., Reyzin, L., and Schapire, R. Contextual

bandits with linear payoff functions. In International Conference on Artiﬁcial Intelligence and Statistics, pp. 208 214, 2011.

Dani, V., Hayes, T. P., and Kakade, S. M. Stochastic lin-

ear optimization under bandit feedback. Conference on Learning Theory, 2008.

Dann, C., Lattimore, T., and Brunskill, E. Unifying PAC and

regret: Uniform PAC bounds for episodic reinforcement learning. In Advances in Neural Information Processing Systems, pp. 5713 5723, 2017.

Dong, K., Peng, J., Wang, Y., and Zhou, Y. pn-regret

for learning in Markov decision processes with function approximation and low Bellman rank. ar Xiv preprint ar Xiv:1909.02506, 2019.

Du, S. S., Kakade, S. M., Wang, R., and Yang, L. Is a

good representation sufﬁcient for sample efﬁcient reinforcement learning? ar Xiv preprint ar Xiv:1910.03016, 2019a.

Du, S. S., Luo, Y., Wang, R., and Zhang, H. Provably

efﬁcient Q-learning with function approximation via distribution shift error checking oracle. ar Xiv preprint ar Xiv:1906.06321, 2019b.

Duan, Y., Chen, X., Houthooft, R., Schulman, J., and

Abbeel, P. Benchmarking deep reinforcement learning for continuous control. In International Conference on Machine Learning, pp. 1329 1338, 2016.

Even-Dar, E., Kakade, S. M., and Mansour, Y. Online

Markov decision processes. Mathematics of Operations Research, 34(3):726 736, 2009.

Farahmand, A.-m., Szepesv ari, C., and Munos, R. Error

propagation for approximate policy and value iteration. In Advances in Neural Information Processing Systems, pp. 568 576, 2010.

Provably Efﬁcient Exploration in Policy Optimization

Fazel, M., Ge, R., Kakade, S. M., and Mesbahi, M. Global

convergence of policy gradient methods for the linear quadratic regulator. ar Xiv preprint ar Xiv:1801.05039, 2018.

Geist, M., Scherrer, B., and Pietquin, O. A theory of regularized Markov decision processes. ar Xiv preprint ar Xiv:1901.11275, 2019.

Jaksch, T., Ortner, R., and Auer, P. Near-optimal regret

bounds for reinforcement learning. Journal of Machine Learning Research, 11(4):1563 1600, 2010.

Jiang, N., Krishnamurthy, A., Agarwal, A., Langford, J.,

and Schapire, R. E. Contextual decision processes with low Bellman rank are PAC-learnable. In International Conference on Machine Learning, pp. 1704 1713, 2017.

Jin, C., Allen-Zhu, Z., Bubeck, S., and Jordan, M. I. Is

Q-learning provably efﬁcient? In Advances in Neural Information Processing Systems, pp. 4863 4873, 2018.

Jin, C., Yang, Z., Wang, Z., and Jordan, M. I. Provably

efﬁcient reinforcement learning with linear function approximation. ar Xiv preprint ar Xiv:1907.05388, 2019.

Kakade, S. M. A natural policy gradient. In Advances in

Neural Information Processing Systems, 2002.

Kakade, S. M. On the Sample Complexity of Reinforcement

Learning. Ph D thesis, University of London, 2003.

Koenig, S. and Simmons, R. G. Complexity analysis of

real-time reinforcement learning. In Association for the Advancement of Artiﬁcial Intelligence, pp. 99 107, 1993.

Konda, V. R. and Tsitsiklis, J. N. Actor-critic algorithms.

In Advances in Neural Information Processing Systems, 2000.

Lattimore, T. and Szepesvari, C. Learning with good feature

representations in bandits and in RL with a generative model. ar Xiv preprint ar Xiv:1911.07676, 2019.

Lefﬂer, B. R., Littman, M. L., and Edmunds, T. Efﬁcient

reinforcement learning with relocatable action models. In Association for the Advancement of Artiﬁcial Intelligence,

pp. 572 577, 2007.

Liu, B., Cai, Q., Yang, Z., and Wang, Z. Neural proxi-

mal/trust region policy optimization attains globally optimal policy. ar Xiv preprint ar Xiv:1906.10306, 2019.

Mania, H., Guy, A., and Recht, B. Simple random search

provides a competitive approach to reinforcement learning. ar Xiv preprint ar Xiv:1803.07055, 2018.

Munos, R. and Szepesv ari, C. Finite-time bounds for ﬁtted

value iteration. Journal of Machine Learning Research, 9 (May):815 857, 2008.

Nemirovsky, A. S. and Yudin, D. B. Problem Complexity

and Method Efﬁciency in Optimization. Wiley, 1983.

Neu, G., Antos, A., Gy orgy, A., and Szepesv ari, C. Online

Markov decision processes under bandit feedback. In Advances in Neural Information Processing Systems, pp. 1804 1812, 2010a.

Neu, G., Gy orgy, A., and Szepesv ari, C. The online loop-

free stochastic shortest-path problem. In Conference on Learning Theory, volume 2010, pp. 231 243, 2010b.

Neu, G., Gy orgy, A., and Szepesv ari, C. The adversarial

stochastic shortest path problem with unknown transition probabilities. In International Conference on Artiﬁcial Intelligence and Statistics, pp. 805 813, 2012.

Neu, G., Jonsson, A., and G omez, V. A uniﬁed view of

entropy-regularized Markov decision processes. ar Xiv preprint ar Xiv:1705.07798, 2017.

Open AI. Open AI Five. https://openai.com/ five/, 2019.

Osband, I. and Van Roy, B. On lower bounds for regret in

reinforcement learning. ar Xiv preprint ar Xiv:1608.02732, 2016.

Osband, I., Van Roy, B., and Wen, Z. Generalization and ex-

ploration via randomized value functions. ar Xiv preprint ar Xiv:1402.0635, 2014.

Rosenberg, A. and Mansour, Y. Online stochastic short-

est path with bandit feedback and unknown transition function. In Advances in Neural Information Processing Systems, pp. 2209 2218, 2019a.

Rosenberg, A. and Mansour, Y. Online convex optimization

in adversarial Markov decision processes. ar Xiv preprint ar Xiv:1905.07773, 2019b.

Rusmevichientong, P. and Tsitsiklis, J. N. Linearly parame-

terized bandits. Mathematics of Operations Research, 35 (2):395 411, 2010.

Schulman, J., Levine, S., Abbeel, P., Jordan, M., and Moritz,

P. Trust region policy optimization. In International Conference on Machine Learning, pp. 1889 1897, 2015.

Schulman, J., Wolski, F., Dhariwal, P., Radford, A., and

Klimov, O. Proximal policy optimization algorithms. ar Xiv preprint ar Xiv:1707.06347, 2017.

Sidford, A., Wang, M., Wu, X., Yang, L., and Ye, Y. Near-

optimal time and sample complexities for solving Markov decision processes with a generative model. In Advances in Neural Information Processing Systems, pp. 5186 5196, 2018a.

Provably Efﬁcient Exploration in Policy Optimization

Sidford, A., Wang, M., Wu, X., and Ye, Y. Variance reduced

value iteration and faster algorithms for solving Markov decision processes. In Symposium on Discrete Algorithms, pp. 770 787, 2018b.

Silver, D., Huang, A., Maddison, C. J., Guez, A., Sifre, L.,

Van Den Driessche, G., Schrittwieser, J., Antonoglou, I., Panneershelvam, V., Lanctot, M., et al. Mastering the game of Go with deep neural networks and tree search. Nature, 529(7587):484, 2016.

Silver, D., Schrittwieser, J., Simonyan, K., Antonoglou,

I., Huang, A., Guez, A., Hubert, T., Baker, L., Lai, M., Bolton, A., et al. Mastering the game of Go without human knowledge. Nature, 550(7676):354, 2017.

Strehl, A. L., Li, L., Wiewiora, E., Langford, J., and Littman,

M. L. PAC model-free reinforcement learning. In International Conference on Machine Learning, pp. 881 888, 2006.

Sutton, R. S. and Barto, A. G. Reinforcement Learning: An

Introduction. MIT, 2018.

Sutton, R. S., Mc Allester, D. A., Singh, S. P., and Mansour,

Y. Policy gradient methods for reinforcement learning with function approximation. In Advances in Neural Information Processing Systems, 2000.

Tosatto, S., Pirotta, M., D Eramo, C., and Restelli, M.

Boosted ﬁtted Q-iteration. In International Conference on Machine Learning, pp. 3434 3443, 2017.

Van Roy, B. and Dong, S. Comments on the Du Kakade-Wang-Yang lower bounds. ar Xiv preprint ar Xiv:1911.07910, 2019.

Wainwright, M. J. Variance-reduced Q-learning is minimax

optimal. ar Xiv preprint ar Xiv:1906.04697, 2019.

Wang, L., Cai, Q., Yang, Z., and Wang, Z. Neural policy

gradient methods: Global optimality and rates of convergence. ar Xiv preprint ar Xiv:1909.01150, 2019.

Wang, W. Y., Li, J., and He, X. Deep reinforcement learning

for NLP. In Association for Computational Linguistics, pp. 19 21, 2018.

Wen, Z. and Van Roy, B. Efﬁcient reinforcement learn-

ing in deterministic systems with value function generalization. Mathematics of Operations Research, 42(3): 762 782, 2017.

Williams, R. J. Simple statistical gradient-following algo-

rithms for connectionist reinforcement learning. Machine Learning, 8(3-4):229 256, 1992.

Xiao, L. Dual averaging methods for regularized stochastic

learning and online optimization. Journal of Machine Learning Research, 11(Oct):2543 2596, 2010.

Yang, L. and Wang, M. Reinforcement leaning in feature

space: Matrix bandit, kernels, and regret bound. ar Xiv preprint ar Xiv:1905.10389, 2019a.

Yang, L. and Wang, M. Sample-optimal parametric Q-

learning using linearly additive features. In International Conference on Machine Learning, pp. 6995 7004, 2019b.

Yang, Z., Chen, Y., Hong, M., and Wang, Z. On the global convergence of actor-critic: A case for linear quadratic regulator with ergodic cost. ar Xiv preprint ar Xiv:1907.06246, 2019a.

Yang, Z., Xie, Y., and Wang, Z. A theoretical analysis

of deep Q-learning. ar Xiv preprint ar Xiv:1901.00137, 2019b.

Yu, J. Y., Mannor, S., and Shimkin, N. Markov decision

processes with arbitrary reward processes. Mathematics of Operations Research, 34(3):737 757, 2009.

Zhou, D., He, J., and Gu, Q. Provably efﬁcient reinforce-

ment learning for discounted mdps with feature mapping. ar Xiv preprint ar Xiv:2006.13165, 2020.

Zimin, A. and Neu, G. Online learning in episodic Marko-

vian decision processes by relative entropy policy search. In Advances in Neural Information Processing Systems, pp. 1583 1591, 2013.