# sampleoptimal_parametric_qlearning_using_linearly_additive_features__bea088da.pdf

Sample-Optimal Parametric Q-Learning Using Linearly Additive Features

Lin F. Yang 1 Mengdi Wang 1

Consider a Markov decision process (MDP) that admits a set of state-action features, which can linearly express the process s probabilistic transition model. We propose a parametric Q-learning algorithm that ﬁnds an approximate-optimal policy using a sample size proportional to the feature dimension K and invariant with respect to the size of the state space. To further improve its sample efﬁciency, we exploit the monotonicity property and intrinsic noise structure of the Bellman operator, provided the existence of anchor stateactions that imply implicit non-negativity in the feature space. We augment the algorithm using techniques of variance reduction, monotonicity preservation, and conﬁdence bounds. It is proved to ﬁnd a policy which is ϵ-optimal from any initial state with high probability using e O(K/ϵ2(1 γ)3) sample transitions for arbitrarily large-scale MDP with a discount factor γ (0, 1). A matching information-theoretical lower bound is proved, conﬁrming the sample optimality of the proposed method with respect to all parameters (up to polylog factors).

1. Introduction

Markov decision problems (MDP) are known to suffer from the curse of dimensionality. A basic theoretical question is: Suppose that one can query sample transitions from any state of the system using any action, how many samples are needed for learning a good policy? In the tabular setting where the MDP has S states and A actions, the necessary and sufﬁcient sample size for ﬁnding an approximateoptimal policy is eΘ( SA (1 γ)3 ) 1 where γ (0, 1) is a discount factor (Azar et al., 2013; Sidford et al., 2018a). However,

*Equal contribution 1Department of Operations Research and Financial Engineering, Princeton University. Correspondence to: Lin Yang <lin.yang@princeton.edu>, Mengdi Wang <mengdiw@princeton.edu>.

Proceedings of the 36 th International Conference on Machine Learning, Long Beach, California, PMLR 97, 2019. Copyright 2019 by the author(s). 1 g f( ) ignores poly log f( ) factors.

this theoretical-sharp result does not generalize to practical problems where S, A can be arbitrarily large or inﬁnite.

Let us consider MDP with structural knowledges. Suppose that each state-action pair (s, a) admits a feature vector φ(s, a) RK that can express the transition dynamics conditioning on (s, a). In practice, the abstract state variable s can be a sequence of historical records or a raw-pixel image, containing much information that is not related to the decision process. More general settings of MDP with structural knowledges have been considered in Azizzadenesheli et al. (2016); Jiang et al. (2017) and references therein.

In this paper, we focus on an important and very basic class of structured MDP, where the features can represent transition distributions P( | ) through an unknown linear additive model. The feature-based linear transition model is related to the commonly used linear Q-function model. We show that they are essentially equivalent when there is zero Bellman error (a notion introduced in Munos & Szepesv ari (2008)). A similar argument has been made in Parr et al. (2008). It also contains as a special case the soft state aggregation model (Singh et al., 1995; Duan et al., 2018). In this setting, we will study the theoretic sample complexity for learning a good policy by querying statetransition samples. We also aim to develop efﬁcient policy learning algorithms with provable sample efﬁciency. We study the following two questions:

Q1: How many observations of state-action-state transitions are necessary for ﬁnding an ϵ-optimal policy?

Q2: How many samples are sufﬁcient for ﬁnding an ϵoptimal policy with high probability and how to ﬁnd it?

To answer Q1, an information-theoretic lower bound is provided (Theorem 1), suggesting that, regardless of the learning algorithm, the necessary sample size for ﬁnding a good policy with high probability is eΩ K (1 γ)3 ϵ2 where K is the dimension of feature space.

To answer Q2, we develop Q-learning-like algorithms that take as input state-transition samples and output a parameterized policy. A basic parametric Q-learning algorithm performs approximate value-iteration estimates on a few points of the Q function, so that actual updates happen on the parameters. This idea originates from the phased Q-learning (Kearns & Singh, 1999) and the ﬁtted value iteration (Munos

Sample-Optimal Parametric Q-Learning with Linear Transition Models

& Szepesv ari, 2008; Antos et al., 2008a;b). Our algorithm is simpler and does not require function ﬁtting. Convergence and approximation error analysis is provided even when the MDP cannot be fully expressed using the features. Despite its simplicity, the basic algorithm has complexity e O K (1 γ)7 ϵ2 , which is not sample-optimal.

Furthermore, we develop an accelerated version of parametric Q-learning that involves taking mini-batches, computing conﬁdence bounds, and using monotonicity-preserving and variance reduction techniques. It uses some ideas from fast solvers of tabular MDP (Sidford et al., 2018b;a). To fully exploit the monotonicity property of the Bellman operator in the algorithm, we need an additional anchor assumption, i.e., there exists a (small) set of state-actions that can represent the remaining ones using convex combinations. The anchors can be viewed as vertices of the state-action space, and implies an intrinsic nonnegativity in the feature space which is needed for monotonic policy improvement. We show that the algorithm takes just enough samples per update to keep the value/policy iterates within a sequence of narrow conﬁdence regions that monotonically improve to the near-optimal solutions. It ﬁnds an ϵ-optimal policy (regardless of the initial state) with probability at least 1 δ using

eΘ K (1 γ)3 ϵ2 log 1

samples. It matches the information-theoretic lower bound up to log( ) factors, thus the algorithm is nearly sampleoptimal. If γ = 0.99, this algorithm is (1 γ) 4 = 108

times faster than the basic algorithm.

Our model, algorithms and analyses relate to previous literatures on the sample complexity of tabular MDP, reinforcement learning with function approximation, linear models and etc. A detailed account for the related literatures is given in Section 6. All technical proofs are given in the appendix. To our best knowledge, this work provides the ﬁrst sample-optimal algorithm and sharp complexity analysis (up to polylog factors) for MDP with linear models.

2. Markov Decision Process, Features, Linear Models

In this section we introduce the basics of Markov decision process and the feature-based linear transition model.

2.1. Preliminaries

In a discounted Markov decision process (DMDP or MDP for short), there is a ﬁnite set of states S, a ﬁnite set of actions A. Let S = |S| and A = |A|. At any state s S, an agent is allowed to play an action a A. She receives an immediate reward r(s, a) [0, 1] after playing a at s, and then the process will transition to the next state s

S with probability P(s |s, a), where P is the collection of transition distributions. The full instance of MDP can be described by the tuple M = (S, A, P, r, γ). The agent would like to ﬁnd a policy π : S A that maximizes the long-term expected reward starting from every state s, i.e.,

vπ(s) := E X

t=0 γtr(st, π(st))|s0 = s

where γ (0, 1) is a discount factor. We call vπ RS the value function of policy π. A policy π is said to be optimal if it attains the maximal possible value at every state. In fact, it is known (see e.g. Puterman (2014)) that there is a unique optimal value function v such that

s S : v (s) = max π vπ(s) = vπ (s).

A policy π is said to be ϵ-optimal if it achieves near-optimal cumulative reward from any initial state, i.e.,

vπ(s) v (s) ϵ, s S,

or equivalently vπ v ϵ for short. We denote the Bellman operator T : RS RS as

s S : [T v](s) = max a A[r(s, a) + γP( |s, a) v].

A vector v is the optimal value of the DMDP if and only if it satisﬁes the Bellman equation v = T v.

The Q-function of a policy π is deﬁned as Qπ(s, a) = r(s, a) + γ P

s P(s |s, a)vπ(s ), and the optimal Qfunction is denoted by Q = Qπ . We overload the notation T to also denote the Bellman operator in the space of Q-functions, i.e., T : RS A RS A such that

T Q(s, a) = r(s, a) + γP( |s, a) max a Q( , a ).

A vector Q RS A is the optimal Q-function if and only if it satisﬁes the Bellman equation Q = T Q.

We use O,Ω, Θ to denote leading orders, and we use O, Ω, Θ to omit polylog factors. We use to denote approximately less than by ignoring non-leading order terms, constant and polylog factors.

2.2. Feature-based Linear Transition Model

We study Markov decision processes with structural knowledges. Suppose that the learning agent is given a set of K feature functions φ1, φ2, . . . , φK : S A R. The feature φ maps the raw state and action (s, a) into the Kdimensional vector

φ(s, a) = [φ1(s, a), φ2(s, a), . . . , φK(s, a)] RK.

Suppose the feature vector φ(s, a) is sufﬁcient to express the future dynamics of the process conditioning on the current raw state and action. In particular, we focus on a basic linear model given below.

Sample-Optimal Parametric Q-Learning with Linear Transition Models

Deﬁnition 1 (Feature-based Linear Transition Model). Consider a DMDP instance M = (S, A, P, r, γ) and a feature map φ : S A RK. We say that M admits a linear feature representation φ if for every s, a, s ,

P(s |s, a) = X

k [K] φk(s, a)ψk(s ).

for some functions ψ1, . . . , ψK : S R. We denote the set of all such MDP instances as Mtrans(S, A, γ, φ). We denote Mtrans K (S, A, γ) the set of all DMDP instances that admits a K-dimensional feature representation.

Remark 1 (Independence of rewards). The feature representations φ(s, a) in Deﬁnition 1 capture the transition dynamics of the Markov process under different actions. It is a form of structural knowledge about the environment. It has nothing to do with the rewards r(s, a).

Remark 2 (Combining state features and action features). In many settings one may be given a state-only feature map φ1 and an action-only feature map φ2. In this case, one can construct the joint state-action feature by φ(s, a) = φ1(s)φ2(a). As long as the MDP admits a linear transition model in both φ1, φ2, it also admits a linear representation in the product feature φ = φ1 φ2.

Remark 3 (Relation to soft state aggregation). The featurebased linear transition model (Deﬁnition 1) contains a worthnoting special case. When each φ(s, a) RK and ψk RS

is a probability density function, the linear transition model reduces to a soft state aggregation model (Singh et al., 1995; Duan et al., 2018). In the soft state aggregation model, each state can be represented by a mixture of latent metastates, through aggregation and disaggregation distributions. There would be K meta-states, which can be viewed as the leading modes of the process. In contrast, our featurebased transition model is much more general. Our feature map φ can be anything as long as it is representative of the transition distributions. It captures information about not only the states but also the actions.

2.3. Relation to Linear Q-function Model

Linear models are commonly used for approximating value functions or Q-functions using given features (sometimes referred to as basis functions). The proposed linear transition model is closely related to the linear Q-function model, where Qπ s are assumed to admit a linear representation.

First it is easy to see that if the MDP admits a linear transition model using φ, the Q-functions admit a linear model.

Proposition 1. Let M Mtrans(S, A, γ, φ). Then Qπ Span(r, φ) for all π.

Next we show that the two models are essentially equivalent in terms of expressibility. A similar conclusion was made by Parr et al. (2008), which showed that a particular

solution obtained using the linear value model is equivalent to a solution obtained using a linear transition model. Recall a notion of Bellman error that was ﬁrstly introduced in (Munos & Szepesv ari, 2008).

Deﬁnition 2 (Bellman Error). Let F RS A be a class of Q functions. Given the Bellman operator T , the Bellman error of F is d(T F, F) = supg F inff F f T g .

We show that the linear transition model is equivalent to the linear Q-function model with zero Bellman error.

Proposition 2 (Equivalence to Zero Bellman Error). Let M = (S, A, P, r, γ) be an MDP instance with the Bellman operator T . Let φ : RS A be a feature map, and let F = Span(r, φ). If r F, then

d(T F, F) = 0 if & only if M Mtrans(S, A, γ, φ).

Suppose Qπ Span(r, φ) for all π s. However, valueiteration-based method would still fail if the Bellman operator T does not preserve the (r, φ) representation. In contrast, if the Q-functions admit linear representations using φ but the transition kernel P does not, the Bellman error can be arbitrarily large. The Bellman error may be large even after projection or function ﬁtting - a common source of unstable and oscillating behaviors in approximate dynamic programming (Tsitsiklis & Van Roy, 1996; Munos & Szepesv ari, 2008).

3. Information-Theoretic Sample Complexity

Let us study the feature-based MDP model (Deﬁnition 1). It comes with the structural knowledge that each state-action pair (s, a) can be represented by the feature vector φ(s, a) RK. However, this model can not be parameterized by a small number of parameters. The full transition model with known feature map φ can not be speciﬁed unless all the unknown parameters ψk(s ), for s S, k [K] are given. Its model size is S K, which can be arbitrarily large for arbitrarily large S.

Given the state-action features, we aim to learn a nearoptimal parametrized policy using a small number of samples, which hopefully depends on K but not S. Suppose that we are given a generative model (Kakade, 2003) where the agent is able to query transition samples and reward from any state-action pair (s, a) S A. Such a generative model is commonly available in simulation systems. To this end, we ask how many samples are necessary to obtain an approximate-optimal policy? Our ﬁrst theorem provides a ﬁrm answer.

Theorem 1 (Sample Complexity Lower Bound). Let M = (S, A, P, r, γ) be an instance of DMDP, and let A be any algorithm that queries sample transitions of M and outputs a policy. Let πA,M,N be the output of A using N samples.

Sample-Optimal Parametric Q-Learning with Linear Transition Models

inf A sup M Mtrans K (S,A,γ) P sup s S (v (s) vπA,M,N (s)) ϵ

1/3, if N = O K (1 γ)3 ϵ2 log ϵ 1

provided ϵ ϵ0 for some ϵ0 0.

Theorem 1 suggests that, in order to solve the feature-based MDP to precision level ϵ with probability at least 2/3, any

algorithm needs at least eΩ K (1 γ)3 ϵ2

sample transitions

in the worst case.

In the tabular setting without any feature, the sample complexity lower bound is known to be Ω( SA (1 γ)3ϵ2 ) (Azar et al., 2013). Our lower bound can be proved by constructing a reduction from the feature-based model to a smaller-size tabular MDP with K state-action pairs. In the reduction, the features are constructed as indicator functions corresponding to a K-partition of the state space. We postpone the proof to the appendix.

4. A Basic Parametric Q-Learning Method

We develop a Q-learning algorithm for MDP admitting feature representations provided with a generative model.

4.1. Algorithm

Recall that phased Q-Learning (Kearns & Singh, 1999) takes the form Q(s, a) r(s, a) + γ

m Pm i=1 maxa Q(s i, a ), where s i s are sample states generated from P( | s, a). In the tabular setting, one needs to keep track of all the Q(s, a) values.

Given the feature map φ : S A RK, we parameterize the Q-functions, value functions and policies using w RK

Qw(s, a) := r(s, a) + γφ(s, a) w, (1)

Vw(s) := max a A Qw(s, a), (2)

πw(s) := arg max a A Qw(s, a). (3)

A scalable learning algorithm should keep track of only the parameters w, from which one can decode the highdimensional value and policy functions according to (1-3).

Algorithm 1 gives a parametric phased Q-learning method. It queries state-action transitions and makes Q-learning-like updates on the parameter w. Each iteration picks a small set of state-action pairs K, and performs approximate value iteration on K. The set K can be picked almost arbitrarily. To obtain a convergence bound, we assume that the stateaction pairs in K cannot be too alike, i.e., the regularity condition (4) holds for some value L > 0.

Assumption 1 (Representative States and Regularity of Features). There exists a representative state-action set K S A with |K| = K and a scalar L > 0 such that

φ(s, a)T Φ 1 K 1 L, (s, a) (4)

where ΦK RK K is the collection of row feature vectors φ(s, a) where (s, a) K and L 1.

Algorithm 1 Phased Parametric Q-Learning (PPQLearning)

1: Input: A DMDP M = (S, A, P, r, γ) with a generative model 2: Input: Integer N > 0 3: 4: Initialize: R Θ log N

1 γ , w 0 RK; 5: Repeat: 6: for t = 1, 2, . . . , R do 7: Pick a representative set K S A satisfying (4). 8: Q 0 RK; 9: for (s, a) K do 10: Obtain N KR samples {s(j)} i.i.d. from P( |s, a);

11: Q[(s, a)] KR

N PN/KR j=1 Π[0,(1 γ) 1][Vw(s(j))]; 12: Π[a,b] projects a number onto [a, b] 13: end for 14: w Φ 1 K Q; 15: end for 16: Output: w RK

4.2. Error Bound and Sample Complexity

We show that the basic parametric Q-learning method enjoys the following error bound.

Theorem 2 (Convergence of Algorithm 1). Suppose Assumption 1 holds. Suppose that the DMDP instance M = (S, A, P, r, γ) has an approximate transition model e P that admits a linear feature representation φ (Defn. 1), such that for some ξ [0, 1], P( | s, a) e P( | s, a) T V ξ, (s, a). Let Algorithm 1 takes N > 0 samples and outputs a parameter w RK. Then, with probability at least 1 δ, vπw v

K N (1 γ) + ξ poly log(NKδ 1)

Remark 4 (Policy optimality guarantee). Our bound applies to vπw, i.e., the actual performance of the policy πw in the real MDP. It is for the ℓ norm, i.e., the policy is ϵ-optimal from every initial state. This is the strongest form of optimality guarantee for solving MDP.

Remark 5 (Approximation error due to model misspeciﬁcation). When the feature-based transition model is inexact up to ξ total variation, there is an approximation gap in

Sample-Optimal Parametric Q-Learning with Linear Transition Models

the policy s performance O h L ξ poly log(NKδ 1)

(1 γ)3 i . It sug-

gests that, even if the observed feature values φ(s, a) cannot fully express the state and action, the Q-learning method can still ﬁnd approximate-optimal policies. The level of degradation depends on the total-variation divergence between the true transition distribution and its closest feature-based transition model.

Remark 6 (Sample complexity of Algorithm 1). When the MDP is fully realizable under the features, we have ξ = 0. Then the number of samples needed for achieving ϵ policy error is

It is independent of size of the original state space, but depends linearly on K. Its dependence on 1 1 γ matches the tabular phased Q-learning (Kearns & Singh, 1999) which has complexity O( SA (1 γ)7ϵ2 ) (Sidford et al., 2018a). Despite the fact that the MDP model has S K unknown parameters, the basic parametric Q-learning method can produce good policies even with small data. However, there remains a gap between the current achievable sample complexity (Theorem 2) and the lower bound (Theorem 1).

5. Sample-Optimal Parametric Q-Learning

In this section we will accelerate the basic parametric Qlearning algorithm to maximize its sample efﬁciency. To do so, we need to modify the algorithm in nontrivial ways in order to take full advantage of the MDP s structure.

5.1. Anchor States and Monotonicity

In order to use samples more efﬁciently, we need to leverage monotonicity of the Bellman operator (i.e., T v1 T v2 if v1 v2). However, when the Q function is parameterized as a linear function in w, noisy updates on w may easily break the pointwise monotonicity in the Q space. To remedy this issue, we will impose an additional assumption to ensure that monotonicity can be preserved implicitly.

Assumption 2 (Anchor State-Action Pairs). There exists a set of anchor state-action pairs K such that for any (s, a) S A, its feature vector can be represented as a convex combination of the anchors {(sk, ak) | k K}:

{λk} : φ(s, a) = X

k K λkφ(sk, ak), X

k K λk = 1, λk 0.

The anchoring (sk, ak) s can be viewed as vertices of the state-action space. They imply that the transition kernel P admits a nonnegative factorization, which can be seen by transforming φ linearly such that each anchor corresponds to a unit feature vector. This implicit non-negativity is a key to pointwisely monotonic policy/value updates.

The notion of anchor is a natural analog of the anchor word condition from topic modeling (Arora et al., 2012) and nonnegative matrix factorization (Donoho & Stodden, 2004). A similar notion of anchor state has been studied in the context of soft state aggregation models to uniquely identify latent meta-states (Duan et al., 2018). Under the anchor assumption, without loss of generality, we will assume that φ s are nonnegative, each φ(s, a) is a vector of probabilities, and there are K anchors with unit feature vectors.

5.2. Achieving The Optimal Sample Complexity

We develop a sample-optimal algorithm which is implemented in Algorithm 2. Let us explain the features that enable it to ﬁnd more accurate policies. Some of the ideas are due to (Sidford et al., 2018b;a), where they were used to develop fast solvers for the tabular MDP.

Parametrization. For the purpose of preserving monotonicity, Algorithm 2 employs a new parametric form. It uses a collection of parameters θ = {w(i)}Z i=1 instead of a single vector, with Z = e O( 1 1 γ ). The parameterized policy and value functions take the form

Vθ(s) := max h [Z] max a A r(s, a) + γφ(s, a) w(h) and

πθ(s) arg max a A max h [Z]

r(s, a) + γφ(s, a) w(h) . (5)

Given θ, one can compute Vθ(s), πθ(s) by solving an onestep optimization problem. If a takes continuous values, it needs to solve a nonlinear optimization problem.

Computing conﬁdence bounds. In Step 13 and Step 18, the algorithm computes conﬁdence bounds ϵ(i,j) s for the estimated values of PVθ. These bounds tightly measure the distance from Vθ to the desired solution path, according to probaiblistic concentration arguments. With these bounds, we can precisely shift our estimator downwards so that certain properties would hold (e.g. monotonicity to be explained later) while not incurring additional error.

Monotonicity preservation. The algorithm guarantees that the following condition holds throughout:

Vθ TπθVθ, pointwise.

We call this property the monotonicity property, which together with monotonicity of the Bellman operator guarantees that (by an induction proof)

Vθ TπθVθ T 2 πθVθ T πθ Vθ = vπθ v ,

Algorithm 2 uses two algorithmic tricks to preserve the monotonicity property throughout the iterations. First, the parametric forms of Vθ and πθ (eq.(5)) take the maximum across all previous parameters (indexed by h = (i, j)). It

Sample-Optimal Parametric Q-Learning with Linear Transition Models

guarantees that Vθ is monotonically improving throughout the outer and inner iterations. Second, the algorithm shifts all the estimated Vθ downwards by a term corresponding to its conﬁdence bound (last equation of Line 13 and Line 18 of Algorithm 2). As a result, the estimated expectation is always smaller than the true expected value. By virtue of the nonnegativity (due to Assumption 2), the estimate, φ(s, a) w(i,j), of the exact inner product P( |s, a) V (i,j 1) for arbitrary (s, a) is also shifted downwards. Then we have

φ(s, a) w(i,j) P( |s, a) V (i,j 1) P( |s, a) V (i,j).

By maximizing the lefthandside over a, we see that the monotonicity property is preserved inductively. See Lemma 8 for a more detailed proof.

Variance reduction. The algorithm uses an outer loop and an inner loop for approximately iterating the Bellman operator. Each outer iteration performs pre-estimation of a reference vector PVθ(i,0) (Step 13), which is used throughout the inner loop. For instance, let θ(i,j) be the parameters at outer iteration i and inner iteration j. To obtain an entry Q(i,j)(s, a) of the new Q-function, we need to estimate P( |s, a) Vθ(i,j 1) with sufﬁcient accuracy, so we have

P( |s, a) Vθ(i,j 1) = P( |s, a) (Vθ(i,j 1) Vθ(i,0))

+ P( |s, a) Vθ(i,0).

Note that the reference P( |s, a) Vθ(i,0) is already approximated with high accuracy in Step 13. This allows the inner loop to successively reﬁne the value and policy, while each inner iteration uses a smaller number of sample transitions to estimate the offset P( |s, a) (Vθ(i,j 1) Vθ(i,0)) (Step 18).

Putting together the preceding techniques, Algorithm 2 performs carefully controlled Bellman updates so that the estimated value-policy functions monotonically improve to the optimal ones. The algorithm contains R = Θ(log[ϵ 1(1 γ) 1]) many outer loops. Each outer loop (indexed by i) starts with a policy v Vθ(i,0) H/2i and ends with a policy v Vθ(i+1,0) H/2i+1.The algorithm takes multiple rounds of mini-batches, where the sample size of each mini-batch is picked just enough to guarantee the accumulation of total error is within ϵ. The algorithm fully exploits the monotonicity property of the Bellman operator as well as the error accumulation in the Markov process (to be explained later in the proof outline).

5.3. Optimal Sample Complexity Guarantee

In this section, we analyze the sample complexity of the algorithm provided in the last section.

Theorem 3 (Near-Optimal Sample Complexity). Suppose M = (S, A, P, r, γ) is an MDP instance admitting the

feature representation φ : S A RK. Suppose that Assumption 2 holds. Let δ, ϵ (0, 1) be parameters. Then Algorithm 2 takes

N = Θ K (1 γ)3 ϵ2 log4/3 K ϵδ(1 γ) log2 1 ϵ(1 γ)

samples and outputs θ such that πθ is ϵ-optimal from every initial state with probability at least 1 δ.

Theorem 3 is proved through a series of lemmas, which we defer to the appendix. Here we sketch the key ideas.

Proof Sketch. Let H = 1 1 γ for short. Each outer-loop iteration decreases the policy error upper bound by at least half. Suppose θ(i,0) is the parameter when the ith outer iteration begins, we expect Vθ(i,0) v H/2i, with high probability. Therefore, after R = log(H/ϵ) iterations, we expect Vθ(R ,0) v H/2R = O(ϵ).

Now we analyze how many samples are sufﬁcient within one outer-loop iteration. We show that the ﬁnal error is mainly due to ϵ(i,0), which comes from estimating the reference function Vθ(i,0) (Line 13). This error is exempliﬁed in the inner loop since Vθ(i,0) is used repeatedly (line 18).

A key step of the proof is to show that the error contributed by ϵ(i,0) throughout the inner-loop iterations is small. By using the monotonicity property, we can show that

ϵ(i,0)(s, a) p

σv (s, a)/m, (s, a),

where denotes approximately less than (ignoring nonleading terms), and σv : S A R is an intrinsic variance function of the MDP:

σv (s, a) := Vars P ( |s,a) v (s ) .

By using the monotonicity property, we prove by induction:

v vπθ(i,R) v Vθ(i,R) γP π (v Vθ(i,R 1)) + ϵ(i,0) π

. . . γR(v Vθ(i,0)) +

i=0 γi(P π )iϵ(i,0) π

(I γP π ) 1ϵ(i,0) π

(I γP π ) 1 q

σπ v / m, pointwise w.h.p.,

where σπ v (s) = σv (s, π (s)), ϵ(i,0) π (s) = ϵ(i,0)(s, π (s)), and m is the mini-batch size. Now we have found a connection between the error accumulation of the algorithm and the intrinsic variance of the MDP. By a form of conditional law of total variance of the Markov process (Lemma 7) and using the convex combination property (Assumption 2), one has (I γP π ) 1 q

Sample-Optimal Parametric Q-Learning with Linear Transition Models

Algorithm 2 Optimal Phased Parametric Q-Learning (OPPQ-Learning)

1: Input: A DMDP M = (S, A, P, r, γ) with anchor state-action pairs K; feature map φ : S A R; 2: Input: ϵ, δ (0, 1) 3: Output: θ RK with |θ| = Θ[(1 γ) 1 log2 ϵ 1] 4: 5: Initialize: R Θ(log[ϵ 1(1 γ) 1]), R Θ[R (1 γ) 1] initialize the numbers of iterations 6: {w(i,j), ϵ(i,j), w(i,j)}i [0,R ],j [0,R] RK as 0 vectors initialize parameters

ϵ2 log(R RKδ 1)4/3

(1 γ)3 , mini-batch size for outer loop

m1 C log(R RKδ 1)

(1 γ)2 for some constant C; mini-batch size for inner loop

8: θ(0,0) {0} RK initialize the output to contain a single 0-vector 9: Iterates:

10: Outer loop 11: for i = 0, 1, . . . , R do 12: for each k [K] do

13: Obtain state samples x(1) k , x(2) k , . . . , x(m) k S from P( |sk, ak) for (sk, ak) K. Let

w(i,0)(k) 1

ℓ=1 Vθ(i,0)(x(ℓ) k ), z(i,0)(k) 1

ℓ=1 V 2 θ(i,0)(x(ℓ) k )

empirical esitimate of PKVθ(i,0) and PKV 2 θ(i,0) σ(i,0)(k) z(i,0)(k) (w(i,0)(k))2

empirical esitimate of variance PKV 2 θ(i,0) (PKVθ(i,0))2

ϵ(i,0)(k) Θ hq

log[R RKδ 1] σ(i,0)(k) m 1 + log[R RKδ 1](1 γ) 1/m3/4i

estimate of the conﬁdence bound of the emprical estimator w(i,0)

w(i,0)(k) max 0, min w(i,0)(k) ϵ(i,0)(k), (1 γ) 1

shift and clip the estimate

14: end for

15: Inner loop 16: for j = 1, 2, . . . , R do 17: for each k [K] do

18: Obtain state samples x(1) k , x(2) k , . . . , x(m1) k S from P ( |sk, ak) for (sk, ak) K. Let

w(i,j)(k) 1 m1

Vθ(i,j 1)(x(ℓ) k ) Vθ(i,0)(x(ℓ) k ) + w(i,0)(k) empirical esitimate of PKVθ(i,j 1)

ϵ(i,j)(k) ϵ(i,0)(k) + Θ(1 γ) 12 ip

log(RR Kδ 1)/m1

approximate the conﬁdence bound of PKVθ(i,j 1)

w(i,j)(k) max n 0, min w(i,j)(k) ϵ(i,j)(k), (1 γ) 1 o

shift and clip the estimate

19: end for 20: θ(i,j) θ(i,j 1) {w(i,j)} attach the newly estimated parameter to θ 21: end for 22: θ(i+1,0) θ(i,R) prepare the next outer loop 23: end for 24: Return θ(R ,R)

Therefore the inner loop accumulates error e O( p

H3/m), so m = O(H3) = O((1 γ) 3) number of samples is enough.

Finally, we prove by induction that all the desired events happen with sufﬁciently high probability, so that the iterates improve monotonically to the optimal solution within a sequence of carefully controlled error bars. The total number of outer iterations is nearly constant, therefore the total sample size needed scales with O((1 γ) 3).

Remark 7 (Sample Optimality of Algorithm 2). Theorem 3 matches the information-theoretic lower bound of Theorem 1 up to polylog factors with respect to all parameters S, A, K, ϵ, 1 γ (note that Theorem 1 still holds under the anchor restriction). Therefore it is a sample-optimal method for solving the feature-based MDP. No other method can outperform it by more than polylog factors.

Remark 8 (About Anchor State-Actions). The proof of Theorem 3 relies on the anchor assumption. The monotonic-

Sample-Optimal Parametric Q-Learning with Linear Transition Models

ity property can be preserved because the anchor state-action pairs imply an implicit non-negative factorization of the transition kernel. The convex combination property of anchor state-actions is used in analyzing the error accumulation, needed by the conditional law of total variance. Anchor condition is commonly believed to be a key to identifying nonnegative models; see for example (Donoho & Stodden, 2004). We believe this is the ﬁrst observation that it also relates to sample-optimal reinforcement learning.

Note that it is possible that the number of anchors is greater than the number of features K, then one can append new (dependent) features to make them equal. In this sense Assumption 2 always holds and the actual sample complexity depends on the number of anchors (instead of features). In addition, the anchors can be pre-computed as long as the φ feature map is known. Remark 9 (Signiﬁcance of (1 γ) 4 Improvement). Let us compare the sample complexities of Algorithms 1, 2. They differ by a multiplicative gap (1 γ) 4. Recall that γ (0, 1) is the discount factor. One can view (1 γ) 1 = 1 + γ + γ2 + as an approximate horizon. If γ = 0.99, the MDP essentially has 100 time steps, and

(1 γ) 4 = 108,

i.e., Algorithm 2 is 108 times faster. It only needs a tiny portion (1/108) of the samples as needed by the basic algorithm. We see that clever algorithmic usage of monotonicity and variance structures of the MDP saves big.

6. Related Literatures

There is a body of works studying the sample complexity of tabular DMDP (i.e., the ﬁnite-state ﬁnite-action case without structural knowledge). Sample-based algorithms for learning value and policy functions have been studied in Kearns & Singh (1999); Kakade (2003); Singh & Yee (1994); Azar et al. (2011b; 2013); Sidford et al. (2018b;a) and many others. Among these papers, Azar et al. (2013) obtains the ﬁrst tight sample bound for ﬁnding an ϵ-optimal value function, Sidford et al. (2018a) obtains the ﬁrst tight sample bound for ﬁnding an ϵ-optimal policy; both complexities are of the form e O[|S||A|(1 γ) 3]. Lower bounds have been shown in Azar et al. (2011a); Even-Dar et al. (2006) and Azar et al. (2013). Azar et al. (2013) gives the ﬁrst tight lower bound Ω[|S||A|(1 γ) 3].

Our result is relevant to the large body of works using linear models and basis functions to approximate value and Q functions. For instance, Tsitsiklis & Van Roy (1997); Nedi c & Bertsekas (2003); Lagoudakis & Parr (2003); Melo et al. (2008); Parr et al. (2008); Sutton et al. (2009); Lazaric et al. (2012); Tagorti & Scherrer (2015) and Maei et al. (2010) studies both policy evaluation and optimization by assuming values are from a linear space. Tsitsiklis & Van Roy

(1997) studied the convergence of the temporal-difference learning algorithm for approximating the value function for a ﬁxed policy. Nedi c & Bertsekas (2003) studies the policy evaluation problem using least square. Parr et al. (2008) studies the relationships of using linear functions to represent values and to represent transition models. Melo et al. (2008) studies the almost sure convergence of Q-learninglike methods using linear function approximation. Sutton et al. (2009) shows off-policy temporal-difference learning is convergent with linear function approximation. These earlier works primarily focused on convergence using linear function approximation, without analyzing the sample complexity.

Fitted value iteration (VI) applies to more general function approximators of the value function (Munos & Szepesv ari, 2008; Antos et al., 2008a; Farahmand et al., 2010; Antos et al., 2008b), where v is approximated within a lowdimensional function space F. They have shown that the error of the ﬁtted-VI is affected by the Bellman error of the space F. Their result applies to a general set of functional spaces, where the statistical error depends on a polynomial of 1/ϵ, 1/(1 γ) and the intrinsic dimension of the functional space. It appears that their result works for the ℓp norm of the policy error, which is proportional to ϵ Θ(p)

with high probability. Their result does not apply to the ℓ policy error which is the focus of the current paper.

More recently, Lazaric et al. (2012); Tagorti & Scherrer (2015) analyzes the sample complexity of temporal difference least square for evaluating a ﬁxed policy. Recently, a work by Jiang et al. (2017) studies the case when a form of Bellman error is decomposable and has a small rank. They show that the number of trajectories needed depends on the Bellman rank rather than the number of states. Chen et al. (2018) proposes a primal-dual method for policy learning that uses linear models and state-action features for both the value and state-action distribution. To our best knowledge, there is no existing result that solves the linear-model MDP with provable-optimal sample complexity.

The paper studies the information-theoretic sample complexity for solving MDP with feature-based linear transition model. It provides the ﬁrst sharp sample complexity upper and lower bounds for learning the policy using a generative model. It also provides a sample-optimal parametric Qlearning method that involves computing conﬁdence bounds, variance reduction and monotonic improvement. We hope that establishing sharp results for the basic linear model would shed lights on more general structured models and motivate faster solutions.

Sample-Optimal Parametric Q-Learning with Linear Transition Models

Acknowledgment

We thank Zhuoran Yang for pointing out a ﬂaw in the initial proof of Proposition 2. We thank the anonymous reviewers for the helpful comments.

Antos, A., Szepesv ari, C., and Munos, R. Fitted Q-iteration in continuous action-space mdps. In Advances in neural information processing systems, pp. 9 16, 2008a.

Antos, A., Szepesv ari, C., and Munos, R. Learning near-optimal policies with bellman-residual minimization based ﬁtted policy iteration and a single sample path. Machine Learning, 71(1):89 129, 2008b.

Arora, S., Ge, R., and Moitra, A. Learning topic models going beyond svd. In Foundations of Computer Science (FOCS), 2012 IEEE 53rd Annual Symposium on, pp. 1 10. IEEE, 2012.

Azar, M. G., Munos, R., Ghavamzadeh, M., and Kappen, H. Reinforcement learning with a near optimal rate of convergence. 2011a.

Azar, M. G., Munos, R., Ghavamzadeh, M., and Kappen, H. Speedy q-learning. In Advances in neural information processing systems, 2011b.

Azar, M. G., Munos, R., and Kappen, H. J. Minimax pac bounds on the sample complexity of reinforcement learning with a generative model. Machine learning, 91(3): 325 349, 2013.

Azizzadenesheli, K., Lazaric, A., and Anandkumar, A. Reinforcement learning in rich-observation mdps using spectral methods. ar Xiv preprint ar Xiv:1611.03907, 2016.

Bertsekas, D. P. Dynamic programming and optimal control, volume 1. Athena scientiﬁc Belmont, MA, 2005.

Chen, Y., Li, L., and Wang, M. Scalable bilinear pi learning using state and action features. In Proceedings of the 35th International Conference on Machine Learning, pp. 834 843, Stockholmsmssan, Stockholm Sweden, 10 15 Jul 2018. PMLR.

Donoho, D. and Stodden, V. When does non-negative matrix factorization give a correct decomposition into parts? In Advances in neural information processing systems, pp. 1141 1148, 2004.

Duan, Y., Ke, Z. T., and Wang, M. State aggregation learning from markov transition data. ar Xiv preprint ar Xiv:1811.02619, 2018.

Even-Dar, E., Mannor, S., and Mansour, Y. Action elimination and stopping conditions for the multi-armed bandit and reinforcement learning problems. Journal of machine learning research, 7(Jun):1079 1105, 2006.

Farahmand, A.-m., Szepesv ari, C., and Munos, R. Error propagation for approximate policy and value iteration. In Advances in Neural Information Processing Systems, pp. 568 576, 2010.

Jiang, N., Krishnamurthy, A., Agarwal, A., Langford, J., and Schapire, R. E. Contextual decision processes with low bellman rank are pac-learnable. In Proceedings of the 34th International Conference on Machine Learning Volume 70, pp. 1704 1713. JMLR. org, 2017.

Kakade, S. M. On the sample complexity of reinforcement learning. Ph D thesis, University of London London, England, 2003.

Kearns, M. J. and Singh, S. P. Finite-sample convergence rates for q-learning and indirect algorithms. In Advances in Neural Information Processing Systems, pp. 996 1002, 1999.

Lagoudakis, M. G. and Parr, R. Least-squares policy iteration. Journal of machine learning research, 4(Dec): 1107 1149, 2003.

Lazaric, A., Ghavamzadeh, M., and Munos, R. Finitesample analysis of least-squares policy iteration. Journal of Machine Learning Research, 13(Oct):3041 3074, 2012.

Maei, H. R., Szepesv ari, C., Bhatnagar, S., and Sutton, R. S. Toward off-policy learning control with function approximation. In ICML, pp. 719 726, 2010.

Melo, F. S., Meyn, S. P., and Ribeiro, M. I. An analysis of reinforcement learning with function approximation. In Proceedings of the 25th international conference on Machine learning, pp. 664 671. ACM, 2008.

Munos, R. and Szepesv ari, C. Finite-time bounds for ﬁtted value iteration. Journal of Machine Learning Research, 9 (May):815 857, 2008.

Nedi c, A. and Bertsekas, D. P. Least squares policy evaluation algorithms with linear function approximation. Discrete Event Dynamic Systems, 13(1-2):79 110, 2003.

Parr, R., Li, L., Taylor, G., Painter-Wakeﬁeld, C., and Littman, M. L. An analysis of linear models, linear value-function approximation, and feature selection for reinforcement learning. In Proceedings of the 25th international conference on Machine learning, pp. 752 759. ACM, 2008.

Sample-Optimal Parametric Q-Learning with Linear Transition Models

Puterman, M. L. Markov decision processes: discrete stochastic dynamic programming. John Wiley & Sons, 2014.

Sidford, A., Wang, M., Wu, X., Yang, L., and Ye, Y. Nearoptimal time and sample complexities for solving markov decision processes with a generative model. In Bengio, S., Wallach, H., Larochelle, H., Grauman, K., Cesa-Bianchi, N., and Garnett, R. (eds.), Advances in Neural Information Processing Systems 31, pp. 5192 5202, 2018a.

Sidford, A., Wang, M., Wu, X., and Ye, Y. Variance reduced value iteration and faster algorithms for solving markov decision processes. In Proceedings of the Twenty-Ninth Annual ACM-SIAM Symposium on Discrete Algorithms, pp. 770 787. Society for Industrial and Applied Mathematics, 2018b.

Singh, S. P. and Yee, R. C. An upper bound on the loss from approximate optimal-value functions. Machine Learning, 16(3):227 233, 1994.

Singh, S. P., Jaakkola, T., and Jordan, M. I. Reinforcement learning with soft state aggregation. In Advances in neural information processing systems, pp. 361 368, 1995.

Sutton, R. S., Maei, H. R., and Szepesv ari, C. A convergent o(n) temporal-difference algorithm for off-policy learning with linear function approximation. In Advances in neural information processing systems, pp. 1609 1616, 2009.

Tagorti, M. and Scherrer, B. On the rate of convergence and error bounds for LSTD(λ). In International Conference on Machine Learning, pp. 1521 1529, 2015.

Tsitsiklis, J. N. and Van Roy, B. Feature-based methods for large scale dynamic programming. Machine Learning, 22(1-3):59 94, 1996.

Tsitsiklis, J. N. and Van Roy, B. Analysis of temporaldiffference learning with function approximation. In Advances in neural information processing systems, pp. 1075 1081, 1997.