# nearly_horizonfree_offline_reinforcement_learning__7e7942cc.pdf

Nearly Horizon-Free Ofﬂine Reinforcement Learning

Tongzheng Ren1 Jialian Li2 Bo Dai3 Simon S. Du4 Sujay Sanghavi1, 5

1 UT Austin 2 Tsinghua University 3 Google Research, Brain Team

4 University of Washington 5 Amazon Search tongzheng@utexas.edu, lijialian7@163.com, bodai@google.com,

ssdu@cs.washington.edu, sanghavi@mail.utexas.edu

We revisit ofﬂine reinforcement learning on episodic time-homogeneous Markov Decision Processes (MDP). For tabular MDP with S states and A actions, or linear MDP with anchor points and feature dimension d, given the collected K episodes data with minimum visiting probability of (anchor) state-action pairs dm, we obtain nearly horizon H-free sample complexity bounds for ofﬂine reinforcement learning when the total reward is upper bounded by 1. Speciﬁcally: For ofﬂine policy evaluation, we obtain an O

error bound for the plug-in estimator, which matches the lower bound up to logarithmic factors and does not have additional dependency on poly (H, S, A, d) in higher-order term.

For ofﬂine policy optimization, we obtain an O

1 Kdm + min(S,d)

suboptimality gap for the empirical optimal policy, which approaches the lower bound up to logarithmic factors and a high-order term, improving upon the best known result by [1] that has additional poly (H, S, d) factors in the main term. To the best of our knowledge, these are the ﬁrst set of nearly horizon-free bounds for episodic time-homogeneous ofﬂine tabular MDP and linear MDP with anchor points. Central to our analysis is a simple yet effective recursion based method to bound a total variance term in the ofﬂine scenarios, which could be of individual interest.

1 Introduction

Reinforcement Learning (RL) aims to learn to make sequential decisions to maximize the long-term reward in unknown environments, and has demonstrated success in game-playing [2, 3], robotics [4], and automatic algorithm design [5]. These successes rely on being able to deploy the algorithms in a way that directly interacts with the respective environment, allowing them to improve the policies in a trial-and-error way. However, such direct interactions with real environments can be expensive or even impossible in other real-world applications, e.g., education [6], health and medicine [7, 8], conversational AI [9] and recommendation systems [10]. Instead, we are often given a collection of logged experiences generated by potentially multiple and possibly unknown policies in the past.

This lack of access to real-time interactions with an environment led to the ﬁeld of ofﬂine reinforcement learning [11]. Within this, ofﬂine policy evaluation (OPE) focuses on evaluating a policy, and ofﬂine policy optimization (OPO) focuses on improving policies; both rely only upon the given ﬁxed past experiences without any further interactions. OPE and OPO are in general notoriously difﬁcult, as unbiased estimators of the policy value can suffer from exponentially increasing variance in terms of horizon in the worst case [12, 13].

To overcome this curse of horizon in OPE, [14, 15] ﬁrst introduced marginalized importance sampling (MIS) based estimators. They showed that if (1) all of the logged experiences are generated

35th Conference on Neural Information Processing Systems (Neur IPS 2021).

from the same behavior policy, and (2) the behavior policy is known, then the exponential dependency on horizon can be improved to polynomial dependency. Subsequently, [16 18] showed that the polynomial dependency could be achieved even without assumptions (1) and (2). The basic idea for these MIS-based estimators is estimating the marginal state-action density ratio between the target policy and the empirical data, so as to adjust the distribution mismatch between them. On the algorithmic side, marginal density ratio estimation can be implemented by either plug-in estimators [15, 19], temporal-difference updates [20, 21], or solving a min-max optimization [16, 17, 22, 18]. These OPE estimators can also be used as one component for OPO, resulting in the algorithms in [23 25], which also inherit the polynomial dependency on horizon.

In a different but related line of work, [26, 27] recently showed that, for the online episodic timehomogeneous tabular Markov Decision Process (MDP) that allows for the interactions with environments, the sample complexity only has poly log dependency on the horizon. This motivates us to consider the following question:

Can ofﬂine reinforcement learning escape from the polynomial dependency on the horizon?

In this paper, we provide an afﬁrmative answer to this question. Speciﬁcally, considering the episodic time-homogeneous tabular MDP with S states, A actions and horizon H, or the linear MDP with anchor points and feature dimension d, assuming the total reward for any episode is upper bounded by 1 almost surely, we obtain the following nearly H-free bounds:

For ofﬂine policy evaluation (OPE), we show that the plug-in estimator has a ﬁnite-sample error

(Theorem 1 and 5), where K is the number of episodes and dm is the minimum visiting probability of (anchor) state-action pairs, that matches the lower bound up to logarithmic factors. We emphasize that the bound has no additional poly (H, S, A, d) dependency in the higher order term, unlike the known results of [19, 24]. For ofﬂine policy optimization (OPO), we show that the policy obtained by model-based planning

on empirical MDP has a sub-optimality gap of e O

1 Kdm + min{S,d}

(Theorem 3 and 6), which matches the lower bound up to logarithmic factors and a high-order term. This also improves upon the best known result from [1] by removing additional poly (H, S, d) factors in the main term.

To the best of our knowledge, these are the ﬁrst set of nearly horizon-free bounds for both OPE and OPO on time-homogeneous tabular MDP and linear MDP with anchor points. To achieve such sharp bounds, we propose a novel recursion based method to bound a total variance term (introduced below) that is broadly emerged in the ofﬂine reinforcement learning, which could be of individual interest.

Technique Overview. With a sequence of fairly standard steps in the literature, we can bound the error of the plug-in estimator via terms related to the square root of the total variance (also known as the Cramer-Rao type lower bound illustrated in [13])

h(s, a)Var P (s,a)

where is the reaching probability, P is the transition and V is the value function under policy . (For a more formal deﬁnition, see Section 3.). An improper bound of this term will introduce unnecessary dependency on the horizon, either in the main term or in the higher-order term. We instead bound this term with a recursive method, by observing that this total variance term can be approximately upper bounded by the square root of the total variance of value square P

h(s, a)Var P (s,a)

and some remaining term (see Lemma 1 and Lemma 3 for the detail). Applying this argument recursively, we can ﬁnally obtain a poly log H upper bound on the total variance, which eventually gets rid of the polynomial dependency on H.

The idea of higher order expansion has been investigated in [28, 27]. We notice that the recursion introduced in [28] was designed for the inﬁnite horizon setting, and how to generalize their technique to ﬁnite horizon setting is still unclear. Our recursion is conceptually more similar to the recursion in [27]. However, [27] considered the online setting, where we only need to bound the error on the visited state-action pairs. For the ofﬂine setting, we need to bound the error on every state-action pair that can be touched by the policy . This introduces the reaching probability in the total variance term, which we need to deal with using the MDP structure. As a result, our recursion is signiﬁcantly different from their counterpart, especially in the case of linear MDP.

Organization. Our paper is organized as follows: in Section 2, we review the related literature on ofﬂine reinforcement learning, and then brieﬂy introduce the problem we consider in Section 3. In Section 4 and Section 5, we show our results of ofﬂine policy evaluation and ofﬂine policy optimization on tabular MDP correspondingly, and in Section 6, we show how to generalize our results to linear MDP with anchor points. We ﬁnally conclude and discuss our results in Section 7.

2 Related Work

In this section, we brieﬂy discuss the related literature in three categories, i.e., ofﬂine policy evaluation, ofﬂine policy optimization, and horizon-free online reinforcement learning. Notice that, for the setting that assumes an additional generative model, typical model-based algorithms ﬁrst query equal number of data from each state-action pair, then perform ofﬂine policy evaluation/optimization based on the queried data. Thus we view the reinforcement learning with generative model as a special instance of ofﬂine reinforcement learning. To make the comparison fair, for method and analysis that do not assume Assumption 1, we scale the error and sample complexity, by assuming per-step reward is upper bounded by 1 γ and 1/H under inﬁnite-horizon and ﬁnite-horizon setting correspondingly.

Table 1: A comparison of existing ofﬂine policy evaluation results. The sample complexity in inﬁnite horizon setting is the number of queries of transitions we need while in episodic setting is the number of episodes we need. If Non-Uniform Reward, the MDP we consider satisﬁes Assumption 1; otherwise, we assume the per-step reward is upper bounded by 1 γ and 1/H correspondingly.

Analysis Setting Non-Uniform Reward Sample Complexity

[28] Inﬁnite Horizon Tabular Yes e O

1 dm(1 γ) 2

[29] Inﬁnite Horizon Tabular Yes

1 dm(1 γ) 2

[19] Finite Horizon time-inhomogeneous Tabular No e O

[13] Finite-Horizon time-inhomogeneous Tabular No

This work Finite Horizon time-homogeneous Tabular/Linear Yes e O

Lower Bound Finite Horizon time-homogeneous Tabular/Linear Yes

Ofﬂine Policy Evaluation. For OPE in inﬁnite horizon tabular MDP, [28] showed that plug-in estimator can achieve the error of e O

under Assumption 1, which matches the lower bound in [29] up to logarithmic factors. For OPE in ﬁnite horizon time-inhomogeneous tabular MDP, [19, 24] provided an error bound of e O

under the uniform reward assumption, which matches the lower bound [13] up to logarithmic factors and an additional higher-order term. We here consider the time-homogeneous MDP, and obtain an error bound of O(

1 Kdm ), that does

not have the additional

SA in higher-order term, which is different from [19, 24].

Beyond the tabular setting, [30] considered the performance of plug-in estimator with linear function approximation under the assumption of linear MDP without anchor points, and [31, 32] provided more detailed analyses on the statistical properties of different kinds of estimators under different assumptions, which are not directly comparable to our work. Recently, there are also works [e.g. 33] focusing on the interval estimation of the policy for practical application.

Ofﬂine Policy Optimization. Ofﬂine policy optimization for inﬁnite horizon MDP can date back to [35]. [28] recently showed that a perturbed version of model-based planning can ﬁnd -optimal policy within e O

1 dm(1 γ) 2

queries of transitions in inﬁnite horizon tabular MDP when the total reward is upper bounded by 1, that matches the lower bound up to logarithmic factors. For the ﬁnite horizon time-inhomogeneous tabular MDP setting, [24] showed that model-based planning can identify -optimal policy with e O

episodes, that matches the lower bound for time-inhomogeneous MDP up to logarithmic factors. When it comes to ﬁnite horizon time-homogeneous tabular MDP and linrar MDP with anchor points, [1] provided a e O

episode complexity for model-based

Table 2: A comparison of existing ofﬂine learning results. The sample complexity in inﬁnite horizon setting is the number of queries of transitions we need while in episodic setting is the number of episodes we need. If Non-Uniform Reward, the MDP we consider satisﬁes Assumption 1; otherwise, we assume the per-step reward is upper bounded by 1 γ and 1/H correspondingly.

Analysis Setting Non-Uniform Reward Sample Complexity

[34] Inﬁnite Horizon Tabular No e O

1 dm(1 γ) 2

[28] Inﬁnite Horizon Tabular Yes e O

1 dm(1 γ) 2

[24] Finite Horizon time-inhomogeneous Tabular No e O

[1] Finite Horizon time-homogeneous Linear No e O

[27] Finite Horizon time-homogeneous Tabular Online Yes e O

This Work Finite Horizon time-homogeneous Tabular/Linear Yes e O

1 dm 2 + min{S,d}

Lower Bound Finite Horizon time-homogeneous Tabular/Linear Yes

planning, which is min{H, S, d} away from the lower bound. We provide a e O

1 dm 2 + min{S,d}

episode complexity, that matches the lower bound up to logarithm factors and a higher order term.

A recent work [36] considered solving the ofﬂine policy optimization with model-free Q-learning with variance reduction. Although their algorithm can match the sample complexity lower bound

when per-step reward is upper bounded by 1, model-free algorithms generally need at

dm ) episodes of data to ﬁnish the algorithm (a.k.a the sample size barrier in [28]) and it is unclear how to translate their results to the setting with total reward upper bounded by 1.

Horizon-Free Online Reinforcement Learning. There are several works obtained nearly horizonfree sample complexity bounds for online reinforcement learning. In the time-homogeneous setting, whether the sample complexity needs to scale polynomially with H was an open problem raised by [37]. The problem was ﬁrst addressed by [26] who proposed an -net over the optimal policies and a simulation-based algorithms to obtain a sample complexity that only scales logarithmically with H, though their dependency on S, A and is not optimal and their algorithm is not computationally efﬁcient. The sample complexity bound was later substantially improved by [27] who obtained an

bound, which nearly matches the contextual bandits (tabular MDP with H = 1)

lower bound

up to an S2A

factor [38]. Their key ideas are (1) a new bonus function and (2) a recursion-based approach to obtain a tight bound on the sample complexity. Such kind of recursion-based approach cannot be directly applied to the ofﬂine setting, and we develop a novel recursion method to suit the ofﬂine scenarios.

3 Problem Setup

Notation: Throughout this paper, we use [N] to denote the set {1, 2, , N} for N 2 Z+, (E) to denote the set of the probability measure over the event set E. Moreover, for simplicity, we use to denote polylog (S, A, H, d, 1/δ) factors (that can be changed in the context), where δ is the failure probability. We use e O and e to denote the upper bound and lower bound up to logarithm factors.

3.1 Markov Decision Process

Markov Decision Process (MDP) is one of the most standard models studied in the reinforcement learning, usually denoted as M = (S, A, R, P, µ), where S is the state space, A is the action space, R : S A ! (R+) is the reward, P : S A ! (S) is the transition, and µ is the initial state distribution. We additionally deﬁne r : S A ! R+ to denote the expected reward.

We focus on the episodic MDP with the horizon1 H 2 Z+ and time-homogeneous setting2 that P and R do not depend on the level h 2 [H]. A (potentially non-stationary) policy is deﬁned as = ( 1, 2, H), where h : S ! (A), 8h 2 [H]. Following the standard deﬁnition, we deﬁne the value function V

h (s) := EP, [PH

t=h R(st, at)|sh = s] and the action-value function (i.e. the Q-function) Q

h(s, a) := EP, [PH

t=h R(st, at)|(sh, ah) = (s, a)], which are the expected cumulative rewards under the transition P and policy starting from sh = s and (sh, ah) = (s, a). Notice that, even though P and R keep invariant under the change of h, V and Q always depend on h in the episodic setting, which introduces additional technical difﬁculties compared with the inﬁnite horizon setting.

The expected cumulative reward under policy is deﬁned as: v := Eµ [V

1 (s)] , and our ultimate goal is ﬁnding the optimal policy of M, which can be written as: = arg max (v ) . We additionally deﬁne the reaching probabilities

h(s) = Pµ,P, (sh = s),

h(s, a) = Pµ,P, (sh = s, ah = a), that represent the probability of reaching state s and state-action pair (s, a) at time step h. Obviously we have P

h(s) = 1, P

h(s, a) = 1, 8h 2 [H], and we also have the following relations between

h(s) (a|s),

h+1(s0) = P

h(s, a)P(s0|s, a).

h at hand, we can write v in an equivalent way: v = P

, which provides a dual perspective on the policy evaluation [18].

3.2 Ofﬂine Reinforcement Learning

Generally, P and r are not revealed to the learner, which means we can only learn about M and identify the optimal policy with data from different kinds of sources. In ofﬂine reinforcement learning, the learner can only have access to a collection of data D = {(si, ai, ri, s0

i)}i2[n] where ri R(si, ai) and s0

i P( |si, ai), that is collected in K episodes with (known or unknown) behavior policy (so that n = KH), For simplicity, deﬁne n(s, a) as the number of data that (si, ai) = (s, a), while n(s, a, s0) is the number of data that (si, ai, s0

i) = (s, a, s0).

With D, the learner is generally asked to do two kinds of tasks. The ﬁrst one is the ofﬂine policy evaluation (a.k.a off-policy evaluation), that aims to estimate v for the given . The second one is the ofﬂine policy optimization, that aims to ﬁnd the ˆ that can perform well on M. We are interested in the statistical limit due to the limited number of data, and how to approach this statistical limit with simple and computationally efﬁcient algorithms.

4 Ofﬂine Policy Evaluation

In this section, we ﬁrst consider the ofﬂine policy evaluation for tabular MDP with number of state S = |S| < 1, number of action A = |A| < 1, which is the basis for the more general settings. We ﬁrst introduce the plug-in estimator we consider, which is equivalent to different kinds of estimators that are widely used in practice. Then we describe the assumptions we use, show the error bound of the plug-in estimator, and provide the proof sketch of the error bound.

4.1 The Plug-in Estimator

Here we introduce the plug-in estimator. We ﬁrst build the empirical MDP c M with the data: ˆP(s0|s, a) = n(s,a,s0)

n(s,a) , ˆr(s, a) =

i2[n] ri1(si,ai)=(s,a)

n(s,a) , where 1 is the indicator function. Then we

correspondingly deﬁne ˆQ

h and ﬁnally the estimator ˆv , 8h 2 [H], by substituting the P and r

1A common belief is that we can always reproduce the results between episodic time-homogeneous setting and inﬁnite horizon setting via substitute the horizon H in episodic setting with the effective horizon 1 1 γ in episodic setting. However, this argument does not always hold, for example the dependency decouple technique used in [34, 28] cannot be directly applied in the episodic setting.

2Some previous work consider time-inhomogeneous setting [e.g. 39], where P and R can be varied for different h 2 [H]. It is noteworthy that we need an additional H factor in the sample complexity to identify -optimal policy for time-inhomogeneous MDP compared with time-homogeneous MDP. Transforming the improvement analysis from time-homogeneous setting to time-inhomogeneous setting is trivial (we only need to replace S with HS), but not vice-versa, as the analysis for time-inhomogeneous setting probably do not exploit the invariant transition sufﬁciently.

h and v with ˆP and ˆr. Such computation can be efﬁciently implemented with dynamic programming. We also introduce the reaching probabilities ˆ

h(s) = Pµ, ˆ P , (sh = s), ˆ

Pµ, ˆ P , (sh = s, ah = a) in the empirical MDP c M, which will be helpful in our analysis.

The plug-in estimator has been studied in [30] under the assumption of linear transition. It s known that the plug-in estimator is equivalent to the MIS estimator proposed in [19] and a certain version of Dual DICE estimator with batch update is proposed in [16], due to the observation that

4.2 Theoretical Guarantee Here we ﬁrst summarize the assumptions we use for the tabular MDP. Assumption 1 (Bounded Total Reward). 8 , we have P

h2[H] rh 1 almost surely, where s1 µ, ah ( |sh), rh R(sh, ah) and sh+1 P( |sh, ah), 8h 2 [H]. This also means P(r R(s, a)|r > 1) = 0, 8(s, a). This is the key assumption used in [26, 27] to escape the polynomial dependence of horizon in episodic setting. As discussed in [37, 26, 27], this assumption is also more general than the uniformly bounded reward assumption: 8(s, a), r(s, a) 1/H. Thus, all of the results in this paper can be generalized to the uniformly bounded reward with a proper scaling of the bounded total reward. Assumption 2 (Data Coverage). 8(s, a), n(s, a) ndm. This assumption has been used in [19, 24] and is similar to concentration coefﬁcient assumption originated from [40]. Intuitively, the performance of the ofﬂine reinforcement learning should depend on dm, since the state-action pair with less visitation will introduce more uncertainty.

Notice that, dm 2

. Assuming access to the generative model, we can query equal number of samples from each state action pair, where dm = (SA) 1. For the ofﬂine data sampled with a ﬁxed behavior policy BEH, we can view dm 1 H mins,a

h (s, a), which measures

the quality of exploration for BEH. When the number of episodes K = e (1/dm), by standard concentration, we know mins,a n(s, a) ndm = HKdm.

We remark that, some of the recent Liu et al. [41], Yin et al. [36] deﬁne the data coverage via the coverage on the visitation of the optimal policy when considering ofﬂine policy optimization. This kind of data coverage can be covered by our Assumption 2, by considering the sub-MDP which only consists of the state-action pair that can be visited by the optimal policy. As we already know that the optimal policy will not leave this sub-MDP, any near-optimal policy on this sub-MDP will be near-optimal on the original MDP, and hence our results can be directly applied under this alternative deﬁnition of data coverage.

With these assumptions at hand, we can present our main results.

Theorem 1. Under Assumption 1 and Assumption 2, suppose K = e (1/dm), then

Kdm holds with probability at least 1 δ, where K is the number of episodes, dm is the minimum visiting probability and absorbs the poly log factors. To demonstrate the tightness of our upper bound, we provide a minimax lower bound in Theorem 23. Theorem 2. There exists a pair of MDPs M1 and M2, and ofﬂine data D with |D| = KH and minimum visiting probability dm, such that for some absolute constant c0, we have

inf ˆv sup Mi2{M1,M2} PMi

|ˆv (D) v | > c0 p Kdm

where ˆv is any estimator that takes D as input.

Remark Theorem 1 and 2 show that, even with the simplest plug-in estimator, we can match the minimax lower bound

for ofﬂine policy evaluation in time-homogeneous MDP up to logarithmic factors. As a result, we can conclude that time-homogeneous MDP is not harder than the bandits in ofﬂine policy evaluation.

3To the best of our knowledge, no lower bound has been provided for ﬁnite horizon time-homogeneous setting, so here we provide a minimax lower bound, and the proof can be found in Appendix B.

Remark The assumption that K = e (1/dm) is a mild and necessary assumption, as with only o (1/dm) episodes, there can exist some under-explored state-action pair which unavoidably leads to constant error (this is also how we construct the hard instance for the minimax lower bound).

Remark Central to our analysis is a recursion based upper bound on the total variance term (see Lemma 1), which enables us to have a sharper bound for the plug-in estimator compared with previous work [e.g. 19, 24] that include the unnecessary poly(S, A) factors in the higher-order term.

4.3 Proof Sketch The detailed proof can be found in Appendix A, and here we sketch our proof in short. With the value difference lemma (see Lemma 4 in the Appendix A), we need to focus on bounding the term:

P(s0|s, a) ˆP(s0|s, a)

which, by Bernstein s inequality and Cauchy-Schwartz inequality associated with the Assumption 1 and Assumption 2, can be upper bounded by the following term with high probability:

h(s, a)Var P (s,a)(V

h+1(s0)) + Kdm

To bound the total variance term P

h(s, a)Var P (s,a)[V

h+1(s0)], we use a novel recursion-based method based on the following observation: Lemma 1 (Variance Recursion For Evaluation). For the value function V

h (s) induced by any , we have that X

h(s, a)Var P (s,a)

Vh+1(s0)2ii

P(s0|s, a) ˆP(s0|s, a)

Vh+1(s0)2i+1

Here the term in (1) can be bounded with P

h(s, a)Var P (s,a)

Vh+1(s0)2i+1i

, the total variance of higher-order value function via Bernstein s inequality. Thus Lemma 1 can be applied iteratively to obtain a tight bound for the total variance term. Deﬁne

P(s0|s, a) ˆP(s0|s, a)

Applying Lemma 1 with V

h (s), we can write the following recursion:

Kdm ( 1(i + 1) + 2i+1) + Kdm .

Solve this recursion, and we can obtain the results in Theorem 1.

5 Ofﬂine Policy Optimization

In this section, we further consider the ofﬂine policy optimization for tabular MDP, which is the ultimate goal for ofﬂine reinforcement learning. We ﬁrst introduce the model-based planning algorithm, which is probably the simplest algorithm for ofﬂine policy optimization. Then we analyze the performance gap between the policy obtained by model-based planning and the optimal policy.

5.1 Model-Based Planning

We consider the optimal policy on the empirical MDP c M, which can be deﬁned as

ˆ = arg max ˆv . (2)

Here ˆ can be obtained by dynamic programming with the empirical MDP, which is also known as model-based planning. We remark that our analysis is independent to the algorithm used for solving (2). In other words, the result also applies to the optimization-based planning with the empirical MDPs [42, 43], as long as it solves (2).

5.2 Theoretical Guarantee

Theorem 3 provides an upper bound on the sub-optimality of ˆ .

Theorem 3. Under Assumption 1 and Assumption 2, suppose that K = e (1/dm), and then

Kdm + S Kdm ,

holds with probability at least 1 δ, where K is the number of episodes, dm is the minimum visiting probability and absorbs the poly log factors.

We also provide a minimax lower bound under the ﬁnite horizon time-homogeneous setting. Theorem 4. There exists a pair of MDPs M1 and M2, and ofﬂine data D with |D| = KH and minimum state-action pair visiting frequency dm, such that for some absolute constant c0, we have

inf ˆv sup Mi2{M1,M2} PMi

//vˆ (D) v // > c0 p Kdm

where ˆ is any planning algorithm that takes D as input.

Remark Theorem 3 provides a bound approaching the minimax lower bound in Theorem 4 up to logarithmic factors and a higher-order term, which shows that the error of ofﬂine policy optimization does not scale polynomially on the horizon. Notice that if dm =

, we can obtain an error

bound of e O

, which can be translated to sample complexity e O

that matches the best known result of sample complexity for online ﬁnite-horizon time-homogeneous setting [27]. We conjecture that the additional S factor in the higher-order term is only an artifact (see Lemma 2) and can be eliminated with more delicate analysis. We leave this as an open problem.

Remark There are also works considering local policy optimization [e.g. 44 46] when the ofﬂine data are not sufﬁcient exploratory. We want to emphasize that, as Theorem 4 suggests, to identify the global optimal policy, we need the ofﬂine data sufﬁcient exploratory.

5.3 Proof Sketch

The detailed proof can be found in Appendix A, and here we sketch our proof in short. Notice that

v vˆ =v ˆv + ˆv ˆvˆ

v ˆv | {z } Error on Fixed Policy

+ ˆvˆ vˆ | {z } Error on Data Dependent Policy

We can directly apply Theorem 1 to bound the error on ﬁxed policy. For the error on data-dependent policy, since the policy ˆ depends on data, we need to consider

P(s0|s, a) ˆP(s0|s, a)

P(s0|s, a) ˆP(s0|s, a)

P(s0|s, a) ˆP(s0|s, a)

h+1(s0) V ˆ

As is independent of ˆP, the term in (4) can be similarly handled with the techniques used in ofﬂine policy evaluation. However, due to the dependency of ˆ and ˆP, we need to deal with the term in (5) more carefully. To decouple the dependency of ˆ and ˆP, we ﬁrst introduce the following lemma: Lemma 2. 8Vh(s) 2 [0, 1], 8h 2 [H], s 2 S, then we have that with high probability,

P(s0|s, a) ˆP(s0|s, a)

S Var P (s,a)[Vh(s0)]

n(s,a) + S n(s,a).

Remark Lemma 2 has been widely used in the design and analysis of online reinforcement learning algorithms, [e.g. 27]. It holds even Vh(s) depends on ˆP(s0|s, a), however, at the cost of an additional S factor. This is the source of the additional S factor in the higher-order term, and we believe a more ﬁne-grained analysis can help avoid this additional S factor.

We also have the following recursion for (4) and (5):

Lemma 3 (Variance Recursion For Optimization). For Vh(s) = V

h (s) and Vh(s) = V

h (s), we have that

h (s, a)Var P (s,a)

Vh+1(s0)2ii

P(s0|s, a) ˆP(s0|s, a)

Vh+1(s0)2i+1

µ(s)V1(s) +

P(s0|s, a) ˆP(s0|s, a)

P(s0|s, a) ˆP(s0|s, a)

P(s0|s, a) ˆP(s0|s, a)

h+1(s0) V ˆ

Applying Bernstein inequality or Lemma 2, and then Lemma 3, we have that:

( 2(i + 1) + 2i+1(v + 2(0))) + Kdm

( 2(i + 1) + 2i+1(v vˆ + 3(0))) + S Kdm

With Assumption 1, we know v 1. Also, with (3) and (4), we have that v vˆ 3(0) +

. Solve this recursion, then we can obtain the results in Theorem 3.

6 Extensions to Linear MDP with Anchor Points

In this section, we ﬁrst introduce the deﬁnition of the linear MDP with anchor points [47, 1], and then generalize our results of ofﬂine policy evaluation and optimization to this setting.

Deﬁnition 1 (Linear MDP with Anchor Points [47, 1]). For the MDP M = (S, A, R, P, µ), assume there is a feature map φ : S A ! Rd, such that r and P admits a linear representation:

r(s, a) = hφ(s, a), ri P(s0|s, a) = hφ(s, a), µ(s0)i,

where µ is an unknown (signed) measure of S. Furthermore, we assume there exists a set of anchor state-action pairs K, such that 8(s, a) 2 S A, 9{λs,a

φ(s, a) = P

k φ(sk, ak), P

k = 1, λs,a

k 0, 8k 2 K.

With the deﬁnition of Linear MDP as well as the anchor points assumption, we can ﬁnd that r(s, a) = P

k r(sk, ak), P(s0|s, a) = P

k P(s0|sk, ak), which can lead to an empirical estimation of ˆP and ˆr by replacing {r(sk, ak)}k2K, {P(s0|sk, ak)}k2K with the empirical counterpart {ˆr(sk, ak)}k2K, { ˆP(s0|sk, ak)}k2K estimated from the ofﬂine data. Following [47, 1], we additionally make the following assumption on the ofﬂine data:

Assumption 3 (Anchor Point Data [47, 1]). Assume |K| = d. For the ofﬂine data D = {(si, ai, ri, s0

i)}i2[n], (si, ai) 2 {(sk, ak)}k2K, 8i 2 [n]. Furthermore, 8k 2 K, n(sk, ak) ndm.

We now present the main theorem on the ofﬂine policy evaluation and ofﬂine policy improvement on the linear MDP with anchor points. Theorem 5. Under Assumption 1 and Assumption 3, suppose n = e (H/dm), and then for a given policy , the plug-in estimator ˆv satisﬁes

H ndm with probability at least 1 δ, where n is the number of ofﬂine data, dm is the minimum visiting probability of anchor points and absorbs the polylog factors. Theorem 6. Under Assumption 1 and Assumption 3, suppose that n = e (H/dm), and then for the

ˆ obtained by model-based planning,

H ndm + d H

holds with probability at least 1 δ, where n is the number of ofﬂine data, d is the feature dimension, dm is the minimum visiting probability of anchor points and absorbs the polylog factors.

Proof of both theorems can be found in Appendix C. From a high-level perspective, we observe that for the unseen state-action pair, we still have Bernstein-type concentration bound and a lemma similar to Lemma 2. Hence we can apply our recursion introduced in Lemma 1 and use the similar techniques for tabular MDP to obtain the desired results. We want to remark that such results demonstrate the broad applicability of our recursion-based analysis in different kinds of ofﬂine scenarios.

Remark Compared with the results in [1], we remove the additional dependency of min{H, |S|, d} in the main term and approach the optimal complexity shown in the [47] up to logarithmic factors, which shows that model-based planning is minimax optimal for Linear MDP with anchor points and demonstrates the effectiveness of our recursion-based analysis.

Remark Here we do not directly replace the term n/H with K, as we make relatively strong assumption on all of the ofﬂine data are sampled from the anchor points. One potential question is how general the anchor point assumption is. We notice that, as [30] showed, the popular FQI algorithm for linear MDP provides an estimate of ˆP as

ˆP(s0|s, a) = φ(s, a)>ˆ 1 P

i2[n] φ(si, ai)1s0=s0

where ˆ = P

i2[n] φ(si, ai)φ(si, ai)>. As P

s0 P(s0|s, a) = 1, by Deﬁnition 1,

s0 P(s0|s, a) = φ(s, a)> P

s0 µ(s0) = φ(s, a)>ˆ 1 P

i2[n] φ(si, ai)

Thus, if φ(s, a)ˆ 1φ(si, ai) 0, 8(s, a), i, then it forms a linear MDP with anchor points with K = {(si, ai)}i2[n], λs,a

k = φ(s, a)ˆ 1φ(sk, ak) and our analysis can be simply adapted to this case where ndm is replaced with λmin(ˆ ) using Freedman s inequality [48], and λmin(ˆ ) denotes the minimum eigenvalue of ˆ . We notice that if φ(s, a)ˆ 1φ(si, ai) < 0, and s0

i are distinct for each i (which is probable for exponential large |S|), then ˆP(s0

i|s, a) < 0, which can be pathological for the algorithm and analysis. We expect a well-behaved transition estimation (i.e. ˆP) for linear MDP shares the similar results, even without anchor point assumptions. We leave this as an open problem.

7 Conclusion

In this paper, we revisit the ofﬂine reinforcement learning on episodic time-homogeneous MDP. We show that, if the total reward is properly normalized, ofﬂine reinforcement learning is not harder than the ofﬂine bandits counterparts. Speciﬁcally, we provide performance guarantee for algorithms based on empirical MDPs, that match the lower bound up to logarithmic factors for ofﬂine policy evaluation, and up to logarithmic factors and a higher-order term for ofﬂine policy optimization, and both do not have polynomial dependency on H. There are still several open problems. For example, can we provide a sharper analysis for the policy obtained by model-based planning without any additional factors on higher-order term? Can we extend to MDP with more general assumption? We leave these problems as future work.

Acknowledgement

The authors thank the anonymous reviewer for their constructive feedback. TR would like to thank for the helpful discussion with Ming Yin and Yu Bai. SS gratefully acknowledges the funding from NSF grants 1564000 and 1934932, and SSD gratefully acknowledges the funding from NSF Award s IIS-2110170 and DMS-2134106.

[1] Qiwen Cui and Lin F Yang. Is plug-in solver sample-efﬁcient for feature-based reinforcement

learning? ar Xiv preprint ar Xiv:2010.05673, 2020.

[2] Volodymyr Mnih, Koray Kavukcuoglu, David Silver, Alex Graves, Ioannis Antonoglou, Daan

Wierstra, and Martin Riedmiller. Playing atari with deep reinforcement learning. ar Xiv preprint ar Xiv:1312.5602, 2013.

[3] David Silver, Aja Huang, Chris J Maddison, Arthur Guez, Laurent Sifre, George Van Den Driess-

che, Julian Schrittwieser, Ioannis Antonoglou, Veda Panneershelvam, Marc Lanctot, et al. Mastering the game of go with deep neural networks and tree search. nature, 529(7587):484 489, 2016.

[4] Open AI: Marcin Andrychowicz, Bowen Baker, Maciek Chociej, Rafal Jozefowicz, Bob Mc-

Grew, Jakub Pachocki, Arthur Petron, Matthias Plappert, Glenn Powell, Alex Ray, et al. Learning dexterous in-hand manipulation. The International Journal of Robotics Research, 39(1):3 20, 2020.

[5] Hanjun Dai, Elias B Khalil, Yuyu Zhang, Bistra Dilkina, and Le Song. Learning combinatorial

optimization algorithms over graphs. ar Xiv preprint ar Xiv:1704.01665, 2017.

[6] Travis Mandel, Yun-En Liu, Sergey Levine, Emma Brunskill, and Zoran Popovic. Ofﬂine policy

evaluation across representations with applications to educational games. In AAMAS, pages 1077 1084, 2014.

[7] Susan A Murphy, Mark J van der Laan, James M Robins, and Conduct Problems Prevention Re-

search Group. Marginal mean models for dynamic regimes. Journal of the American Statistical Association, 96(456):1410 1423, 2001.

[8] Omer Gottesman, Fredrik Johansson, Matthieu Komorowski, Aldo Faisal, David Sontag, Finale

Doshi-Velez, and Leo Anthony Celi. Guidelines for reinforcement learning in healthcare. Nature medicine, 25(1):16 18, 2019.

[9] Asma Ghandeharioun, Judy Hanwen Shen, Natasha Jaques, Craig Ferguson, Noah Jones, Agata

Lapedriza, and Rosalind Picard. Approximating interactive human evaluation with self-play for open-domain dialog systems. ar Xiv preprint ar Xiv:1906.09308, 2019.

[10] Minmin Chen, Alex Beutel, Paul Covington, Sagar Jain, Francois Belletti, and Ed H Chi. Top-k

off-policy correction for a reinforce recommender system. In Proceedings of the Twelfth ACM International Conference on Web Search and Data Mining, pages 456 464, 2019.

[11] Sergey Levine, Aviral Kumar, George Tucker, and Justin Fu. Ofﬂine reinforcement learning:

Tutorial, review, and perspectives on open problems. ar Xiv preprint ar Xiv:2005.01643, 2020.

[12] Lihong Li, Rémi Munos, and Csaba Szepesvári. Toward minimax off-policy value estimation.

In Artiﬁcial Intelligence and Statistics, pages 608 616. PMLR, 2015.

[13] Nan Jiang and Lihong Li. Doubly robust off-policy value evaluation for reinforcement learning.

In International Conference on Machine Learning, pages 652 661. PMLR, 2016.

[14] Qiang Liu, Lihong Li, Ziyang Tang, and Dengyong Zhou. Breaking the curse of horizon:

Inﬁnite-horizon off-policy estimation. ar Xiv preprint ar Xiv:1810.12429, 2018.

[15] Tengyang Xie, Yifei Ma, and Yu-Xiang Wang. Towards optimal off-policy evaluation for rein-

forcement learning with marginalized importance sampling. ar Xiv preprint ar Xiv:1906.03393, 2019.

[16] Oﬁr Nachum, Yinlam Chow, Bo Dai, and Lihong Li. Dualdice: Behavior-agnostic estimation

of discounted stationary distribution corrections. In Advances in Neural Information Processing Systems, pages 2318 2328, 2019.

[17] Masatoshi Uehara, Jiawei Huang, and Nan Jiang. Minimax weight and q-function learning for

off-policy evaluation. In International Conference on Machine Learning, pages 9659 9668. PMLR, 2020.

[18] Mengjiao Yang, Oﬁr Nachum, Bo Dai, Lihong Li, and Dale Schuurmans. Off-policy evaluation

via the regularized lagrangian. ar Xiv preprint ar Xiv:2007.03438, 2020.

[19] Ming Yin and Yu-Xiang Wang. Asymptotically efﬁcient off-policy evaluation for tabular

reinforcement learning. In Silvia Chiappa and Roberto Calandra, editors, Proceedings of the Twenty Third International Conference on Artiﬁcial Intelligence and Statistics, volume 108 of Proceedings of Machine Learning Research, pages 3948 3958. PMLR, 26 28 Aug 2020.

[20] Assaf Hallak and Shie Mannor. Consistent on-line off-policy evaluation. In International

Conference on Machine Learning, pages 1372 1383. PMLR, 2017.

[21] Carles Gelada and Marc G Bellemare. Off-policy deep reinforcement learning by bootstrapping

the covariate shift. In Proceedings of the AAAI Conference on Artiﬁcial Intelligence, volume 33, pages 3647 3655, 2019.

[22] Ruiyi Zhang, Bo Dai, Lihong Li, and Dale Schuurmans. Gendice: Generalized ofﬂine estimation

of stationary values. ar Xiv preprint ar Xiv:2002.09072, 2020.

[23] Oﬁr Nachum, Bo Dai, Ilya Kostrikov, Yinlam Chow, Lihong Li, and Dale Schuurmans. Al-

gaedice: Policy gradient from arbitrary experience. ar Xiv preprint ar Xiv:1912.02074, 2019.

[24] Ming Yin, Yu Bai, and Yu-Xiang Wang. Near optimal provable uniform convergence in

off-policy evaluation for reinforcement learning. ar Xiv preprint ar Xiv:2007.03760, 2020.

[25] Yao Liu, Adith Swaminathan, Alekh Agarwal, and Emma Brunskill. Off-policy policy gradient

with stationary distribution correction. In Uncertainty in Artiﬁcial Intelligence, pages 1180 1190. PMLR, 2020.

[26] Ruosong Wang, Simon S Du, Lin F Yang, and Sham M Kakade. Is long horizon reinforce-

ment learning more difﬁcult than short horizon reinforcement learning? ar Xiv preprint ar Xiv:2005.00527, 2020.

[27] Zihan Zhang, Xiangyang Ji, and Simon S Du. Is reinforcement learning more difﬁcult than ban-

dits? a near-optimal algorithm escaping the curse of horizon. ar Xiv preprint ar Xiv:2009.13503, 2020.

[28] Gen Li, Yuting Wei, Yuejie Chi, Yuantao Gu, and Yuxin Chen. Breaking the sample size

barrier in model-based reinforcement learning with a generative model. ar Xiv preprint ar Xiv:2005.12900, 2020.

[29] Ashwin Pananjady and Martin J Wainwright. Instance-dependent l-inﬁnity-bounds for policy

evaluation in tabular reinforcement learning. IEEE Transactions on Information Theory, 67(1): 566 585, 2020.

[30] Yaqi Duan and Mengdi Wang. Minimax-optimal off-policy evaluation with linear function

approximation. ar Xiv preprint ar Xiv:2002.09516, 2020.

[31] Nathan Kallus and Masatoshi Uehara. Efﬁciently breaking the curse of horizon in off-policy

evaluation with double reinforcement learning. ar Xiv preprint ar Xiv:1909.05850, 2019.

[32] Nathan Kallus and Masatoshi Uehara. Double reinforcement learning for efﬁcient off-policy

evaluation in markov decision processes. Journal of Machine Learning Research, 21(167):1 63, 2020.

[33] Yihao Feng, Tongzheng Ren, Ziyang Tang, and Qiang Liu. Accountable off-policy evaluation

with kernel bellman statistics. In International Conference on Machine Learning, pages 3102 3111. PMLR, 2020.

[34] Alekh Agarwal, Sham Kakade, and Lin F Yang. Model-based reinforcement learning with a

generative model is minimax optimal. In Conference on Learning Theory, pages 67 83. PMLR, 2020.

[35] Mohammad Gheshlaghi Azar, Rémi Munos, and Hilbert J Kappen. Minimax pac bounds on the

sample complexity of reinforcement learning with a generative model. Machine learning, 91 (3):325 349, 2013.

[36] Ming Yin, Yu Bai, and Yu-Xiang Wang. Near-optimal ofﬂine reinforcement learning via double

variance reduction, 2021.

[37] Nan Jiang and Alekh Agarwal. Open problem: The dependence of sample complexity lower

bounds on planning horizon. In Conference On Learning Theory, pages 3395 3398. PMLR, 2018.

[38] Tor Lattimore and Csaba Szepesvári. Bandit algorithms. Cambridge University Press, 2020.

[39] Chi Jin, Zeyuan Allen-Zhu, Sebastien Bubeck, and Michael I Jordan. Is q-learning provably efﬁcient? In S. Bengio, H. Wallach, H. Larochelle, K. Grauman, N. Cesa-Bianchi, and R. Garnett, editors, Advances in Neural Information Processing Systems, volume 31, pages 4863 4873. Curran Associates, Inc., 2018. URL https://proceedings.neurips.cc/paper/2018/ﬁle/d3b1fb02964aa64e257f9f26a31f72cf Paper.pdf.

[40] Rémi Munos. Error bounds for approximate policy iteration. In ICML, volume 3, pages

560 567, 2003.

[41] Yao Liu, Adith Swaminathan, Alekh Agarwal, and Emma Brunskill. Off-policy policy gradient

with state distribution correction. ar Xiv preprint ar Xiv:1904.08473, 2019.

[42] Martin L Puterman. Markov decision processes: discrete stochastic dynamic programming.

John Wiley & Sons, 2014.

[43] Mehryar Mohri, Afshin Rostamizadeh, and Ameet Talwalkar. Foundations of Machine Learning.

The MIT Press, 2012. ISBN 026201825X.

[44] Sham Kakade and John Langford. Approximately optimal approximate reinforcement learning.

In In Proc. 19th International Conference on Machine Learning. Citeseer, 2002.

[45] Yao Liu, Adith Swaminathan, Alekh Agarwal, and Emma Brunskill. Provably good batch

reinforcement learning without great exploration. ar Xiv preprint ar Xiv:2007.08202, 2020.

[46] Aviral Kumar, Aurick Zhou, George Tucker, and Sergey Levine. Conservative q-learning for

ofﬂine reinforcement learning. ar Xiv preprint ar Xiv:2006.04779, 2020.

[47] Lin Yang and Mengdi Wang. Sample-optimal parametric q-learning using linearly additive

features. In International Conference on Machine Learning, pages 6995 7004. PMLR, 2019.

[48] David A Freedman. On tail probabilities for martingales. the Annals of Probability, pages

100 118, 1975.

[49] Tor Lattimore and Marcus Hutter. Near-optimal pac bounds for discounted mdps. Theoretical

Computer Science, 558:125 143, 2014.

[50] Martin J Wainwright. High-dimensional statistics: A non-asymptotic viewpoint, volume 48.

Cambridge University Press, 2019.

[51] Christoph Dann, Tor Lattimore, and Emma Brunskill. Unifying pac and regret: uniform pac

bounds for episodic reinforcement learning. In Proceedings of the 31st International Conference on Neural Information Processing Systems, pages 5717 5727, 2017.

[52] Jinglin Chen and Nan Jiang. Information-theoretic considerations in batch reinforcement learning. In Kamalika Chaudhuri and Ruslan Salakhutdinov, editors, Proceedings of the 36th International Conference on Machine Learning, volume 97 of Proceedings of Machine Learning Research, pages 1042 1051. PMLR, 09 15 Jun 2019. URL http://proceedings.mlr.press/v97/chen19e.html.

[53] Tengyang Xie and Nan Jiang. Q* approximation schemes for batch reinforcement learn-

ing: A theoretical comparison. In Jonas Peters and David Sontag, editors, Proceedings of the 36th Conference on Uncertainty in Artiﬁcial Intelligence (UAI), volume 124 of Proceedings of Machine Learning Research, pages 550 559. PMLR, 03 06 Aug 2020. URL http://proceedings.mlr.press/v124/xie20a.html.

[54] Tengyang Xie and Nan Jiang. Batch value-function approximation with only realizability. ar Xiv

preprint ar Xiv:2008.04990, 2020.