# batch_valuefunction_approximation_with_only_realizability__9926da97.pdf

Batch Value-function Approximation with Only Realizability

Tengyang Xie 1 Nan Jiang 1

We make progress in a long-standing problem of batch reinforcement learning (RL): learning Q from an exploratory and polynomial-sized dataset, using a realizable and otherwise arbitrary function class. In fact, all existing algorithms demand function-approximation assumptions stronger than realizability, and the mounting negative evidence has led to a conjecture that sample-efﬁcient learning is impossible in this setting (Chen & Jiang, 2019). Our algorithm, BVFT, breaks the hardness conjecture (albeit under a stronger notion of exploratory data) via a tournament procedure that reduces the learning problem to pairwise comparison, and solves the latter with the help of a state-action-space partition constructed from the compared functions. We also discuss how BVFT can be applied to model selection among other extensions and open problems.

1. Introduction

What is the minimal function-approximation assumption that enables polynomial sample complexity, when we try to learn Q from an exploratory batch dataset? Existing algorithms and analyses those that have largely laid the theoretical foundation of modern reinforcement learning have always demanded assumptions that are substantially stronger than the most basic one: realizability, i.e., that Q

(approximately) lies in the function class. These strong assumptions have recently compelled Chen & Jiang (2019) to conjecture an information-theoretic barrier, that polynomial learning is impossible in batch RL, even with exploratory data and realizable function approximation.

In this paper, we break this barrier by an algorithm called Batch Value-Function Tournament (BVFT). Via a tournament procedure, BVFT reduces the learning problem to that of identifying Q from a pair of candidate functions. In

1Department of Computer Science, University of Illinois at Urbana-Champaign, Illinois, USA. Correspondence to: Nan Jiang <nanjiang@illinois.edu>.

Proceedings of the 38 th International Conference on Machine Learning, PMLR 139, 2021. Copyright 2021 by the author(s).

this subproblem, we create a piecewise constant function class of statistical complexity O(1/ϵ2) that can express both candidate functions up to small discretization errors, and use the projected Bellman operator associated with the class to identify Q . We present the algorithm in Section 4 and prove its sample complexity in Sections 5 and 6. A limitation of our approach is the use of a relatively stringent version of concentrability coefﬁcient from Munos (2003) to measure the exploratoriness of the dataset (see Assumption 1). Section 7.2 investigates the difﬁculties in relaxing the assumption, and Appendix D discusses how to mitigate the pathological behavior of the algorithm when the assumption does not hold.

As another limitation, BVFT enumerates over the function class and is computationally inefﬁcient for training. That said, the algorithm is efﬁcient when the function class has a polynomial cardinality, making it applicable to another problem in batch RL: model selection (Farahmand & Szepesv ari, 2011).1 In Section 7.1, we review the literature on this important problem and discuss how BVFT has signiﬁcantly advanced the state of the art on the theoretical front.

2. Related Work

Stronger Function-Approximation Assumptions in Existing Theory The theory of batch RL has struggled for a long time to provide sample-efﬁciency guarantees when realizability is the only assumption imposed on the function class. An intuitive reason is that learning Q is roughly equivalent to minimizing the Bellman error, but the latter cannot be estimated from data (Jiang, 2019; Sutton & Barto, 2018, Chapter 11.6), leading to the infamous double sampling difﬁculty (Baird, 1995; Antos et al., 2008). Stronger/additional assumptions have been proposed to circumvent the issue, including low inherent Bellman errors (Munos & Szepesv ari, 2008; Antos et al., 2008), averager classes (Gordon, 1995), and additional function approximation of importance weights (Xie & Jiang, 2020).

State Abstractions State abstractions are the simplest form of function approximation. (They are also special cases of

1We use the phrase model selection as in the context of e.g., cross validation, and the word model does not refer to MDP dynamics; rather they refer to value functions for our purposes.

Batch Value-function Approximation with Only Realizability

the aforementioned averagers.) In fact, certainty equivalence with a state abstraction that can express Q , known as Q - irrelevant abstractions, is known to be consistent, i.e., Q

will be correctly learned if each abstract state-action pair receives inﬁnite amount of data (Littman & Szepesv ari, 1996; Li et al., 2006, Theorem 4).

While this observation is an important inspiration for our algorithm, making it useful for an arbitrary and unstructured function class is highly nontrivial and is one of the main algorithmic contributions of this paper. Furthermore, our ﬁnite-sample analysis signiﬁcantly deviates from the tabular -style proofs in the abstraction literature (Paduraru et al., 2008; Jiang, 2018), where ℓ concentration bounds are established assuming that each abstract state receives sufﬁcient data (c.f. Footnote 8). In our analysis, the structure of the abstraction is arbitrary, and it is much more convenient to treat them as piecewise constant classes over the original state space and use tools from statistical learning theory to establish concentration results under weighted ℓ2 norm; see Section 5.2.3 for details.

Tournament Algorithms Our algorithm design also draws inspirations from existing tournament algorithms. Closest related is Scheff e tournament for density estimation (Devroye & Lugosi, 2012), which minimizes the total-variation (TV) distance from the true density among the candidate models, and has been applied to RL by Sun et al. (2019). Interestingly, the main challenge in TV-distance minimization is very similar to ours at a high level, that TV-distance itself of a single model cannot be estimated from data when the support of the distribution has a large or inﬁnite cardinality. Similar to Scheff e tournament, our algorithm compares pairs of candidate value functions, which is key to overcoming the fundamental unlearnability of Bellman errors.

Tournament algorithms are also found in RL when the goal is to select the best state abstraction from a candidate set (Hallak et al., 2013; Jiang et al., 2015). These works will be discussed in Section 7.1 in the context of model selection.

Lower Bounds Wang et al. (2020); Amortila et al. (2020); Zanette (2020); Chen et al. (2021) have recently proved hardness results under Q realizability in batch RL. These results do not contradict ours because they deploy a weaker data assumption; see Appendix A.2 for discussions. Rather, their negative and our positive results are complementary and together provide a ﬁne-grained characterization of the landscape of batch RL.

3. Preliminaries

3.1. Markov Decision Processes

Consider an inﬁnite-horizon discounted Markov Decision Process (S, A, P, R, γ, d0), where S is the ﬁnite state space

that can be arbitrarily large, A is the ﬁnite action space, P : S A (S) is the transition function, R : S A [0, Rmax] is the reward function, γ [0, 1) is the discount factor, and d0 (S) is the initial state distribution.

A (deterministic and stationary) policy π : S A induces a distribution of the inﬁnite trajectory s0, a0, r0, s1, a1, r1, . . . , as s0 d0, a0 = π(s0), r0 = R(s0, a0), s1 P(s0, a0), . . .. We use E[ |π] to denote taking expectation w.r.t. such a distribution. The expected discounted return of a policy is deﬁned as J(π) := E[P t=0 γtrt|π], and our goal is to optimize J(π). Note that the random variable P t=0 γtrt is always bounded in the range [0, Vmax] where Vmax = Rmax/(1 γ).

In the discounted setting, there is a policy π : S A that simultaneously optimizes the expected return for all starting states. This policy can be obtained as the greedy policy of the Q function, i.e., π (s) = πQ (s) := arg maxa A Q (s, a), where we use π( ) to denote a policy that greedily chooses actions according to a real-valued function over S A. The optimal Q-value function, Q , can be uniquely deﬁned through the Bellman optimality equations: Q = T Q , where T : RS A RS A is the optimality operator, deﬁned as (T f)(s, a) := R(s, a) + γEs P (s,a)[Vf(s , a )], where Vf(s, a) := maxa f(s, a).

3.2. Batch Data

We assume that the learner has access to a batch dataset D consisting of i.i.d. (s, a, r, s ) tuples, where (s, a) µ, r = R(s, a), s P(s, a). Such an i.i.d. assumption is standard for ﬁnite-sample analyses in the ADP literature (Munos & Szepesv ari, 2008; Farahmand et al., 2010; Chen & Jiang, 2019), and can often be relaxed at the cost of signiﬁcant technical burdens and complications (see e.g., Antos et al., 2008). We will also use µ(s) and µ(a|s) to denote the marginal of s and the conditional of a given s. To learn a near-optimal policy in batch RL, an exploratory dataset is necessary, and we measure the degree of exploration as follows:

Assumption 1. We assume that µ(s, a) > 0 s, a. We further assume that (1) There exists constant 1 CA < such that for any s S, a A, µ(a|s) 1/CA. (2) There exists constant 1 CS < such that for any s S, a A, s S, P(s |s, a)/µ(s ) CS. Also d0(s)/µ(s) CS. It will be convenient to deﬁne C = CSCA.

The ﬁrst statement is very standard, asserting that the data distribution put enough probabilities on all actions. For example, with a small number of actions, a uniformly random policy ensures that CA = |A| satisﬁes this assumption.

The second statement measures the exploratoriness of µ s

Batch Value-function Approximation with Only Realizability

state marginal by CS, and two comments are in order. First, this is a form of concentrability assumption, which not only enforces data to be exploratory, but also implicitly imposes restrictions on the MDP s dynamics (see the reference to P in Assumption 1). While the latter may be undesirable, Chen & Jiang (2019, Theorem 4) shows that such a restriction is unavoidable when learning with a general function class. Second, the version of concentrability coefﬁcient we use was introduced by Munos (2003, Eq.(6)), and is more stringent than its more popular variants (e.g., Munos, 2007; Farahmand et al., 2010). That said, (1) hardness results exist under a weaker form of the assumption (see Appendix A.2), and (2) whenever the transition dynamics admit low-rank stochastic factorization, there always exist data distributions that yield small CS despite that |S| can be arbitrarily large; see Appendix A.1, where we also discuss how Assumption 1 compares to no inherent Bellman errors in the context of low-rank MDPs. We investigate why it is difﬁcult to work with more relaxed assumptions in Section 7.2, and discuss how to mitigate the negative consequences when the assumption is violated in Appendix D.

A direct consequence of Assumption 1, which we will use later to control error propagation and distribution shift, is the following proposition.

Proposition 1. Let ν be a distribution over S A and π be a policy. Let ν = P(ν) π denote the distribution speciﬁed by the generative process (s , a ) ν (s, a) ν, s P( |s, a), a = π(s ). Under Assumption 1, we have ν /µ := maxs,a ν (s, a)/µ(s, a) C. Also note that (d0 π)/µ C.

Additional Notations For any real-valued function of (s, a, r, s ), we use Eµ[ ] as a shorthand for taking expectation of the function when (s, a) µ, r = R(s, a), s P(s, a). Also for any f : S A R, deﬁne f 2 2,µ := Eµ[f 2]; f 2,µ is a weighted ℓ2 norm and satisﬁes the triangular inequality. We also use f 2 2,D to denote the empirical approximation of f 2 2,µ based on the dataset D.

3.3. Value-function Approximation

Since the state space S can be prohibitively large, function approximation is necessary for scaling RL to large and complex problems. In the value-function approximation setting, we are given a function class F (S A [0, Vmax]) to model Q . Unlike prior works that measure the approximation error of F using inherent Bellman errors (Munos, 2007; Antos et al., 2008) which amounts to assuming that F is (approximately) closed under T we will measure the error using Deﬁnition 1, where 0 error only implies realizability, Q F. In fact, given that the assumptions required by all existing algorithms are substantially stronger than realizability, Chen & Jiang (2019, Conjecture 8) conjecture that polynomial sample complexity is unattainable in batch RL

Algorithm 1 Batch Value-Function Tournament (BVFT)

1: Input: Dataset D, function class F, discretization parameter ϵdct (0, Vmax). 2: for f F do 3: f discretize the output of f with resolution ϵdct. (see Footnote 4). 4: end for 5: for f F do 6: for f F do 7: Deﬁne φ s.t. φ(s, a) = φ(s , a ) iff f(s, a) = f(s , a ) and f (s, a) = f (s , a ). 8: E(f; f ) f b T µ φ f 2,D (see Eq.(1) for def of b T µ φ ; the dependence on f is only through φ). 9: end for 10: end for 11: ˆf arg minf F maxf F E(f; f ). 12: Output: ˆπ = π ˆ f.

when we only impose realizability on F, which is why our result may be surprising.

Deﬁnition 1. Let ϵF := inff F f Q . 2 Let f

denote the f that attains the inﬁmum.

We assume F is ﬁnite but exponentially large (as in Chen & Jiang (2019)), i.e., we can only afford poly log |F| in the sample complexity. For continuous function classes that admit a ﬁnite ℓ covering number (Agarwal, 2011), our approach and analysis immediately extend by replacing F with its ϵ-net at the cost of slightly increasing ϵF.

3.4. Polynomial Learning

Our goal is to devise a statistically efﬁcient algorithm with the following kind of guarantee: with high probability we can learn an ϵ-optimal policy ˆπ, that is, J(ˆπ) J(π ) ϵ Vmax, when F is realizable and the dataset D is only polynomially large. The polynomial may depend on the effective horizon 1/(1 γ), the statistical complexity of the function class log |F|, the concentrability coefﬁcient C, (the inverse of) the suboptimality gap ϵ, and 1/δ where δ is the failure probability. Our results can also accommodate the more general setting when F is not exactly realizable, in which case the suboptimality of ˆπ is allowed to contain an additional term proportional to the approximation error ϵF up to a polynomial multiplicative factor.

2It is possible to deﬁne ϵF under weighted ℓ2 norm, though making this change in our current proof yields a worse (albeit polynomial) sample complexity; see Appendix E for more details.

Batch Value-function Approximation with Only Realizability

4. Algorithm and the Guarantee

In this section, we introduce and provide intuitions for our algorithm, and state its sample complexity guarantee which will be proved in the subsequent sections.

The design of our algorithm is based on an important observation inspired by the state-abstraction literature (see Section 2): when the function class F is piecewise constant and realizable, batch learning with exploratory data is consistent using e.g., Fitted Q-Iteration. This is because piecewise constant classes are very stable, and their associated projected Bellman operators are always γ-contractions under ℓ , implying that Q is the only ﬁxed point of such operators when statistical errors are ignored;3 we will actually establish these properties in Section 5.1.

While realizability is the only expressivity condition assumed, being piecewise constant is a major structural assumption, and is too restrictive to accommodate practical function-approximation schemes such as linear predictors or neural networks, let alone the completely unstructured set of functions one would encounter in model selection (Section 7.1). How can we make use of this observation?

An immediate idea is improper learning, i.e., augmenting F which is not piecewise constant in general and may have an arbitrary structure to its smallest superset that is piecewise constant, which automatically inherits realizability from F. To do so, we may ﬁrst discretize the output of each function f F up to a small discretization error ϵdct,4 and partition S A by grouping state-action pairs together only when the output f F (after discretization) is constant across them. The problem is that, the resulting function class is way too large compared to F; its statistical complexity measured by the number of groups can be as large as (Vmax/ϵdct)|F|, doubly exponential in poly log |F| which is what we can afford!

To turn this idea into a polynomial algorithm, we note that the statistical complexity of the superset is affordable when |F| is constant, say, |F| = 2. This provides us with a procedure that identiﬁes Q out of two candidate functions. To handle an exponentially large F, we simply perform pairwise comparisons between all pairs of f, f F, and output the function that has survived all pairwise comparisons involving it. Careful readers may wonder what happens when Q / {f, f }, as realizability is obviously violated. As

3Q is always a ﬁxed point of the projected Bellman update operators associated with any realizable function class, but there is no uniqueness guarantee in general. 4When Vmax/ϵdct is an odd integer, discretization onto a regular grid {ϵdct, 3ϵdct, . . . , Vmax ϵdct} guarantees at most ϵdct approximation error, and the cardinality of the set is Vmax/2ϵdct. For arbitrary ϵdct (0, Vmax), a similar discretization yields a cardinality of Vmax/2ϵdct , and we upper-bound it by Vmax/ϵdct throughout the analysis for convenience.

we will show in Section 6, the outcomes of these bad comparisons simply do not matter: Q is never involved in such comparisons, and any other function f will always be checked against f = Q , which is enough to expose the deﬁciency of a bad f.

The above reasoning ignores approximation and estimation errors, which we handle in the actual algorithm and its analysis; see Algorithm 1. Below we state its sample complexity guarantee, which is the main theorem of this paper.

Theorem 2. Under Assumption 1, with probability at least 1 δ, BVFT (Algorithm 1) with ϵdct = (1 γ)2ϵVmax

C returns a policy ˆπ that satisﬁes

J(π ) J(ˆπ) (4 + 8

C)ϵF (1 γ)2 + ϵ Vmax,

with a sample complexity of 5

The most outstanding characteristic of the sample complexity is the 1/ϵ4 rate. In fact, the poor dependencies on C and 1/(1 γ) are both due to 1/ϵ4: when we rewrite the guarantee in terms of suboptimality gap as a function of n = |D|, we see an O(

Cn 1/4/(1 γ)2) estimation-error term, featuring the standard

C penalty due to distribution shift and quadratic-in-horizon error propagation.

The 1/ϵ4 rate comes from two sources: 1/ϵ2 of it is due to the worst-case statistical complexity of the piecewise constant classes created during pairwise comparisons. The other 1/ϵ2 is the standard statistical rate. While standard, proving O(1/ϵ2) concentration bounds in our analysis turns out to be technically challenging and requires some clever tricks. We refer mathematically inclined readers to Section 5.2.3 for how we overcome those challenges.

We prove Theorem 2 in the next two sections. Section 5 establishes the essential properties of the pairwise comparison step in Line 8, where we view the problem at a somewhat abstract level to attain proof modularity. Section 6 uses the results in Section 5 to prove the ﬁnal guarantee.

5. Value-function Validation using a Piecewise Constant Function Class

In this section we analyze a subproblem that is crucial to our algorithm: given a piecewise constant class Gφ (S A [0, Vmax]) (induced by φ, a partition of S A)6 with small

5 O( ) suppresses poly-logarithmic dependencies. 6We treat φ as mapping S A to an arbitrary ﬁnite codomain, and g(s, a) = g(s , a ) g Gφ iff φ(s, a) = φ(s , a ).

Batch Value-function Approximation with Only Realizability

realizability error ϵφ := ϵGφ, we show that we can compute a statistic for any given function f0 : S A [0, Vmax], and the statistic will be a good surrogate for f0 Q as long as Assumption 1 holds and the sample size is polynomially large. We use |φ| to denote the number of equivalence classes induced by φ.

As Section 4 and Algorithm 1 have already alluded to, later we will invoke this result when comparing two candidate value functions f and f (with f0 = f), and deﬁne φ as the coarsest partition that can express both f and f ; when Q {f, f }, ϵφ will be small. To maintain the modularity of the analysis, however, we will view φ as an arbitrary partition of S A in this section.

The statistic we compute is f0 b T µ φ f0 2,D (c.f. Line 8 of

Algorithm 1), where b T µ φ is deﬁned as follows:

Deﬁnition 2. Deﬁne b T µ φ as the sample-based projected Bellman update operator associated with Gφ: for any f : S A [0, Vmax], b T µ φ f :=

arg min g Gφ

(s,a,r,s ) D [(g(s, a) r γVf(s ))2]. (1)

5.1. Warm up: |D| and ϵφ = 0

To develop intuitions, we ﬁrst consider the special case of |D| and ϵφ = 0. In this scenario, we can show that Q is the unique ﬁxed point of b T µ φ , which justiﬁes using

f0 b T µ φ f0 as a surrogate for f0 Q . The concepts and lemmas introduced here will also be useful for the later analysis of the general case.

We start by deﬁning T µ φ as b T µ φ when |D| .

Deﬁnition 3. Deﬁne T µ φ as the projected Bellman update where the projection is onto Gφ, weighted by µ. That is, for any f : S A [0, Vmax],

T µ φ f := arg min g Gφ Eµ[(g(s, a) r γVf(s ))2]. (2)

Next, we show that it is possible to deﬁne an MDP Mφ, such that T µ φ coincides with the Bellman update of Mφ. Readers familiar with state abstractions may ﬁnd the deﬁnition unusual, as the abstract MDP associated with φ is typically deﬁned over the compressed (or abstract) state space instead of the original one (e.g., Ravindran & Barto, 2004). We deﬁne Mφ over S because (1) our φ is an arbitrary partition of S A, which does not necessarily induce a consistent notion of abstract states, and (2) even when it does, the MDPs deﬁned over S are dual representations to and share many important properties with the classical notion of abstract MDPs (Jiang, 2018).

Deﬁnition 4. Deﬁne Mφ = (S, A, Pφ, Rφ, γ, d0), where

s, a:φ( s, a)=φ(s,a) µ( s, a)R( s, a)

s, a:φ( s, a)=φ(s,a) µ( s, a) .

Pφ(s |s, a) =

P s, a:φ( s, a)=φ(s,a) µ( s, a)P(s | s, a)

s, a:φ( s, a)=φ(s,a) µ( s, a) .

Lemma 3. T µ φ is the Bellman update operator of Mφ.

Lemma 3 implies that T µ φ is a γ-contraction under ℓ and has a unique ﬁxed point, namely the optimal Q-function of Mφ. It then sufﬁces to show that Q is such a ﬁxed point.

Proposition 4. When ϵφ = 0, Q is the unique ﬁxed point of T µ φ .

5.2. The General Case

In the general case, we want to show that f0 b T µ φ f0 2,D and f0 Q control each other. The central result of this section is the following proposition:

Proposition 5. Fixing any ϵ1, ϵ. Suppose

|D| 32V 2 max|φ| ln 8Vmax

ϵδ ϵ2 + 50V 2 max|φ| ln 80Vmax

Then, with probability at least 1 δ, for any ν (S A) such that ν/µ C,

f0 Q 2,ν 2ϵφ+

C( f0 b T µ φ f0 2,D+ϵ1+ ϵ) 1 γ . (3)

At the same time,

f0 b T µ φ f0 2,D (1 + γ) f0 Q + 2ϵφ + ϵ + ϵ1. (4)

Proving the proposition requires quite some preparations. We group the helper lemmas according to their nature in Sections 5.2.1 to 5.2.3, and prove Proposition 5 in Section 5.2.4.

5.2.1. ERROR PROPAGATION

The ﬁrst two lemmas allow us to characterize error propagation in later proofs. That is, it will help answer the question: if we ﬁnd f0 T µ φ f0 2,µ to be small (but nonzero), why does it imply that f0 Q is small?

While results of similar nature exist in the state-abstraction literature, they often bound f0 Q with f0 T µ φ f0 , where error propagation is easy to handle (Jiang, 2018). However, f0 T µ φ f0 can only be reliably estimated if each group of state-action pairs receives a sufﬁcient portion of the data, which is not guaranteed in our setting due to the arbitrary nature of φ created in Line 7. This forces us to work with the µ-weighted ℓ2-norm and carefully characterize how error propagation shifts the distributions.

Batch Value-function Approximation with Only Realizability

In fact, it is precisely this analysis that demands the strong deﬁnition of concentrability coefﬁcient CS in Assumption 1: as we will later show in the proof of Proposition 5 (Section 5.2.4), the error propagates according to the dynamics of Mφ instead of that of M (c.f. the Pφ(ν) term in Eq.(10)). Therefore, popular deﬁnitions of concentrability coefﬁcient (e.g., Munos, 2007; Antos et al., 2008; Farahmand et al., 2010; Xie & Jiang, 2020) which all consider state distributions induced in M do not ﬁt our analysis. Fortunately, the CS deﬁned in Assumption 1 has a very nice property, that it automatically carries over to Mφ no matter what φ is:

Lemma 6. Any C < that satisﬁes Assumption 1 for the true MDP M also satisﬁes the same assumption in Mφ. As a further consequence, Proposition 1 is also satisﬁed when P is replaced by Pφ.

5.2.2. ERROR OF Q UNDER T µ φ

The next lemma parallels Proposition 4 in Section 5.1, where we showed that Q T µ φ Q = 0 when ϵφ = 0. When ϵφ is non-zero, we need a more robust version of this result showing that Q T µ φ Q is controlled by ϵφ.

Lemma 7. Q T µ φ Q 2ϵφ.

5.2.3. CONCENTRATION BOUNDS

We need two concentration events: that b T µ φ f0 is close to

T µ φ f0, and that f0 b T µ φ f0 2,D is close to f0 b T µ φ f0 2,µ. We will split the failure probability δ evenly between these events.

Concentration of b T µ φ f0 We begin with the former, which requires a standard result for realizable least-square regression. The proof is deferred to Appendix B.5.

Lemma 8 (Concentration Bound for Least-Square Regression). Consider a real-valued regression problem with feature space X and label space Y [0, Vmax]. Let (xi, yi) PX,Y be n i.i.d. data points. Let H (X Y) be a hypothesis class with ℓ covering number N = N (H, ϵ0) and that realizes the Bayes-optimal regressor, i.e., h = (x 7 E[Y |X = x]) H. Let ˆh = arg minh H ˆE[(h(X) Y )2] be the empirical risk minimizer (ERM), where ˆE is the empirical expectation based on {(xi, yi)}n i=1. Then, with probability at least 1 δ,

E[(h (X) ˆh(X))2] 8V 2 max log N

δ n + 8Vmaxϵ0.

We then use Lemma 8 to prove that b T µ φ f T µ φ f 2,µ is small.

Lemma 9. Fixing f : S A [0, Vmax]. W.p. 1 δ

b T µ φ f T µ φ f 2,µ ϵ, as long as

|D| 16V 2 max(2|φ| log(4Vmax/ ϵ) + log(2/δ))

Technical Challenge & Proof Idea Recall that b T µ φ f is the ERM (in Gφ) of the least-square regression problem (s, a) 7 r +γVf(s ), so b T µ φ f T µ φ f 2,µ essentially measures the µ-weighted ℓ2 distance between the ERM and the population risk minimizer. Proving this is straightforward when the regression problem is realizable, as Lemma 8 would be directly applicable.7 In our case, however, the regression problem is in general non-realizable (except for f = Q ) and can incur arbitrarily large approximation errors, as Gφ does not necessarily contain the Bayes-optimal regressor T f.

The key proof idea is to leverage a special property of piecewise constant classes8 to reduce the analysis to the realizable case: regressing (s, a) 7 r + γVg(s ) over Gφ is equivalent to regressing x 7 r + γVg(s ) (with x = φ(s, a)) over a tabular function class, where the s, a|x portion of the data generation process is treated as part of the inherent label noise. After switching to this alternative view, the tabular class over the codomain of φ is fully expressive and always realizable, which makes Lemma 8 applicable. See Appendix B.6 for the full proof of Lemma 9.

Concentration of f0 g 2,D The second concentration result we need is an upper bound on | f0 g 2,D f0 g 2,µ| for all g Gφ simultaneously. We need to union bound over g Gφ because our statistic is f0 g 2,D with g = b T µ φ f0, which is a data-dependent function.

Lemma 10. W.p. 1 δ/2, g Gφ, | f0 g 2,D f0 g 2,µ| ϵ1, as long as

|D| 50V 2 max|φ| ln 80Vmax

Technical Challenge & Proof Idea It is straightforward to bound | f0 g 2 2,D f0 g 2 2,µ| (note the squares), but a na ıve conversion to a bound on the desired quantity (difference without squares) would result in O(n 1/4) rate. To obtain O(n 1/2) rate, we consider two situations separately, depending on whether f0 g 2,µ is below or above certain threshold: when it is below the threshold, we can use

7See e.g., Lemma 16 of Chen & Jiang (2019), where the (approximate) realizability of any such regression problem is guaranteed by the assumption of low inherent Bellman error. 8An alternative (and much messier) approach is to prove scalarvalued concentration bounds for b T µ φ f in each group of state-action pairs. Those groups with few data points will have high uncertainty, but they also contribute little to 2,µ. Compared to this approach, our proof is much simpler.

Batch Value-function Approximation with Only Realizability

Bernstein s to exploit the low variance of (f0 g)2; when it is above the threshold, we obtain the bound by factoring the difference of squares. Combining these two cases with an O(ϵ1) threshold yields a clean O(n 1/2) result; see proof details in Appendix B.7.

5.2.4. PROOF OF PROPOSITION 5

We are now ready to prove Proposition 5. Due to space limit we only provide a proof sketch in the main text.

Proof Sketch. To prove Eq.(3), deﬁne πf,f as the policy s 7 arg maxa max{f(s, a), f (s, a)}. Consider any ν such that ν/µ C, we have Q f0 2,ν

Q T µ φ Q 2,ν + T µ φ Q T µ φ f0 2,ν + f0 T µ φ f0 2,ν.

The ﬁrst term can be bounded via Lemma 7. The second term is bounded by γ Q f0 2,Pφ(ν) π ˆ f,Q , where Pφ(ν) π ˆ f,Q is a distribution that also satisﬁes ( )/µ C (Proposition 1) and hence can be handled by recursion. The third can be bounded by

C f0 T µ φ f0 2,µ due to ν/µ C, and f0 T µ φ f0 2,µ can be related to f0 T µ φ f0 2,µ by the concentration bounds established in Section 5.2.3, which are satisﬁed due to the choice of |D| in the proposition statement.

To prove Eq.(4), we can similarly relate f0 b T µ φ f0 2,D to f0 T µ φ f0 2,µ via the concentration bounds, and

f0 T µ φ f0 2,µ f0 T µ φ f0

f0 Q + T µ φ Q T µ φ f0

(1 + γ) f0 Q . (γ-contraction of T µ φ )

6. Proof of Theorem 2

With the careful analysis of the pairwise-comparison step given in Section 5, we are now ready to analyze Algorithm 1. Roughly speaking, we will make the following arguments:

For the output ˆf, if maxf E( ˆf; f ) is small, then ˆf Q . (Eq.(3) of Proposition 5)

That maxf E( ˆf; f ) will be small, because maxf E(f ; f ) is small, where f F is the best approximation of Q in Deﬁnition 1. (Eq.(4) of Proposition 5)

Before we delve into the proof of Theorem 2, we need yet another lemma, which connects ϵφ in Section 5 to the approximation error of F. As Section 4 has suggested, this is feasible because we are only concerned with the comparisons involving f , and ϵφ may be arbitrarily large otherwise. Lemma 11. The φ induced from Line 7 satisﬁes |φ| (Vmax/ϵdct)2. When f {f, f }, we further have ϵφ ϵF + ϵdct.

Proof of Theorem 2. Among the (f, f ) pairs enumerated in Lines 5 and 6, we will only be concerned with the cases when either f = f or f = f , and there are 2|F| such pairs. We require that w.p. 1 δ, Proposition 5 holds for all these 2|F| pairs. To guarantee so, we set the sample size |D| to the expression in the statement of Proposition 5, with |φ| replaced by its upper bound in Lemma 11 and δ replaced by δ/2|F| (union bound). We also let ϵ1 = ϵ to simplify the expressions, and will set the concrete value of ϵ later. The following is a sample size that satisﬁes all the above:

|D| 82V 4 max ln 160Vmax|F|

ϵδ ϵ2ϵ2 dct . (5)

Let φ be the partition induced by ˆf and f . According to Eq.(3), for any ν s.t. ν/µ C,

ˆf Q 2,ν 2ϵφ +

C( ˆf b T µ φ ˆf 2,D + 2 ϵ)

C(E( ˆf; f ) + 2 ϵ)

2ϵF + 2ϵdct +

C(maxf F E( ˆf; f ) + 2 ϵ)

It then remains to bound maxf E( ˆf; f ). Note that

max f F E( ˆf; f ) = min f F max f F E(f; f ) max f F E(f ; f ).

For any f , let φ be the partition of S A induced by f

and f . Then

E(f ; f ) = f b T µ φ f 2,D

(1 + γ) f Q + 2ϵφ + 2 ϵ (Eq.(4))

4ϵF + 2ϵdct + 2 ϵ.

Combining the above results, we have for any ν s.t. ν/µ C,

ˆf Q 2,ν 2ϵF + 2ϵdct +

C(4ϵF + 2ϵdct + 2 ϵ + 2 ϵ)

C(ϵdct + ϵ) 1 γ .

Finally, since any state-action distribution induced by any (potentially non-stationary) policy always satisﬁes Proposition 5, by Chen & Jiang (2019, Lemma 13) we have

J(π ) J(ˆπ) 2 1 γ sup ν: ν/µ C ˆf Q ν

C(ϵdct + ϵ) (1 γ)2 .

To guarantee that 8

C(ϵdct+ ϵ) (1 γ)2 ϵVmax, we set ϵdct = ϵ =

(1 γ)2ϵVmax

C . Plugging this back into Eq.(5) yields the sample complexity in the theorem statement.

Batch Value-function Approximation with Only Realizability

7. Discussions and Conclusions

7.1. Application to Model Selection

When learning Q from a batch dataset in practice, one would like to try different algorithms, different function approximators, and even different hyperparameters for a ﬁxed algorithm and see which combination gives the best result, as is always the case in machine-learning practices. In supervised learning, this can be done by a simple crossvalidation procedure on the holdout dataset. In batch RL, however, how to perform such a model-selection step in a provably manner has been a widely open problem.9

There exists a limited amount of theoretical work on this topic, which often consider a restrictive setting when the base algorithms are model-based learners using nested state abstractions (Hallak et al., 2013; van Seijen et al., 2014; Jiang et al., 2015).10 The only ﬁnite-sample guarantee we are aware of, given by Jiang et al. (2015), provides an oracle inequality with respect to an upper bound of f Q based on how much the base state abstractions violate bisimulation (or model-irrelevance) criterion (Whitt, 1978; Even-Dar & Mansour, 2003; Li et al., 2006) and ℓ concentration bounds, and the guarantee does not scale to the case where the number of base algorithms is super constant.

In comparison, BVFT provides a more direct approach with a much stronger guarantee: let Q1, . . . , Qm be the output of different base algorithms. We can simply run BVFT on the holdout dataset with F = {Qi}m i=1. The only functionapproximation assumption we need is that one of Qi s is a good approximation of Q , which is hardly an assumption as there is little we can do if all the base algorithms produce bad results. Compared to prior works, our approach is much more agnostic w.r.t. the details of the base algorithms, our loss and guarantees are directly related to f Q as opposed to relying on (possibly loose) upper bounds based on bisimulation, and our statistical guarantee scales to an exponentially large F as opposed to a constant-sized one.

Another common approach to model selection is to estimate J(π) for each candidate π via off-policy evaluation (OPE).11

OPE-based model selection has very different characteristics compared to BVFT, and they may be used together to complement each other; see a more detailed comparison and discussion in Appendix F.

9See Mandel et al. (2014) and Paine et al. (2020) for empirical advances on this problem. 10An exception is the work of Farahmand & Szepesv ari (2011), which requires the additional assumption that a regression procedure can approximate T f and uses it to compute f T f . 11As a side note, BVFT can be adapted to OPE when Qπ F for target policy π as long as we change the max operator in b T µ φ to π, though Assumption 1 will still be needed.

7.2. On the Assumption of Exploratory Data

As noted in Section 3.2, our Assumption 1 adopts a relatively stringent deﬁnition of concentrability coefﬁcient. A more standard deﬁnition is the following, as appeared in the hardness conjecture of Chen & Jiang (2019):

Assumption 2. Let dπ t be the distribution of (st, at) when we start from s0 d0 and follow policy π, which we will call an admissible distribution. We assume that there exists C < such that dπ t /µ C for any (possibly nonstationary) policy π and t 0.

In Appendix C we construct 3 scenarios to illustrate the difﬁculties (and sometimes possibilities) in extending our algorithm and its guarantees to a weaker data assumption such as Assumption 2; due to space limit we only include a high-level summary of the results below. In the ﬁrst construction, we show that BVFT fails under Assumption 2 in a very simple MDP if we are allowed to provide a contrived µ distribution to the learner where data is unnaturally missing in certain states (Figure 1). Motivated by the unnaturalness of the construction, we attempt to circumvent the hardness by imposing an additional mild assumption on top of Assumption 2, that µ must itself be admissible . While it becomes much more difﬁcult to construct a counterexample against the algorithm, it is still possible to design a scenario where our analysis breaks down seriously (Figure 2). We conclude with a positive result showing that the actual assumption we need is somewhere in between Assumptions 1 and 2, for that our algorithm and analysis work for a simple and natural on-policy case which obviously violates Assumption 1; formulating a tighter version of the assumption in a natural and interpretable manner remains future work.

7.3. Conclusions

We conclude the paper with a few open problems:

Is it possible to circumvent the failure modes discussed in Section 7.2 with novel algorithmic ideas, so that a variant of BVFT only requires a weaker assumption on data? On a related note, the original hardness conjecture of Chen & Jiang (2019) remains unsolved: our positive result assumes a stronger data assumption, and the negative results of Wang et al. (2020); Amortila et al. (2020) assume weaker ones.

When the data is seriously under-exploratory, to the extent that it is impossible to compete with π (Fujimoto et al., 2019; Liu et al., 2019; 2020), what is the minimal function-approximation assumption that enables polynomial learning? In particular, requiring that F realizes Q no longer makes sense as we do not even attempt to compete with π . Recent works often suggest that we compete with π whose occupancy is covered by µ, but as of now very strong expressivity assumptions are needed

Batch Value-function Approximation with Only Realizability

to achieve such an ambitious goal (e.g., Jiang & Huang, 2020, Proposition 9). It will be interesting to explore more humble objectives and see if the algorithmic and analytical ideas in this work extend to the more realistic setting of learning with non-exploratory data.

Acknowledgments

The authors thank Akshay Krishnamurthy, Yu Bai, Yu Xiang Wang for discussions related to Lemma 10, and Alekh Agarwal for numerous comments and discussions after the initial draft of the paper was released. Nan Jiang acknowledges support from the DEVCOM Army Research Laboratory under Cooperative Agreement W911NF-17-2-0196 (ARL Io BT CRA). The views and conclusions contained in this document are those of the authors and should not be interpreted as representing the ofﬁcial policies, either expressed or implied, of the Army Research Laboratory or the U.S. Government. The U.S. Government is authorized to reproduce and distribute reprints for Government purposes notwithstanding any copyright notation herein.

Agarwal, S. E0 370 Statistical Learning Theory: Covering Numbers, Pseudo-Dimension, and Fat-Shattering Dimension. Indian Institute of Science, 2011. https: //www.shivani-agarwal.net/Teaching/ E0370/Aug-2011/Lectures/5.pdf.

Amortila, P., Jiang, N., and Xie, T. A variant of the wangfoster-kakade lower bound for the discounted setting. ar Xiv preprint ar Xiv:2011.01075, 2020.

Antos, A., Szepesv ari, C., and Munos, R. Learning near-optimal policies with bellman-residual minimization based ﬁtted policy iteration and a single sample path. Machine Learning, 71(1):89 129, 2008.

Baird, L. Residual algorithms: Reinforcement learning with function approximation. In Machine Learning Proceedings 1995, pp. 30 37. Elsevier, 1995.

Barreto, A. d. M. S., Pineau, J., and Precup, D. Policy iteration based on stochastic factorization. Journal of Artiﬁcial Intelligence Research, 50:763 803, 2014.

Chen, J. and Jiang, N. Information-theoretic considerations in batch reinforcement learning. In Proceedings of the 36th International Conference on Machine Learning, pp. 1042 1051, 2019.

Chen, L., Scherrer, B., and Bartlett, P. L. Inﬁnite-horizon ofﬂine reinforcement learning with linear function approximation: Curse of dimensionality and algorithm. ar Xiv preprint ar Xiv:2103.09847, 2021.

Devroye, L. and Lugosi, G. Combinatorial methods in density estimation. Springer Science & Business Media, 2012.

Even-Dar, E. and Mansour, Y. Approximate equivalence of Markov decision processes. In Learning Theory and Kernel Machines, pp. 581 594. Springer, 2003.

Farahmand, A.-m. and Szepesv ari, C. Model selection in reinforcement learning. Machine learning, 85(3):299 332, 2011.

Farahmand, A.-m., Szepesv ari, C., and Munos, R. Error Propagation for Approximate Policy and Value Iteration. In Advances in Neural Information Processing Systems, pp. 568 576, 2010.

Feng, Y., Ren, T., Tang, Z., and Liu, Q. Accountable offpolicy evaluation with kernel bellman statistics. In Proceedings of the 37th International Conference on Machine Learning (ICML-20), 2020.

Fujimoto, S., Meger, D., and Precup, D. Off-policy deep reinforcement learning without exploration. In International Conference on Machine Learning, pp. 2052 2062, 2019.

Gordon, G. J. Stable function approximation in dynamic programming. In Proceedings of the twelfth international conference on machine learning, pp. 261 268, 1995.

Hallak, A., Di-Castro, D., and Mannor, S. Model selection in markovian processes. In Proceedings of the 19th ACM SIGKDD Conference on Knowledge Discovery and Data mining, pp. 374 382, 2013.

Jiang, N. CS 598: Notes on State Abstractions. University of Illinois at Urbana-Champaign, 2018. http://nanjiang.cs.illinois.edu/ files/cs598/note4.pdf.

Jiang, N. On value functions and the agent-environment boundary. ar Xiv preprint ar Xiv:1905.13341, 2019.

Jiang, N. and Huang, J. Minimax value interval for offpolicy evaluation and policy optimization. Advances in Neural Information Processing Systems, 33, 2020.

Jiang, N. and Li, L. Doubly Robust Off-policy Value Evaluation for Reinforcement Learning. In Proceedings of The 33rd International Conference on Machine Learning, volume 48, pp. 652 661, 2016.

Jiang, N., Kulesza, A., and Singh, S. Abstraction Selection in Model-based Reinforcement Learning. In Proceedings of the 32nd International Conference on Machine Learning, pp. 179 188, 2015.

Batch Value-function Approximation with Only Realizability

Jiang, N., Krishnamurthy, A., Agarwal, A., Langford, J., and Schapire, R. E. Contextual decision processes with low Bellman rank are PAC-learnable. In International Conference on Machine Learning, 2017.

Jin, C., Yang, Z., Wang, Z., and Jordan, M. I. Provably efﬁcient reinforcement learning with linear function approximation. In Conference on Learning Theory, pp. 2137 2143. PMLR, 2020.

Lagoudakis, M. G. and Parr, R. Least-squares policy iteration. The Journal of Machine Learning Research, 4: 1107 1149, 2003.

Li, L., Walsh, T. J., and Littman, M. L. Towards a uniﬁed theory of state abstraction for MDPs. In Proceedings of the 9th International Symposium on Artiﬁcial Intelligence and Mathematics, pp. 531 539, 2006.

Littman, M. L. and Szepesv ari, C. A generalized reinforcement-learning model: Convergence and applications. In ICML, volume 96, pp. 310 318, 1996.

Liu, Q., Li, L., Tang, Z., and Zhou, D. Breaking the curse of horizon: Inﬁnite-horizon off-policy estimation. In Advances in Neural Information Processing Systems, pp. 5361 5371, 2018.

Liu, Y., Swaminathan, A., Agarwal, A., and Brunskill, E. Off-policy policy gradient with state distribution correction. ar Xiv preprint ar Xiv:1904.08473, 2019.

Liu, Y., Swaminathan, A., Agarwal, A., and Brunskill, E. Provably good batch reinforcement learning without great exploration. ar Xiv preprint ar Xiv:2007.08202, 2020.

Mandel, T., Liu, Y.-E., Levine, S., Brunskill, E., and Popovic, Z. Ofﬂine policy evaluation across representations with applications to educational games. In Proceedings of the 13th International Conference on Autonomous Agents and Multi-Agent Systems, pp. 1077 1084, 2014.

Munos, R. Error bounds for approximate policy iteration. In ICML, volume 3, pp. 560 567, 2003.

Munos, R. Performance bounds in l p-norm for approximate value iteration. SIAM journal on control and optimization, 46(2):541 561, 2007.

Munos, R. and Szepesv ari, C. Finite-time bounds for ﬁtted value iteration. Journal of Machine Learning Research, 9 (May):815 857, 2008.

Paduraru, C., Kaplow, R., Precup, D., and Pineau, J. Modelbased reinforcement learning with state aggregation. In 8th European Workshop on Reinforcement Learning, 2008.

Paine, T. L., Paduraru, C., Michi, A., Gulcehre, C., Zolna, K., Novikov, A., Wang, Z., and de Freitas, N. Hyperparameter selection for ofﬂine reinforcement learning. ar Xiv preprint ar Xiv:2007.09055, 2020.

Precup, D., Sutton, R. S., and Singh, S. P. Eligibility Traces for Off-Policy Policy Evaluation. In Proceedings of the 17th International Conference on Machine Learning, pp. 759 766, 2000.

Ravindran, B. and Barto, A. Approximate homomorphisms: A framework for nonexact minimization in Markov decision processes. In Proceedings of the 5th International Conference on Knowledge-Based Computer Systems, 2004.

Sun, W., Jiang, N., Krishnamurthy, A., Agarwal, A., and Langford, J. Model-based RL in Contextual Decision Processes: PAC bounds and Exponential Improvements over Model-free Approaches. In Conference on Learning Theory, 2019.

Sutton, R. S. and Barto, A. G. Reinforcement learning: An introduction. MIT press, 2018.

Uehara, M., Huang, J., and Jiang, N. Minimax weight and q-function learning for off-policy evaluation. In International Conference on Machine Learning, pp. 9659 9668. PMLR, 2020.

van Seijen, H., Whiteson, S., and Kester, L. Efﬁcient abstraction selection in reinforcement learning. Computational Intelligence, 30(4):657 699, 2014.

Wang, R., Foster, D. P., and Kakade, S. M. What are the statistical limits of ofﬂine rl with linear function approximation? ar Xiv preprint ar Xiv:2010.11895, 2020.

Whitt, W. Approximations of dynamic programs, I. Mathematics of Operations Research, 3(3):231 243, 1978.

Xie, T. and Jiang, N. Q Approximation Schemes for Batch Reinforcement Learning: A Theoretical Comparison. In Proceedings of the 36th Conference on Uncertainty in Artiﬁcial Intelligence (UAI), pp. 550 559, 2020.

Zanette, A. Exponential lower bounds for batch reinforcement learning: Batch rl can be exponentially harder than online rl. ar Xiv preprint ar Xiv:2012.08005, 2020.