# on_the_statistical_complexity_of_offline_decisionmaking__16ce8e1e.pdf

On The Statistical Complexity of Offline Decision-Making

Thanh Nguyen-Tang 1 Raman Arora 1

We study the statistical complexity of offline decision-making with function approximation, establishing (near) minimax-optimal rates for stochastic contextual bandits and Markov decision processes. The performance limits are captured by the pseudo-dimension of the (value) function class and a new characterization of the behavior policy that strictly subsumes all the previous notions of data coverage in the offline decision-making literature. In addition, we seek to understand the benefits of using offline data in online decisionmaking and show nearly minimax-optimal rates in a wide range of regimes.

1. Introduction

Reinforcement learning (RL) has achieved remarkable empirical success in a wide range of challenging tasks, from playing video games at the same level as humans (Mnih et al., 2015), surpassing champions at the game of Go (Silver et al., 2018), to defeating top-ranked professional players in Star Craft (Vinyals et al., 2019). However, many of these systems require extensive online interaction in gameplay with other players who are experts at the task or some form of self-play (Li et al., 2016; Ouyang et al., 2022). Such online interaction may not be affordable in many realworld scenarios due to concerns about cost, safety, and ethics (e.g., healthcare and autonomous driving). Even in domains where online interaction is possible (e.g., dialogue systems), we would still prefer to utilize available historical, pre-collected datasets to learn useful decision-making policies efficiently. Such an approach would allow leveraging plentiful data, possibly replicating the success that supervised learning has had recently (Le Cun et al., 2015). Offline RL has emerged as an alternative to allow learning from existing datasets and is particularly attractive when online

*Equal contribution 1Department of Computer Science, Johns Hopkins University, Baltimore 21218, USA. Correspondence to: Thanh Nguyen-Tang <nguyent@cs.jhu.edu>.

Proceedings of the 41 st International Conference on Machine Learning, Vienna, Austria. PMLR 235, 2024. Copyright 2024 by the author(s).

interaction is prohibitive (Ernst et al., 2005; Lange et al., 2012; Levine et al., 2020).

Nevertheless, learning good policies from offline data presents a unique challenge not present in online decisionmaking: distributional shift. In essence, the policy that interacts with the environment and collects data differs from the target policy we aim to learn. This challenge becomes more pronounced in real-world problems with large state spaces, where it necessitates function approximation to generalize from observed states to unseen ones.

Representation learning is a basic challenge in machine learning. It is not surprising, then, that function approximation plays a pivotal role in reinforcement learning (RL) problems with large state spaces, mirroring its significance in statistical learning theory (Vapnik, 2013). Empirically, deep RL, which employs neural networks for function approximation, has achieved remarkable success across diverse tasks (Mnih et al., 2015; Schulman et al., 2017). The choice of function approximation class determines the inductive bias we inject into learning, e.g., our belief that the learner s environment is relatively simple even though the state space may be large.

It is natural, then, to understand different function approximation classes in terms of a tight characterization of their complexity and learnability. In statistical supervised learning, specific combinatorial properties of the function class are known to completely characterize sample-efficient supervised learning in both realizable and agnostic settings (Vapnik & Chervonenkis, 1971; Alon et al., 1997; Attias et al., 2023). For offline RL, a similar characterization is not known. With that as our motivation, we pose the following fundamental question that has largely remained unanswered: What is a sufficient and necessary condition for learnability in offline RL with function approximation?

We note that given the additional challenge of distribution shift in offline RL, such a characterization would depend not only on the properties of the function class but, more importantly, on the quality of the offline dataset. The existing literature on offline RL provides theoretical understanding only for limited scenarios of distributional shifts. These works capture the quality of offline data via a notion of data coverage. The strongest of these notions is that of uniform coverage, which requires that the behavior policy

On The Statistical Complexity of Offline Decision-Making

(aka, the policy used for data collection) covers the space of all feasible trajectories with sufficient probability (Munos & Szepesv ari, 2008; Chen & Jiang, 2019). Recent works consider weaker assumptions, including single-policy concentrability, which only requires coverage for the trajectories of the target policy that we want to compete against (Liu et al., 2019; Rashidinejad et al., 2021; Yin & Wang, 2021) and relative condition numbers (Agarwal et al., 2021; Uehara & Sun, 2021). While the works above do not take into account function approximation in their notion of data coverage, Bellman residual ratio (Xie et al., 2021a), Bellman error transfer coefficient (Song et al., 2022) and data diversity (Nguyen-Tang & Arora, 2023) directly incorporate function approximation into the notion of offline data quality. Many of these notions are incompatible in that they give differing views of the landscape of offline learning. Further, there are no known lower bounds establishing the necessity of any of these assumptions. This suggests that there are gaps in our understanding of when offline learning is feasible.

In this paper, we work towards giving a more comprehensive and tighter characterization of problems that are learnable with offline data. Our key contributions are as follows.

We introduce the notion of policy transfer coefficients, which subsumes other notions of data coverage.

In conjunction with the pseudo-dimension of the function class, policy transfer coefficients give a tight characterization of offline learnability. Specifically, the class of offline learning problems characterized as learnable by policy transfer coefficients subsumes the problems characterized as learnable in prior literature. We provide (nearly) matching minimax lower and upper bounds for offline learning.

Our results encompass offline learning in the setting of multi-armed bandits, contextual bandits, and Markov decision processes. We consider a variety of function approximation classes, including linear, neuralnetwork-based, and, more generally, any function class with bounded L1 covering numbers.

We extend our results to the setting of hybrid offlineonline learning and formally characterize the value of offline data in online learning problems.

We overcome various technical challenges such as giving the uniform Bernstein s inequality for Bellman-like loss using empirical L1 covering numbers (see Appendix A and Remark A.4) and removing the blowup of the number of iterations of the Hedge algorithm (see Remark 5.3). These may be of independent interest in themselves.

The rest of the paper is organized as follows. In Section 2, we introduce a formal setup for offline decision-making problems, where we focus mainly on the contextual bandit model to avoid deviating from the main points. In Section 3, we introduce policy transfer coefficients, a new notion of data coverage. In Section 4 and Section 5, we provide lower bounds and upper bounds, respectively, for offline decisionmaking (in the contextual bandit model). In Section 6, we consider a hybrid offline-online setting. We extend our results to Markov decision processes (MDPs) in Section 7. We conclude with a discussion in Section 8.

2. Background and Problem Formulation

2.1. Stochastic contextual bandits

We represent a stochastic contextual bandit environment with a tuple (X, A, D), where X denotes the set of contexts, A denotes the space of actions and D (X Y) denotes an unknown joint distribution over the contexts and rewards. Without loss of generality, we take Y := [0, 1]A. A learner interacts with the environment as follows. At each time step, the environment samples (X, Y ) D, the learner is presented with the context X, she commits to an action a A, and observes reward Y (a).

We model the learner as stochastic. The learner maintains a stochastic policy π : X (A), i.e., a map from the context space to a distribution over the action space. The value, V π, of a policy π is defined to be its expected reward,

V π D := E(X,Y ) D,A π( |X)[Y (A)].

The sub-optimality of ˆπ w.r.t. any policy π is defined as:

Sub Optπ D(ˆπ) = V π D V ˆπ D. (1)

We often suppress the subscript in V π D and Sub Optπ D(ˆπ).

2.2. Offline data

Let S = {(xi, ai, ri)}i [n] be a dataset collected by a (fixed,

but unknown) behavior policy µ, i.e., (xi, yi) i.i.d. D, ai µ( |xi), and ri = yi(ai) for all i [n]. The goal of offline learning is to learn a policy ˆπ from the offline data such that it has small sub-optimality Sub Optπ D(ˆπ) for as wide as possible a range of comparator policies π (possibly including an optimal policy π D arg maxπ V π D).

2.3. Function approximation

A central aspect of any value-based method for a sequential decision-making problem is to employ a certain function class F [0, 1]X A for modeling rewards in terms of contexts and actions; it is typical to solve a regression problem using squared loss. The choice of the function class reflects learner s inductive bias or prior knowledge about the task

On The Statistical Complexity of Offline Decision-Making

at hand. In particular, we often make the following realizability assumption (Chu et al., 2011; Agarwal et al., 2012; Foster & Rakhlin, 2020).

Assumption 2.1 (Realizability). There exists an f F such that f (x, a) = E[Y (a)|X = x], (x, a) (X A).

We will utilize a well-understood characterization of complexity of real-valued function classes from statistical learning, that of pseudo-dimension (Pollard, 1984).

Definition 2.2 (Pseudo-dimension). A set {z1, . . . , zd} Z is said to be shattered by H RZ if there exists a vector r Rd such that for all ϵ { 1}m, there exists h H such that sign(h(zi) ri) = ϵi for all i [n]. The pseudo-dimension Pdim(H) is the cardinality of the largest set shattered by H.

For example, neural networks of depth L, with Re LU activation, and total number of parameters (weights and biases) equal to W have pseudo-dimension of O(WL log(W)) (Bartlett et al., 2019).

Assumption 2.3. Define the function class associated with each fixed action, F( , a) := {f( , a) : f F}. We assume that supa A Pdim F( , a) d.

The reason we assume that the function class associated with each action has a bounded pseudo-dimension is that we can provide (nearly) matching lower and upper bounds for such a function class. We also provide upper bounds in terms of the covering number of a function class.

Definition 2.4 (Covering number). Let S = {z1, . . . , zn} Z, and ˆPS( ) = 1 n Pn i=1 δzi( ) be the empirical distribution, where δz is the Dirac function at z. For any p > 0 and any H RZ, let Np(H, ϵ, Lp( ˆPS)) denote the size of the smallest H such that:

h H, h H : h h Lp( ˆ PS) ϵ,

where h h Lp( ˆ PS) := 1

n Pn i=1 |h(zi) h (zi)|p 1/p

is a pseudo-metric on H. We define the (worst-case) Lp covering number Np(H, ϵ, n) as:

Np(H, ϵ, n) = sup S:|S|=n Np(H, ϵ, Lp( ˆPS)).

Notation. Let f(x, π) := Ea π( |x)f(x, a), x, and F( , Π) := {f( , π) : f F, π Π}, where Π := (A)X

is the set of all possible (Markovian) policies. Let D π denote the distribution of the random variable (x, a, r), where (x, y) D, a π( |x), r = y(a). Let lf denote the random variable (f(x, a) r)2 where (x, a, r) D µ. The empirical and the population means of lf are represented as ˆPlf = 1 n Pn i=1 lf(xi, ai, ri) and Plf = E(x,a,r) D µ[lf(x, a, r)], respectively. The empirical and the population means of f under policy π are denoted as ˆPf( , π) = 1

n Pn i=1 f(xi, π),

and Pf( , π) := Ex D[f(x, π)], respectively. We write a b to mean a = O(b), suppressing only absolute constants, and a e< b to mean a = e O(b), further suppressing log factors. Define [x]1 := max{ x, x}.

3. Offline Decision-Making as Transfer Learning

We view offline decision-making as transfer learning where the goal is to utilize pre-collected experiences for learning new tasks. A key observation we leverage is that there are parallels in how the two areas capture distribution shift a common challenge in both settings. Transfer learning uses various notions of distributional discrepancies (Ben David et al., 2010; David et al., 2010; Germain et al., 2013; Sugiyama et al., 2012; Mansour et al., 2012; Tripuraneni et al., 2020; Watkins et al., 2023) to capture distribution shift between the source tasks and the target task, much like how offline decision-making uses various notions of data coverage to measure the distributional mismatch due to offline data. We consider a new notion of data coverage inspired by transfer learning, which will be shown shortly to tightly capture the statistical complexity of offline decisionmaking from a behavior policy.

Definition 3.1 (Policy transfer coefficients). Given any policy π, ρ 0 is said to be a policy transfer exponent from µ to π w.r.t. (D, F) if there exists a finite constant C, called policy transfer factor, such that:

f F : (ED π[f f])2ρ CED µ[(f f)2]. (2)

Any such pair (ρ, C) is said to be a policy transfer coefficient from µ to π w.r.t. (D, F). We denote the minimal policy transfer exponent by ρπ.1 The minimal policy transfer factor corresponding to ρπ is denoted as Cπ.

Remark 3.2. Our definition of policy transfer resembles and is directly inspired by the notion of transfer exponent by (Hanneke & Kpotufe, 2019), which we refer to as Hanneke Kpotufe (HK) transfer exponent for distinction. A direct adaptation of the HK transfer exponent would result in:

f F : ED π[(f f)2]ρ CED µ[(f f)2]. (3)

Note the difference in the LHS of Equation (2) and that of Equation (3). When defining policy transfer coefficients, we use ℓ2-distance2 ED µ[(f f)2] w.r.t. the behavior policy to control the expected value gap ED π[f f], whereas the HK transfer exponent directly requires a bound on squared distance ED π[(f f)2] w.r.t. to the target policy. While this appears to be a small change, our notion bears a deeper connection with offline decision-making

1If ρ is a policy transfer exponent, so is any ρ ρ. 2The ℓ2-distance from f to f corresponds to the excess risk of f w.r.t. squared loss when assuming realizability.

On The Statistical Complexity of Offline Decision-Making

that the HK transfer exponent cannot capture. Perhaps, the best way to demonstrate this is through a concrete example where the policy transfer coefficient tightly captures the learnability of offline decision-making while the HK transfer exponent fails we return to this in Example 3.3. For now, it suffices to argue the tightness of our notion more generally. Indeed, our policy transfer exponent notion allows a tighter characterization of the transferability for offline decision-making, as indicated by Jensen s inequality:

(ED π[f f])2 ED π[(f f)2].

Now, consider a random variable, ζ, that is distributed according to the Bernoulli distribution Ber(p). Then, we have E[ζ2]/|E[ζ]|2 = 1/p. This means that in this example, the transfer factor C1 (corresponding to ρ = 1) in our definition is smaller than the transfer factor implied by the original definition of (Hanneke & Kpotufe, 2019) by a factor of p, which is significant for small values of p.

3.1. Relations with other notions of data coverage

In this section, we highlight the properties of transfer exponents and compare them with other notions of data coverage considered in offline decision-making literature. Perhaps the most common notions of data coverage are that of single-policy concentrability coefficients (Liu et al., 2019; Rashidinejad et al., 2021), relative condition numbers (for linear function classes) (Agarwal et al., 2021; Uehara & Sun, 2021), and data diversity (Nguyen-Tang & Arora, 2023).3

We demonstrate that policy transfer coefficients strictly generalize all these prior notions, in the sense that bounds on the prior notions of data coverage always imply bounds on transfer exponents but not vice versa. Specifically, there are problem instances for which the existing measures of data coverage tend to infinity, yet these problems are learnable given the characterization in terms of transfer coefficients.

Compared with concentrability coefficients. The concentrability coefficient between π and µ is defined as κπ := supx,a π(a|x) µ(a|x). The finiteness of κπ is widely used as one of the sufficient conditions for sample-efficient offline decision-making. By definition, the policy transfer factor corresponding to the policy transfer exponent of 1, is always upper-bounded by κπ. The finiteness of κπ requires the support of µ to contain that of π. However, offline decisionmaking does not even need overlapping supports between a target policy and the behaviour policy.

Example 3.3. Consider |X| = 1 and A = Rd. Let F be the class of d-dimensional halfspaces that pass through the origin, and Π be the set of d-dimensional spheres centered at

3Data diversity of Nguyen-Tang & Arora (2023), in fact, generalizes Bellman error transfer coefficient of Song et al. (2022) to allow an additive error.

the origin, Π = {Uniform({a Rd : a = r}) : r 0} (see the figure below).

Example 3.3, taken from (Hanneke & Kpotufe, 2019), gracefully reflects the correctness of policy transfer exponent at characterizing learnability in offline settings, whereas both the HK transfer exponent and the concentrability coefficient fail to do so. In this case, by direct computation we have that ED π[f f] = 0, thus the minimal policy transfer exponent from any behavior policy µ Π to any different policy π Π approaches zero, i.e., ρπ 0, while the HK transfer exponent, denoted ρHK π , is 1 and the concentrability coefficient κπ is infinity. These different values of data quality measures translate into different predictive bounds of offline learning for Example 3.3. For concreteness, we summarize these bounds in Table 1.

Data quality measure Predictive bound Policy transfer exponent ρπ 0 0

HK transfer exponent ρHK π = 1

n Concentrability coefficient κπ=

Table 1. Predictive bounds for offline learning of Example 3.3

stemming from different measures of data quality. The bounds for the policy transfer exponent and the HK transfer exponent are obtained from Theorem 5.1 (note that our results are applicable to HK transfer exponents as well), while the bound for the concentrability coefficient is obtained from Rashidinejad et al. (2021).

Which of the above data quality measures give a tight characterization of learnability of problem in Example 3.3? With the realizability assumption, the value V π of any policy in Π is equal to 1/2, regardless of the policy. Thus, the true sub-optimality is zero. This is tightly bounded by the predictive bound of our policy transfer exponent, while those by HK transfer exponent and concentrability coefficient give vacuous bounds, as shown in Table 1.

Another interesting remark is that for offline decisionmaking, it is not always necessary to learn the true reward function, a task that requires the sample complexity of Θ( d

ϵ ) this is captured by the HK transfer exponent at the value of 1; see (Hanneke & Kpotufe, 2019, Example 1).

Compared with the data diversity of Nguyen-Tang & Arora (2023). The notion of data diversity, recently proposed by Nguyen-Tang & Arora (2023) is motivated by the notion of task diversity in transfer learning (Tripuraneni

On The Statistical Complexity of Offline Decision-Making

et al., 2020). Nguyen-Tang & Arora (2023) show that data diversity subsumes both the concentrability coefficients and the relative condition numbers and that it can be used to derive strong guarantees for offline decision-making and state-of-the-art bounds when the function class is finite or linear. Their data diversity notion is, in fact, the policy transfer factor corresponding to policy transfer exponent ρ = 1. The following example, which is modified from (Hanneke & Kpotufe, 2019, Example 3), shows that data diversity can be infinite while the minimal transfer exponent is finite.

Example 3.4. Consider |X| = 1, A = [ 1, 1], F = {ft : t [ 1, 1]} where ft(a) = 1{a t} is a 1-dimensional threshold , and let f0 be the optimal mean reward function. Let ι > 0 be any positive scalar. Consider the following distribution: µ(a) a2ι 1 for a 0 and µ(a) is uniform for a [ 1, 0]. Let π be the uniform distribution over A. By direct computation, for any t 0, |Eπ[f0 ft]| t and Eµ[(f0 f)2] t2ι. Thus, no ρ < 2ι can be a policy transfer exponent, as limt 0 |Eπ[f0 ft]|ρ/Eµ[(f0 f)2] . Thus, the bound in Nguyen-Tang & Arora (2023) becomes vacuous. However, as long as the squared Bellman error Eµ[(f0 ft)2] can predict the Bellman error |Eπ[f0 ft]|, one should expect that offline learning is still possible, albeit at a slower rate.

As our policy transfer coefficients cover the data diversity of Nguyen-Tang & Arora (2023) as a special case (ρ = 2), which, in turn, is a generalization of the relative condition number, we refer to Nguyen-Tang & Arora (2023) for a detailed comparison with the relative condition number.

4. Lower Bounds

Let B(ρ, C, d) denote the class of offline learning problem instances with any distribution D over contexts and rewards, any function class F that satisfies Assumptions 2.1 and 2.3, a behavior policy µ, and all policies π Π such that policy transfer coefficients w.r.t. µ are (ρ, C). For this class, we give a lower bound on the sub-optimality of any offline learning algorithm.

Theorem 4.1. For any C > 0, ρ 1, n d max{22ρ 4C, C 1 ρ 1 /32}, we have

inf ˆπ( ) sup (D,µ,π,F) B(ρ,C,d) ED [Sub Optπ D(ˆπ)] Cd

where the infimum is taken over all offline algorithm ˆπ( ) (a randomized mapping from the offline data to a policy).

The lower bound in Theorem 4.1 is information-theoretic, i.e., it applies to any algorithm for problem class B(ρ, C, d). The lower bound is obtained by constructing a set of hard contextual bandit (CB) instances {Di} that are supported on d data points. Then, for each Di, we pick the hardest

comparator policy π = π Di and design a behavior policy that satisfies the policy transfer condition. We pick a simple enough function class that satisfies realizability and ensures that the policy transfer exponents and the pseudo-dimension are bounded. We then proceed to show that given a behavior policy µ, for any two CB instances Di and Dj that are close to each other (i.e., KL[(Di µ)n (Dj µ)n] is small) the corresponding optimal policies disagree. A complete proof is given in Appendix B.

5. Upper Bounds

Next, we show that there exists an offline learning algorithm that is agnostic to the minimal policy transfer coefficient of any policy, yet it can compete uniformly with all comparator policies, as long as their minimal policy transfer exponent is finite. For this algorithm, we give an upper bound for VCtype classes that matches the lower bound in the previous section up to log factors, ignoring the dependence on K = |A|. For more general function classes, we only provide upper bounds.

The general recipe for our algorithm (Algorithm 1) is rather standard. We follow the actor-critic framework for offline RL studied in several prior works (Zanette et al., 2021; Xie et al., 2021a; Nguyen-Tang & Arora, 2023). The algorithm alternates between computing a pessimistic estimate of the actor and improving the actor with the celebrated Hedge algorithm (Freund & Schapire, 1997).

Algorithm 1 Hedge for Offline Decision-Making (Of DMHedge)

1: Input: Offline data S, function class F 2: Hyperparameters: Confidence parameter β, learning rate η, number of iterations T 3: Initialize π1( |x) = Uniform(A), x X 4: for t = 1 to T do 5: Pessimism: ft = arg min f F: ˆ P lf ˆ P l ˆ f β ˆPf( , πt)

6: Hedge: πt+1(a|x) πt(a|x)eηft(x,a), (x, a) 7: end for 8: Output: A randomized policy ˆπ as a uniform distribution over {πt}t [T ].

The following result bounds the suboptimality of Of DMHedge up to absolute constants which we ignore for ease of exposition (see Theorem C.1 for exact constants).

Theorem 5.1. Fix any δ [0, 1], ϵ 0. Assume that |A| = K. Then, for any (D, F) such that Assumption 2.1 holds, invoking Algorithm 1 with β = Θ ϵ + log(N1(F,ϵ,n)/δ)

and η = Θ( q

T ) returns a policy ˆπ such that with

On The Statistical Complexity of Offline Decision-Making

probability at least 1 δ, π Π, T N,

E[Sub Optπ D(ˆπ)|S] C

ϵ + log(N1(F, ϵ, n)/δ)

log(N1(F( , Π), ϵ, n)/δ)

The upper bound consists of three main terms, which represent the transfer cost for offline decision-making, standard statistical learning error, and the optimization error of the Hedge algorithm, respectively. Note that Theorem 5.1 does not require Assumption 2.3. Remark 5.2 (Adaptive to policy transfer coefficients). Of DM-Hedge does not need to know the policy transfer coefficients of any target policy, yet it is adaptively competing with any target policy π characterized by minimal policy transfer exponent ρπ and policy transfer factor Cπ. Remark 5.3 (No cost blowup of the Hedge algorithm). Note that the first two terms in the upper bound in Theorem 5.1 are completely agnostic to the number of iterations T. Thus, scaling up T to drive the last term to zero does not increase the effective dimension a desirable property that is absent in the bounds of similar algorithms provided in Zanette et al. (2021); Xie et al. (2021a); Nguyen-Tang & Arora (2023). This difference stems from the fact that prior work uses uniform convergence over all policies that can be generated by the Hedge algorithm over T steps. We, however, recognize that any policy πt that is generated by a step of the Hedge algorithm is only used in one single form: f( , πt) for f F. Thus, it suffices to control the complexity of F( , Π) where Π is the set of all possible stationary policies. This elegant trick becomes more clear in terms of benefit in comparison with other works that suffer sub-optimal rates due to this cost blowup of the Hedge algorithm; we refer the reader to Section 7.2 for further discussion. Remark 5.4 (Potential barrier for offline decision-making in the fast transfer regime ρπ < 1). The second term in the bound above vanishes when |X| = 1 (i.e., in a multiarmed bandit setting). This yields error rates that are faster than 1/ n if the rate of transfer from µ to π is fast (i.e., ρπ < 1). One such example is Example 3.3 where ρ 0, which our upper bounds capture rather precisely. In the general case with |X| > 1, Of DM-Hedge is unable to take advantage of fast policy transfer exponent ρπ < 1, as the bound also involves the second term stemming from standard statistical learning over the context domain X. This is quite different from the supervised learning setting of Hanneke & Kpotufe (2019) where we can always ensure a fast transfer rate when ρπ < 1. Further, note that our lower bound (Theorem 4.1) excludes the fast transfer regime ρπ < 1. So, it remains unclear if, any algorithm in the offline setting can leverage fast transfer.

Next, we instantiate the upper bound in Theorem 5.1 for VC-type function classes.

Example 5.5 (VC-type / parametric classes). Under the same setting as Theorem 5.1, if Assumption 2.3 holds, the suboptimality of Of DM-Hedge is bounded by (ignoring log(1/δ) terms):

( Kd log(dn)

1 , CπKd log(dn)

The VC-type parametric classes include the d-dimensional linear class as a special case. The bound above follows from Theorem 5.1 and using the following inequalities: maxa A N1(F( , a), ϵ, n) e(d + 1)( 2e

ϵ )d (Haussler, 1995) and max {N1(F( , Π), ϵ, n), N1(F, ϵ, n)} (maxa A N1(F( , a), ϵ/K, n))K (Lemma F.5).

For the common large-sample, difficult-transfer regime, i.e., n Kd log(dn) and ρπ 1, the upper bounds for the parametric classes in Example 5.5 match the lower bound in Theorem 4.1, ignoring constants, log factors and dependence on K.

The upper bound in Theorem 5.1 also applies to nonparametric classes, though we do not provide a (matching) lower bound for this case. We leave that for future work.

Example 5.6 (Nonparametric classes). Under the same setting as Theorem 5.1, assume that for all a, F( , a) scales polynomially with the inverse of the scale, i.e., for some p > 0

log N1(ϵ, F( , a), n) 1

p , a A, ϵ 0.

Then, the upper bound in Theorem 5.1 is of the order (ignoring log(1/δ) terms):

1 2ρπ(p+1) , K

In the typical large-sample difficult-transfer regime, the

above yields an error rate of C

n 1 2ρπ(p+1) which vanishes with n. Typical examples of nonparametric classes include infinite-dimension linear classes with p = 2 (Zhang, 2002), the class of 1-Lipschitz functions over [0, 1]d, with p = d (Slivkins, 2011), the class of Holder-smooth functions of order β over [0, 1]d, with p = d/β (Rigollet & Zeevi, 2010), and neural networks with spectrally bounded norms, with p = 2 (Bartlett et al., 2017).

On The Statistical Complexity of Offline Decision-Making

6. Offline Data-assisted Online Decision-Making

In this section, we consider a hybrid setting, where in addition to the offline data S = {(xi, ai, ri)}i [n], the learner is allowed to interact with the environment for m rounds. The goal is to output a policy ˆπhyb with small Sub Optπ D(ˆπhyb) w.r.t. π with a high value V π D. We assume realizability (Assumption 2.1) and for simplicity, we focus only on VCtype function classes with pseudo-dimension at most d. Let ρ = ρπ and C = Cπ denote the policy transfer coefficients for an optimal policy π .

To avoid deviating from the main point, in this section, we focus on the large sample, difficult transfer regime, where d Kd log(dn) and ρ 1. The key question we ask is whether a learner can perform better in a hybrid setting than in a purely online or offline setting.

6.1. Lower bounds

We start with a lower bound for any hybrid learner for the class of problems B(ρ, C, d) as in Section 4.

Theorem 6.1. For any C > 0, ρ 1, and sample size

n d max{22ρ 4C, C 1 ρ 1 32 }, we have

inf ˆπhyb( ) sup (D,µ,π,F) B(ρ,C,d) ED [Sub Optπ D(ˆπhyb)]

where the infimum is taken over all possible hybrid algorithm ˆπhyb( ).

Ignoring log factors and dependence on K, the first term in the lower bound matches the upper bound for Of DMHedge. Therefore, if that is the dominating term, the learner does not benefit from online interaction. The second term, p

d/m, matches the upper bound of the state-of-the-art online learner (Simchi-Levi & Xu, 2022) with an online interaction budget of m. In the regime where the latter term is dominating there is no advantage to having offline data.

To summarize, the lower bound suggests that if the policy transfer coefficient is known a priori to the hybrid learner, there is no benefit of mixing the offline data with the online data at least in the worst case. That is, obtaining the nearly minimax-optimal rates for hybrid learning is akin to either discarding the online data and running the best offline learner or ignoring the offline data and running the best online learner. Which algorithm to run depends on the transfer coefficient. That there is no benefit to mixing online and offline data is a phenomenon that has also been discovered in a related setting of policy finetuning (Xie et al., 2021b).

6.2. Upper bounds

The discussion following the lower bound suggests different algorithmic approaches (purely offline vs. purely online) for different regimes defined in terms of the policy transfer coefficient. However, it is unrealistic to assume that the learner has prior knowledge of the transfer coefficient. We present a hybrid learning algorithm (Algorithm 2) that offers the best of both worlds. Without requiring the knowledge of the policy transfer coefficient, it produces a policy with nearly optimal minimax rates.

The key algorithmic idea is rather simple and natural. We invoke both an offline policy optimization algorithm and an online learner, resulting in policies ˆπoff and ˆπon, respectively. Half of the interaction budget, i.e., m/2 rounds, is utilized for learning ˆπon. For the remaining m/2 rounds, we run the EXP4 algorithm (Auer et al., 2002) with ˆπoff and ˆπon as expert policies. The output of EXP4 is a uniform distribution over all of the iterates. We denote this randomized policy as ˆπhyb.We can equivalently represent ˆπhyb as a distribution over {ˆπoff, ˆπon}.

Algorithm 2 Hybrid Learning Algorithm

1: Input: Offline data S, function class F, m online interactions 2: ˆπoff Offline Learner(S, F) 3: ˆπon Online Learner( m

2 , F) 4: ˆπhyb EXP4( m

2 , {ˆπoff, ˆπon}) 5: Output: ˆπhyb

Proposition 6.2. Algorithm 2 return a randomized policy ˆπhyb such that for any δ (0, 1), with probability at least 1 δ,

max π {ˆπoff,ˆπon} ED[Sub Optπ D(ˆπhyb)|S, Son] 4 2 log 2 m

+ 32 log(log(m/2)/δ)

where S is the offline data and Son is the online data collected by ˆπon in Algorithm 2.

The first term in the bound above comes from the guarantees on EXP4 with two experts (Lattimore & Szepesv ari, 2020, Theorem 18.3) and the remaining terms result from an (improved) online-to-batch conversion (Nguyen-Tang et al., 2023, Lemma A.5) and the assumption that contexts are sampled i.i.d. Note that the result above does not require any assumption on F and D.

Next, we implement Algorithm 2 using FALCON+ (Simchi Levi & Xu, 2022, Algorithm 2) for online learning and Of DM-Hedge for offline learning. Then, we have the following bound on the output of the hybrid learning algorithm.

On The Statistical Complexity of Offline Decision-Making

Theorem 6.3. Assume that the closure of F is convex. Then, given that Assumptions 2.1 and 2.3 hold, we have

ED h Sub Optπ D D (ˆπhyb) i e< 1 m+min

In the bound above, the first term is the standard error rate

of EXP4, the term C Kd

n 1 2ρ is the error rate of Of DM-

Hedge (see Theorem 5.1), and K q

d m is the error rate of FALCON+ (which we tailor to our setting in Appendix D.1). Applying Proposition 6.2 yields the minimum over the error rates of the offline and online learning.

In the regime that the EXP4 cost is dominated by the second term, i.e. , when m n C Kd 1/ρ , the upper bound on the suboptimality of hybrid learning nearly matches the lower bound, modulo log factors and pesky dependence on K.

6.3. Related works for the hybrid setting

Xie et al. (2021b) consider a related setting of policy finetuning, where the online learner is given an additional reference policy µ that is close to the optimal policy and the learner can collect data using µ at any point. They do not consider function approximation and utilize single-policy concentrability coefficients. Song et al. (2022) consider the same hybrid setting as ours (albeit for RL) and show that offline data of good quality can help avoid a need to explore, thereby resulting in an oracle-efficient algorithm which in general is not possible. The statistical complexity they establish for their algorithm is not minimax-optimal, and can get arbitrarily worse with the quality of the offline data. Our guarantees, on the other hand, are adaptive as they revert to online learning guarantees when the quality of the offline data is low. Wagenmaker & Pacchiano (2023) give instance-dependent bounds for hybrid RL with linear function approximation under a uniform coverage condition. Other related works include hybrid RL in tabular MDPs (Li et al., 2023), a Bayesian framework for incorporating offline data into the prior distribution (Tang et al., 2023), and oneshot online learning using offline data (Zhang & Zanette, 2023).

7. Offline Decision-Making in MDPs

In this section, we extend our results to offline decisionmaking in Markov decision processes (MDPs). We show that the key insights developed for the contextual bandit model extend naturally to offline learning of MDPs as we establish nearly matching upper and lower bounds.

Setup. Let M = MDP(X, A, [H], {Ph}h [H], {rh}h [H]) denote an episodic Markov decision process with state space X, action space A, horizon length H, transition kernel Ph : X A (X), and mean reward functions rh : X A [0, 1]. For any policy π = (π1, . . . , πH) where πh : X (A), let {Qπ h}h [H] and {V π h }h [H] denote the action-value functions and the state-value functions, respectively. For any policy π, define the Bellman operator [T π h g](x, a) := Ex Ph( |x,a) [rh(x, a) + g(x )]. For simplicity, we assume that the MDP starts in the same initial state at every episode. Here Eπ[ ] denotes the expectation with respect to the randomness of the trajectory (xh, ah, . . . , x H, a H), with ai πi( |xi) and xi+1 Pi( |xi, ai) for all i. We assume that |rh| 1, h.

The learner has access to a dataset S = {(x(t) h , a(t) h , r(t) h )}h [H],t [n] collected using a behavior policy µ. Define the (value) sub-optimality as Sub Optπ M(ˆπ) := V π 1 (s1) V ˆπ 1 (s1). Wherever clear, we drop the subscript M in Qπ M, V π M, dπ M, and Sub Optπ M(ˆπ).

Function approximation. We consider a function approximation class F = (F1, . . . , FH) where Fh [0, H h + 1]X A.

We make the following standard assumptions regarding how the function class F interacts with the underlying MDP M.

Assumption 7.1 (Realizability). π Π, h [H], Qπ h Fh.

Assumption 7.2 (Bellman completeness). π Π, h [H], if fh+1 Fh+1, then T π h fh+1 Fh.

For simplicity, we focus on VC-type/parametric F, though, again, our upper bounds can yield a vanishing error rate for non-parametric classes.

Assumption 7.3. suph [H],a A Pdim(Fh( , a)) d.

Definition 7.4 (Policy transfer coefficients). Given any policy π, the minimal policy transfer exponent ρπ w.r.t. (M, F, µ) is the smallest ρ 0 such that there exists a finite constant C such that for every h [H]

fh, gh Fh : |Eπ[fh gh]|2ρ CEµ[(fh gh)2]. (4)

The smallest such C w.r.t ρπ, denoted Cπ, is called the policy transfer factor. The pair (ρπ, Cπ) is said to be the policy transfer coefficient of π.

7.1. Lower bounds

Let M(ρ, C, d) denote the class of offline learning problem instances with any MDP M, any function class F that satisfies Assumptions 7.1, 7.2, and 7.3, a behavior policy µ, and all policies π Π such that policy transfer coefficients w.r.t. µ are (ρ, C). For this class, we give a lower bound on the sub-optimality of any offline learning algorithm.

On The Statistical Complexity of Offline Decision-Making

Theorem 7.5. For ρ 1 and n Cd H2ρ

32 , we have

inf ˆπ sup (M,F,π,µ) M(ρ,C,d) EM[Sub Optπ M(ˆπ)] H2Cd

7.2. Upper bounds

We now establish upper bounds for this setting. Due to space limitations, we only state the main result here and defer the details (including concrete algorithms) to Appendix E.2.

Theorem 7.6. Let δ > 0. Assume that |A| = K. Then, there exists a learning algorithm that for any problem instance in the set M(ρ, C, d), given offline data S of size n, returns a policy ˆπ such that with probability at least 1 δ over the randomness of generating S, we have

E [Sub Optπ M(ˆπ)|S] e< H H2Cπ(Kd + log(1/δ))

relative to any comparator π Π.

There is a gap of O(H) between our upper and lower bounds, something that can potentially be improved using variance-aware algorithms. Nonetheless, Theorem 7.6 improves and generalizes the results of Xie et al. (2021a) and Nguyen-Tang & Arora (2023) on several fronts. First, since policy transfer coefficients subsume other notions of data coverage, our results hold for a larger class of problem instances. Second, our results hold for any function class with finite L1 covering number. In contrast, the guarantees in Xie et al. (2021a) hold only for finite function classes and in Nguyen-Tang & Arora (2023) for function classes with finite domain-wide L covering numbers. Third, even when specializing to the setting of Xie et al. (2021a); Nguyen Tang & Arora (2023) (i.e., ρπ = 1 and finite function class), our bounds decay as n 1/2 which is optimal whereas their bounds scale as n 1/4 in the worst-case. Finally, we provide a lower bound for general function approximation.

8. Discussion

We study the statistical complexity of offline decisionmaking with value function approximation. We identify a large class of offline learning problems, characterized by the pseudo-dimension of the value function class and a new characterization of the offline data, for which we provide tight minimax lower and upper bounds. We also provide insights into the role of offline data for online decision-making from a minimax perspective.

We remark that our results do not imply that pseudodimension and policy transfer coefficients are necessary conditions for learnability in offline decision-making with function approximation. Consequently, there are several

notable gaps in our current understanding. First, our lower bounds apply only to parametric function classes, i.e., for function classes with finite pseudo-dimension. Whereas our upper bounds hold also for non-parametric function classes. Can we provide a lower bound for general function classes, including non-parametric ones? Second, in the hybrid setting with a parametric function class, the upper bound of our adaptive algorithm matches the lower bound only in the regime that m n C Kd 1/ρ . Is it possible to design a fully adaptive algorithm (i.e., without the knowledge of policy transfer coefficients) whose upper bound matches the lower bound in any regime of m, including the practical scenarios where the online exploration budget m is small? Third, what are the necessary conditions for the value function class and the data quality that fully characterize the statistical complexity of offline decision-making with value function approximation?

Impact Statement

This paper presents work whose goal is to advance the field of Machine Learning. There are many potential societal consequences of our work, none which we feel must be specifically highlighted here.

Acknowledgements

This research was supported, in part, by DARPA GARD award HR00112020004 and NSF CAREER award IIS1943251.

Agarwal, A., Dud ık, M., Kale, S., Langford, J., and Schapire, R. Contextual bandit learning with predictable rewards. In Artificial Intelligence and Statistics, pp. 19 26. PMLR, 2012.

Agarwal, A., Kakade, S. M., Lee, J. D., and Mahajan, G. On the theory of policy gradient methods: Optimality, approximation, and distribution shift. J. Mach. Learn. Res., 22(98):1 76, 2021.

Alon, N., Ben-David, S., Cesa-Bianchi, N., and Haussler, D. Scale-sensitive dimensions, uniform convergence, and learnability. Journal of the ACM (JACM), 44(4):615 631, 1997.

Attias, I., Hanneke, S., Kalavasis, A., Karbasi, A., and Velegkas, G. Optimal learners for realizable regression: PAC learning and online learning. ar Xiv preprint ar Xiv:2307.03848, 2023.

Auer, P., Cesa-Bianchi, N., Freund, Y., and Schapire, R. E. The nonstochastic multiarmed bandit problem. SIAM journal on computing, 32(1):48 77, 2002.

On The Statistical Complexity of Offline Decision-Making

Bartlett, P. L., Foster, D. J., and Telgarsky, M. J. Spectrallynormalized margin bounds for neural networks. Advances in neural information processing systems, 30, 2017.

Bartlett, P. L., Harvey, N., Liaw, C., and Mehrabian, A. Nearly-tight VC-dimension and pseudodimension bounds for piecewise linear neural networks. Journal of Machine Learning Research, 20(63):1 17, 2019.

Ben-David, S., Blitzer, J., Crammer, K., Kulesza, A., Pereira, F., and Vaughan, J. W. A theory of learning from different domains. Machine learning, 79:151 175, 2010.

Chen, J. and Jiang, N. Information-theoretic considerations in batch reinforcement learning. In International Conference on Machine Learning, pp. 1042 1051. PMLR, 2019.

Chu, W., Li, L., Reyzin, L., and Schapire, R. Contextual bandits with linear payoff functions. In Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, pp. 208 214. JMLR Workshop and Conference Proceedings, 2011.

Cucker, F. and Smale, S. On the mathematical foundations of learning. Bulletin of the American mathematical society, 39(1):1 49, 2002.

David, S. B., Lu, T., Luu, T., and P al, D. Impossibility theorems for domain adaptation. In Proceedings of the Thirteenth International Conference on Artificial Intelligence and Statistics, pp. 129 136. JMLR Workshop and Conference Proceedings, 2010.

Ernst, D., Geurts, P., and Wehenkel, L. Tree-based batch mode reinforcement learning. Journal of Machine Learning Research, 6, 2005.

Foster, D. and Rakhlin, A. Beyond UCB: Optimal and efficient contextual bandits with regression oracles. In International Conference on Machine Learning, pp. 3199 3210. PMLR, 2020.

Foster, D., Agarwal, A., Dud ık, M., Luo, H., and Schapire, R. Practical contextual bandits with regression oracles. In International Conference on Machine Learning, pp. 1539 1548. PMLR, 2018.

Foster, D. J., Krishnamurthy, A., Simchi-Levi, D., and Xu, Y. Offline reinforcement learning: Fundamental barriers for value function approximation. ar Xiv preprint ar Xiv:2111.10919, 2021.

Freund, Y. and Schapire, R. E. A decision-theoretic generalization of on-line learning and an application to boosting. Journal of computer and system sciences, 55(1):119 139, 1997.

Germain, P., Habrard, A., Laviolette, F., and Morvant, E. A PAC-Bayesian approach for domain adaptation with specialization to linear classifiers. In International conference on machine learning, pp. 738 746. PMLR, 2013.

Hanneke, S. and Kpotufe, S. On the value of target data in transfer learning. Advances in Neural Information Processing Systems, 32, 2019.

Haussler, D. Sphere packing numbers for subsets of the boolean n-cube with bounded Vapnik-Chervonenkis dimension. Journal of Combinatorial Theory, Series A, 69 (2):217 232, 1995.

Krishnamurthy, A., Agarwal, A., Huang, T.-K., Daum e III, H., and Langford, J. Active learning for cost-sensitive classification. Journal of Machine Learning Research, 20 (65):1 50, 2019.

Lange, S., Gabel, T., and Riedmiller, M. Batch reinforcement learning. In Reinforcement learning, pp. 45 73. Springer, 2012.

Lattimore, T. and Szepesv ari, C. Bandit algorithms. Cambridge University Press, 2020.

Le Cun, Y., Bengio, Y., and Hinton, G. Deep learning. nature, 521(7553):436 444, 2015.

Lee, W. S., Bartlett, P. L., and Williamson, R. C. The importance of convexity in learning with squared loss. In Proceedings of the Ninth Annual Conference on Computational Learning Theory, pp. 140 146, 1996.

Levine, S., Kumar, A., Tucker, G., and Fu, J. Offline reinforcement learning: Tutorial, review, and perspectives on open problems. ar Xiv preprint ar Xiv:2005.01643, 2020.

Li, G., Zhan, W., Lee, J. D., Chi, Y., and Chen, Y. Reward-agnostic fine-tuning: Provable statistical benefits of hybrid reinforcement learning. ar Xiv preprint ar Xiv:2305.10282, 2023.

Li, J., Monroe, W., Ritter, A., Galley, M., Gao, J., and Jurafsky, D. Deep reinforcement learning for dialogue generation. ar Xiv preprint ar Xiv:1606.01541, 2016.

Liu, Y., Swaminathan, A., Agarwal, A., and Brunskill, E. Off-policy policy gradient with stationary distribution correction. In Globerson, A. and Silva, R. (eds.), Proceedings of the Thirty-Fifth Conference on Uncertainty in Artificial Intelligence, UAI 2019, Tel Aviv, Israel, July 22-25, 2019, volume 115 of Proceedings of Machine Learning Research, pp. 1180 1190. AUAI Press, 2019.

Mansour, Y., Mohri, M., and Rostamizadeh, A. Multiple source adaptation and the R enyi divergence. ar Xiv preprint ar Xiv:1205.2628, 2012.

On The Statistical Complexity of Offline Decision-Making

Mehta, N. A. and Williamson, R. C. From stochastic mixability to fast rates. Advances in Neural Information Processing Systems, 27, 2014.

Mnih, V., Kavukcuoglu, K., Silver, D., Rusu, A. A., Veness, J., Bellemare, M. G., Graves, A., Riedmiller, M., Fidjeland, A. K., Ostrovski, G., et al. Human-level control through deep reinforcement learning. nature, 518(7540): 529 533, 2015.

Munos, R. and Szepesv ari, C. Finite-time bounds for fitted value iteration. J. Mach. Learn. Res., 9:815 857, 2008.

Nguyen-Tang, T. and Arora, R. On sample-efficient offline reinforcement learning: Data diversity, posterior sampling and beyond. In Thirty-seventh Conference on Neural Information Processing Systems, 2023.

Nguyen-Tang, T., Yin, M., Gupta, S., Venkatesh, S., and Arora, R. On instance-dependent bounds for offline reinforcement learning with linear function approximation. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 37, pp. 9310 9318, 2023.

Ouyang, L., Wu, J., Jiang, X., Almeida, D., Wainwright, C., Mishkin, P., Zhang, C., Agarwal, S., Slama, K., Ray, A., et al. Training language models to follow instructions with human feedback. Advances in Neural Information Processing Systems, 35:27730 27744, 2022.

Pollard, D. Convergence of stochastic processes. David Pollard, 1984.

Rashidinejad, P., Zhu, B., Ma, C., Jiao, J., and Russell, S. Bridging offline reinforcement learning and imitation learning: A tale of pessimism. Advances in Neural Information Processing Systems, 34:11702 11716, 2021.

Rigollet, P. and Zeevi, A. Nonparametric bandits with covariates. ar Xiv preprint ar Xiv:1003.1630, 2010.

Schulman, J., Wolski, F., Dhariwal, P., Radford, A., and Klimov, O. Proximal policy optimization algorithms. ar Xiv preprint ar Xiv:1707.06347, 2017.

Silver, D., Hubert, T., Schrittwieser, J., Antonoglou, I., Lai, M., Guez, A., Lanctot, M., Sifre, L., Kumaran, D., Graepel, T., et al. A general reinforcement learning algorithm that masters chess, shogi, and go through self-play. Science, 362(6419):1140 1144, 2018.

Simchi-Levi, D. and Xu, Y. Bypassing the monster: A faster and simpler optimal algorithm for contextual bandits under realizability. Mathematics of Operations Research, 47(3):1904 1931, 2022.

Slivkins, A. Contextual bandits with similarity information. In Proceedings of the 24th annual Conference On

Learning Theory, pp. 679 702. JMLR Workshop and Conference Proceedings, 2011.

Song, Y., Zhou, Y., Sekhari, A., Bagnell, J. A., Krishnamurthy, A., and Sun, W. Hybrid RL: Using both offline and online data can make RL efficient. ar Xiv preprint ar Xiv:2210.06718, 2022.

Srebro, N., Sridharan, K., and Tewari, A. Optimistic rates for learning with a smooth loss. ar Xiv preprint ar Xiv:1009.3896, 2010.

Sugiyama, M., Suzuki, T., and Kanamori, T. Density ratio estimation in machine learning. Cambridge University Press, 2012.

Tang, D., Jain, R., Hao, B., and Wen, Z. Efficient online learning with offline datasets for infinite horizon mdps: A bayesian approach. ar Xiv preprint ar Xiv:2310.11531, 2023.

Tripuraneni, N., Jordan, M., and Jin, C. On the theory of transfer learning: The importance of task diversity. Advances in neural information processing systems, 33: 7852 7862, 2020.

Tsybakov, A. B. On nonparametric estimation of density level sets. The Annals of Statistics, 25(3):948 969, 1997.

Uehara, M. and Sun, W. Pessimistic model-based offline reinforcement learning under partial coverage. ar Xiv preprint ar Xiv:2107.06226, 2021.

Van Erven, T., Grunwald, P., Mehta, N. A., Reid, M., Williamson, R., et al. Fast rates in statistical and online learning. 2015.

Vapnik, V. The nature of statistical learning theory. Springer science & business media, 2013.

Vapnik, V. and Chervonenkis, A. Y. On the uniform convergence of relative frequencies of events to their probabilities. Theory of Probability & Its Applications, 16(2): 264 280, 1971.

Vinyals, O., Babuschkin, I., Czarnecki, W. M., Mathieu, M., Dudzik, A., Chung, J., Choi, D. H., Powell, R., Ewalds, T., Georgiev, P., et al. Grandmaster level in Star Craft II using multi-agent reinforcement learning. Nature, 575 (7782):350 354, 2019.

Wagenmaker, A. and Pacchiano, A. Leveraging offline data in online reinforcement learning. In International Conference on Machine Learning, pp. 35300 35338. PMLR, 2023.

Watkins, A., Ullah, E., Nguyen-Tang, T., and Arora, R. Optimistic rates for multi-task representation learning. In Thirty-seventh Conference on Neural Information Processing Systems, 2023.

On The Statistical Complexity of Offline Decision-Making

Xie, T., Cheng, C.-A., Jiang, N., Mineiro, P., and Agarwal, A. Bellman-consistent pessimism for offline reinforcement learning. Advances in neural information processing systems, 34:6683 6694, 2021a.

Xie, T., Jiang, N., Wang, H., Xiong, C., and Bai, Y. Policy finetuning: Bridging sample-efficient offline and online reinforcement learning. Advances in neural information processing systems, 34:27395 27407, 2021b.

Yin, M. and Wang, Y.-X. Towards instance-optimal offline reinforcement learning with pessimism. Advances in neural information processing systems, 34, 2021.

Zanette, A., Wainwright, M. J., and Brunskill, E. Provable benefits of actor-critic methods for offline reinforcement learning. Advances in neural information processing systems, 34:13626 13640, 2021.

Zhang, R. and Zanette, A. Policy finetuning in reinforcement learning via design of experiments using offline data. ar Xiv preprint ar Xiv:2307.04354, 2023.

Zhang, T. Covering number bounds of certain regularized linear function classes. Journal of Machine Learning Research, 2(Mar):527 550, 2002.

Zhang, T. Mathematical analysis of machine learning algorithms. Cambridge University Press, 2023.

On The Statistical Complexity of Offline Decision-Making

A. Uniform Bernstein s Inequality

A central tool for our analysis is the uniform concentrability of a subset of functions near the ERM (empirical risk minimizer) predictor. In order to be able to match the lower bounds of offline decision-making (at least) for parametric classes, we require that every ball centered around the ERM predictor within a sufficiently small radius (in fact, of the order of O( 1

n)) has an O( 1

n) order in the excess risk. While fast rates are in general not possible, they are so under certain conditions (Van Erven et al., 2015), including the bounded, squared loss in parametric classes that we consider. A uniform Bernstein s inequality is also presented in (Cucker & Smale, 2002, Theorem B), but using a strong notion of domain-wide L covering numbers. (Zhang, 2023, Theorem 3.21) presents a version of uniform Bernstein s inequality using the population L1 covering numbers. The population L1 however requires the knowledge of the data distribution. Here, we present a more practical version of uniform Bernstein s inequality using the empirical L1 covering numbers, which is, at least in principle, computable given the empirical data. More importantly, and also as a key technical result in this section, we prove in Proposition A.3 a uniform Bernstein s inequality for Bellman-like loss functions using the empirical L1 covering numbers. Proposition A.3 applies to handle the data structure generated by RL and, as a special case, naturally applies to contextual bandits. These results are central to our analysis tool and might be of independent interest. Notably, our proof for Proposition A.3 relies on an elegant argument of localizing a function class into a set of balls centered around the covering functions, and then performing uniform convergence in each of such local balls before combining them via a union bound. This localization argument is inspired by a similar argument by (Mehta & Williamson, 2014).

Before stating the uniform Bernstein s inequality for Bellman-like loss functions in Proposition A.3, we start with the uniform Bernstein s inequality for a simpler case, which we also use in our analysis and also serves a good point for demonstrating the localization argument.

A.1. Uniform Bernstein s inequality for generic case

Proposition A.1 (Uniform Bernstein s inequality for generic case). Let G be a set of functions g : Z [ b, b]. Fix n N. Denote ˆEg := 1

n Pn i=1 g(zi) where {zi}i i.i.d. P and Eg = Ez P [g(z)], and V[g] is the variance of g(z). Then, for any δ > 0, with probability at least 1 δ, we have

g G : Eg ˆEg inf ϵ>0

2V[g] log(2N1(G, ϵ, n)/δ)

n + 62b log(6N1(G, ϵ, n)/δ)

Fix ϵ > 0 and let N = N1(G, ϵ, n). Let {gi}i [N] be an ϵ-cover of G w.r.t. L1( ˆPn). For any i [N], denote Gi = {g G : ˆE|g gi| ϵ}. We have G i [N]Gi. The proof of Proposition A.1 relies on the following lemma.

Lemma A.2. Fix any i [N]. With probability at least 1 δ,

sup g Gi E|g gi| 12ϵ + 12b log(3/δ)

Proof of Lemma A.2. The key idea to obtain fast rates in the above bounds is leveraging the non-negativity of the random objects of interest by employing Lemma F.2. With probability at least 1 δ, for any g Gi, we have

E|g gi| 4ˆE|g gi| + inf ϵ

8ϵ + 12b log(3N1(Gi, ϵ , n)/δ)

12ϵ + 12b log(3/δ)

where the second inequality follows from that ˆE|g gi| ϵ, that we choose ϵ = ϵ, and that the ϵ-ball Gi can be ϵ-covered by one point, thus N1(Gi, ϵi, n) = 1.

On The Statistical Complexity of Offline Decision-Making

Proof of Proposition A.1. Denote the event E = E1 E2, where

i [N], Egi ˆEgi 2b log(N/δ)

2V[gi] log(N/δ)

sup i [N] sup g Gi E|g gi| 12ϵ + 12b log(3N/δ)

Now consider under the event E. Since G i [N]Gi, for any g G, there must exist i [N] such that g Gi (consequently, ˆE|g gi| ϵ). Also notice that

V[g] V[gi] = E[g2] E[g2 i ] + E[gi]2 E[g]2 2b E|g gi| + 2b|E[gi] E[g]| 4b E|g gi|.

Thus, we have

Eg ˆEg = Egi ˆEgi + E(g gi) + ˆE(gi g)

Egi ˆEgi + E|g gi| + ˆE|g gi|

2b log(N/δ)

2V[gi] log(N/δ)

n + E|g gi| + ˆE|g gi|

2b log(N/δ)

2V[g] log(N/δ)

2|V[g] V[gi]| log(N/δ)

n + E|g gi| + ˆE|g gi|

2b log(N/δ)

2V[g] log(N/δ)

n + b log(N/δ)

b |V[g] V[gi]| + E|g gi| + ˆE|g gi|

5b log(N/δ)

2V[g] log(N/δ)

n + 5E|g gi| + ˆE|g gi|

5b log(N/δ)

2V[g] log(N/δ)

n + 5(12ϵ + 12b log(3N/δ)

62b log(3N/δ)

2V[g] log(N/δ)

where the fourth inequality follows AM-GM inequality . Finally, note that, by Bernstein s inequality and the union bound, Pr(E1) 1 δ; by Lemma A.2 and the union bound, Pr(E2) 1 δ. Thus, by the union bound again, Pr(E) 1 2δ.

A.2. Uniform Bernstein s inequality for Bellman-like loss classes

We now establish the uniform Bernstein s inequality for the Bellman-like loss functions. The nature of the result and the proof is similar to those for Proposition A.1, but only more involved as we deal with a more structural random tuple.

Consider a tuple of random variables (x, a, r, x ) X A [0, 1] X distributed according to distribution P. For any u : X A R and g : X A R, we define the random variable

M(u, g) = (u(x, a) r g(x ))2 (g (x, a) r g(x ))2,

where g (x, a) := E(r,x ) P ( |x,a) [r + g(x )]. Let {(xt, at, rt, x t)}t [n] be an i.i.d. sample from P. We write Mt in replace of M when we replace (x, a, r, x ) in M by (xt, at, rt, x t). We consider function classes U {u : X A [0, b]} and G {g : X [0, b]}. We assume, for simplicity, that r + g(x ) [0, b] almost surely. Proposition A.3 (Uniform Bernstein s inequality for Bellman-like loss functions). Fix any ϵ > 0. With probability at least 1 δ, for any u U, g G,

E[(u(x, a) g (x, a))2] 2

t=1 Mt(u, g) + inf ϵ>0

108bϵ + b2 36 log N1(U, ϵ, n) + 83 log N1(G, ϵ, n) + 108 log(12/δ)

In addition, with probability at least 1 δ, for any u U, g G,

t=1 Mt(u, g) inf ϵ>0

30bϵ + b2 4 log N1(U, ϵ, n) + 28 log N1(G, ϵ, n) + 28 log(6/δ)

On The Statistical Complexity of Offline Decision-Making

Remark A.4. Krishnamurthy et al. (2019) develop a uniform Freedman-type inequality for an indicator-type martingale difference sequence, which does not require the data-generating policy to be non-adaptive as in our case. Applying the uniform Freedman-type inequality of Krishnamurthy et al. (2019) as-is to the stochastic contextual bandit with a non-adaptive data-generating policy results in larger constants and an additional factor of log(K) (see (Foster et al., 2018) for why this additional log K is the case). However, there are deeper reasons why our uniform Bernstein s inequality is preferred over the uniform Freedman-type inequality of Krishnamurthy et al. (2019). First, it seems impossible even if we want to apply the uniform Freedman-type inequality of Krishnamurthy et al. (2019) as-is to the RL setting. The reason is that the martingale structure in the uniform Freedman-type inequality of Krishnamurthy et al. (2019) is formed only through the adaptive policy that selects actions based on the historical data, and they require i.i.d. assumption on {xt}t 1. In contrast, in RL, even when we assume that the initial state is i.i.d. from a fixed distribution, the states from h 2 are not independent once the policy that generates the offline data is adaptive. Second, the uniform Freedman-type inequality of Krishnamurthy et al. (2019) cannot seem to easily apply to the RL setting considered in Proposition A.1, where we consider multiple target regression functions g , whereas they require one fixed target function.

Fix ϵ > 0. For simplicity, we write Ng = N1(G, ϵ, n) and Nu = N1(U, ϵ, n).

Lemma A.5. For any u U and any g G, we have

E[M(u, g)] = E[(u(x, a) g (x, a))2],

V[M(u, g)] 4b2E[M(u, g)].

Let {ui}i Nu(ϵ) and {gj}j Ng(ϵ) be the corresponding ϵ-cover of U and G. Let Ui be the set of functions u U such that u ui L1({(xt,at)}t [n]) ϵ. Similarly, we define Gj.

The following lemma establishes a finite function class variant of Proposition A.3, over (a finite number of) the covering functions of U and G.

Lemma A.6. With probability 1 δ, for any (i, j) [Nu] [Ng], we have

E[(ui(x, a) g j (x, a))2] 2

t=1 Mt(ui, gj) + 12b2 log(Nu Ng/δ)

In addition, with probability 1 δ, for any (i, j) [Nu] [Ng], we have

t=1 Mt(ui, gj) 4b2 log(Nu Ng/δ)

Proof of Lemma A.6. For any (i, j) [Nu] [Ng], by the Freedman s inequality, with probability at least 1 δ, for any t [0, 1

E[(ui(x, a) g j (x, a))2] 1

t=1 Mt(ui, gj) + (e 2)λV[M(ui, gj)] + log(1/δ)

t=1 Mt(ui, gj) + 4b2(e 2)λEM(ui, gj) + log(1/δ)

where the second inequality follows from Lemma A.5 (the second inequality). By choosing λ = 1 8(e 2)b2 , the above inequality becomes:

E[(ui(x, a) g j (x, a))2] 2

t=1 Mt(ui, gj) + 12 log(1/δ)

Taking the union bound over (i, j) [Nu] [Ng] completes the proof for the first part.

The second part follows similarly except the only difference is that we choose λ = 1 4(e 2)b2 in the Freedman s inequality.

On The Statistical Complexity of Offline Decision-Making

The following lemma shows that a ball in the L1 distance is provably contained within a ball with an empirical L1 distance with a slightly larger radius by a margin that scales at fast rates O( 1

Lemma A.7. 1. Fix i [Nu]. With probability at least 1 δ,

sup u Ui E|u ui| 12ϵ + 12b log(3/δ)

2. Fix j [Ng]. With probability at least 1 δ,

sup g Gj E|g gj| 12ϵ + 12b log(3/δ)

3. Fix j [Ng]. With probability at least 1 δ, we have

t=1 Ex P ( |xt,at)|g gj|(x ) 12ϵ + 12b log(3/δ)

Proof of Lemma A.7. The nature of these results are similar to Lemma A.2 in the generic case, where the key idea to obtain fast rates in the estimation errors is leveraging the non-negativity of the random objects of interest by employing Lemma F.2. We start with Part 1. With probability at least 1 δ, for any u Ui, we have

E|u ui| 4ˆE|u ui| + inf ϵ

8ϵ + 12b log(3N1(ϵ , Ui, n)/δ)

12ϵ + 12b log(3/δ)

where the second inequality follows from that ˆE|u ui| ϵ, that we choose ϵ = ϵ, and that the ϵ-ball Ui can be ϵ-covered by one point, thus N1(ϵ, Ui, n) = 1.

Part 2 follows exactly as Part 1. For Part 3, we notice

t=1 |g gj|(x t)|{(xt, at)}t [n]

t=1 Ex P ( |xt,at)|g gj|(x ).

Thus, the same arguments in Part 1 and Part 2 apply to Part 3.

We are now ready to prove Proposition A.3.

Proof of Proposition A.3. We start with the following simple discretization: For any u, u , g, g , we have

M(u, g) M(u , g ) 2b|u u |(x, a) + 4b|g g |(x ) + 2b Ex P ( |x,a)|g g |(x ), (5)

(u g )2(x, a) (u g )2(x, a) 2b|u u |(x, a) + 2b Ex P ( |x,a)|g g |(x ). (6)

Consider the following event:

sup i [Nu] sup u Ui E|u ui| 12ϵ + 12b log(3Nu/δ)

sup j [Ng] sup g Gj E|g gj| 12ϵ + 12b log(3Ng/δ)

sup j [Ng] sup g Gj ˆEEx P ( |x,a)|g gj|(x ) 12ϵ + 12b log(3Ng/δ)

E[(ui(x, a) g j (x, a))2] 2

t=1 Mt(ui, gj) + 12b2 log(Nu Ng/δ)

On The Statistical Complexity of Offline Decision-Making

Note that for any u U and g G, there exist i [Nu] and j Ng such that u Ui and g Gj. Thus, under event E1, for any u U, g G, we have

E[(u(x, a) g (x, a))2] E[(ui(x, a) g j (x, a))2] + 2b E|u ui| + 2b E|g gj|

t=1 Mt(ui, gj) + 12b2 log(Nu Ng/δ)

n + 2b E|u ui| + 2b E|g gj|

t=1 Mt(u, g) + 4bˆE|u u | + 8bˆE|g g | + 4bˆEEx P ( |x,a)|g g |(x )

+ 12b2 log(Nu Ng/δ)

n + 2b E|u ui| + 2b E|g gj|

t=1 Mt(u, g) + 4bϵ + 8bϵ + 4b(12ϵ + 12b log(3Ng/δ)

+ 12b2 log(Nu Ng/δ)

n + 2b(12ϵ + 12b log(3Nu/δ)

n ) + 2b(12ϵ + 12b log(3Ng/δ)

t=1 Mt(u, g) + 108bϵ + b2 36 log(Nu) + 83 log(Ng) + 108 log(3/δ)

where the first inequality follows from Equation (6), the second inequality follows from E1, the third inequality follows from Equation (5), and the fourth inequality follows from E1. By Lemma A.7, Lemma A.6 and the union bound, we have Pr(E1) 1 4δ. Thus, we complete the first part of the theorem.

For the second part, the proof follows similarly. In particular, we consider the following event:

sup j [Ng] sup g Gj ˆEEx P ( |x,a)|g gj|(x ) 12ϵ + 12b log(3Ng/δ)

t=1 Mt(ui, gj) 4b2 log(Nu Ng/δ)

It follows from Lemma A.6 (second part), Lemma A.7, and a union bound, that Pr(E2) 1 2δ.

Finally, note that under event E2, for any u U, g G, we have

t=1 Mt(u, g) 1

t=1 Mt(ui, gj) + 2bˆE|u ui| + 4bˆE|g gj| + 2bˆEEx P ( |x,a)|g gj|(x )

4b2 log(Nu Ng/δ)

n + 2bϵ + 4bϵ + 2b(12ϵ + 12b log(3Ng/δ)

30bϵ + b2 4 log(Nu) + 28 log(Ng) + 28 log(3/δ)

where the first inequality follows from Eq. (5), the second inequality follows from the conditions in event E2.

B. Proofs of Section 4

Proof of Theorem 4.1. Fix any (C, ρ, d) as stated in the theorem. Let ϵ = Cd

32n 1/(2ρ). We have 0 < ϵ < 1/2 and ϵ2ρ 2

Construction of hard instances. Let A = {a1, a2}. Pick any d mutually distinct points x1, . . . , xd. We construct a family of the context-reward distributions Dσ indexed by σ { 1, 1}d where Dσ(x, y) = PX(x) P σ Y |X(y), PX(xi) = 1 d, P σ Y |X(y(a1)|xi) = Ber( 1

2), and P σ Y |X(y(a2)|xi) = Ber( 1

2). Choose any µ such that

On The Statistical Complexity of Offline Decision-Making

µ(a2|xi) = ϵ2ρ 2

C 1, i [d]. Now we choose the function class F as follows: Fa1 = {f(x) 1 2} and Fa2 = {fα(xi) := 1

2, i [d]|α { 1, 1}d}.

Verification. We now verify that for any σ, (Dσ, µ, π Dσ, F) B(ρ, C, d). First, it is easy to see that Pdim(Fa2) = d and Pdim(Fa1) = 0. Second, we have f σ(xi, a1) = 1

2 and f σ(xi, a2) = 1

2, where f σ(x, a) := EPσ[Y (a)|X = x], thus f σ F, σ. Third, for any f F, we have f( , a2) = fα for some α { 1, 1}d and for any policy π, we have

Eσ,π(f) = ϵ2 d X

i=1 π(a2|xi)1{σi = αi}

where Eσ,π(f) denotes the excess risk of f under Dσ π, i.e., under the squared loss and realizability, Eσ,π(f) = EDσ π[(f f )2].

Since we choose µ(a2|xi) = ϵ2ρ 2

C 1, i [d], we have

Eσ,µ(f) = ϵ2ρ

C dist(σ, α)

ϵ2 dist(σ, α)

C Eρ σ,π (f),

since dist(σ, α) d and ρ 1. Thus, we have (Dσ, µ, π Pσ, F) B(ρ, C, d) for any σ { 1, 1}d.

Reduction to testing. For any policy π, we have V π σ := V π Dσ = Pd i=1 1 d π(a1|xi)( 1

2) + π(a2|xi)( 1

2) . Thus, we can compute the sub-optimality of the output policy ˆπ of any offline learner by

V σ V ˆπ σ =

ϵ 2d (1{σi} ˆπ(a2|xi)σi) .

Let ˆσi = 1{ˆπ(a2|xi) 1

2}. We have 1{σi} ˆπ(a2|xi)σi |σi ˆσi|

4 , and thus, we have

Eσ [V σ V π σ ] ϵ

4d Eσ [dist(σ, ˆσ)] , (7)

where dist(σ, ˆσ) := Pd i=1 1{σi = ˆσi} denotes the Hamming distance between two binary vectors σ and ˆσ, and Eσ denotes the expectation over the randomness of ˆσ with respect to Pσ.

The worst-case Hamming distance supσ { 1,1}d Eσ [dist(σ, ˆσ)] can be lower-bounded using the standard tools in hypothesis testing:

sup σ { 1,1}d Eσ [dist(σ, ˆσ)] d

2 min σ,σ :dist(σ,σ )=1 inf ψ [Dσ(ψ = σ) + Dσ (ψ = σ )]

1 2 max σ,σ :dist(σ,σ )=1 KL ((Dσ µ)n (Dσ µ)n)

where the first inequality follows Assouad s lemma (Tsybakov, 1997, Lemma 2.12) and the second inequality follows from (Tsybakov, 1997, Theorem 2.12).

We now compute the KL distance: For any σ and σ such that dist(σ, σ ) = 1, let i [d] be the (only) coordinate that σ

On The Statistical Complexity of Offline Decision-Making

differs from σ , we have

KL ((Dσ µ)n (Dσ µ)n) = n KL (Dσ µ Dσ µ)

= n EPXKL (µ Pσ(Y |X) µ Pσ (Y |X))

i=1 KL µ(a1|xi)Ber(1

2) + µ(a2|xi)Ber(1

2 + σi ϵ 2) µ(a1|xi)Ber(1

2) + µ(a2|xi)Ber(1

2 + σ i ϵ 2)

i=1 µ(a2|xi)KL Ber(1

2 + σi ϵ 2) Ber(1

2 + σ i ϵ 2)

d µ(a2|xi )KL Ber(1

d µ(a2|xi )ϵ2 = 16nϵ2ρ

where the first inequality uses the convexity of KL divergence, the second inequality uses a basic KL upper bound that KL Ber 1

2 16ϵ2 for any z { 1, 1}, and the second-last equality plugs in the choice of µ, and the last equality plugs in the choice of ϵ. Now, plugging the above inequality into Equation (8) and Equation (7), we have

max σ { 1,1}d Eσ [V σ V π σ ] ϵ

C. Proofs of Section 5

We restate Theorem 5.1 in an exact form with disclosed constants.

Theorem C.1. Fix any δ [0, 1]. Invoke Algorithm 1 with β = 32ϵ + 4 log N1(F,ϵ,n)+24 log(12/δ)

n . Then, for any (D, F) such that Assumption 2.1 holds, with probability at least 1 δ, for any π Π,

E[Sub Optπ D(ˆπ)|S] Cρ(π) inf ϵ 0

172ϵ + 44 log N1(F, ϵ, n) + 156 log(24/δ)

1/(2ρ(π)) + 4

+ 1{|X|>1} inf ϵ 0

2 log(2N1(F( , Π), ϵ, n)/δ)

n + 62 log(6N1(F( , Π), ϵ, n)/δ)

Our proof relies on the following lemmas.

Lemma C.2. Fix any ϵ > 0, and δ [0, 1]. Define the version space

F(β) := {f F : ˆPlf ˆPl ˆ f β},

where β = 32ϵ + 4 log N1(F,ϵ,n)+24 log(12/δ)

n . Then, with probability at least 1 δ, f F(β), and

sup f F(β) ED µ[(f f )2] 172ϵ + 44 log N1(F, ϵ, n) + 156 log(24/δ)

Proof of Lemma C.2. Lemma C.2 is an instantiation of Proposition A.3 for the contextual bandit case where we have a tuple of random variables (x, a, r) instead of having an additional transition variable x . In the contextual bandit case, we simply use G = {0} the set of the zero function, and remove all the terms regarding N1(G, ϵ, n) (which is 1 in this case) in the RHS of Proposition A.3.

On The Statistical Complexity of Offline Decision-Making

Lemma C.3. Choose η = q

log K 4(e 2)T and T log K

e 2 . Then we have

t=1 ˆP(ft( , π) ft( , πt)) 4 p

Proof of Lemma C.3. It suffices to show a stronger result, that any x X and π Π,

t=1 ft(x, π) ft(x, πt) 4 p

This is a standard guarantee of the Hedge algorithm, where a complete proof can be found at (Nguyen-Tang & Arora, 2023, Lemma B.6).

Proof of Theorem C.1.

V π V πt = Pf ( , π) Pf ( , πt)

= ED π[f ft] + (P ˆP)ft( , π) + ˆP(ft( , π) ft( , πt)) + ( ˆP P)f ( , πt) + ˆP(ft( , πt) f ( , πt)),

ED π[f ft] | {z } I1

+ ˆP(ft( , π) ft( , πt)) | {z } I2

+2 sup g F( ,Π) |(P ˆP)g|

+ ˆP(ft( , πt) f ( , πt)) | {z } I4

We bound each term I1, I2, I3, I4 separately. Regarding term I2, by Lemma C.3, we have

t=1 ˆP(ft( , π) ft( , πt)) 4 p

Consider the event E = E1 E2 E3, where

sup g F( ,Π) |(P ˆP)g| 1{|X|>1} inf ϵ>0

2 log(2N1(F( , Π), ϵ, n)/δ)

n + 62 log(6N1(F( , Π), ϵ, n)/δ)

E2 := {f F(β)} ,

sup f F(β) ED µ[(f f )2] 172ϵ + 44 log N1(F, ϵ, n) + 156 log(24/δ)

Due to pessimism of Algorithm 1, ft = arg min f F: ˆ P lf ˆ P l ˆ f β ˆPf( , πt). Thus, under E2, I4 = ˆPft( , πt) ˆPf ( , πt) 0. By

the policy transfer definition, we have

I1 = ED π[f ft] CπED µ[(f f)2] 1/(2ρπ) .

Thus, under event E, we have

t=1 V πt Cπ

172ϵ + 44 log N1(F, ϵ, n) + 156 log(24/δ)

1/(2ρπ) + 4

+ 1{|X| > 1} inf ϵ>0

2 log(2N1(F( , Π), ϵ, n)/δ)

n + 62 log(6N1(F( , Π), ϵ, n)/δ)

Now we bound the probability of each of the events E1, E2, E3 and combine them via the union bound for the probability of E. Note that if |X| = 1, we have a trivial inequality I3 = 0. Combining with Proposition A.1, we have Pr(E1) 1 δ. By Lemma C.2, we have Pr(E2 E3) 1 δ. Thus, Pr(E) 1 2δ.

On The Statistical Complexity of Offline Decision-Making

D. Proofs for Section 6

D.1. The expected sub-optimality guarantees for FALCON+ (Simchi-Levi & Xu, 2022, Algorithm 2)

When we employ FALCON+ (Simchi-Levi & Xu, 2022, Algorithm 2) for the online learning step in Algorithm 2, the returned policy ˆπon has the expected regret over m/2 rounds of order p

KEF,δ/ log(m/2)(m)m, where EF,δ(n) is a learning guarantee for the offline regression oracle (see their Assumption 2). Assumption D.1 ((Simchi-Levi & Xu, 2022, Assumption 2)). Let π be an arbitrary policy. Given n training samples of the form (xi, ai, ri(ai)) i.i.d. drawn according to (xi, ri) D and ai π( |xi), the offline regression oracle returns a predictor ˆf : X A R. For any δ > 0, with probability at least 1 δ, we have

Ex DX ,a π( |x) h ( ˆf(x, a) f (x, a))2i EF,δ(n).

Note that, the output for running FALCON++ for m/2 iterates is a sequence of m/2 policies. ˆπon is an uniform distribution over such m/2 iterates.

Since the context x is i.i.d. from DX , we can transform the expected regret p

KEF,δ/ log(m/2)(m)m into the expected sub-optimality bound of order p

KEF,δ/ log(m/2)(m) + 1

m using the improved online-to-batch argument (Nguyen-Tang et al., 2023), wherein 1

m is the cost of converting a regret bound into a sub-optimality bound. It is known that if the closure of F is convex (which we assume here for simplicity of comparison), the sample complexity of learning F using squared loss is Pdim(F)

ϵ (log(1/ϵ) + log(1/δ)) (Lee et al., 1996). Thus, EF,δ(n) Pdim(F)

n log(n/δ). Finally, note that Pdim(F) Kd.

D.2. Proof of Theorem 6.1

Proof of Theorem 6.1. Construct the hard problem instances exactly like the ones in the proof of Theorem 4.1, except that we now choose

The verification and the reduction to testing follow exactly, except for the step in which we compute the KL divergence of the observations between two different models. Specifically, denote Qσ as the probability of the pre-collected data and the online data under the model Pσ. Let Son = { xi, ai, ri}i [m] be the random online data collected during the online phase by a hybrid algorithm ALG. Note that at depends on xt, { xi, ai, ri}i [t 1], and Soff for any t [m]. We denote this conditional distribution by PALG( at| xt, { xi, ai, ri}i [t 1], Soff). Note that the conditional distribution PALG depends only on the algorithm ALG and invariant to any underlying model Pσ, thus we have

Qσ(Son) Qσ (Son) = Y

Pσ( ri| xi, ai) Pσ ( ri| xi, ai).

Consider any σ and σ such that dist(σ, σ ) = 1. We thus have X

Soff Son Qσ(Son) log Qσ(Son)

Qσ (Son) = X

Soff Son Qσ(Son) Y

Pσ( ri| xi, ai) Pσ ( ri| xi, ai)

i=1 KL Ber(1

2 + σi ϵ 2) Ber(1

2 + σ i ϵ 2) m

Hence, we have

KL(Qσ Qσ ) = X

Soff Son Qσ(Soff Son) log Qσ(Soff Son)

Qσ (Soff Son)

Soff Qσ(Soff) log Qσ(Soff)

Qσ (Soff) + X

Soff Son Qσ(Son) log Qσ(Son)

d 1/4 + 1/4 = 1/2,

On The Statistical Complexity of Offline Decision-Making

where the first term in the first inequality follows from the upper bound of KL of two distributions of the offline data in the proof of Theorem 4.1, the second term of the first inequality follows from a direct calculation, and the last inequality follows from the choice of ϵ presented at the beginning of the proof.

The worst-case Hamming distance supσ { 1,1}d Eσ [dist(σ, ˆσ)] can be lower-bounded using the standard tools in hypothesis testing:

sup σ { 1,1}d Eσ [dist(σ, ˆσ)] d

2 min σ,σ :dist(σ,σ )=1 inf ψ [Qσ(ψ = σ) + Qσ (ψ = σ )]

1 2 max σ,σ :dist(σ,σ )=1 KL (Qσ Qσ )

where the first inequality follows Assouad s lemma (Tsybakov, 1997, Lemma 2.12) and the second inequality follows from (Tsybakov, 1997, Theorem 2.12).

Let ˆσi = 1{ˆπ(a2|xi) 1

2}. We have 1{σi} ˆπ(a2|xi)σi |σi ˆσi|

4 , and thus, we have

sup σ { 1,1}d Eσ [V σ V π σ ] ϵ

4d sup σ { 1,1}d Eσ [dist(σ, ˆσ)] ϵ

E. Proofs for Section 7

E.1. Proof of Theorem 7.5

Proof of Theorem 7.5. The proof structure follows similarly as that of Theorem 4.1.

Construction of a family of hard MDPs. Each MDP Mσ is characterized by σ { 1, 1}d. For any σ, Mσ is a deterministic MDP with the state space is S = {x1, x1, x1, . . . , xd, xd, xd}, the action space A = {a1, a2} (see Figure 1). Each MDP Mσ starts uniformly at one of d states x1, . . . , xd. Form each state xi for any i [d], one can follow the blue path by taking action a1 or the red path by taking action a2. This always lead to an absorbing state ( xi in the blue path and xi in the red path). Taking any action from an absorbing state leads to the same state. If one starts from xi for any i [d], the reward for every blue arrow in the graph (resp. red arrow) is 1

2). Also note that all the MDPs in the family share the same (deterministic) dynamics (they are only different by reward labelling).

We can compute exactly Q-functions of each policy under each MDP Mσ. Note that starting from h 2, the value function does not depend on the action being taking. The total reward in any trajectory essentially depends on which initial state one starts with and which action one takes from the initial state. Thus under any MDP Mσ, its Q-value functions under a policy π have the property that they are completely agnostic to π which comes handy in satisfying the value realizability and Bellman completeness. This property is inspired by the construction by (Foster et al., 2021), though our constructions are different and much simpler.

Specifically, for any π, we have

Qπ h( xi, a) = H h + 1

2 , h 1, a A, i [d]

Qπ h( xi, a) = (H + 1 h)(1

2 + σi ϵ 2), h 1, a A, i [d]

Qπ 1(xi, a1) = H

Qπ 1(xi, a2) = H(1

2 + σi ϵ 2), i [d].

On The Statistical Complexity of Offline Decision-Making

Figure 1. Hard MDPs

We construction the following function class F = {f σ : σ { 1, 1}d} where

f σ 1 (xi, a1) = H

2 , f σ 1 (xi, a2) = H(1

2 + σi ϵ 2),

h [2, H], f σ h ( xi, a) = (H + 1 h)1

2, f σ h ( xi, a) = (H + 1 h)(1

2 + σi ϵ 2).

It is easy to see that for any σ { 1, 1}d, the pair (F, Mσ) satisfies the value realizability and the Bellman completness. The value realizability follows from that F is constructed as the bare minimum function class that contains all possible Q-value functions of the MDPs in the class. The Bellman completeness follows from the value realizability and that the Bellman backup under any policy π on a function in F does not depend on π.

2 + π1(a2|xi)(1

2 + σi ϵ 2) .

Thus, we have

Eσ [V σ V π σ ] Hϵ

4d Eσ [dist(σ, ˆσ)] , (9)

For any f F, we have f = f α for some α { 1, 1}d. For any policy π, we have

h [H], Eπ[(f σ h f α h )2] = ϵ2(H 1 + 1)2 d X

i=1 π1(a2|xi)1{σi = αi}

µ(a2|xi) = ϵ2ρ 2H2ρ 2

On The Statistical Complexity of Offline Decision-Making

This is valid only when C (ϵH)2ρ 2. Then we have

h [H], C|Eπ[f σ h f α h ]|2 CEπ[(f σ h f α h )2] Eµ[(f σ h f α h )2]ρ.

The offline data can be equivalently reduced into Sn = {(st 1, at 1, rt 1)}t [n] because the information in the first timestep h = 1 fully captures the information in subsequent time steps h 1. For any σ and σ such that dist(σ, σ ) = 1, let i [d] be the (only) coordinate that σ differs from σ , we have

KL (Prσ(Sn) Prσ (Sn)) 16n

d µ(a2|xi )ϵ2 16n H2ρ 2ϵ2ρ

where we choose

ϵ = Cd 32n H2ρ 2

The worst-case Hamming distance supσ { 1,1}d Eσ [dist(σ, ˆσ)] can be lower-bounded using the standard tools in hypothesis testing:

sup σ { 1,1}d Eσ [dist(σ, ˆσ)] d

2 min σ,σ :dist(σ,σ )=1 inf ψ

h Pr σ (ψ = σ) + Pr σ (ψ = σ ) i

1 2 max σ,σ :dist(σ,σ )=1 KL Pr σ (Sn) Pr σ (Sn) !

where the first inequality follows Assouad s lemma (Tsybakov, 1997, Lemma 2.12) and the second inequality follows from (Tsybakov, 1997, Theorem 2.12).

Plugging into Equation (9), we have

sup σ { 1,1}d Eσ [V σ V π σ ] Hϵ

4d sup σ { 1,1}d Eσ [dist(σ, ˆσ)]

E.2. Proof of Theorem 7.6

We first re-state Theorem 7.6.

Theorem E.1. Let Assumption 7.1, Assumption 7.2, and Assumption 7.3 hold and assume that |A| = K. There exists a (possibly randomized) learning algorithm ˆπ such that for any MDP M, and any δ [0, 1], with probability at least 1 δ, for any π such that ρπ 0.5,

E [Sub Optπ M(ˆπ)|S] = C

1 2ρπ π H1 1 2ρπ H2ϵ + H3 d(ϵ, n) + log(H/δ)

where d(ϵ, n) := maxh [H]{log N1(Fh, ϵ, n) log N1(Fh( , Πh), ϵ, n)}.

We provide a specific algorithm, Algorithm 3, that obtains the bounds in Theorem E.1. Algorithm 3 is essentially the Of DM-Hedge algorithm (Algorithm 1) extended to MDPs. Note that Algorithm 3 also already appears in the prior works of (Xie et al., 2021a; Nguyen-Tang & Arora, 2023).

On The Statistical Complexity of Offline Decision-Making

Algorithm 3 Hedge for Offline Decision-Making in MDP (Of DM-Hedge-MDP)

1: Input: Offline data S, function class F 2: Hyperparameters: Confidence parameter β, learning rate η, iteration number T

3: Initialize π(1) = {π(1) h }h [H], where π(1) h ( |x) = Uniform(A), x Xh 4: for t = 1 to T do 5: Pessimism: f (t) = arg min f F(β,π(t)) f1(s1, π(t) 1 ) where

F(β, π(t)) :=

h [H] Li(fh, fh+1, π(t)) inf g F

h [H] Li(gh, fh+1, π(t)) β

6: Hedge: π(t+1) h (a|x) π(t) h (a|x)eηf (t) h (x,a), (x, a, h) 7: end for 8: Output: A randomized policy ˆπ as a uniform distribution over {π(t)}t [T ].

Notations. For convenience, we denote the element-wise functionals indexed by functions and policies:

Li(fh, fh+1, π) := (fh(x(i) h , a(i) h ) r(i) h fh+1(x(i) h+1, πh+1))2,

Zi(fh, fh+1, π) := Li(fh, fh+1, π) Li(T π h fh+1, fh+1, π),

Eπ h (fh, fh+1)(x, a) := (T π h fh fh+1)(x, a).

A key starting point for the proof of Theorem E.1 of our upper bounds in this section is the error decomposition lemma that relies on a notion of induced MDPs, used originally in (Zanette et al., 2021) and adopted in (Nguyen-Tang & Arora, 2023). Definition E.2 (Induced MDPs). For any policy π and any sequence of functions Q = {Qh}h [H] {X A R}H, the (Q, π)-induced MDPs, denoted by M(Q, π) is the MDP that is identical to the original MDP M except only that the expected reward of M(Q, π) is given by {rπ,Q h }h [H], where

rπ,Q h (x, a) := rh(x, a) (T π h fh fh+1)(x, a).

By definition of M(π, Q), Q is the fixed point of the Bellman equation Qh = T π h,M(π,Q)Qh+1.

Lemma E.3. For any policy π and any sequence of functions Q = {Qh}h [H] {X A R}H, we have

Qπ M(π,Q) = Q,

where M(π, Q) is the induced MDP given in Definition E.2.

A key lemma that we use is the following error decomposition. Lemma E.4 (Error decomposition). For any action-value functions Q {S A R}H and any policies π, π Π, we have

Sub Opt M π ( π) =

h=1 Eπ[E π h (Qh, Qh+1)(xh, ah)] + Q1(x1, π1) V π 1,M(x1) + Sub Opt M(Q, π) π ( π).

Proof of Lemma E.4. We have

Sub Opt M π ( π) = V π 1 (x1) V π 1 (x1)

= V π 1 (x1) V π 1,M(Q, π)(x1) + V π 1,M(Q, π)(x1) V π 1 (x1) + V π 1,M(Q, π)(x1) V π 1,M(Q, π)(x1)

h=1 Eπ[E π h (Qh, Qh+1)(xh, ah)] + Q1(x1, π1) V π 1 (x1) + Sub Opt M(Q, π) π ( π),

On The Statistical Complexity of Offline Decision-Making

where in the last equality, for the first term, we use, by Definition E.2, that

V π 1 (x1) V π 1,M(Q, π)(x1) =

h=1 Eπ h rh(xh, ah) r π,Q h (xh, ah) i =

h=1 Eπ[E π h (Qh, Qh+1)(xh, ah)],

for the second term, we use, by Lemma E.3, that

V π 1,M(Q, π)(x1) = Qπ 1,M(Q, π)(x1, π1) = Q1(x1, π).

Lemma E.5 ((Nguyen-Tang & Arora, 2023)). Consider an arbitrary sequence of value functions {Qt}t [T ] such that maxh,t Qt h b and define the following sequence of policies {πt}t [T +1] where

π1( |s) = Uniform(A), s,

πt+1 h (a|s) πt h(a|s) exp ηQt h(s, a) , (s, a, h, t).

Suppose η = q

ln K 4(e 2)b2T and T ln K (e 2). For an arbitrary policy π Π, we have

V π 1,M(πt,Qt)(x1) V πt 1,M(πt,Qt)(x1) 4Hb p

Proof of Theorem E.1. By Lemma E.4, for every π, t, we have

Sub Opt M π (π(t)) =

h=1 Eπ[(T π(t) h f (t) h+1 f (t) h )(sh, ah)]

+ f (t) 1 (s1, π(t) 1 ) V π(t) 1,M (s1) | {z } Bt

+ Sub Opt M(f (t),π(t)) π (π(t)) | {z } Ct

Thus, we have

E [Sub Optπ M(ˆπ)|S] = 1

t=1 At + Bt + Ct.

Note that by Lemma E.5, we have

Thus, we now only need to bound PT t=1 At and PT t=1 Bt. Note that for every π and π, we have

h=1 Eπ (T π h fh+1 fh)(sh, ah)

h=1 C1/(2ρπ) π Eµ (T π h fh+1 fh)(sh, ah)2 1/(2ρπ)

C1/(2ρπ) π H1 1/(2ρπ)

h=1 (T π h fh+1 fh)(sh, ah)2 #!1/(2ρπ)

where the first inequality follows from the Bellman completeness assumption and the transfer exponent definition, the second inequality follows from Jensen s inequality for a concave function x 7 x1/(2ρπ) as long as ρπ 1/2. Thus, to bound PT t=1 At, it suffices to bound the in-distribution squared Bellman error Eµ h PH h=1(T π(t) h f (t) h+1 f (t) h )(sh, ah)2i . This relies on the uniform Bernstein s inequality for Bellman-like loss functions we have developed in Appendix A.

On The Statistical Complexity of Offline Decision-Making

Lemma E.6 (Uniform Bernstein s inequality for Bellman-like loss functions). Fix any ϵ > 0. With probability at least 1 δ, for any f F, π Π,

E[(fh T π h fh+1)2] 2

t=1 Zt(fh, fh+1, π)

108Hϵ + H2 36 log N1(Fh, ϵ, n) + 83 log N1(Fh+1( , Πh+1), ϵ, n) + 108 log(12/δ)

In addition, with probability at least 1 δ, for any f F, π Π,

t=1 Zt(fh, fh+1, π) inf ϵ>0

32Hϵ + H2 4 log N1(Fh, ϵ, n) + 28 log N1(Fh+1( , Πh+1), ϵ, n) + 24 log(6/δ)

Proof of Lemma E.6. This is a direct application of Proposition A.3.

Now let s fix ϵ 0 and δ [0, 1]. We use c to denote an absolute constant that can change its value at every of its appearance, as we are not interested in absolute constant factors and would like to simplify the presentation. Set β in Algorithm 3 by

β = c H2ϵ + H3 d(ϵ, n) + log(H/δ)

By the second part of Lemma E.6, Assumption 7.1, and Assumption 7.2, we have

Pr ( π, Qπ F(β, π)) 1 δ.

Thus, we have

Pr (Bt 0, t) 1 δ.

For bounding PT t=1 At, note that we have

i=1 Zi(f (t) h , f (t) h+1, π(t)) X

h [H] Li(f (t) h , f (t) h+1, π(t)) inf g F

h [H] Li(gh, f (t) h+1, π(t)) β,

where the first inequality follows from Assumption 7.2 and the second inequality follows from the design of Algorithm 3. Thus, by the first part of Lemma E.6, with probability at least 1 δ, we have

h=1 (T π(t) h f (t) h+1 f (t) h )(sh, ah)2 #

c H2ϵ + H3 d(ϵ, n) + log(H/δ)

Plugging the above inequality into the RHS of Equation (10) leads to an upper bound on PT t=1 At.

F. Support Lemmas

Lemma F.1 (Freedman s inequality). Let X1, . . . , XT be any sequence of real-valued random variables. Denote Et[ ] = E[ |X1, . . . , Xt 1]. Assume that Xt R for some R > 0 and Et[Xt] = 0 for all t. Define the random variables

t=1 Xt, V :=

i=1 Et[X2 t ].

On The Statistical Complexity of Offline Decision-Making

Then for any δ > 0, with probability at least 1 δ, for any λ [0, 1/R],

S (e 2)λV + ln(1/δ)

The following lemma exploits the non-negativity of the function class to obtain a fast estimation error rate when relating the population quantity to the empirical one.

Lemma F.2. Consider any function class H {Z [0, b]} for some b > 0 and let Sn = {z1, . . . , zn} be an i.i.d. sample from a distribution P (Z). For any δ (0, 1), with probability at least 1 δ over the randomness of Sn, we have

h H : Ph 4 ˆPnh + inf ϵ>0

8ϵ + 12b ln(3N1(H, ϵ, Sn)/δ)

Remark F.3. Lemma F.2 is a generalization of (Zhang, 2023, Theorem 4.12) from a function range [0, 1] to an arbitrary function range [0, b], i.e. setting b = 1 in the above lemma reduces to (Zhang, 2023, Theorem 4.12).

Proof of Lemma F.2. We start from (Zhang, 2023, Theorem 4.12, with γ = 0.5 in their theorem ) which corresponds to the case b = 1 of the above lemma. Let H/b := {h/b : h H}. Since h 1, h H/b, we can apply (Zhang, 2023, Theorem 4.12) to H/b: With probability at least 1 δ: h H, we have

b + inf ϵ>0

8ϵ + 12 ln(3N1(H/b, ϵ, Sn)/δ)

b + inf ϵ>0

8ϵ + 12 ln(3N1(H, ϵb, Sn)/δ)

The above inequality implies that

Ph 4 ˆPnh + inf ϵ>0

8bϵ + 12b ln(3N1(H, ϵb, Sn)/δ)

4 ˆPnh + inf ϵ>0

8ϵ + 12b ln(3N1(H, ϵ, Sn)/δ)

where the last inequality follows from replacing ϵ by ϵ/b in the first inequality.

Remark F.4. It is possible to obtain a tighter bound specified by the Rademacher complexity of the function class H in the above lemma, if we are willing to make an additional assumption that each function in H is smooth (and non-negative, which is already satisfied in the above lemma). The fast rates are achievable via the optimistic rate framework of (Srebro et al., 2010). The smooth and non-negative condition is satisfied anyway in our case as we use squared loss. However, this result comes at the cost of a large absolute constant in the upper bound. Also, this stronger upper bound ultimately does not benefit our case as our bounds still depend on log-covering numbers, instead of entirely depending on Rademacher complexity.

Lemma F.5. Let F : X A R, let Π = {X (A)} be the set of all policies. Suppose that |A| = K. For any p 1, n N, we have

max {Np(F, ϵ, n), Np(F( , Π), ϵ, n)} Y

a A Np(F( , a), ϵ K1/p , n).

Proof of Lemma F.5. We will first prove that:

Np(F( , Π), ϵ, n) Y

a A Np(F( , a), ϵ K1/p , n).

On The Statistical Complexity of Offline Decision-Making

The other inequality can be proved similarly. Fix any ϵ > 0, p 1, n N. Let Na = Np(F( , a), ϵ , n), a A. For any

g F( , Π), g = f( , π) for some f F, π Π. Let f such that 1

n Pn i=1 |f(xi, a) f (xi, a)|p 1/p ϵ K1/p for any a. Define g = f ( , π). We have

g g p n = 1

i=1 |f(xi, π) f (xi, π)|p

Ea π( |xi) [|f(xi, a) f (xi, a)|] p

i=1 Ea π( |xi) [|f(xi, a) f (xi, a)|p]

i=1 max a A |f(xi, a) f (xi, a)|p

a A |f(xi, a) f (xi, a)|p

i=1 |f(xi, a) f (xi, a)|p ϵp.