# model_selection_in_batch_policy_optimization__f115ebca.pdf

Model Selection in Batch Policy Optimization

Jonathan N. Lee 1 2 George Tucker 2 Oﬁr Nachum 2 Bo Dai 2

Abstract We study the problem of model selection in batch policy optimization: given a ﬁxed, partialfeedback dataset and M model classes, learn a policy with performance that is competitive with the policy derived from the best model class. We formalize the problem in the contextual bandit setting with linear model classes by identifying three sources of error that any model selection algorithm should optimally trade off in order to be competitive: (1) approximation error, (2) statistical complexity, and (3) coverage. The ﬁrst two sources are common in model selection for supervised learning, where optimally trading off these two is well-studied. In contrast, the third source is unique to batch policy optimization and is due to dataset shift inherent to the setting. We ﬁrst show that no batch policy optimization algorithm can achieve a guarantee addressing all three simultaneously, revealing a stark contrast between difﬁculties in batch policy optimization and the positive results available in supervised learning. Despite this negative result, we show that relaxing any one of the three error sources enables the design of algorithms achieving near-oracle inequalities for the remaining two. We conclude with experiments demonstrating the efﬁcacy of these algorithms.

1. Introduction

Model selection and hyperparameter tuning are fundamental tasks in supervised learning and statistical learning theory. In these settings, given a model class, the standard goal is to minimize risk, which can always be decomposed as a sum of approximation error (i.e., bias) and estimation error (i.e., variance) of the model class. A trade-off between these two often exists: a large model class may require a

1Department of Computer Science, Stanford University, USA 2Google Research, Mountain View, USA. Correspondence to: Jonathan N. Lee <jnl@stanford.edu>.

Proceedings of the 39 th International Conference on Machine Learning, Baltimore, Maryland, USA, PMLR 162, 2022. Copyright 2022 by the author(s).

lot of data for generalization while a small one may suffer from large approximation error. At its core, model selection describes the problem of choosing a model class so as to automatically balance these quantities. A vast literature exists on algorithms and selection rules such as hold-out methods, cross-validation, and structural risk minimization that nearly optimally achieve this trade-off in supervised learning (Massart, 2007; Lugosi & Nobel, 1999; Bartlett et al., 2002; Bartlett, 2008). That is, one can select a model class from a collection that nearly matches the performance of the best model class. The implications in practice have been equally, if not more, impactful as evidenced by the widespread use of model validation and selection in machine learning applications, where methods like cross-validation on held-out data are standard and essential steps for practitioners.

In recent years, interest has turned to model selection in online bandits and reinforcement learning (Agarwal et al., 2017; Foster et al., 2019; Pacchiano et al., 2020; Lee et al., 2021; Modi et al., 2020; Chatterji et al., 2020; Muthukumar & Krishnamurthy, 2021; Ghosh et al., 2021). However, in contrast to the extensive understanding of model selection in supervised learning and the growing literature in online learning, relatively little is known about model selection in the context of batch (or ofﬂine) bandits and reinforcement learning. Batch policy optimization, or ofﬂine policy optimization, is a promising paradigm for learning decision-making policies by leveraging large datasets of interactions with the environment (Lange et al., 2012; Levine et al., 2020). The goal is for the learner to ﬁnd a good policy without interacting with the environment in an online fashion so as to avoid dangerous or costly deployments of sub-optimal policies. While a number of works provide theoretically sample-efﬁcient algorithms for batch policy optimization (Munos & Szepesv ari, 2008; Jin et al., 2019; Nachum et al., 2019b; Liu et al., 2020; Xie et al., 2021), their effectiveness in practice has been limited due to the lack of tools for validation and selection when faced with multiple options for model classes or hyperparameter settings.

It is clear that a need exists in batch policy optimization for an analogue to methods like cross-validation in supervised learning. Traditionally, this need has motivated a large body of literature dedicated to the problem of batch policy evaluation (e.g., (Precup, 2000; Jiang & Li, 2016; Nachum et al.,

Model Selection in Batch Policy Optimization

2019a)). These works aim to estimate the values (i.e., online performance) of a set of arbitrary candidate policies, typically via value function regression on the ﬁxed batch dataset or some form of importance sampling. Unfortunately, in practice these methods warrant their own hyperparameter selection (e.g. choice of model class used to estimate value functions), an observation that has been noted by several authors (Tang & Wiens, 2021; Kumar et al., 2021; Paine et al., 2020). The policy evaluator could thus suffer from the same issues as batch policy optimization algorithms. The question of how to achieve effective model selection methods in the batch setting thus remains open.

Motivated by this challenge, in this paper we formalize and study theoretically the problem of model selection for batch policy optimization in the setting of contextual bandits with linear models and make progress towards understanding what is and is not possible in the batch setting compared to supervised learning. The problem is as follows. The learner is given access to a collection of model classes F1, . . . FM in order to estimate value functions. Equipped with a single model class Fk, a base algorithm produces a policy ˆπk. The learner s goal is thus to leverage all M classes to produce a policy ˆπ that nearly matches the performance of the best ˆπk. That is, the learner should perform as if the optimal model class Fk that produces the best ˆπk were known in advance. To ground our results, we focus on the contextual bandit setting and model classes Fk that are linear with respect to a collection of known feature maps.

1.1. Contributions

Three Model Selection Criteria In Section 3, we identify three sources of error that a model selection algorithm should trade off in order to be competitive with ˆπk for linear model classes. Two natural sources, borrowed from supervised learning bounds, are approximation error and statistical complexity of the model class. However, unlike supervised learning, a third source that contributes to error is the coverage of the ﬁxed dataset in conjunction with the properties of the model class. This can be interpreted as the effect of dataset shift. The ﬁxed dataset may not sufﬁciently cover the relevant states and actions1 and some model classes may be better equipped to handle this. Finally, we aim to ensure that model selection preserves the property that the learned policy competes well against any well-covered comparator policy (Jin et al., 2021; Zanette et al., 2021; Xie et al., 2021).

Hardness of Model Selection Our ﬁrst technical contribution (Theorem 2) shows that it is provably impossible to perform model selection so as to optimally trade off all three

1In contextual bandits, one need only worry about action distribution mismatch.

error sources. This is perhaps surprising for two reasons. Firstly, in supervised learning and general risk minimization, such oracle inequalities that nearly optimally balance approximation error and statistical complexity are achievable through myriad procedures, and a vast literature exists on this topic. Secondly, recent work by Su et al. (2020) shows that the analogous problem of estimator selection for batch policy evaluation is possible up to an inexact oracle inequality. These observations suggest that the difﬁculty of model selection is a unique characteristic of the batch policy selection/optimization problem. Furthermore, since the negative result applies to the more restrictive setting of linear contextual bandits, it automatically implies the same for broader classes of problems such as general function approximators and multi-step reinforcement learning.

Positive Results Despite this negative result, we show in Section 5 that positive results for model selection are possible in some instances. Namely, as long as just one of the three error sources is ignored, there exist algorithms to achieve an oracle inequality that optimally trades off the remaining two. We provide experimental results demonstrating the effectiveness of these algorithms.

1.2. Related Work

Pessimistic Policy Optimization The principle of pessimism in batch policy optimization has recently attracted great interest as both a heuristic and principled method that performs (provably) well on the given data distribution in order combat the dataset shift problem. Work by Jin et al. (2021); Xiao et al. (2021); Xie et al. (2021); Zanette et al. (2021); Uehara & Sun (2021) has sought to theoretically quantify the beneﬁt of pessimistic methods for batch learning with different coverage conditions. In particular, Jin et al. (2021) show it is possible to recover regret bounds that are stronger when compared to policies that are well-covered by the data. However, these theoretical studies typically assume a single model class and either assume realizability (Jin et al., 2021; Uehara & Sun, 2021) or require that the approximation error is known in advance (Xie et al., 2021), which is often not the case. The goal of our paper is to investigate and make progress on these unaddressed issues when multiple model classes are available with unknown approximation error.

Policy Evaluation and Selection Another related line of work has considered the problem of batch policy evaluation where one seeks to estimate the value of a target policy using a ﬁxed dataset (Duan et al., 2020). Often, the end goal is to evaluate a number of policies and select the one with maximal value (Yang et al., 2020). In principle, one could use such an evaluation method to select the estimated best policy from several candidates, which are generated

Model Selection in Batch Policy Optimization

from some batch policy optimization method using different model classes. However, one of the main drawbacks of these policy evaluation methods is that they are subject to their own hyperparameters and modeling errors which may compromise the ﬁnal model selection regret bound. For example, to ensure accurate policy evaluation, one might be tempted to use a large model class, but the resulting estimation error can easily leak into the regret bound if only a small model class is sufﬁcient, rendering any model selection guarantee infeasible. Tang & Wiens (2021); Kumar et al. (2021); Paine et al. (2020); Zhang & Jiang (2021) have previously pointed out this problem and attempted to address it with practical solutions, but theoretical guarantees are less understood. This is precisely what we wish to understand in the current paper.

Farahmand & Szepesv ari (2011) studied the problem of selecting action-value functions for reinforcement learning, but also assumed access to an estimator, which may suffer from the aforementioned problems. Xie & Jiang (2021) also considered selecting action-value functions for RL as an application of their algorithm, but did not consider the end-to-end model selection problem with coverage and the resulting guarantee is not competitive with an oracle. Representation learning (Agarwal et al., 2020; Papini et al., 2021; Zhang et al., 2021) is also related, but most current work is in the online setting with assumed realizability or ﬁxed statistical complexity. Most similar in setting to our work is that of Su et al. (2020) who show that it is possible to select from a collection of batch policy evaluators in a way that optimally trades off approximation error and estimation error up to constants, as long as the estimators are properly ordered. However, they did not consider the selection problem, where the best policy will be selected among the candidates that are evaluated upon given data. We show in Section 4 that this additional task raises complications since different target policies may yield different coverage properties, but these complications can be overcome if the problem is slightly relaxed.

2. Preliminaries

Notation For n N, we deﬁne the set [n] := {1, . . . , n}. Let S Rd d be a positive semi-deﬁnite matrix and x Rd

be a vector. By default, x denotes the ℓ2-norm of x and S denotes the spectral norm of S. x S =

x Sx is the Mahalanobis norm. ψ2 and ψ1 denote the sub-Gaussian and sub-exponential norms, respectively (see Appendix A for precise deﬁnitions). We use C, C1, C2, . . . to denote absolute constants that do not depend on problemrelevant parameters. We use δ > 0 to denote a desired failure probability and assume δ 1/e. We write a b to mean there exists a constant C > 0 such that a Cb.

2.1. Contextual Bandits

We consider the contextual bandit setting (Lattimore & Szepesv ari, 2018) with state space X, action space A, and a ﬁxed distribution D over X RA. A learner interacts with the environment through the following protocol: the environment samples a pair (X, Y ) D with X X and Y RA. The learner observes X and then commits to an action a A. A reward Y (a) is incurred and subsequently only the value of Y (a) is revealed to the learner. The learner s objective is to determine a policy ˆπ : X A, which maps states to distributions over actions, such that the expected reward EˆπY (A) is large. Here Eˆπ denotes the expectation over state-action-rewards induced by ˆπ with A ˆπ( |X). Similarly, we use Pˆπ to denote the probability measure under ˆπ. Given a state x and action a, we write the expected reward function as f(x, a) = E [Y (a) | x]. We assume f(x, a) [ 1, 1] and the noise terms η(a) = Y (a) f(X, a) are independent across actions and sub-Gaussian with η(a) ψ2 1. We use EX and PX to denote the expectation and measure over just the marginal distribution over states X.

2.2. Batch Learning

As mentioned before, we consider the batch setting for policy optimization (Lange et al., 2012; Levine et al., 2020; Xiao et al., 2021). Rather than learning via direct interaction with the environment, the learner is given access to a ﬁxed dataset D = {xi, ai, yi}i [n] of n N prior interactions where yi is drawn from D conditioned on xi and yi = yi(ai) as in the aforementioned interface. Similar to the potential outcomes framework under unconfoundedness (Lin, 2013; Imbens & Rubin, 2015), we assume that ai (yj(a))j [n],a A | xi. Intuitively, this ensures the process that selects actions does not peek at the outcomes directly or through a confounding variable. For example, this could be collected from a behavior policy. From this data, the learner produces a policy ˆπ to minimize regret with respect to a comparator policy π2:

Reg(π, ˆπ) = EX [f(X, π(X)) f(X, ˆπ(X))] .

The comparator π can be any deterministic policy, including the optimal policy. The typical regret is obtained by maximizing over π: Reg(ˆπ) := maxπ Reg(π, ˆπ). Like recent work (Jin et al., 2021; Zanette et al., 2021; Uehara & Sun, 2021; Xie et al., 2021), the motivation for this ﬂexibility is that the globally optimal policy may not be well-covered in the dataset, and in these cases, we are interested in proving regret bounds that compete well against comparators that are nearly optimal but that are also well-covered by D.

2For deterministic policies, we write π(x) A to denote the action on which all the probability mass lies given the state x X. We assume all comparator and learned policies are deterministic but all our results can be easily extended.

Model Selection in Batch Policy Optimization

Algorithm 1 Pessimistic Linear Learner

1: Input: Dataset D, linear model class F, conﬁdence parameter 0 < δ 1/e, regularization parameter λ > 0. 2: Set V λ

i [n] φ(xi, ai)φ(xi, ai)

3: Set ˆθ V 1 1 n P

i [n] φ(xi, ai)yi

4: for x X do 5: ˆf(x, a) D φ(x, a), ˆθ E βλ,δ(n, d) φ(x, a) V 1

6: Set ˆπ(x) arg maxa A ˆf(x, a) with ties broken arbitrarily 7: end for 8: Return: ˆπ

In the batch setting, it is often assumed that, for the data in D, states are generated as xi D marginally and ai comes from a (potentially unknown) ﬁxed stochastic behavior policy µ conditioned on xi (Levine et al., 2020). However, our aforementioned assumption about D throughout most of the paper is more general as it does not require that xi or ai come from a random distribution at all such as in a ﬁxed design setting. Nevertheless, we will eventually address this behavior policy setting in Section 5, so we will brieﬂy introduce context and a deﬁnition. It is common to assume that µ sufﬁciently covers the state-action space, often by means of a concentrability coefﬁcient which captures the worst-case ratio of the density under an arbitrary policy π and µ (Munos & Szepesv ari, 2008).

Deﬁnition 1. The concentrability coefﬁcient with respect to data collection policy µ is deﬁned as C(µ) := supπ,x,a π(a|x)/µ(a|x).

It should be noted that the assumption that such concentrability coefﬁcients are small can be very strong. When using function approximation, it may not be necessary to require that µ have non-zero density everywhere as long as one can still sample-efﬁciently ﬁnd a good ﬁt on the given data distribution. Exploiting the properties of function approximation can thus overcome many coverage issues.

3. Model Selection for Batch Linear Bandit

In this section, we introduce linear model classes for the contextual bandit problem (Chu et al., 2011; Abbasi-Yadkori et al., 2011; Jin et al., 2019; 2021) and present a corresponding batch regret bound for a single model class. We identify three sources of sub-optimality in a resulting regret bound and then formally state the goal of model selection to balance these three sources.

3.1. Batch Regret for a Single Model Class

Let F (X A R) be a model class deﬁned by a known d-dimensional feature mapping φ : X A Rd

F := (x, a) 7 φ(x, a), θ : θ Rd . (1)

Leveraging the batch dataset D, Algorithm 1 (which is standard and essentially a restatement of that of Jin et al. (2021)) performs ridge regression with regularizer λ/n on the rewards observed in D and returns a policy ˆπ that conservatively chooses actions based on both the best-ﬁt estimator and a penalty term determined by the coverage. The penalty term is modulated by a coefﬁcient that we set to be equal to

βλ,δ(n, d) = q

5d+10d1/2 log1/2(1/δ)+10 log(1/δ)

In general F may not actually contain the optimal regressor f that deﬁnes the true reward function. In such cases, we say that F may suffer from misspeciﬁcation or approximation error. It is thus important to ensure that the sub-optimality of the extracted policy scales gracefully with the approximation error of the model, even when the approximation error is not known. The below result, which to the best of our knowledge has not been explicitly stated in the literature, follows as a simple application of a standard regression analysis (e.g. Hsu et al. (2012b, Section 3.1)) to handle concentration and the penalized action-selection method akin to that of Jin et al. (2021).

Theorem 1. Let ˆπ be the output policy of Algorithm 1 with λ > 0. Deﬁne,

ϵ(π, ˆπ) = EX [|f(X, π(X)) φ(X, π(X)), θ |]

+ EX [| φ(X, ˆπ(X)), θ f(X, ˆπ(X))|]

where θ arg minθ Rd P

i [n] φ(xi, ai) θ f(xi, ai) 2 . If θ

d, then with probability at least 1 δ, for any policy π (including the optimal policy), Reg(π, ˆπ) is bounded above by

ϵ(π, ˆπ) | {z } approx. error + 2βλ,δ(n, d) | {z } complexity

EX φ(X, π(X)) V 1 | {z } coverage

d n EX φ(X, π(X)) V 1

where V is the regularized empirical covariance matrix of the data, deﬁned in Algorithm 13.

The theorem reveals that sub-optimality in the regret bound is due primarily to three sources:

3Throughout the paper, we use e O to omit polylog factors of problem-dependent parameters such as δ 1, dimension d, and number of model classes M (deﬁned in Section 3.2).

Model Selection in Batch Policy Optimization

1. Approximation error: this represents how far the closest function in F is from representing the true reward function f. Here, closest means the solution, θ , to the ﬁxed design regression problem. When f(x, a) = φ(x, a) θ 4, we say the data is realizable and clearly ϵ(π, ˆπ) = 0. We discuss approximation error and its various forms in more detail in Appendix B.

2. Statistical complexity: this represents the learnability size of F. In the linear case, this is concisely encapsulated by the dimension d of the feature map φ. From the deﬁnition of β, we have βλ,δ(n, d) = e O( p

3. Coverage properties: this source plays an important role in the batch learning setting where sub-optimality can be due to the dataset shift between D and the learned policy ˆπ. In the linear setting, this is represented as the mismatch between the covariance matrix V and the feature distribution of the comparator policy π. That is, if directions of features that π visits do not coincide with directions covered in V , then one should expect the error due to mismatch to be large. One critical factor due to pessimism is that we need only consider the mismatch with the comparator π (e.g. the optimal policy) as opposed to a worst-case policy or the maximum eigenvalue of V 1 or a concentrability coefﬁcient, all of which could lead to a substantially larger regret bound. This ensures that ˆπ is competitive against all well-covered comparator policies π under the dataset D, simultaneously (Jin et al., 2021).

We note that θ depends on the states and actions in the dataset D. Several additional remarks are in order.

Remark 1. The bound in Theorem 1 can be viewed as a data-dependent bound (dependent on the state-action pairs in the training dataset D), which is desirable for two main reasons. The ﬁrst is that it is consistent and easily comparable with the prior work of Jin et al. (2021); Zanette et al. (2021) who proved similar bounds (albeit under the assumption of realizability). The approximation error ϵ(π, ˆπ) can be naturally viewed as all of the sub-optimality not accounted for in the usual estimation error, i.e. it recovers the prior work in the realizable case with ϵ(π, ˆπ) = 0 since θ can be set to the realizing value. The second reason is that, as we will see, the data-dependent bound is convenient for model selection since we can evaluate it directly and it avoids any distributional assumptions and quantities until absolutely necessary (Duan et al., 2020; Xie et al., 2021). In statistical learning settings, data-dependent generalization bounds are desirable precisely for the purpose of model selection. They also tend to be tighter than worst-case counterparts (Antos et al., 2002; Bartlett et al., 2002).

Remark 2. In general, one does not know the approxima-

4There could be multiple elements in the arg min, but this is purely analytical so we can set θ correctly in the realizable case.

tion error ϵ(π, ˆπ) or even non-trivial upper bounds on it, which is the initial motivation for model selection both here and in supervised learning. We work with this particular quantity since it is naturally one of the tightest. Alternatives can be easily derived with some work but there is little reason to artiﬁcially inﬂate the bound. See Appendix B for further discussion.

3.2. Model Selection Objectives

With these sources of error in mind, we now introduce the general model selection problem for batch policy optimization. We assume a collection of M linear model classes F1, . . . , FM, such that, for k [M], Fk = (x, a) 7 φk(x, a), θ : θ Rdk where φk is a known dk-dimensional feature map for model class Fk. We desire an algorithm with the following guarantee: given an input dataset D of n interactions and model classes F1, . . . , FM, the algorithm outputs a policy ˆπ such that, with probability at least 1 δ, Reg(π, ˆπ) is bounded above by

e O mink [M]

ϵk(π, ˆπ) + q

n EX φk(X, π(X)) V 1 k

(2) for all deterministic policies π. Here, ϵk and Vk are the corresponding approximation error and regularized empirical covariance matrix for class k, as deﬁned in the previous section for the single model class.

Interpretation In words, the main goal of model selection is to achieve performance that is nearly as good as the performance that could be achieved had the optimal model class been known in advance. Observe that this desired bound is essentially the best single model class guarantee from Theorem 1 applied to each of the M model classes. Such an inequality is often referred to as an oracle inequality because an oracle with knowledge of the best class could simply choose it. We emphasize that achieving the desired bound in (2) requires careful balancing of all three error sources (approximation, complexity, coverage). Importantly, note that we aim to maintain the property that ˆπ is competitive against any well-covered comparator π. This stands in stark contrast to oracle inequalities in supervised learning, which typically require only balancing approximation error and statistical complexity.

4. A Negative Result for Model Selection

With the main goal of model selection in the batch problem having been introduced, we present our ﬁrst major contribution, which establishes a fundamental hardness of the model selection problem in the batch policy optimization setting. We show that, unlike standard learning problems, it is actually impossible to optimally trade off all three error sources that comprise the oracle inequality in (2).

Model Selection in Batch Policy Optimization

Before arriving at the theorem, we identify a condition to impose additional structure on the model classes F1, . . . , FM so as to actually make model selection easier. Otherwise, without structure, the selection problem is trivially impossible. In particular, we consider the setting where the model classes are nested.

Deﬁnition 2. The collection of linear model classes F1, . . . , FM with respective feature maps φ1, . . . , φM is said to be nested if, for each map φk+1, the ﬁrst dk coordinates of φk+1 are the same as φk for all k [M 1].

The nestedness condition imposes structure sufﬁcient for adaptive policy value estimation via a variant of the algorithm by Su et al. (2020). Nestedness effectively requires that the model classes are ordered by complexity. Nonetheless, for policy optimization, we will see the additional structure is still insufﬁcient.

We denote A as a model selection algorithm which takes as input the nested model classes F1, . . . , FM and a dataset D of n interactions and outputs a learned policy ˆπ = A(F1:M, D). The following theorem states that even for such nested model classes, near-optimal model selection is impossible, and performance can be arbitrarily worse in fact.

Theorem 2. Let F1, . . . , FM be a particular nested collection of linear model classes (Deﬁnition 2). For any α > 0, there is n = Θ(α2) such that for any algorithm A there is a contextual bandit instance with comparator π and dataset D with n interactions consistent with D that satisﬁes

ED [Reg(π, A(F1:M, D))]

ϵk(π, ˆπ) + q

n EX φk(X, π(X)) V 1 k

Here, ED denotes the expectation with respect to the randomness of the observed rewards in the dataset that are distributed according to the bandit instance D (conditioned on the states and actions). The result follows by reducing the problem to a batch multi-armed bandit problem and designing the appropriate nested model classes that ensure the oracle inequality is much better than what is achievable by any algorithm on the bandit problem. The dataset is constructed in order to be sufﬁciently imbalanced while still being true to the underlying conditional distribution induced by D. The construction is illustrative of the core problems that lead to the oracle achieving very small regret. We remark that the result does not necessarily apply to all collections of model classes; it relies on existence of such a nested collection. See Appendix C for a detailed proof.

The theorem shows that in general, the oracle (encapsulated by the denominator), which picks the best regret bound induced by the model classes from Theorem 1, can be made to achieve regret that is arbitrarily better than the regret of

any model selection algorithm for a sufﬁciently imbalanced dataset. The result suggests that the desired bound of (2) is too ambitious and it highlights a separation of difﬁculty between model selection in the batch policy optimization setting (where there is an additional error coverage involved in the oracle inequality) and standard statistical learning.

Note that because this is a hardness result in the restricted class of linear contextual bandits, the implication applies much more broadly to policy optimization, for example with general function classes and multi-step reinforcement learning. Interestingly, the nestedness condition should make positive results for model selection easier (by restricting the set of problem instances that an algorithm must deal with), yet the negative result still holds under this condition. This suggests that policy optimization is deceptively hard compared to policy evaluation where nestedness is indeed sufﬁcient (Su et al., 2020).

We remark brieﬂy that the lower bound by α cannot be trivially due to e.g. polylogarithmic factors in n that may be omitted from the denominator. This is because n can be chosen to be quadratic in α, not exponential. The lower bound, as shown explicitly in the proof of Theorem 2, is indeed due to an imbalanced dataset that causes worse coverage error in the numerator.

5. Positive Results for Special Cases

In the last section, we showed that attempting to optimally trade off all three error sources (1) approximation error, (2) complexity, and (3) the coverage property is not possible in general for any model selection algorithm and thus we cannot hope to make progress without further assumptions. The question remains of whether there are other settings sufﬁcient for model selection. In this section, we explore relaxations of the model selection objective. Rather than requiring all three error sources be addressed, we aim to trade off two at a time. We will show that non-trivial model selection guarantees are indeed possible in certain settings.

5.1. Balancing complexity and coverage

We consider the problem of minimizing regret when the approximation error is zero or we are willing to ignore its contribution to the regret. For example, when we have multiple feature representations {φk} that all satisfy realizability, but some may handle coverage better or induce more favorable distributions under the behavior policy µ.

Algorithm 2 shows a simple selection rule for this case. The main idea is that we will use each model class Fk to generate an estimate ˆθk and covariance matrix Vk. Then, when extracting the policy, actions are chosen pessimistically, but the action that achieves the highest pessimistic value among all classes is selected. The following theorem shows that this

Model Selection in Batch Policy Optimization

Algorithm 2 Complexity-Coverage Selection

1: Input: Dataset D, linear model classes F1, . . . , FM, conﬁdence parameter δ > 0, regularization parameter λ > 0. 2: for k [M] do 3: Vk λ

i [n] φk(xi, ai)φk(xi, ai) .

4: ˆθk V 1 k 1 n P

i [n] φk(xi, ai)yi

5: end for 6: for x X do 7: ˆfk(x, a) φk(x, a), ˆθk βλ,δ(n, dk) φk(x, a) V 1 k 8: Set ˆπ(x), ˆk(x) arg maxa A,k [M] ˆfk(x, a) with ties broken arbitrarily 9: end for 10: Return: ˆπ

procedure optimally trades off complexity and coverage.

Theorem 3. Given arbitrary linear model classes F1, . . . , FM, Algorithm 2 outputs a policy ˆπ such that, with probability at least 1 δ, for any comparator policy π, the regret Reg(π, ˆπ) is bounded above by

mink [M] n 2βλ,δ/M(n, dk) EX φk(X, π(X)) V 1 k

k [M] ϵk(π, ˆπ).

The detailed proof is shown in Appendix D. In the case where realizability holds for all of the model classes (ϵk = 0 for all k), an exact oracle inequality is achieved and we have, with high probability, Reg(π, ˆπ) is bounded above by

O mink [M] q

dk log(M/δ)

n EX φk(X, π(X) V 1 k

However, when positive approximation error is involved, it can be cumulative in the regret bound. We note that Algorithm 2 is similar in concept to the online representation selection algorithm of Papini et al. (2021).

5.2. Balancing complexity and approximation error

We now consider the setting where the worst-case coverage properties are tolerable, but we would like to optimally trade off approximation error and statistical complexity. We will examine two methods: i) an adaptive method inspired by the SLOPE estimator (Su et al., 2020) and ii) the classical hold-out method. While the hold-out method is desirable for its simplicity, the SLOPE method is potentially capable of achieving stronger theoretical guarantees under a slightly stronger nestedness condition on the model classes.

Hitherto, for the dataset D, we required only that the actions ai are chosen independent of all potential outcomes conditional on xi without regard to any behavior policy µ. We

Algorithm 3 SLOPE Method

1: Input: Dataset D, linear model classes F1, . . . , FM, conﬁdence parameter δ > 0 2: for k [M] do 3: Estimate covariance matrix Vk and parameters ˆθk as in Algorithm 2.

4: Set ˆfk(x, a) D φk(x, a), ˆθk E

5: Set ˆπk(x) arg maxa A ˆf(x, a) with ties broken arbitrarily 6: end for 7: for ℓ [M] do 8: for k [M] do

9: Set ˆvk(ˆπℓ) EX h ˆfk(X, ˆπℓ(X)) i

10: Set ξk ζk(δ/M) EX maxa φk(X, a) V 1 k 11: Deﬁne intervals

Ik,ℓ= [ˆvk(ˆπℓ) 2ξk, ˆvk(ˆπℓ) + 2ξk]

12: end for 13: Select model class for for evaluating ˆπℓ:

ˆk(ℓ) = min n k : TM j=k Ij,ℓis non-empty o

Set ˆv(ˆπℓ) ˆvˆk(ℓ)(ˆπℓ) 14: end for 15: Set ˆk = arg maxk [M] ˆv(ˆπk) 16: Return: ˆπ = ˆπˆk

will now explicitly assume that the each (xi, ai, yi) in the dataset D is sampled jointly from D and a ﬁxed behavior policy µ as discussed in Section 2.2. This is stronger than before, but it is a standard setting for batch learning (Xie et al., 2021). Henceforth, we simply use Eµ [ ] to denote the expectation over the joint distribution EX,A D µ [ ]. We also consider a relaxed version of the approximation error, which can be written in terms of the statistical approximation error between Fk and the true reward function f:

ϵk = min θ Rdk 2 q

C(µ)Eµ ( φk(X, A), θ f(X, A))2.

Deﬁne θk = arg minθ Rdk Eµ ( φk(X, A), θ f(X, A))2

when Eµ φ(X, A)φ(X, A) 0. If Fk satisﬁes realizability, then ϵk = 0 as before. However, this version has dependence on the concentrability coefﬁcient C(µ) for the worst-case dataset shift (Deﬁnition 1), which we are willing to tolerate in this section.

5.2.1. SLOPE METHOD

The ﬁrst method, which we state in Algorithm 3, is inspired by the SLOPE estimator, originally designed for evaluation (Su et al., 2020). The algorithm begins by generat-

Model Selection in Batch Policy Optimization

ing estimates ˆθk using the dataset D. The main idea is to then estimate the values of the ˆπk policies using an improved variant of the SLOPE estimator (which may be of independent interest) to achieve the optimal trade-off between approximation error and complexity. We describe this sub-procedure and its differences from the original in Appendix E. However, we require additional structure on the model classes in order to meet the pre-conditions of the improved SLOPE estimator. In particular, we assume the model classes are nested in the sense of Deﬁnition 2 (see precedents Bartlett et al. (2002); Foster et al. (2019)). We also assume the following distributional conditions.

Assumption 1. For all k [M], Eµ [φ(X, A)] = 0 and Σk := Eµ φ(X, A)φ(X, A) 0. Furthermore, under D µ, the features φk(X, A) are sub-Gaussian with Σ 1/2 k φk(X, A) ψ2 1.

Assumption 2. For all k [M], θk 1.

Centering is done for ease of exposition. The eigenvalue lower bound is useful for random design linear regression analysis. The sub-Gaussian condition is standard. Assumption 2 is like the precondition of Theorem 1 but it is distribution-dependent rather than data-dependent. Deﬁne conﬁdence coefﬁcient (used in Algorithm 3) as

n V 1/2 k log(4dk/δ)

C2dk+C3d1/2 k log1/2(4dk/δ)+C4 log(4dk/δ)

for sufﬁciently large constants C1, C2, C3, C4 > 0 deﬁned in Appendix E. Under these assumptions, we have the following theorem.

Theorem 4. Let F1, . . . , FM be a nested collection of linear model classes. For λk = 1 for all k [M], Algorithm 3 outputs a policy ˆπ such that, with probability at least 1 4δ, for any comparator policy π, Reg(π, ˆπ) is bouned above by

12 min k [M]

n ϵk + ζk(δ/M) EX max a φk(X, a) V 1 k

We show the detailed proof in Appendix E. Theorem 4 shows that it is possible to obtain a near-oracle inequality when we are willing to forgo the coverage property and focus solely on trading off approximation error and complexity. Concisely, this is bounded above by

O mink ϵk + q

dk V 1 k log(dk M/δ)

n E maxa φk(X, a) V 1 k

The coefﬁcient ζk(δ/M) on the second term is slightly larger than βλk,δ/M(n, dk) due to the dependence on

V 1/2 k , but they are of approximately the same order in d and n 1. We emphasize the factor EX maxa φk(X, a) V 1 k , is different from the coverage

error as deﬁned in (2), which is EX φk(X, π(X)) V 1 k . This version demonstrates the distribution shift effect on features, but depends on the worst-case policy rather than the comparator π. Thus, we are unable to maintain competitiveness against well-covered policies. Note that only the approximation error ϵk depends on C(µ).

Note that ϵk is a weaker form of the approximation than the previously used ϵk(π, ˆπ). A natural question is whether a bound of the form mink{ ϵk + ζk(δ/M) E φ(X, π(X)) V 1 k }, which satisﬁes all three criteria, is possible with this slightly weaker approximation error. We show in Appendix C that the argument in the proof of Theorem 2 still applies in this case and thus a bound of this form is still not possible.

5.2.2. HOLD-OUT METHOD

We now analyze the performance of the hold-out method, a classical model selection tool in supervised learning and risk minimization. The dataset D is partitioned into Din and Dout where Dout is used to estimate the regression error as a proxy and selecting among candidates ˆθk for each model class trained on Din. As before, we deﬁne ˆπk(x) arg maxa A D φk(x, a), ˆθk E . We denote the empirical regression loss on the independent hold-out set as:

ˆLk(θ) = 1 |Dout| P

(xi,ai,yi) Dout ( φk(xi, ai), θ yi)2

The hold-out method simply chooses the model class with smallest empirical loss: ˆk arg mink [M] ˆLk(ˆθk).

Theorem 5. Given arbitrary linear model classes F1, . . . , FM, let ˆπ = ˆπˆk where ˆk arg mink [M] ˆLk(ˆθk). Then, there is a constant C > 0 such that, with probability at least 1 2δ, Reg(π, ˆπ) is bounded above by

mink n ϵk + C p

C(µ) ˆθk θk Σk o

C(µ) (1 maxℓ ˆθℓ ) log1/2(M/δ)

The detailed proof is provided in Appendix F. Note that, for simplicity, we have stated the bound abstractly in terms of its estimation error ˆθk θk Σk and the norm of maxℓ ˆθℓ , where Σk is deﬁned in Assumption 1. Standard analyses of random design linear regression (Hsu et al., 2012b) can provide further bounds on these. While we are able to select to achieve error on the order of the best model class, this is only achieved when the estimation error depends on the concentrability coefﬁcient C(µ). There is some residual estimation error on the order of O(1/n1/4 out), which is slower than the typical O(1/ n); however, this term does not have any dependence on d, assuming θℓ is of constant size.

Model Selection in Batch Policy Optimization

25 50 75 100 125 150 175 200 Dataset size

Coverage-Complexity Selection

C-C Selection

Figure 1: In the fully realizable case, the performance of the Algorithm 2 for model selection is compared to base algorithms that only use a single model class. Each model class is deﬁned by a different feature representation generated with underlying dimension dhid. The error band represents standard error.

6. Experiments

To complement our primarily theoretical results, we study the utility of the above model selection algorithms in synthetic experiments and empirically compare them. In individual experiments on balancing complexity-coverage and approximation-complexity, we generated a collection of feature maps, each deﬁning a linear model class. We ﬁrst evaluated the performance of the base algorithms using Algorithm 1 with each model class. We then compare this performance to the proposed selection algorithms. All results were averaged over 20 trials. For clarity, error bands represent standard error only for the model selection algorithms. Further details can be found in the appendix.

Complexity-Coverage Trade-off In this setting, we studied the case where all feature maps are capable of realizing the true reward function. That is, the learner need not deal with any approximation error, but it can beneﬁt from selecting a good representation. We let |X| = 20 and |A| = 10 and generated d = |X||A|-dimensional feature maps through the following procedure: for model class k a random collection of dhid,k vectors of size |X||A| are generated ensuring that a linear combination exactly equals f. We then randomly project them to d-dimensions to produce φk. For model selection, we implemented Algorithm 2. Figure 1 compares the performance. As expected, the model class with smallest dhid performs best. The model selection is nearly able to match this performance, even without knowing which feature map uses the smallest dhid.

Approximation Error-Complexity Trade-off Next, we consider trading off approximation error and complexity with nested function classes. We again let |A| = 10, but allowed X to be inﬁnite with feature vectors generated from zero-mean normal distributions with different covariance matrices. The feature vectors were given ambient dimension d = 100, but the reward function was designed using

25 50 75 100 125 150 175 200 Dataset size

Approximation-Complexity Selection

d: 15 d: 20 d: 30 d: 50 d: 75 d: 100 SLOPE Hold-out

Figure 2: The performance of the SLOPE and hold-out methods are compared against base algorithms that only use a single model class of varying dimension. Both are eventually able to nearly match the performance of the best model class, but the hold-out method is consistently better with less data. Error bands represent standard error.

only the ﬁrst d = 30 coordinates. Model classes were generated by truncating full feature vectors to the following dimensions {15, 20, 30, 50, 75, 100}. Thus, the ﬁrst two suffer from approximation error while the last three are excessively large. For model selection, we considered both the SLOPE method and the hold-out method with an 80/20 data split. Figure 2 shows that both model selection algorithms are eventually able to match performance of the best model class. Interestingly, the hold-out performs consistently better than the SLOPE method with small data.

7. Discussion

We introduced the theoretical study of model selection for batch policy optimization, identifying three sources of error to consider when selecting model classes. We showed that balancing all three is not possible in general while remaining competitive with an oracle, but relaxing any one allows the design of effective model selection algorithms. Several open questions remain. First, while the negative result is general, the positive results thus far have applied only to the linear contextual bandit setting. We expect that the challenges become more complex for reinforcement learning and general model classes, but similar trade-offs may be observed for general function classes that handle coverage in terms of comparator-speciﬁc concentrability coefﬁcients (Xie et al., 2021; Uehara & Sun, 2021). Another interesting direction is to understand more formally the strong empirical performance of the hold-out method.

Acknowledgements

We thank Christoph Dann and anonymous reviewers for their detailed feedback in improving the paper. Most of the work was completed while JNL was an intern at Google Research. JNL was partially supported by NSF GRFP.

Model Selection in Batch Policy Optimization

Abbasi-Yadkori, Y., P al, D., and Szepesv ari, C. Improved algorithms for linear stochastic bandits. In Advances in Neural Information Processing Systems, pp. 2312 2320, 2011.

Agarwal, A., Luo, H., Neyshabur, B., and Schapire, R. E. Corralling a band of bandit algorithms. In Conference on Learning Theory, pp. 12 38. PMLR, 2017.

Agarwal, A., Kakade, S., Krishnamurthy, A., and Sun, W. Flambe: Structural complexity and representation learning of low rank mdps. ar Xiv preprint ar Xiv:2006.10814, 2020.

Antos, A., K egl, B., Linder, T., and Lugosi, G. Datadependent margin-based generalization bounds for classiﬁcation. Journal of Machine Learning Research, 3(Jul): 73 98, 2002.

Bartlett, P. L. Fast rates for estimation error and oracle inequalities for model selection. Econometric Theory, pp. 545 552, 2008.

Bartlett, P. L., Boucheron, S., and Lugosi, G. Model selection and error estimation. Machine Learning, 48(1): 85 113, 2002.

Bubeck, S., Perchet, V., and Rigollet, P. Bounded regret in stochastic multi-armed bandits. In Conference on Learning Theory, pp. 122 134. PMLR, 2013.

Chatterji, N., Muthukumar, V., and Bartlett, P. Osom: A simultaneously optimal algorithm for multi-armed and linear contextual bandits. In International Conference on Artiﬁcial Intelligence and Statistics, pp. 1844 1854, 2020.

Chen, J. and Jiang, N. Information-theoretic considerations in batch reinforcement learning. In International Conference on Machine Learning, pp. 1042 1051. PMLR, 2019.

Chu, W., Li, L., Reyzin, L., and Schapire, R. Contextual bandits with linear payoff functions. In Proceedings of the Fourteenth International Conference on Artiﬁcial Intelligence and Statistics, pp. 208 214. JMLR Workshop and Conference Proceedings, 2011.

Duan, Y., Jia, Z., and Wang, M. Minimax-optimal offpolicy evaluation with linear function approximation. In International Conference on Machine Learning, pp. 2701 2709. PMLR, 2020.

Duchi, J. Lecture notes for statistics 311/electrical engineering 377. Stanford University, 2019.

Farahmand, A.-m. and Szepesv ari, C. Model selection in reinforcement learning. Machine learning, 85(3):299 332, 2011.

Foster, D. and Rakhlin, A. Beyond ucb: Optimal and efﬁcient contextual bandits with regression oracles. In International Conference on Machine Learning, pp. 3199 3210. PMLR, 2020.

Foster, D., Krishnamurthy, A., and Luo, H. Model selection for contextual bandits. ar Xiv preprint ar Xiv:1906.00531, 2019.

Ghosh, A., Sankararaman, A., and Kannan, R. Problemcomplexity adaptive model selection for stochastic linear bandits. In International Conference on Artiﬁcial Intelligence and Statistics, pp. 1396 1404. PMLR, 2021.

Hsu, D., Kakade, S., Zhang, T., et al. A tail inequality for quadratic forms of subgaussian random vectors. Electronic Communications in Probability, 17, 2012a.

Hsu, D., Kakade, S. M., and Zhang, T. Random design analysis of ridge regression. In Conference on learning theory, pp. 9 1. JMLR Workshop and Conference Proceedings, 2012b.

Imbens, G. W. and Rubin, D. B. Causal inference in statistics, social, and biomedical sciences. Cambridge University Press, 2015.

Jiang, N. and Li, L. Doubly robust off-policy value evaluation for reinforcement learning. In International Conference on Machine Learning, pp. 652 661. PMLR, 2016.

Jin, C., Yang, Z., Wang, Z., and Jordan, M. I. Provably efﬁcient reinforcement learning with linear function approximation. ar Xiv preprint ar Xiv:1907.05388, 2019.

Jin, Y., Yang, Z., and Wang, Z. Is pessimism provably efﬁcient for ofﬂine rl? In International Conference on Machine Learning, pp. 5084 5096. PMLR, 2021.

Kumar, A., Singh, A., Tian, S., Finn, C., and Levine, S. A workﬂow for ofﬂine model-free robotic reinforcement learning. ar Xiv preprint ar Xiv:2109.10813, 2021.

Lange, S., Gabel, T., and Riedmiller, M. Batch reinforcement learning. In Reinforcement learning, pp. 45 73. Springer, 2012.

Lattimore, T. and Szepesv ari, C. Bandit algorithms. preprint, 2018.

Lee, J., Pacchiano, A., Muthukumar, V., Kong, W., and Brunskill, E. Online model selection for reinforcement learning with function approximation. In International Conference on Artiﬁcial Intelligence and Statistics, pp. 3340 3348. PMLR, 2021.

Model Selection in Batch Policy Optimization

Levine, S., Kumar, A., Tucker, G., and Fu, J. Ofﬂine reinforcement learning: Tutorial, review, and perspectives on open problems. ar Xiv preprint ar Xiv:2005.01643, 2020.

Lin, W. Agnostic notes on regression adjustments to experimental data: Reexamining freedman s critique. The Annals of Applied Statistics, 7(1):295 318, 2013.

Liu, Y., Swaminathan, A., Agarwal, A., and Brunskill, E. Provably good batch off-policy reinforcement learning without great exploration. Advances in Neural Information Processing Systems, 33:1264 1274, 2020.

Lugosi, G. and Nobel, A. B. Adaptive model selection using empirical complexities. Annals of Statistics, pp. 1830 1864, 1999.

Massart, P. Concentration inequalities and model selection. 2007.

Modi, A., Jiang, N., Tewari, A., and Singh, S. Sample complexity of reinforcement learning using linearly combined model ensembles. In International Conference on Artiﬁcial Intelligence and Statistics, pp. 2010 2020. PMLR, 2020.

Munos, R. and Szepesv ari, C. Finite-time bounds for ﬁtted value iteration. Journal of Machine Learning Research, 9 (5), 2008.

Muthukumar, V. and Krishnamurthy, A. Universal and data-adaptive algorithms for model selection in linear contextual bandits. ar Xiv preprint ar Xiv:2111.04688, 2021.

Nachum, O., Chow, Y., Dai, B., and Li, L. Dualdice: Behavior-agnostic estimation of discounted stationary distribution corrections. ar Xiv preprint ar Xiv:1906.04733, 2019a.

Nachum, O., Dai, B., Kostrikov, I., Chow, Y., Li, L., and Schuurmans, D. Algaedice: Policy gradient from arbitrary experience. ar Xiv preprint ar Xiv:1912.02074, 2019b.

Pacchiano, A., Phan, M., Abbasi-Yadkori, Y., Rao, A., Zimmert, J., Lattimore, T., and Szepesvari, C. Model selection in contextual stochastic bandit problems. ar Xiv preprint ar Xiv:2003.01704, 2020.

Paine, T. L., Paduraru, C., Michi, A., Gulcehre, C., Zolna, K., Novikov, A., Wang, Z., and de Freitas, N. Hyperparameter selection for ofﬂine reinforcement learning. ar Xiv preprint ar Xiv:2007.09055, 2020.

Papini, M., Tirinzoni, A., Restelli, M., Lazaric, A., and Pirotta, M. Leveraging good representations in linear contextual bandits. ar Xiv preprint ar Xiv:2104.03781, 2021.

Precup, D. Eligibility traces for off-policy policy evaluation. Computer Science Department Faculty Publication Series, pp. 80, 2000.

Rudelson, M. and Vershynin, R. Hanson-wright inequality and sub-gaussian concentration. Electronic Communications in Probability, 18:1 9, 2013.

Su, Y., Srinath, P., and Krishnamurthy, A. Adaptive estimator selection for off-policy evaluation. In International Conference on Machine Learning, pp. 9196 9205. PMLR, 2020.

Tang, S. and Wiens, J. Model selection for ofﬂine reinforcement learning: Practical considerations for healthcare settings. ar Xiv preprint ar Xiv:2107.11003, 2021.

Uehara, M. and Sun, W. Pessimistic model-based ofﬂine rl: Pac bounds and posterior sampling under partial coverage. ar Xiv preprint ar Xiv:2107.06226, 2021.

Vershynin, R. Introduction to the non-asymptotic analysis of random matrices. ar Xiv preprint ar Xiv:1011.3027, 2010.

Vershynin, R. High-dimensional probability: An introduction with applications in data science, volume 47. Cambridge university press, 2018.

Xiao, C., Wu, Y., Mei, J., Dai, B., Lattimore, T., Li, L., Szepesvari, C., and Schuurmans, D. On the optimality of batch policy optimization algorithms. In International Conference on Machine Learning, pp. 11362 11371. PMLR, 2021.

Xie, T. and Jiang, N. Batch value-function approximation with only realizability. In International Conference on Machine Learning, pp. 11404 11413. PMLR, 2021.

Xie, T., Cheng, C.-A., Jiang, N., Mineiro, P., and Agarwal, A. Bellman-consistent pessimism for ofﬂine reinforcement learning. ar Xiv preprint ar Xiv:2106.06926, 2021.

Yang, M., Dai, B., Nachum, O., Tucker, G., and Schuurmans, D. Ofﬂine policy selection under uncertainty. ar Xiv preprint ar Xiv:2012.06919, 2020.

Zanette, A., Wainwright, M. J., and Brunskill, E. Provable beneﬁts of actor-critic methods for ofﬂine reinforcement learning. ar Xiv preprint ar Xiv:2108.08812, 2021.

Zhang, S. and Jiang, N. Towards hyperparameter-free policy selection for ofﬂine reinforcement learning. In Thirty Fifth Conference on Neural Information Processing Systems, 2021.

Zhang, W., He, J., Zhou, D., Zhang, A., and Gu, Q. Provably efﬁcient representation learning in low-rank markov decision processes. ar Xiv preprint ar Xiv:2106.11935, 2021.

Model Selection in Batch Policy Optimization

A. Sub-Gaussian and Sub-Exponential Random Variables

In this section, we review basic deﬁnitions and properties of sub-Gaussian and sub-exponential random variables. See (Vershynin, 2018) for a comprehensive introduction.

Let X R be a random variable. We deﬁne the norms:

X ψ2 := sup p N p 1/2 (E|X|p)1/p (3)

X ψ1 := sup p N p 1 (E|X|p)1/p (4)

Deﬁnition 3. The random variable X is sub-Gaussian with parameter X ψ2 if X ψ2 < . It is sub-exponential with parameter X ψ1 if X ψ1 < .

For a non-negative real value τ 0 we write X sub G(τ 2) to indicate that X ψ2 τ. Similarly, we write X sub E(τ) to suggest X ψ1 τ. We note that this deﬁnition of sub-Gaussian random variables is equivalent up to constant factors

with an alternative popular deﬁnition when EX = 0. This deﬁnition requires that EeλX exp λ2τ 2

2 for all λ R. Let

X Rd be a random vector. Then, we write X ψ2 = supv Rd : v 1 v X ψ2. The same notational conventions above apply to the vector X.

We now state several basic results concerning sub-Gaussian and sub-exponential random variables that will be used throughout the remaining proofs.

Lemma 1 ((Vershynin, 2010), Lemma 2.7.7). Let X and Y be (potentially dependent) real-valued random variables. Then, the following holds: XY ψ1 X ψ2 Y ψ2.

Lemma 2. Let X Rd be a random vector with second moment matrix Σ = EXX . Let v Rd be such that v 1. Then, v Σv 2 X 2 ψ2.

Proof. We have v Σv = Ev XX v = E(v X)2. From the deﬁnition of ψ2, we have that E|v X|2 2 v X 2 ψ2. Finally, we note that v X ψ2 X ψ2 since v 1.

Lemma 3. Let X satisfy EX = 0 and X ψ2 τ. Then, E exp (λX) exp 5λ2τ 2

2 for all λ R.

Proof. Note that X ψ2 τ implies that (E|X|p)1/p τ p. for all p. Theorem 3.10 of (Duchi, 2019) shows that this implies the stated condition.

B. Proof of Theorem 1

Theorem 1. Let ˆπ be the output policy of Algorithm 1 with λ > 0. Deﬁne,

ϵ(π, ˆπ) = EX [|f(X, π(X)) φ(X, π(X)), θ |]

+ EX [| φ(X, ˆπ(X)), θ f(X, ˆπ(X))|]

where θ arg minθ Rd P

i [n] φ(xi, ai) θ f(xi, ai) 2 . If θ

d, then with probability at least 1 δ, for any policy π (including the optimal policy), Reg(π, ˆπ) is bounded above by

ϵ(π, ˆπ) | {z } approx. error + 2βλ,δ(n, d) | {z } complexity

EX φ(X, π(X)) V 1 | {z } coverage

d n EX φ(X, π(X)) V 1

where V is the regularized empirical covariance matrix of the data, deﬁned in Algorithm 15.

5Throughout the paper, we use e O to omit polylog factors of problem-dependent parameters such as δ 1, dimension d, and number of model classes M (deﬁned in Section 3.2).

Model Selection in Batch Policy Optimization

We ﬁrst leverage a basic result in the analysis of ﬁxed design linear regression problems. For convenience, we will write φi = φ(xi, ai), fi = f(xi, ai) and noise ηi = ηi(ai). That is, yi = fi + ηi. Recall the following deﬁnitions:

i [n] φiφ i (5)

θ arg min θ Rd 1 n

φ i θ fi 2 (7)

B.1. Concentration

Lemma 4. Conditioned on (xi, ai)i [n], with probability at least 1 δ,

C1d + C2d1/2 log1/2(1/δ) + C3 log(1/δ)

where C1 = 5, C2 = 10, and C3 = 10.

Proof. From the deﬁnition of ˆθ, we have

i φi (fi + ηi) θ V (9)

i [n] φiφ i θ + 1

i φiηi θ V (10)

where in the last equality we have used the fact that θ is the solution to minθ P i(φ i θ fi)2 and therefore satisﬁes the normal equations: X

i φiφ i θ = X

i φifi (11)

i φiηi V (12)

n V 1θ V + 1

i φiηi V (13)

i φiηi (14)

where the last inequality follows since σ1/2 max(V 1) (λ/n) 1/2.

To bound the second term, we apply the Lemma 5, stated below, which is an application of standard ﬁxed-design linear regression results of (Hsu et al., 2012b). This shows that

i φiηi 2 5d + 10 p

d log(1/δ) + 10 log(1/δ)

with probability at least 1 δ. Applying this to the previous bound on θ θ V gives the result.

Model Selection in Batch Policy Optimization

i φiηi 2 > 5d + 10 p

d log(1/δ) + 10 log(1/δ)

n | x1:n, a1:n

Proof of Lemma 5. Let η = (η1, . . . , ηn) and Φ = φ1 φn . Note then that 1

1 n λId + Φ Φ 1/2 Φ η , where η is a random 1-sub-Gaussian random vector consisting of independent entries.

Note that we have that Eηi = 0 and ηi ψ2 1, which, by Lemma 3 implies E exp (λX) exp λ2σ2

2 where σ2 5. The concentration result then follows from a version of the Hanson-Wright inequality due to (Hsu et al., 2012a), a restatement of which can be found in Lemma 12:

Aη 2 > C1 tr(AA ) + C2 q

tr((AA )2) log(1/δ) + C3σ2 AA log(1/δ) δ (15)

where A = λId + Φ Φ 1/2 Φ and C1 = 5, C2 = 10, and C3 = 10. We may then bound the relevant quantities. Let Φ Φ = UΛU be the spectral decomposition of Φ Φ where U is unitary and Λ 0 is diagonal.

tr(AA ) = tr λId + Φ Φ 1/2 Φ Φ λId + Φ Φ 1/2 (16)

= tr λId + Φ Φ 1 Φ Φ (17)

tr (λId + Λ) 1Λ (18)

tr((AA )2) = tr λId + Φ Φ 1/2 Φ Φ λId + Φ Φ 1 Φ Φ λId + Φ Φ 1/2 (20)

tr (λId + Λ) 1 Λ(λId + Λ) 1Λ (21)

AA = λId + Φ Φ 1/2 Φ Φ λId + Φ Φ 1/2 (23)

= U(λId + Λ) 1/2Λ(λId + Λ) 1/2U (24)

where the very last inequality uses the that the unitary matrix preserves the norm an the maximal eigenvalue of (λId + Λ) 1/2Λ(λId + Λ) 1/2 is at most 1.

We therefore conclude that

i φiηi 2 C1d + C2 p

d log(1/δ) + C3 log(1/δ)

with probability at least 1 δ.

B.2. Full Proof

Proof of Theorem 1. for this proof, all expectations E denote EX, the expectation over the state random variable X from D. By adding and subtracting, the regret may be decomposed simply as

Reg(π, ˆπ) = E [f(X, π(X)) φ(X, π(X)), θ ] (27)

+ E [ φ(X, ˆπ(X)), θ f(X, ˆπ(X))] (28)

+ E [ φ(X, π(X)), θ φ(X, ˆπ(X)), θ ] (29)

ϵ(π, ˆπ) + E [ φ(X, π(X)), θ φ(X, ˆπ(X)), θ ] (30)

Model Selection in Batch Policy Optimization

For the remainder of the proof, we focus on bounding the second term. Adding and subtracting again, we have

E [ φ(X, π(X)) φ(X, ˆπ(X)), θ ] (31)

E h φ(X, π(X)) θ φ(X, ˆπ(X)) ˆθ i + E h φ(X, ˆπ(X)) ˆθ φ(X, ˆπ(X)) θ i (32)

E h φ(X, π(X)) θ φ(X, ˆπ(X)) ˆθ i + ˆθ θ V E φ(X, ˆπ(X)) V 1 (33)

where the last inequality is due to Cauchy-Schwarz.

Now, we apply the result of Lemma 4 to get that the event

C1d + C2d1/2 log1/2(1/δ) + C3 log(1/δ)

βλ,δ(n, d) (35)

occurs with probability at least 1 δ for absolute constants C1, C2, C3 > 0 deﬁned there. Conditioning on this event, we have

E [ φ(X, π(X)) φ(X, ˆπ(X)), θ ] E h φ(X, π(X)) θ φ(X, ˆπ(X)) ˆθ i

+ βλ,δ(n, d) E φ(X, ˆπ(X)) V 1

E h φ(X, π(X)) θ φ(X, π(X)) ˆθ i

+ βλ,δ(n, d) E φ(X, π(X)) V 1

ˆθ θ V + βλ,δ(n, d) E φ(X, π(X)) V 1

2βλ,δ(n, d) E φ(X, π(X)) V 1

where the second inequality applies the penalized action-selection for policy ˆπ, the third inequality applies Cauchy-Schwarz, and the last inequality once again applies the condition on the concentration of ˆθ θ V .

B.3. Discussion of Approximation Error

In this paper, we work with a fairly general notion of approximation error ϵk(π, ˆπ). Note that this depends both on the comparator policy π and the learned policy ˆπ and it tends to be small when θ outputs similar rewards to f on both of these policies. It is exactly zero when realizability is satisﬁed, f Fk, as is assumed in most related work. The reason for this choice is that it allows a large degree of ﬂexibility as many natural alternatives may upper bound it, for example those given below.

Here we point out a couple alternatives that appear frequently in bandit and RL theory.

1. Perhaps the most common assumption is a worst-case difference between f and the model class Fk (Jin et al., 2019;

Foster & Rakhlin, 2020):

ϵk,worst-case = min ˆ f Fk sup x X,a A |f(x, a) ˆf(x, a)|

The obvious disadvantage of this version is that certain states or contexts might be irrelevant but still lead to large prediction errors. Furthermore, the minimizing ˆf does not generally have any convenient statistical properties (e.g. satisfying ﬁrst-order optimality conditions in the linear case with respect to some relevant distribution).

2. Versions of the minimum squared error are also commonly used (Chen & Jiang, 2019) when it is assumed that the data is generated from a behavior policy µ, as assumption we do not consider until Section 5:

ϵk,sq = min ˆ f Fk Eµ ˆf(X, A) f(X, A) 2

Model Selection in Batch Policy Optimization

This is a natural formulation from a statistical perspective as well and it partially remedies some of the problems of the worst-case approximation error since we care only about those states and actions induced under µ. Unfortunately, this typically brings a concentrability coefﬁcient into the mix. We leverage this as an upper bound to achieve some positive results in Section 5.2.

We remark that numerous prior works make the assumption that an upper bound on the approximation error is known. However, it is unrealistic in practice to expect that such information is available, and it furthermore trivializes the model selection problem. While we consider upper bounds to ϵk(π, ˆπ) in Section 5.2, we make no such assumption about knowing the value of this upper bound.

C. Proof of Theorem 2

Theorem 2. Let F1, . . . , FM be a particular nested collection of linear model classes (Deﬁnition 2). For any α > 0, there is n = Θ(α2) such that for any algorithm A there is a contextual bandit instance with comparator π and dataset D with n interactions consistent with D that satisﬁes

ED [Reg(π, A(F1:M, D))]

ϵk(π, ˆπ) + q

n EX φk(X, π(X)) V 1 k

Let us begin by ﬁrst describing the dataset and observations for a given contextual bandit instance. The set of contextual bandit instances is determined by possible distributions D over state-reward pairs. For our construction, we ﬁx the stateaction pairs in the dataset {(xi, ai)}n i=1 across all instances that we consider. To be clear, the distribution of the rewards in the dataset under different instances will be different, but the covariates are held ﬁxed. Note that this is not necessary for the lower bound, but it will sufﬁce in our example.

Proof of Theorem 2. The main idea of the proof is to construct a difﬁcult contextual bandit problem with A = {a1, a2} and show that the oracle can leverage a pair of model classes satisfying Deﬁnition 2 to achieve small regret. The hardness of the contextual bandit problem will be shown via a reduction to a multi-armed bandit (MAB). The model classes will be chosen such that one is well-speciﬁed while the other has no approximation error in some instances but large approximation error in others.

To start, consider a class of two two-armed bandit instances E = (ν1, ν2) (i.e. no states) which are identiﬁed by their product distributions over rewards of both arms. We let ν1 = N( , 1) N( 2 , 1) and ν2 = N( , 1) N(0, 1) for > 0 to be determined later. That is, across both instances, arm a1 has the same reward distribution, but arm a2 can have either mean 0 or 2 . We let Eνi denote the expectation associated with instance νi. We let D be a dataset consisting of n1 > 0 samples from a1 and n2 > 0 samples from a2 with n1 and n2 to be determined precisely later. Such a construction is similar to that of standard lower bounds in bandits (Bubeck et al., 2013; Lattimore & Szepesv ari, 2018); however, since we are in the ofﬂine setting, the dataset is given.

We now establish a regret lower bound for any arbitrary algorithm A that outputs an arm A(D) {a1, a2} as a function of the data D. The next lemma follows from a standard application of Le Cam s two-point method and similar results for the ofﬂine multi-armed bandit problem (Xiao et al., 2021).

Lemma 6. Let = 1 2 n2 . Then, for any algorithm A, maxi,j Eνi [Y (aj) Y (A(D))] 1 8 n2

Henceforth, we will deﬁne := 1 2 n2 . We now construct a linear contextual bandit instance and apply a reduction to the MAB setting so that we may leverage the stated lower bound. Let X be a singleton (that is, states have no effect) and again A = {a1, a2}. Since there is only a single state, we omit notational dependence of policies6 and functions on the state.

We again consider two instances E = {ν1, ν2} which each govern the data distribution denote by Dνi. For Dν1, we set Y N( , 1) N( 2 , 1) and, for Dν2, we set Y N( , 1) N(0, 1) where is deﬁned above. Note that this ensures that the noise for either instance is given by the centered standard normal distribution (Y (a) f(a)) N(0, 1) for all a A. We use π to denote the optimal policy (action), which depends on the instance. In ν1, we have π = a1 and

6For a deterministic policy π, we simply use π A to denote the selected action and nπ denotes the number of samples to arm π.

Model Selection in Batch Policy Optimization

in ν2, we have π = a2. Finally, we assume that the batch dataset D again consists of n1 > 0 samples of a1 and n2 > 0 samples of a2 with n1 n2 giving a total of n = n1 + n2 samples. Exact quantities will be determined at the end.

Next, we construct two linear model classes F1 and F2. For F1, we use the following 1-dimensional feature map φ1 : A R:

( 1 a = a1 0 a = a2 .

F1 thus has some opportunity to make predictions about the mean of a1 but the features are trivial for a2, potentially leading to approximation error. For F2, we set φ2 : A R2 as

( (1, 0) a = a1 (0, 1) a = a2

Note that this model is well-speciﬁed as f(a) = φ2(a) θ by setting θ = (f(a1), f(a2)) . It is also evident that F1 and F2 are nested according to Deﬁnition 2.

θk, = arg min θ Rdk

φk(a(i)) θ f(a(i)) 2

for k {1, 2} where a(i) denotes the action of the ith datapoint in D. It is easy to verify that the following conditions are true as long as n1 > 0 and n2 > 0:

1. In ν1: θ1, = and θ2, = ( , 2 ).

2. In ν2: θ1, = and θ2, = ( , 0).

We now summarize the estimation error and approximation error for both model classes. The contribution of estimation error follows directly from the deﬁnitions. We summarize the results in the following fact. Note that we need not include expectations over the state since there is only one state.

Fact 1. Let Vk = λ

i [n] φk(a(i))φk(a(i)) for k {1, 2} and λ > 0. For any instance in E and comparator π A, the following inequalities hold:

φ1(π) V 1 1 r n

n1 φ2(π) V 1 2 r n

Proof. The results follow by direct calculation and using the fact that n1, n2 > 0.

For the approximation error, we can clearly see that, for model class F2, ϵ2(π, ˆπ) = 0 for all instances in E and all ˆπ and π since the model is well-speciﬁed. For F1, the approximation error will depend on the instance. Observe that in ν2, we have

ϵ1(π, ˆπ) = | φ1(ˆπ), θ1, f(ˆπ)| + | φ1(π), θ1, f(π)| = 0

regardless of what π and ˆπ are. Furthermore, for ν1, we have ϵ1(π, ˆπ) 2 in the worst case.

Combining the results for both estimation error and approximation error, we have

ϵ1(π , ˆπ) +

n EX φ1(π ) V 1 1 2 + 1 n1 2 n2

ϵ2(π , ˆπ) +

n EX φ2(π ) V 1 2 r

Model Selection in Batch Policy Optimization

in instance ν1 and

ϵ1(π , ˆπ) +

n EX φ1(π ) V 1 1 1 n1

ϵ2(π , ˆπ) +

n EX φ2(π ) V 1 2 r

in instance ν2. Therefore, we conclude that

ϵk(π , ˆπ) +

n EX φk(X, π ) V 1 k

for all ν E. Furthermore, note that this contextual bandit setting exactly reduces to the MAB problem and thus Lemma 6 requires that the regret of any algorithm A be lower bounded as Eν [Reg(π, A(D))] 1 8 n2 for some ν E with our given choice of . Therefore, there is a constant C1 > 0 such that for any algorithm A, there exists ν E satisfying

Eν [Reg(π , A(D))]

ϵk(π , ˆπk) + q

n EX φk(X, π ) V 1 k

Finally, we are left with choosing n1 and n2 as the number of samples in the dataset (which is the same across all of the instances). Choosing n1 = Ω α2n2 ensures the claim, which is possible since it was assumed that n = Ω(α2).

C.1. Proof of Lemma 6

Proof. Note that

max i,j Eνi [Y (j) Y (A(D))] = max i Eνi [Y (i) Y (A(D))] (36)

by deﬁnition of the instances ν1 and ν2. For convenience, we just write A instead of A(D). Then,

max i Eνi [Y (i) Y (A)] 1

2 (Eν1 [Y (1) Y (A)] + Eν2 [Y (2) Y (A)]) (37)

2 (Pν1(A = 1) + Pν2(A = 2)) (38)

2 (1 Pν1 Pν2 T V ) (39)

1 2DKL(Pν1 Pν2)

where the last two inequalities follow from the deﬁnition of the total-variation distance and Pinsker s inequality, respectively. Then, we apply the tensorization of the KL-divergence over the product distribution induced by the dataset to get:

DKL (Pν1 Pν2) = n1DKL (N( , 1) N( , 1)) (41)

+ n2DKL (N(0, 1) N(2 , 1)) (42)

= n2DKL (N(0, 1) N(2 , 1)) (43)

For normal distribution, DKL (N(0, 1) N(2 , 1)) = 2 2. Therefore, choosing = 1 2 n2 , we have

max i Eνi [Y (i) Y (A)]

2 (1 ) (44)

= 1 8 n2 (45)

Model Selection in Batch Policy Optimization

D. Proof of Theorem 3

Theorem 3. Given arbitrary linear model classes F1, . . . , FM, Algorithm 2 outputs a policy ˆπ such that, with probability at least 1 δ, for any comparator policy π, the regret Reg(π, ˆπ) is bounded above by

mink [M] n 2βλ,δ/M(n, dk) EX φk(X, π(X)) V 1 k

k [M] ϵk(π, ˆπ).

Proof of Theorem 3. For all k [M], let θk, denote the solution to

φk(xi, ai) θ f(xi, ai) 2

Throughout the proof, all expectations E denote EX over D and, within expectations over X, we will write ˆk := ˆk(X) for shorthand. From the deﬁnition of regret, we have

Reg(π, ˆπ) = E [f(X, π(X)) f(X, ˆπ(X))]

E h f(X, π(X)) D φˆk(X, π(X)), θˆk, Ei + E h D φˆk(X, ˆπ(X)), θˆk, E f(X, ˆπ(X)) i

+ E h D φˆk(X, π(X)), θˆk, E D φˆk(X, ˆπ(X)), θˆk, Ei

We now focus on bounding the contribution of the estimation error to the regret. Using the condition in Lemma 4 to bound each ˆθk θ ,k Vk, we have

E h D φˆk(X, π(X)), θˆk, E D φˆk(X, ˆπ(X)), θˆk, Ei (46)

E h D φˆk(X, π(X)), θˆk, E D φˆk(X, ˆπ(X)), ˆθˆk Ei + ˆθˆk θˆk, Vˆk E φˆk(X, ˆπ(X)) V 1 ˆk (47)

E h D φˆk(X, π(X)), θˆk, E D φˆk(X, ˆπ(X)), ˆθˆk Ei + βλ,δ(n, dˆk) E φˆk(X, ˆπ(X)) V 1 ˆk (48)

Next, we apply the selection rule that determines ˆk and ˆπ simultaneously, both of which are designed to maximize the penalized value estimate across actions and model classes. For any ﬁxed k [M], the previous display is bounded by

EX h D φˆk(X, π(X)), θˆk, E D φk(X, π(X)), ˆθk Ei + βλ,δ(n, dk) E φk(X, π(X)) V 1 k (49)

There is now a potential mismatch between the predictions under θ ,ˆk and the predictions under ˆθk. To handle this, we will turn to the approximation error. For k [M], deﬁne ϵk(X) := |f(X, π(X)) φk(X, π(X)) θk, |.

The ﬁrst term in the previous display can be bounded using predictions under model class k up to additive factors in ϵˆk(X) and ϵk(X). D φˆk(X, π(X)), θˆk, E

| D φˆk(X, π(X)), θˆk, E f(X, π(X))| + | φk(X, π(X)), θk, f(X, π(X))| + φk(X, π(X)), θk,

ϵˆk(X) + ϵk(X) + φk(X, π(X)), θk,

Then, conditioned on the same event from Lemma 4 and using the approximation error above, we may further bound (49) with

E h D φˆk(X, π(X)), θˆk, E D φk(X, π(X)), ˆθk Ei + βλ,δ(n, dk) E φk(X, π(X)) V 1 k

E h φk(X, π(X)), θk, D φk(X, π(X)), ˆθk Ei + E ϵˆk(X) + ϵk(X) + βλ,δ(n, dk) E φk(X, π(X)) V 1 k E ϵˆk(X) + ϵk(X) + 2βλ,δ(n, dk) E φk(X, π(X)) V 1 k

Model Selection in Batch Policy Optimization

Applying this upper bound to the regret and using the fact that this holds for any ﬁxed k [M], we get

Reg(π, ˆπ) (50)

ϵk(π, ˆπ) + EX ϵˆk(X) + min k [M] EX n ϵk(X) + 2βλ,δ(n, dk) φk(X, π(X)) V 1 k

k [M] ϵk(ˆπ, π) + min k [M]

n 2βλ,δ(n, dk) EX φk(X, π(X)) V 1 k

Note here that ˆk depends on X and thus we cannot readily replace EX ϵˆk(X) with maxk ϵk (π, ˆπ) in the ﬁrst inequality. The sum over approximation errors is thus done to simplify the bound in terms of just ϵk (π, ˆπ) terms. Finally, note that Lemma 4 establishes concentration of ˆθk θ ,k Vk for all k [M] with probability at least 1 Mδ. Changing variables to δ = Mδ proves the result.

E. Proof of Theorem 4

The SLOPE-inspired algorithm for balancing complexity and approximation error is stated completely in Algorithm 3.

E.1. Single Model Guarantee

We ﬁrst begin with an independent result for a single d-dimensional F that shows that one can bound the regret of a learned policy ˆπ by the approximation error ϵ of the optimal parameter θ plus an estimation error term that depends on the complexity of the model class d. For clarity of notation, we drop dependence on the model class index k in the subscript. Recall the deﬁnitions φi := φ(xi, ai) and

ˆθ = V 1 1 n

ˆπ(x) arg max a A

D ˆθ, φ(x, a) E

θ arg min θ Rd Eµ ( φ(X, A), θ f(X, A))2

ϵ = min θ Rd 2 q

C(µ)Eµ ( φ(X, A), θ f(X, A))2

Proposition 1. For the above deﬁnitions, the following inequality holds with probability at least 1 δ:

d n V 1/2 log(4d/δ) +

d log(4d/δ) + C3 log(4d/δ)

Furthermore, under the same event, the regret Reg(π, ˆπ) is bounded above by:

d n V 1/2 log(4d/δ) + q

d log(4d/δ)+C3 log(4d/δ)

EX maxa φ(X, a) V 1

where C4 = 192 and C1:3 are deﬁned in Lemma 4.

Proof. The regret decomposes as

Reg(π, ˆπ) = EX [f(X, π(X)) f(X, ˆπ(X))]

= EX f(X, π(X)) φ(X, π(X)), θ + φ(X, ˆπ(X)), θ f(X, ˆπ(X))

+ EX φ(X, π(X)) φ(X, ˆπ(X)), θ

Model Selection in Batch Policy Optimization

By Jensen s inequality and Deﬁnition 1, the ﬁrst expectation can be bounded as

EX f(X, π(X)) φ(X, π(X)), θ + φ(X, ˆπ(X)), θ f(X, ˆπ(X))

C(µ)Eµ φ(X, A), θ f(X, A) 2

For the second expectation, it remains to show that the policy ˆπ selects actions nearly as well as π with respect to θ. For convenience, deﬁne φ(π) := EXφ(X, π(X)) for features φ and policy π. Then, φ(π), θ φ(ˆπ), θ = φ(π), θ D φ(ˆπ), ˆθ E + D φ(ˆπ), ˆθ E φ(ˆπ), θ

φ(π), θ D φ(π), ˆθ E + D φ(ˆπ), ˆθ E φ(ˆπ), θ

θ ˆθ V EX φ(X, π(X)) V 1 + θ ˆθ V EX φ(X, ˆπ(X)) V 1

2 θ ˆθ V EX max a φ(X, a) V 1

where the ﬁrst inequality uses the fact that ˆπ selects actions to maximize the reward predicted with ˆθ, the second inequality applies Cauchy-Schwarz and the last inequality takes the worst-case action. Thus, we focus on the concentration of ˆθ θ V . We use the previous deﬁnitions of φi = φ(xi, ai), fi := f(xi, ai), and ηi := ηi(ai). Furthermore, we deﬁne the error term ei = fi φ i θ.

i φi φ i θ + ei + ηi θ

i φiei λV 1 θ

The ﬁrst term is bounded above as 1

n λV 1 θ V q

n . For the second term, we appeal to Lemma 5 to show that 1 n V 1 X

1 n V 1/2 X

d log(1/δ) + C3 log(1/δ)

n conditional on x1:n and a1:n for constants C1, C2, C3 > 0 deﬁned in Lemma 5 with probability at least 1 δ. For the third term, we note that the expectation inside the norm is zero and use Lemma 9 to show concentration, yielding: 1 n V 1 X

V 64(1 + 2 θ ) p

d/n V 1/2 log(2d/δ)

d/n V 1/2 log(2d/δ)

with probability at least 1 δ for a constant C4 = 192 since θ 1 by assumption.

Therefore, by the union bound, we are able to conclude that

d n V 1/2 log(4d/δ) +

d log(4d/δ) + C3 log(4d/δ)

with probability at least 1 δ for C4 = 192 and C1:3 are deﬁned in Lemma 4.

Model Selection in Batch Policy Optimization

Proposition 1 ensures the validity of choosing

n V 1/2 k log(4dk/δ) +

dk log(4dk/δ) + 10 log(4dk/δ)

Note that we have used the assumption that θk 1.

E.2. An Improved Analysis of the SLOPE Estimator

In order to simplify notation, we will denote f(π) = EXf(X, π(X)) and φ(π) = EXφ(X, π(X)) for deterministic policies π and features φ.

Algorithm 3 relies on the validity of a version of the SLOPE estimator introduced by (Su et al., 2020). Recall that we construct the following value estimators:

ˆvk(π) = EX D φ(X, π(X)), ˆθk E

Proposition 1 ensures that they satisfy the following guarantee.

Lemma 7. Let the event of Proposition 1 hold for all model classes k [M]. Then, for any k and policy π,

|ˆvk(π) f(π)| ϵk + ζk(δ) EX max a φ(X, a) V 1 k

Proof. The event ensures that ˆθk θk ζk(δ). This holds for all model classes with probability at least 1 Mδ. Therefore, we have

ˆvk(π) f(π) = D φk(π), ˆθk E f(π)

= D φk(π), ˆθk E φk(π), θk + φk(π), θk f(π)

D φk(π), ˆθk E φk(π), θk + q

C(µ)Eµ φk(X, A), θk f(X, A) 2

ϵk + ˆθk θk Vk EX max a φk(X, a) V 1 k ϵk + ζk(δ) EX max a φk(X, a) V 1 k

where we have again used Jensen s inequality, concentrability in Deﬁnition 1, and Cauchy-Schwarz.

Next, we verify that ϵk is decreasing in k while ζk(δ) EX maxa φ(X, a) V 1 k .

Lemma 8. The following conditions hold for all k [M 1]:

1. ϵk ϵk+1 and

2. ζk(δ) EX maxa φk(X, a) V 1 k ζk+1(δ) EX maxa φk+1(X, a) V 1 k+1.

Proof. The ﬁrst condition is trivially true since for any θ Rdk we have θ Rdk+1, which equals θ in the top dk coordinates and is zero in the bottom dk+1 dk coordinates since the model classes are nested. This at least achieves the excess risk of θ and therefore ϵk ϵk+1.

For the second condition, observe that one immediately has ξk(δ) ξk+1(δ). It sufﬁces to show that the second factor is also increasing. Lemma 11 shows that in general for nested vectors and positive deﬁnite matrices:

φk(X, a) V 1 k φk+1(X, a) V 1 k+1, (53)

proving the claim.

Model Selection in Batch Policy Optimization

We are now ready to prove that the SLOPE estimator from Algorithm 3 is adaptive. Note that this proof is done in general and may be of independent interest as it requires fewer assumptions than that of (Su et al., 2020). Consider estimators ˆv1, . . . , ˆv M of a quantity v R and deﬁne

ˆk = min{k : |ˆvk ˆvℓ| 2ξk, ℓ> k}

Theorem 6. Let ˆv1, . . . , ˆv M be estimators of a quantity v R with parameters (ψk)k [M] and (ξk)k [M] satisfying

1. |ˆvk v| ψk + ξk for all k [M]

2. ψk ψk+1 for all k [M 1]

3. ξk ξk+1 for all k [M 1]

Then, the estimator ˆv deﬁned above satisﬁes

|ˆv v| C min k {ψk + ξk}

where C = 5.

Proof. Let k = arg mink [M] {ψk + ξk}. To prove the claim, we handle to cases: (1) ˆk < k and (2) ˆk > k . Otherwise, the selection is already correct. In the ﬁrst case, we have that ˆk intersects all intervals above it including k . Therefore

|v ˆvˆk| |ˆvˆk ˆvk | + |v ˆvk |

2ξˆk + 2ξk + ψk + ξk 5 (ψk + ξk )

For the second case, we have that i = ˆk 1, which satisﬁes i i does intersect with some j [ˆk, M]. Therefore

2ξi + 2ξj |ˆvi ˆvj| ψi + ξi + ψj + ξj

by deﬁnition. It follows then that

ξi + ξj ψi + ψj 2ψk

since k i, j. Therefore,

|v ˆvˆk| ψˆk + ξˆk ψk + ξj ψk + ξj + ξi 3ψk 3 (ψk + ξk )

Since all cases have been handled, we see that the claim is satisﬁed with C = 5.

E.3. Proof of Model Selection Guarantee

Equipped with the single model class guarantees and the SLOPE estimator guarantees, we are ready to prove the ﬁnal result, which is restated below.

Theorem 4. Let F1, . . . , FM be a nested collection of linear model classes. For λk = 1 for all k [M], Algorithm 3 outputs a policy ˆπ such that, with probability at least 1 4δ, for any comparator policy π, Reg(π, ˆπ) is bouned above by

12 min k [M]

n ϵk + ζk(δ/M) EX max a φk(X, a) V 1 k

Model Selection in Batch Policy Optimization

Proof of Theorem 4. Recall that Proposition 1 guarantees that ˆθk θ V ζk(δ) for all k with probability at least 1 Mδ. From here on, assume this event holds. In order to derive the results, we ﬁrst note that Lemma 8 ensures the ordering properties of ϵk and ζk(δ) E max φk(X, a) V 1 k . Therefore, it is valid to apply Theorem 6 with ψk = ϵk and ξk = ζk E max φk(X, a) V 1 k .

Note that the selection rule ensures that ˆπ ˆπˆℓwhere ˆℓ= arg maxℓˆv(ˆπℓ). From the regret decomposition, we have, for any ﬁxed ℓ [M],

Reg(π, ˆπ) = f(π) f(ˆπ)

= f(π) ˆv(ˆπ) + ˆv(ˆπ) f(ˆπ)

f(π) ˆv(ˆπˆℓ) + C min k

n ϵk + ζk(δ) E max φk(X, a) V 1 k

f(π) ˆv(ˆπℓ) + C min k

n ϵk + ζk(δ) E max φk(X, a) V 1 k

= f(π) f(ˆπℓ) + f(ˆπℓ) ˆv(ˆπℓ) + C min k

n ϵk + ζk(δ) E max φk(X, a) V 1 k

f(π) f(ˆπℓ) + 2C min k

n ϵk + ζk(δ) E max φk(X, a) V 1 k

= f(π) f(ˆπℓ) + 2C min k

n ϵk + ζk(δ) E max φk(X, a) V 1 k

ϵℓ+ 2ζℓ(δ) E max φℓ(X, a) V 1 ℓ + 2C min k

n ϵk + ζk(δ) E max φk(X, a) V 1 k

where the ﬁrst and third inequalities follow from Theorem 6 with the constant C = 5 and the second uses the selection rule of ˆℓ. The last inequality uses Proposition 1. Therefore,

Reg(π, ˆπ) (2C + 1) ϵℓ+ 2(C + 1)ζℓ(δ) E max φℓ(X, a) V 1 ℓ

Recall that ℓ [M] was arbitrary and the assumed event occurs with probability at least 1 Mδ. The proof of the claim is completed by a change of variables δ = Mδ. Therefore, the claim is satisﬁed by choosing C = 12.

E.4. Technical Lemmas

Lemma 9. With probability at least 1 δ, V 1/2 X

C(1 + 2 θ )

nd V 1/2 log(2d/δ) (54)

where C = 64

Proof. Note that we have Eµ [φiei] = Eµ φi(fi φ i θ) = Eµφ(X, A)f(X, A) Σ θ which equals zero by ﬁrst order optimality conditions applied to the minimizer θ. Therefore it sufﬁces to show concentration of P

i Σ 1/2φiei around its mean.

Deﬁne φi = Σ 1/2φi and deﬁne φj i as the jth coordinate of the sample φi.

Note that Zj i := φj i fi φ i Σ1/2 θ is sub-exponential with parameter Zj i ψ1 C1 + C2 Σ1/2 θ C1 + 2C2 θ

where C1 = 1 and C2 = 1 C1, C2 > 0 by Lemma 1 and Lemma 2. This follows because φi ψ2 1 and fi [ 1, 1] by assumption. Therefore by Bernstein s inequality and multiplying by Σ 1/2,

i φj iei| 32 q

(1 + 2 θ )2n log(2/δ) + 32(1 + 2 θ ) log(2/δ)

64(1 + 2 θ ) n log(2/δ)

Model Selection in Batch Policy Optimization

with probability at least 1 δ. We have used the fact that δ 1/e so that p

log(2/δ) log(2/δ). Taking the union bound over all coordinates j [d] and applying standard norm inequalities, we have

i φiei 64(1 + 2 θ )

nd log(2/δ)

with probability at least 1 dδ by the union bound. The result then follows by change of variables with δ = dδ and applying the ℓ2 matrix norm inequality.

F. Proof of Theorem 5

We deﬁne the expected regression loss as Lk(θ) = EˆLk(θ) for θ Rdk. We ﬁrst require the following concentration result which is immediate from the selection rule via Bernstein s inequality.

Lemma 10. There is a constant C > 0 such that with probability at least 1 δ,

|ˆLk(ˆθk) Lk(ˆθk)| C

(1 + ˆθk )4

nout log(2M/δ)

for all k [M].

Proof. Deﬁne Zk,i = D φk,i, ˆθk E yi. Note that Zk,i ψ2 2 ˆθk + 2 since Σ 1/2φk,i ψ2 1 and yi ψ2 2.

Therefore Z2 k,i ψ1 (2 + 2 ˆθk )2. By Bernstein s inequality, we have

|ˆLk(ˆθk) Lk(ˆθk)| = | 1

i Z2 k,i E[Z2 k,i]|

4(2 + 2 ˆθk )4 log(2M/δ)

nout + 64(2 + 2 ˆθk )2 log(2M/δ)

with probability at least 1 δ for all k. It is assumed that δ 1/e. Therefore, 1 nout 1 nout and p

log(M/δ) log(M/δ). Applying these two upper bounds to the terms above and then summing them gives the result.

Armed with this result, we turn to the proof of Theorem 5, restated below.

Theorem 5. Given arbitrary linear model classes F1, . . . , FM, let ˆπ = ˆπˆk where ˆk arg mink [M] ˆLk(ˆθk). Then, there is a constant C > 0 such that, with probability at least 1 2δ, Reg(π, ˆπ) is bounded above by

mink n ϵk + C p

C(µ) ˆθk θk Σk o

C(µ) (1 maxℓ ˆθℓ ) log1/2(M/δ)

Proof of Theorem 5. Recall that Y (A) = f(X, A) + η(A) is the observed random reward taking action A where η is the noise vector. By linearity of expectation, for any ˆk [M],

Lˆk(θ) = Eµ φˆk(X, A), θ Y (A) 2

= Eµ φˆk(X, A), θ f(X, A) + f(X, A) Y (A) 2

= Eµ φˆk(X, A), θ f(X, A) 2 + Eµ (f(X, A) Y (A))2

+ 2Eµ φˆk(X, A), θ f(X, A) (f(X, A) Y (A))

= Eµ φˆk(X, A), θ f(X, A) 2 + Eµ (f(X, A) Y (A))2

Model Selection in Batch Policy Optimization

Applying the tower rule of the expectation to the last term conditioned on (X, A) yields the above results since η is zero-mean independent noise. Then,

Eµ D φˆk(X, A), ˆθˆk E f(X, A) 2 = Lˆk(ˆθˆk) Eµ (f(X, A) Y (A))2 (55)

ˆLˆk(ˆθˆk) Eµ (f(X, A) Y (A))2 + C

(1 + ˆθˆk )4

nout log(M/δ) (56)

ˆLk(ˆθk) Eµ (f(X, A) Y (A))2 + C

(1 + maxℓ ˆθℓ )4

nout log(M/δ) (57)

Lk(ˆθk) Eµ (f(X, A) Y (A))2 + 2C

(1 + maxℓ ˆθℓ )4

nout log(M/δ) (58)

Eµ D φk(X, A), ˆθk E f(X, A) 2 + 2C

(1 + maxℓ ˆθℓ )4

nout log(M/δ) (59)

with probability at least 1 δ. The ﬁrst inequality follows from Lemma 10. The second inequality follows from the choice of ˆk to minimize ˆLk(ˆθk).

Therefore, we can apply the following regret bound for any k [M]:

Reg(π, ˆπ) = f(π) f(ˆπ)

f(π) D φ(ˆπ), ˆθˆk E + D φ(ˆπ), ˆθˆk E f(ˆπ)

f(π) D φ(π), ˆθˆk E + D φ(ˆπ), ˆθˆk E f(ˆπ)

C(µ)Eµ D φˆk(X, A), ˆθˆk E f(X, A) 2

v u u t C(µ)Eµ D φk(X, A), ˆθk E f(X, A) 2 + 2CC(µ)

(1 + maxℓ ˆθℓ )4

nout log(M/δ)

C(µ)Eµ φk(X, A), θk f(X, A) 2 + 2 p

C(µ) ˆθk θk Σk

C1 1 + maxℓ ˆθℓ 4 log2(M/δ)

C(µ) ˆθk θk Σk + 2 p

C1 1 + maxℓ ˆθℓ 4 log2(M/δ)

Note that the third inequality follows from applying Jensen s inequality and Deﬁnition 1. The fourth inequality applies the previous display.

Balancing approximation error and coverage. We conclude by remarking that a ﬁnal case may be considered when we ignore the model selection criterion of statistical complexity and aim to balance only approximation error and coverage. In this case, the problem becomes trivial since we are permitted to take arbitrarily large model classes until realizability is achieved.

G. Supporting Lemmas

In this section, we state several independent results that support the proofs of the main results.

Model Selection in Batch Policy Optimization

G.1. Nestedness Properties

Let M Rd d be a positive deﬁnite matrix of the form

M = A B B D

where A Rd1 d1 is also a positive deﬁnite matrix and D Rd2 d2 and B Rd1 d2. Lemma 11. Let a Rd1 and b Rd2 be arbitrary. The following inequality holds: a b

a A 1a (61)

Proof. By Schur complement inverse rules: a b

= a A 1a + a A 1B(M/A) 1B A 1a 2b (M/A) 1B A 1a + b (M/A) 1b (62)

a A 1a + a A 1B(M/A) 1B A 1a a A 1B(M/A) 1B A 1a (63)

= a A 1a (64)

where the inequality follows from optimizing over b.

G.2. Concentration of Quadratic Forms

The following is a restatement of Lemma 14 of (Hsu et al., 2012b) for convenience, which can be interpreted as a version of the Hanson-Wright inequality (Rudelson & Vershynin, 2013). Lemma 12. Let A Rm n be a matrix and Σ = AA . Let X Rn be a random vector with independent coordinates X1, . . . , Xn such that EXi = 0 and E exp(λXi) E exp λ2σ2

2 for all λ2. Then,

P AX 2 σ2 tr Σ + 2σ2p

tr(Σ2) log(1/δ) + 2σ2 Σ log(1/δ) δ (65)

H. Additional Experiment Details

In this section, we provide some additional details regarding the experimental results presented in Section 6. We start with details that are common to both of the settings considered. In order to evaluate the performance of algorithms, within each trial, we generated a test set of ntest = 500 samples. All algorithms were thus compared on the same data within a trial. For both the batch dataset and the test set, noise was artiﬁcially generated on rewards by sampling from a standard normal distribution N(0, 1) such that η(a) sub G(1) for all a A. Regret was computed by taking the difference between the optimal policy π and the learned policy evaluated on the same test set. Thus, the points approximately (up to noise) represent Reg(π , ˆπn) where ˆπn is the learned policy after n batch samples.

The data collection policy was generated as a policy independent of the observed state. Thus µ(a|x) = µ(a |x) for all a, a , x. We generated the probabilities of sampling arms by sampling from a standard Dirichlet distribution of |A| values. For the algorithms, penalization terms (i.e. the estimation error) typically depends on constants being chosen sufﬁciently large to ensure a conﬁdence interval is valid. However, choosing large values in practice can lead to unnecessarily poor convergence. We found that multiplying by C = 0.1 yielded good performance in most settings.

H.1. Complexity-Coverage Setting

In this section, all random quantities were generated by sampling multivariate normal distributions. We ﬁrst generated a random vector (f(x, a))x X,a A, which speciﬁes the average reward for each state-action pair. In order to generate a set of linear models (feature maps) that all satisfy realizability, we began with an input dhid and randomly generated dhid 1 vectors v1, . . . , vdhid 1 of length |X||A| and solved for the last vdhid by subtracting these off the reward. This ensures a particular linear combination of the vs equals the vector (f(x, a))x X,a A. This procedure was repeated for various values of dhid and the resulting feature vectors were scaled up to d = |X||A| by multiplying by a random matrix A with elements generated from N(0, 1). This ensures that the feature maps are not simply equivalent linear transformations of each other.

Model Selection in Batch Policy Optimization

H.2. Approximation-Complexity Setting

In contrast the previous setting, we considered an inﬁnite state space where, for each action, a d = 100 dimensional covariate vector is sampled from a multivariate normal distribution with mean 0 and covariance matrix Σa, where Σa was also randomly generated. As mentioned in the main text, we constructed model classes by truncating the original covariate vector to small dimensions, thus inducing a nested structure. Since d = 30, some of these choices result in misspeciﬁed models. For the SLOPE method, in order to estimated the predicted values to generate each ˆvk, we used a validation set of unlabeled samples (i.e. no revealed reward).