# experiment_planning_with_function_approximation__cb252eec.pdf

Experiment Planning with Function Approximation

Aldo Pacchiano Broad Institute & Boston University

apacchia@broadinstitute.org

Jonathan N. Lee Stanford University jnl@stanford.edu

Emma Brunskill Stanford University ebrun@cs.stanford.edu

We study the problem of experiment planning with function approximation in contextual bandit problems. In settings where there is a signiﬁcant overhead to deploying adaptive algorithms for example, when the execution of the data collection policies is required to be distributed, or a human in the loop is needed to implement these policies producing in advance a set of policies for data collection is paramount. We study the setting where a large dataset of contexts but not rewards is available and may be used by the learner to design an effective data collection strategy. Although when rewards are linear this problem has been well studied [53], results are still missing for more complex reward models. In this work we propose two experiment planning strategies compatible with function approximation. The ﬁrst is an eluder planning and sampling procedure that can recover optimality guarantees depending on the eluder dimension [42] of the reward function class. For the second, we show that a uniform sampler achieves competitive optimality rates in the setting where the number of actions is small. We ﬁnalize our results introducing a statistical gap ﬂeshing out the fundamental differences between planning and adaptive learning and provide results for planning with model selection.

1 Introduction

Data-driven decision-making algorithms have achieved impressive empirical success in various domains such as online personalization [4, 48], games [36, 43], dialogue systems [32] and robotics [25, 35]. In many of these decision-making scenarios, it is often advantageous to consider contextual information when making decisions. This recognition has sparked a growing interest in studying adaptive learning algorithms in the setting of contextual bandits [26, 33, 6] and reinforcement learning (RL) [46]. Adaptive learning scenarios involve the deployment of data collection policies, where learners observe rewards or environment information and utilize this knowledge to shape subsequent data collection strategies. Nonetheless, the practical implementation of adaptive policies in realworld experiments currently presents signiﬁcant challenges. First, there is signiﬁcant infrastructure requirements and associated overheads which require skills and resources many organizations lack. For example, while there are end-user services that enable organizations to automatically send different text messages to different individuals, such services typically do not offer adaptive bandit algorithms. Second, the resulting reward signal may be signiﬁcantly delayed. As an example, the effect of personalized health screening reminders on a patient making a doctors appointments may take weeks. Therefore, while it is increasingly recognized that there exist other settings where contextspeciﬁc policies are likely to be beneﬁcial, including behavioural science[8], many organizations working in these settings may ﬁnd it infeasible to run experiments with adaptive policies. To bridge this gap, there is a need to explore the deployment of non-adaptive or static strategies that can effectively collect data with minimal or no updates. Surprisingly, limited research has been conducted to investigate this particular setting, highlighting the importance of further exploration in this area.

37th Conference on Neural Information Processing Systems (Neur IPS 2023).

In this work, we consider how to design a static experimental sampling strategy for contextual multi-armed bandit settings which, when executed, will yield a dataset from which we can compute a near-optimal contextual policy (one with small simple regret). Our framework closely adheres to the experiment planning problem formulation originally introduced by [53]. The problem involves a learner interacting with a contextual bandit problem where the reward function is unknown. Our assumption is that the learner possesses a substantial ofﬂine dataset comprising of m sample contexts, none of which include reward information. The learner s objective is to utilize this data to design a static policy sequence of length T (where m T) that enables the collection of valuable reward information when deployed in the real world. The ultimate goal is to produce an almost optimal policy using the data gathered during deployment. To address this scenario, the authors of [53] propose a two-phase approach involving Planner and Sampler algorithms. In the planner phase, the learner employs the context data to generate a set of sampling policies, which are then utilized in the sampler phase. The primary focus of [53] lies in the analysis of the linear case, assuming the reward function to be a linear function of the context vectors. Their key algorithmic contribution is the linear Planner Algorithm, which encompasses a reward-free instance of Lin UCB. The analysis presented in [53] demonstrates that the required number of samples under the static sampler to achieve an "-optimal policy (also referred to as simple regret) scales as O

, where d represents the dimension of the underlying space. This result matches the online minimax sample complexity for the simple regret problem [11, 1].

The algorithm proposed by [53] effectively constructs a static policy sequence of adequate span, utilizing the linear structure of the problem. However, in many problem settings, linearity alone may not be sufﬁcient to capture the true nature of the underlying reward model. This limitation is particularly evident in scenarios such as genetic perturbation experimentation [40] or other similar contexts. In such cases, the availability of algorithms that do not rely on linearity becomes crucial. Unfortunately, extending the results and techniques presented in [53] to the generic function approximation regime is not straightforward.

Here we consider when the reward function is realized by an unknown function f belonging to a known function class F (which can be more complex than linear).

Adaptive Learning with Function Approximation. In adaptive learning settings where a learner can change the data collection policies as it observes rewards and has much more ﬂexibility than in experiment planning scenarios, various adaptive learning procedures compatible with generic function approximation have been proposed for contextual bandit problems. Among these, two signiﬁcant methods relevant to our discussion are the Optimistic Least Squares algorithm (Opt LS ) introduced by [42] and the Square CB algorithm introduced by [17]. Both of these methods offer guarantees for cumulative regret. Speciﬁcally, the cumulative regret of Opt LS scales as Op

deluder logp|F|q Tq, while the cumulative regret of Square CB scales as O

|A| logp|F|q T

where A corresponds to the set of actions. The eluder dimension1 (deluder) is a statistical complexity measure introduced by [42], which enables deriving guarantees for adaptive learning algorithms based on the principle of optimism in the face of uncertainty in contextual bandits and reinforcement learning [31, 22, 37, 9]. By employing an online-to-batch conversion strategy, these two methods imply that the number of samples required to achieve "-simple regret using an adaptive algorithm is at most O

deluder logp|F|q{"2

|A| logp|F|q{"2

respectively.

Contributions. In this paper, we address the function approximation setting within the contextual bandit (static) experiment planning problem. Although in experiment planning the data collection policies have to be produced before interacting with the environment, we establish that surprisingly the adaptive "-simple regret rates achieved by Opt LS and Square CB are also attainable by a static experiment sampling strategy. To achieve this, we introduce and analyze two experimental planning algorithms. The Planner Eluder algorithm (see Section 4) utilizes conﬁdence intervals derived from least squares optimization to construct a static set of policies to be executed during the sampling phase. When the algorithm has access to a sufﬁcient number of ofﬂine samples to generate T

deluder logp|F|q{"2

static policies, the learner can produce an "-optimal policy denoted as p T thus matching the online-to-batch "-simple regret rates of Opt LS . Since the eluder dimension of linear function classes of dimension d is -up to logarithmic factorsof order Opdq, our results

1We formally introduce this quantity in Section 4. Here we use a simpler notation to avoid confusion.

recover the linear experiment planning rates of [53]. Additionally, in Section 5, we demonstrate that collecting T

|A| logp|F|q{"2

uniform samples is sufﬁcient to obtain an "-optimal policy p T . This matches the online-to-batch "-simple regret rates of Square CB . These two results yield conclusions similar to those known in the linear setting [53], but for general function classes.

These results prompt two important questions: (1) are existing adaptive cumulative regret algorithms already optimal for simple regret minimization, and (2) is static experiment planning sufﬁcient to match the simple regret guarantees of adaptive learning for realizable contextual bandits?

In Section 6, we provide a negative answer to both questions. We present a certain structured class of bandit problems where a different adaptive algorithm can require signiﬁcantly less samples than that implied by the bounds from Opt LS and Square CB . Our result is for the realizable setting [2, 18, 17, 44, 9], where the true reward function f belongs to a known function class F, where the true complexity of simple regret problems in adaptive learning scenarios has not be known known in this setting. This result complements recent results [34] that online-to-batch strategies are suboptimal for minimizing simple regret in the agnostic contextual bandits setting.2

For the second question, our statistical lower bound shows that for this class of problems, a signiﬁcantly smaller number of samples are needed to ﬁnd an "-optimal policy when using adaptive learning, than when using a static policy. This result complements related work in reinforcement learning with realizable linear state-action value function approximation[52].

Finally, we address the problem of model selection when the learner is presented with a family of reward function classes t Fiu M

i 1 and is guaranteed that the true reward function f belongs to Fi , where the index i is unknown. We demonstrate that as long as T

|A| logpmaxp M, |Fi |qq{"2

uniform samples are available, it is possible to construct an "-optimal policy denoted as p T . Further details regarding these results can be found in Section 7.

2 Further Related Work

Best Arm Identiﬁcation and Design of Experiments. Previous work to efﬁciently produce an almost optimal policy for non-contextual linear and multi-armed bandits settings is based on best-arm identiﬁcation procedures [16, 20, 13, 45, 47, 51]. These are strategies that react adaptively to the collected rewards and achieve instance dependent sample complexity. Other papers in the design of experiments literature [24, 15, 29, 41] introduce methods for designing a non-adaptive policy that can be used to ﬁnd an optimal arm with high probability in non-contextual scenarios.

Safe Optimal Design. When safety concerns are imposed in the exploration policy, works such as [54] and [49] have studied the problem of producing a safe and efﬁcient static policy used to collect data that can then be used for policy evaluation in the MAB and linear settings. This is in contrast with the unconstrained contextual function approximation scenario we study.

Reward Free Reinforcement Learning. In Reinforcement Learning settings a related line of research [21, 50, 10] considers the problem of producing enough exploratory data to allow for policy optimization for all functions in a reward class. These works allow for the learner to interact with the world for a number of steps, enough to collect data that will allow for zero-shot estimation of the optimal policy within a reward family. This stands in contrast with the experiment planning setup, where the objective is to produce a sequence of policies for static deployment at test time with the objective of learning the unknown reward function.

3 Problem Deﬁnition

We study a two-phase interaction between a learner and its environment consisting of a Planning and a Sampling phase. During the planning phase the learner has access to m T i.i.d. ofﬂine

2In the agnostic case the learner is instead presented with an arbitrary policy class that may or may not be related to the mean reward function f . In this setting a recent study by [34] introduces an algorithm that achieves sharp rates characterized in terms of a quantity .

context samples3 from a context distribution P supported on a context space X as well as knowledge of a reward function class F where each element f P F has domain X ˆ A where A is the action set. During the sampling phase the learner interacts with a contextual bandit problem for T steps where at each time-step t a context xt P is revealed to the learner, the learner takes an action at and receives a reward rt. We adopt the commonly used realizability assumption, which is widely employed in the contextual bandit literature [2, 18, 17, 44, 9].

Assumption 3.1 (Realizability). There exists a function f P F such that rt f pxt, atq t where f P F and t is a conditionally zero mean 1 subgaussian random variable for all t P r Ts during the sampling phase.

During the planning phase the learner is required to build a static policy sequence t iu T

i 1 (a policy is a mapping from X to A, the set of distributions over actions) that, without knowledge of the reward, can be deployed to collect data during the sampling phase. In contrast with adaptive learning procedures, in the experiment planning problem the policies in t iu T

i 1 have to be fully speciﬁed during the planning phase and cannot depend on the learner s observed reward values during the sampling phase. The learner s objective is to use the reward data collected during the sampling phase to produce an " optimal policy p T such that,

Ex P,a p T pxq rf px, aqs " max

Ex P,a pxq rf px, aqs .

for a suboptimality parameter " 0. Because we have assumed realizability of the reward function, the optimal policy among all possible policies arg max Ex P,a pxq rf px, aqs can be written as pxq arg maxa PA f px, aq for all x P X.

Throughout this work we make the following boundedness assumptions, Assumption 3.2 (Bounded Function Range). The range of F is bounded, i.e. for all x, a P X ˆ A, |fpx, aq| B for some constant B 0. Assumption 3.3 (Bounded Noise). The noise variables are bounded | t| B for all t P r Ts for some constant B 0.

Assumptions 3.2 and 3.3 imply |rt| B B for all t P r Ts.

4 Planning Under the Eluder Dimension

The eluder dimension is a sequential notion of complexity for function classes that was originally introduced by [42]. The eluder dimension can be used to bound the regret of optimistic least squares algorithms in contextual bandits [42, 9] and reinforcement learning [37]. Informally speaking the eluder dimension is the length of the longest sequence of points one must observe to accurately estimate the function value at any other point. We use the eluder dimension deﬁnition from [42]. Deﬁnition 4.1. (" dependence) Let G be a scalar function class with domain Z and " 0. An element z P Z is " dependent on tz1, , znu Ñ Z w.r.t. G if any pair of functions g, g1 P G satisfying

i 1pgpziq g1pziqq2 " also satisﬁes gpzq g1pzq ". Furthermore, z P Z is " independent of tz1, , znu w.r.t. G if it is not " dependent on tz1, , znu.

Deﬁnition 4.2. ("-eluder) The " non monotone eluder dimension á deluderp G, "q of G is the length of the longest sequence of elements in Z such that every element is " independent of its predecessors. Moreover, we deﬁne the " eluder dimension deluderp G, "q as deluderp G, "q max"1 " á deluderp G, "q.

By deﬁnition the " eluder dimension increases as " is driven down to zero. Let D be a dataset of X, A pairs. For any f, f 1 with support X ˆ A we use the notation }f f 1}D to denote the data norm of the difference between functions f and f 1:

pfpx, aq f 1px, aqq2.

3This assumption is based on the fact that gathering ofﬂine context samples can be signiﬁcantly less costly compared to conducting multiple rounds of experimental deployment.

Algorithm 1 Eluder Planner

1: Input: m T samples tx Pum

1, function class F, conﬁdence radius function βF : t, δ Ñ R. 2: Initialize data buffer D1 H 3: for t 1 . . .T do 4: For any x P X deﬁne the uncertainty radius of action a P A as,

!px, a, Dtq max f,f 1PF s.t. }f f 1}Dt 4βF pt,δq

fpx, aq f 1px, aq.

5: Deﬁne the policy t as tpxq arg maxa PA !px, a, Dtq for all x P X. 6: Update Dt 1 Dt Y tpxt, tpxtqqu 7: end for

Algorithm 2 Sampler

1: Input: number of time-steps T. 2: Initialize data buffer r D1 H 3: for t 1 . . .T do 4: Deﬁne deployment policy t as t. 5: Observe context xt P. 6: Play action at tp xtq and receive reward rt. 7: Update r Dt 1 r Dt Y tp xt, at, rtqu. 8: end for

The Eluder Planner algorithm takes as input m T i.i.d. samples from the context distribution tx Pum

1 and a realizable function class F satisfying Assumption 3.1. The learner iterates over T out of these m samples in a sequential manner. We use the name xt to denote the t th input sample and call t to the (deterministic) exploration policies produced upon processing samples 1, , t 1. We use the notation Dt tpx , px qqut 1

1 to denote the dataset composed of the ﬁrst t 1 (ordered) samples from tx um

1 and actions from policies t ut 1

1. We adopt the convention that D1 H. The output of the Eluder Planner algorithm is the sequence of policies t tu T

t 1. Policy t is deﬁned as tpxq arg maxa PA !px, a, Dtq such that !px, a, Dtq is an uncertainty measure deﬁned as,

!px, a, Dtq max f,f 1PF s.t. }f f 1}Dt 4βFpt,δq fpx, aq f 1px, aq. (1)

Where the conﬁdence radius function βFpδ, tq equals

βFpδ, tq Cp B Bq

for some universal constant C 0 deﬁned in Proposition A.7. This is a complexity radius implied by the guarantees of least squares regression (Lemma A.8) and constraints imposed by Lemma 4.6. See Appendix A.1 and Proposition A.7 for a detailed discussion about this deﬁnition. We call the combination of the Eluder Planner and Sampler as the eluder planning algorithm. Its performance guarantees are obtained by relating the regret of a constructed sequence of optimistic policies tr opt

t 1 based on the data collected by t tu T

t 1 to the sum of uncertainty radii T

t 1 !pxt, tpxtq, Dtq of the context-actions collected during the planning phase. Using optimism when constructing this policy sequence is important. In contrast with the sequence of greedy policies greedy

t p |xq arg maxa PA pftpx, aq, where pft arg minf PF

px,a,rq P r Dtpfpx, aq rq2 resulting

from a least squares over r Dt, the sequence of optimistic policies satisﬁes a regret bound. Thus, playing a uniform policy from the sequence tr opt

t 1 satisﬁes a simple regret guarantee. The sum of uncertainty radii T

t 1 !pxt, tpxtq, Dtq is bounded using a Lemma we borrow from [9].

Lemma 4.3. [Lemma 3 from [9]] Let F be a function class satisfying Assumption 3.2 with "-eluder dimension deluderp F, "q. For all T P N and any dataset sequence t Dtu8

t 1 with D1 H and Dt tt x , a uut 1

1 of context-action pairs, the following inequality on the sum of the uncertainty radii holds,

!p xt, at, Dtq O

BT, Bdeluderp F, B{Tq βFp T, δq

deluderp F, B{Tq T

Extracting Policy p T from Data. At the end of the execution of the Sampler algorithm, the learner will have access to a sequence of datasets t r Dtu T

t 1 of contexts, actions and reward triplets. The learner will use this sequence of datasets to compute the sequence of regression functions,

pft arg min

pfp x , a q r q2. (3)

Standard Least Squares (LS) results (see Lemma A.8) imply that } pft f } Dt βpt, δq with high probability for all t P N. These conﬁdence sets are used to deﬁne a sequence of optimistic (deterministic) policies tr opt

t pxq arg max

max f s.t. } pft f} Dt βFpt,δq

fpx, aq (4)

The candidate optimal policy p T is then deﬁned as p T Uniformpr opt

1 , , r opt

T q. The main result in this section is the following Theorem.

Theorem 4.4. Let " 0. There exists a universal constant c 0 such that if

T cmax2p B, B, 1qdeluderp F, B{Tq log p|F|T{δq

then with probability at least 1 δ the eluder planning algorithm s policy p T is " optimal .

Remark 4.5. The results of this section hold for the structured bandits setting [23, 27]; that is, scenarios where X t Hu.

Both sides of the inequality deﬁning T in Theorem 4.4 depend on T. Provided deluderp F, B{Tq is sublinear as a function of T, there exists a ﬁnite value of T satisfying Equation 5. When deluderp F, "q

deluder log p1{ "q for some deluder and for all " 0 setting T c max2p B, B,1qdeluderp F,B"q logp|F|{"δq

"2 is enough to guarantee the conditions of Theorem 4.4 are satisﬁed. Examples 4 and 5 from [42] show the eluder dimension of linear or generalized linear function classes can be written in such way. In these cases deluder is a constant multiple of the underlying dimension of the space.

Using standard techniques, the results of Theorem 4.4 can be adapted to the case when the function class F is inﬁnite but admits a metrization and has a covering number. In this case the sample complexity will scale not with logp|F|q but instead with the logarithm of the covering number of F. For example, in the case of linear functions, the logarithm of the covering number of F under the 2 norm will scale as d, the ambient dimension of the space up to logarithmic factors of T (see for example Lemma D.1 of [14]). Plugging this into the sample guarantees of Theorem 4.4 recovers the

sample complexity for linear experiment planning from [53].

4.1 Proof Sketch of Theorem 4.4

Let x1, , x T be the sampler s context sequence. The proof works by showing that with probability at least 1 δ the regret of the tr opt

t 1 sequence satisﬁes the regret bound

a PA f prxt, aq f prxt, r opt

deluderp F, B{Tq T logp|F|T{δq

A key part of the proof involves relating the balls deﬁned by the data norms of the planner datasets t Dtu T

t 1 and those induced by the data norms of the sampler datasets t r Dtu T

t 1. The precise statement of this relationship is provided in Lemma 4.6, and its proof can be found in Appendix C.

Lemma 4.6. With probability at least 1 δ

tpf, f 1q s.t. }f f 1} r Dt 2βFpt, δqu Ñ tpf, f 1q s.t. }f f 1}Dt 4βFpt, δqu (6)

Simultaneously for all t P r Ts. Where t Dtu T

t 1 is the dataset sequence resulting of the execution of Eluder Planner while t r Dtu T

t 1 is the dataset sequence resulting of the execution of the Sampler and βFpδ, tq is the conﬁdence radius function deﬁned in Equation 2.

Let pft as in Equation 3 and deﬁne E as the event where the Standard Least Squares results (see Lemma A.8) hold i.e. } pft f } r Dt βFpt, δq for all t P r Ts and also Equation 6 from Lemma 4.6 holds for all t P r Ts. The results of Lemmas A.8 and 4.6 imply Pp Eq 1 δ

For context x and action a let s denote by f x,a

t as a function achieving the inner maximum in the deﬁnition of opt

t (see Equation 4). When E holds the policies t opt

t 1 are optimistic in the sense

that f x, opt

t pxq t px, opt

t pxqq maxa PA f px, aq and therefore,

a PA f p xt, aq f p xt, opt

t pxq t p xt, opt

t p xtqq f p xt, opt

max f,f 1PBtp2, r Dtq

t p xtqq f 1p xt, opt

max f,f 1PBtp4,Dtq fp xt, opt

t p xtqq f 1p xt, opt

t p xtqq p q.

Where Btpγ, Dq tf, f 1 P F s.t. }f f 1}D γβFpδ, tqu. Inequality piq follows by optimism, piiq

is a consequence of f x, opt

t pxq t , f P Btp2, r Dtq when E holds since in this case } pft f } r Dt βFpt, δq

and } f x, opt

t pxq t pft} r Dt βFpt, δq, and piiiq follows because when E holds, Equation 6 of Lemma 4.6

is satisﬁed and therefore Btp2, r Dtq Ñ Btp4, Dtq. The RHS p q of the equation above satisﬁes,

t p xtq, Dtq

a PA !p xt, a, Dtq

!p xt, tp xtq, Dtq.

We relate the sum of uncertainty radii t!p xt, tp xtq, Dtqu T

t 1 with those of the planner t!pxt, tpxtq, Dtqu T

t 1 via Hoeffding Inequality (see Lemma A.1) and conclude that w.h.p,

a PA f p xt, aq f p xt, opt

!pxt, tpxtq, Dtq O

T log p1{δq

Lemma 4.3 allows us to bound the sum of these uncertainty radii as

!pxt, tpxtq, Dtq O

BT, Bdeluderp F, B{Tq βFp T, δq

deluderp F, B{Tq T

and therefore w.h.p,

a PA f p x , aq f p x , opt

BT, Bdeluderp F, B{Tq βFp T, δq

deluderp F, B{Tq T

Converting this cumulative regret bound into a simple regret one (Lemma B.1) ﬁnalizes the result. A detailed version of the proof can be found in Appendix C.1.

5 Uniform Sampling Strategies

In this section we show that a uniform sampling strategy can produce an " optimal policy with probability at least 1 δ after collecting O

max2p B, Bq|A| logp|F|{"δq

samples. This procedure

achieves the same simple regret rate as converting Square CB s cumulative regret into simple regret4.

In contrast with the eluder planning algorithm the uniform sampling strategy does not require a planning phase. Instead it consists of running of the Sampler (Algorithm 2) setting r j Uniformp Aq for all j P r Ts. Given a dataset r DT of contexts, actions and rewards collected during the sampling phase, we solve the least squares problem:

pf T arg min

pxi,ai,riq P r DT

pfpxi, aiq riq2

and deﬁne the policy p T as p T pxq arg maxa PA pf T px, aq. The main result of this section is,

4The regret bound of the Square CB algorithm scales as O

|A| logp Fq T logp T{δq

Theorem 5.1. There exists a universal constant c 0 such that if T c max2p B, Bq|A| logp|F|{"δq

"2 then with probability at least 1 δ the uniform planning algorithm s policy p T is " optimal.

The proof of Theorem 5.1 can be found in Appendix D.

Comparison Between Opt LS and Square CB . The regret rate after T steps of the Opt LS algorithm applied to a contextual bandit problem with discrete function class F scales5 (up to logarithmic factors) as O

deluderp F, B{Tq logp|F|{δq T

. In contrast, the regret of the Square CB algorithm

satisﬁes a regret guarantee (up to logarithmic factors) of order O

|A| logp|F|{δq T

where the eluder dimension dependence is substituted by a polynomial dependence on the number of actions |A|. When the number of actions is small, or even constant, the regret rate of Square CB can be much smaller than that of Opt LS . The opposite is true when the number of actions is large or even inﬁnite. Converting these cumulative regret bounds to simple regret implies the number of samples required to produce an " optimal policy from the adaptive Opt LS policy sequence scales as

deluderp F,B{T q logp|F|{δq

"2 whereas for the adaptive Square CB policy sequence it scales as |A| logp|F|{δq

"2 . The results of Theorems 4.4 and 5.1 recover these rates in the experiment planning setting.

6 Gap Between Experiment Planning and Adaptive Learning

The results of Sections 4 and 5 imply planning bounds that are comparable to the corresponding online-to-batch guarantees for Opt LS [42] and Square CB [17]. The main result of this section Theorem 6.1 shows there are problems where the number of samples required for experiment planning can be substantially larger than the number of samples required of an adaptive learning algorithm. This result implies the suboptimality of algorithms such as Square CB and Opt LS for adaptive learning.

In order to state our results we consider an action set Atree indexed by the nodes of a height L binary tree deﬁned here as having L levels and 2L 1 nodes.

Figure 1: Binary Tree

We call al,i the i-th action of the l-th level of the tree. For an example see Figure 1. Let " 0. We deﬁne a function class Ftree indexed by paths from the root node to a leaf. For any such path p ta1,1, a2,i2, , a L,i Lu the function f ppq equals,

1 if a a L,i L 1 2" if a P pzta L,i Lu 1 12" o.w.

The following result ﬂeshes out a separation between planning and adaptive learning with action set Atree and function class Ftree in the setting where at time T the learner will produce a guess for the optimal policy p T . Theorem 6.1. Let " 0, T P N. Consider the action set Atree and function class Ftree and a reward noise process such that t Np0, 1q conditionally for all t P r Ts. For any planning algorithm Alg there is a function f P Ftree such that when T 2L 5

9"2 and Alg interacts with f then,

EAlg,f r Ea p T rf paqss max

a PAtree f paq ".

Moreover, there is an adaptive algorithm Algadaptive such that if T 2L logp2L{"q

EAlgadaptive r Ea p T rfpaqss max

a PAtree fpaq ".

for all f P Ftree. Where E Å Alg, rf r s is the expectation over the randomness of Å Alg and the environment

for target function rf, and p T is the algorithm s ﬁnal policy guess after the sampling phase.

5We obviate the T dependence on deluder for readability.

The main insight behind Theorem 6.1 is that adaptive strategies to ﬁnd an optimal action in the Ftree function class can make use of the conditional structure of the action space by trying to determine a path of actions from the root to the leaf containing the optimal action. An adaptive strategy can determine this path by querying only nodes that are adjacent to it. In contrast, a static experiment planning strategy cannot leverage this structure and has to query all leaf nodes. Theorem 6.1 implies a gap between the adaptive and experiment planning problems. Moreover, since the eluder dimension of Ftree scales with 2L (see Lemma D.1), Opt LS and Square CB are suboptimal adaptive algorithms for this model class. In contrast with the O

2L logp2L{"q

upper bound in Theorem 6.1, converting the

cummulative regret bounds of Opt LS and Square CB yield guarantees scaling as O

2L logp|Ftree|q

7 Model Selection

In this section we consider a setting where the learner has access to F1, , FM function (model) classes all with domain X ˆ A. The learner is promised there exists an index i such that f P Fi . The value of index i is not known by the learner. We will show the uniform sampling strategy has a sample complexity that scales logarithmically with the number of models and with the complexity of the optimal model class |Fi |. In this setting the Sampler will collect two uniform actions datasets p r Dp Trainq

T , r Dp Testq

T q of size T{2 each. Using the train" dataset r Dp Trainq

T the learner computes candidate reward models pf piq

T for all i P r Ms by solving the least squares problems:

pxi,ai,riq P r Dp Trainq

pfpxi, aiq riq2 .

Using the test" dataset the learner extracts a guess for the optimal model class index i by solving,

px ,a ,r q P r Dp Testq

T px , a q r

The candidate policy p T is deﬁned as p T pxq arg maxa PA pf piq

T px, aq. Let s start by relating the

expected least squares loss of the candidate model pf piq

T to the size of the optimal model class |Fi |, Proposition 7.1. There exists a universal constant c 0 such that,

Ex P,a Uniformp Aq

f px, aq pf piq

c max2p B, Bq logp T maxp M, |Fi |q{δq

With probability at lest 1 δ.

The proof of Proposition 7.1 can be found in Appendix E.1. The uniform sampling model-selection algorithm has the following performance guarantees,

Theorem 7.2. There exists a universal constant rc1 0 s.t. if T c1 max2p B, Bq|A| logp

maxp M,|F |q

"δ q "2 then with probability at least 1 δ, the candidate policy p T of the uniform sampling model-selection algorithm is " optimal.

Proof. Due to the realizability assumption prt f pxt, atq tq the instantaneous regret of p T on context x P X equals maxa f px, aq f px, p T pxqq. Let pxq arg maxa PA f px, aq. Just like in the proof of Theorem 5.1 (see Equation 25), we can relate the suboptimality of p T with the expected

least squares loss of pf piq

T under the uniform policy,

a PA f px, aq f px, p T pxqq

|A|Ex P,a Uniformp Aq

f px, aq pf piq

Finally Proposition 7.1 implies thre is a universal constant c 0 such that,

Ex P,a Uniformp Aq

f px, aq pf piq

cp B2 B2q logp T maxp M, |Fi |q{δq

with probability at least 1 δ. Plugging this result back into Equation 8 we see the suboptimality of

p T can be upper bounded as,

a PA f px, aq f px, p T pxqq

c|A|p B2 B2q logp T maxp M, |Fi |q{δq

Setting gp Tq 4c|A|p B2 B2q logp T maxp M,|Fi |q{δq

T in Lemma A.6 implies there exists a universal

constant c1 0 such that gp Tq "2 for all T c1 max2p B, Bq|A| log

maxp M,|Fi |q

The results of Theorem 7.2 can be best understood by contrasting them to the uniform sampling algorithm with input model class equal to the union of the function classes Fall Yi Pr Ms Fi. Applying the results of Theorem 5.1 to Fall yields a sample complexity scaling with maxi Pr Ms logp|F|iq, a quantity that could be much larger than logp|Fi |q. In contrast, the uniform sampling model-selection algorithm achieves a sample complexity scaling with logp|Fi |q at a price logarithmic in the number of models classes M. This logarithmic dependence on M stands apart from model selection algorithms for cumulative regret scenarios such as Corral [5, 39], ECE [30] and Regret Balancing [38, 12] that instead have a polynomial dependence on M. The uniform sampling model-selection algorithm is agnostic to the value of ". The results of Theorem 7.2 hold for any ". If " is known in advance the learner can compute the model class index pi arg min

i P r Ms s.t. T c max2p B, Bq|A| logp|Fi|{"δq

and use the uniform sampling strategy for Fpi. For this choice of ", Theorem 5.1 guarantees similar bounds to those of Theorem 7.2. Unfortunately in contrast with the uniform sampling model-selection algorithm this method would be valid for a single choice of ".

8 Conclusion

In this work we have introduced the ﬁrst set of algorithms for the experiment planning problem for contextual bandits with general function approximation. We have developed the Eluder Planning algorithm that produces a static policy sequence that after deployment can be used to recover an " optimal policy. We showed it is enough for the number of static policies and therefore samples during the sampling phase to be as large as the number of samples required from an adaptive procedure based on an online-to-batch conversion of the Opt LS algorithm. Similarly we also demonstrated the uniform sampling strategy enjoys the same online-to-batch conversion sample complexity as the Square CB algorithm. These results seemingly suggest that simple regret rates for adaptive learning may also be achieved in experiment planning scenarios. We show this is not the case. There exist structured bandit problems for which adaptive learning may require a number of samples that is substantially smaller than the number of samples required by a static policy sequence. This is signiﬁcant because it implies the suboptimality of the rates achieved by existing adaptive learning algorithms such as Opt LS and Square CB and also because it draws a clear distinction between adaptive learning and experiment planning. This implies these algorithms are either suboptimal or their upper bound analysis is not tight. We believe the ﬁrst to be correct. This is an important open question we hope to see addressed in future research. We have also introduced the ﬁrst model selection results for the experiment planning problem.

Many important questions remain regarding this setting. Chief among them is to characterize the true statistical complexity of experimental design for contextual bandits with general function approximations. Our results indicate the eluder dimension is not the sharpest statistical complexity measure to characterize learning here. Developing a more new form of complexity, as well as an accompanying algorithm that can achieve the true statistical lower bound for the problem of experiment planning remains an exciting and important open question to tackle in future research.

Acknowledgments and Disclosure of Funding

We thank Akshay Krishnamurthy for helpful discussions. Aldo Pacchiano would like to thank the support of the Eric and Wendy Schmidt Center at the Broad Institute of MIT and Harvard. This work was supported in part by funding from the Eric and Wendy Schmidt Center at the Broad Institute of MIT and Harvard. This work was supported in part by NSF grant #2112926.

[1] Y. Abbasi-Yadkori, D. Pál, and C. Szepesvári. Improved algorithms for linear stochastic bandits.

Advances in neural information processing systems, 24:2312 2320, 2011.

[2] A. Agarwal, M. Dudík, S. Kale, J. Langford, and R. Schapire. Contextual bandit learning with

predictable rewards. In Artiﬁcial Intelligence and Statistics, pages 19 26. PMLR, 2012.

[3] A. Agarwal, D. Hsu, S. Kale, J. Langford, L. Li, and R. Schapire. Taming the monster: A fast

and simple algorithm for contextual bandits. In International Conference on Machine Learning, pages 1638 1646. PMLR, 2014.

[4] A. Agarwal, S. Bird, M. Cozowicz, L. Hoang, J. Langford, S. Lee, J. Li, D. Melamed,

G. Oshri, O. Ribas, et al. Making contextual decisions with low technical debt. ar Xiv preprint ar Xiv:1606.03966, 2016.

[5] A. Agarwal, H. Luo, B. Neyshabur, and R. E. Schapire. Corralling a band of bandit algorithms.

In Conference on Learning Theory, pages 12 38. PMLR, 2017.

[6] S. Agrawal and N. Goyal. Thompson sampling for contextual bandits with linear payoffs. In

International conference on machine learning, pages 127 135. PMLR, 2013.

[7] A. Beygelzimer, J. Langford, L. Li, L. Reyzin, and R. Schapire. Contextual bandit algorithms

with supervised learning guarantees. In Proceedings of the Fourteenth International Conference on Artiﬁcial Intelligence and Statistics, pages 19 26. JMLR Workshop and Conference Proceedings, 2011.

[8] C. J. Bryan, E. Tipton, and D. S. Yeager. Behavioural science is unlikely to change the world

without a heterogeneity revolution. Nature human behaviour, 5(8):980 989, 2021.

[9] J. Chan, A. Pacchiano, N. Tripuraneni, Y. S. Song, P. Bartlett, and M. I. Jordan. Parallelizing

contextual bandits. ar Xiv preprint ar Xiv:2105.10590, 2021.

[10] J. Chen, A. Modi, A. Krishnamurthy, N. Jiang, and A. Agarwal. On the statistical efﬁciency of

reward-free exploration in non-linear rl. Advances in Neural Information Processing Systems, 35:20960 20973, 2022.

[11] W. Chu, L. Li, L. Reyzin, and R. Schapire. Contextual bandits with linear payoff functions. In

Proceedings of the Fourteenth International Conference on Artiﬁcial Intelligence and Statistics, pages 208 214, 2011.

[12] A. Cutkosky, C. Dann, A. Das, C. Gentile, A. Pacchiano, and M. Purohit. Dynamic balancing

for model selection in bandits and rl. In International Conference on Machine Learning, pages 2276 2285. PMLR, 2021.

[13] R. Degenne, P. Ménard, X. Shang, and M. Valko. Gamiﬁcation of pure exploration for linear

bandits. In International Conference on Machine Learning, pages 2432 2442. PMLR, 2020.

[14] S. S. Du, S. M. Kakade, J. D. Lee, S. Lovett, G. Mahajan, W. Sun, and R. Wang. Bilinear classes:

A structural framework for provable generalization in rl. ar Xiv preprint ar Xiv:2103.10897, 2021.

[15] H. Esfandiari, A. Karbasi, A. Mehrabian, and V. Mirrokni. Regret bounds for batched bandits.

In Proceedings of the AAAI Conference on Artiﬁcial Intelligence, volume 35, pages 7340 7348, 2021.

[16] T. Fiez, L. Jain, K. G. Jamieson, and L. Ratliff. Sequential experimental design for transductive

linear bandits. Advances in neural information processing systems, 32, 2019.

[17] D. Foster and A. Rakhlin. Beyond ucb: Optimal and efﬁcient contextual bandits with regression

oracles. In International Conference on Machine Learning, pages 3199 3210. PMLR, 2020.

[18] D. Foster, A. Agarwal, M. Dudik, H. Luo, and R. Schapire. Practical contextual bandits with

regression oracles. In International Conference on Machine Learning, pages 1539 1548. PMLR, 2018.

[19] D. A. Freedman. On tail probabilities for martingales. the Annals of Probability, pages 100 118,

[20] Y. Jedra and A. Proutiere. Optimal best-arm identiﬁcation in linear bandits. Advances in Neural

Information Processing Systems, 33:10007 10017, 2020.

[21] C. Jin, A. Krishnamurthy, M. Simchowitz, and T. Yu. Reward-free exploration for reinforcement

learning. In International Conference on Machine Learning, pages 4870 4879. PMLR, 2020.

[22] C. Jin, Q. Liu, and S. Miryooseﬁ. Bellman eluder dimension: New rich classes of rl problems,

and sample-efﬁcient algorithms. Advances in neural information processing systems, 34: 13406 13418, 2021.

[23] K.-S. Jun and C. Zhang. Crush optimism with pessimism: Structured bandits beyond asymptotic

optimality. Advances in Neural Information Processing Systems, 33:6366 6376, 2020.

[24] J. Kiefer and J. Wolfowitz. The equivalence of two extremum problems. Canadian Journal of

Mathematics, 12:363 366, 1960.

[25] J. Kober, J. A. Bagnell, and J. Peters. Reinforcement learning in robotics: A survey. The

International Journal of Robotics Research, 32(11):1238 1274, 2013.

[26] J. Langford and T. Zhang. The epoch-greedy algorithm for contextual multi-armed bandits.

Advances in neural information processing systems, 20(1):96 1, 2007.

[27] T. Lattimore and R. Munos. Bounded regret for ﬁnite-armed structured bandits. Advances in

Neural Information Processing Systems, 27, 2014.

[28] T. Lattimore and C. Szepesvári. Bandit algorithms. Cambridge University Press, 2020.

[29] T. Lattimore, C. Szepesvari, and G. Weisz. Learning with good feature representations in

bandits and in rl with a generative model. In International Conference on Machine Learning, pages 5662 5670. PMLR, 2020.

[30] J. Lee, A. Pacchiano, V. Muthukumar, W. Kong, and E. Brunskill. Online model selection for

reinforcement learning with function approximation. In International Conference on Artiﬁcial Intelligence and Statistics, pages 3340 3348. PMLR, 2021.

[31] G. Li, P. Kamath, D. J. Foster, and N. Srebro. Understanding the eluder dimension. Advances

in Neural Information Processing Systems, 35:23737 23750, 2022.

[32] J. Li, W. Monroe, A. Ritter, M. Galley, J. Gao, and D. Jurafsky. Deep reinforcement learning

for dialogue generation. ar Xiv preprint ar Xiv:1606.01541, 2016.

[33] L. Li, W. Chu, J. Langford, and R. E. Schapire. A contextual-bandit approach to personalized

news article recommendation. In Proceedings of the 19th international conference on World wide web, pages 661 670, 2010.

[34] Z. Li, L. Ratliff, K. G. Jamieson, L. Jain, et al. Instance-optimal pac algorithms for contextual

bandits. Advances in Neural Information Processing Systems, 35:37590 37603, 2022.

[35] T. P. Lillicrap, J. J. Hunt, A. Pritzel, N. Heess, T. Erez, Y. Tassa, D. Silver, and D. Wierstra.

Continuous control with deep reinforcement learning. ar Xiv preprint ar Xiv:1509.02971, 2015.

[36] V. Mnih, K. Kavukcuoglu, D. Silver, A. A. Rusu, J. Veness, M. G. Bellemare, A. Graves,

M. Riedmiller, A. K. Fidjeland, G. Ostrovski, et al. Human-level control through deep reinforcement learning. nature, 518(7540):529 533, 2015.

[37] I. Osband and B. Van Roy. Model-based reinforcement learning and the eluder dimension.

Advances in Neural Information Processing Systems, 27, 2014.

[38] A. Pacchiano, C. Dann, C. Gentile, and P. Bartlett. Regret bound balancing and elimination for

model selection in bandits and rl. ar Xiv preprint ar Xiv:2012.13045, 2020.

[39] A. Pacchiano, M. Phan, Y. Abbasi Yadkori, A. Rao, J. Zimmert, T. Lattimore, and C. Szepesvari.

Model selection in contextual stochastic bandit problems. Advances in Neural Information Processing Systems, 33:10328 10337, 2020.

[40] A. Pacchiano, D. Wulsin, R. A. Barton, and L. Voloch. Neural design for genetic perturbation

experiments. ar Xiv preprint ar Xiv:2207.12805, 2022.

[41] Y. Ruan, J. Yang, and Y. Zhou. Linear bandits with limited adaptivity and learning distributional

optimal design. In Proceedings of the 53rd Annual ACM SIGACT Symposium on Theory of Computing, pages 74 87, 2021.

[42] D. Russo and B. Van Roy. Eluder dimension and the sample complexity of optimistic exploration.

Advances in Neural Information Processing Systems, 26, 2013.

[43] D. Silver, A. Huang, C. J. Maddison, A. Guez, L. Sifre, G. Van Den Driessche, J. Schrittwieser,

I. Antonoglou, V. Panneershelvam, M. Lanctot, et al. Mastering the game of go with deep neural networks and tree search. nature, 529(7587):484 489, 2016.

[44] D. Simchi-Levi and Y. Xu. Bypassing the monster: A faster and simpler optimal algorithm for

contextual bandits under realizability. Mathematics of Operations Research, 2021.

[45] M. Soare, A. Lazaric, and R. Munos. Best-arm identiﬁcation in linear bandits. Advances in

Neural Information Processing Systems, 27, 2014.

[46] R. S. Sutton. Introduction: The challenge of reinforcement learning. In Reinforcement Learning,

pages 1 3. Springer, 1992.

[47] C. Tao, S. Blanco, and Y. Zhou. Best arm identiﬁcation in linear bandits with linear dimension

dependency. In International Conference on Machine Learning, pages 4877 4886. PMLR, 2018.

[48] A. Tewari and S. A. Murphy. From ads to interventions: Contextual bandits in mobile health.

Mobile Health: Sensors, Analytic Methods, and Applications, pages 495 517, 2017.

[49] R. Wan, B. Kveton, and R. Song. Safe exploration for efﬁcient policy evaluation and comparison.

In International Conference on Machine Learning, pages 22491 22511. PMLR, 2022.

[50] R. Wang, S. S. Du, L. Yang, and R. R. Salakhutdinov. On reward-free reinforcement learning

with linear function approximation. Advances in neural information processing systems, 33: 17816 17826, 2020.

[51] L. Xu, J. Honda, and M. Sugiyama. A fully adaptive algorithm for pure exploration in linear

bandits. In International Conference on Artiﬁcial Intelligence and Statistics, pages 843 851. PMLR, 2018.

[52] A. Zanette. Exponential lower bounds for batch reinforcement learning: Batch rl can be

exponentially harder than online rl. In International Conference on Machine Learning, pages 12287 12297. PMLR, 2021.

[53] A. Zanette, K. Dong, J. N. Lee, and E. Brunskill. Design of experiments for stochastic contextual

linear bandits. Advances in Neural Information Processing Systems, 34:22720 22731, 2021.

[54] R. Zhu and B. Kveton. Safe optimal design with applications in off-policy learning. In

International Conference on Artiﬁcial Intelligence and Statistics, pages 2436 2447. PMLR, 2022.