# experiment_planning_with_function_approximation__cb252eec.pdf Experiment Planning with Function Approximation Aldo Pacchiano Broad Institute & Boston University apacchia@broadinstitute.org Jonathan N. Lee Stanford University jnl@stanford.edu Emma Brunskill Stanford University ebrun@cs.stanford.edu We study the problem of experiment planning with function approximation in contextual bandit problems. In settings where there is a significant overhead to deploying adaptive algorithms for example, when the execution of the data collection policies is required to be distributed, or a human in the loop is needed to implement these policies producing in advance a set of policies for data collection is paramount. We study the setting where a large dataset of contexts but not rewards is available and may be used by the learner to design an effective data collection strategy. Although when rewards are linear this problem has been well studied [53], results are still missing for more complex reward models. In this work we propose two experiment planning strategies compatible with function approximation. The first is an eluder planning and sampling procedure that can recover optimality guarantees depending on the eluder dimension [42] of the reward function class. For the second, we show that a uniform sampler achieves competitive optimality rates in the setting where the number of actions is small. We finalize our results introducing a statistical gap fleshing out the fundamental differences between planning and adaptive learning and provide results for planning with model selection. 1 Introduction Data-driven decision-making algorithms have achieved impressive empirical success in various domains such as online personalization [4, 48], games [36, 43], dialogue systems [32] and robotics [25, 35]. In many of these decision-making scenarios, it is often advantageous to consider contextual information when making decisions. This recognition has sparked a growing interest in studying adaptive learning algorithms in the setting of contextual bandits [26, 33, 6] and reinforcement learning (RL) [46]. Adaptive learning scenarios involve the deployment of data collection policies, where learners observe rewards or environment information and utilize this knowledge to shape subsequent data collection strategies. Nonetheless, the practical implementation of adaptive policies in realworld experiments currently presents significant challenges. First, there is significant infrastructure requirements and associated overheads which require skills and resources many organizations lack. For example, while there are end-user services that enable organizations to automatically send different text messages to different individuals, such services typically do not offer adaptive bandit algorithms. Second, the resulting reward signal may be significantly delayed. As an example, the effect of personalized health screening reminders on a patient making a doctors appointments may take weeks. Therefore, while it is increasingly recognized that there exist other settings where contextspecific policies are likely to be beneficial, including behavioural science[8], many organizations working in these settings may find it infeasible to run experiments with adaptive policies. To bridge this gap, there is a need to explore the deployment of non-adaptive or static strategies that can effectively collect data with minimal or no updates. Surprisingly, limited research has been conducted to investigate this particular setting, highlighting the importance of further exploration in this area. 37th Conference on Neural Information Processing Systems (Neur IPS 2023). In this work, we consider how to design a static experimental sampling strategy for contextual multi-armed bandit settings which, when executed, will yield a dataset from which we can compute a near-optimal contextual policy (one with small simple regret). Our framework closely adheres to the experiment planning problem formulation originally introduced by [53]. The problem involves a learner interacting with a contextual bandit problem where the reward function is unknown. Our assumption is that the learner possesses a substantial offline dataset comprising of m sample contexts, none of which include reward information. The learner s objective is to utilize this data to design a static policy sequence of length T (where m T) that enables the collection of valuable reward information when deployed in the real world. The ultimate goal is to produce an almost optimal policy using the data gathered during deployment. To address this scenario, the authors of [53] propose a two-phase approach involving Planner and Sampler algorithms. In the planner phase, the learner employs the context data to generate a set of sampling policies, which are then utilized in the sampler phase. The primary focus of [53] lies in the analysis of the linear case, assuming the reward function to be a linear function of the context vectors. Their key algorithmic contribution is the linear Planner Algorithm, which encompasses a reward-free instance of Lin UCB. The analysis presented in [53] demonstrates that the required number of samples under the static sampler to achieve an "-optimal policy (also referred to as simple regret) scales as O , where d represents the dimension of the underlying space. This result matches the online minimax sample complexity for the simple regret problem [11, 1]. The algorithm proposed by [53] effectively constructs a static policy sequence of adequate span, utilizing the linear structure of the problem. However, in many problem settings, linearity alone may not be sufficient to capture the true nature of the underlying reward model. This limitation is particularly evident in scenarios such as genetic perturbation experimentation [40] or other similar contexts. In such cases, the availability of algorithms that do not rely on linearity becomes crucial. Unfortunately, extending the results and techniques presented in [53] to the generic function approximation regime is not straightforward. Here we consider when the reward function is realized by an unknown function f belonging to a known function class F (which can be more complex than linear). Adaptive Learning with Function Approximation. In adaptive learning settings where a learner can change the data collection policies as it observes rewards and has much more flexibility than in experiment planning scenarios, various adaptive learning procedures compatible with generic function approximation have been proposed for contextual bandit problems. Among these, two significant methods relevant to our discussion are the Optimistic Least Squares algorithm (Opt LS ) introduced by [42] and the Square CB algorithm introduced by [17]. Both of these methods offer guarantees for cumulative regret. Specifically, the cumulative regret of Opt LS scales as Op deluder logp|F|q Tq, while the cumulative regret of Square CB scales as O |A| logp|F|q T where A corresponds to the set of actions. The eluder dimension1 (deluder) is a statistical complexity measure introduced by [42], which enables deriving guarantees for adaptive learning algorithms based on the principle of optimism in the face of uncertainty in contextual bandits and reinforcement learning [31, 22, 37, 9]. By employing an online-to-batch conversion strategy, these two methods imply that the number of samples required to achieve "-simple regret using an adaptive algorithm is at most O deluder logp|F|q{"2 |A| logp|F|q{"2 respectively. Contributions. In this paper, we address the function approximation setting within the contextual bandit (static) experiment planning problem. Although in experiment planning the data collection policies have to be produced before interacting with the environment, we establish that surprisingly the adaptive "-simple regret rates achieved by Opt LS and Square CB are also attainable by a static experiment sampling strategy. To achieve this, we introduce and analyze two experimental planning algorithms. The Planner Eluder algorithm (see Section 4) utilizes confidence intervals derived from least squares optimization to construct a static set of policies to be executed during the sampling phase. When the algorithm has access to a sufficient number of offline samples to generate T deluder logp|F|q{"2 static policies, the learner can produce an "-optimal policy denoted as p T thus matching the online-to-batch "-simple regret rates of Opt LS . Since the eluder dimension of linear function classes of dimension d is -up to logarithmic factorsof order Opdq, our results 1We formally introduce this quantity in Section 4. Here we use a simpler notation to avoid confusion. recover the linear experiment planning rates of [53]. Additionally, in Section 5, we demonstrate that collecting T |A| logp|F|q{"2 uniform samples is sufficient to obtain an "-optimal policy p T . This matches the online-to-batch "-simple regret rates of Square CB . These two results yield conclusions similar to those known in the linear setting [53], but for general function classes. These results prompt two important questions: (1) are existing adaptive cumulative regret algorithms already optimal for simple regret minimization, and (2) is static experiment planning sufficient to match the simple regret guarantees of adaptive learning for realizable contextual bandits? In Section 6, we provide a negative answer to both questions. We present a certain structured class of bandit problems where a different adaptive algorithm can require significantly less samples than that implied by the bounds from Opt LS and Square CB . Our result is for the realizable setting [2, 18, 17, 44, 9], where the true reward function f belongs to a known function class F, where the true complexity of simple regret problems in adaptive learning scenarios has not be known known in this setting. This result complements recent results [34] that online-to-batch strategies are suboptimal for minimizing simple regret in the agnostic contextual bandits setting.2 For the second question, our statistical lower bound shows that for this class of problems, a significantly smaller number of samples are needed to find an "-optimal policy when using adaptive learning, than when using a static policy. This result complements related work in reinforcement learning with realizable linear state-action value function approximation[52]. Finally, we address the problem of model selection when the learner is presented with a family of reward function classes t Fiu M i 1 and is guaranteed that the true reward function f belongs to Fi , where the index i is unknown. We demonstrate that as long as T |A| logpmaxp M, |Fi |qq{"2 uniform samples are available, it is possible to construct an "-optimal policy denoted as p T . Further details regarding these results can be found in Section 7. 2 Further Related Work Best Arm Identification and Design of Experiments. Previous work to efficiently produce an almost optimal policy for non-contextual linear and multi-armed bandits settings is based on best-arm identification procedures [16, 20, 13, 45, 47, 51]. These are strategies that react adaptively to the collected rewards and achieve instance dependent sample complexity. Other papers in the design of experiments literature [24, 15, 29, 41] introduce methods for designing a non-adaptive policy that can be used to find an optimal arm with high probability in non-contextual scenarios. Safe Optimal Design. When safety concerns are imposed in the exploration policy, works such as [54] and [49] have studied the problem of producing a safe and efficient static policy used to collect data that can then be used for policy evaluation in the MAB and linear settings. This is in contrast with the unconstrained contextual function approximation scenario we study. Reward Free Reinforcement Learning. In Reinforcement Learning settings a related line of research [21, 50, 10] considers the problem of producing enough exploratory data to allow for policy optimization for all functions in a reward class. These works allow for the learner to interact with the world for a number of steps, enough to collect data that will allow for zero-shot estimation of the optimal policy within a reward family. This stands in contrast with the experiment planning setup, where the objective is to produce a sequence of policies for static deployment at test time with the objective of learning the unknown reward function. 3 Problem Definition We study a two-phase interaction between a learner and its environment consisting of a Planning and a Sampling phase. During the planning phase the learner has access to m T i.i.d. offline 2In the agnostic case the learner is instead presented with an arbitrary policy class that may or may not be related to the mean reward function f . In this setting a recent study by [34] introduces an algorithm that achieves sharp rates characterized in terms of a quantity . context samples3 from a context distribution P supported on a context space X as well as knowledge of a reward function class F where each element f P F has domain X ˆ A where A is the action set. During the sampling phase the learner interacts with a contextual bandit problem for T steps where at each time-step t a context xt P is revealed to the learner, the learner takes an action at and receives a reward rt. We adopt the commonly used realizability assumption, which is widely employed in the contextual bandit literature [2, 18, 17, 44, 9]. Assumption 3.1 (Realizability). There exists a function f P F such that rt f pxt, atq t where f P F and t is a conditionally zero mean 1 subgaussian random variable for all t P r Ts during the sampling phase. During the planning phase the learner is required to build a static policy sequence t iu T i 1 (a policy is a mapping from X to A, the set of distributions over actions) that, without knowledge of the reward, can be deployed to collect data during the sampling phase. In contrast with adaptive learning procedures, in the experiment planning problem the policies in t iu T i 1 have to be fully specified during the planning phase and cannot depend on the learner s observed reward values during the sampling phase. The learner s objective is to use the reward data collected during the sampling phase to produce an " optimal policy p T such that, Ex P,a p T pxq rf px, aqs " max Ex P,a pxq rf px, aqs . for a suboptimality parameter " 0. Because we have assumed realizability of the reward function, the optimal policy among all possible policies arg max Ex P,a pxq rf px, aqs can be written as pxq arg maxa PA f px, aq for all x P X. Throughout this work we make the following boundedness assumptions, Assumption 3.2 (Bounded Function Range). The range of F is bounded, i.e. for all x, a P X ˆ A, |fpx, aq| B for some constant B 0. Assumption 3.3 (Bounded Noise). The noise variables are bounded | t| B for all t P r Ts for some constant B 0. Assumptions 3.2 and 3.3 imply |rt| B B for all t P r Ts. 4 Planning Under the Eluder Dimension The eluder dimension is a sequential notion of complexity for function classes that was originally introduced by [42]. The eluder dimension can be used to bound the regret of optimistic least squares algorithms in contextual bandits [42, 9] and reinforcement learning [37]. Informally speaking the eluder dimension is the length of the longest sequence of points one must observe to accurately estimate the function value at any other point. We use the eluder dimension definition from [42]. Definition 4.1. (" dependence) Let G be a scalar function class with domain Z and " 0. An element z P Z is " dependent on tz1, , znu Ñ Z w.r.t. G if any pair of functions g, g1 P G satisfying i 1pgpziq g1pziqq2 " also satisfies gpzq g1pzq ". Furthermore, z P Z is " independent of tz1, , znu w.r.t. G if it is not " dependent on tz1, , znu. Definition 4.2. ("-eluder) The " non monotone eluder dimension á deluderp G, "q of G is the length of the longest sequence of elements in Z such that every element is " independent of its predecessors. Moreover, we define the " eluder dimension deluderp G, "q as deluderp G, "q max"1 " á deluderp G, "q. By definition the " eluder dimension increases as " is driven down to zero. Let D be a dataset of X, A pairs. For any f, f 1 with support X ˆ A we use the notation }f f 1}D to denote the data norm of the difference between functions f and f 1: pfpx, aq f 1px, aqq2. 3This assumption is based on the fact that gathering offline context samples can be significantly less costly compared to conducting multiple rounds of experimental deployment. Algorithm 1 Eluder Planner 1: Input: m T samples tx Pum 1, function class F, confidence radius function βF : t, δ Ñ R. 2: Initialize data buffer D1 H 3: for t 1 . . .T do 4: For any x P X define the uncertainty radius of action a P A as, !px, a, Dtq max f,f 1PF s.t. }f f 1}Dt 4βF pt,δq fpx, aq f 1px, aq. 5: Define the policy t as tpxq arg maxa PA !px, a, Dtq for all x P X. 6: Update Dt 1 Dt Y tpxt, tpxtqqu 7: end for Algorithm 2 Sampler 1: Input: number of time-steps T. 2: Initialize data buffer r D1 H 3: for t 1 . . .T do 4: Define deployment policy t as t. 5: Observe context xt P. 6: Play action at tp xtq and receive reward rt. 7: Update r Dt 1 r Dt Y tp xt, at, rtqu. 8: end for The Eluder Planner algorithm takes as input m T i.i.d. samples from the context distribution tx Pum 1 and a realizable function class F satisfying Assumption 3.1. The learner iterates over T out of these m samples in a sequential manner. We use the name xt to denote the t th input sample and call t to the (deterministic) exploration policies produced upon processing samples 1, , t 1. We use the notation Dt tpx , px qqut 1 1 to denote the dataset composed of the first t 1 (ordered) samples from tx um 1 and actions from policies t ut 1 1. We adopt the convention that D1 H. The output of the Eluder Planner algorithm is the sequence of policies t tu T t 1. Policy t is defined as tpxq arg maxa PA !px, a, Dtq such that !px, a, Dtq is an uncertainty measure defined as, !px, a, Dtq max f,f 1PF s.t. }f f 1}Dt 4βFpt,δq fpx, aq f 1px, aq. (1) Where the confidence radius function βFpδ, tq equals βFpδ, tq Cp B Bq for some universal constant C 0 defined in Proposition A.7. This is a complexity radius implied by the guarantees of least squares regression (Lemma A.8) and constraints imposed by Lemma 4.6. See Appendix A.1 and Proposition A.7 for a detailed discussion about this definition. We call the combination of the Eluder Planner and Sampler as the eluder planning algorithm. Its performance guarantees are obtained by relating the regret of a constructed sequence of optimistic policies tr opt t 1 based on the data collected by t tu T t 1 to the sum of uncertainty radii T t 1 !pxt, tpxtq, Dtq of the context-actions collected during the planning phase. Using optimism when constructing this policy sequence is important. In contrast with the sequence of greedy policies greedy t p |xq arg maxa PA pftpx, aq, where pft arg minf PF px,a,rq P r Dtpfpx, aq rq2 resulting from a least squares over r Dt, the sequence of optimistic policies satisfies a regret bound. Thus, playing a uniform policy from the sequence tr opt t 1 satisfies a simple regret guarantee. The sum of uncertainty radii T t 1 !pxt, tpxtq, Dtq is bounded using a Lemma we borrow from [9]. Lemma 4.3. [Lemma 3 from [9]] Let F be a function class satisfying Assumption 3.2 with "-eluder dimension deluderp F, "q. For all T P N and any dataset sequence t Dtu8 t 1 with D1 H and Dt tt x , a uut 1 1 of context-action pairs, the following inequality on the sum of the uncertainty radii holds, !p xt, at, Dtq O BT, Bdeluderp F, B{Tq βFp T, δq deluderp F, B{Tq T Extracting Policy p T from Data. At the end of the execution of the Sampler algorithm, the learner will have access to a sequence of datasets t r Dtu T t 1 of contexts, actions and reward triplets. The learner will use this sequence of datasets to compute the sequence of regression functions, pft arg min pfp x , a q r q2. (3) Standard Least Squares (LS) results (see Lemma A.8) imply that } pft f } Dt βpt, δq with high probability for all t P N. These confidence sets are used to define a sequence of optimistic (deterministic) policies tr opt t pxq arg max max f s.t. } pft f} Dt βFpt,δq fpx, aq (4) The candidate optimal policy p T is then defined as p T Uniformpr opt 1 , , r opt T q. The main result in this section is the following Theorem. Theorem 4.4. Let " 0. There exists a universal constant c 0 such that if T cmax2p B, B, 1qdeluderp F, B{Tq log p|F|T{δq then with probability at least 1 δ the eluder planning algorithm s policy p T is " optimal . Remark 4.5. The results of this section hold for the structured bandits setting [23, 27]; that is, scenarios where X t Hu. Both sides of the inequality defining T in Theorem 4.4 depend on T. Provided deluderp F, B{Tq is sublinear as a function of T, there exists a finite value of T satisfying Equation 5. When deluderp F, "q deluder log p1{ "q for some deluder and for all " 0 setting T c max2p B, B,1qdeluderp F,B"q logp|F|{"δq "2 is enough to guarantee the conditions of Theorem 4.4 are satisfied. Examples 4 and 5 from [42] show the eluder dimension of linear or generalized linear function classes can be written in such way. In these cases deluder is a constant multiple of the underlying dimension of the space. Using standard techniques, the results of Theorem 4.4 can be adapted to the case when the function class F is infinite but admits a metrization and has a covering number. In this case the sample complexity will scale not with logp|F|q but instead with the logarithm of the covering number of F. For example, in the case of linear functions, the logarithm of the covering number of F under the 2 norm will scale as d, the ambient dimension of the space up to logarithmic factors of T (see for example Lemma D.1 of [14]). Plugging this into the sample guarantees of Theorem 4.4 recovers the sample complexity for linear experiment planning from [53]. 4.1 Proof Sketch of Theorem 4.4 Let x1, , x T be the sampler s context sequence. The proof works by showing that with probability at least 1 δ the regret of the tr opt t 1 sequence satisfies the regret bound a PA f prxt, aq f prxt, r opt deluderp F, B{Tq T logp|F|T{δq A key part of the proof involves relating the balls defined by the data norms of the planner datasets t Dtu T t 1 and those induced by the data norms of the sampler datasets t r Dtu T t 1. The precise statement of this relationship is provided in Lemma 4.6, and its proof can be found in Appendix C. Lemma 4.6. With probability at least 1 δ tpf, f 1q s.t. }f f 1} r Dt 2βFpt, δqu Ñ tpf, f 1q s.t. }f f 1}Dt 4βFpt, δqu (6) Simultaneously for all t P r Ts. Where t Dtu T t 1 is the dataset sequence resulting of the execution of Eluder Planner while t r Dtu T t 1 is the dataset sequence resulting of the execution of the Sampler and βFpδ, tq is the confidence radius function defined in Equation 2. Let pft as in Equation 3 and define E as the event where the Standard Least Squares results (see Lemma A.8) hold i.e. } pft f } r Dt βFpt, δq for all t P r Ts and also Equation 6 from Lemma 4.6 holds for all t P r Ts. The results of Lemmas A.8 and 4.6 imply Pp Eq 1 δ For context x and action a let s denote by f x,a t as a function achieving the inner maximum in the definition of opt t (see Equation 4). When E holds the policies t opt t 1 are optimistic in the sense that f x, opt t pxq t px, opt t pxqq maxa PA f px, aq and therefore, a PA f p xt, aq f p xt, opt t pxq t p xt, opt t p xtqq f p xt, opt max f,f 1PBtp2, r Dtq t p xtqq f 1p xt, opt max f,f 1PBtp4,Dtq fp xt, opt t p xtqq f 1p xt, opt t p xtqq p q. Where Btpγ, Dq tf, f 1 P F s.t. }f f 1}D γβFpδ, tqu. Inequality piq follows by optimism, piiq is a consequence of f x, opt t pxq t , f P Btp2, r Dtq when E holds since in this case } pft f } r Dt βFpt, δq and } f x, opt t pxq t pft} r Dt βFpt, δq, and piiiq follows because when E holds, Equation 6 of Lemma 4.6 is satisfied and therefore Btp2, r Dtq Ñ Btp4, Dtq. The RHS p q of the equation above satisfies, t p xtq, Dtq a PA !p xt, a, Dtq !p xt, tp xtq, Dtq. We relate the sum of uncertainty radii t!p xt, tp xtq, Dtqu T t 1 with those of the planner t!pxt, tpxtq, Dtqu T t 1 via Hoeffding Inequality (see Lemma A.1) and conclude that w.h.p, a PA f p xt, aq f p xt, opt !pxt, tpxtq, Dtq O T log p1{δq Lemma 4.3 allows us to bound the sum of these uncertainty radii as !pxt, tpxtq, Dtq O BT, Bdeluderp F, B{Tq βFp T, δq deluderp F, B{Tq T and therefore w.h.p, a PA f p x , aq f p x , opt BT, Bdeluderp F, B{Tq βFp T, δq deluderp F, B{Tq T Converting this cumulative regret bound into a simple regret one (Lemma B.1) finalizes the result. A detailed version of the proof can be found in Appendix C.1. 5 Uniform Sampling Strategies In this section we show that a uniform sampling strategy can produce an " optimal policy with probability at least 1 δ after collecting O max2p B, Bq|A| logp|F|{"δq samples. This procedure achieves the same simple regret rate as converting Square CB s cumulative regret into simple regret4. In contrast with the eluder planning algorithm the uniform sampling strategy does not require a planning phase. Instead it consists of running of the Sampler (Algorithm 2) setting r j Uniformp Aq for all j P r Ts. Given a dataset r DT of contexts, actions and rewards collected during the sampling phase, we solve the least squares problem: pf T arg min pxi,ai,riq P r DT pfpxi, aiq riq2 and define the policy p T as p T pxq arg maxa PA pf T px, aq. The main result of this section is, 4The regret bound of the Square CB algorithm scales as O |A| logp Fq T logp T{δq Theorem 5.1. There exists a universal constant c 0 such that if T c max2p B, Bq|A| logp|F|{"δq "2 then with probability at least 1 δ the uniform planning algorithm s policy p T is " optimal. The proof of Theorem 5.1 can be found in Appendix D. Comparison Between Opt LS and Square CB . The regret rate after T steps of the Opt LS algorithm applied to a contextual bandit problem with discrete function class F scales5 (up to logarithmic factors) as O deluderp F, B{Tq logp|F|{δq T . In contrast, the regret of the Square CB algorithm satisfies a regret guarantee (up to logarithmic factors) of order O |A| logp|F|{δq T where the eluder dimension dependence is substituted by a polynomial dependence on the number of actions |A|. When the number of actions is small, or even constant, the regret rate of Square CB can be much smaller than that of Opt LS . The opposite is true when the number of actions is large or even infinite. Converting these cumulative regret bounds to simple regret implies the number of samples required to produce an " optimal policy from the adaptive Opt LS policy sequence scales as deluderp F,B{T q logp|F|{δq "2 whereas for the adaptive Square CB policy sequence it scales as |A| logp|F|{δq "2 . The results of Theorems 4.4 and 5.1 recover these rates in the experiment planning setting. 6 Gap Between Experiment Planning and Adaptive Learning The results of Sections 4 and 5 imply planning bounds that are comparable to the corresponding online-to-batch guarantees for Opt LS [42] and Square CB [17]. The main result of this section Theorem 6.1 shows there are problems where the number of samples required for experiment planning can be substantially larger than the number of samples required of an adaptive learning algorithm. This result implies the suboptimality of algorithms such as Square CB and Opt LS for adaptive learning. In order to state our results we consider an action set Atree indexed by the nodes of a height L binary tree defined here as having L levels and 2L 1 nodes. Figure 1: Binary Tree We call al,i the i-th action of the l-th level of the tree. For an example see Figure 1. Let " 0. We define a function class Ftree indexed by paths from the root node to a leaf. For any such path p ta1,1, a2,i2, , a L,i Lu the function f ppq equals, 1 if a a L,i L 1 2" if a P pzta L,i Lu 1 12" o.w. The following result fleshes out a separation between planning and adaptive learning with action set Atree and function class Ftree in the setting where at time T the learner will produce a guess for the optimal policy p T . Theorem 6.1. Let " 0, T P N. Consider the action set Atree and function class Ftree and a reward noise process such that t Np0, 1q conditionally for all t P r Ts. For any planning algorithm Alg there is a function f P Ftree such that when T 2L 5 9"2 and Alg interacts with f then, EAlg,f r Ea p T rf paqss max a PAtree f paq ". Moreover, there is an adaptive algorithm Algadaptive such that if T 2L logp2L{"q EAlgadaptive r Ea p T rfpaqss max a PAtree fpaq ". for all f P Ftree. Where E Å Alg, rf r s is the expectation over the randomness of Å Alg and the environment for target function rf, and p T is the algorithm s final policy guess after the sampling phase. 5We obviate the T dependence on deluder for readability. The main insight behind Theorem 6.1 is that adaptive strategies to find an optimal action in the Ftree function class can make use of the conditional structure of the action space by trying to determine a path of actions from the root to the leaf containing the optimal action. An adaptive strategy can determine this path by querying only nodes that are adjacent to it. In contrast, a static experiment planning strategy cannot leverage this structure and has to query all leaf nodes. Theorem 6.1 implies a gap between the adaptive and experiment planning problems. Moreover, since the eluder dimension of Ftree scales with 2L (see Lemma D.1), Opt LS and Square CB are suboptimal adaptive algorithms for this model class. In contrast with the O 2L logp2L{"q upper bound in Theorem 6.1, converting the cummulative regret bounds of Opt LS and Square CB yield guarantees scaling as O 2L logp|Ftree|q 7 Model Selection In this section we consider a setting where the learner has access to F1, , FM function (model) classes all with domain X ˆ A. The learner is promised there exists an index i such that f P Fi . The value of index i is not known by the learner. We will show the uniform sampling strategy has a sample complexity that scales logarithmically with the number of models and with the complexity of the optimal model class |Fi |. In this setting the Sampler will collect two uniform actions datasets p r Dp Trainq T , r Dp Testq T q of size T{2 each. Using the train" dataset r Dp Trainq T the learner computes candidate reward models pf piq T for all i P r Ms by solving the least squares problems: pxi,ai,riq P r Dp Trainq pfpxi, aiq riq2 . Using the test" dataset the learner extracts a guess for the optimal model class index i by solving, px ,a ,r q P r Dp Testq T px , a q r The candidate policy p T is defined as p T pxq arg maxa PA pf piq T px, aq. Let s start by relating the expected least squares loss of the candidate model pf piq T to the size of the optimal model class |Fi |, Proposition 7.1. There exists a universal constant c 0 such that, Ex P,a Uniformp Aq f px, aq pf piq c max2p B, Bq logp T maxp M, |Fi |q{δq With probability at lest 1 δ. The proof of Proposition 7.1 can be found in Appendix E.1. The uniform sampling model-selection algorithm has the following performance guarantees, Theorem 7.2. There exists a universal constant rc1 0 s.t. if T c1 max2p B, Bq|A| logp maxp M,|F |q "δ q "2 then with probability at least 1 δ, the candidate policy p T of the uniform sampling model-selection algorithm is " optimal. Proof. Due to the realizability assumption prt f pxt, atq tq the instantaneous regret of p T on context x P X equals maxa f px, aq f px, p T pxqq. Let pxq arg maxa PA f px, aq. Just like in the proof of Theorem 5.1 (see Equation 25), we can relate the suboptimality of p T with the expected least squares loss of pf piq T under the uniform policy, a PA f px, aq f px, p T pxqq |A|Ex P,a Uniformp Aq f px, aq pf piq Finally Proposition 7.1 implies thre is a universal constant c 0 such that, Ex P,a Uniformp Aq f px, aq pf piq cp B2 B2q logp T maxp M, |Fi |q{δq with probability at least 1 δ. Plugging this result back into Equation 8 we see the suboptimality of p T can be upper bounded as, a PA f px, aq f px, p T pxqq c|A|p B2 B2q logp T maxp M, |Fi |q{δq Setting gp Tq 4c|A|p B2 B2q logp T maxp M,|Fi |q{δq T in Lemma A.6 implies there exists a universal constant c1 0 such that gp Tq "2 for all T c1 max2p B, Bq|A| log maxp M,|Fi |q The results of Theorem 7.2 can be best understood by contrasting them to the uniform sampling algorithm with input model class equal to the union of the function classes Fall Yi Pr Ms Fi. Applying the results of Theorem 5.1 to Fall yields a sample complexity scaling with maxi Pr Ms logp|F|iq, a quantity that could be much larger than logp|Fi |q. In contrast, the uniform sampling model-selection algorithm achieves a sample complexity scaling with logp|Fi |q at a price logarithmic in the number of models classes M. This logarithmic dependence on M stands apart from model selection algorithms for cumulative regret scenarios such as Corral [5, 39], ECE [30] and Regret Balancing [38, 12] that instead have a polynomial dependence on M. The uniform sampling model-selection algorithm is agnostic to the value of ". The results of Theorem 7.2 hold for any ". If " is known in advance the learner can compute the model class index pi arg min i P r Ms s.t. T c max2p B, Bq|A| logp|Fi|{"δq and use the uniform sampling strategy for Fpi. For this choice of ", Theorem 5.1 guarantees similar bounds to those of Theorem 7.2. Unfortunately in contrast with the uniform sampling model-selection algorithm this method would be valid for a single choice of ". 8 Conclusion In this work we have introduced the first set of algorithms for the experiment planning problem for contextual bandits with general function approximation. We have developed the Eluder Planning algorithm that produces a static policy sequence that after deployment can be used to recover an " optimal policy. We showed it is enough for the number of static policies and therefore samples during the sampling phase to be as large as the number of samples required from an adaptive procedure based on an online-to-batch conversion of the Opt LS algorithm. Similarly we also demonstrated the uniform sampling strategy enjoys the same online-to-batch conversion sample complexity as the Square CB algorithm. These results seemingly suggest that simple regret rates for adaptive learning may also be achieved in experiment planning scenarios. We show this is not the case. There exist structured bandit problems for which adaptive learning may require a number of samples that is substantially smaller than the number of samples required by a static policy sequence. This is significant because it implies the suboptimality of the rates achieved by existing adaptive learning algorithms such as Opt LS and Square CB and also because it draws a clear distinction between adaptive learning and experiment planning. This implies these algorithms are either suboptimal or their upper bound analysis is not tight. We believe the first to be correct. This is an important open question we hope to see addressed in future research. We have also introduced the first model selection results for the experiment planning problem. Many important questions remain regarding this setting. Chief among them is to characterize the true statistical complexity of experimental design for contextual bandits with general function approximations. Our results indicate the eluder dimension is not the sharpest statistical complexity measure to characterize learning here. Developing a more new form of complexity, as well as an accompanying algorithm that can achieve the true statistical lower bound for the problem of experiment planning remains an exciting and important open question to tackle in future research. Acknowledgments and Disclosure of Funding We thank Akshay Krishnamurthy for helpful discussions. Aldo Pacchiano would like to thank the support of the Eric and Wendy Schmidt Center at the Broad Institute of MIT and Harvard. This work was supported in part by funding from the Eric and Wendy Schmidt Center at the Broad Institute of MIT and Harvard. This work was supported in part by NSF grant #2112926. [1] Y. Abbasi-Yadkori, D. Pál, and C. Szepesvári. Improved algorithms for linear stochastic bandits. Advances in neural information processing systems, 24:2312 2320, 2011. [2] A. Agarwal, M. Dudík, S. Kale, J. Langford, and R. Schapire. Contextual bandit learning with predictable rewards. In Artificial Intelligence and Statistics, pages 19 26. PMLR, 2012. [3] A. Agarwal, D. Hsu, S. Kale, J. Langford, L. Li, and R. Schapire. Taming the monster: A fast and simple algorithm for contextual bandits. In International Conference on Machine Learning, pages 1638 1646. PMLR, 2014. [4] A. Agarwal, S. Bird, M. Cozowicz, L. Hoang, J. Langford, S. Lee, J. Li, D. Melamed, G. Oshri, O. Ribas, et al. Making contextual decisions with low technical debt. ar Xiv preprint ar Xiv:1606.03966, 2016. [5] A. Agarwal, H. Luo, B. Neyshabur, and R. E. Schapire. Corralling a band of bandit algorithms. In Conference on Learning Theory, pages 12 38. PMLR, 2017. [6] S. Agrawal and N. Goyal. Thompson sampling for contextual bandits with linear payoffs. In International conference on machine learning, pages 127 135. PMLR, 2013. [7] A. Beygelzimer, J. Langford, L. Li, L. Reyzin, and R. Schapire. Contextual bandit algorithms with supervised learning guarantees. In Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, pages 19 26. JMLR Workshop and Conference Proceedings, 2011. [8] C. J. Bryan, E. Tipton, and D. S. Yeager. Behavioural science is unlikely to change the world without a heterogeneity revolution. Nature human behaviour, 5(8):980 989, 2021. [9] J. Chan, A. Pacchiano, N. Tripuraneni, Y. S. Song, P. Bartlett, and M. I. Jordan. Parallelizing contextual bandits. ar Xiv preprint ar Xiv:2105.10590, 2021. [10] J. Chen, A. Modi, A. Krishnamurthy, N. Jiang, and A. Agarwal. On the statistical efficiency of reward-free exploration in non-linear rl. Advances in Neural Information Processing Systems, 35:20960 20973, 2022. [11] W. Chu, L. Li, L. Reyzin, and R. Schapire. Contextual bandits with linear payoff functions. In Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, pages 208 214, 2011. [12] A. Cutkosky, C. Dann, A. Das, C. Gentile, A. Pacchiano, and M. Purohit. Dynamic balancing for model selection in bandits and rl. In International Conference on Machine Learning, pages 2276 2285. PMLR, 2021. [13] R. Degenne, P. Ménard, X. Shang, and M. Valko. Gamification of pure exploration for linear bandits. In International Conference on Machine Learning, pages 2432 2442. PMLR, 2020. [14] S. S. Du, S. M. Kakade, J. D. Lee, S. Lovett, G. Mahajan, W. Sun, and R. Wang. Bilinear classes: A structural framework for provable generalization in rl. ar Xiv preprint ar Xiv:2103.10897, 2021. [15] H. Esfandiari, A. Karbasi, A. Mehrabian, and V. Mirrokni. Regret bounds for batched bandits. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 35, pages 7340 7348, 2021. [16] T. Fiez, L. Jain, K. G. Jamieson, and L. Ratliff. Sequential experimental design for transductive linear bandits. Advances in neural information processing systems, 32, 2019. [17] D. Foster and A. Rakhlin. Beyond ucb: Optimal and efficient contextual bandits with regression oracles. In International Conference on Machine Learning, pages 3199 3210. PMLR, 2020. [18] D. Foster, A. Agarwal, M. Dudik, H. Luo, and R. Schapire. Practical contextual bandits with regression oracles. In International Conference on Machine Learning, pages 1539 1548. PMLR, 2018. [19] D. A. Freedman. On tail probabilities for martingales. the Annals of Probability, pages 100 118, [20] Y. Jedra and A. Proutiere. Optimal best-arm identification in linear bandits. Advances in Neural Information Processing Systems, 33:10007 10017, 2020. [21] C. Jin, A. Krishnamurthy, M. Simchowitz, and T. Yu. Reward-free exploration for reinforcement learning. In International Conference on Machine Learning, pages 4870 4879. PMLR, 2020. [22] C. Jin, Q. Liu, and S. Miryoosefi. Bellman eluder dimension: New rich classes of rl problems, and sample-efficient algorithms. Advances in neural information processing systems, 34: 13406 13418, 2021. [23] K.-S. Jun and C. Zhang. Crush optimism with pessimism: Structured bandits beyond asymptotic optimality. Advances in Neural Information Processing Systems, 33:6366 6376, 2020. [24] J. Kiefer and J. Wolfowitz. The equivalence of two extremum problems. Canadian Journal of Mathematics, 12:363 366, 1960. [25] J. Kober, J. A. Bagnell, and J. Peters. Reinforcement learning in robotics: A survey. The International Journal of Robotics Research, 32(11):1238 1274, 2013. [26] J. Langford and T. Zhang. The epoch-greedy algorithm for contextual multi-armed bandits. Advances in neural information processing systems, 20(1):96 1, 2007. [27] T. Lattimore and R. Munos. Bounded regret for finite-armed structured bandits. Advances in Neural Information Processing Systems, 27, 2014. [28] T. Lattimore and C. Szepesvári. Bandit algorithms. Cambridge University Press, 2020. [29] T. Lattimore, C. Szepesvari, and G. Weisz. Learning with good feature representations in bandits and in rl with a generative model. In International Conference on Machine Learning, pages 5662 5670. PMLR, 2020. [30] J. Lee, A. Pacchiano, V. Muthukumar, W. Kong, and E. Brunskill. Online model selection for reinforcement learning with function approximation. In International Conference on Artificial Intelligence and Statistics, pages 3340 3348. PMLR, 2021. [31] G. Li, P. Kamath, D. J. Foster, and N. Srebro. Understanding the eluder dimension. Advances in Neural Information Processing Systems, 35:23737 23750, 2022. [32] J. Li, W. Monroe, A. Ritter, M. Galley, J. Gao, and D. Jurafsky. Deep reinforcement learning for dialogue generation. ar Xiv preprint ar Xiv:1606.01541, 2016. [33] L. Li, W. Chu, J. Langford, and R. E. Schapire. A contextual-bandit approach to personalized news article recommendation. In Proceedings of the 19th international conference on World wide web, pages 661 670, 2010. [34] Z. Li, L. Ratliff, K. G. Jamieson, L. Jain, et al. Instance-optimal pac algorithms for contextual bandits. Advances in Neural Information Processing Systems, 35:37590 37603, 2022. [35] T. P. Lillicrap, J. J. Hunt, A. Pritzel, N. Heess, T. Erez, Y. Tassa, D. Silver, and D. Wierstra. Continuous control with deep reinforcement learning. ar Xiv preprint ar Xiv:1509.02971, 2015. [36] V. Mnih, K. Kavukcuoglu, D. Silver, A. A. Rusu, J. Veness, M. G. Bellemare, A. Graves, M. Riedmiller, A. K. Fidjeland, G. Ostrovski, et al. Human-level control through deep reinforcement learning. nature, 518(7540):529 533, 2015. [37] I. Osband and B. Van Roy. Model-based reinforcement learning and the eluder dimension. Advances in Neural Information Processing Systems, 27, 2014. [38] A. Pacchiano, C. Dann, C. Gentile, and P. Bartlett. Regret bound balancing and elimination for model selection in bandits and rl. ar Xiv preprint ar Xiv:2012.13045, 2020. [39] A. Pacchiano, M. Phan, Y. Abbasi Yadkori, A. Rao, J. Zimmert, T. Lattimore, and C. Szepesvari. Model selection in contextual stochastic bandit problems. Advances in Neural Information Processing Systems, 33:10328 10337, 2020. [40] A. Pacchiano, D. Wulsin, R. A. Barton, and L. Voloch. Neural design for genetic perturbation experiments. ar Xiv preprint ar Xiv:2207.12805, 2022. [41] Y. Ruan, J. Yang, and Y. Zhou. Linear bandits with limited adaptivity and learning distributional optimal design. In Proceedings of the 53rd Annual ACM SIGACT Symposium on Theory of Computing, pages 74 87, 2021. [42] D. Russo and B. Van Roy. Eluder dimension and the sample complexity of optimistic exploration. Advances in Neural Information Processing Systems, 26, 2013. [43] D. Silver, A. Huang, C. J. Maddison, A. Guez, L. Sifre, G. Van Den Driessche, J. Schrittwieser, I. Antonoglou, V. Panneershelvam, M. Lanctot, et al. Mastering the game of go with deep neural networks and tree search. nature, 529(7587):484 489, 2016. [44] D. Simchi-Levi and Y. Xu. Bypassing the monster: A faster and simpler optimal algorithm for contextual bandits under realizability. Mathematics of Operations Research, 2021. [45] M. Soare, A. Lazaric, and R. Munos. Best-arm identification in linear bandits. Advances in Neural Information Processing Systems, 27, 2014. [46] R. S. Sutton. Introduction: The challenge of reinforcement learning. In Reinforcement Learning, pages 1 3. Springer, 1992. [47] C. Tao, S. Blanco, and Y. Zhou. Best arm identification in linear bandits with linear dimension dependency. In International Conference on Machine Learning, pages 4877 4886. PMLR, 2018. [48] A. Tewari and S. A. Murphy. From ads to interventions: Contextual bandits in mobile health. Mobile Health: Sensors, Analytic Methods, and Applications, pages 495 517, 2017. [49] R. Wan, B. Kveton, and R. Song. Safe exploration for efficient policy evaluation and comparison. In International Conference on Machine Learning, pages 22491 22511. PMLR, 2022. [50] R. Wang, S. S. Du, L. Yang, and R. R. Salakhutdinov. On reward-free reinforcement learning with linear function approximation. Advances in neural information processing systems, 33: 17816 17826, 2020. [51] L. Xu, J. Honda, and M. Sugiyama. A fully adaptive algorithm for pure exploration in linear bandits. In International Conference on Artificial Intelligence and Statistics, pages 843 851. PMLR, 2018. [52] A. Zanette. Exponential lower bounds for batch reinforcement learning: Batch rl can be exponentially harder than online rl. In International Conference on Machine Learning, pages 12287 12297. PMLR, 2021. [53] A. Zanette, K. Dong, J. N. Lee, and E. Brunskill. Design of experiments for stochastic contextual linear bandits. Advances in Neural Information Processing Systems, 34:22720 22731, 2021. [54] R. Zhu and B. Kveton. Safe optimal design with applications in off-policy learning. In International Conference on Artificial Intelligence and Statistics, pages 2436 2447. PMLR, 2022.