# neural_design_for_genetic_perturbation_experiments__efd39d5e.pdf Published as a conference paper at ICLR 2023 NEURAL DESIGN FOR GENETIC PERTURBATION EXPERIMENTS Aldo Pacchiano Microsoft Research NYC apacchiano@microsoft.com Drausin Wulsin & Robert A. Barton & Luis Voloch Immunai {drausin,robert.barton,luis}@immunai.com The problem of how to genetically modify cells in order to maximize a certain cellular phenotype has taken center stage in drug development over the last few years (with, for example, genetically edited CAR-T, CAR-NK, and CAR-NKT cells entering cancer clinical trials). Exhausting the search space for all possible genetic edits (perturbations) or combinations thereof is infeasible due to cost and experimental limitations. This work provides a theoretically sound framework for iteratively exploring the space of perturbations in pooled batches in order to maximize a target phenotype under an experimental budget. Inspired by this application domain, we study the problem of batch query bandit optimization and introduce the Optimistic Arm Elimination (OAE) principle designed to find an almost optimal arm under different functional relationships between the queries (arms) and the outputs (rewards). We analyze the convergence properties of OAE by relating it to the Eluder dimension of the algorithm s function class and validate that OAE outperforms other strategies in finding optimal actions in experiments on simulated problems, public datasets well-studied in bandit contexts, and in genetic perturbation datasets when the regression model is a deep neural network. OAE also outperforms the benchmark algorithms in 3 of 4 datasets in the Gene Disco experimental planning challenge. 1 INTRODUCTION We are inspired by the problem of finding the genetic perturbations that maximize a given function of a cell (a particular biological pathway or mechanism, for example the proliferation or exhaustion of particular immune cells) while performing the least number of perturbations required. In particular, we are interested in prioritizing the set of genetic knockouts (via sh RNA or CRISPR) to perform on cells that would optimize a particular scalar cellular phenotype. Since the space of possible perturbations is very large (with roughly 20K human protein-coding genes) and each knockout is expensive, we would like to order the perturbations strategically so that we find one that optimizes the particular phenotype of interest in fewer total perturbations than, say, just brute-force applying all possible knockouts. In this work we consider only single-gene knockout perturbations since they are the most common, but multi-gene perturbations are also possible (though considerably more technically complex to perform at scale). While a multi-gene perturbation may be trivially represented as a distinct (combined) perturbation in our framework, we leave for future work the more interesting extension of embedding, predicting, and planning these multi-gene perturbations using previously observed single-gene perturbations. With this objective in mind we propose a simple method for improving a cellular phenotype under a limited budget of genetic perturbation experiments. Although this work is inspired by this concrete biological problem, our results and algorithms are applicable in much more generality to the setting of experimental design with neural network models. We develop and evaluate a family of algorithms for the zero noise batch query bandit problem based on the Optimistic Arm Elimination principle (OAE). We focus on developing tractable versions of these algorithms compatible with neural network function approximation. Published as a conference paper at ICLR 2023 During each time-step OAE fits a reward model on the observed responses seen so far while at the same time maximizing the reward on all the arms yet to be pulled. The algorithm then queries the batch of arms whose predicted reward is maximal among the arms that have not been tried out. We conduct a series of experiments on synthetic and public data from the UCI Dua & Graff (2017) database and show that OAE is able to find the optimal arm" using fewer batch queries than other algorithms such as greedy and random sampling. Our experimental evaluation covers both neurally realizable and not neurally realizable function landscapes. The performance of OAE against benchmarks is comparable in both settings, demonstrating that although our presentation of the OAE algorithm assumes realizability for the sake of clarity, it is an assumption that is not required in practice. In the setting where the function class is realizable i.e. the function class F used by OAE contains the function generating the rewards, and the evaluation is noiseless we show two query lower bounds for the class of linear and 1 Lipshitz functions. We validate OAE on the public CMAP dataset Subramanian et al. (2017), which contains tens of thousands of genetic sh RNA knockout perturbations, and show that it always outperforms a baseline and almost always outperforms a simpler greedy algorithm in both convergence speed to an optimal perturbation and the associated phenotypic rewards. These results illustrate how perturbational embeddings learned from one biological context can still be quite useful in a different biological context, even when the reward functions of these two contexts are different. Finally we also benchmark our methods in the Gene Disco dataset and algorithm suite (see Mehrjou et al. (2021)) and show OAE to be competitive against benchmark algorithms in the task of maximizing Hit Ratios. 2 RELATED WORK Bayesian Optimization The field of Bayesian optimization has long studied the problem of optimizing functions severely limited by time or cost Jones et al. (1998). For example, Srinivas et al. (2009) introduce the GP-UCB algorithm for optimizing unknown functions. Other approaches based on adaptive basis function regression have also been used to model the payoff function as in Snoek et al. (2015). These algorithms have been used in the drug discovery context. Mueller et al. (2017) applied Bayesian optimization to the problem of optimizing biological phenotypes. Very recently, Gene Disco was released as a benchmark suite for evaluating active learning algorithms for experiment design in drug discovery Mehrjou et al. (2021). Perhaps the most relevant to our setting are the many works that study the batch acquisition setting in Bayesian active learning and optimization such as Kirsch et al. (2019); Kathuria et al. (2016) and the GP BUCB algorithm of Desautels et al. (2014). In this work we move beyond the typical parametric and Bayesian assumptions from these works towards algorithms that work in conjunction with neural network models. We provide guarantees for the no noise setting we study based on the Eluder dimension Russo & Van Roy (2013). Parallel Bandits Despite its wide applicability in many scientific applications, batch learning has been studied relatively seldom in the bandit literature. Despite this, recent work (Chan et al., 2021) show that in the setting of contextual linear bandits (Abbasi-Yadkori et al., 2011), the finite sample complexity of parallel learning matches that of sequential learning irrespective of the batch size provided the number of batches is large enough. Unfortunately, this is rarely the regime that matters in many practical applications such as drug development where the size of the experiment batch may be large but each experiment may be very time consuming, thus limiting their number. In this work we specifically address this setting in our experimental evaluation in Section E. Structure Learning Prior work in experiment design tries to identify causal structures with a fixed budget of experiments Ghassami et al. (2018). Scherrer et al Scherrer et al. (2021) proposes a mechanism to select intervention targets to enable more efficient causal structure learning. Sussex et al. (2021) extend the amount of information contained in each experiment by simultaneously intervening on multiple variables. Causal matching, where an experimenter can perform a set of interventions aimed to transform the system to a desired state, is studied in Zhang et al. (2021). Neural Bandits Methods such as Neural UCB and Shallow Neural UCB Zhou et al. (2020); Xu et al. (2020) are designed to add an optimistic bonus to model predictions of a nature that can be analytically computed as is extremely reminiscent of the one used in linear bandits (Auer, 2002; Dani et al., 2008), thus their theoretical validity depends on the linearizing conditions to hold. More Published as a conference paper at ICLR 2023 recently (Pacchiano et al., 2021b) have proposed the use of pseudo-label optimisim for the Bank Loan problem where they propose an algorithm that adds optimism to neural network predictions through the addition of fake data and is only analyzed in the classification setting. Our algorithms instead add optimism to their predictions. The later is achieved via two methods, either by explicitly encouraging it to fit a model whose predictions are large in unseen data, or by computing uncertainties. Active Learning Active learning is relatively well studied problem Settles (2009); Dasgupta (2011); Hanneke et al. (2014) particularly in the context of supervised learning. See for example Balcan et al. (2009), Dasgupta et al. (2007), Settles (2009), Hanneke et al. (2014). There is a vast amount of research on active learning for classification (see for example Agarwal (2013), Dekel et al. (2010) and Cesa-Bianchi et al. (2009) where the objective is to learn a linearly parameterized response model P(y|x). Broadly speaking there are two main sample construction approaches, diversity Sener & Savarese (2017); Geifman & El-Yaniv (2017); Gissin & Shalev-Shwartz (2019) and uncertainty sampling Tong & Koller (2001); Schohn & Cohn (2000); Balcan et al. (2009); Settles et al. (2007), successful in the large Guo & Schuurmans (2007); Wang & Ye (2015); Chen & Krause (2013); Wei et al. (2015); Kirsch et al. (2019) and small batch sizes regimes respectively. Diversity sampling methods produce spread out samples to better cover the space while uncertainty-based methods estimate model uncertainty to select what points to label. Hybrid approaches are common as well. A common objective in the active learning literature is to collect enough samples to produce a model that minimizes the population loss over the data distribution. This is in contrast with the objective we study in this work, which is to find a point in the dataset with a large response. There is a rich literature dedicated to the development of active learning algorithms for deep learning applications both in the batch and single sample settings Settles et al. (2007); Ducoffe & Precioso (2018); Beluch et al. (2018); Ash et al. (2021). 3 PROBLEM DEFINITION Let y : A R be a response function over A Rd. We assume access to a function class F Fun(A, R) where Fun(A, R) denotes the set of functions from A to R. Following the typical online learning terminology we call A the set of arms. In this work we allow A to be infinite, although we only consider finite A in practice. In our setting the experiment designer (henceforth called the learner) interacts with y and A in a sequential manner. During the t th round of this interaction, aided by F and historical query and response information the learner is required to query a batch of b N arms {at,i}b i=1 A and observe noiseless responses {yt,i = y (at,i)}b i=1 after which these response values are added to the historical dataset Dt+1 = Dt {(at,i, yt,i)}b i=1. In this work we do not assume that y F. Instead we allow the learner access to a function class F to aid her in producing informative queries. This is a common situation in the setting of neural experiment design, where we may want to use a DNN model to fit the historical responses and generate new query points without prior knowledge of whether it accurately captures y . Our objective is to develop a procedure that can recover an almost optimal arm a A in the least number of arm pulls possible. We consider the following objective, τ quantile optimality. Find an arm aτ A belonging to the top τ quantile1 of {y (a)}a A. Although ϵ optimality (find an arm aϵ A such that y (aϵ) + ϵ maxa A y (a) for ϵ 0) is the most common criterion considered in the optimization literature, for it to be meaningful it requires knowledge of the scale of maxa A y (a). In some scenarios this may be hard to know in advance. Thus in our experiments we focus on the setting of τ quantile optimality as a more relevant practical performance measure. This type of objective has been considered by many works in the bandit literature (see for example Szorenyi et al. (2015); Zhang & Ong (2021)). Moreover, it is a measure of optimality better related to practical objectives used in experiment design evaluation, such as hit ratio in the Gene Disco benchmark library Mehrjou et al. (2021). We show in Section E that our algorithms are successful at producing almost optimal arms under this criterion after a small number of queries. The main challenge we are required to overcome in this problem is designing a smart choice of batch 1In the case of an infinite set A quantile optimality is defined with respect to a measure over A. Published as a conference paper at ICLR 2023 queries {at,i}B i=1 that balances the competing objectives of exploring new regions of the arm space and zooming into others that have shown promising rewards. In this work we focus on the case where the observed response values y (a) of any arm a A are noiseless. In the setting of neural perturbation experiments the responses are the average of many expression values across a population of cells, and thus it is safe to assume the observed response is almost noiseless. In contrast with the noisy setting, when the response is noiseless, querying the same arm twice is never necessary. We leave the question on how to design algorithms for noisy responses in the function approximation regime for future work. although note it can be reduced to our setting if we set the exploitation round per data point sufficiently large. Evaluation. After the queries t ℓ=1{aℓ,i}b i=1 the learner will output a candidate approximate optimal arm ˆat among all the arms whose labels she has queried (all arms in Dt+1) by considering (ˆat, ˆyt) = arg max(a,y) Dt+1 y, the point with the maximal observed reward so far. Given a quantile value τ we measure the performance of our algorithms by considering the first timestep tτ first where a τ quantile optimal point ˆat was proposed. 4 OPTIMISTIC ARM ELIMINATION With these objectives in mind, we introduce a family of algorithms based on the Optimistic Arm Elimination Algorithm (OAE) principle. We call Ut to the subset of arms yet to be queried by our algorithm. At time t any OAE algorithm produces a batch of query points of size b from2 Ut. Our algorithms start round t by fitting an appropriate response predictor eft : Ut R based on the historical query points Dt and their observed responses so far. Instead of only fitting the historical responses with a square loss and produce a prediction function eft, we encourage the predictions of eft to be optimistic on the yet-to-be-queried points of Ut. We propose two tractable ways of achieving this. First by fitting a model (or an ensemble of models) ef o t to the data in Dt and explicitly computing a measure of uncertainty ut : Ut R of its predictions on Ut. We define the optimistic response predictor eft(a) = ef o t (a) + ut(a). Second, we achieve this by defining eft to be the approximate solution of a constrained objective, eft arg max f F A(f, Ut) s.t. X (a,y) Dt (f(a) y)2 γt. (1) where γt a possibly time-dependent parameter satisfying γt 0 and A(f, U) is an acquisition objective tailored to produce an informative arm (or batch of arms) from Ut. We consider a couple of acquisition objectives Aavg(f, U) = 1 |U| P a U f(a), Ahinge(f, U) = 1 |U| P a U (max(0, f(a)))p for some p > 0 and Asoftmax(f, U) = log P a U exp(f(a)) and Asum(f, U) = P a U f(a). An important acquisition functions of theoretical interest, although hard to optimize in practice are Amax(f, U) = maxa U f(a) and its batch version Amax,b(f, U) = max B U,|B|=b P a B f(a). Regardless of whether eft was computed via Equation 1 or it is an uncertainty aware objective of the form eft(a) = ef o t (a) + eut(a), our algorithm then produces a query batch Bt by solving Bt arg max B Ut,|B|=b Aavg( eft, B). (2) The principle of Optimism in the Face of Uncertainty (OFU) allows OAE algorithms to efficiently explore new regions of the space by acting greedily with respect to a model that fits the rewards of the arms in Dt as accurately as possible but induces large responses from the arms she has not tried. If y F, and eft is computed by solving Equation 1, it can be shown the optimistic model overestimates the true response values i.e. P a Bt eft(a) P a B ,t y (a) where B ,t = arg max B Ut,|B|=b P a B y (a). Consult Appendix D.2 for a proof and an explanation of the relevance of this observation. Acting greedily based on an optimistic model means the learner tries out the arms that may achieve the highest reward according to the current model plausibility set. After pulling these arms, the learner can successfully update the model plausibility set and repeat this procedure. 2The batch equals Ut when |Ut| b. Published as a conference paper at ICLR 2023 Algorithm 1 Optimistic Arm Elimination Principle (OAE) Input Action set A Rd, num batches N, batch size b Initialize Unpulled arms U1 = A. Observed points and labels dataset D1 = for t = 1, , N do If t = 1 then: Sample uniformly a size b batch Bt U1. Else: Solve for eft and compute Bt arg max B Ut||B|=b A( eft, B). Observe batch rewards Yt = {y (a) for a Bt} Update Dt+1 = Dt {(Bt, Yt)} and Ut+1 = Ut\Bt . 4.1 TRACTABLE IMPLEMENTATIONS OF OAE In this section we go over the algorithmic details behind the approximations that we have used when implementing the different OAE methods we have introduced in Section 4. 4.1.1 OPTIMISTIC REGULARIZATION In order to produce a tractable implementation of the constrained problem 1 we approximate it with the optimism regularized objective, eft arg min f F (a,y) Dt (f(a) y)2 λreg A(f, Ut) | {z } Optimism Regularizer And define Bt following Equation 2. Problem 3 is compatible with DNN function approximation. In our experiments we set the acquisition function to Aavg, Ahinge with p = 4 and Asoftmax. The resulting methods are Mean Opt, Hinge PNorm and Max Opt. Throughout this work Random Batch corresponds to uniform arm selection and Greedy to setting the optimism regularizer to 0. 4.1.2 ENSEMBLE METHODS We consider two distinct methods (Ensemble and Ensemble Noise Y) to produce uncertainty estimations based on ensemble predictions. In both we fit M models { bf i t}M i=1 to Dt and define ef o t (a) = maxi=1, ,M bf i t(a) + mini=1, ,M bf i t(a) 2 , ut(a) = max i=1, ,M bf i t(a) f o t (a). So that eft = ef o t + ut. We explore two distinct methods to produce M different models { bf i t}M i=1 fit to Dt that differ in the origin of the model noise. Ensemble produces M models resulting of independent random initialization of their model parameters. The Ensemble Noise Y method injects label noise into the dataset responses. For all i {1, , M} we build a dataset Di t = {(a, y + ξ) for (a, y) Dt, ξ N(0, 1)} where ξ is an i.i.d. zero mean Gaussian random sample. The functions { bf i t}M i=1 are defined as, bf i t arg min f F (a,y) Di t (f(a) y)2. In this case the uncertainty of the ensemble predictions is the result of both the random parameter initialization of the { bf i t}M i=1 and the label noise . This noise injection procedure draws its inspiration from methods such as RLSVI and NARL Russo (2019) and Pacchiano et al. (2021a). 4.2 DIVERSITY SEEKING VERSIONS OF OAE In the case b > 1, the explore / exploit trade-off is not the sole consideration in selecting the arms that make up Bt. In this case, we should also be concerned about selecting sufficiently diverse points within the batch Bt to maximize the information gathering ability of the batch. In Section 4.2 we show how to extend Algorithm 1 (henceforth referred to as vanilla OAE) to effectively induce Published as a conference paper at ICLR 2023 query diversity. We introduce two versions OAE Dv D and OAE Seq which we discuss in more detail below. A detailed description of tractable implementations of these algorithms can be found in Appendix C. 4.2.1 DIVERSITY VIA DETERMINANTS Inspired by diversity-seeking methods in the Determinantal Point Processes (DPPs) literature Kulesza & Taskar (2012), we introduce the OAE Dv D algorithm. Inspired by the Dv D algorithm Parker Holder et al. (2020) we propose to augment the vanilla OAE objective with a diversity regularizer. Bt arg max B Ut,|B|=b Asum( eft, B) + Div( eft, B). (4) OAE Dv D s Div regularizer is inspired by the theory of Determinantal Point Processes and equals a weighted log-determinant objective. OAE Dv D has access to a kernel function K : Rd Rd R and at the start of every time step t N it builds a kernel matrix Kt R|Ut| |Ut|, Kt[i, j] = K(ai, aj), i, j Ut. For any subset B Ut we define the OAE Dv D diversity-aware score as, Div(f, B) = λdiv log (Det(Kt[B, B])) (5) Where Kt[B, B] corresponds to the submatrix of Kt with columns (and rows) indexed by B and λdiv is a diversity regularizer. Since the resulting optimization problem in Equation 6 may prove to be extremely hard to solve, we design a greedy maximization algorithm to produce a surrogate solution. The details can be found in Appendix B.1. OAE Dv D induces diversity leveraging the geometry of the action space. 4.2.2 SEQUENTIAL BATCH SELECTION RULES In this section we introduce OAE Seq a generalization of the OAE algorithm designed to produce in batch diversity. OAE Seq produces a query batch by solving a sequence of b optimization problems. The first element at,1 of batch Bt is chosen as the arm in Ut achieving the most optimistic prediction over plausible models (following objective 1 and any of the tractable implementations defined in Section C, either optimistic regularization or ensemble methods). To produce the second point at,2 in the batch (provided b > 1) we temporarily add the pair (at,1, yt,1) to the data buffer Dt, where yt,1 is a virtual reward estimator for at,1. Using this fake labels augmented data-set we select at,2 following the same optimistic selection method used for at,1. Although other choices are possible in practice we set yt,1 as a mid-point between an optimistic and pessimistic prediction of the value of y (at,1). The name OAE Seq derives from the sequential way in which the batch is produced. If OAE Seq selected arm a to be in the batch, and this arm has a non-optimistic virtual reward y(a) that is low relative to the optimistic values of other arms, then OAE Seq will not select too many arms close to a in the same batch. OAE Seq induces diversity not through the geometry of the arm space but in a way that is intimately related to the plausible arm values in the function class. A similar technique of adding hallucinated values to induce diversity has been proposed before, for example in the GP BUCB algorithm of Desautels et al. (2014). Ours is the first time this general idea has been tested in conjunction with scalable neural network function approximation algorithms. A detailed discussion of this algorithm can be found in Section B.1.1. 4.3 THE STATISTICAL COMPLEXITY OF ZERO NOISE BATCH LEARNING In this section we present our main theoretical results regarding OAE with function approximation. In our results we use the Eluder dimension Russo & Van Roy (2013) to characterize the complexity of the function class F. This is appropriate because our algorithms make use of the optimism principle to produce their queries Bt. We show two novel results. First, we characterize the sample complexity of zero noise parallel optimistic learning with Eluder classes with dimension d. Perhaps surprisingly the regret of Vanilla OAE with batch size b has the same regret profile the case b = 1 up to a constant burn in factor of order bd. Second, our results holds under model misspecification, that is when y F at the cost of a linear dependence in the misspecification error. Although our results are for the noiseless setting (the subject of this work), we have laid the most important part of the groundwork Published as a conference paper at ICLR 2023 to extend them to the case of noisy evaluation. We explain why in Appendix D.2. We relegate the formal definition of the Eluder to Appendix D.2. In this section we measure the misspecification of y via the norm. We assume y satisfies minf F f y ω where f y = maxa A |f(a) y (a)|. Let ey = arg minf F y f be the projection of y onto F. We analyze OAE where eft is computed by solving 1 with γt (t 1)bω2 with acquisition objective Amax,b. We will measure the performance of OAE via its regret defined as, Reg F,b(T) = a B ,t y (a) X Where B ,t = max B Ut,|B|=b P a B y (a). The main result in this section is, Theorem 4.1. The regret of OAE with Amax,b acquisition function satisfies, Reg F,b(T) = O dim E(F, αT )b + ω p dim E(F, αT )Tb . With αt = max 1 t2 , inf{ f1 f2 : f1, f2 F, f1 = f2} and C = maxf F,a,a A |f(a) f(a )|. The proof of theorem 4.1 can be found in Appendix D.2. This result implies the regret is bounded by a quantity that grows linearly with ω, the amount of misspecification but otherwise only with the scale of dim E(F, αT )b. The misspecification part of the regret scales as the same rate as a sequential algorithm running Tb batches of size 1. When ω = 0, the regret is upper bounded by O (dim E(F, αT )d). For example, in the case of linear models, the authors of Russo & Van Roy (2013) show dim E(F, ϵ) = O (d log(1/ϵ)). This shows that for example sequential OAE when b = 1 achieves the lower bound of Lemma D.1 up to logarithmic factors. In the setting of linear models, the b dependence in the rate above is unimprovable by vanilla OAE without diversity aware sample selection. This is because an optimistic algorithm may choose to use all samples in each batch to explore a single unexplored one dimensional direction. Theoretical analysis for OAE Dv D and OAE Seq is left for future work. In Appendix D.1 we also show lower bounds for the query complexity for linear and Lipshitz classes. 5 TRANSFER LEARNING ACROSS GENETIC PERTURBATION DATASETS Figure 1: Regression mean squared error loss for models that predict cell-line specific phenotypic rewards from VCAP-derived perturbational features. In order to show the effectiveness of OAE in the large batch - small number of iterations regime we consider genetic perturbations from the CMAP dataset Subramanian et al. (2017), which contains a 978-gene bulk expression readout from thousands of single-gene sh RNA knockout perturbations3 across a number of cell lines. We consider the setting in which we have observed the effect of knockouts in one biological context (i.e., cell line) and would like to use it to plan a series of knockout experiments in another. Related applications may have different biological contexts, from different cell types or experimental conditions. We use the level 5 CMAP observations, each of which contains of 978-gene transcriptional readout from an sh RNA knockout of a particular gene in a particular cell line. In our experiments, we choose to optimize a cellular proliferation phenotype, defined as a function on the 978-gene expression space. See Appendix F for details. We use the 4 cells lines with the most number genetic perturbations in common: VCAP (prostate cancer, n = 14602 ), HA1E (kidney epithelium, n = 10487), MCF7 (breast cancer, n = 6638), and A375 (melanoma, n = 10033). We first learn a 100-dimensional action (perturbation) embedding ai for each perturbation in VCAP with an autoencoder. The autoencoder has a 100-dimension 3The sh RNA perturbations are just a subset of the 1M+ total perturbations across different perturbation classes. Published as a conference paper at ICLR 2023 Figure 3: Linear. τ-quantile batch convergence of Mean Opt on genetic perturbation datasets. bottleneck layer and two intermediate layers of 1500 and 300 Re LU units with dropout and batch normalization and is trained using the Adam optimizer on mean squared reconstruction loss. We use these 100-dimensional perturbations embeddings as the features to train the eft functions for each of the other cell types. According to our OAE algorithm, we train a fresh feed-forward neural network with two intermediate layers (of 100 and 10 units) for after observing the phenotypic rewards for each batch of 50 gene (knockout) perturbations. Figure 6 summarizes this approach. Figure 2: NN 100-10 τ-quantile batch convergence on genetic perturbation datasets of Mean Opt (left) and Ensemble, Ensemble Noise Y (right). Figure 1 shows the mean squared error loss of models trained to predict the cell-line specific phenotypic reward from the 100-dimensional VCAPderived perturbational features. These models are trained on successive batches of perturbations sampled via Random Batch and using the same NN 1500-300 hidden layer architecture of the decoder. Not surprisingly, the loss for the VCAP reward is one of the lowest, but that of two other cell lines (HA1E and MCF7) are also quite similar, showing the NN 1500300 neural net function class is flexible to learn the reward function in one context from the perturbational embedding in another. In all of our experiments we consider either a linear or a neural network function class with Re LU activations. In all of them we consider a batch size b = 50 and a number of batches N = 20. Figure 2 shows the convergence and reward results for the 4 cell lines when the neural network architecture equals NN 100-10. Since the perturbation action features were learned on the VCAP dataset (though agnostic to any phenotypic reward), the optimal VCAP perturbations are found quite quickly by all versions of Mean Opt including Greedy. Interestingly, Mean Opt still outperforms Random Batch in the HA1E and MCF7 cell lines but not on A375. When the neural network architecture equals NN 10-5, Mean Opt is only competitive with Random Batch on the VCAP and MCF7 datasets (see figure 37 in Appendix G.4). Moreover, when F is a class of linear functions, Mean Opt can beat Random Batch only in VCAP. This can be explained by looking at the regression fit plots of figure 4. The baseline loss value is the highest for A375 even for NN 100-10, thus indicating this function class is too far from the true responses values for A375. The loss curves for NN 10-5 lie well above those for NN 100-10 for all datasets thus explaining the degradation in performance when switching from a NN 100-10 to a smaller capacity Published as a conference paper at ICLR 2023 of NN 10-5. Finally, the linear fit achieves a very small loss for VCAP explaining why Mean Opt still outperforms Random Batch in VCAP with linear models. In all other datasets the linear fit is subpar to the NN 100-10, explaining why Mean Opt in NN 100-10 works better than in linear ones. We note that Ensemble Optimism is competitive with Random Batch in both NN 10-5 and NN 100-10 architectures in 7 our of the 8 experiments we conducted. In Appendix G.4 the reader can find results for Max Opt and Hinge Opt. Both methods underperform in comparison with Mean Opt. In Appendix E.2.1 and E.2.2 the reader will find experiments using tractable versions of OAE Dv D and OAE Seq. In Appendix E and G, we present extensive additional experiments and discussion of our findings over different network architectures (including linear), and over a variety of synthetic and public datasets from the UCI database Dua & Graff (2017). Figure 4: Training set regression fit curves evolution over training for the suite of gene perturbation datasets. Each training batch contains 10 datapoints. 5.0.1 GENEDISCO EXPERIMENTAL PLANNING BENCHMARK We test a Bayesian OAE algorithm against the Gene Disco benchmark Mehrjou et al. (2021), which assess the "Hit Rate" of different experimental planning algorithms over a number of pooled CRISPR experiments. We assess our performance against the other acquisition functions provided in the public implementation4 that select batches based solely on uncertainty considerations. We use the public implementation of Gene Disco and do not change the neural network architecture provided corresponding to a Bayesian Neural Network with a hidden layer of size 32. We use the 808 dimensional Achilles treatment descriptors and built eft by adding the BNN s uncertainty to the model base predictions. We test our algorithm in the Schmidt et al. 2021 (IFNg), Schmidt et al. 2021 (IL2), Zhuang et al. 2019 (NK), and Zhu et al. 2021 (Sars Cov2) datasets and tested for performance using the Hit Ratio metric after collecting 50 batches of size 16. This is defined as the ratio of arm pulls lying in the top .05 quantile of genes with the largest absolute value. Our results are in Table 5. OAE outperforms the other algorithms by a substantial margin in 3 out of the 4 datasets that we tested. Dataset Top Uncertain Soft Uncertain OAE-Bayesian Schmidt et al. 2021 (IFNg) 0.057 0.046 0.062 Schmidt et al. 2021 (IL2) 0.083 0.081 0.107 Zhuang et al. 2019 (NK) 0.035 0.047 0.085 Zhu et al. 2021 (Sars Cov2) 0.035 0.049 0.0411 Figure 5: Hit Ratio Results after 50 batches of size 16. BNN model trained with Achilles treatment descriptors. Final Hit Ratio average of 5 independent runs. Top Uncertain selects the 16 points with the largest uncertainty values and Soft Uncertain samples them using a softmax distribution. 6 CONCLUSION In this work we introduce a variety of algorithms inspired in the OAE principle for noiseless batch bandit optimization. We also show lower bounds for the query complexity for linear and Lipshitz classes as well as a novel regret upper bound in terms of the Eluder dimension of the query class. Our theoretical results hold under misspecification. Through a variety of experiments in synthetic, public and biological data we show that the different incarnations of OAE we propose in this work can quickly search through a space of actions for the almost optimal ones. This work focused in the case where the responses are noiseless. Extending our methods and experimental evaluation to the case where the responses are noisy is an exciting avenue for future work. 4https://github.com/genedisco/genedisco-starter Published as a conference paper at ICLR 2023 Yasin Abbasi-Yadkori, Dávid Pál, and Csaba Szepesvári. Improved algorithms for linear stochastic bandits. Advances in neural information processing systems, 24:2312 2320, 2011. Alekh Agarwal. Selective sampling algorithms for cost-sensitive multiclass prediction. In International Conference on Machine Learning, pp. 1220 1228. PMLR, 2013. Jordan Ash, Surbhi Goel, Akshay Krishnamurthy, and Sham Kakade. Gone fishing: Neural active learning with fisher embeddings. Advances in Neural Information Processing Systems, 34:8927 8939, 2021. P. Auer. Using confidence bounds for exploitation-exploration trade-offs. Journal of Machine Learning Research, 3(Nov):397 422, 2002. Maria-Florina Balcan, Alina Beygelzimer, and John Langford. Agnostic active learning. Journal of Computer and System Sciences, 75(1):78 89, 2009. William H Beluch, Tim Genewein, Andreas Nürnberger, and Jan M Köhler. The power of ensembles for active learning in image classification. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 9368 9377, 2018. Erdem Bıyık, Kenneth Wang, Nima Anari, and Dorsa Sadigh. Batch active learning using determinantal point processes. ar Xiv preprint ar Xiv:1906.07975, 2019. Nicolo Cesa-Bianchi, Claudio Gentile, and Francesco Orabona. Robust bounds for classification via selective sampling. In Proceedings of the 26th annual international conference on machine learning, pp. 121 128, 2009. Jeffrey Chan, Aldo Pacchiano, Nilesh Tripuraneni, Yun S Song, Peter Bartlett, and Michael I Jordan. Parallelizing contextual linear bandits. ar Xiv preprint ar Xiv:2105.10590, 2021. Yuxin Chen and Andreas Krause. Near-optimal batch mode active learning and adaptive submodular optimization. In International Conference on Machine Learning, pp. 160 168. PMLR, 2013. V. Dani, T. P. Hayes, and S. M. Kakade. Stochastic linear optimization under bandit feedback. In COLT, pp. 355 366. Omnipress, 2008. Sanjoy Dasgupta. Two faces of active learning. Theoretical computer science, 412(19):1767 1781, 2011. Sanjoy Dasgupta, Daniel J Hsu, and Claire Monteleoni. A general agnostic active learning algorithm. Advances in neural information processing systems, 20, 2007. Ofer Dekel, Claudio Gentile, and Karthik Sridharan. Robust selective sampling from single and multiple teachers. In COLT, pp. 346 358, 2010. Thomas Desautels, Andreas Krause, and Joel W. Burdick. Parallelizing exploration-exploitation tradeoffs in gaussian process bandit optimization. Journal of Machine Learning Research, 15(119): 4053 4103, 2014. URL http://jmlr.org/papers/v15/desautels14a.html. Dheeru Dua and Casey Graff. UCI machine learning repository, 2017. URL http://archive. ics.uci.edu/ml. Melanie Ducoffe and Frederic Precioso. Adversarial active learning for deep networks: a margin based approach. ar Xiv preprint ar Xiv:1802.09841, 2018. Yonatan Geifman and Ran El-Yaniv. Deep active learning over the long tail. ar Xiv preprint ar Xiv:1711.00941, 2017. Amir Emad Ghassami, Saber Salehkaleybar, Negar Kiyavash, and Elias Bareinboim. Budgeted experiment design for causal structure learning. In International Conference on Machine Learning, pp. 1724 1733. PMLR, 2018. Published as a conference paper at ICLR 2023 Daniel Gissin and Shai Shalev-Shwartz. Discriminative active learning. ar Xiv preprint ar Xiv:1907.06347, 2019. Yuhong Guo and Dale Schuurmans. Discriminative batch mode active learning. Advances in neural information processing systems, 20, 2007. Insu Han, Prabhanjan Kambadur, Kyoungsoo Park, and Jinwoo Shin. Faster greedy map inference for determinantal point processes. In International Conference on Machine Learning, pp. 1384 1393. PMLR, 2017. Ruth E Hanna and John G Doench. Design and analysis of CRISPR-Cas experiments. Nat. Biotechnol., 38(7):813 823, July 2020. Steve Hanneke et al. Theory of disagreement-based active learning. Foundations and Trends in Machine Learning, 7(2-3):131 309, 2014. Donald R Jones, Matthias Schonlau, and William J Welch. Efficient global optimization of expensive black-box functions. Journal of Global optimization, 13(4):455 492, 1998. Tarun Kathuria, Amit Deshpande, and Pushmeet Kohli. Batched gaussian process bandit optimization via determinantal point processes. Advances in Neural Information Processing Systems, 29: 4206 4214, 2016. Andreas Kirsch, Joost Van Amersfoort, and Yarin Gal. Batchbald: Efficient and diverse batch acquisition for deep bayesian active learning. Advances in neural information processing systems, 32:7026 7037, 2019. Alex Kulesza and Ben Taskar. Determinantal point processes for machine learning. ar Xiv preprint ar Xiv:1207.6083, 2012. Arash Mehrjou, Ashkan Soleymani, Andrew Jesson, Pascal Notin, Yarin Gal, Stefan Bauer, and Patrick Schwab. Genedisco: A benchmark for experimental design in drug discovery. ar Xiv preprint ar Xiv:2110.11875, 2021. Sérgio Moro, Paulo Cortez, and Paulo Rita. A data-driven approach to predict the success of bank telemarketing. Decision Support Systems, 62:22 31, 2014. Jonas Mueller, David Reshef, George Du, and Tommi Jaakkola. Learning optimal interventions. In Artificial Intelligence and Statistics, pp. 1039 1047. PMLR, 2017. George L Nemhauser, Laurence A Wolsey, and Marshall L Fisher. An analysis of approximations for maximizing submodular set functions i. Mathematical programming, 14(1):265 294, 1978. Aldo Pacchiano, Philip Ball, Jack Parker-Holder, Krzysztof Choromanski, and Stephen Roberts. Towards tractable optimism in model-based reinforcement learning. In Uncertainty in Artificial Intelligence, pp. 1413 1423. PMLR, 2021a. Aldo Pacchiano, Shaun Singh, Edward Chou, Alex Berg, and Jakob Foerster. Neural pseudo-label optimism for the bank loan problem. Advances in Neural Information Processing Systems, 34, 2021b. Jack Parker-Holder, Aldo Pacchiano, Krzysztof M Choromanski, and Stephen J Roberts. Effective diversity in population based reinforcement learning. Advances in Neural Information Processing Systems, 33:18050 18062, 2020. Daniel Russo. Worst-case regret bounds for exploration via randomized value functions. Advances in Neural Information Processing Systems, 32, 2019. Daniel Russo and Benjamin Van Roy. Eluder dimension and the sample complexity of optimistic exploration. Advances in Neural Information Processing Systems, 26, 2013. Nino Scherrer, Olexa Bilaniuk, Yashas Annadani, Anirudh Goyal, Patrick Schwab, Bernhard Schölkopf, Michael C Mozer, Yoshua Bengio, Stefan Bauer, and Nan Rosemary Ke. Learning neural causal models with active interventions. ar Xiv preprint ar Xiv:2109.02429, 2021. Published as a conference paper at ICLR 2023 Greg Schohn and David Cohn. Less is more: Active learning with support vector machines. In ICML, volume 2, pp. 6. Citeseer, 2000. Ozan Sener and Silvio Savarese. Active learning for convolutional neural networks: A core-set approach. ar Xiv preprint ar Xiv:1708.00489, 2017. Burr Settles. 2009. Burr Settles, Mark Craven, and Soumya Ray. Multiple-instance active learning. Advances in neural information processing systems, 20, 2007. Jasper Snoek, Oren Rippel, Kevin Swersky, Ryan Kiros, Nadathur Satish, Narayanan Sundaram, Mostofa Patwary, Mr Prabhat, and Ryan Adams. Scalable bayesian optimization using deep neural networks. In International conference on machine learning, pp. 2171 2180. PMLR, 2015. Niranjan Srinivas, Andreas Krause, Sham M Kakade, and Matthias Seeger. Gaussian process optimization in the bandit setting: No regret and experimental design. ar Xiv preprint ar Xiv:0912.3995, 2009. Aravind Subramanian, Rajiv Narayan, Steven M Corsello, David D Peck, Ted E Natoli, Xiaodong Lu, Joshua Gould, John F Davis, Andrew A Tubelli, Jacob K Asiedu, David L Lahr, Jodi E Hirschman, Zihan Liu, Melanie Donahue, Bina Julian, Mariya Khan, David Wadden, Ian C Smith, Daniel Lam, Arthur Liberzon, Courtney Toder, Mukta Bagul, Marek Orzechowski, Oana M Enache, Federica Piccioni, Sarah A Johnson, Nicholas J Lyons, Alice H Berger, Alykhan F Shamji, Angela N Brooks, Anita Vrcic, Corey Flynn, Jacqueline Rosains, David Y Takeda, Roger Hu, Desiree Davison, Justin Lamb, Kristin Ardlie, Larson Hogstrom, Peyton Greenside, Nathanael S Gray, Paul A Clemons, Serena Silver, Xiaoyun Wu, Wen-Ning Zhao, Willis Read-Button, Xiaohua Wu, Stephen J Haggarty, Lucienne V Ronco, Jesse S Boehm, Stuart L Schreiber, John G Doench, Joshua A Bittker, David E Root, Bang Wong, and Todd R Golub. A next generation connectivity map: L1000 platform and the first 1,000,000 profiles. Cell, 171(6):1437 1452.e17, November 2017. Scott Sussex, Andreas Krause, and Caroline Uhler. Near-optimal multi-perturbation experimental design for causal structure learning. ar Xiv preprint ar Xiv:2105.14024, 2021. Balazs Szorenyi, Róbert Busa-Fekete, Paul Weng, and Eyke Hüllermeier. Qualitative multi-armed bandits: A quantile-based approach. In International Conference on Machine Learning, pp. 1660 1668. PMLR, 2015. Simon Tong and Daphne Koller. Support vector machine active learning with applications to text classification. Journal of machine learning research, 2(Nov):45 66, 2001. Martin J Wainwright. High-dimensional statistics: A non-asymptotic viewpoint, volume 48. Cambridge University Press, 2019. Zheng Wang and Jieping Ye. Querying discriminative and representative samples for batch mode active learning. ACM Transactions on Knowledge Discovery from Data (TKDD), 9(3):1 23, 2015. Kai Wei, Rishabh Iyer, and Jeff Bilmes. Submodularity in data subset selection and active learning. In International conference on machine learning, pp. 1954 1963. PMLR, 2015. Pan Xu, Zheng Wen, Handong Zhao, and Quanquan Gu. Neural contextual bandits with deep representation and shallow exploration. ar Xiv preprint ar Xiv:2012.01780, 2020. Mengyan Zhang and Cheng Soon Ong. Quantile bandits for best arms identification. In International Conference on Machine Learning, pp. 12513 12523. PMLR, 2021. Vicky Zhang, Chandler Squires, and Caroline Uhler. Matching a desired causal state via shift interventions. Advances in Neural Information Processing Systems, 34, 2021. Dongruo Zhou, Lihong Li, and Quanquan Gu. Neural contextual bandits with ucb-based exploration, 2020. Published as a conference paper at ICLR 2023 1 Introduction 1 2 Related Work 2 3 Problem Definition 3 4 Optimistic Arm Elimination 4 4.1 Tractable Implementations of OAE . . . . . . . . . . . . . . . . . . . . . . . . . . 5 4.1.1 Optimistic Regularization . . . . . . . . . . . . . . . . . . . . . . . . . . 5 4.1.2 Ensemble Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5 4.2 Diversity seeking versions of OAE . . . . . . . . . . . . . . . . . . . . . . . . . . 5 4.2.1 Diversity via Determinants . . . . . . . . . . . . . . . . . . . . . . . . . . 6 4.2.2 Sequential batch selection rules . . . . . . . . . . . . . . . . . . . . . . . 6 4.3 The Statistical Complexity of Zero Noise Batch Learning . . . . . . . . . . . . . . 6 5 Transfer Learning Across Genetic Perturbation Datasets 7 5.0.1 Gene Disco Experimental Planning Benchmark . . . . . . . . . . . . . . . 9 6 Conclusion 9 A Transfer Learning Across Genetic Perturbation Datasets 14 B Complementary description of the methods 15 B.1 Appendix Diversity via Determinants . . . . . . . . . . . . . . . . . . . . . . . . 15 B.1.1 Sequential batch selection rules . . . . . . . . . . . . . . . . . . . . . . . 16 C Tractable Implementations of OAE Dv D and OAE Seq 17 C.0.1 Diversity via Determinants . . . . . . . . . . . . . . . . . . . . . . . . . . 17 C.0.2 Sequential Batch Selection Rules . . . . . . . . . . . . . . . . . . . . . . 17 D Theoretical Results 18 D.1 Quantifying the Query Complexity of F . . . . . . . . . . . . . . . . . . . . . . . 18 D.2 Optimism and its properties . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21 D.3 Noisy responses . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25 E Experiments 25 E.1 Vanilla OAE . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26 E.1.1 Synthetic One Dimensional Datasets . . . . . . . . . . . . . . . . . . . . . 26 E.1.2 Public Supervised Learning Datasets . . . . . . . . . . . . . . . . . . . . . 29 E.2 Experiments Diversity Seeking Objectives . . . . . . . . . . . . . . . . . . . . . . 30 E.2.1 OAE Dv D . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31 Published as a conference paper at ICLR 2023 E.2.2 OAE Seq . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32 F Cellular proliferation phenotype 34 G Further Experiments 35 G.1 Synthetic Datasets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35 G.2 Regression Fitted Datasets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37 G.3 UCI public datasets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37 G.4 Transfer Learning Biological Datasets . . . . . . . . . . . . . . . . . . . . . . . . 38 A TRANSFER LEARNING ACROSS GENETIC PERTURBATION DATASETS In this section we have placed a diagrammatic version of the data pipeline described in section 5. Figure 6: Neural design for genetic perturbation experiments. (a) Learn a perturbation action embedding space by training an autoencoder on the gene expression resulting from a large set of observed genetic perturbations in a particular biological context (e.g., sh RNA gene knockouts for a particular cell line in CMAP). (b) Select an initial batch of B perturbation actions to perform in parallel within a related (but different) biological context. Selection can be random (uniform) or influenced by prior information about the relationship between genes and the phenotype to be optimized. (c) Perform the current batch of experimental perturbations in vitro and observed their corresponding phenotypic rewards. (d) Concatenate the latest batch s features and observed rewards to those of previous batches to update the perturbation reward training set. (e) Train a new perturbation reward regression (with some degree of pre-defined optimism) network on the observed perturbation rewards. (f) Use this regressor to predict the optimistic rewards for the currently unobserved perturbations. (g) Select the next batch from these unobserved perturbations with the highest optimistic reward. Published as a conference paper at ICLR 2023 B COMPLEMENTARY DESCRIPTION OF THE METHODS B.1 APPENDIX DIVERSITY VIA DETERMINANTS Inspired by diversity-seeking methods in the Determinantal Point Processes (DPPs) literature Kulesza & Taskar (2012), we introduce the OAE Dv D algorithm. Inspired by the Dv D algorithm ?? we propose the use DPPs can be used to produces diverse subsets by sampling proportionally to the determinant of the kernel matrix of points within the subset. From a geometric perspective, the determinant of the kernel matrix represents the volume of a parallelepiped spanned by feature maps corresponding to the kernel choice. We seek to maximize this volume, effectively filling the feature space. Using a determinant score to induce diversity has been proposed as a strategy in other domains, most notably in the form of the Diversity via Determinants (Dv D) algorithm from Parker-Holder et al. (2020) for Population Based Reinforcement Learning. It is from this work that we take inspiration to name our algorithm OAE Dv D. The idea of using DPPs for diversity guided active learning has been explored by Bıyık et al. (2019). In the active learning setting the objective function used to build the batch is purely driven by the diversity objective. The method works as follows. At time t, OAE Dv D constructs a regression estimator eft using the arms and responses in Dt (for example by solving problem 1). Instead of using Equation 2, our algorithm selects batch Bt by optimizing a diversity aware objective of the form, Bt arg max B Ut,|B|=b Asum( eft, B) + Div( eft, B). (6) OAE Dv D s Div regularizer is inspired by the theory of Determinantal Point Processes and equals a weighted log-determinant objective. OAE Dv D has access to a kernel function K : Rd Rd R and at the start of every time step t N it builds a kernel matrix Kt R|Ut| |Ut|, Kt[i, j] = K(ai, aj), i, j Ut. For any subset B Ut we define the OAE Dv D diversity-aware score as, Div(f, B) = λdiv log (Det(Kt[B, B])) (7) Where Kt[B, B] corresponds to the submatrix of Kt with columns (and rows) indexed by B and λdiv is a diversity regularizer. Since the resulting optimization problem in Equation 6 may prove to be extremely hard to solve, we design a greedy maximization algorithm to produce a surrogate solution. We build the batch Bt greedily. The first point at,1 in the batch is selected to be the point in Ut that maximizes the response eft. For all i 2 the point at,i in Ut is selected from Ut\{at,j}i 1 j=1 such that, at,i = max a Ut\{at,j}i 1 j=1 Asum( eft, {a} {at,j}i 1 j=1) + Div( eft, {a} {at,j}i 1 j=1) Algorithm 2 Optimistic Arm Elimination - Dv D (OAE Dv D) Input Action set A Rd, num batches N, batch size b, λdiv Initialize Unpulled arms U1 = A. Observed points and labels dataset D1 = for t = 1, , N do if t = 1 then Sample uniformly a size b batch Bt U1. Compute Bt using the greedy procedure described above. Observe batch rewards Yt = {y (a) for a Bt} Update Dt+1 = Dt {(Bt, Yt)}. Update Ut+1 = Ut\Bt . We define a reward augmented kernel matrix Kaug t (i, j) = Kt(i, j) r exp e ft(i)+ e ft(j) λdet . This matrix satisfies Kaug t [B, B] = diag q exp( eft/λdet) [B, B] Kt[B, B] diag q exp( eft/λdet) [B, B] Published as a conference paper at ICLR 2023 for all B Ut. Since the determinant of a product of matrices is the product of their determinants it follows that Det(Kaug t [B, B]) = exp P a B eft(a)/λdet Det (Kt[B, B]). Thus for all B Ut, equation 6 with diversity score 7 can be rewritten as Bt arg max B Ut,|B|=b log (Det(Kaug t [B, B])) This is because log (Det(Kaug t [B, B])) = a B e ft(a) λdiv + log (Det(Kt[B, B])) for all B Ut. It is well known the log-determinant set function for a positive semidefinite matrix is submodular (see for example section 2.2 of Han et al. (2017)). It has long been established that the greedy algorithm achieves an approximation ratio of (1 1 e) for the constrained submodular optimization problem (see Nemhauser et al. (1978)). This justifies the choices we have made behind the greedy algorithm we use to select Bt. B.1.1 SEQUENTIAL BATCH SELECTION RULES In this section we introduce OAE Seq a generalization of the OAE algorithm designed to produce in batch diversity. OAE Seq produces a query batch by solving a sequence of b optimization problems. OAE Seq uses first e Dt,0 = Dt the set of arms pulled so far as well as e Ut,0 = Ut (the set of arms yet to be pulled) to produce a function eft,1 that determines the initial arm in the batch via the greedy choice at,1 = arg maxa Ut eft,1(a). The function eft,1 is computed using the same method as any vanilla OAE procedure. Having chosen this arm, in the case when b > 1, a virtual reward eyt,1 (possibly different from eft(at,1)) is assigned to the query arm at,1, and datasets e Dt,1 = Dt {(at,1, eyt,1)} and e Ut,1 = Ut\{at,1} are defined. The same optimization procedure that produced eft,1 is used to output eft,2 now with e Dt,1 and e Ut,1 as inputs. Arm at,2 is defined as the greedy choice at,2 = arg maxa e Ut,1 eft,2(a). The remaining batch elements (if any) are determined by successive repetition of this process so that e Dt,i = Dt {(at,j, eyt,j)}i j=1 and e Ut,i = Ut\{at,j}i j=1. The trace of this procedure leaves behind a sequence of functions and datasets {( eft,i, e Ut,i)}b i=1 such that at,i arg max e Ut,i 1 eft,i(a). Algorithm 3 Optimistic Arm Elimination - Batch Sequential (OAE Seq) Input Action set A Rd, number of batches N, batch size b, pessimism-optimism balancing parameter α. Initialize Unpulled arms U1 = A. Observed points and labels dataset D1 = for t = 1, , N do if t = 1 then Sample uniformly a size b batch Bt U1. Solve for {( eft,i, e Ut,i)}b i=1 via the procedure described in the text. Compute Bt = {at,i arg max a e Ut,i 1 eft,i(a)}b i=1. (8) Observe batch rewards Yt = {f (a) for a Bt} Update Dt+1 = Dt {(Bt, Yt)} D. Update Ut+1 = Ut\Bt . To determine the value of the virtual rewards eyt,i, we consider a variety of options. We start by discussing the case when the fake reward eyt,i = eft(at,i) and the acquisition function equals Asum(f, U). Published as a conference paper at ICLR 2023 When γt = 0, the fake reward satisfies eyt,i = eft(at,i) and y F it follows that eft,i = eft independent of i [B] is a valid choice for the function ensemble { eft,i}b i=1. In this case the query batch Bt can be computed by solving for eft, eft arg max f F a Ut f(a) s.t. X (a,y) e Dt,i (f(a) y)2 = 0. (9) And defining the batch Bt as Bt = arg max B Ut s.t. |B|=b a B eft(a). The equivalence between this definition of Bt and the sequential batch selection rule follows by noting the equality constraint from 9 ensures that eft,i = eft is a valid solution for each of the intermediate problems defining the sequence {( eft,i, e Ut,i)}B i=1. OAE Seq is designed with a more general batch selection rule that may yield distinct intermediate arm selection functions { eft,i}. In our experiments we compute the virtual reward eyt,i as an average of ef optimistic t,i and ef pessimistic t,i , optimistic and a pessimistic estimators of the responses in Ut,i 1. We consider two mechanisms for computing ef optimistic t,i and ef pessimistic t,i . First when the computation of eft,i is based on producing an uncertainty function ut,i and a base model ef o t,i, we define ef optimistic t,i = ef o t,i +ut,i and ef pessimistic t,i = ef o t,i ut,i. Second, we define ef optimistic t,i and ef pessimistic t,i as the solutions of the constrained objectives ef optimistic t,i = arg max f Ft,i A(f, Ut) and ef pessimistic t,i = arg min f Ft,i A(f, Ut) Where Ft,i = {f F s.t. P (a,y) Dt,i 1 (f(a) y)2 γt}. In both cases we define the fictitious rewards as a weighted average of the pessimistic and optimistic predictors eyt,i = α ef optimistic t,i + (1 α) ef pessimistic t,i where α [0, 1] is an optimism weighting parameter while we keep the functions used to define what points are part of the batch as eft,i = ef optimistic t,i . The principle of adding hallucinated values to induce diversity has been proposed before for example in the GP BUCB algorithm of Desautels et al. (2014). C TRACTABLE IMPLEMENTATIONS OF OAE Dv D AND OAE Seq In this section we go over the algorithmic details behind the approximations that we have used when implementing the different OAE methods we have introduced in Section 4. C.0.1 DIVERSITY VIA DETERMINANTS The OAE Dv D algorithm differs from OAE only in the way in which the query batch Bt is computed. OAE Dv D uses equation 6 instead of equation 2. Solving for Bt is done via the greedy algorithm described in Section 4.2.1. The function eft can be computed via a regularized optimization objective or using an ensemble. In our experimental evaluation we opt for defining eft via the regularization route as the result of solving problem 3. More experimental details including the type of kernel used are explained in Section E.2. In our experimental evaluation we use the Mean Opt objective to produce eft and we refer to the resulting method as Det D. C.0.2 SEQUENTIAL BATCH SELECTION RULES We explore different ways of defining the functions { eft,i}b i=1. Depending on what procedure we use to optimize and produce these functions we will obtain different versions of OAE Seq. We use the name Seq B to denote the OAE Seq method that fits the functions { ef optimistic t,i }b i=1 and Published as a conference paper at ICLR 2023 { ef pessimistic t,i }b i=1 by solving the regularized objectives, ef optimistic t,i arg min f F (a,y) Dt,i 1 (f(a) y)2 λreg A(f, Ut,i 1), ef pessimistic t,i arg min f F (a,y) Dt,i 1 (f(a) y)2+λreg A(f, Ut,i 1). In our experiments we set the acquisition function to Aavg. For some value5 of λreg > 0. The functions { eft,i}b i=1 are defined as eft,i = ef optimistic t,i and the virtual rewards as eyt,i = α ef optimistic t,i + (1 α) ef pessimistic t,i where α [0, 1] is an optimism-pessimism weighting parameter. In our experimental evaluation we use α = 1 2. More experimental details are presented in Section E.2. The functions { ef optimistic t,i }b i=1 and { ef pessimistic t,i }b i=1 can be defined with the use of an ensemble. Borrowing the definitions of Section 4.1.2 ef optimistic t,i = ef o t,i+ ut,i, ef pessimistic t,i = ef o t,i ut,i. Where ef o t,i and ut,i are computed by first fitting an ensemble of models { bf j t,i}M j=1 using dataset Dt,i 1. In our experimental evaluation we explore the use of Ensemble and Ensemble Noise Y optimization styles to fit the models { bf j t,i}M j=1. In our experiments we use the names Ensemble Seq B and Ensemble Seq B Noise Y to denote the resulting sequential batch selection methods. More details of our implementation can be found in Section E.2. D THEORETICAL RESULTS D.1 QUANTIFYING THE QUERY COMPLEXITY OF F Let ϵ 0 and define tϵ opt(A, f) to be the first time-step when an ϵ optimal point ˆat is proposed by a learner (possibly randomized) when interacting with arm set A Rd and the pseudo rewards are noiseless evaluations {f(a)}a A with f F. We define the query complexity of the A, F pair as, Tϵ(A, F) = min Alg max f F E tϵ opt(A, f) Where the minimum iterates over all possible learning algorithms. We can lower bound of the problem complexity for several simple problem classes, Lemma D.1. When A = { x 1 for x Rd} and 1. If F is the class of linear functions defined by vectors in the unit ball F = {x θ x : θ 1 for θ Rd} then Tϵ(A, F) d when ϵ < 1 d and Tϵ(A, F) d ϵd otherwise. 2. If F is the class of 1-Lipschitz functions functions then Tϵ(A, F) 1 Proof. As a consequence of Yao s principle, we can restrict ourselves to deterministic algorithms. Indeed, min Alg max f F E tϵ opt(A, f) = max DF min Det Alg Ef DF tϵ opt(A, f) Thus, to prove the lower bound we are after it is enough to exhibit a distribution DF over instances f F and show a lower bound for the expected tϵ opt(A, f) where the expectation is taken using DF. With the objective of proving item 1 let DF be the uniform distribution over the sphere Unif d(1). By symmetry it is easy to see that Eθ Unif d(r) [θi] = 0, i d and r 0. 5As we have pointed out in Section 4.2.2 setting λreg = 0 reduces OAE Seq to vanilla OAE. Published as a conference paper at ICLR 2023 Varθ Unif d(r) (θi) = Eθ Unif d(r)[θ2 i ] (i) = r2 Equality (i) follows because i=1 Eθ Unif d(r)[θ2 i ] = Eθ Unif d(r)[ θ 2] = r2 and because by symmetry for all i, j [d] the second moments agree, Eθ Unif d(r)[θ2 i ] = Eθ Unif d(r)[θ2 j] Eθ Unif d(r)[|θi|] q Eθ Unif d(r)[θ2 i ] = r Let Det Alg be the optimal deterministic algorithm for DF and a1 be its first action. Since DF is the unofrm distribution over the sphere, inequality 11 expected scale of the reward reward experienced is upper bounded by 1 d, and furthermore, since a1 = 1, the expected second moment of the reward experienced (where expectations are taken over DF) equals 1 We now employ a conditional argument, if Det Alg has played a1 and observed a reward r1, We assume that up to time m algorithm Det Alg has played actions a1, , am and received rewards r1, , rm. Given these outcomes, Det Alg can recover the component of θ lying in span(a1, , am). Let am+1 be Det Alg s action at time m+1. By assumption this is a deterministic function of a1, , am and r1, , rm. Since θ is drawn from Unif d(1), the expected squared dot product between the component of a m+1 = Proj(am+1, span(a1, , am) ) satisfies, Eθ DF|{a 1 θ=ri}m i=1 h θ a m+1 2i = 1 θ0 m 2 d m 1 a0 m+1 2 where θ0 m = Proj(θ, span(a1, , am)). The last inequality follows because the conditional distribution of Proj(θ, span(a1, , am) ) given a1, , am and r1, , rm is a uniform distribution over the d m dimensional sphere of radius p 1 θ0m 2, the scale of a m+1 is q 1 a0 m+1 2 and we have assumed the a0 m+1 = 0. Thus, the agreement of a m+1 with Proj(θ, span(a1, , am) ) satisfies Equation 10. We consider the expected square norm of the recovered θ up to time m. This is the random variable θ0 m 2 = Pm t=1 θ a t 2 where a t = Proj(at, span(a1, , at 1) ). Thus, Eθ DF θ0 m 2 = Eθ DF t=1 Eθ DF|{a 1 θ=ri}t 1 i=1 h θ a t 2i# (i) = Eθ DF Equality (i) holds because of 12. Recall that by Equation 10, Eθ DF θ0 1 2 = 1 Published as a conference paper at ICLR 2023 Thus by the above equalities, Eθ DF θ0 2 2 = 1 Unrolling these equalities further we conclude that Eθ DF θ0 m 2 = m This implies the expected square agreement between the learner s virtual guess ˆat is upper bounded by m d . Thus, when ϵ < 1 d, the expected number of queries required is at least d. When ϵ > d, the expected number of queries instead satisfies a lower bound of d ϵd . We now show shift our attention to Lipschitz functions. First we introduce the following simple construction of a 1 Lipschitz function over a small ball of radius ϵ. We use this construction throughout our proof. Let x Rd be an arbitrary vector, define B(x, ϵ) as the ball centered around x of radius 2ϵ under the 2 norm and S(x, 2ϵ) as the sphere (the surface of B(x, 2ϵ)) centered around x Define the function f ϵ x : Rd R as, f ϵ x(z) = minz S(x,2ϵ) z z 2 if z B(x, 2ϵ) 0 o.w. It is easy to see that f ϵ x is 1 Lipschitz. We consider three different cases, 1. If z1, z2 B(x, 2ϵ)c then |f ϵ x(z1) f ϵ x(z2)| = 0 z1 z2 . The result follows. 2. If z1 B(x, 2ϵ) but z2 B(x, 2ϵ)c. Let z3 be the intersection point in the line going from z1 to z2 lying on S(x, 2ϵ). Then |f ϵ x(z1) f ϵ x(z2)| = minz S(x,2ϵ) z1 z 2 z1 z3 2 z1 z2 2. 3. If z1, z2 B(x, 2ϵ). It is easy to see that |f ϵ x(z1) f ϵ x(z2)| = | z1 x 2 z2 x 2|. And therefore by the triangle inequality applied to x, z1, z2, that | z1 x 2 z2 x 2| z1 z2 2. The result follows. Let N(B(0, 1), 2ϵ) be a 2ϵ packing of the unit ball. For simplicity we ll use the notation N2ϵ = |N(B(0, 1), 2ϵ)|. Define the set of functions Fϵ = {f ϵ x for all x N(B(0, 1), 2ϵ)} and define DF as the uniform distribution over Fϵ F. Similar to the case when F is the set of linear functions, we make use of Yao s principle. Let Det Alg be an optimal deterministic algorithm for DF. Let ai be Det Alg s i th query point and ri be the i th reward it receives. If the ground truth was f ϵ x and the algorithm does not sample a query point from inside B(x, 2ϵ), it will receive a reward of 0 and thus would not have found an ϵ optimal point. Thus topt(f ϵ x) first time to pull an arm in B(x, 1). As a consequence of this fact, Ef ϵx DF [1(a1 B(x, 2ϵ))] 1 N2ϵ . Hence Ef ϵx DF [1(a1 B(x, 2ϵ))] N2ϵ 1 N2ϵ . Therefore, N2ϵ ℓ N2ϵ ℓ+ 1 N2ϵ ℓ N2ϵ ℓ+ 1 (i) Covering(B(0, 1), 4ϵ) (ii) 1 4ϵ d where inequality (i) is a consequence of Lemma 5.5 and inequality (ii) from Lema 5.7 in Wainwright (2019). Published as a conference paper at ICLR 2023 Translating to Quantile Optimality . The results of Lemma D.1 can be interpreted in the langauge of quantile optimality by imposing a uniform measure over the sphere. In this case ϵ optimality is equivalent (approximately) to a 1 2 d(1 ϵ)2 quantile. The results of Lemma D.1 hold regardless of the batch size B. It is thus impossible to design an algorithm that can single out an ϵ-optimal arm in less than Tϵ(A, F) queries for all problems defined by the pair A, F simultaneously. D.2 OPTIMISM AND ITS PROPERTIES The objective of this section is to prove Theorem 4.1 which we restart for the reader s convenience. Theorem 4.1. The regret of OAE with Amax,b acquisition function satisfies, Reg F,b(T) = O dim E(F, αT )b + ω p dim E(F, αT )Tb . With αt = max 1 t2 , inf{ f1 f2 : f1, f2 F, f1 = f2} and C = maxf F,a,a A |f(a) f(a )|. Let s start by defining the ϵ Eluder dimension, a complexity measure introduced by Russo & Van Roy (2013) to analyze optimistic algorithms. Throughout this section we ll use the notation g A = p P a A g2(a) to denote the data norm of function g : A R. Definition D.2. Let ϵ 0 and Z = {ai}n i=1 A be a sequence of arms. 1. An action a is ϵ dependent on Z with respect to F if any f, f F satisfying p Pn i=1(f(ai) f (ai))2 ϵ also satisfies |f(a) f (a)| ϵ. 2. An action a is ϵ independent of Z with respect to F if a is not ϵ dependent on Z. 3. The ϵ eluder dimension dim E(F, ϵ) of a function class F is the length of the longest sequence of elements in A such that for some ϵ ϵ, every element is ϵ independent of its predecessors. Lemma D.3. Let s assume y satisfies minf F y f ω where g = maxa Z |g(a)|. Let ey = arg minf F y f be the projection of y onto F. If eft is computed by solving 1 with γt (t 1)ω2b with acquisition objective Amax,b the response predictions of eft over values Bt satisfy, X a B ,t y (a) bω + X a B ,t ey (a) bω + X where B ,t = max B Ut,|B|=b P a B y (a). Proof. Let Ft be the subset of F satisfying f Ft if P (x,y) Dt(f(x) y)2 (t 1)ω2 γt. By definition ey Ft. Substituting the definition of Amax,b into Equation 1, eft arg max f Ft max B Ut,|B|=b Since ey Ft, max f Ft,B Ut,|B|=b a B f(b) = X a e B ,t ey (a) X a B ,t ey (a). Where e B ,t = max B Ut,|B|=b P a B ey (a) and B ,t = max B Ut,|B|=b P a B y (a). Finally, for all a A, we have ey (a) + ω y (a). This finalizes the proof. m Since ey (a) + ω y (a). for all a A, Lemma D.3 implies, a B ,t y (a) X eft(a) ey (a) (13) Published as a conference paper at ICLR 2023 We define the width of a subset e F F at an action a A by, r e F(a) = sup f,f e F |f(a) f (a)|. And use the shorthand notation rt(a) to denote r Ft(a) where Ft = {f F s.t. P (a,y) Dt(f(a) y)2 γt}. Equation 13 implies, a B ,t y (a) X a Bt rt(a) (14) In order to bound the contribution of the sum Pt ℓ=1 P a Bt rt(a) we use a similar technique as in Russo & Van Roy (2013). First we prove a generalization of Proposition 3 of Russo & Van Roy (2013) to the case of parallel feedback. Proposition D.4. If {γt 0|t N} is a nondecreasing sequence and Ft = {f F : P (a,y) Dt(f(a) y)2 γt} then, a Bt 1(rt(a) > ϵ) O γtd Where d = dim E(F, ϵ). Proof. We will start by upper bounding the number of disjoint sequences in Dt 1 that an action a Bt can be ϵ dependent on when rt(a) > ϵ. If rt(a) > ϵ there exist f, f Ft such that f(a) f(a) > ϵ. By definition if a Bt is ϵ dependent on a sequence S t 1 ℓ=1Bℓ= Dt then f f 2 S > ϵ2 (since otherwise f f 2 S ϵ2 would imply a to be ϵ independent of Dt). Thus if a is ϵ dependent on K disjoint sequences S1, , SK Dt, then f f 2 Dt Kϵ2. By the triangle inequality, f f Dt f ey Dt + f ey Dt 2 γt. Thus it follows that ϵ K < 2 γt and therefore Next we prove a lower bound for K. In order to do so we prove a slightly more general statement. Consider a batched sequence of arms { aℓ,i}i [b],ℓ [τ] where for the sake of the argument aℓ,i is not necessarily meant to be aℓ,i. We use the notation Bℓ= { aℓ,i}i [b] to denote the ℓ th batch in { aℓ,i}i [b],ℓ [τ] and Dt = t 1 ℓ=1 Bℓ. Let τ N and define e K as the largest integer such that e Kd + b τb. We show there is a batch number ℓ τ and in-batch index i such that aℓ ,i is ϵ dependent on a subset of disjoint sequences of size at least K 2 out of a set of e K disjoint sequences S1, , S K Dℓ 1. First let s start building the Si sequences by setting Si =the i th element in { aℓ,i}i [b],ℓ [τ] ordered lexicographically. This will involve elements of up to batch B K/b . Since we are going to apply the same argument recursively in our proof, let s denote the current batch index in the construction of S1, , S e K as ℓ, this is, we set S1, , S e K D ℓ. At the start of the argument ℓ= K/b . If there is an arm a B ℓ+1 such that a is ϵ dependent on at least e K/2 of the { Si} e K i=1 sequences, the result would follow. Otherwise, it must be the case that for all a B ℓ+1 there are at least e K 2 sequences in { Si} e K i=1 on which a is ϵ independent. Published as a conference paper at ICLR 2023 Let s consider a bipartite graph with edge sets B ℓ+1 and { Si} e K i=1. We draw an edge between a B ℓ+1 and Sj { Si} e K i=1 if a is ϵ independent on Sj. If for all a B ℓ+1 there are at least e K 2 sequences in { Si} e K i=1 on which a is ϵ independent, Lemma D.5 implies the existence of a matching of size at least min( e K 8 , b) between elements in B ℓ+1 and the sequences in { Si} e K i=1. The case min( e K 8 , b) = e K 8 can only occur when (τ 1)b 8d b and therefore when τb 8bd + b. In case min( e K 8 , b) = b. If we reach ℓ= τ it must be the case that at least (τ 1)b points could be accommodated into the e K sequences. By definition of e K it must be the case that each sub-sequence Si satisfies |Si| d. Since each element of subsequence Si is ϵ independent of its predecesors, |Si| = d. In this case we would conclude there is an element in Bτ that is ϵ dependent on e K and therefore at least e K 2d subsequences. If τb 8γτ d ϵ2 + 2d + b then e K Combining the results of the two last paragraphs we conclude that if τb 8γτ d ϵ2 + 2d + 2b + 8bd then there is a batch index ℓ [τ] such that there is an arm aℓ ,i Bℓ that is ϵ dependent of at least e K ϵ2 + 1 disjoint sequences contained in Dℓ . Let s apply the previous result to the sequence of at,i for i [b] and t R such that rt(at,i) > ϵ. An immediate consequence of the previous results is that if Pt ℓ=1 P a Bℓ1(rℓ(a) > ϵ) 8γtd ϵ2 + 2d + 2b+8bd there must exist an arm in Bt such that it is ϵ dependent on at least 4γt ϵ2 +1 disjoint sequences of Dt. Nonetheless, Equation 15 implies this is impossible. Thus, Pt ℓ=1 P a Bℓ1(rℓ(a) > ϵ) 8γtd ϵ2 + 2d + 2b + 8bd 8γtd ϵ2 + 12bd. The result follows. Finally, the RHS of equation 14 can be upper bounded using a modified version of Lemma 2 of Russo & Van Roy (2013) (which can be found in Appendix D.2 ) yielding, a B ,t y (a) X T +O min (dim E(F, αT )b, T) C + p dim E(F, αT )ωT . Where αt = max 1 t2 , inf{ f1 f2 : f1, f2 F, f1 = f2} and C = maxf F,a,a A |f(a) f(a )|. The quantity in the left is known as regret. This result implies the regret is bounded by a quantity that grows linearly with ω, the amount of misspecification but otherwise only with the scale of dim E(F, αT )b. Our result is not equivalent to splitting the datapoints in b parts and adding b independent upper bounds. The resulting upper bound in the later case will have in a term of the form O b p dim E(F, αT )ωT whereas in our analysis this term does not depend on b. When ω = 0, the regret is upper bounded by O (dim E(F, αT )d). For example, in the case of linear models, the authors of Russo & Van Roy (2013) show dim E(F, αT ) = O (d log(1/ϵ)). This shows that for example sequential OAE when b = 1 achieves the lower bound of Lemma D.1 up to logarithmic factors. In the setting of linear models, the b dependence in the rate above is unimprovable by vanilla OAE without diversity aware sample selection. This is because an optimistic algorithm may choose to use all samples in each batch to explore explore a single unexplored one dimensional direction. Theoretical analysis for OAE Dv D and OAE Seq is left for future work. Lemma D.5. Let G be a bipartite graph with node set A B such that |A| = K and |B| = b. If for all nodes v B it follows that N(v) e K 2 b there exists a perfect matching between the nodes in B and the ones in A. 2. If instead e K 2 < b there exists a subset of e K 8 nodes in A with a perfect matching to B. Proof. The first item follows immediately from Hall s marriage theorem. Notice that in this case the neighboring set of any subset of nodes in B has a cardinality of at least e K 2 and therefore it is at least the size of B. The conditions of Hall s theorem are satisfied thus implying the existence of a perfect matching between the nodes in B to the nodes in A. Published as a conference paper at ICLR 2023 In the second scenario, let s first prove there exists a subset A of A of size at least e K 4 such that every element in A has at least b 4 neighbors in B. We prove this condition by the way of contradiction. There are at least b e K 2 edges in the graph. Suppose there were at most L vertices in A with degree greater or equal to b 4. This value of L must satisfy the inequality, Lb + ( e K L) b This is because the maximum number of edges a vertex in A can have equals b. Thus, L e K All nodes in L have degree at least b 4. If we restrict ourselves to a subset e L of L of size e K 8 and since in this scenario e K 4, Hall s stable marriage theorem implies there is a perfect matching between e L to B. The result follows. Lemma 2 of Russo & Van Roy (2013) adapted to notation of OAE. Lemma D.6. If {γt 0|t N} is a nondecreasing sequence and Ft = {f F s.t. P (a,y) Dt(f(a) y)2 γt} then with probability 1 for all T N, a Bt rt(a) O dim E(F, αT )b + ω p dim E(F, αT )Tb . Where αt = max 1 2 inf{ f1 f2 : f1, f2 F, f1 = f2} and C = maxf F,a,a A |f(a) f(a )|. Proof. The proof of lemma D.6 follows the proof template as that of Lemma 2 in Russo & Van Roy (2013). We reproduce it here for completeness. For ease of notation we use d = dim E(F, αT ). We first re-order the sequence {r1(a1,1), r1(a1,2), , r1(a1,b), , r T (a T,b)) as ri1 ri T b. We have, a Bt rt(a) = j=1 1(rij αT )rij + j=1 1(rij > αT )rij j=1 1(rij > αT )rij Where inequality (i) holds by definition of αT noting that j=1 1(rij αT )rij = 0, if αT = 1 2 inf{ f1 f2 : f1, f2 F, f1 = f2}. j=1 1(rij αT )rij 1, if αT = 1 Proposition D.4 (since d dim E(F, ϵ) for all ϵ αT ) implies that for all ij with rij > αT , we can bound j as a Bt 1(rt(a) > rij) O Published as a conference paper at ICLR 2023 Algorithm 4 Noisy Batch Selection Principle (OAE) Input Action set A Rd, num batches N, batch size b Initialize Observed points and labels dataset D1 = for t = 1, , N do If t = 1 then: Sample uniformly a size b batch Bt U1. Else: Solve for eft and compute Bt arg max B Ut||B|=b A( eft, B). Observe batch rewards Yt = {y (a) for a Bt} Update Dt+1 = Dt {(Bt, Yt)} and Ut+1 = Ut\Bt . So there is a constant c > 0 such that rij O q γT d (j cbd)+ . For all j cbd notice that Pcbd j=1 rij = O (bd) since the radii rij are all of constant size. We conclude that, γT d (j cbd)+ γT d Z T b cbd Substituting γT = ω2Tb we conclude, j=1 rij = O bd + ω Thus finalizing the result. Combining the result of Lemma D.6 and Equation 14 finalizes the proof of Theorem 4.1. D.3 NOISY RESPONSES In this section we describe how our results imply improved bounds for the setting with noisy responses. In this case we assume that yt,i = y (at,i) + ξt,i where ξt,i is conditionally zero mean. We consider the following optimistic least squares algorithm, E EXPERIMENTS We demonstrate the effectiveness of our OAE algorithm in several problem settings across public and synthetic datasets. We evaluate the algorithmic implementations described in Section 4.1 by setting the acquisition function to Aavg(f, U) and the batch selection rule as in Equation 2 for the vanilla OAE methods and as Equations 6 and 8 for OAE s diversity inducing versions OAE Dv D and OAE Seq respectively. All neural network architectures use Re LU activations. All ensemble methods use an ensemble size of M = 10 and xavier parameter initialization. Small vs Large Batch Regimes. Oftentimes the large batch - small number of iterations regime is the most interesting scenario from a practical perspective Hanna & Doench (2020). In scientific settings like pooled genetic perturbations, each experiment may take a long time (many weeks or months) to conclude, but it is possible to conduct a batch of experiments in parallel together. We study this regime in Section 5. Published as a conference paper at ICLR 2023 Figure 7: (top row) Synthetic one dimensional datasets (bottom) Evolution of Mean Opt in the Multi Valley One Dim Hole dataset using λreg = 0.001. From left to right: Iterations 5, 15, 25, 45. E.1 VANILLA OAE We test Mean Opt, Hinge PNorm, Max Opt, Ensemble and Ensemble Noise Y s performance (see Section 4.1 for a detailed description of each of these algorithms) over different values of the regularization parameter λreg, including the greedy choice of λreg = 0, henceforth referred to as Greedy. We compare these algorithms to the baseline method that selects points Bt by selecting a uniformly random batch of size B from the set Ut, henceforth referred to as Random Batch and against each other. We conduct experiments on three kind of datasets. First in Section E.1.1 we capture the behavior of OAE in a set of synthetic one dimensional datasets specifically designed to showcase different landscapes for y ranging from uni-modal to multi-modal with missing values. In Section E.1.2 we conduct similar experiments on public datasets from the UCI database Dua & Graff (2017). In both sections E.1.1 and E.1.2 all of our experiments have a batch size of 3, a time horizon of N = 150 and over two types of network architectures. In Section 5 we consider the setting in which we have observed the effect of knockouts in one biological context (i.e., cell line) and would like to use it to plan a series of knockout experiments in another. We test OAE in this context by showing the effectiveness of the Mean Opt, Hinge PNorm, Max Opt, Ensemble and Ensemble Noise Y methods in successfully leveraging the learned features from a source cell line in the optimization of a particular cellular proliferation phenotype for several target cell lines. All of our vanilla OAE methods show that better expressivity of the underlying model class F allows for better performance (as measured by the number of trials it requires to find a response within a particular response quantile from the optimum). Low capacity (in our experiments Re LU neural networks with two layers of sizes of 10 and 5) models have a harder time competing against Random Batch than larger ones (Re LU neural networks with two layers of sizes 100 and 10). We also present results for linear and very high capacity models (two layer of sizes 300 and 100). E.1.1 SYNTHETIC ONE DIMENSIONAL DATASETS Figure 7 shows different one dimensional synthetic datasets that used to validate our methods. The leftmost, the One Hill Valley One Dim dataset consists of 1000 arms uniformly sampled from the interval [ 10, 10]. The responses y are unimodal. The learner s goal is to find the arm with x coordinate value equals to 3 as it is the one achieving the largest response. We use the dataset One Hill Valley One Dim Hole to test for OAE s ability to find the maximum when the surrounding points are not present in the dataset. The remaining two datasets Multi Valley One Dim and Multi Valley One Dim Hole are built with the problem of multimodal optimization in mind. Each of these datasets have 4 local maxima. We use Multi Valley One Dim to test the OAE s ability to avoid getting stuck in local optima. The second dataset mimics the construction of the One Hill Valley One Dim Hole dataset and on top of testing the algorithm s ability to escape local optima, it also is meant to test what happens when the global optimum s neighborhood isn t present in the dataset. Since one of the algorithms we test is the greedy algorithm (corresponding to λ = 0), the Hole datasets are meant to present a challenging situation for this class of algorithms. Published as a conference paper at ICLR 2023 Figure 8: NN 10-5. τ-quantile batch convergence (left) and corresponding rewards over batches (right) on the One Hill Valley One Dim (easier) and Multi Valley One Dim (harder), One Hill Valley One Dim Hole and Multi Valley One Dim Hole synthetic datasets show how the OAE algorithm can achieve higher reward faster than Random Batch or Greedy. The neural network architecture has two hidden layers of sizes 10 and 5. Figure 9: NN 100-10. τ-quantile batch convergence (left) and corresponding rewards over batches (right) on the suite of synthetic datasets when the neural network architecture has two hidden layers of sizes 100 and 10. We test several neural network architectures across all experiments. In all these cases we use Re LU activation functions trained for 5000 steps via the Adam optimizer using batches of size 10. In our tests we use a batch size B = 3, a number of batches N = 150 and repeat each experiment a total of 25 times, reporting average results with standard error bars at each time step. First we compare the Mean Opt version of OAE with a simple Random Batch strategy that selects a random batch of size B from the dataset points that have not been queried yet and a Greedy algorithm corresponding to Mean Opt with λreg = 0. Figures 8, 9 and 10 show representative results for Mean Opt with (λreg = 0.001, 0.01, 0.1) accross three different neural network architectures, NN 10-5, NN 100-10, and NN 300-100. In all cases, the high optimism versions of Mean Opt perform substantially better than Random Batch. In both multimodal datasets Greedy underperforms with respect to the versions of Mean Opt with λreg > 0. This points to the usefulness of optimism when facing multimodal optimization landscapes. We also note that for example NN 10-5 is the best performing architecture for Mean Opt with λreg = 0.001 despite the regression loss of fitting the Multi Valley One Dim s responses with a NN 10-5 architecture not reaching zero (see Figure 13). This indicates the function class F need not contain y for Mean Opt to achieve good performance. Figure 10: NN 300-100. τ-quantile batch convergence on the suite of synthetic datasets when the network architecture has two hidden layers of sizes 300 and 100. Published as a conference paper at ICLR 2023 Moreover, it also indicates the use of higher capacity models, despite being able to achieve better regression fit may not perform better than lower capacity ones in the task of finding a good performing arm. We leave the task of designing smart strategies to select the optimal network architecture for future work. It suffices to note all of the architectures used in our experimental evaluation performed better than more naive strategies such as Random Batch. Second, in Figures 11 and 12 we evaluate Mean Opt vs ensemble implementations of OAE across the two neural network architectures NN 10-5 and NN 100-10. We observe that Ensemble performs competitively with all other methods in the one dimensional datasets and outperforms all in the One Hill Valley One Dim Hole. In the multi dimensional datasets, Mean Opt performs better than Ensemble, Ensemble Noise Y and Greedy. In this case the most optimistic version of Mean Opt (λreg = .1) is the best performing of all. This may indicate that in multi modal environments the optimism injected by the random initialization of the ensemble models in Ensemble or the reward noise in Ensemble Noise Y do not induce an exploration strategy as effective as the explicit optimistic fit of Mean Opt. In unimodal datasets, the opposite is true, Mean Opt with a large regularizer (λreg = 0.1) underperforms Ensemble, Ensemble Noise Y and Greedy. In Appendix G.1 the reader may find similar results for Max Opt and Hinge Opt. Similar results hold in that case. Figure 11: NN 10-5. τ-quantile batch convergence of Mean Optimism vs Ensemble Optimism vs Ensemble Noise Y. Figure 12: NN 100-10. τ-quantile batch convergence of Mean Optimism vs Ensemble Optimism vs Ensemble Noise Y. Figure 13: Training set regression fit curves evolution over training for the suite of synthetic datasets. Each training batch contains 10 datapoints. Published as a conference paper at ICLR 2023 Figure 15: NN 10-5. τ-quantile batch convergence on the Adult and Bike Sharing Day datasets with regression-fitted response values. E.1.2 PUBLIC SUPERVISED LEARNING DATASETS Figure 14: NN 100-10. τ-quantile batch convergence (left) Bike Sharing Day, Bike Sharing Hour and Blog Feedback datasets with original response values. We test our methods on public binary classification (Adult, Bank) and regression (Bike Sharing Day, Bike Sharing Hour, Blog Feedback) datasets from the UCI repository (Dua & Graff, 2017). In our implementation, the versions of the UCI datasets we use have the following characteristics. Due to our internal data processing that splits the data into train, test and validation sets the number of datapoints we consider may be different from the size of their public versions. Our code converts the datasets categorical attributes into numerical ones using one hot encodings. That explains the discrepancy between the number of attributes listed in the public description of these datasets and ours (see https://archive.ics. uci.edu/ml/index.php). The Bike Sharing Day dataset consists of 658 datapoints each with 13 attributes. The Bikesharing Hour dataset consists of 15642 datapoints each with 13 attributes. The Blog Feedback dataset consists of 52396 datapoints each with 280 attributes. The Adult dataset consists of 32561 datapoints each with 119 attributes. The Bank dataset (Moro et al., 2014) consists of 32950 datapoints each with 60 attributes. To evaluate our algorithm we assume the response (regression target or binary classification label) values are noiseless. We consider each observation i in a dataset to represent a discrete action, each of which has features ai and reward y (ai) from the response. In all of our experiments we use a batch size of 3 and evaluate over 25 independent runs each with 150 batches. We first use all 5 public datasets to test OAE in the setting when the true function class F is known (in this case, a neural network) is known contain the function OAE learns over the course of the batches. We train a neural network under a simple mean squared error regression fit to the binary responses (for the binary classification datasets) or real-valued responses (for the regression datasets). This regression neural network consists of a neural network with two hidden layers. In Figure 15 we present results where the neural network layers have sizes 10 and 5 and the responses are fit to the Adult classification dataset and the regression dataset Bike Sharing Day. In Appendix G.2 and Figure 32 we present results where we fit a two layer neural network model of dimensions 100 and 10 to the responses of the Bank classification dataset and the Bike Sharing Hour regression dataset. In each case we train the regression fitted responses on the provided datasets using 5, 000 batch gradient Published as a conference paper at ICLR 2023 Figure 16: Training set regression fit curves evolution over training for the UCI datasets. Each training batch contains 10 datapoints. Figure 17: Linear. τ-quantile batch convergence on the Bike Sharing Day, Bike Sharing Hour and Blog Feedback datasets. steps (with a batch size of 10). During test time we use the real-valued response predicted by our regression model as the reward for the corresponding action. This ensures the dataset s true reward response model has the same architecture as the reward model used by OAE. We use the same experimental parameters and comparison algorithms as in the synthetic dataset experiments. Figure 15 shows the results on the binary classification Adult and regression Bike Sharing Day datasets using these fitted responses. We observe the Mean Opt algorithm handily outperforms Random Batch on both datasets. Appendix G.2 shows similar results for Bike Sharing Hour and Bank. We also compare the performance of Mean Opt, Ensemble and Ensemble Noise Y in the Adult and Bike Sharing Day datasets. In both cases, ensemble methods achieved better performance than Mean Opt. It remains an exciting open question to verify whether these observations translates into a general advantage for ensemble methods in the case when y F. Given OAE s strong performance when the true and learned reward functions are members of the same function class F, we next explore the performance when they are not necessarily in the same class by revising the problem on the regression datasets to use their original, real-world responses. Figure 14 shows results for the Bike Sharing Day, Bike Sharing Hour and Blog Feedback datasets. In this case Ensemble outperforms both Random Batch in τ-quantile convergence time. In all of these plots we observe that high optimism approaches underperform compared with low optimism ones. Ensemble and Greedy (the degenerate version λreg = 0 of Mean Opt) achieve the best performance across all three datasets. We observe the same phenomenon take place even when F is a class of linear functions (see figure 18). Just as we observed in the case of the suite of synthetic datasets, setting F to be a class of linear models Mean Opt still achieves substantial performance gains w.r.t Random Batch (see figure 17) despite its regression fit loss never reaching absolute zero (see figure 16). In Appendix G.3, figure 33 we compare the performance of Max Opt and Hinge Opt with Random Batch when F is a class of neural networks with hidden layers of sizes 100 and 10. We observe the performance of Max Opt, although beats Random Batch is suboptimal in comparison with Mean Opt. In contrast Hinge Opt has a similar performance to Mean Opt. E.2 EXPERIMENTS DIVERSITY SEEKING OBJECTIVES In this section we explore how diversity inducing objectives can sometimes result in better performing variants of OAE. We implement and test Det D, Seq B, Ensemble Seq B and Published as a conference paper at ICLR 2023 Figure 18: Linear. τ-quantile batch convergence on the Bike Sharing Day, Bike Sharing Hour and Blog Feedback datasets. Figure 20: NN 10-5.τ-quantile batch convergence performance of Det D vs Random Batch in the suite of synthetic datasets. Ensemble Seq B Noise Y. In all the experiments we have conducted we kept the batch size b and the number of batches N for each dataset equal to the settings used in Section E.1. The neural network architectures are the same we have considered before, two layer networks with Re LU activations. E.2.1 OAE Dv D Figure 19: NN 100-10. τ-quantile batch convergence performance of Det D vs Random Batch in the Bike Sharing Day dataset. We implemented and tested the Det D algorithm described in Section C.0.1. In our experiments we set λdiv = 1 and set eft to be the result of solving the Mean Opt objective (see problem 3) for different values of λreg. We set K to satisfy, K(a1, a2) = exp a1 a1 a2 a2 We see that across the suite of synthetic datasets and architectures (NN 10-5 and NN 100-10) the performance of Mean Opt degrades when diversity is enforced (see figures 20 and 21 and compare with figures 8 and 9). A similar phenomenon is observed in the suite of UCI datasets (see figure 19 for Det D results on the Bike Sharing dataset and figure 14 for comparison). In contrast, we note that Det D beats the performance of Random Batch in the A373, MCF7 and HA1E datasets and over the two neural architectures NN 10-5 and NN 100-10. Nonetheless, the Det D is not able to beat Random Batch in the VCAP dataset. These results indicate that a diversity objective that relies only on the geometry of the arm space and does not take into account the response values may be beneficial when y F but could lead to a deterioration in performance when y F. The algorithm designer should be careful when balancing diversity objectives and purely optimism driven exploration strategies, since the optimal combination depends on the nature of the dataset. It remains an interesting avenue for future research to design strategies that diagnose in advance the appropriate diversity/optimism balance for OAE Dv D. Published as a conference paper at ICLR 2023 Figure 21: NN 100-10. τ-quantile batch convergence performance of Det D vs Random Batch in the suite of synthetic datasets. Figure 22: NN 10-5. τ-quantile batch convergence performance of Det D vs Random Batch in the suite of genetic perturbation datasets. E.2.2 OAE Seq In this section we present our experimental evaluation of the three tractable implementations of OAE Seq we described in Section C.0.2. In our experiments we set the optimism-pessimism weighting parameter α = 1/2 and the acquisition function to Aavg. We are primarily concerned with answering whether augmenting the Mean Opt, Ensemble and Ensemble Noise Y methods with a sequential in batch selection mechanism leads to an improved performance for OAE. We answer this question in the affirmative. In our experimental results we show that across datasets and neural network architectures adding in batch sequential optimism either improves or leads to no substantial degradation in the number of batches OAE requires to arrive at good arm. Figure 24: NN 100-10. τ-quantile batch convergence performance of (left) Mean Opt vs Seq B, (center) Ensemble vs Ensemble Seq B and (right) Ensemble Noise Y vs Ensemble Seq B Noise Y on the suite of synthetic datasets. Figure 23: NN 100-10. τ-quantile batch convergence performance of Det D vs Random Batch in the suite of genetic perturbation datasets. Published as a conference paper at ICLR 2023 Figure 26: NN 10-5. τ-quantile batch convergence performance of (left) Mean Opt vs Seq B, (center) Ensemble vs Ensemble Seq B and (right) Ensemble Noise Y vs Ensemble Seq B Noise Y on the suite of synthetic datasets. Figure 25: NN 100-10. τ-quantile batch convergence performance of Mean Opt vs Seq B on the gene perturbation datasets. More precisely in figures 26 and 24 we show that in the set of synthetic datasets adding a sequential in batch selection rule improves the performance of OAE across the board for (almost) all datasets and all methods Mean Opt, Ensemble and Ensemble Noise Y and two neural architectures NN 10-5 and NN 100-10. Similar gains are observed when incorporating a sequential batch selection rule to OAE in the Bike Sharing Day and Bike Sharing Hour datasets when F corresponds to the class NN 100-10 (see figure 27). Finally, we observe the performance of OAE Seq either did not degrade or slightly improved that of Mean Opt, Ensemble and Ensemble Noise Y in the Blog Feedback and the genetic perturbation datasets (see figures 34, 25 and 38). We conclude that incorporating a sequential batch decision rule, although it may be computationally expensive, is a desirable strategy to adopt. It Published as a conference paper at ICLR 2023 Figure 27: NN 100-10. τ-quantile batch convergence performance of (left) Mean Opt vs Seq B, (center) Ensemble vs Ensemble Seq B and (right) Ensemble Noise Y vs Ensemble Seq B Noise Y on the Bike Sharing Day and Bike Sharing Hour datasets. is interesting to note that in contrast with Det D the diversity induced by Seq B did not alleviate the subpar performance of Mean Opt in the set of genetic perturbation datasets. This can be explained by Seq B induces query diversity by fitting fake responses and it therefore may be limited by the expressiveness of F. The Det D instead injects diversity by using the geometry of the arm space and may bypass the limitations of exploration strategies induced by F. Unfortunately Det D may result in suboptimal performance when y F. F CELLULAR PROLIFERATION PHENOTYPE Let G be the list of genes present in CMAP also associated with proliferation phenotype according to the Seurat cell cycle signature, and let xi,g represent the level 5 gene expression of perturbation ai for gene g G. We define the proliferation reward for perturbation ai as the average expression of the genes in G, f prolif (ai) = 1 For convenience, G = {AURKB, BIRC5, CCNB2, CCNE2, CDC20, CDC45, CDK1, CENPE, GMNN, KIF2C, LBR, NCAPD2, NUSAP1, PCNA, PSRC1, RFC2, RPA2, SMC4, STMN1, TOP2A, UBE2C, UBR7, USP1}. Published as a conference paper at ICLR 2023 G FURTHER EXPERIMENTS G.1 SYNTHETIC DATASETS Figure 28: NN 10-5. Max Optimism comparison vs Random Batch over the suite of synthetic datasets. Figure 29: NN 100-10. Max Optimism comparison vs Random Batch over the suite of synthetic datasets. Published as a conference paper at ICLR 2023 Figure 30: NN 10-5. Hinge PNorm comparison vs Random Batch over the suite of synthetic datasets. Figure 31: NN 100-10. Hinge PNorm comparison vs Random Batch over the suite of synthetic datasets. Published as a conference paper at ICLR 2023 G.2 REGRESSION FITTED DATASETS Figure 32: NN 100-10. Bike Sharing Hour and Bank Regression Fitted Datasets. Our algorithms perform substantially better than Random Batch. Optimistic approaches are better than Greedy in the Bank dataset. G.3 UCI PUBLIC DATASETS Figure 33: NN 100-10. Max Optimism and Hinge PNorm Optimism comparison vs Random Batch in the Bike Sharing Day, Bike Sharing Hour and Blog Feedback datasets. Published as a conference paper at ICLR 2023 Figure 34: NN 100-10. τ-quantile batch convergence performance of (left) Mean Opt vs Seq B, (center) Ensemble vs Ensemble Seq B and (right) Ensemble Noise Y vs Ensemble Seq B Noise Y on the Blog Feedback dataset. G.4 TRANSFER LEARNING BIOLOGICAL DATASETS Figure 35: NN 10-5. Max Optimism and Hinge PNorm Optimism comparison vs Random Batch. Published as a conference paper at ICLR 2023 Figure 36: NN 100-10. Max Optimism and Hinge PNorm Optimism comparison vs Random Batch. Published as a conference paper at ICLR 2023 Figure 37: NN 10-5. τ-quantile batch convergence of Mean Opt (left), Ensemble and Ensemble Noise Y (right) on genetic perturbation datasets. Published as a conference paper at ICLR 2023 Figure 38: NN 10-5. τ-quantile batch convergence performance of (left) Mean Opt vs Seq B, (center) Ensemble vs Ensemble Seq B and (right) Ensemble Noise Y vs Ensemble Seq B Noise Y on the genetic perturbation datasets.