# adaptive_sampling_for_discovery__5e8aea06.pdf Adaptive Sampling for Discovery Ziping Xu Department of Statistics University of Michigan zipingxu@umich.edu Eunjae Shim Department of Chemistry University of Michigan eunjae@umich.edu Ambuj Tewari Department of Statistics University of Michigan tewaria@umich.edu Paul Zimmerman Department of Chemistry University of Michigan paulzim@umich.edu In this paper, we study a sequential decision-making problem, called Adaptive Sampling for Discovery (ASD). Starting with a large unlabeled dataset, algorithms for ASD adaptively label the points with the goal to maximize the sum of responses. This problem has wide applications to real-world discovery problems, for example drug discovery with the help of machine learning models. ASD algorithms face the well-known exploration-exploitation dilemma. The algorithm needs to choose points that yield information to improve model estimates but it also needs to exploit the model. We rigorously formulate the problem and propose a general information-directed sampling (IDS) algorithm. We provide theoretical guarantees for the performance of IDS in linear, graph and low-rank models. The benefits of IDS are shown in both simulation experiments and real-data experiments for discovering chemical reaction conditions. 1 Introduction Machine Learning (ML) models are becoming increasingly popular for discovery problems in various areas like drug discovery [40, 5, 38] and scientific discoveries [13, 23, 19]. Despite its success in real-data applications, the theoretical aspects of the problem are not fully understood in the machine learning literature. In this paper, we consider a particular case of discovery and formalize it as a problem that we call Adaptive Sampling for Discovery (ASD). In ASD, an algorithm sequentially picks T samples X1, . . . , XT out of a size-n unlabeled dataset Sn (T n) and obtains their labels Y1, . . . , YT with the goal to maximize the sum PT t=1 Yt. The labels correspond to the quality of discoveries, which can be continuous or ordered categorical variables depending on the types of discoveries. As a sequential decision-making problem, we find it substantially connected to the existing literature on Multi-arm Bandit (MAB) and other related decision-making problems, which we briefly review in the rest of this section. Multi-arm bandit. MAB is a decision-making problem that maximizes the total rewards by sequentially pulling arms and each arm corresponds to a reward distribution [27]. MAB faces the well-known exploration exploitation trade-off dilemma. ASD has to deal with the same explorationexploitation dilemma if one treats each arm as an unlabeled point. The major difference between ASD and MAB is that each arm after being pulled once can still be pulled for infinite number of times, while we can only label a sample once in ASD. This is because a discovery can only be made once: one receives no credit making the same discovery twice. 36th Conference on Neural Information Processing Systems (Neur IPS 2022). In fact, as we will show later, ASD can be an MAB when each unique value in Sn has larger than T repeated samples and n T 2. Moreover, a variant of MAB called Sleeping Expert [22, 24] can be seen as a more general problem setup of ASD. One can intuitively interpret Sleeping Expert as the bandit problem where the set of available arms varies. More generally, bandit algorithms that allows adversarial choice of the action set can be used to solve the ASD problem. However, due to its generality, the regret bounds developed in [24] are loose or vacuous in our setup. Detailed discussions are deferred to Section 2. Due to the similarities between bandits and ASD, we adopt the same performance measurement, regret, from bandit literature. Regret is the highest sum of labels that could have been achieved minus the actual sum of labels. There is a long line of work in the bandit literature on solving the exploration-exploitation tradeoff. Popular methods includes UCB (upper confidence bound), TS (Thompson sampling). UCB keeps an optimistic view on the uncertain model [4, 26], which encourages the algorithm to select arms with higher uncertainty. TS [1] adapts a Bayesian framework, which samples a reward vector from its posterior distribution and select arms accordingly. In this paper, we adopt another family of methods called Information-directed Sampling (IDS). IDS [33] balances between the expected information gain on the unknown model and instant regret with respect to the posterior distribution under a Bayesian framework. IDS has been shown to outperform UCB in problem with rich structural information, e.g. sparse linear bandit [14]. As we show below (Proposition 1), the ASD problem is not interesting in the unstructured case which means we have to use structural assumptions. That makes IDS a natural choice. Traditional MAB algorithms designed for finite-armed bandits are not applicable to ASD since the number of unlabeled samples n can be much larger than horizon T. There are works considering infinite-armed bandits by assuming structural information across arms. For instance, linear bandit assumes a linear model for the reward generation [27], [41] assumes Lipschitz-continuity and [42] considers neural network model for reward generation. However, there are few works studying sleeping expert with structure across arms beyond linear structure. Other related literature. With a similar adaptive sampling procedure, Active Learning (AL) aims at achieving a better model accuracy with a limited number of labels. As the reader may have noticed, the goal of AL aligns with ASD in the early stage when the model has high uncertainty and the predictions are high unreliable, while in the later stage ASD aims at better discovery performance. Indeed, there are active learning algorithms based on the idea of maximum information gain [3]. [29] applies the idea to the matrix completion problem, which significantly improves the prediction accuracy from random selection. Other works with a similar goal go by the name sequential exploration [6, 7, 17]. Their works lack a theoretical justification and the algorithms are not generally applicable. A more relevant setup in the active learning literature is Active Search (AS). Active search is an active learning setting with the goal of identifying as many members of a given class as possible under a labeling budget. AS concerns the Bayesian optimal methods named Efficient Nonmyopic Search (ENS) [21, 20, 32], which is not computationally efficient. Their approximate algorithms for the Bayesian optimal method do not provide strong theoretical guarantee. A more detailed comparison between our proposed approach and ENS is given in Appendix I. Main contributions. In this paper, we formulate the adaptive sampling for discovery problem and propose an generic algorithm using information-direct sampling (IDS) strategy. We theoretically analyze the regret of IDS under generalized linear model, low-rank matrix model and graph model assumptions. Indeed, the analysis for linear model are directly from IDS for linear bandit. The regret analysis for low-rank model and graph model are new even in the bandit literature. The results are complemented by simulation studies. We apply the algorithms to the real-world problems in reaction condition discovery, where our algorithm can discover about 20% more plausible reaction conditions than random policies and other simple baseline policies adapted from the bandit literature. 2 Formulation In this section, we formally formulate ASD and rigorously discuss its connections to MAB. Notations. We first introduce some notations that are repeatedly used in this paper. For any finite set S, we let D(S) be the set of all the distributions over the set. For any positive integer n, we let [n] = {1, . . . , n}. We use the standard O( ) and Ω( ) notation to hide universal constant factors. We also use a b and a b to indicate a = O(b) and a = Ω(b). ASD is a sequential decision-making problem over discrete time steps. In general, we let Ft be the observations up to decision step t. We adapt a Bayesian framework and let Et[ ] = E[ | Ft] be the conditional expectation given the history up to t. Let Pt be the probability measure of posterior distribution. Adaptive sampling for discovery. We formulate ASD problem as follows. Consider a problem with the covariate space X Rd and label space Y [0, 1]. For categorical discoveries, the labels are ordered such that higher labels reflects better discoveries. Given a input x X, its label is generated from a distribution Dθ|x over Y parametrized by unknown parameters θ. We adopt a Bayesian framework and denote the prior distribution of θ by φ. Given a sequence of t 1 observations Ft = {(Xi, Yi)}t 1 i=1, the posterior distribution of φ is denoted by φ( | Ft). We also let fθ(x) = EY Dθ|x[Y ] be the expected label of input x under model θ. ASD starts with an unlabeled dataset Sn X with n samples. Our goal is to adaptively label T samples {X1, . . . , XT } Sn to maximize the sum of labels PT t=1 Yt. The discovered set is removed from available unlabeled set, i.e. each arm can only be pulled once. At each step t, an algorithm chooses an input Xt from Sn t := Sn \ {X1, . . . , Xt 1}. The environment returns the label Yt Dθ|Xt. To measure the performance of any algorithm, we borrow the definition of Bayesian regret from bandit literature [27]. Note that the regret definition is also used in Sleeping Expert [22], which is a more general framework than ASD. With respect to any chosen random sequence (X1, . . . , XT ), we define regret by RT = max x1,...,x T Sn t=1 fθ(Xt). (1) Let Sn t = Sn \ {X1, . . . , Xt 1} be the remaining unlabeled set at the start of step t. An algorithm A is a sequence of policies (π1, . . . , πT ). Let D(Sn t ) Rn t 1 be the set of all the distributions over Sn t that are denoted as n t 1 vectors. Each πt is a mapping from history Ft to a distribution D(Sn t ) over Sn T . The Bayesian Regret w.r.t. an algorithm is BR(T, A) := ERT , where the expectation is taken over the randomness in the choice of input Xt, the labeling Yt and over the prior distribution on θ. Let X 1, . . . , X T be the sequence that realizes the maximum in (1). Note that X 1, . . . , X T are random variables depending on θ. In general, the goal the algorithm is to achieve a sublinear regret upper bound. Connections to bandit problem. This problem degenerates to a classical bandit problem when X is discrete and Sn has infinite number of unlabeled samples for each unique value in X. In the bandit literature some frequentist results are available. The problem has the minimax regret rate p |X|T. However, in our problem, we generally consider large unlabeled set (n T). In fact, when |X| is larger than T 2, Proposition 1 shows that a linear regret is unavoidable without a structural assumption. Therefore, we discuss different types of structural information that allows information sharing across different inputs. Proposition 1 (Lower bound for nonstructural case). Consider the following set of ASD problems. Let n = T 2, Sn = [n] and Dθ|x = N(θx, 1) for x [n] and θ [0, 1]n. For any algorithm A, there exists some prior distribution over the set of ASDs described above such that BR(T, A) T. Compared with sleeping expert. Sleeping expert considers a generalized bandit framework where the available (awake) action set At are changing. It is the same as ASD except for that the changes of the unlabeled dataset (or awake action set) follow certain dynamic. That is whenever an input is labeled, it is taken out of the unlabeled dataset. However, sleep expert normally assumes a stochastic distribution on At or adversarial changes [24, 9], which makes their regret bound loose. To be specific, [22] derives regret bounds for unstructured bandits, which leads to vacuous regret bound of n T. For structured case, [24] derives the same regret bound in this paper, due to the strong structure assumption for linear model. There are few literature for sleeping bandits with complicated structures beyond linear model. The regret bounds in the bandit literature for the other two models, graph and low-rank matrix models, are not applicable both resulting a Ω(T) regret bound, which assures the necessity of studying ASD problem by itself. 3 Information-directed Sampling and Generic Regret Bound In this section, we introduce a strategy called Information-Directed Sampling and develop a generic regret bound. IDS runs by evaluating the information gain on the indices of the top T inputs with respect to the labeling of a certain input x. It balances between obtaining low expected regret in the current step and acquiring high information about the inputs that are worth labeling. We first introduce entropy and information gain. Definition 1 (Entropy and information gain). For any random variable X X with a density function P, its entropy is defined as H(X) = R x X P(x) log(P(x)). Let H(X | Z) be the conditional entropy of X given Z. The mutual information between any two random variables Y and X can be defined as I(X; Y ) = H(X) H(X | Y ). Conditional mutual information is I(X; Y | Z) = H(X | Z) H(X | Y, Z). Furthermore, we use Ht and It for the entropy and mutual information under the posterior distribution up to step t. For IDS, we let t(x) = Et[maxx Sn t fθ(x ) fθ(x)] be the expected instant regret. Let the information gain of selecting x at the step t be gt(x) = It(X t,1, . . . , X t,T t+1; Yt | Ft 1, Xt = x), where (X t,1, . . . , X t,T t+1) = (x1, . . . , x T t+1) : xi Sn t , fθ(x1) fθ(x T t+1) max x =x1,...,x T t+1 fθ(x ) , are the random variable for the top T t + 1 unlabeled points. Note that {X t,1, . . . , X t,T t+1} {X 1, . . . , X T }. Intuitively, gt(x) captures the amount of information on the indices of the top T t + 1 points gained by labeling x. IDS minimizes the following ratio πIDS t arg min π D(Stn) Ψt,λ(π) := ( T t π)λ g T t π , for some constant λ > 0, where t Rn t+1 and gt Rn t+1 are the corresponding vectors for the expected single-step regret and information gain. Constant ratio here weighs between instant regret and information gain. A higher λ prefers points with lower instant regrets. An abstract algorithm is given Algorithm 1. Algorithm 1 IDS (Information-Directed Sampling) for Discovery Input: Unlabeled dataset Sn, prior distribution φ, total number of steps T, constant λ. Initialize history F0 = {}, Sn 1 = Sn. for t = 1 to T do Calculate t(x) and gt(x) for each x Sn t using posterior φ( | Ft 1). Sample Xt arg maxπ Ψt,λ(π) and receive label Yt from Dθ|Xt. Update history Ft = Ft 1 {(Xt, Yt)} and unlabeled dataset Sn t+1 = Sn t \ {Xt}. end for We can show that the following generic Bayesian regret upper bound holds. Lemma 1. Suppose πIDS = (πt)t [T ] and Ψt,λ(πt) Ψ ,λ. Then we have IDS using constant λ has the following regret bound BR(T, IDS) (Ψ ,λH(X 1, . . . , X T )T λ 1)1/λ. As the case in the bandit problem, the regret bound depends on the worst-case information ratio bound and the entropy of the random variables for the top T unlabeled points. Now we will discuss how to bound information ratios under various structural information assumptions. 4 Bounding Information Ratio under Structural Assumptions In this section, we study three cases with different structural information. 4.1 Generalized linear model. In this subsection, we discuss a generalized regression model, which includes logistic regression, corresponding to a natural binary discovery problem. Consider d-dimensional generalized linear regression problem. The label Y given an input X is generated by Y = µ(X θ ) + ϵ, where µ is usually referred as link function. We assume that Y is uniformly bounded in [0, 1]. In addition, we assume an upper bound on the derivatives of µ. Assumption 1. The first order derivatives of µ are upper-bounded by Lµ. Theorem 1. Under Assumption 1, we have Ψ ,2 Lµd/2, which gives the Bayesian regret bound BR(T, IDS) q d H(X 1, . . . , X T )LµT (2) A naive bound for H(X 1, . . . , X T ) in [33] does not apply here because it introduces a log(n) dependency. Instead, we notice that X 1, . . . , X T is a deterministic function of θ and Sn. We can bound the entropy of θ instead. Assuming a Gaussian prior on θ, H(θ) d log(d). Indeed, the regret in (2) is analogous to the regret bound in linear bandit [33, 27]. Theoretically, IDS does not show an advantage. This is partly because carefully designed algorithms for linear model are sufficiently utilizing the structural information efficiently. [14] has shown that IDS achieves a much tighter lower bound for sparse linear bandit compared to UCB-type algorithms and Thompson sampling algorithms. We show in the Appendix G that IDS enjoys a regret bound of O(s T 2/3) in the discovery setting with a sparse linear structure, with s being the sparsity. 4.2 Low-rank matrix In this subsection, we discuss a novel problem setup: low-rank matrix discovery. Low-rank matrix completion problem has been extensively studied in the literature [25, 18, 31]. However, most of the algorithms select entries with a uniform random distribution. In many applications, appropriately identifying a subset of missing entries in an incomplete matrix is of paramount practical importance. For example, we can encourage users to rate certain movies for movie rating matrix recovery. While selecting the users that reveal more information is important, it is also important to push the movies aligning with users tastes. Active matrix completion has long been studied in the literature [8, 29, 15]. As far as we know, no work has considered a discovery setup, where the goal of adaptive sampling is not only to get a better recovery accuracy but also to maximize the sum of observed entries. Formulation. Consider an unknown matrix Y Rm1 m2, whose entries (i, j) [m1] [m2] are sampled in the following manner: Yi,j = e T i Mi,jej + ϵi,j for ϵi,j being the noise for entry i, j, where M Rm1 m2 is the unknown rank-r matrix and r min{m1, m2}. The matrix M admits SVD, i.e. M = UDV T , where U Rm1 r and V Rr m2 are orthogonal matrix and D Rr r. For simplicity, we let m1 = m2 = m. Our analysis can be easily extended to the case m1 = m2. Note that this setup is closely connected to the low-rank bandit problem except for that the actions are standard basis and can only be selected once. The best results in the literature [28] achieve a regret bound of O(m3/2 r T). Since in our case, T m2, the regret bound becomes vacuous. This finding implies that we need some further assumptions on the structure of the matrix. A common assumption in the matrix completion is incoherence, an important concept that measures how the subspace aligns with the standard basis. Definition 2 (Coherence). The coherence of subspace U with respect to the standard i-th basis vector ei is defined as µi(U) = PUei 2 2, where PU is the orthogonal projection onto the subset defined by U. Note that µi(U) = U T ei 2 2 as well. Assumption 2 (Incoherence). There exists some constant γ > 0, such that the unknown matrix M is incoherent, i.e. maxi [m] µi(U) p γr/m and maxi m µi(V ) p To introduce our results, we further define some constants. Remark 1. Let d be the maximum singular value of the unknown matrix M. We have max i,j |Mi,j| max i,j µi(U)µj(V ) D F γr3/2 d/m := B. Theorem 2. Under the low-rank matrix model with Assumption 2, the worst-case information ratio can be bounded by Ψ ,3 4B(B2 + 1)r3γ2, which gives us a regret bound of BR(T, IDS) 4B B2 + 1 r3γ2H(M)T 2 1/3 . The entropy term H(M) depends on the distribution of M. To give an example, let M = UV T , where U, V Rm r and each element in U and V are sampled independently from standard Gaussian. Then H(M) H((U, V )) = H(U) + H(V ) mr log(mr). In general, the regret bound is still sublinear when mr4 T m2. We consider the discovery problem with graph feedback. Specifically, we are given a graph G = (Sn, E), where the unlabeled set Sn is the node set. By labeling a node x Sn, we receive noisy outcomes for all the nodes connecting to x. Let the side information at step t be Ot(Xt) = { Yx }x :(x ,Xt) E, where Yx = x + ϵx for ϵx being the zero-mean noise given x being selected. In addition to the side information, we also receive the label for Xt, which is sampled in the same manner. This setup is also studied in the sleeping expert literature [9]. [9] gives a regret bound depending on E[P x Tx/Qx], where Tx is the total number of visits on action x and Qx is the total number of observations for action x. The ratio can be low when the algorithms tend to select nodes with high degrees and their algorithms are not designed for that purpose. Therefore, the structure are not sufficiently exploited and the regret bound can be loose. To measure the complexity of a graph, we introduce the following two definitions. Definition 3 (Maximal independent set). An independent set is a set of vertices in a graph such that no two of which are adjacent. A maximal independent set (MIS) is an independent set that is not a subset of any other independent set. We denote the cardinality of the smallest maximum independent set of a graph G by C(G). Definition 4 (Clique cover number). A Clique of a graph G = (K, E) is a subset S K such that the sub-graph formed by S and E is a complete graph. A Clique cover of a graph G = (K, E) is a partition of K, denoted by Q, such that Q is a clique for each Q Q. The cardinality of the smallest clique cover is called the clique cover number, which is denoted by χ(G). Remark 2. Clique cover number is guaranteed to be larger than the cardinality of the smallest maximal independent set, i.e. C(G) X(G). Since the nodes in the graph are diminishing, we further define a stable version of MIS. Definition 5. Let Gt(G) be the set of all subgraphs of G with N t nodes and define Ct(G) = max Gt Gt(G) C(Gt). Assumption 3. We assume that x + ϵx B almost surely for all x Sn and some constant B > 0. Theorem 3. Under Assumption 3, the Bayesian regret of the IDS with λ = 2 can be upper bounded by BR(T, IDS) min{(BCT (G)T)2/3, (X(G)T)1/2}. Let us discuss certain graphs where the regret bound can be low. First, if the graph can be decomposed into K T complete subgraphs, then X(G) K and we have a regret bound of KT. An example, in which C(G) is low while X(G) is high, is a star-shaped graph, with a set of nodes in the center of the graph connecting to all the other nodes. See Figure 2 in the Appendix for an illustration. 4.4 A generic results for models with structural information As shown in sections 4.2 and 4.3, a T 2/3 regret upper bound with a small coefficient can be derived when the model has certain structural information. We give the following generic result, stating that as long as there exists a policy µ, whose information gain can be lower bounded by the instant regret of Thompson sampling policy an upper bound for Ψ ,3 can be derived. Proposition 2. Assume that there exists a policy µ such that g t µ φ( t πts t )2, where πts t is the Thompson sampling policy at the step t for some constant φ and assume that the instant regrets are uniformly bounded by B. Then Ψ ,3 2B/φ, which gives a regret bound of O((BTH(θ)/φ)2/3). 5 Experiments We start with simulation studies on the three problems: (generalized) linear model, low-rank matrix and graph model. Since information gain can be hard to calculate in some problems, we introduce sampling-based approximate algorithms, which generate random samples from the posterior distribution and evaluate a variance term instead of the original information gain. We will also introduce general Thompson Sampling algorithms for all the three problem setups. 5.1 Approximate algorithms It may not be efficient to evaluate the information gain in certain problem setups. For example, one way to generate random low-rank matrix is to generate its row and column spaces from Gaussian distribution. We do not have a closed form for the posterior distribution of the resulting matrix. Neither can we evaluate the information gain. To that end, we follow an approximate algorithmic design in [33], which replaces the information gain with a conditional variance vt(x) = Vart(E[Yt,x | X 1, Xt = x]), which will be evaluated under random samples from posterior distribution using MCMC. The conditional variance is a lower bound of the information gain. An approximate algorithm is given by Algorithm 2 in Appendix H. Proposition 3. The following lower bound for information gain holds, gt(x) vt(x). Compared algorithms. For the rest of section, we compare IDS (the approximate algorithm), TS, UCB (if the bandit version is available) and random policy. As a method that is closely connected to IDS, Thompson sampling (also referred as posterior sampling) is also compared in our experiments. Thompson sampling [36] at each round samples a θt from its posterior distribution and then selects the input with lowest instant regret w.r.t to θt. TS does not account for the amount of information gain, thus can be insufficient in utilizing the structural information. UCB is an important algorithm design in bandit literature. Though it is not specifically designed for ASD, we apply UCB to (generalized) linear ASD by manually removing arms that have been selected. 5.2 Simulation studies Linear and logistic regression model. The data generating distribution are specified below. Unlabeled dataset is sampled from N(0, σ2 x) and the prior distribution of θ is N(0, σ2 θ). The noise ϵ is sampled from N(0, σ2 ϵ ). We study the effects of d = 20, 50, 100. The same setup is used for logistic regression model. The posterior distribution for θ in simple linear regression is also Gaussian. For logistic regression, we used a Laplace approximation for the posterior distribution [16]. We use the GLM-UCB [12] for logistic regression. Graph model. We consider different types of graphs. For each graph, the node values are sampled from standard Gaussian distribution and the noise are sampled from Gaussian with zero mean and σ2 ϵ variance. We tested σ2 ϵ = 0.1, 1.0, 10. We experiment on graphs with N = 900. For a better demonstration of the early stage performance, we only show the first 50 steps. We experiment the following types of graphs. 1) Random connections: any node has a probability of p (= 0.01) to be connected; 2) Complete graph: every pair of nodes has an edge; 3) Star graph: 2/3 of nodes are randomly connected with a probability of p (= 0.01). And the rest of 1/3 nodes are connected to all the nodes. Note that an example of star graph is given in Figure 2. Since, no node in complete graph provides more information than others, we expect TS and IDS perform the same. We expect IDS outperforms TS more significantly in star graphs than it does in random graphs as there are more nodes gaining global information. Low-rank matrix We sample each entry of U Rd r and V Rd r from N(0, σ2 0) and let the unknown matrix Y = UV T + ϵ, with each entry of ϵ from N(0, σ2 ϵ ). Since we do not have a closed form for the posterior distribution, posterior samples are generated from a MCMC algorithm. In the experiment, we let m = 30, which gives us 900 steps at maximum. For a better demonstration of the early stage, we show the cumulative regrets in the first 100 steps. Results. Figure 1 shows the simulations results for the four models above. IDS has lower cumulative regrets over all the steps in linear model, logistic model, graph random and graph star model. TS and IDS always outperform random policy and UCB (if available), showing a benefit of developing algorithms for ASD. In the graph experiments, IDS and TS performs almost the same for complete graph because any input gains information of the full graph. IDS has slightly lower regrets than TS in random graph because of the weak structural information. It has significantly lower regrets than TS and random because of the stronger structural information. As shown in (b, d), the cumulative regrets at steo 100 grows linearly with d. A higher noise also leads to a higher regret as shown in (f, h, j, l). IDS performs better in different levels of d for linear models and different levels of σ for graph random, graph star, matrix. Linear model 0 100 200 300 400 500 T 20 30 40 50 60 70 80 90 100 d 100-step Regrets (b) Regret at T = 100 Logistic model 0 100 200 300 400 500 T TS Random UCB IDS 20 30 40 50 60 70 80 90 100 d 250-step Regrets TS Random UCB IDS (d) Regret at T = 100 Graph (random) 0 10 20 30 40 50 T TS IDS random (e) σ2 ϵ = 0.1 0 2 4 6 8 10 noise 50-step Regrets TS IDS random (f) Regret at T = 50 Graph (complete) 0 10 20 30 40 50 T TS IDS random (g) σ2 ϵ = 0.1 0 2 4 6 8 10 noise 50-step Regrets TS IDS random (h) Regret at T = 50 Graph (star) 0 10 20 30 40 50 T TS IDS random (i) σ2 ϵ = 0.1 0 2 4 6 8 10 noise 50-step Regrets TS IDS random (j) Regret at T = 50 0 20 40 60 80 100 T IDS TS Random (k) σ2 ϵ = 0.1 0.2 0.4 0.6 0.8 1.0 noise 100-step Regrets IDS TS Random (l) Regret at T = 100 Figure 1: Cumulative regret curves for linear regression (a, b), logistic regression (c, d), random graph, complete graph, star graph (e-j) and low-rank matrix (k, l). Columns 1 and 3 are the cumulative regret curves for different models. Columns 2 and 4 are cumulative regret at an early step for different hyperparameters. For linear and logistic regression model, d ranges in {20, 50, 100}. For graph model, σ2 ϵ ranges in {0.1, 1.0, 10}. For low-rank matrix, σ2 ϵ ranges in {0.1, 0.5, 1.0}. 5.3 Reaction condition discovery To further test the usefulness of IDS and TS for ASD in real-world problems, we consider a reaction condition discovery problem in organic chemistry. Background and dataset. In the chemical literature, a single reaction condition that gives the highest yield for one set of reacting molecules is often demonstrated on a couple dozen analogous molecules. However, for molecules with structures that significantly differ from this baseline, one needs to rediscover working reaction conditions. This provides a significant challenge as the number of plausible reaction conditions can grow rapidly, due to factors such as the many combinations of individual reagents, varying concentrations of the reagents, and temperature. Therefore, strategies that can identify as many high-yielding reaction conditions as early as possible would greatly facilitate the preparation of functional molecules. Similar reaction condition discovery problem is studied under a transfer learning framework [30, 37]. The IDS algorithms may therefore be highly useful in examining experimental data compared to UCB, TS and random policies. Two experimental datasets on reactions that form carbon-nitrogen bonds, which are of high importance for synthesizing drug molecules, were therefore selected as specific test cases. The first dataset, named Photoredox Nickel Dual-Catalysis (PNDC) [11], comes from a reaction that involves two types of catalysts (a photocatalyst and a nickel catalyst) that work together to form carbon-nitrogen bonds. It has n = 80 reaction conditions for discovery. The variables in the dataset are the identity of photocatalysts, relative catalyst concentration, and relative substrate concentration, totaling 80 reaction conditions. The second C-N Cross-Coupling with Isoxazoles (CNCCI) dataset [2] comes from a collection of a cross-coupling reactions in the presence of an additive. This dataset was designed to test the reaction s tolerance to the additive (isoxazole) and includes 330 candidate reactions. Both datasets have continuous response variables (reaction yield). Results. Table 1 demonstrated the results of linear IDS, TS, UCB for PNDC and CNCCI. We tuned hyperparameters for all three algorithms. Details are given in Appendix H. Over 17 steps, IDS showed a dominant performance in all 11 molecules tested in PNDC with an average of 1.58 more discoveries than random policy. Either IDS or TS performs the best in the 9 combinations of catalyst and base tested in CNCCI dataset. Over 66 steps, IDS discovers 14.1 more plausible reaction conditions on average than random. Table 1: Summary of cumulative regrets at 20% of total number of steps. Each column corresponds to a target molecule. The regrets are averages of 10 independent runs, whose standard deviation are given in the brackets. Best algorithm for each column is marked in red. (-) denote that the algorithms are deterministic. The columns are the indices for target molecules, whose details are provided in Appendix H. Molecules\ X2 X3 X4 X5 X6 X8 Algorithms Random 8.77 (1.44) 7.74 (0.75) 2.12 (0.20) 1.70 (0.15) 4.96 (0.72) 9.90 (1.47) IDS 6.53 (0.20) 5.70 (0.52) 1.56 (0.06) 0.81 (0.01) 3.44 (0.81) 9.35 (0.96) TS 7.64 (1.11) 6.67 (1.09) 2.15 (0.48) 1.22 (0.38) 4.07 (1.30) 9.78 (1.69) UCB 9.63 (-) 7.66 (-) 2.58 (-) 1.42 (-) 5.26 (-) 10.43 (-) X11 X12 X13 X14 X15 Random 7.15 (1.21) 5.27 (0.92) 4.44 (0.43) 5.72 (0.35) 2.65 (0.47) IDS 2.40 (0.12) 3.95 (0.19) 3.22 (0.21) 4.94 (0.07) 1.09 (0.03) TS 2.90 (1.76) 4.42 (0.90) 4.79 (1.17) 5.39 (0.64) 1.35 (0.43) UCB 3.58 (-) 5.28 (-) 4.53 (-) 5.86 (-) 2.03 (-) (a) Results for PNDC Catalyst and Base \ L1+B1 L2+B1 L3+B1 L4+B1 L1+B2 Algorithms Random 25.65 (2.11) 23.08 (1.98) 19.90 (1.19) 15.96 (0.83) 29.59 (1.26) IDS 15.55 (1.52) 7.90 (0.28) 10.03 (0.88) 10.81 (0.79) 10.76 (0.51) TS 11.84 (1.04) 9.82 (2.23) 11.58 (4.31) 10.39 (0.81) 12.49 (1.10) UCB 13.63 (-) 10.08 (-) 12.00 (-) 11.98 (-) 17.41 (-) L4+B2 L1+B3 L3+B3 L4+B3 Random 28.78 (2.59) 19.84 (0.81) 29.03 (2.22) 27.08 (2.51) IDS 10.51 (1.32) 8.63 (0.97) 8.76 (0.32) 9.44 (0.52) TS 12.68 (2.33) 10.61 (0.99) 12.44 (7.39) 9.80 (2.39) UCB 15.60 (-) 12.42 (-) 11.49 (-) 12.32 (-) (b) Results for CNCCI 6 Discussion In this paper, we posed and comprehensively studied the Adaptive Sampling for Discovery problem and proposed the Information-directed sampling algorithm along with its regret analysis. The results are complemented with both simulation and real-data experiments. There are a few open directions following the results in the paper. First, although our paper shows bandit-like regret bounds, ASD problem by nature is different from bandit in the way that the unlabeled set is diminishing, which may further reduce the complexity of the problem. For example, in the linear model, all the unlabeled points may fall on a lower dimensional hyperplane, which would reduce the dependence on d. Second, frequentist regret may be studied instead of adopting a Bayesian framework. More practical algorithms e.g. random forests and neural network may be considered in the future work. Sampling from posterior can be slow for certain models. A fast and generalizable algorithm using IDS for more complicated and larger scale real-data applications may be considered. 7 Acknowledgement We acknowledge the support of National Science Foundation via grant IIS-2007055 and CHE1551994. [1] Shipra Agrawal and Navin Goyal. Analysis of thompson sampling for the multi-armed bandit problem. In Conference on learning theory, pages 39 1. JMLR Workshop and Conference Proceedings, 2012. [2] Derek T Ahneman, Jesús G Estrada, Shishi Lin, Spencer D Dreher, and Abigail G Doyle. Predicting reaction performance in c n cross-coupling using machine learning. Science, 360(6385):186 190, 2018. [3] Kasra Arnavaz, Aasa Feragen, Oswin Krause, and Marco Loog. Bayesian active learning for maximal information gain on model parameters. In 2020 25th International Conference on Pattern Recognition (ICPR), pages 10524 10531. IEEE, 2021. [4] Peter Auer, Nicolo Cesa-Bianchi, and Paul Fischer. Finite-time analysis of the multiarmed bandit problem. Machine learning, 47(2):235 256, 2002. [5] Jürgen Bajorath, Steven Kearnes, W Patrick Walters, Nicholas A Meanwell, Gunda I Georg, and Shaomeng Wang. Artificial intelligence in drug discovery: into the great wide open, 2020. [6] J Eric Bickel and James E Smith. Optimal sequential exploration: A binary learning model. Decision Analysis, 3(1):16 32, 2006. [7] David B Brown and James E Smith. Optimal sequential exploration: Bandits, clairvoyants, and wildcats. Operations research, 61(3):644 665, 2013. [8] Shayok Chakraborty, Jiayu Zhou, Vineeth Balasubramanian, Sethuraman Panchanathan, Ian Davidson, and Jieping Ye. Active matrix completion. In 2013 IEEE 13th international conference on data mining, pages 81 90. IEEE, 2013. [9] Corinna Cortes, Giulia De Salvo, Claudio Gentile, Mehryar Mohri, and Scott Yang. Online learning with sleeping experts and feedback graphs. In International Conference on Machine Learning, pages 1370 1378. PMLR, 2019. [10] Shi Dong and Benjamin Van Roy. An information-theoretic analysis for thompson sampling with many actions. Advances in Neural Information Processing Systems, 31, 2018. [11] Spencer D Dreher and Shane W Krska. Chemistry informer libraries: Conception, early experience, and role in the future of cheminformatics. Accounts of Chemical Research, 54(7):1586 1596, 2021. [12] Sarah Filippi, Olivier Cappe, Aurélien Garivier, and Csaba Szepesvári. Parametric bandits: The generalized linear case. Advances in Neural Information Processing Systems, 23, 2010. [13] Yolanda Gil, Mark Greaves, James Hendler, and Haym Hirsh. Amplify scientific discovery with artificial intelligence. Science, 346(6206):171 172, 2014. [14] Botao Hao, Tor Lattimore, and Wei Deng. Information directed sampling for sparse linear bandits. Advances in Neural Information Processing Systems, 34, 2021. [15] Zichang He, Bo Zhao, and Zheng Zhang. Active sampling for accelerated mri with low-rank tensors. ar Xiv preprint ar Xiv:2012.12496, 2020. [16] Tommi S Jaakkola and Michael I Jordan. A variational approach to bayesian logistic regression models and their extensions. In Sixth International Workshop on Artificial Intelligence and Statistics, pages 283 294. PMLR, 1997. [17] Babak Jafarizadeh and Reidar Bratvold. The two-factor price process in optimal sequential exploration. Journal of the Operational Research Society, 72(7):1637 1647, 2021. [18] Prateek Jain, Praneeth Netrapalli, and Sujay Sanghavi. Low-rank matrix completion using alternating minimization. In Proceedings of the forty-fifth annual ACM symposium on Theory of computing, pages 665 674, 2013. [19] Xiaowei Jia, Jared Willard, Anuj Karpatne, Jordan S Read, Jacob A Zwart, Michael Steinbach, and Vipin Kumar. Physics-guided machine learning for scientific discovery: An application in simulating lake temperature profiles. ACM/IMS Transactions on Data Science, 2(3):1 26, 2021. [20] Shali Jiang, Gustavo Malkomes, Matthew Abbott, Benjamin Moseley, and Roman Garnett. Efficient nonmyopic batch active search. Advances in Neural Information Processing Systems, 31, 2018. [21] Shali Jiang, Gustavo Malkomes, Geoff Converse, Alyssa Shofner, Benjamin Moseley, and Roman Garnett. Efficient nonmyopic active search. In International Conference on Machine Learning, pages 1714 1723. PMLR, 2017. [22] Varun Kanade, H Brendan Mc Mahan, and Brent Bryan. Sleeping experts and bandits with stochastic action availability and adversarial rewards. In Artificial Intelligence and Statistics, pages 272 279. PMLR, 2009. [23] Anuj Karpatne, Gowtham Atluri, James H Faghmous, Michael Steinbach, Arindam Banerjee, Auroop Ganguly, Shashi Shekhar, Nagiza Samatova, and Vipin Kumar. Theory-guided data science: A new paradigm for scientific discovery from data. IEEE Transactions on knowledge and data engineering, 29(10):2318 2331, 2017. [24] Robert Kleinberg, Alexandru Niculescu-Mizil, and Yogeshwer Sharma. Regret bounds for sleeping experts and bandits. Machine learning, 80(2):245 272, 2010. [25] Vladimir Koltchinskii, Karim Lounici, and Alexandre B Tsybakov. Nuclear-norm penalization and optimal rates for noisy low-rank matrix completion. The Annals of Statistics, 39(5):2302 2329, 2011. [26] Tor Lattimore. Optimally confident ucb: Improved regret for finite-armed bandits. ar Xiv preprint ar Xiv:1507.07880, 2015. [27] Tor Lattimore and Csaba Szepesvári. Bandit algorithms. Cambridge University Press, 2020. [28] Yangyi Lu, Amirhossein Meisami, and Ambuj Tewari. Low-rank generalized linear bandit problems. In International Conference on Artificial Intelligence and Statistics, pages 460 468. PMLR, 2021. [29] Simon Mak, Henry Shaowu Yushi, and Yao Xie. Information-guided sampling for low-rank matrix completion. ar Xiv preprint ar Xiv:1706.08037, 2017. [30] Gregory Mc Innes, Rachel Dalton, Katrin Sangkuhl, Michelle Whirl-Carrillo, Seung-been Lee, Philip S Tsao, Andrea Gaedigk, Russ B Altman, and Erica L Woodahl. Transfer learning enables prediction of cyp2d6 haplotype function. PLo S Computational Biology, 16(11):e1008399, 2020. [31] Luong Trung Nguyen, Junhan Kim, and Byonghyo Shim. Low-rank matrix completion: A contemporary survey. IEEE Access, 7:94215 94237, 2019. [32] Quan Nguyen and Roman Garnett. Nonmyopic multiclass active search for diverse discovery. ar Xiv preprint ar Xiv:2202.03593, 2022. [33] Daniel Russo and Benjamin Van Roy. Learning to optimize via information-directed sampling. Advances in Neural Information Processing Systems, 27, 2014. [34] Daniel Russo and Benjamin Van Roy. An information-theoretic analysis of thompson sampling. The Journal of Machine Learning Research, 17(1):2442 2471, 2016. [35] Daniel Russo and Benjamin Van Roy. Learning to optimize via information-directed sampling. Operations Research, 66(1):230 252, 2018. [36] Daniel J Russo, Benjamin Van Roy, Abbas Kazerouni, Ian Osband, Zheng Wen, et al. A tutorial on thompson sampling. Foundations and Trends in Machine Learning, 11(1):1 96, 2018. [37] Eunjae Shim, Joshua Kammeraad, Ziping Xu, Ambuj Tewari, Tim Cernak, and Paul Zimmerman. Predicting reaction conditions from limited data through active transfer learning. Chem. Sci., pages , 2022. [38] Thomas J Struble, Juan C Alvarez, Scott P Brown, Milan Chytil, Justin Cisar, Renee L Des Jarlais, Ola Engkvist, Scott A Frank, Daniel R Greve, Daniel J Griffin, et al. Current and future roles of artificial intelligence in medicinal chemistry synthesis. Journal of medicinal chemistry, 63(16):8667 8682, 2020. [39] Kiran K Thekumparampil, Prateek Jain, Praneeth Netrapalli, and Sewoong Oh. Statistically and computationally efficient linear meta-representation learning. Advances in Neural Information Processing Systems, 34, 2021. [40] Jessica Vamathevan, Dominic Clark, Paul Czodrowski, Ian Dunham, Edgardo Ferran, George Lee, Bin Li, Anant Madabhushi, Parantu Shah, Michaela Spitzer, et al. Applications of machine learning in drug discovery and development. Nature reviews Drug discovery, 18(6):463 477, 2019. [41] Yizao Wang, Jean-Yves Audibert, and Rémi Munos. Algorithms for infinitely many-armed bandits. Advances in Neural Information Processing Systems, 21, 2008. [42] Dongruo Zhou, Lihong Li, and Quanquan Gu. Neural contextual bandits with ucb-based exploration. In International Conference on Machine Learning, pages 11492 11502. PMLR, 2020. The checklist follows the references. Please read the checklist guidelines carefully for information on how to answer these questions. For each question, change the default [TODO] to [Yes] , [No] , or [N/A] . You are strongly encouraged to include a justification to your answer, either by referencing the appropriate section of your paper or providing a brief inline description. For example: Did you include the license to the code and datasets? [Yes] See Supplementary material. Did you include the license to the code and datasets? [No] The code and the data are proprietary. Did you include the license to the code and datasets? [N/A] Please do not modify the questions and only use the provided macros for your answers. Note that the Checklist section does not count towards the page limit. In your paper, please delete this instructions block and only keep the Checklist section heading above along with the questions/answers below. 1. For all authors... (a) Do the main claims made in the abstract and introduction accurately reflect the paper s contributions and scope? [Yes] (b) Did you describe the limitations of your work? [Yes] The limitations are discussed in the Section 6 (c) Did you discuss any potential negative societal impacts of your work? [Yes] Depending on the area of application, the impact can be different. The potential impact of reaction discovery is in 6. (d) Have you read the ethics review guidelines and ensured that your paper conforms to them? [Yes] 2. If you are including theoretical results... (a) Did you state the full set of assumptions of all theoretical results? [Yes] (b) Did you include complete proofs of all theoretical results? [Yes] The proofs are given in Appendix 3. If you ran experiments... (a) Did you include the code, data, and instructions needed to reproduce the main experimental results (either in the supplemental material or as a URL)? [Yes] The codes and dataset are in the supplemental material. (b) Did you specify all the training details (e.g., data splits, hyperparameters, how they were chosen)? [Yes] They are provided along with code (c) Did you report error bars (e.g., with respect to the random seed after running experiments multiple times)? [Yes] See Figure 1 and Table 1. (d) Did you include the total amount of compute and the type of resources used (e.g., type of GPUs, internal cluster, or cloud provider)? [Yes] They are given in Appendix H 4. If you are using existing assets (e.g., code, data, models) or curating/releasing new assets... (a) If your work uses existing assets, did you cite the creators? [Yes] (b) Did you mention the license of the assets? [N/A] (c) Did you include any new assets either in the supplemental material or as a URL? [Yes] We include our code in the supplementary material. (d) Did you discuss whether and how consent was obtained from people whose data you re using/curating? [Yes] The data are realised with their publications. (e) Did you discuss whether the data you are using/curating contains personally identifiable information or offensive content? [Yes] The data does not have personally identifiable information or offensive content. 5. If you used crowdsourcing or conducted research with human subjects... (a) Did you include the full text of instructions given to participants and screenshots, if applicable? [N/A] (b) Did you describe any potential participant risks, with links to Institutional Review Board (IRB) approvals, if applicable? [N/A] (c) Did you include the estimated hourly wage paid to participants and the total amount spent on participant compensation? [N/A]