# active_search_in_intensionally_specified_structured_spaces__1a38b94c.pdf Active Search in Intensionally Specified Structured Spaces dino.oglic@uni-bonn.de Institut f ur Informatik III Universit at Bonn, Germany Roman Garnett garnett@wustl.edu Dep. of Computer Science & Eng. Washington University in St. Louis, USA Thomas G artner tg@thomasgaertner.org School of Computer Science The University of Nottingham, UK We consider an active search problem in intensionally specified structured spaces. The ultimate goal in this setting is to discover structures from structurally different partitions of a fixed but unknown target class. An example of such a process is that of computer-aided de novo drug design. In the past 20 years several Monte Carlo search heuristics have been developed for this process. Motivated by these hand-crafted search heuristics, we devise a Metropolis Hastings sampling scheme where the acceptance probability is given by a probabilistic surrogate of the target property, modeled with a max entropy conditional model. The surrogate model is updated in each iteration upon the evaluation of a selected structure. The proposed approach is consistent and the empirical evidence indicates that it achieves a large structural variety of discovered targets. 1 Introduction We consider an active classification problem in structured spaces, where the goal is not to learn a hypothesis but to discover a diverse set of structures exhibiting a target property. A variant of this problem where the only goal is to discover targets is known as active search (Garnett et al. 2012). In the applications we consider, the search space is specified only intensionally and its cardinality is at least exponential in the size of its combinatorial objects (e.g., number of edges in a graph). Thus, the extension of the search space can neither be completely stored on a disk nor enumerated in feasible time. The structures we aim to discover are characterized by a target property that is a priori not known for any structure and is expensive to evaluate on each structure. The evaluation process can be noisy and it is simulated with an oracle. The structures exhibiting the target property are typically rare and we can not assume that they are concentrated in a small region of the search space. We are thus interested in finding a diverse set of candidates that spans the whole space and is likely to exhibit the target property. Taking drug discovery as our main motivating example, several problems have been identified as the cause for the huge cost associated with attrition (Scannell et al. 2012; Schneider and Schneider 2016), i.e., drug candidates failing later stages of the development process, and increased use of algorithmic support has been proposed as a remedy (Woltosz Copyright c 2017, Association for the Advancement of Artificial Intelligence (www.aaai.org). All rights reserved. 2012). In particular, (i) the chemspace, i.e., the space of potentially synthesizable compounds, is huge estimates are often larger than 1060; (ii) there are many activity cliffs, i.e., small changes in structure can have large effects on pharmaceutical activity, and (iii) existing compound libraries focus on a very restricted area of the chemspace. De novo design approaches (Schneider and Fechner 2005) aim to overcome these problems by constructing desired molecular structures from scratch. In the past 20 years, several Monte Carlo search heuristics have been developed for de novo design of druglike molecules (Schneider and Fechner 2005). A common property of these search heuristics is the generation of molecular structures using Markov chains. Several search heuristics incorporate an additional scoring step in which the generated structures are accepted/rejected with a probability based on a hand-crafted energy-based scoring function. The whole process can be seen as Metropolis sampling from an expertdesigned distribution. Throughout the constructive process this designed distribution is either kept static or manually updated as the process evolves. Motivated by these hand-crafted search heuristics, we propose a data-driven approach that learns the target class of desired structures as it observes the results of new experiments. To deal with the intensionally specified search space, we assume that a proposal generator can be constructed which is specific to the application domain and has support on all parts of the space that contain the targets. Similar to the described Monte Carlo search heuristics, we model this proposal generator with a Markov chain given by its transition kernel. The transition kernel can be either conditional or independent and in the latter case the proposal generator is an uninformed sampler. As the target structures are typically rare and expensive to evaluate, the cost per discovered structure would be prohibitively high for plain Monte Carlo search performed by evaluating each proposed structure. To overcome this, our approach relies on a max-entropy conditional model that acts as a probabilistic surrogate for the oracle evaluations. This conditional model is updated in each iteration upon the evaluation of a selected structure. As this changes the distribution of the Metropolis sampler in the following discovery step, we can not assume that the sampled structures are drawn independently from identical distributions. We analyze the theoretical properties of this process in Section 3 where we show its consistency and bound the mixing Proceedings of the Thirty-First AAAI Conference on Artificial Intelligence (AAAI-17) Algorithm 1 DE-NOVO-DESIGN Input: target property y Y, conditional exponential family model p (y | x, θ) with a regularization parameter λ > 0, proposal generator G, evaluation oracle O, and budget B N Output: list of structures x1, x2, . . . , x B X B 1: θ1 0 2: for t = 1, 2, . . . , B do 3: xt G 4: repeat 5: x G and u U[0, 1] 6: if u < p(y | x, θt)/p(y | xt, θt) then xt x end if 7: until CHAIN MIXED 8: yt O(xt) and wt 1/p(y |xt,θt) 9: θt+1 arg minθ 1 t t i=1 wi ln p (yi | xi, θ) + λ θ 2 H 10: end for time of the Metropolis Hastings chain with an independent proposal generator. To study the empirical performance in silico, i.e., without conducting lab experiments, we design synthetic testbeds that share many characteristics with drug design (Section 4). In particular, instead of the chemspace, we consider the space of all graphs of a given size and aim at constructing graphs with rare and structurally non-smooth properties such as having a Hamiltonian cycle or being connected and planar. We conclude with a discussion where we contrast our approach to other related approaches (Section 5). 2 Algorithm Algorithm 1 gives a pseudo-code description of our approach. To model the evaluation of the target property, our algorithm takes as input an oracle which outputs a label for a given structure. To reflect the expensiveness of these evaluations, the oracle can be accessed a number of times that is limited by a budget. Other parameters of the algorithm are the proposal generator, target property, and parameters specifying a set of models from the conditional exponential family. In the next section, we demonstrate that for this choice of a conditional model the probabilistic surrogate for the oracle evaluations is a max-entropy model subject to constraints on the first moments of the sample. Denote the space of candidate structures X, the space of properties Y, and a Hilbert space H with inner product , . The parameter set Θ H is usually a compact subset of the Hilbert space and together with the sufficient statistics φ : X Y H of y | x specifies the set of conditional exponential models as p (y | x, θ) = exp φ (x, y), θ A (θ | x) , (1) where A (θ | x) = ln Y exp ( φ (x, y), θ ) and θ Θ. In practice, we do not directly specify the parameter set Θ but instead simply regularize the importance weighted negative log-likelihood of the sample by adding the term θ 2 H. To account for this, the algorithm takes as input a hyperparameter which controls the regularization. The constructive process is initialized by setting the parameter vector of the conditional exponential family to zero (line 1). This implies that the first sample is unbiased and uninformed. Then, the algorithm starts iterating until we deplete the oracle budget B (line 2). In the initial steps of each iteration (lines 3 7), the Metropolis Hastings algorithm (Metropolis et al. 1953) is used to sample from the posterior p(x | y , θt) = p(y |x,θt)p0(x) p0(y ) , where p0 (y ) is the marginal probability of y Y and p0 (x) is the stationary distribution of the proposal generator G defined with a transition kernel g for which the detailed balance condition holds (Andrieu et al. 2003). Thus, to obtain samples from the posterior p (x | y , θt), the Metropolis Hastings acceptance criterion is p(y | x , θt) p(y | xt, θt) p0(x ) g(x xt) p0(xt) g(xt x ) = p(y | x , θt) p(y | xt, θt), (2) where x is the proposed candidate, xt is the last accepted state, θt is the parameter vector of the conditional exponential family model, and g (xt x ) denotes the probability of the transition from state xt to state x . After the Metropolis Hastings chain has mixed (line 7), the algorithm outputs its last accepted state xt as a candidate structure and presents it to an evaluation oracle (line 8). The oracle evaluates it providing feedback yt to the algorithm. The labeled pair (xt, yt) is then added to the training sample and an importance weight is assigned to it (line 8). The importance weighting is needed for the consistency of the algorithm because the samples are neither independent nor identically distributed. Finally, the conditional exponential family model is updated by optimizing the weighted negative-log likelihood of the sample (line 9). This model is then used by the algorithm to sample a candidate structure in the next iteration. The optimization problem in line 9 is convex in θ and the representer theorem (Wahba 1990) guarantees that it is possible to express the solution θt+1 as a linear combination of sufficient statistics, i.e., θt+1 = t i=1 c Y αicφ (xi, c) for some αic R. Hence, a globally optimal solution can be found and a set of conditional exponential family models can be specified using only a joint input output kernel and a regularization parameter. 3 Theoretical analysis In this section, we first show that in Algorithm 1 a maxentropy conditional model is used as a probabilistic surrogate for the oracle. We then prove that Algorithm 1 is consistent and analyze the mixing time of an independent Metropolis Hastings chain for sampling from the posterior p (x | y , θ). 3.1 Max-entropy probabilistic surrogate In previous work it was shown that exponential family models are max-entropy models subject to constraints on the first moments of the sample (Jaynes 1957). The following proposition is an adaptation of this max-entropy result to conditional exponential family models. For the sake of completeness, a proof is provided in Appendix A. Proposition 1. Let P denote the set of all conditional distributions that have square integrable densities with respect to a base measure defined on the domain of a sufficient statistic φ (x, y) and support on the entire domain of φ (x, y). A maxentropy conditional distribution from P that satisfies a set of constraints on the first moments of the sample can be represented as a conditional exponential model. To specify this distribution it is sufficient to find the maximum a posteriori estimator from the conditional exponential family of models. This proposition guarantees that conditional exponential family models are objectively encoding the information from the sample into the model. In fact, any other choice of the conditional model makes additional assumptions about the samples that reduce the entropy and introduces a potentially undesirable bias into the process. 3.2 Consistency In this section, we show that Algorithm 1 converges in probability to the best model from a parameter set Θ. For this, we assume that Θ is a compact subset of a Euclidean space and that there exist constants R, r > 0 such that θ R for all θ Θ and φ (x, y) = k (x, y) , (x, y) r for all (x, y) X Y. In finite dimensional Euclidean spaces closed spheres are compact sets and, in line with our previous assumption, we can take Θ to be the sphere of radius R centered at the origin. In infinite dimensional spaces closed spheres are not compact sets and in this case it is possible to find an approximate finite dimensional basis of the kernel feature space using the Cholesky decomposition of the kernel matrix (Fine and Scheinberg 2002) and define Θ as in the finite dimensional case. We note that this is a standard step for many kernel based approaches in machine learning (Bach 2007). Given the stationary distribution p0 (x) of the proposal generator and the conditional label distribution of the evaluation oracle p0 (y | x), the latent data-generating distribution is p0 (x, y) = p0 (y | x) p0 (x). We measure the difference between this data-generating distribution and our conditional exponential family model, parameterized with a vector θ, using the Kullback Leibler divergence (Akaike 1973; White 1982). Eliminating the parameter-free terms from this divergence measure, we obtain the loss function of θ, X Y p0 (x, y) ln p (y | x, θ) . We assume that there exists a unique minimizer of the loss function L (θ) in the interior of the parameter set Θ and denote this minimizer with θ . If the optimal parameter vector θ Θ satisfies Ep0(y|x) [φ (x, y)] = Ep(y|x,θ ) [φ (x, y)] for all x X, it is said that the model is well-specified. In our case, sample points are obtained from a query distribution that depends on previous samples, i.e., xi q (x | x1, . . . , xi 1), but labels are still obtained from the conditional label distribution yi p0 (y | xi) independent of xj (j < i). The main difficulty in proving the consistency of the approach in the general case where the queried structures are neither independent nor identically distributed comes from the fact that standard concentration bounds do not hold for this setting. A workaround frequently encountered in the literature is to assume that the model is well-specified as in this case the sampling process is consistent irrespective of the query distribution. Before proving convergence in the general case, we first briefly consider the cases of independent samples and well-specified models. For the common case in which the training sample is drawn independently from a distribution q (x), let ˆθn = arg max θ Θ 1 n q (xi) ln p (yi | xi, θ) . (3) The sequence of optimizers {ˆθn}n N converges to the optimal parameter vector θ (White 1982; Shimodaira 2000). For q (x) = p0 (x), ˆθn is the maximum likelihood estimate of θ over an i.i.d. sample {(xi, yi)}n i=1. Moreover, for Θ = {θ | θ R} the latter optimization problem is equivalent to finding the maximum a posteriori estimator with a Gaussian prior on θ (Altun, Smola, and Hofmann 2004). In the case of a well-specified model, for all x X, it holds Ep0(y|x) [φ (x, y)] = Ep(y|x,θ ) [φ (x, y)]. Thus, for all marginal distributions p0 (x), the gradient of the loss is zero at θ , i.e., L(θ ) = Y φ (x, y) (p (y | x, θ ) p0 (y | x)) = 0. In other words, if the model is well-specified, the maximum likelihood estimator is consistent for all query distributions. We now proceed to the general case for which we do not make the assumption that the model is well-specified and again show that the optimizer θt converges to the optimal parameter vector θ . At iteration t of Algorithm 1 an instance is selected by sampling from the query distribution q (x | Dt 1) = p (x | y , θt), where θt denotes a parameter vector from Θ which is completely determined by the previously seen data Dt 1. Thus, a candidate sampled at iteration t depends on previous samples through the parameter vector and the independence between input output pairs within the sample is lost. As a result of this, the convergence of the sequence {θt}t N to θ for the general case of misspecified model cannot be guaranteed by the previous results relying on the independence assumption (Shimodaira 2000). To show the consistency in this general case, we first rewrite the objective which is optimized at iteration t of Algorithm 1. For a fixed target property y , the parameter vector θt+1 is obtained by solving the following problem: A (θ | xi) φ (xi, yi), θ p (y | xi, θi) + λ θ 2 . (4) Assuming the parameter set is well behaved (Theorem 2), the objective in Eq. (4) is convex and can be optimized using standard optimization techniques. Before we show that the sequence of optimizers θt converges to the optimal parameter vector θ , let us formally define the empirical loss of a parameter vector θ given the data Dt available at iteration t, L (θ | Dt) = 1 p0 (y ) A (θ | xi) φ (xi, yi), θ p (y | xi, θi) . The following theorem and corollary show that Algorithm 1 is consistent in the general case for misspecified models and a sample of structures which are neither independent nor identically distributed. The proofs are provided in Appendix A. Theorem 2. Let p (y | x, θ) denote the conditional exponential family distribution parameterized with a vector θ Θ, where Θ is a compact subset of a d dimensional Euclidean space Rd. Let p0 (x, y) denote a latent data generating distribution such that, for all x X, the support of the likelihood function p0 (y | x) is contained in the support of p (y | x, θ) for all θ Θ. Let ln p (y | x, θ) h (x, y) for all θ Θ and some function h (x, y) : X Y R which is Lebesque integrable in the measure p0 (x, y). Then for all 0 < ε, δ < 1 there exists N (ε, δ) Ω 1 δ such that for all t N (ε, δ) we have P supθ Θ |L (θ) L (θ | Dt)| ε 1 δ. Corollary 3. The sequence of estimators {θt}t 1 converges in probability to θ Θ. 3.3 Mixing time analysis Having shown the consistency of Algorithm 1, we proceed to bound the mixing time of the Metropolis Hastings chain. For that, we consider an independent proposal generator G and provide a simple coupling analysis to bound the worst case mixing time of an independent Metropolis Hastings chain for sampling from the posterior p(x | y , θt) (Vembu, G artner, and Boley 2009). This allows us to utilize perfect sampling algorithms such as coupling from the past (Propp and Wilson 1996) to draw samples from the posterior. We assume |X| parallel and identical chains are started from all possible states x X and an identical random bit sequence is used to simulate all the chains. Thus, whenever two chains move to a common state, all the future transitions of the two chains are the same. From that point on it is sufficient to track only one of the chains. This is called a coalescence (Huber 1998). Propp and Wilson (1996) have shown that if all the chains were started at time T and have coalesced to a single chain at step T with T > T > 0, then samples drawn at time 0 are exact samples from the stationary distribution. For conditional exponential family models p (y | x, θ) > 0, the lower bound can be controlled with the regularization parameter. Thus, there will always be a path with non-zero probability between any two target structures. As it is the case with other Metropolis algorithms, for difficult problems where clusters of targets are far apart in the search space, the mixing will be slower as the model becomes more confident. The following proposition (a proof is provided in Appendix A) gives a worst case bound on the mixing time of an independent Metropolis Hastings chain for sampling from the posterior distribution p(x | y , θt). Proposition 4. For all 0 < ε < 1, with probability 1 ε, the mixing time τ(ε) of an independent Metropolis Hastings chain for sampling from the posterior distribution p(x | y , θt) is bounded from above by ln ε/ ln 1 exp( 4r θt . 4 Experiments Having provided theoretical justification for our approach in the previous section, here we evaluate its effectiveness with a series of synthetic experiments that are designed to mimic the construction of cocktail recipes and graphs with desired properties. The main reason for not evaluating the approach on a real-world problem is not the lack of a proposal generator for that domain but the lack of a suitable experimental set up (the usual retrospective analysis on labeled data is not suitable for active search in intensionally specified structured spaces). For instance, to apply the approach to the design of molecules our main motivating example an independent proposal generator can be used (Goldberg and Jerrum 1997), as well as numerous samplers outlined in Schneider & Fechner (2005). In the first set of experiments, we design cocktails of different flavors dry, creamy, and juicy. The recipes are represented as sparse real-valued vectors such that the non-zero values in these vectors indicate the proportions of the respective ingredients (i.e., the vectors are normalized). In the second set of experiments, the goal is to design Hamiltonian and connected planar graphs, as well as the respective complements of these classes. As we can not expect to be able to perfectly distinguish each of these classes from its complement due to the hardness of complete graph kernels (G artner, Flach, and Wrobel 2003), we can not expect to learn to perfectly generate these concepts. The main objective of these experiments is to demonstrate that our approach can discover a diverse set of target-structures in non-smooth problems which act as in silico proxies for the drug design task. In particular, in the construction of Hamiltonian graphs and complements of these, there are numerous Hamiltonian graphs which become non-Hamiltonian with a removal of a single edge. Such graphs are structurally very similar and close in the design space. Thus, these testbeds can mimic well the activity cliffs specific to drug design where very similar structures have different protein binding affinities. In our empirical evaluation, we compare Algorithm 1 to k-NN active search with 1and 2-step look-ahead (Garnett et al. 2012) and a greedy method which discovers structures by repeatedly performing argmax search over samples from a proposal generator using the learned conditional label distribution (selected structures are labeled by an oracle and the model is updated in each iteration). In the first step of this evaluation, we measure the improvement of each of the considered approaches over plain Monte Carlo search performed with a proposal generator. We assess the performance of the approaches with correct-construction curves which show the cumulative number of distinct target structures discovered as a function of the budget expended. To quantify the improvement of the approaches over plain Monte Carlo search, we measure the lift of the correct-construction curves. In particular, for sampling from the minority class of a proposal generator the lift is computed as the ratio between the number of distinct structures from this class generated by an algorithm and the number of such structures observed in a sample (of the same size) from the distribution of the proposal generator. In the second step of our empirical evaluation, we assess the structural diversity between the targets discovered by an algorithm. We do this by incorporating diversity into the correct-construction curves. Namely, we take a sample of 50 000 structures from the proposal generator and filter out targets. We consider these as undiscovered targets and compute the average distance between an undiscovered structure and a subsample of budget size from this set of structures. With this average distance as radius we circumscribe a sphere around each of the undiscovered targets. Then, instead of construction-curves defined with the number of discovered targets, we use the construction-curves defined with the number of the spheres having a target structure within them. To quantify the effectiveness of the considered algorithms in discovering structurally diverse targets, we normalize these sphere based construction-curves with one such curve corresponding to an ideal algorithm that only generates targets the output of this algorithm can be represented with a subsample of budget size from the undiscovered target structures. Implementation details for all algorithms are provided in Appendix C. We have simulated Algorithm 1 with the uniform proposal generator over the space of graphs with 7 and 10 nodes (Wormald 1987). For the space of cocktails, we have developed a frequency based sampler from a small set of cocktails collected from www.webtender.com. This Figure 1: The figure shows the lift of correct-construction curves for considered graph and cocktail concepts. The lift indicates how much more likely it is to see a target compared to the Monte Carlo search with a proposal generator. Figure 2: The figure shows the dispersion of discovered targets relative to an algorithm with the identical proposal generator that outputs only targets. The reported curves can be seen as the percentage of discovered target class partitions given a budget. sampler generates cocktail recipes by first sampling the number of ingredients from the Poisson distribution and then it selects the recipe ingredients based on their co-occurrence frequency in the collected data set. The parameters of this proposal generator are moment-matched with respect to the collected cocktail data set. As this proposal generator almost always samples recipes with 2-10 ingredients, for n possible ingredients the number of different ingredient combinations is 10 k=2 n k (approximately n10). As the sampler is developed based on a set of cocktails with 335 ingredients there are approximately 1024 different combinations of ingredients in this search space. Thus, this is a huge search space that can provide an insight into the properties of the discovery process on large scale problems. To label the cocktails generated by this proposal generator we have trained decision trees for each of the considered flavor profiles using a labeling of these cocktails according to flavor. All the reported results were obtained by averaging over 5 runs of the algorithm. The Metropolis Hastings sampling was performed with a burn-in sample of 50 000 proposals and sampling was done for 50 rounds/batches. In each round we take 10 i.i.d. samples by running 10 Metropolis Hastings chains in parallel (note that samples from different rounds are dependent). To allow for models of varying complexity, we have estimated the conditional exponential family regularization parameter in each round using 5-fold stratified cross-validation. As the competing approaches argmax and k-NN active search (Garnett et al. 2012) are not designed to search for targets without an a priori provided labeled structures, we have made a minor modification to our problem setting and warm-started each method with a random sample of 5 target and the same num- ber of non-target structures. For graphs these were chosen uniformly from the search space and for cocktails uniformly from the available sample of cocktails. Note that without this warm-start the argmax search estimates the distribution of target structures with a single peak around the first discovered target. Moreover, k-NN probabilistic model cannot learn a property until it sees more than k labeled structures and it is unlikely to observe a target in k successive samples from a proposal generator. In Figure 3.3, we show the lift of the correct-construction curves for all the considered approaches. We have defined these correct-construction curves by considering isomorphic graphs and cocktails with equal sets of ingredients (ignoring portions of each ingredient) as identical structures. The plots indicate that our approach and k-NN active search are able to emphasize the target class in all the domains for all the considered properties. Moreover, for our approach the magnitude of this emphasis is increasing over time and it is more likely to generate a target as the process evolves. In all domains and for all properties, k-NN active search discovers more target structures than our approach. For graph properties, we see that argmax search also discovers more targets than our approach. For cocktails, argmax search discovers many cocktails with identical sets of ingredients and different portions of these (such cocktails are considered identical in the correct-construction curves). Thus, if we are only interested in discovering target structures without considering structural diversity between them, our empirical evaluation indicates that it is better to use k-NN active search than Algorithm 1. In Figure 3.3, we show the dispersion of target structures discovered by each of the considered approaches. The plots indicate that our approach achieves a large structural variety of discovered targets. In all domains and for all properties, our approach outperforms both k-NN active and greedy argmax search. These experiments also indicate that k-NN active search explores more than argmax search. In some of the plots, a dip can be observed in the curves for k-NN active and argmax search. This can be explained by the exploitative nature of these algorithms and the fact that the search is focused to a small region of the space until all the targets from it are discovered. In contrast to this, our approach discovers targets from the whole space and can cover a large number of spheres centered at undiscovered samples with a relatively small number of targets. Thus, if we are interested in discovering diverse target structures, our results indicate that it is better to use Algorithm 1 than k-NN active or argmax search. 5 Discussion Active search with k-NN probabilistic model (Garnett et al. 2012) is a related approach with the problem setting similar to that of de novo design. The key distinction between the investigated problem setting and k-NN active search is in the requirement to discover structures from the whole domain. Garnett et al. (2012) assume that an extensional description in the form of a finite subset of the domain is explicitly given as input to the algorithm. In this work we require only an intensional description of the domain. For instance, for the domain of graphs on n N vertices, the intensional description is just that of the number of vertices, while the extensional one consists of a list of all graphs on n vertices. In many cases, considering intensional descriptions is much more promising because an algorithm with an extensional description of an exponentially large or uncountable search space can only consider small and often arbitrary subsets of this space. The second key distinction between k-NN active search and de novo design is in the assessment of their outcomes. In particular, both approaches try to find, as soon as possible, as many as possible target structures. However, k-NN active search is designed to only discover members of a target class and Algorithm 1 is designed to find members of distinct structural partitions of a target class. This is very useful in domains where there are numerous isofunctional structures and in which k-NN active search outputs structures from small number of structural partitions of a target class. Recently, active search has been applied to a problem related to our cocktail construction task interactive exploration of patterns in a cocktail dataset (Paurat, Garnett, and G artner 2014). The difference between our setting and that of Paurat et al. (2014) is in the requirement to generate novel and previously unseen cocktails exhibiting a target property rather than searching for patterns in an existing cocktail dataset. In addition to this, active search has been applied to real-world problems where the search space is given by a single combinatorial graph, and some subset of its nodes is interesting (Wang, Garnett, and Schneider 2013). This is different from applications we consider here and for which the search space is the space of all graphs of a given size. As the investigated problem setting can be seen as a search in structured spaces, our approach is, with certain distinctions, closely related to structured output prediction (Tsochantaridis et al. 2004; Daum e III, Langford, and Marcu 2009). In structured output prediction the goal is to find a mapping from an instance space to a structured output space. A common approach is to find a joint scoring function, from the space of input output pairs to the set of reals, and to predict the output structure which maximizes the scoring function for each test input. Finding a good scoring function can often be cast as a convex optimization problem with exponentially many constraints. It can be solved efficiently if the so-called separation and/or decoding sub-problems can be solved efficiently. One difference between the investigated setting and structured output prediction is in the assumption how input output pairs are created. In particular, structured output prediction assumes that the provided outputs are optimal for the given inputs. In many de novo design problems, it is infeasible to find the best possible output for a given input. For de novo drug design this assumption implies that we would need to know the best molecule from the space of all synthesizable molecules with respect to different properties, such as binding affinity to specific protein sites. Moreover, as the decoding problem is designed assuming that the input output pairs are optimal the greedy argmax approach to solving this problem does not incorporate exploration. As a result of this, similar to argmax search these methods generate structures from a very small number of structural partitions of the target class. Other differences are in the iterative nature of de novo design and in the hardness of the separation or decoding sub-problems that most structured output prediction approaches need to solve. Another related sub-problem is that of finding preimages (Weston, Sch olkopf, and Bakir 2004) which is typically also hard in the context of structured domains except for some special cases such as strings (Gigu ere et al. 2015). Related to the proposed approach are also methods for interactive learning and optimization as well as Bayesian optimization. Interactive learning and optimization methods implement a two-step iterative process in which an agent interacts with a user until a satisfactory solution is obtained. Some well-known interactive learning and optimization methods tackle problems in information retrieval (Yue and Joachims 2009; Shivaswamy and Joachims 2012) and reinforcement learning (Wilson, Fern, and Tadepalli 2012; Jain et al. 2013). However, these methods are only designed to construct a single output from the domain of real-valued vectors and can not be directly applied to structured domains. Bayesian optimization (Brochu, Cora, and de Freitas 2010; Shahriari et al. 2015), on the other hand, is an approach to sequential optimization of an expensive, black-box, real-valued objective. Rather than seeking a set of high-quality items, Bayesian optimization focuses on finding the single highestscoring point in the domain. We, in contrast, consider discrete labels and wish to maximize the number of diverse targets found in an intensionally specified structured space. In drug design, this emphasis on exploring all parts of the search space is known as scaffold-hopping (Schneider and Fechner 2005) and it is related to the problem of attrition (Schneider and Schneider 2016). Namely, in order to address this problem it is not sufficient to search for a molecule with the highest activity level as it can be toxic or bind to an undesired protein in addition to the target protein. If attrition is to be reduced an algorithm needs to find a number of structurally different molecules binding to a target protein. As our approach achieves a large structural variety of discovered targets, it has a potential to tackle this difficult problem. Acknowledgments We are grateful for access to the University of Nottingham High Performance Computing Facility. Part of this work was also supported by the German Science Foundation (grant GA 1615/1-1). References Akaike, H. 1973. Information theory and an extension of the maximum likelihood principle. In Second International Symposium on Information Theory. Akad emiai Kiado. Aldous, D. 1983. Random walks on finite groups and rapidly mixing Markov chains. S eminaire de Probabilit es XVII. Altun, Y.; Smola, A. J.; and Hofmann, T. 2004. Exponential families for conditional random fields. In Proceedings of the 20th Conference on Uncertainty in Artificial Intelligence. Andrieu, C.; de Freitas, N.; Doucet, A.; and Jordan, M. I. 2003. An introduction to MCMC for machine learning. Machine Learning. Azuma, K. 1967. Weighted sums of certain dependent random variables. Tohoku Mathematical Journal. Bach, F. 2007. Active learning for misspecified generalized linear models. In Advances in Neural Information Processing Systems 19. Beygelzimer, A.; Dasgupta, S.; and Langford, J. 2009. Importance weighted active learning. In Proceedings of the 26th International Conference on Machine Learning. Borgwardt, K. M. 2007. Graph kernels. Ph.D. Dissertation, Ludwig Maximilians University Munich. Brochu, E.; Cora, V. M.; and de Freitas, N. 2010. A tutorial on Bayesian optimization of expensive cost functions, with application to active user modeling and hierarchical reinforcement learning. ar Xiv preprint ar Xiv:1012.2599. Cameron, P. J. 1998. Introduction to Algebra. Oxford Univ. Press. Carl, B., and Stephani, I. 1990. Entropy, Compactness, and the Approximation of Operators. Cambridge University Press. Daum e III, H.; Langford, J.; and Marcu, D. 2009. Search-based structured prediction. Machine Learning. Dixon, J. D., and Wilf, H. S. 1983. The random selection of unlabeled graphs. Journal of Algorithms. Fine, S., and Scheinberg, K. 2002. Efficient SVM training using lowrank kernel representations. Journ. of Machine Learning Research. Garnett, R.; Krishnamurthy, Y.; Xiong, X.; Schneider, J.; and Mann, R. P. 2012. Bayesian optimal active search and surveying. In Proceedings of the 29th International Conference on Machine Learning. G artner, T.; Flach, P. A.; and Wrobel, S. 2003. On graph kernels: Hardness results and efficient alternatives. In Proceedings of the 16th Annual Conference on Computational Learning Theory. Gelfand, I. M., and Fomin, S. V. 1963. Calculus of variations. Prentice-Hall Inc. Gigu ere, S.; Rolland, A.; Laviolette, F.; and Marchand, M. 2015. Algorithms for the hard pre-image problem of string kernels and the general problem of string prediction. In Proceedings of the 32nd International Conference on Machine Learning. Goldberg, L. A., and Jerrum, M. 1997. Randomly sampling molecules. In Proceedings of the 8th ACM SIAM Symposium on Discrete Algorithms. Guruswami, V. 2000. Rapidly mixing Markov chains: A comparison of techniques (survey). Technical report, MIT. Huber, M. 1998. Exact sampling and approximate counting techniques. In Proceedings of the 30th Annual ACM Symposium on the Theory of Computing. Jain, A.; Wojcik, B.; Joachims, T.; and Saxena, A. 2013. Learning trajectory preferences for manipulators via iterative improvement. In Advances in Neural Information Processing Systems 26. Jaynes, E. T. 1957. Information theory and statistical mechanics. Physical Review. Metropolis, N.; Rosenbluth, A. W.; Rosenbluth, M. N.; Teller, A. H.; and Teller, E. 1953. Equation of state calculations by fast computing machines. The Journal of Chemical Physics. Paurat, D.; Garnett, R.; and G artner, T. 2014. Interactive exploration of larger pattern collections: A case study on a cocktail dataset. In Proceedings of KDD IDEA. Propp, J. G., and Wilson, D. B. 1996. Exact sampling with coupled Markov chains and applications to statistical mechanics. In Proc. of the 7th International Conference on Random Struct. and Algorithms. Scannell, J. W.; Blanckley, A.; Boldon, H.; and Warrington, B. 2012. Diagnosing the decline in pharmaceutical R&D efficiency. Nature Reviews Drug Discovery. Schneider, G., and Fechner, U. 2005. Computer-based de novo design of drug-like molecules. Nature Reviews Drug Discovery. Schneider, P., and Schneider, G. 2016. De novo design at the edge of chaos. Journal of Medicinal Chemistry. Shahriari, B.; Swersky, K.; Wang, Z.; Adams, R. P.; and de Freitas, N. 2015. Taking the human out of the loop: A review of Bayesian optimization. Technical report, Universities of Harvard, Oxford, Toronto, and Google Deep Mind. Shimodaira, H. 2000. Improving predictive inference under covariate shift by weighting the log-likelihood function. Journal of Statistical Planning and Inference. Shivaswamy, P., and Joachims, T. 2012. Online structured prediction via coactive learning. In Proceedings of the 29th International Conference on Machine Learning. Tsochantaridis, I.; Hofmann, T.; Joachims, T.; and Altun, Y. 2004. SVM learning for interdependent and structured output spaces. In Proc. of the 21st International Conference on Machine Learning. Vembu, S.; G artner, T.; and Boley, M. 2009. Probabilistic structured predictors. In Proceedings of the 25th Conference on Uncertainty in Artificial Intelligence. Wahba, G. 1990. Spline models for observational data. SIAM. Wang, X.; Garnett, R.; and Schneider, J. 2013. Active search on graphs. In Proceedings of the 19th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. Weston, J.; Sch olkopf, B.; and Bakir, G. H. 2004. Learning to find pre-images. In Adv. in Neural Information Processing Systems 16. White, H. 1982. Maximum likelihood estimation of misspecified models. Econometrica. Wilson, A.; Fern, A.; and Tadepalli, P. 2012. A Bayesian approach for policy learning from trajectory preference queries. In Advances in Neural Information Processing Systems 25. Woltosz, W. S. 2012. If we designed airplanes like we design drugs... Journal of Computer-Aided Molecular Design. Wormald, N. C. 1987. Generating random unlabelled graphs. SIAM Journal on Computing. Yue, Y., and Joachims, T. 2009. Interactively optimizing information retrieval systems as a dueling bandits problem. In Proceedings of the 26th International Conference on Machine Learning.