# boss_bayesian_optimization_over_string_spaces__cebd2ece.pdf BOSS: Bayesian Optimization over String Spaces Henry B. Moss STOR-i Centre for Doctoral Training Lancaster University, UK h.moss@lancaster.ac.uk Daniel Beck Computing and Information Systems University of Melbourne, Australia d.beck@unimelb.edu.au Javier González Microsoft Research Cambridge, UK David S. Leslie Dept. of Mathematics and Statistics Lancaster University, UK Paul Rayson School of Computing and Communications Lancaster University, UK This article develops a Bayesian optimization (BO) method which acts directly over raw strings, proposing the first uses of string kernels and genetic algorithms within BO loops. Recent applications of BO over strings have been hindered by the need to map inputs into a smooth and unconstrained latent space. Learning this projection is computationally and data-intensive. Our approach instead builds a powerful Gaussian process surrogate model based on string kernels, naturally supporting variable length inputs, and performs efficient acquisition function maximization for spaces with syntactical constraints. Experiments demonstrate considerably improved optimization over existing approaches across a broad range of constraints, including the popular setting where syntax is governed by a context-free grammar. 1 Introduction Many tasks in chemistry, biology and machine learning can be framed as optimization problems over spaces of strings. Examples include the design of synthetic genes [González et al., 2014, Tanaka and Iwata, 2018] and chemical molecules [Griffiths and Hernández-Lobato, 2020, Gómez-Bombarelli et al., 2018], as well as problems in symbolic regression [Kusner et al., 2017] and kernel design [Lu et al., 2018]. Common to these applications is the high cost of evaluating a particular input, for example requiring resource and labor-consuming wet lab tests. Consequently, most standard discrete optimization routines are unsuitable, as they require many evaluations. Bayesian Optimization [Shahriari et al., 2015, BO] has recently risen as an effective strategy to address the applications above, due to its ability to find good solutions within heavily restricted evaluation budgets. However, the vast majority of BO approaches assume a low dimensional, mostly continuous space; string inputs have to be converted to fixed-size vectors such as bags-of-ngrams or latent representations learned through an unsupervised model, typically a variational autoencoder [Kingma and Welling, 2014, VAE]. In this work, we remove this encoding step and propose a BO architecture that operates directly on raw strings through the lens of convolution kernels [Haussler, 1999]. In particular, we employ a Gaussian Process [Rasmussen, 2003, GP] with a string kernel [Lodhi et al., 2002] as the surrogate model for the objective function, measuring the similarity between strings by examining shared non-contiguous sub-sequences. String kernels provide an easy and user-friendly way to deploy BO loops directly over strings, avoiding the expensive architecture tuning required to find a useful VAE. At the same time, by using a kernel trick to work in much richer feature spaces than the bags-of-ngrams vectors, string kernels can encode the non-contiguity known to be informative when modeling genetic sequences [Vert, 2007] and SMILES [Anderson et al., 1987] representations of molecules [Cao et al., 2012](see Figure 1). We show that our string kernel s two 34th Conference on Neural Information Processing Systems (Neur IPS 2020), Vancouver, Canada. Figure 1: Similar molecules have SMILES strings with local differences (red) but common non-contiguous sub-sequences. Figure 2: BO loop for molecule design using a string kernel surrogate model (a) and genetic algorithms for acquisition function maximization (b). parameters can be reliably fine-tuned to model complex objective functions with just a handful of function evaluations, without needing the large collections of unlabeled data required to train VAEs. Devising a BO framework directly over strings raises the question of how to maximize acquisition functions; heuristics used to select new evaluation points. Standard BO uses numerical methods to maximize these functions but these are not applicable when the inputs are discrete structures such as strings. To address this challenge, we employ a suite of genetic algorithms [Whitley, 1994] to provide efficient exploration of string spaces under a range of syntactical constraints. Our contributions can be summarized as follows: We introduce string kernels into BO, providing powerful GP surrogate models of complex objective functions with just two data-driven parameters (Figure 2.a). We propose a suite of genetic algorithms suitable for efficiently optimizing acquisition functions under a variety of syntactical constraints (Figure 2.b). We demonstrate that our framework out-performs established baselines across four scenarios encompassing a range of applications and diverse set of constraints. 2 Related Work BO by feature extraction BO has previously been applied to find genes with desirable features: a high-cost string optimization problem across a small alphabet of four bases. Genes are represented as either codon frequencies (a bags-of-ngrams of triplets of characters) [González et al., 2014], or as a one-hot-encoding of the genes at each location in the string [Tanaka and Iwata, 2018]. Although these representations are sufficient to allow BO to improve over random gene designs, each mapping discards information known to be important when modeling genes. A bags-of-ngrams representation ignores positional and contextual information by modeling characters to have equal effect regardless of position or context, whereas a one-hot encoding fails to exploit translational invariance. Moreover, by assuming that all potential genes belong to a small fixed set of candidates, González et al. [2014] and Tanaka and Iwata [2018] ignore the need to provide an efficient acquisition optimization routine. This assumption is unrealistic for many real gene design loops and is tackled directly in our work. BO with VAEs Kusner et al. [2017], Gómez-Bombarelli et al. [2018] and Lu et al. [2018] use VAEs to learn latent representations for string spaces following the syntactical constraints given by contextfree grammars (CFG). Projecting a variable-length and constrained string space to an unconstrained latent space of fixed dimensions requires a sophisticated mapping, which in turn requires a lot of data to learn. As BO problems never have enough string-evaluation pairs to learn a supervised mapping, VAEs must be trained to reconstruct a large collection of valid strings sampled from the CFG. A representation learned in this purely unsupervised manner will likely be poorly-aligned with the problem s objective function, under-representing variation and over-emphasizing sub-optimal areas of the original space. Consequently, VAE s often explore only limited regions of the space and have dead areas that decode to invalid strings [Griffiths and Hernández-Lobato, 2020]. Moreover, performance is sensitive to the arbitrary choice of the closed region of latent space considered for BO. Sub-sequence Occurrence, u String, s "genic" "geno" "ge" "genetics" "genetics" "genetics" "genetics" "genomic" "genomic" "genomic" "genomic" "genomes" "genomes" "genomes" "genomes" Sub-sequence Contribution, cu(s) "genic" "geno" "ge" λ5 mλ2 g 0 λ2 m(1 + λ2 g) λ5 mλ2 g λ4 m λ2 m 0 λ4 m λ2 m(1 + λ4 g) Table 1: Occurrences (left panel) and respective contributions function values (right panel) of sample sub-sequences when evaluating the strings "genetics", "genomic" and "genomes". Evolutionary algorithms in BO The closest existing idea to our work is that of Kandasamy et al. [2018], where an evolutionary algorithm optimizes acquisition functions over a space of neural network architectures. However, their approach does not support syntactically constrained spaces and, as it is based solely on local mutations, cannot perform the global search required for large string spaces. Moreover, as their kernel is based on an optimal transport distance between individual network layers, it does not model the non-contiguous features supported by string kernels. Contemporaneous work of Swersky et al. [2020] also considers BO over strings and proposes an evolutionary algorithm based on generative modeling for their acquisition function optimization. However, their approach relies on ensembles of neural networks rather than GP surrogate models, is suitable for strings of up to only 100 characters and does not support spaces with syntactic constraints. 3 Preliminaries Bayesian Optimization In its standard form, BO seeks to maximize a smooth function f : X R over a compact set X Rd in as few evaluations as possible. Smoothness is exploited to predict the performance of not yet evaluated points, allowing evaluations to be focused into promising areas of the space. BO loops have two key components - a surrogate model and an acquisition function. Surrogate model To predict the values of f across X, a surrogate model is fit to the previously collected (and potentially noisy) evaluations Dt = {(xi, yi)}i=1,..,t, where yi = f(xi) + ϵi for iid Gaussian noise ϵi N(0, σ2). As is standard in the literature, we use a GP surrogate model [Rasmussen, 2003]. A GP provides non-parametric regression of a particular smoothness controlled by a kernel k : X X R measuring the similarity between two points. Acquisition function The other crucial ingredient for BO is an acquisition function αt : X R, measuring the utility of making a new evaluation given the predictions of our surrogate model. We use the simple yet effective search strategy of expected improvement (EI): evaluating points yielding the largest improvement over current evaluations. Although any BO acquisition function is compatible with our framework, we choose EI as it provides an effective search whilst not incurring significant BO overheads. Under a GP, EI has a closed form expression and can be efficiently calculated (see Shahriari et al. [2015]). A single BO loop is completed by evaluating the location with maximal utility xt+1 = argmaxx X αt(x) and is repeated until the optimization budget is exhausted. String Kernels (SKs) SKs are a family of kernels that operate on strings, measuring their similarity through the number of sub-strings they share. Specific SKs are then formally defined by the particular definition of a sub-string they encompass, which defines the underlying feature space of the kernel. In this work, we employ the Sub-sequence String Kernel (SSK) [Lodhi et al., 2002, Cancedda et al., 2003], which uses sub-sequences of characters as features. The sub-sequences can be non-contiguous, giving rise to an exponentially-sized feature space. While enumerating such a space would be infeasible, the SSK uses the kernel trick to avoid computation in the primal space, enabled via an efficient dynamic programming algorithm. By matching occurrences of sub-sequences, SSKs can provide a rich contextual model of string data, moving far beyond the capabilities of popular bag-of-ngrams representations where only contiguous occurrences of sub-strings are modeled. Formally, an nth order SSK between two strings a and b is defined as kn(a, b) = X u Σn cu(a)cu(b) for cu(s) = λ|u| m X 1