# variational_search_distributions__ef591816.pdf Published as a conference paper at ICLR 2025 VARIATIONAL SEARCH DISTRIBUTIONS Daniel M. Steinberg, Rafael Oliveira, Cheng Soon Ong & Edwin V. Bonilla Data61, CSIRO, Australia {dan.steinberg, rafael.dossantosdeoliveira, cheng-soon.ong, edwin.bonilla}@data61.csiro.au We develop variational search distributions (VSD), a method for conditioning a generative model of discrete, combinatorial designs on a rare desired class by efficiently evaluating a black-box (e.g. experiment, simulation) in a batch sequential manner. We call this task active generation; we formalize active generation s requirements and desiderata, and formulate a solution via variational inference. VSD uses off-the-shelf gradient based optimization routines, can learn powerful generative models for desirable designs, and can take advantage of scalable predictive models. We derive asymptotic convergence rates for learning the true conditional generative distribution of designs with certain configurations of our method. After illustrating the generative model on images, we empirically demonstrate that VSD can outperform existing baseline methods on a set of real sequence-design problems in various protein and DNA/RNA engineering tasks. 1 INTRODUCTION We consider a variant of the active search problem (Garnett et al., 2012; Jiang et al., 2017; Vanchinathan et al., 2015), where we wish to find members (designs) of a rare desired class in a batch sequential manner with a fixed black-box evaluation (e.g. experiment) budget. We call sequential active learning of a generative model of these designs active generation. Examples of rare designs are compounds that could be useful pharmaceutical drugs, or highly active enzymes for catalyzing chemical reactions. We assume the design space is discrete or partially discrete, high-dimensional and practically innumerable. For example, the number of possible configurations of a single protein is 20O(100) (see, e.g., Sarkisyan et al., 2016). Learning a generative model of these designs allows us to circumvent the need for traversing the whole search space. We are interested in this active generation objective for a variety of reasons. We may wish to study the properties of the fitness landscape (Papkou et al., 2023) to gain a better scientific understanding of a phenomenon such as natural evolution. Or, we may not be able to completely specify the constraints and objectives of a task, but we would like to characterize the space of, and generate new feasible designs. For example, we want enzymes that can degrade plastics in an industrial setting, but we may not yet know the exact conditions (e.g. temperature, p H), some of which may be anticorrelated with enzyme catalytic activity. Alternatively, if we know these multiple objectives and constraints, we may only want to generate designs from a Pareto set. Assuming we can take advantage of a prior distribution over designs, we formulate the search problem as inferring the posterior distribution over rare, desirable designs. Importantly, this posterior can be used for generating new designs. Specifically, we use (black-box) variational inference (VI) (Ranganath et al., 2014), and so refer to our method as variational search distributions (VSD). Our major contributions are: (1) we formulate the batch active generation objective over a (practically) innumerable discrete design space, (2) we present a variational inference algorithm, VSD, which solves this objective, (3) we show that VSD performs well theoretically and empirically, and (4) we connect active generation to other recent advances in black-box optimization (BBO) of discrete sequences that use generative models. VSD uses off-the-shelf gradient based optimization routines, is able to learn powerful generative models, and can take advantage of scalable predictive models. In our experiments we show that VSD can outperform existing baseline methods on a set of real applications. Finally, we evaluate our approach on the related sequential BBO problem, where we want to find the globally optimal design for a specific objective and show competitive performance when compared with state-of-the-art methods, e.g., based on latent space optimization (LSO) (Gruver et al., 2023). Published as a conference paper at ICLR 2025 In this section we formalize our problem and describe its requirements and desiderata. We also develop our proposed solution, based on variational inference, which we will refer to as variational search distributions (VSD). 2.1 THE PROBLEM OF ACTIVE GENERATION We are given a design space X, which can be discrete or mixed discrete-continuous and high dimensional, and where for each instance that we choose x X, we measure some corresponding property of interest (so-called fitness) y R. For example, in our motivating application of DNA/RNA or protein sequences, X = VM, where V is the sequence vocabulary (e.g., amino acid labels, |V| = 20) and M is the length of the sequence. However, we do not limit the application of our method to sequences. Using this framing, a real world experiment (e.g., measuring the activity of an enzyme) can be modeled as an unknown relationship, y = f (x) + ϵ, (1) for some black-box function (the experiment), f , and measurement error ϵ R, distributed according to p(ϵ) with Ep(ϵ)[ϵ] = 0. Instead of modeling the whole space, X, we are only interested in a set of events which we choose based on fitness, S X. In particular for active generation we wish to learn a generative model, q(x), that only returns samples x(s) S by efficiently querying the black-box function in Equation 1. We assume that S are rare events in a high dimensional space, and that we have access to a prior belief, p(x), which helps narrow in on this subset of X. We are given an initial dataset, DN := {(yn, xn)}N n=1, which may contain only a few instances of xn S. Given p(x) and DN we aim to generate batches of unique candidates, {xbt}B b=1, for black-box (experimental) evaluation in a series of rounds, t {1, . . . , T}, where B = O(1000) and we desire xbt S. After each round DN is augmented with the black-box results of the batch, i.e. DN DN {(xbt, ybt)}B b=1. As we shall see later, our solution allows us to satisfy the following requirements and additional desiderata for active generation. Requirements & Desiderata. Active generation requirements (R) and other desiderata (D). (R1) Rare feasible designs, S, are rare events in X that need to be identified (R2) Sequential non-myopic candidate generation, x S, for sequential black-box evaluation (R3) Discrete search over (combinatorially) large design spaces, e.g. x X = VM (R4) Batch generation of up to O(1000) diverse candidate designs per round (R5) Generative models, x(s) q(x), that are taskspecific for rare, feasible designs (D1) Guaranteed convergence for certain choices of priors, variational distributions and predictive models (D2) Gradient based optimization strategies for candidate searching (D3) Scalable predictive models that enable highthroughput evaluation/experiments. Like active search (Garnett et al., 2012) in our case we are interested in the solution space of the super level-set, SSLS := {x : f (x) > τ} for a threshold τ R (e.g., wild-type fitness). As we only have access to noisy measurements, y, our task is to estimate the super level-set distribution, p(x|y > τ), using active generation. Estimating this distribution over SSLS is computationally and statistically challenging and, therefore, we cast this as a variational inference problem. We also consider the case of BBO for which SBBO := argmaxx f (x), and we show that we can accommodate this in our variational framework by iteratively raising τt per round, t. We visualize the properties and models involved in active generation as applied to a continuous fitness landscape in Figure 1. 2.2 VARIATIONAL SEARCH DISTRIBUTIONS We cast the estimation of p(x|y > τ) as a sequential optimization problem. A suitable objective for a round, t, is to minimize a divergence, ϕ t = argmin ϕ D[p(x|y > τ) q(x|ϕ)] (2) Published as a conference paper at ICLR 2025 (a) argmaxx f (x) (b) {x : f (x) > τ} (d) p(x|y > τ) Figure 1: Fitness landscape properties and models. (a) A noise-less fitness landscape, f (x), and the maximum fitness design, SBBO = {x }, as the white . (b) The super level-set, SSLS, of all fit designs as the white hatched area. (c) Prior belief p(x). (d) The density/mass function of the super level-set, p(x|y > τ), as blue contours. Our goal is to sequentially estimate a generative model for the distribution of the super level-set (d). We assume a noisy relationship between f and y, so the super level-set will not have a hard boundary, and p(x|y > τ) will be defined over all X. where q(x|ϕ) is a parameterized distribution from which we sample candidate designs xbt, (R5), and which we aim to match to p(x|y > τ). The difficulty is that we cannot directly evaluate or empirically sample from p(x|y > τ). However, if we consider the reverse Kullback-Leibler (KL) divergence, argmin ϕ DKL[q(x|ϕ) p(x|y > τ)] = argmin ϕ Eq(x|ϕ) p(x) log p(y > τ|x) , (3) where we have expanded p(x|y > τ) using Bayes rule and dropped the constant term p(y > τ), we note that we no longer require evaluation of p(x|y > τ) directly. We recognize the right-hand side of Equation 3 as the well known (negative) variational evidence lower bound (ELBO), LELBO(ϕ) := Eq(x|ϕ)[log p(y > τ|x)] DKL[q(x|ϕ) p(x)] . (4) For this we assume access to a prior distribution over the space of designs, p(x), that may be informed from the data at hand (or pre-trained). Henceforth, as we will develop a sequential algorithm, we will denote this prior as p(x|D0). We note the relationship between log p(y > τ|x) and the probability of improvement (PI) acquisition function from Bayesian optimization (BO) (Kushner, 1964), log p(y > τ|x) log p(y > τ|x, DN) = log Ep(y|x,DN)[1[y > τ]] = log αP I(x, DN, τ) . (5) Here 1 : {false, true} {0, 1} is the indicator function and p(y|x, DN) is typically estimated using the posterior predictive distribution of a Gaussian process (GP) given data, DN. So p(y > τ|x, DN) = Ψ((µN(x) τ)/σN(x)), where Ψ( ) is a cumulative standard normal distribution function, and µN(x), σ2 N(x) are the posterior predictive mean and variance, respectively, of the GP. We refer to this estimation strategy as GP-PI, and rewrite the ELBO accordingly, LELBO(ϕ, τ, DN) = Eq(x|ϕ)[log αP I(x, DN, τ)] DKL[q(x|ϕ) p(x|D0)] . (6) We refer to the method that maximizes the objective in Equation 6 as variational search distributions (VSD), since we are using the variational posterior distribution as a means of searching the space of fit designs, satisfying (R1), (R2) and (R4). It is well known that when the true posterior is a member of the variational family indexed by ϕ, the above variational inference procedure has the potential to recover the exact posterior distribution. To recommend candidates for black-box evaluation we sample a set of designs from our search distribution each round, b=1 q(x|ϕ t ), where ϕ t = argmax ϕ LELBO(ϕ, τ, DN) . (7) We discuss the relationship between VSD and BO in Appendix G. In general, because of the discrete combinatorial nature of our problem, we cannot use the re-parameterization trick (Kingma & Welling, 2014) to estimate the gradients of the ELBO straightforwardly. Instead, we use the score function gradient, also known as REINFORCE (Williams, 1992; Mohamed et al., 2020) with standard gradient descent methods (D2) such as Adam (Kingma & Ba, 2014), ϕLELBO(ϕ, τ, DN) = Eq(x|ϕ)[(log αP I(x, DN, τ) log q(x|ϕ) + log p(x|D0)) ϕ log q(x|ϕ)] . (8) Published as a conference paper at ICLR 2025 Here we use Monte-Carlo sampling to approximate the expectation with a suitable variance reduction scheme, such as control variates (Mohamed et al., 2020). We find that the exponentially smoothed average of the ELBO works well in practice, and is the same strategy employed in Daulton et al. (2022). VSD implements black-box variational inference (Ranganath et al., 2014) for parameter estimation, and despite the high-dimensional nature of X, we find we only need O(1000) samples to estimate the required expectations for ELBO optimization on problems with M = O(100), satisfying (R3). Note that Equation 6 8 do not involve any data (DN) directly, only indirectly through the acquisition function. Hence the scalability of VSD is dependent on the complexity of training the underlying estimator of p(y|x, DN). 2.3 CLASS PROBABILITY ESTIMATION So far our method indirectly computes PI by transforming the predictions of a GP surrogate model, p(y|x, DN), as in Equation 5. Instead we may choose to follow the reasoning used by Bayesian optimization by density-ratio estimation (BORE) in Tiao et al. (2021); Oliveira et al. (2022); Song et al. (2022), and directly estimate the quantity we care about, p(y > τ|x, DN). This can be accomplished using class probability estimation (CPE) on the labels z := 1[y > τ] {0, 1} so p(y > τ|x, DN) = p(z = 1|x, DN) πθ(x), where πθ : X [0, 1]. We can estimate the class probabilities using a proper scoring rule (Gneiting & Raftery, 2007) such as log-loss, LCPE(θ, Dz N) := 1 n=1 zn log πθ(xn) + (1 zn) log(1 πθ(xn)), (9) where Dz N = {(zn, xn)}N n=1. The VSD objective and gradient estimator using CPE then become, LELBO(ϕ, θ) = Eq(x|ϕ)[log πθ(x)] DKL[q(x|ϕ) p(x|D0)] , (10) ϕLELBO(ϕ, θ) = Eq(x|ϕ)[(log πθ(x) log q(x|ϕ) + log p(x|D0)) ϕ log q(x|ϕ)] . (11) into which we plug θ t = argminθ LCPE(θ, Dz N). We refer to this strategy as CPE-PI. Using a CPE enables the use of more scalable estimators than GP-PI, satisfying our desideratum (D3). This is crucial if we choose to run more than a few rounds of experiments with B = O(1000). Since VSD is a black-box method we may choose to use CPEs that are non-differentiable, such as decision tree ensembles. The complete VSD algorithm is given in Algorithm 1 and depicted in Figure B.1. We have allowed for a threshold function, τt = fτ({y : y DN}, γt), that can be used to modify the threshold each round. For example, an empirical quantile function τt = ˆQy(γt) where γt (0, 1) as in Tiao et al. (2021). Or a constant τ for estimating a constant distribution of the super level-set. Algorithm 1 VSD optimization loop with CPE-PI. Require: Threshold function fτ and γ1, dataset DN, black-box f , prior p(x|D0), CPE πθ(x), variational family q(x|ϕ), budget T and B. 1: function FITMODELS(DN, τ) 2: Dz N {(zn, xn)}N n=1, where zn = 1[yn > τ] 3: θ argminθ LCPE(θ, Dz N) 4: ϕ argmaxϕ LELBO(ϕ, θ ) 5: return ϕ , θ 6: for round t {1, . . . , T} do 7: τt fτ({y : y DN}, γt) 8: ϕ t , θ t FITMODELS(DN, τt) 9: {xbt}B b=1 q(x|ϕ t ) 10: {ybt}B b=1 {f (xbt) + ϵbt}B b=1 11: DN DN {(xbt, ybt)}B b=1 12: τ fτ({y : y DN}, γ ) 13: ϕ , θ FITMODELS(DN, τ ) 14: return ϕ , θ 2.4 THEORETICAL ANALYSIS We show that VSD sampling distributions converge to a target distribution that characterizes the level set given by τ, satisfying (D1) in two general settings. We first derive results assuming f is drawn Published as a conference paper at ICLR 2025 from a Gaussian process, i.e., f GP(0, k), with a positive-semidefinite covariance (or kernel) function k : X X R (Appendix E), using GP-PI as the CPE for VSD. These results are then extended to probabilistic classifiers based on wide neural networks (NNs) (Appendix F) by means of the neural tangent kernel (NTK) for the given architecture (Jacot et al., 2018). For the analysis, we set B = 1 and N = t, though having B > 1 should improve the rates by a multiplicative factor. Theorem 2.1. Under mild assumptions (E.1 to E.5), the variational distribution of VSD equipped with GP-PI converges to the level-set distribution in probability at the following rate: DKL[p(x|y > τt, Dt) p(x|y > τt, f )] OP(t 1/2) . (12) This result is based on showing that the GP posterior variance vanishes at an optimal rate of O(t 1) in our setting (Lemma E.5). We also analyze the rate at which VSD finds feasible designs, or hits , compared to an oracle with full knowledge of f . After T rounds, the number of hits found by VSD is HT = PT t=1 1[yt > τt 1], where yt follows Equation 1 and xt p(x|y > τt 1, Dt 1). The number of hits, H T , from an agent that fully knows f is the same but for generating conditioned on f with xt p(x|y > τt 1, f ). Using this definition and Theorem 2.1, we prove the following. Corollary 2.1. Under the settings in Theorem 2.1, we also have that: E[|HT H T |] O( E[HT ] is related to the empirical recall measure (18) up to the normalization constant, but it does not account for repeated hits, which are treated as false discoveries (false positives) under recall. Lastly, for NN-based CPEs, we obtain convergence rates dependent on the spectrum of the NTK (Proposition F.2), which we instantiate for infinitely wide Re LU networks below. For the full results and proofs, please see Appendix E for the GP-based analysis and Appendix F for the NTK results. Corollary 2.2. Let πθ be modeled via a fully connected Re LU network. Then, under assumptions on identifiability and sampling (F.1 to F.6), in the infinite-width limit, VSD with CPE-PI achieves: DKL[p(x|y > τt, Dt) p(x|y > τt, f )] e OP t 1 2(M+1) . (14) This result finally indicates that, when equipped with flexible NN-based CPEs, VSD is also capable of recovering the target distribution for arbitrary sequence lengths in combinatorial problems. 3 RELATED WORK We will consider related work firstly in terms of methods that have similar components to VSD, then secondly in terms of related problems to our specification of active generation. VSD can be viewed as one of many methods that makes use of the variational bound (Staines & Barber, 2013), max x f (x) max ϕ Eq(x|ϕ)[f (x)] . (15) The maximum is always greater than or equal to the expected value of a random variable. This bound is useful for black-box optimization (BBO) of f , and becomes tight if q(x|ϕ) δ(x ). See Appendix G for more detail and VSD s relation to BO. Other well known methods that make use of this bound are natural evolution strategies (NES) (Wierstra et al., 2014), variational optimization (VO) (Staines & Barber, 2013; Bird et al., 2018), estimation of distribution algorithms (EDA) (Larra naga & Lozano, 2001; Brookes et al., 2020), and Bayesian optimization with probabilistic reparametrization (BOPR) (Daulton et al., 2022). For learning the parameters of the variational distribution, ϕ, they variously make use of maximum likelihood estimation or the score function gradient estimator (REINFORCE) (Williams, 1992). Algorithms that explicitly modify Equation 15 to stop the collapse of q(x|ϕ) to a point mass for batch design include design by adaptive sampling (Db AS) (Brookes & Listgarten, 2018) and conditioning by adaptive sampling (Cb AS) (Brookes et al., 2019). They use fixed samples x(s) from q(x|ϕ t 1) for approximating the expectation, and then optimize ϕ using a weighted maximum-likelihood or variational style procedure. Though Db AS and Cb AS were formulated for offline (non-sequential) tasks, they have often been used in a sequential setting (Ren et al., 2022). We can take a unifying view of algorithms that use a surrogate model for f by recognizing the general gradient estimator, Eq(x|ϕ )[w(x) ϕ log q(x|ϕ)] . (16) Published as a conference paper at ICLR 2025 Method w(x) ϕ Fixed x(s) q(x|ϕ ) per round? VSD log πθ (x) + log p(x|D0) log q(x|ϕ) ϕ No (REINFORCE) Cb AS πθ (x)p(x|D0)/q(x|ϕ t 1) ϕ t 1 Yes (importance Monte Carlo) Db AS πθ (x) ϕ t 1 Yes (Monte Carlo) BORE πθ (x) ϕ No (REINFORCE) BOPR α(x, DN) ϕ No (REINFORCE) Table 1: How related methods can be adapted from Equation 16. VSD, Cb AS and Db AS may also use a cumulative distribution representation of αPI(x, DN, τ) in place of πθ (x). where we give each component in Table 1. For our experiments BORE has been adapted to discrete X by using the score function gradient estimator, which we denote by BORE , while Cb AS and Db AS have been adapted to use a CPE their original derivations use a PI acquisition function. A number of finite horizon methods have been applied to biological sequence BBO tasks, such as Amortized BO (Swersky et al., 2020), GFlow Nets (Jain et al., 2022), and the reinforcement learning based Dyna PPO (Angermueller et al., 2019). LSO-like methods (G omez-Bombarelli et al., 2018; Tripp et al., 2020; Stanton et al., 2022; Gruver et al., 2023) tackle optimization of sequences by encoding them into a continuous latent space within which candidate optimization or generation takes place. Selected candidates are decoded back into sequences before black box evaluation; see Gonz alez-Duque et al. (2024) for a comprehensive survey. VSD does not require a latent space nor an encoder, and as such can be seen as an amortized variant of probabilistic reparameterisation methods (Daulton et al., 2022) or continuous relaxations (Michael et al., 2024). Heuristic stochastic search methods such as Ada Lead (Sinai et al., 2020) and proximal exploration (PEX) (Ren et al., 2022) have also demonstrated strong empirical performance on these tasks. We compare the properties of the most relevant methods to our problem in Table 2. In contrast to just finding the maximum using BBO, active generation considers another problem generating samples from a rare set of feasible solutions. Generation methods that estimate the super level-set distribution, p(x|y > τ), include Cb AS, which optimizes the forward KL divergence, DKL[p(x|y > τ) q(x|ϕ)] using importance weighted cross entropy estimation (Rubinstein, 1999). Batch-BORE (Oliveira et al., 2022) also optimizes the reverse KL divergence and uses CPE, but with Stein variational inference (Liu & Wang, 2016) for diverse batch candidates (with a continuous relaxation for discrete variables). There is a rich literature on the related task of active learning and BO for level set estimation (LSE) (Bryan et al., 2005; Gotovos et al., 2013; Bogunovic et al., 2016; Zhang et al., 2023a). However, we focus on learning a generative model of a discrete space. For active generation VSD, Cb AS and Db AS all use an acquisition function defined in the original domain, X, to weight gradients (see Equation 16) for learning a conditional generative model, from which xbt are sampled. An alternative is to use guided generation, that is to train an unconditional generative model, and then have a discriminative model guide (condition) the samples from the unconditional model at test time. This plug-and-play of a discriminative model has shown promise for controlled image and text generation of pre-trained models (Nguyen et al., 2017; Dathathri et al., 2020; Li et al., 2022; Zhang et al., 2023b). La MBO (Stanton et al., 2022) and La MBO-2 (Gruver et al., 2023) take a guided generation approach to solve the active generation problem. La MBO uses an (unconditional) masked language model auto-encoder, and then optimizes sampling from its latent space using an acquisition function as a guide. La MBO-2 takes a similar approach, but uses a diffusion process as the unconditional model, and modifies a Langevin sampling de-noising process with an acquisition function guide. 4 EXPERIMENTS Firstly we test our method, VSD, on its ability to generate complex, structured designs, x, in a single round by training it to generate a subset of handwritten digits from flattened MNIST images (Le Cun et al., 1998) in Sec. 4.1. We then compare VSD on two sequence design tasks against existing baseline methods. The first of these tasks (Sec. 4.2) is to generate as many unique, fit sequences as possible using the datasets DHFR (Papkou et al., 2023), Trp B (Johnston et al., 2024) and TFBIND8 (Barrera et al., 2016). These datasets contain near complete evaluations of X, and to our knowledge DHFR and Trp B are novel in the machine learning literature. The second (Sec. 4.3) is a more traditional black-box optimization task of finding the maximum of an unknown function; using Published as a conference paper at ICLR 2025 Rare x S (R1) Sequential (R2) Discrete X (R3) Batch {xbt}B b=1 (R4) Generative q(x|ϕ) (R5) Guaranteed (D1) Gradient descent (D2) Scalable (D3) General acq./reward fn. Amortization BOPR (Daulton et al., 2022) BORE (Tiao et al., 2021) Batch BORE (Oliveira et al., 2022) Db AS (Brookes & Listgarten, 2018) Cb AS (Brookes et al., 2019) Amortized BO (Swersky et al., 2020) GFlow Nets (Jain et al., 2022) Dyna PPO (Angermueller et al., 2019) Ada Lead (Sinai et al., 2020) PEX (Ren et al., 2022) GGS (Kirjner et al., 2024) LSO e.g. (Tripp et al., 2020) La MBO (Stanton et al., 2022) La MBO-2 (Gruver et al., 2023) VSD (ours) Table 2: Feature table of competing methods: has feature, does not have feature, partially has feature, or requires only simple modification. We follow Swersky et al. (2020) in their definition of amortization referring to the ability to use q(x|ϕ t 1) for warm-starting the optimization of ϕt. datasets AAV (Bryant et al., 2021), GFP (Sarkisyan et al., 2016) and the biologically inspired Ehrlich functions (Stanton et al., 2024). The corresponding datasets involve |V| {4, 20}, 4 M 237 and 65, 000 < |X| < 20237. We discuss the settings and properties of these datasets in greater detail in Appendix C. For the biological sequence experiments we run a predetermined number of experimental rounds, T = 10 or 32 for the Ehrlich functions. We set the batch size to B = 128, and use five different seeds for random initialization. We compare against Db AS (Brookes & Listgarten, 2018), Cb AS (Brookes et al., 2019), Ada Lead (Sinai et al., 2020), and PEX (Ren et al., 2022) all of which we have adapted to use a CPE, BORE (Tiao et al., 2021) which we have adapted to use the score function gradient estimator, and a na ıve baseline that uses random samples from the prior, p(x|D0). To reduce confounding, all methods share the same surrogate model, acquisition functions, priors and variational distributions where possible. We compare against La MBO-2 (Gruver et al., 2023) on the Ehrlich functions, it uses its own surrogate and generative models. 4.1 CONDITIONAL GENERATION OF HANDWRITTEN DIGITS Our motivating application for VSD is to model the space of fit DNA and protein sequences, which are string-representations of complex 3-dimensional structures. In this experiment we aim to demonstrate, by analogy, that VSD can generate sequences that represent 2-dimensional structures. For this task, we have chosen to unroll (reverse the order of every odd row, and flatten) down-scaled (14 14 pixel, 8-bit) MNIST (Le Cun et al., 1998) images into sequences, x, where M = 196 and |V| = 8. We then train long short-term memory (LSTM) recurrent neural network (RNN) and decoder-only causal transformer generative models on the entire MNIST training set by maximum likelihood (ML). These generative distributions are used as the prior models, p(x|D0), for VSD and we detail their form in Appendix C.3. The task is then to use VSD in one round to estimate the pos- (a) LSTM Prior (b) Transformer Prior (c) LSTM Posterior (d) Transformer Posterior Figure 2: (a) and (b) are samples from the LSTM and transformer priors, respectively. (c) and (d) show samples from the LSTM and transformer VSD variational distributions respectively. We also report the samples mean scores according to the CPE probabilities. Published as a conference paper at ICLR 2025 terior p(x|y {3, 5}) using a CPE trained on labels zn = 1[yn {3, 5}]. We use a convolutional architecture for the CPE given in Appendix C.4, and it achieves a test balanced accuracy score of 99%. We parameterize the variational distributions, q(x|ϕ), in the same way as the priors, and initialize these distribution parameters from the prior distribution parameters. During training with ELBO the prior distribution parameters are locked, and we run training for 5000 iterations. This is exactly lines 8 and 9 in Algorithm 1. Samples are visualized from the resulting variational distributions with the corresponding priors in Figure 2. We see that the prior LSTM and transformer are able to generate convincing digits once the sampled sequences are re-rolled , and that VSD is able to effectively refine these distributions, even though it does not have access to any data directly only scores from the CPE. Both the LSTM and transformer yield qualitatively similar results, and have similar mean scores from the CPE. 4.2 FITNESS LANDSCAPES In this setting we wish to find fit sequences x SSLS, so we fix τ over all rounds. We only consider the combinatorially (near) complete datasets to avoid any pathological behavior from relying on machine learning oracles (Surana et al., 2024). Results are presented in Figure 3. The primary measures by which we compare methods are precision, recall and performance, Precisiont = 1 min{t B, |S|} b=1 1[ybr > τ] 1[xbr / X q b 1,r], (17) Recallt = 1 min{TB, |S|} b=1 1[ybr > τ] 1[xbr / X q b 1,r], (18) Performancet = b=1 ybr 1[xbr / X q b 1,r]. (19) Here X q br X is the set of experimentally queried sequences by the bth batch member of the rth round, including the initial training set. These measures are comparable among probabilistic and non probabilistic methods. Precision and recall measure the ability of a method to efficiently explore S, where min{t B, |S|} is the size of the selected set at round t (bounded by the number of good solutions), and min{TB, |S|} is the number of positive elements possible in the experimental budget. Performance measures the cumulative fitness of the unique batch members, but unlike Jain et al. (2022) we do not normalize this measure. For exact experimental settings we refer the reader to Appendix C.1. We set τ to be that of the wild-type sequences in the DHFR and Trp B datasets, and use τ = 0.75 for TFBIND8. We find that a uniform prior over sequences, and a mean field variational distribution (Equation 22) are adequate for these experiments, as is a simple MLP for the CPE. Results are presented in Figure 3. VSD is the best performing method by most of the measures. We have found the Ada Lead and PEX evolutionary-search based methods to be effective on lower-dimensional problems (TFBIND8 being the lowest here), however we consistently observe their performance degrading as the dimension of the problem increases. We suspect this is a direct consequence of their random mutation strategies being suited to exploration in low dimensions, but less efficient in higher dimensions compared to the learned generative models employed by VSD, Cb AS, and Db AS. Our modified version of BORE (which is just the expected log-likelihood component of Equation 10) performs badly in all cases, and this is a direct consequence of its variational distribution collapsing to a point mass. In a non-batch setting this behavior is not problematic, but shows the importance of the KL divergence of VSD in this batch setting. We replicate these experiments in Appendix D.1 using GP-PI, also backed by our guarantees. In all cases VSD s results remain similar or improve slightly, whereas the other methods results remain similar or degrade. We report on batch diversity scores in Appendix D.3. 4.3 BLACK-BOX OPTIMIZATION In this experiment we use VSD on the related task of BBO for finding SBBO. We set τt adaptively by specifying it as an empirical quantile Qt y of the observed target values at round t, τt = Qt y(γt=pη t 1) (20) Published as a conference paper at ICLR 2025 (c) TFBIND8 Figure 3: Fitness landscape results. Precision (Equation 17), recall (Equation 18) and performance (Equation 19) higher is better for the combinatorially (near) complete datasets, DHFR, Trp B and TFBIND8. The random method is implemented by drawing B samples uniformly. where pt 1 is a percentile from the previous round, and η [0, 1] is an annealing parameter for τt that trades off exploration and exploitation. Performance is measured by simple regret rt, which quantifies the fitness gap between the globally optimal design and the best design found, rt = y max y {ybi}B,t b=1,i=1. (21) Here y is the fitness value of the globally optimal sequence x . We use the higher dimensional AAV (y =19.54), GFP (y =4.12) and Ehrlich functions (y =1) datasets/benchmarks to show that VSD can scale to higher dimensional problems. X of AAV and GFP is completely intractable to fully explore experimentally, and so we use a predictive oracle trained on all the original experimental data as the ground-truth black-box function. We use the CNN-based oracles from Kirjner et al. (2024) for these experiments. However, we note here that some oracles used in these experiments do not predict well out-of-distribution (Surana et al., 2024), which limits their real-world applicability. The Ehrlich functions (Stanton et al., 2024) are challenging biologically inspired closed-form simulations that cover all X. We compare against a genetic algorithm (GA), Cb AS and La MBO-2 (Gruver et al., 2023) for sequences of length M = {15, 32, 64} using the POLI and POLI-BASELINES benchmarks and baselines software (Gonz alez-Duque et al., 2024). For these experiments we use CNNs for the CPEs all experimental settings are in Appendix C.2. The results are summarized in Figure 4 and 5. Batch diversity scores for these experiments are presented in Appendix D.3, and for HOLO Ehrlich function implementation results see Appendix D.2. VSD is among the leading methods for all experiments. VSD takes better advantage of the more complex variational distributions than Cb AS and Db AS since it can sample from the adapted variational distribution while learning it. We can see that Ada Lead, PEX and often BORE all perform worse than random for reasons previously mentioned. Simple regret can drop below zero for AAV & GFP since an oracle is used as the black box function, but the global maximizer is taken from the experimental data. VSD outperforms Cb AS on the Ehrlich function benchmarks, and is competitive with La MBO-2. We also present an ablation study in Appendix D.4. Published as a conference paper at ICLR 2025 (a) Independent (c) Transformer Figure 4: AAV & GFP BBO results. Simple regret (Equation 21) lower is better on GFP and AAV with independent and auto-regressive variational distributions. The PEX and Ada Lead results are replicated between the plots, since they are unaffected by choice of variational distribution. Figure 5: Ehrlich function (POLI implementation) BBO results. VSD and Cb AS with different variational distributions; mean field (MF), LSTM and transformer (TFM), compared against genetic algorithm (GA) and La MBO-2 baselines. 5 CONCLUSION We have presented the problem of active generation sequentially learning a generative model for designs of a rare class by efficiently evaluating a black-box function and a method for efficiently generating samples which we call variational search distributions (VSD). Underpinned by variational inference, VSD satisfies critical requirements and important desiderata, and we show that VSD converges asymptotically to the true level-set distribution at the same rate as a Monte-Carlo estimator with full knowledge of the true distribution. We showcased the benefits of our method empirically on a set of combinatorially complete and high dimensional sequential-design biological problems and show that it can effectively learn powerful generative models of fit designs. There is a close connection between active generation and black-box optimization, and with the advent of powerful generative models we hope that our explicit framing of generation of fit sequences will lead to further study of this connection. Finally, our framework can be generalized to more complex application scenarios, involving learning generative models over Pareto sets, SPareto, in a multi-objective setting, or other challenging combinatorial optimization problems (Bengio et al., 2021), such as graph structures (Annadani et al., 2023), and mixed discrete-continuous variables. All of which are worth investigating as future work directions. For the code implementing the models and experiments in this paper, please see https://github.com/csiro-funml/variationalsearch. Published as a conference paper at ICLR 2025 ACKNOWLEDGMENTS This work is funded by the CSIRO Science Digital and Advanced Engineering Biology Future Science Platforms, and was supported by resources and expertise provided by CSIRO IMT Scientific Computing. We would like to thank the anonymous reviewers, and especially reviewer xamp for the constructive feedback and advice, which greatly increased the relevance and quality of this work. Yasin Abbasi-Yadkori. Online Learning for Linearly Parametrized Control Problems. Phd, University of Alberta, 2012. Christof Angermueller, David Dohan, David Belanger, Ramya Deshpande, Kevin Murphy, and Lucy Colwell. Model-based reinforcement learning for biological sequence design. In International conference on learning representations (ICLR), 2019. Yashas Annadani, Nick Pawlowski, Joel Jennings, Stefan Bauer, Cheng Zhang, and Wenbo Gong. Bayes DAG: Gradient-based posterior inference for causal discovery. In Thirty-seventh Conference on Neural Information Processing Systems (Neur IPS), 2023. Maximilian Balandat, Brian Karrer, Daniel R. Jiang, Samuel Daulton, Benjamin Letham, Andrew Gordon Wilson, and Eytan Bakshy. Bo Torch: A Framework for Efficient Monte-Carlo Bayesian Optimization. In Advances in Neural Information Processing Systems 33, 2020. Luis A Barrera, Anastasia Vedenko, Jesse V Kurland, Julia M Rogers, Stephen S Gisselbrecht, Elizabeth J Rossin, Jaie Woodard, Luca Mariani, Kian Hong Kock, Sachi Inukai, et al. Survey of variation in human transcription factors reveals prevalent dna binding changes. Science, 351 (6280):1450 1454, 2016. David G. T. Barrett and Benoit Dherin. Implicit gradient regularization. In International Conference on Learning Representations (ICLR). Open Review, 2021. Heinz Bauer. Probability theory and elements of measure theory. Academic Press, 2nd edition, 1981. Yoshua Bengio, Andrea Lodi, and Antoine Prouvost. Machine learning for combinatorial optimization: A methodological tour d horizon. European Journal of Operational Research, 290(2): 405 421, 2021. ISSN 0377-2217. Thomas Bird, Julius Kunze, and David Barber. Stochastic variational optimization. ar Xiv preprint ar Xiv:1809.04855, 2018. Ilija Bogunovic, Jonathan Scarlett, Andreas Krause, and Volkan Cevher. Truncated variance reduction: A unified approach to bayesian optimization and level-set estimation. Advances in neural information processing systems, 29, 2016. St ephane Boucheron, G abor Lugosi, and Pascal Massart. Concentration inequalities: A Nonasymptotic Theory of Independence. Oxford University Press, 2013. David Brookes, Hahnbeom Park, and Jennifer Listgarten. Conditioning by adaptive sampling for robust design. In International conference on machine learning, pp. 773 782. PMLR, 2019. David Brookes, Akosua Busia, Clara Fannjiang, Kevin Murphy, and Jennifer Listgarten. A view of estimation of distribution algorithms through the lens of expectation-maximization. In Proceedings of the 2020 Genetic and Evolutionary Computation Conference Companion, pp. 189 190, 2020. David H Brookes and Jennifer Listgarten. Design by adaptive sampling. ar Xiv preprint ar Xiv:1810.03714, 2018. Brent Bryan, Robert C Nichol, Christopher R Genovese, Jeff Schneider, Christopher J Miller, and Larry Wasserman. Active learning for identifying function threshold boundaries. Advances in neural information processing systems, 18, 2005. Published as a conference paper at ICLR 2025 Drew H Bryant, Ali Bashir, Sam Sinai, Nina K Jain, Pierce J Ogden, Patrick F Riley, George M Church, Lucy J Colwell, and Eric D Kelsic. Deep diversification of an aav capsid protein by machine learning. Nature Biotechnology, 39(6):691 696, 2021. Lin Chen and Sheng Xu. Deep neural tangent kernel and laplace kernel have the same RKHS. In International Conference on Learning Representations, 2021. Sayak Ray Chowdhury and Aditya Gopalan. On Kernelized Multi-armed Bandits. In Proceedings of the 34th International Conference on Machine Learning (ICML), Sydney, Australia, 2017. Sumanth Dathathri, Andrea Madotto, Janice Lan, Jane Hung, Eric Frank, Piero Molino, Jason Yosinski, and Rosanne Liu. Plug and play language models: A simple approach to controlled text generation. In International Conference on Learning Representations, 2020. Samuel Daulton, Xingchen Wan, David Eriksson, Maximilian Balandat, Michael A Osborne, and Eytan Bakshy. Bayesian optimization over discrete and mixed spaces via probabilistic reparameterization. Advances in Neural Information Processing Systems (Neur IPS), 35:12760 12774, 2022. Lester E. Dubins and David A. Freedman. A sharper form of the Borel-Cantelli lemma and the strong law. The Annals of Mathematical Statistics, 36(3):800 807, 1965. Audrey Durand, Odalric-Ambrym Maillard, and Joelle Pineau. Streaming kernel regression with provably adaptive mean, variance, and regularization. Journal of Machine Learning Research, 19, 2018. Rick Durrett. Probability: theory and examples. Cambridge University Press, New York, NY, 5th edition, 2019. Ronen Eldan, Dan Mikulincer, and Tselil Schramm. Non-asymptotic approximations of neural networks by Gaussian processes. In 34th Annual Conference on Learning Theory (COLT), volume 134 of Proceedings of Machine Learning Research, 2021. Henry E. Fleming. Equivalence of regularization and truncated iteration in the solution of ill-posed image reconstruction problems. Linear Algebra and Its Applications, 130(C):133 150, 1990. E. Garc ıa-Portugu es. Notes for Nonparametric Statistics. Bookdown, 2024. URL https:// bookdown.org/egarpor/NP-UC3M/. Version 6.9.1. ISBN 978-84-09-29537-1. Roman Garnett. Bayesian optimization. Cambridge University Press, 2023. Roman Garnett, Yamuna Krishnamurthy, Xuehan Xiong, Jeff G. Schneider, and Richard P. Mann. Bayesian optimal active search and surveying. In Proceedings of the 29th International Conference on Machine Learning, ICML 2012, Edinburgh, Scotland, UK, June 26 - July 1, 2012. icml.cc / Omnipress, 2012. Tilmann Gneiting and Adrian E Raftery. Strictly proper scoring rules, prediction, and estimation. Journal of the American statistical Association, 102(477):359 378, 2007. Rafael G omez-Bombarelli, Jennifer N Wei, David Duvenaud, Jos e Miguel Hern andez-Lobato, Benjam ın S anchez-Lengeling, Dennis Sheberla, Jorge Aguilera-Iparraguirre, Timothy D Hirzel, Ryan P Adams, and Al an Aspuru-Guzik. Automatic chemical design using a data-driven continuous representation of molecules. ACS central science, 4(2):268 276, 2018. Miguel Gonz alez-Duque, Richard Michael, Simon Bartels, Yevgen Zainchkovskyy, Søren Hauberg, and Wouter Boomsma. A survey and benchmark of high-dimensional Bayesian optimization of discrete sequences. ar Xiv preprint ar Xiv:2406.04739, 2024. Miguel Gonz alez-Duque, Simon Bartels, and Richard Michael. poli: a libary of discrete sequence objectives, January 2024. URL https://github.com/ Machine Learning Life Science/poli. Alkis Gotovos, Nathalie Casati, Gregory Hitz, and Andreas Krause. Active learning for level set estimation. In Proceedings of the Twenty-Third international joint conference on Artificial Intelligence, pp. 1344 1350, 2013. Published as a conference paper at ICLR 2025 Nate Gruver, Samuel Stanton, Nathan Frey, Tim GJ Rudner, Isidro Hotzel, Julien Lafrance-Vanasse, Arvind Rajpal, Kyunghyun Cho, and Andrew G Wilson. Protein design with guided discrete diffusion. Advances in neural information processing systems (Neur IPS), 36, 2023. Kurt Hornik, Maxwell Stinchcombe, and Halbert White. Multilayer feedforward networks are universal approximators. Neural Networks, 2(5):359 366, 1989. Jiri Hron, Yasaman Bahri, Jascha Sohl-Dickstein, and Roman Novak. Infinite attention: NNGP and NTK for deep attention networks. In Proceedings of the 37th International Conference on Machine Learning (ICML). PMLR 108, 2020. Arthur Jacot, Franck Gabriel, and Cl ement Hongler. Neural tangent kernel: Convergence and generalization in neural networks. In Advances in Neural Information Processing Systems, Montreal, Canada, 2018. Moksh Jain, Emmanuel Bengio, Alex Hernandez-Garcia, Jarrid Rector-Brooks, Bonaventure FP Dossou, Chanakya Ajit Ekbote, Jie Fu, Tianyu Zhang, Michael Kilgour, Dinghuai Zhang, et al. Biological sequence design with gflownets. In International Conference on Machine Learning, pp. 9786 9801. PMLR, 2022. Shali Jiang, Gustavo Malkomes, Geoff Converse, Alyssa Shofner, Benjamin Moseley, and Roman Garnett. Efficient nonmyopic active search. In International Conference on Machine Learning, pp. 1714 1723. PMLR, 2017. Kadina E. Johnston, Patrick J. Almhjell, Ella J. Watkins-Dulaney, Grace Liu, Nicholas J. Porter, Jason Yang, and Frances H. Arnold. A combinatorially complete epistatic fitness landscape in an enzyme active site. Proceedings of the National Academy of Sciences, 121(32):e2400439121, 2024. doi: 10.1073/pnas.2400439121. Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization. ar Xiv preprint ar Xiv:1412.6980, 2014. Diederik P. Kingma and Max Welling. Auto-encoding variational bayes. In Yoshua Bengio and Yann Le Cun (eds.), 2nd International Conference on Learning Representations, ICLR 2014, Banff, AB, Canada, April 14-16, 2014, Conference Track Proceedings, 2014. Andrew Kirjner, Jason Yim, Raman Samusevich, Shahar Bracha, Tommi S Jaakkola, Regina Barzilay, and Ila R Fiete. Improving protein optimization with smoothed fitness landscapes. In The Twelfth International Conference on Learning Representations, 2024. Harold J Kushner. A new method of locating the maximum point of an arbitrary multipeak curve in the presence of noise. Journal of Basic Engineering, 1964. Pedro Larra naga and Jose A Lozano. Estimation of distribution algorithms: A new tool for evolutionary computation, volume 2. Springer Science & Business Media, 2001. Yann Le Cun, L eon Bottou, Yoshua Bengio, and Patrick Haffner. Gradient-based learning applied to document recognition. Proceedings of the IEEE, 86(11):2278 2324, 1998. Jaehoon Lee, Lechao Xiao, Samuel S. Schoenholz, Yasaman Bahri, Roman Novak, Jascha Sohl Dickstein, and Jeffrey Pennington. Wide neural networks of any depth evolve as linear models under gradient descent. In 33rd Conference on Neural Information Processing Systems (Neur IPS 2019), Vancouver, Canada, 2019. Xiang Li, John Thickstun, Ishaan Gulrajani, Percy S Liang, and Tatsunori B Hashimoto. Diffusionlm improves controllable text generation. Advances in Neural Information Processing Systems, 35:4328 4343, 2022. Chaoyue Liu, Libin Zhu, and Mikhail Belkin. On the linearity of large non-linear models: When and why the tangent kernel is constant. In Advances in Neural Information Processing Systems, 2020. Qiang Liu and Dilin Wang. Stein variational gradient descent: A general purpose bayesian inference algorithm. Advances in neural information processing systems, 29, 2016. Published as a conference paper at ICLR 2025 Odalric-Ambrym Maillard. Self-normalization techniques for streaming confident regression. HAL, τ). It uses samples from the current learned approximation of this distribution, q(x|ϕ t ), for proposing candidates to evaluate each round. Unmodified batch BO in the discrete setting without any specialization requires a list of candidates, from which a batch of candidates are selected per round using a surrogate model with a batch acquisition function, e.g. see Wilson et al. (2017). The surrogate model s hyper-parameters, θ, are estimated by minimizing negative log marginal likelihood, LNLML(θ, DN). Mechanistically, active generation learns a generative model of valuable candidates to circumvent the requirement of having to select candidates from a list. This is especially important for searching over the space of sequences, X, where enumerating feasible candidates is often intractable. Furthermore, active generation naturally lends itself to large batch sizes, while explicit batch optimization is often computationally intensive and is limited to O(10) candidates per batch Wilson et al. (2017). As mentioned in section 3, alternatives to active generation for specializing batch BO to the discrete domain also include LSO (Tripp et al., 2020) and amortized BO (Swersky et al., 2020) among others. MKTTTL LFLVGALTQ MKTATL LFLVGTLTQ MKTTTL DFLVGALTQ MKTYTL LFLVGTLTQ MKTTTL LFLVGALTT 1.2 3.6 0.1 2.8 0.3 MKTTTL LFLVGALTS MKTFTL LFLVGVLTQ MKTTIL LFLVGTLTQ MKTSTL LFLVGTLTC MKTATL LFLVGALTT Unlabelled data k9QA1Fhfb Edo2ta DUCQNZ/q4ivxs1W37Pnw Z9C4I5a JF5HEXNu3CUsy Ll GTIJxgw CX+Gw BI2CSV5YWG4An YFl3xg YQYp N8Nyal BF25YZ0STX9m RIp+z Ljh JSYy Zpb Cvr Lc2i Vp P/0w YFJrv DUm Sq QJ6x2UVJISnmt Habjo Tm DOXEAm Ba2F0p G4MGhv ZPGt Cs Pjkt+D0Ry/Y6e0c/2zt H8zta JAt8o10SEB+k X3yhxy RPm Hkw Vly Vpx V59Ftu Gvu+qz Ude Y9m+RVu F+e AJ+Et Ag=q(x|ω {(xbt, ybt)}B Model training S = {x : f (x) > ω} (a) VSD Active Generation MKTTTL LFLVGALTQ MKTATL LFLVGTLTQ MKTTTL DFLVGALTQ MKTYTL LFLVGTLTQ MKTTTL LFLVGALTT 1.2 3.6 0.1 2.8 0.3 {(xbt, ybt)}B Model training MKTTTL LFLVGALTQ MKTFTL LFLVGTLTQ MKTTIL LFLVGTLTQ MKTSTL LFLVGTLTQ MKTTTL LFLVGTLTQ MKTTTL LFLHGTLTQ MKTCTL LFLVGTLTQ MKTTTL LFLVGQLTQ MKTNTL LFLVGTLTQ MKTTTL LKLVGTLTQ MKTTTL LFLVGALTT p(y|x, DN, ω) Unlabelled data ω({xb}, DN, ε Candidate list LNLML(ω, DN) (b) Batch BO Figure B.1: Depictions of (a) active generation as implemented by VSD, and (b) batch Bayesian optimization as applied to discrete sequences. Please see the text for a discussion of the differences between these approaches. C EXPERIMENTAL DETAILS We use three well established datasets; a green fluorescent protein (GFP) from Aequorea Victoria (Sarkisyan et al., 2016), an adeno-associated virus (AAV) Bryant et al. (2021); and DNA binding activity to a human transcription factor (TFBIND8) (Trabucco et al., 2022; Barrera et al., 2016). These datasets have been used variously by Brookes & Listgarten (2018); Brookes et al. (2019); Angermueller et al. (2019); Kirjner et al. (2024); Jain et al. (2022) among others. The GFP task is to maximize fluorescence, this protein consists of 238 amino acids, of which 237 can mutate. The AAV task us to maximize the genetic payload that can be delivered, and the associated protein has 28 amino acids, all of which can mutate. A complete combinatorial assessment is infeasible for these tasks, and so we use the convolution neural network oracle presented in Kirjner et al. (2024) as in-silico ground truth. TFBIND8 contains a complete combinatorial assessment of the effect of changing 8 nucleotides on binding to human transcription factor SIX6 REF R1 (Barrera et al., 2016). The dataset we use contains all 65536 sequences prepared by Trabucco et al. (2022). We also use two novel datasets from recent works that experimentally assess the (near) complete combinatorial space of short sequences. The first dataset measures the antibiotic resistance of Escherichia coli metabolic gene fol A, which encodes dihydrofolate reductase (DHFR) (Papkou et al., 2023). Only a sub-sequence of this gene is varied (9 nucleic acids which encode 3 amino acids), and so a near-complete (99.7%) combinatorial scan is available. For variants that have no fitness (resistance) data available, we give a score of 1. The next dataset is near-complete combinatorial scan of four interacting amino acid residues near the active site of the enzyme tryptophan synthase (Trp B) (Johnston et al., 2024), with 159,129 unique sequences and fitness values, we use 0.2 for Published as a conference paper at ICLR 2025 the missing fitness values (we do not use the authors imputed values). These residues are explicitly shown to exhibit epistasis or non-additive effects on catalytic function which makes navigating this landscape a more interesting challenge from an optimization perspective. Finally, we use the recently proposed Ehrlich functions (Stanton et al., 2024) benchmark. These functions are challenging closed form biological analogues, specifically designed to test BBO methods on high dimensional sequence design tasks without having to resort to physical experimentation or machine learning oracles. We use the POLI and POLI-BASELINES software package for the benchmark and baselines (Gonz alez-Duque et al., 2024), and test on both the original HOLO implementation (Stanton et al., 2024) as well as the native POLI implementation of these functions. The properties of these datasets and benchmarks are presented in Table C.1. Dataset |V| M |Xavailable| |X| TFBIND8 4 8 65,536 65,536 Trp B 20 4 159,129 160,000 DHFR 4 9 261,333 262,144 AAV 20 28 42,340 2028 GFP 20 237 51,715 20237 Ehrlich-15 20 15 2015 2015 Ehrlich-32 20 32 2032 2032 Ehrlich-64 20 64 2064 2064 Table C.1: Alphabet size, sequence length, and number of available sequences for each of the datasets we use in this work. We optimize VSD, Cb AS, Db AS and BORE for a minimum of 3000 iterations each round (5000 for all experiments but the Ehrlich functions) using Adam (Kingma & Ba, 2014). When we use a CPE, Ada Lead s κ parameter is set to 0.5 since the CPE already incorporates the appropriate threshold. C.1 FITNESS LANDSCAPES SETTINGS For the DHFR and Trp B experiments we set maximum fitness in the training dataset to be that of the wild type, and τ to be slightly below the wild type fitness value (so we have 10 positive examples to train the CPE with). We use a randomly selected Ntrain = 2000 below the wild-type fitness to initially train the CPE, we also explicitly include the wild-type. The thresholds and wildtype fitness values are; DHRF: τ = 0.1, ywt = 0, Trp B: τ = 0.35, ywt = 0.409. We follow the same procedure for the TFBIND8 experiment, however, there is no notion of a wild-type sequence in this data, and so we set τ = 0.75, and ytrain max = 0.85. We use a uniform prior over sequences, p(x) = QM m=1 Categ(xm|1 |V| 1), since these are relatively small search spaces, and the subsequences of nucleic/amino acids have been specifically selected for their task. Similarly, we find that relatively simple independent (mean-field) variational distributions of the form in Equation 22 and MLP based CPEs work best for these experiments (details in Sec. C.4). C.2 BLACK-BOX OPTIMIZATION SETTINGS We follow Kirjner et al. (2024) in the experimental settings for the AAV and GFP datasets, but we modify the maximum fitness training point and training dataset sizes to make them more amenable to a sequential optimization setting. The initial percentiles, schedule, and max training fitness values are; AAV: p0 = 0.8, η = 0.7, ymax = 5, GFP: p0 = 0.8, η = 0.7 ymax = 1.9. We aim for p T = 0.99. The edit distance between x and the fittest sequence in the CPE training data is 8 for GFP, and 13 for AAV. We again use a random Ntrain = 2000 for training the CPEs, which in this case are CNNs architecture specifics are in Sec. C.4. For the Ehrlich function experiment, we use sequence lengths of M = {15, 32, 64} with 2 motifs for the shorter sequence lengths, and 8 motifs for M = 64. All use a motif length of 4 and a quantization of 4. B = 128, T = 32 and only 128 random samples of the function are used for DN these are resampled for each seed. As before, 5 different random seeds are used for these trials, and for VSD we use an the same scheduling function for τt as in Equation 20, with p0 = 0.5 and η = 0.87 (so p T = 0.99). The lower initial percentile is used since the training dataset is much Published as a conference paper at ICLR 2025 smaller than in the other experiments, and we find allowing for more exploration initially improves VSD s performance. In these higher dimensional settings, we find that performance of the methods heavily relies on using an informed prior (in the case of VSD and Cb AS), or initial variational distribution (in the case of Db AS and BORE). To this end, we follow Brookes et al. (2019) and fit the initial variational distribution to the CPE training sequences (regardless of fitness), but we use maximum likelihood. For the more complex variational distributions (LSTM and transformer), we have to be careful not to over-fit so we implement early stopping and data augmentation techniques. Then for VSD and Cb AS we copy this distribution and fix its parameters for the remainder of the experiment for use as a prior. We also use this prior for the Random method, but Ada Lead and PEX use alternative generative heuristics. For these experiments we use the simple independent variational distribution and the same LSTM and causal decoder-only transformer models from Sec. 4.1. C.3 VARIATIONAL DISTRIBUTIONS In this section we summarize the main variational distribution architectures considered for VSD, BORE, Cb AS and Db AS, and the sampling distributions for the Random baseline method. Somewhat surprisingly, we find that we often obtain good results for the biological sequence experiments using a simple independent (or mean-field) variational distribution, especially in lower dimensional settings, m=1 Categ(xm|softmax(ϕm)), (22) where xm V and ϕm R|V|. However, this simple mean-field distribution was not capable of generating convincing handwritten digits or, in some cases, higher-dimensional sequences. We have also tested a variety of transition variational distributions, q(xt|xt 1, ϕ) = m=1 Categ(xtm|softmax(NNm(xt 1, ϕ))), (23) where NNm(xt 1, ϕ) is the mth vector output of a neural network that takes a sequence from the previous round, xt 1, as input. We have implemented multiple neural net encoder/decoder architectures for NNm(xt 1, ϕ), but we did not consider architectures of the form NNm(ϕ) since the variational distribution in Equation 22 can always learn a ϕm = NNm(ϕ ). We found that none of these transition architectures significantly outperformed the mean-field distribution (Equation 22) when it was initialized well (e.g. fit to the CPE training sequences), see Sec. D.4 for results. We also implemented auto-regressive variational distributions of the form, q(x|ϕ) = Categ(x1|softmax(ϕ1)) m=2 q(xm|x1:m 1, ϕ1:m) where, (24) q(xm|x1:m 1, ϕ1:m) = Categ(xm|softmax(LSTM(xm 1, ϕm 1:m))), Categ(xm|softmax(DTransformer(x1:m 1, ϕ1:m))). For a LSTM RNN and a decoder-only transformer with a causal mask, for the latter see Phuong & Hutter (2022, Algorithm 10 & Algorithm 14) for maximum likelihood training and sampling implementation details respectively. We list the configurations of the LSTM and transformer variational distributions in Table C.2. We use additive positional encoding for all of these models. When using these models for priors or initialization of variational distributions, we find that over-fitting can be an issue. To circumvent this, we use early stopping for larger training datasets, or data augmentation techniques for smaller training datasets (as in the case of the Ehrlich functions). C.4 CLASS PROBABILITY ESTIMATOR ARCHITECTURES For the fitness landscape experiments on the smaller combinatorially complete datasets we use a two-hidden layer MLP, with an input embedding layer. The architecture is given in Figure C.2 (a). For the larger dimensional AAV and GFP datasets and Ehrlich function benchmark, we use the Published as a conference paper at ICLR 2025 Configuration Digits AAV GFP Ehrlich 15 Ehrlich 32 Ehrlich 64 LSTM Layers 5 4 4 3 3 3 Network size 128 32 32 32 32 64 Embedding size 4 10 10 10 10 10 Transformer Layers 4 1 1 2 2 2 Network Size 256 64 64 32 64 128 Attention heads 8 2 2 1 2 3 Embedding size 32 20 20 10 20 30 Table C.2: LSTM and transformer network configuration. convolutional architecture given in Figure C.2 (b). On all but the Ehrlich benchmark, five fold cross validation was used to select the hyper parameters before the CPEs are trained on the whole training set for use in the subsequent experimental rounds. For the Ehrlich benchmark we do not use crossvalidation to select the CPE hyper parameters but we do use an additive ensemble of 10 randomly initialized CNNs for the CPE following La MBO-2. Model updates are performed by retraining on the whole query set. D ADDITIONAL EXPERIMENTAL RESULTS D.1 FITNESS LANDSCAPES GAUSSIAN PROCESS PROBABILITY OF IMPROVEMENT Here we present additional fitness landscape experimental results, where we have used a GP as a surrogate model for p(y|x, DN) in conjunction with a complementary Normal CDF as the PI acquisition function. This is one of the main frameworks supported by our theoretical analysis. VSD, Db AS, Cb AS and BORE can make use of the GP-PI acquisition function, and so BORE is BOPR (Daulton et al., 2022) in this instance since we are not using a CPE. PEX and Ada Lead only use the GP surrogate, as per their original formulation. The GP uses a simple categorical kernel with automatic relevance determination from Balandat et al. (2020), k(x, x ) = σ exp 1[xm = x m] lm where σ and lm are hyper-parameters controlling scale and length-scale respectively. See Figure D.3 for the results. D.2 EHRLICH FUNCTION HOLO RESULTS See Figure D.4 for BBO results on the original HOLO Ehrlich function implementation (Stanton et al., 2024). We present additional diversity scores for these and the POLI implementation in Sec. D.3. D.3 DIVERSITY SCORES The diversity of batches of candidates is a common thing to report in the literature, and to that end we present the diversity of our results here. We have taken the definition of pair-wise diversity from (Jain et al., 2022) as, Diversityt = 1 B(B 1) xj DBt\{xi} Lev(xi, xj), (26) where Lev : X X N0 is the Levenshtein distance. We caution the reader as to the interpretation of these results however, as more diverse batches often do not lead to better performance, precision, recall or simple regret (as can be seen from the Random method results). Though insufficient diversity can also explain poor performance, as in the case of BORE. Results for the fitness landscape experiment are presented in Figure D.5, and black-box optimization for AAV & GFP in Figure D.6 and Ehrlich functions in Figure D.7. Published as a conference paper at ICLR 2025 Sequential( Embedding( num_embeddings=A, embedding_dim=8 ), Dropout(p=0.2), Flatten(), Leaky Re LU(), Linear( in_features=8 * M, out_features=32 ), Leaky Re LU(), Linear( in_features=32, out_features=1 ), ) (a) MLP architecture Sequential( Embedding( num_embeddings=A, embedding_dim=10 ), Dropout(p=0.2), Conv1d( in_channels=10, out_channels=16, kernel_size=3 or 7, ), Leaky Re LU(), Max Pool1d( kernel_size=2 or 4, stride=2 or 4, ), Conv1d( in_channels=16, out_channels=16, kernel_size=7, ), Leaky Re LU(), Max Pool1d( kernel_size=2 or 4, stride=2 or 4, ), Flatten(), Lazy Linear( out_features=128 ), Leaky Re LU(), Linear( in_features=128, out_features=1 ), ) (b) CNN architecture Figure C.2: CPE architectures used for the experiments in Py Torch syntax. A = |V|, M = M, GFP uses a max pooling kernel size and stride of 4, all other datasets and benchmarks use 2. The Ehrlich function benchmark uses and ensemble of 10 randomly initialized CNNs that are additively combined. The Ehrlich-15 functions use a kernel size of 3, all other BBO experiments use a kernel size of 7. La MBO-2 uses the same kernel size as our CNNs for the Ehrlich functions. D.4 ABLATIONS VARIATIONAL AND PRIOR DISTRIBUTIONS In Figure D.8 we present ablation results for VSD using different priors and variational distributions. We use the BBO experimental datasets for this task as they are higher-dimensional and so more sensitive to these design choices. We test the following prior and variational posterior distributions: IU Independent categorical variational posterior distribution of the form in Equation 22, and a uniform prior distribution, p(x) = QM m=1 Categ(xm|1 |V| 1). I Independent categorical prior and variational posterior of the form in Equation 22. The prior is fit using ML on the initial CPE training data. LSTM LSTM prior and variational posterior of the form Equation 24. The prior is fit using ML on the initial CPE training data. DTFM Decoder-only causal transformer prior and variational posterior of the form Equation 24. The prior is fit using ML on the initial CPE training data. Published as a conference paper at ICLR 2025 (c) TFBIND8 Figure D.3: Fitness landscape results using GP-PI. Precision (Equation 17), recall (Equation 18) and performance (Equation 19) higher is better for the combinatorially (near) complete datasets, DHFR and Trp B and TFBIND8. The random method is implemented by drawing B samples uniformly. Figure D.4: Ehrlich function (HOLO implementation) BBO results. VSD and Cb AS with different variational distributions; mean field (MF), LSTM and transformer (TFM), compared against genetic algorithm (GA) and La MBO-2 baselines. TAE Independent categorical prior and a transition-style auto-encoder variational posterior of the form Equation 23, where we use two-hidden layer MLPs for the encoder and decoder. The prior is fit using ML on the initial CPE training data. TCNN Independent categorical prior and a transition-style convolutional auto-encoder variational posterior of the form Equation 23, where we use a convolutional encoder, and transpose convolutional decoder. The prior is fit using ML on the initial CPE training data. We use the informed-independent priors with the transition variational distributions since they are somewhat counter-intuitive to use as priors themselves. Published as a conference paper at ICLR 2025 (c) TFBIND8 Figure D.5: Fitness landscape diversity results. Higher is more diverse, as defined by Equation 26. (a) Independent (c) Transformer Figure D.6: Black-box optimization results for diversity on GFP and AAV with independent and auto-regressive variational distributions. Higher is more diverse, as defined by Equation 26. The PEX and Ada Lead results are replicated between the plots, since they are unaffected by choice of variational distribution. From Figure D.8 we can see that while using an uninformative prior works in the lower-dimensional fitness landscape experiments, using an informative prior is crucial for these higher dimensional problems. We found a similar result when using this uninformative prior with Cb AS, or using a uniform initialization with Db AS and BORE. The methods are not able to make any significant progress within the experimental budget given. The independent and transition variational distributions achieve similar performance, whereas the auto-regressive models generally outperform all others. This is because of the LSTM and transformer s superior generalization performance when generating sequences measured both when training the priors (on held-out sequences) and during VSD adaptation. E THEORETICAL ANALYSIS FOR GP-BASED CPES In this section, we present theoretical results concerning VSD and its estimates when equipped with Gaussian process regression models (Rasmussen & Williams, 2006). We show that VSD sampling distributions converge to a target distribution that characterizes the level set given by τ. The approximation error mainly depends on the predictive uncertainty of the probabilistic model with respect to the true underlying function f . For the analysis, we will assume that f is drawn from a Gaussian process, i.e., f GP(0, k), with a positive-semidefinite covariance (or kernel) function k : X X R. In this case, we can show that the predictive uncertainty of the model converges Published as a conference paper at ICLR 2025 Figure D.7: Black-box optimization results for diversity on the POLI and HOLO implementations of the Ehrlich functions. Higher is more diverse, as defined by Equation 26. (in probability) to zero as the number of observations grows. From this result, we prove asymptotic convergence guarantees for VSD equipped with GP-PI-based CPEs. These results form the basis for our analysis of CPEs based on neural networks (Appendix F). E.1 GAUSSIAN PROCESS POSTERIOR Let f GP(0, k) be a zero-mean Gaussian process with a positive-semidefinite covariance function k : X X R. Assume that we are given a set DN := {(xi, yi)}N i=1 of N 1 observations yi = f (xi) + ϵi, where ϵ N 0, σ2 ϵ and xi X. The GP posterior predictive distribution at any x X is then given by (Rasmussen & Williams, 2006): f (x)|DN N µN(x), σ2 N(x) (27) µN(x) = k N(x) (KN + σ2 ϵ I) 1y N (28) k N(x, x ) = k(x, x ) k N(x) (KN + σ2 ϵ I) 1k N(x ) (29) σ2 N(x) = k N(x, x) , (30) where k N(x) := [k(x, xi)]N i=1 RN, KN := [k(xi, xj)]N,N i,j=1 RN N, and y N := [yi]N i=1 RN. Batch size. In the following, we will assume a batch of size B = 1 to keep the proofs simple. With this assumption, at every iteration t 1, we have N = t observations available in the dataset. We would, however, like to emphasize that sampling a batch of multiple observations, instead of a single observation, per iteration should only improve the convergence rates by a constant (batchsize-dependent) multiplicative factor. Therefore, our results remain valid as an upper bound for the convergence rates of VSD in the batch setting. E.2 BACKGROUND We will consider an underlying probability space (Ω, A, P), where Ωis the sample space, A denotes the σ-algebra of events, and P is a probability measure. For any event A A, we have that P[A] [0, 1] quantifies the probability of that event. For events involving a random variable, e.g., χ : (Ω, A) (R, BR), where BR denotes the Borel σ-algebra of the real line with its usual topology, we will let: P[χ > 0] = P[{ω Ω: χ(ω) > 0}] . (31) Published as a conference paper at ICLR 2025 Figure D.8: Ablation results for the AAV and GFP BBO experiments. VSD is trialed with different prior and variational posterior combinations, I indicates a simple independent informed prior and posterior, IU is the same but with a uniform prior, LSTM and DTFM are the LSTM and decoder only transformer prior and posteriors, TCNN and TAE are transition convolutional encoder-decoder and auto-encoder posteriors, with informed independent priors. See text for details. We will also use conditional expectations, i.e., given a σ-sub-algebra S of A, the conditional expectation E[χ|S] is a S-measurable random variable such that: A E[χ|S] d P = Z A χ d P = E[χ|A] . (32) We will denote by {Ft} t=0 an increasing filtration on A. For instance, we could set Ft as the σalgebra generated by the random variables in the algorithm (i.e., the candidates, target observations, etc.) at time t. For more details on the measure-theoretic definition of probability, we refer the reader to classic textbooks in the area (e.g. Bauer, 1981; Durrett, 2019) We will use the following well known notation for asymptotic convergence results. For a given strictly positive function g : N R, we define O(g(t)) as the set of functions asymptotically bounded by g (up to a constant factor) as: O(g(t)) := h : N R lim sup t |h(t)| g(t) < , (33) and for convergence in probability we use its stochastic counterpart: OP(g(t)) := ρ : N (Ω, A) (R, BR) lim C lim sup t P |ρ(t)| g(t) > C = 0 , (34) which is equivalent to: ε > 0, Cε (0, ) : P |ρt| ε, t Tε , (35) Published as a conference paper at ICLR 2025 for some Tε N. For almost sure convergence, we may also say that a sequence of random variables ρt, t N, is almost surely O(g(t)) if P[ρt O(g(t))] = 1. A deeper overview on these notations and their properties can be found in Garc ıa-Portugu es (2024). In addition, as it is common in the multi-armed bandits literature, we use the variants e O and e OP to denote asymptotic rates which are valid up to logarithmic factors. E.3 AUXILIARY RESULTS We start with a few technical results which will form the basis for our derivations. The following recursive relations allow us to derive convergence rates for the variance of a GP posterior by analyzing how much it reduces per iteration. Lemma E.1 (Chowdhury & Gopalan (2017, Appendix F)). The posterior mean and covariance functions of a Gaussian process given t 1 observations obey the following recursive identities: µt(x) = µt 1(x) + k(x, xt) σ2ϵ + σ2 t 1(xt)(yt µt 1(x)) (36) kt(x, x ) = kt 1(x, x ) kt 1(x, xt)kt 1(xt, x ) σ2ϵ + σ2 t 1(xt) (37) σ2 t (x) = σ2 t 1(x) k2 t 1(x, xt) σ2ϵ + σ2 t 1(xt) , (38) for x, x X. We will also make use of the following version of the second Borel-Cantelli lemma adapted from Durrett (2019, Thr. 4.5.5) and its original statement in Dubins & Freedman (1965). Lemma E.2 (Second Borel-Cantelli lemma). Let {At} t=1 be a sequence of events where At Ft, for all t N, and let χt : ω 7 1[ω At], for ω Ω. Then the following holds with probability 1: PT t=1 χt PT t=1 P[At|Ft 1] = L < , (39) assuming P[A1|F0] > 0. In addition, if lim T PT t=1 P[At|Ft 1] = , then L = 1. The next result provides us with an upper bound on the posterior variance of a Gaussian process which is valid for any covariance function. Lemma E.3. Let k : X X R be any positive-semidefinite kernel on X, and let k : X X R be a kernel defined as: k(x, x ) = k(x, x), x = x 0, x = x , (40) for x, x X. Given any set of observations {xi, yi}t i=1, for t 1, denote by σ2 t the predictive variance of a GP model with prior covariance given by k, and let σ2 t denote the predictive variance of a GP model configured with k as prior covariance function, where both models are given the same set of observations. Then the following holds for all t 0: σ2 t (x) σ2 t (x) = σ2 ϵ σ2 0(x) σ2ϵ + Nt(x) σ2 0(x), x X , (41) where Nt(x) denotes the number of observations at x, and σ2 0(x) = σ2 0(x) := k(x, x), for x X. Proof. It is not hard to show that k defines a valid positive-semidefinite covariance function whenever k is positive semidefinite. We will then focus on proving the main statement by an induction argument. The proof that the statement holds for the base case at t = 0 is trivial given the definition: σ2 0(x) = k(x, x) = k(x, x) = σ2 0(x), x X . (42) Now assume that, for a given t > 0, it holds that σ2 t (x) σ2 t (x), for all x X. We will then check if the inequality remains valid at t + 1. By Lemma E.1, we have that: σ2 t+1(x) = σ2 t (x) k2 t (x, xt+1) σ2 t (xt+1) + σ2ϵ (43) Published as a conference paper at ICLR 2025 For any x X such that x = xt+1, we know that kt(x, xt+1) 0, so that (again by Lemma E.1): k2 t (x, xt+1) k2(x, xt+1) = 0 , (44) which shows that: x = xt+1, σ2 t+1(x) σ2 t (x) σ2 t (x) = σ2 t+1(x) . (45) At x = xt+1, we can rewrite σ2 t+1(x) = σ2 t+1(xt+1) as: σ2 t+1(xt+1) = σ2 ϵ σ2 t (xt+1) σ2 t (xt+1) + σ2ϵ . (46) We then check the difference: σ2 t+1(xt+1) σ2 t+1(xt+1) = σ2 ϵ σ2 t (xt+1) σ2 t (xt+1) + σ2ϵ σ2 ϵ σ2 t (xt+1) σ2 t (xt+1) + σ2ϵ = σ2 ϵ σ2 t (xt+1)( σ2 t (xt+1) + σ2 ϵ ) σ2 ϵ σ2 t (xt+1)(σ2 t (xt+1) + σ2 ϵ ) (σ2 t (xt+1) + σ2ϵ )( σ2 t (xt+1) + σ2ϵ ) = σ4 ϵ (σ2 t (xt+1) σ2 t (xt+1)) (σ2 t (xt+1) + σ2ϵ )( σ2 t (xt+1) + σ2ϵ ) 0 , since σ2 t (xt+1) σ2 t (xt+1) by our assumption for time t. Therefore, we have shown that: σ2 t (x) σ2 t (x) = σ2 t+1(x) σ2 t+1(x) , x X . (48) From the conclusion above and the base case, the inequality in the main result follows by induction. Now we derive an explicit form for σ2 t . Note that this case corresponds to an independent Gaussian model, i.e., f (x) f (x ) whenever x = x , for f GP(0, k). For any t 1, this model s predictive variance at any x X is given by: σ2 t 1(x), x = xt σ2 ϵ σ2 t 1(xt) σ2ϵ + σ2 t 1(xt) = 1 σ2 t 1(xt) + 1 1 , x = xt (49) Looking at the reciprocal, we have that: t 1, 1 σ2 t (x) = 1 σ2 t 1(xt) + 1[xt = x] σ2ϵ , x X. (50) Therefore, every observation at x is simply adding a factor of σ 2 ϵ to σ 2 t (x). Unwrapping this recursion leads us to: t 1, 1 σ2 t (x) = 1 σ2 0(x) + 1 i=1 1[xi = x] , x X . (51) The result in Lemma E.3 then follows as the reciprocal of the above, which concludes the proof. Lemma E.4. Let f GP(0, k) for a given k : X X R, where σ2 X := supx X k(x, x) < , and |X| < . Then f is almost surely bounded, and: E sup x X |f (x)| σX p 2 log |X| . (52) Proof. The result follows by an application of a concentration inequality for the maximum of a finite collection of sub-Gaussian random variables (Boucheron et al., 2013, Sec. 2.5). Note that {f (x)}x X is a collection of |X| Gaussian, and therefore sub-Gaussian, random variables with sub-Gaussian parameter given by σ2 X σ2 t (x), for all X. Applying the maximal inequality for a finite collection sub-Gaussian random variables (Boucheron et al., 2013, Thr. 2.5), we have that: E max x X f (x) σX p 2 log |X| < . (53) By symmetry, we know that f (x) is also sub-Gaussian with the same parameter, so that the bound remains valid for maxx X f (x). As a consequence, the expected value of the maximum of |f (x)| is upper bounded by the same constant. On a finite set, the maximum and the supremum coincide. As the expected value of the supremum is finite, the supremum must be almost surely finite by Markov s inequality, and therefore f is almost surely bounded. Published as a conference paper at ICLR 2025 E.4 ASYMPTOTIC CONVERGENCE The main assumption we will be working with in this section is the following. Assumption E.1. The objective function is a sample from a Gaussian process f GP(0, k), where k : X X R is a bounded positive-semidefinite kernel on X. The next result allows us to derive a convergence rate for the posterior variance of a GP as a function of the sampling probabilities. This result might also be useful by itself for other sampling problems involving GP-based approximations. Lemma E.5. Let {xt}t 1 be a sequence of X-valued random variables adapted to the filtration {Ft}t 1. For a given x X, assume that the following holds: T N : T T , t=1 P[xt = x | Ft 1] BT > 0 , (54) for a some sequence of lower bounds {Bt}t N. Then, under Assumption E.1, given observations at {xi}t i=1, the following holds with probability 1: σ2 t (x) O(B 1 t ). (55) In addition, if Bt , then limt Btσ2 t (x) σ2 ϵ . Proof. At any iteration t, the posterior variance σ2 t of a GP model is upper bounded by a worst case assumption of no correlation between observations (see Lemma E.3). In this case, we have that: σ2 t (x) σ2 t (x) = σ2 ϵ σ2 0(x) σ2ϵ + Nt σ2 0(x) , (56) where σ2 0(x) := k(x, x) = k(x, x), and Nt := Nt(x) t denotes the total number of observations taken at x as of iteration t. Without loss of generality, assume that σ2 0(x) = 1. The only random variable to be bounded in Equation 56 is Nt. Let χt := 1[xt = x], so that: i=1 1[xt = x] , t 1. (57) We now apply the second Borel-Cantelli lemma (Lemma E.2) to Nt. Namely, let b Nt denote the sum of conditional expectations of {χi}t i=1 given available data, i.e.: i=1 E[χi | Fi 1] = i=1 E[1[xt = x] | Fi 1] = i=1 P[xi = x | Fi 1] . (58) By Lemma E.2, we know that the following holds for some L R: b Nt = L < . (59) Hence, Nt is asymptotically equivalent to b Nt. Applying this fact to σ2 t , we have that: lim t Bt σ2 t (x) = lim t Btσ2 ϵ σ2ϵ + Nt = lim t Btσ2 ϵ σ2ϵ + L b Nt lim t Btσ2 ϵ σ2ϵ + LBt L lim t min{LBt, σ2 ϵ } Published as a conference paper at ICLR 2025 which holds with probability 1. Lastly, note that, if Bt , then L = 1 by Lemma E.2, and the last limit above becomes σ2 ϵ . The main result then follows by an application of Lemma E.3 and the definition of the big-O notation (see Equation 33).1 We assume a finite search space, which is the case for spaces of discrete sequences of bounded length. However, we conjecture that our results can be extended to continuous or mixed discretecontinuous search spaces via a discretization argument under further assumptions on the kernel k (e.g., ensuring that f is Lipschitz continuous, as in Srinivas et al. (2010)). Assumption E.2. The search space X is finite, |X| < . We assume that our family of variational distributions is rich enough to be able to represent the PI-based distribution p(x|y > τt, Dt), which is the optimum of our variational objective when the optimal classifier is given by GP-PI. Although this assumption could be seen as strong, note that, due to Gaussian noise, the classification probability p(y > τt|x, Dt) should be a reasonably smooth function of x, which facilitates the approximation of the resulting posterior by a generative model. Assumption E.3. For every t 0, p(x|y > τt, Dt) is a member of the variational family, i.e.: ϕ t : D[q(x|ϕ t ) p(x|y > τt, Dt)] = 0. (61) The next assumption is a technical one to ensure that the thresholds will not diverge to infinity. Assumption E.4. The sequence of thresholds is almost surely bounded:2 sup t N |τt| τ < . (62) We can now state our main result regarding the GP-based approximations learned by VSD. Theorem E.1. Let assumptions E.1 to E.4 hold. Then the following holds with probability 1 for VSD equipped with GP-PI: σ2 t (x) O(t 1) , (63) at every x X such that p(x) > 0. Proof. Let ℓt(x) := p(y > τt|x, Dt). For any given x X where p(x) > 0, by Assumption E.2, we have that the next candidate will be sampled according to: t 0, P[xt+1 = x | Ft] = p(x|y > τt, Dt) = ℓt(x)p(x) Ep(x)[ℓt(x)] where we used the fact that Ep(x)[ℓt(x)] 1, since ℓt(x) 1, for all x X. As p(x) > 0, we only have to derive a lower bound on ℓt(x) to apply Lemma E.5 and derive a convergence rate. A lower bound on ℓt(x) is given by: t 0, ℓt(x) = Ψ σ2 t (x) + σ2ϵ where Ψ( ) denotes the cumulative distribution function of a standard normal random variable, and denotes the essential supremum of a function under P (the probability measure of the underlying abstract probability space). Therefore, if limt µt < , we will have that limt ℓt(x) > 0, and the sum in Lemma E.5 will diverge. By Jensen s inequality for conditional expectations, we have that: t 0, µt = E[f | Ft] E[ f | Ft]. (66) 1Recall that for convergent sequences lim and lim sup coincide. 2We do not require τ to be known, only finite. Published as a conference paper at ICLR 2025 As E[E[ f | Ft]] = E[ f ] < (cf. Lemma E.4), an application of Markov s inequality implies that: lim a P[E[ f | Ft] a] lim a 1 a E[ f ] = 0. (67) Furthermore, mt := E[ f | Ft] also defines a non-negative martingale, and by the martingale convergence theorem (Durrett, 2019, Thr. 4.2.11), limt mt = m := E[ f | F ] is well defined and E[E[ f | F ]] = E[ f ] < . Again, by Markov s inequality, for any a > 0, we have that: P h lim t µt a E[ f ] i E [limt µt ] a E[ f ] E [limt E[ f | Ft]] a E[ f ] = 1 Therefore, for any a > 0 and any given x X, with probability at least 1 1 a, the following holds: lim t P[xt = x | Ft 1] p(x) lim t ℓt 1(x) p(x) lim t Ψ µt 1 + τ p(x)Ψ a E[ f ] + τ =: b (a) > 0 . Hence, for any εa (0, b (a)), there is Na N, such that P[xt = x | Ft 1] b (a) εa > 0, for all t Na. As a result, Pt t =1 P[xt = x | Ft 1] (b (a) εa)(t Na), for all t Na, which asymptotically diverges at a rate proportional to t. By Lemma E.5 and the definition of the big-O notation, for any x X, we then have that: a > 0, P lim sup t tσ2 t (x) σ2 ϵ < 1 1 Taking the limit as a , we can finally conclude that: P lim sup t tσ2 t (x) < = 1, (71) i.e., σ2 t is almost surely O(t 1), which concludes the proof. Remark E.1. The convergence rate in Theorem E.1 is optimal and cannot be further improved. As shown by previous works in the online learning literature (Mutn y & Krause, 2018; Takeno et al., 2024), a lower bound on the GP variance at each iteration t 1 is given by σ2 t (x) σ2 ϵ (σ2 ϵ + t) 1 (assuming k(x, x) = 1), which is the case when every observation in the dataset was collected at the same point x X (see Takeno et al., 2024, Lem. 4.2). Therefore, the lower and upper bounds on the asymptotic convergence rates for the GP variance differ by only up to a multiplicative constant. The result in Theorem E.1 now allows us to derive a convergence rate for VSD s approximations to the level-set distributions. To do so, however, we will require the following mild assumption, which is satisfied by any prior distribution which has support on the entire domain X. Assumption E.5. The prior distribution is such that p(x) > 0, for all x X. Theorem 2.1. Under mild assumptions (E.1 to E.5), the variational distribution of VSD equipped with GP-PI converges to the level-set distribution in probability at the following rate: DKL[p(x|y > τt, Dt) p(x|y > τt, f )] OP(t 1/2) . (12) Proof. We first prove an upper bound for the KL divergence in terms of the PI approximation error. We then derive a bound for this term and apply Theorem E.1 to obtain a convergence rate. KL bound formulation. Let ℓt(x) := p(y > τt|x, Dt) and ℓ t (x) := p(y > τt|x, f ), for x X. From the definition of the KL divergence, we have that: DKL[p(x|y > τt, Dt) p(x|y > τt, f )] = Ep(x|y>τt,Dt)[log p(x|y > τt, Dt) log p(x|y > τt, f )] = Ep(x|y>τt,Dt)[log ℓt(x) log ℓ t (x)] + log Ep(x)[ℓ t (x)] log Ep(x)[ℓt(x)] = Ep(x|y>τt,Dt) + log Ep(x)[ℓ t (x)] Ep(x)[ℓt(x)] Published as a conference paper at ICLR 2025 For logarithms, we know that log(1 + a) a, for all a > 1, which shows that: = log 1 + ℓt(x) ℓ t (x) ℓ t (x) ℓt(x) ℓ t (x) ℓ t (x) (73) log Ep(x)[ℓ t (x)] Ep(x)[ℓt(x)] = log 1 + Ep(x)[ℓ t (x) ℓt(x)] Ep(x)[ℓt(x)] Ep(x)[ℓ t (x) ℓt(x)] Ep(x)[ℓt(x)] . (74) Combining the above into Equation 72 yields: DKL[p(x|y > τt, Dt) p(x|y > τt, f )] Ep(x|y>τt,Dt) ℓt(x) ℓ t (x) ℓ t (x) + Ep(x)[ℓ t (x) ℓt(x)] Ep(x)[ℓt(x)] . (75) The denominator in the expression above is such that: t 0, ℓ t (x) = p(y > τt|x, f ) = Ψ f (x) τt , x X . (76) By Lemma E.4, we know that E[ f ] < , which implies that P[ f < ] = 1 by Markov s inequality. Next, we derive a bound for the approximation error term. Error bound. We now derive an upper bound for the difference ℓt(x) := ℓt(x) ℓ t (x) and then show that it asymptotically vanishes. Applying Taylor s theorem to Ψ, we can bound ℓt as a function of the approximation error between the mean µt and the true function f as: t 0, | ℓt(x)| = σ2 t (x) + σ2ϵ σ2 t (x) + σ2ϵ f (x) τt σϵµt(x) f (x) p σ2 t (x) + σ2ϵ + τt( p σ2 t (x) + σ2ϵ σϵ) σ2 t (x) + σ2ϵ |σϵµt(x) f (x) p σ2 t (x) + σ2ϵ | + |τt|σt(x) σ2ϵ σϵ|µt(x) f (x)| + σt(x)(|f (x)| + |τt|) since supϵ R dΨ(ϵ) 2π < 1, and we used the fact that σϵ p σ2 t (x) + σ2ϵ σt(x) + σϵ to obtain the last two inequalities. Convergence rate. To derive a convergence rate, given any x X and t 0, we have that: E[| ℓt(x)| | Ft] σϵE[|µt(x) f (x)| | Ft] + σt(x)(E[|f (x)| | Ft] + |τt|) We know that E[|f (x)| | Ft] is almost surely bounded, and by Jensen s inequality, it also holds that: E[|µt(x) f (x)| | Ft] σt(x). (79) Applying Theorem E.1, we then have that: | ℓt(x)| OP(t 1/2). (80) Since µt E[ f |Ft] OP(1), we also have that: 1 Ep(x)[ℓt(x)] OP(1) . (81) Lastly, we know that 1 ℓ t (x) OP(1) by Equation 76 and the observation that f OP(1). The main result then follows by combining the rates above into Equation 75. Published as a conference paper at ICLR 2025 E.5 PERFORMANCE ANALYSIS At every iteration t 1, VSD samples xt from (an approximation to) the target p(x|y > τt 1, Dt 1) and obtains an observation yt p(y|xt). A positive hit consists of an event yt > τt 1, where τt 1 is computed based on the data available in Dt 1 or a constant. Therefore, we can compute the probability of a positive hit for a given realization of f as: P[yt > τt 1 | Dt 1, f ] = Ep(x|y>τt 1,Dt 1)[p(y > τt 1|x, f )] . (82) Then the expected number of hits HT after T 1 iterations is given by: E[HT | f ] = t=1 Ep(x|y>τt 1,Dt 1)[p(y > τt 1|x, f )] . (83) We will compare this quantity with the expected number of hits H T obtained by a sampling distribution with full knowledge of the objective function f : E[H T | f ] = t=1 Ep(x|y>τt 1,f )[p(yt > τt 1|x, f )] . (84) The next result allows us to bound the difference between these two quantities. Corollary 2.1. Under the settings in Theorem 2.1, we also have that: E[|HT H T |] O( Proof. For all T 1, we have that: E[HT H T ] = E t=1 Ep(x|y>τt 1,Dt 1)[p(y > τt 1|x, f )] Ep(x|y>τt 1,f )[p(yt > τt 1|x, f )] x X p(y > τt 1|x, f ) (p(x|y > τt 1, Dt 1) p(x|y > τt 1, f )) x X p(y > τt|x, f )p(x) ℓt(x) Ep(x )[ℓt(x )] ℓ t (x) Ep(x )[ℓ t (x )] x X p(x) | ℓt(x)| min{Ep(x )[ℓt(x )] , Ep(x )[ℓ t (x )]} since p(y > τt 1|x, f ) 1, for all t 1. As both µt and f are in OP(1), min{Ep(x )[ℓt(x )] , Ep(x )[ℓ t (x )]} is lower bounded by some constant. As ℓt(x) OP(t 1/2), for T large enough and some C > 0, we then have that: E[|HT H T |] C which follows by an application of the Euler-Maclaurin formula, since R T 1 1 T 2 and the remainder term asymptotically vanishes. Remark E.2. If the oracle achieves E[H T ] = T, the error bound in Corollary 2.1 suggests an increasing rate of positive hits by VSD as 1 T E[HT ] 1 CT 1/2, for some constant C > 0 and large enough T. Therefore, VSD should asymptotically achieve a full rate of 1 positive hit per iteration in the single-point batch setting we consider. Note, however, that the results above do not discount for repeated samples, though should still indicate that VSD achieves a high discovery rate over the course of its execution. Published as a conference paper at ICLR 2025 F VSD WITH NEURAL NETWORK CPES In this section, we consider VSD with class probability estimators that are not based on GP regression, which was the case for the previous section, while specifically focusing on neural network models. We will, however, show that with a kernel-based formulation we are able to capture the classification models based on neural networks which we use. This is possible by analyzing the behavior of infinite-width neural networks (Jacot et al., 2018; Lee et al., 2019), whose approximation error with respect to the finite-width model can be bounded (Liu et al., 2020; Eldan et al., 2021). Although our classifiers are learned by minimizing the cross-entropy (CE) loss, we can connect their approximations with theoretical results from the infinite-width neural network (NN) literature, which are mostly based on the mean squared error (MSE) loss. Recall that, given a dataset Dz N := {(xn, zn)}N n=1 with binary labels zn {0, 1}, the cross-entropy loss for a probabilistic classifier πθ : X [0, 1] parameterized by θ is given by3: LCPE(θ, Dz N) := 1 n=1 zn log πθ(xn) + (1 zn) log(1 πθ(xn)) . (87) The MSE loss for the same model corresponds to: LMSE(θ, Dz N) := 1 n=1 (zn πθ(xn))2 . (88) The following result establishes a connection between the two loss functions. Proposition F.1. Given a binary classification dataset Dz N of size N 1, the following holds for the cross-entropy and the mean-square error losses: LCPE(θ, Dz N) LMSE(θ, Dz N), N N . (89) Proof. Applying the basic logarithmic inequality log(1 + a) a, for all a > 1, to the crossentropy loss definition yields: LCPE(θ, Dz N) := 1 n=1 zn log πθ(xn) + (1 zn) log(1 πθ(xn)) n=1 zn(πθ(xn) 1) (1 zn)πθ(xn) n=1 2znπθ(xn) zn πθ(xn) n=1 zn 2znπθ(xn) + πθ(xn) . Now note that zn = z2 n, for zn {0, 1}, and πθ(xn) πθ(xn)2, as πθ(xn) [0, 1], for all n {1, . . . , N}. Making these substitutions in Equation 90, we obtain: LCPE(θ, Dz N) 1 n=1 z2 n 2znπθ(xn) + πθ(xn)2 = LMSE(θ, Dz N) , (91) which concludes the proof. The result in Proposition F.1 suggests that minimizing the cross-entropy loss will lead us to minimize the MSE loss as well, since the latter is upper bounded by the former. This result provides us with theoretical justification to derive convergence results based on the MSE loss, which has been better analyzed in the NN literature (Jacot et al., 2018; Lee et al., 2019), as a proxy to establish convergence guarantees for the CE-based VSD setting. 3We implicitly assume that 0 < πθ(xn) < 1, for n {1, . . . , N}, so that the CE loss is well defined. This assumption can, however, be relaxed when dealing with the MSE loss, which remains well defined otherwise. Published as a conference paper at ICLR 2025 F.1 LINEAR APPROXIMATIONS VIA THE NEURAL TANGENT KERNEL For this analysis, we will follow a frequentist setting. Namely, let π denote the unknown true classifier, i.e., π(x) := p(y > τ|x, f ), for x X. We assume that π is an unknown, fixed element of a reproducing kernel Hilbert space (RKHS) associated with a given kernel (Sch olkopf & Smola, 2001). In the case of infinite-width neural networks, we know that under certain assumptions the NN trained via gradient descent under the MSE loss will asymptotically converge to a kernel ridge regression solution whose kernel is given by the neural tangent kernel (NTK, Jacot et al., 2018). This asymptotic solution is equivalent to the posterior mean of a Gaussian process that assumes no observation noise. However, for a finite number of training steps s < , the literature has shown that gradient-based training provides a form of implicit regularization, which we use to ensure robustness to label noise. Moreover, although our analysis will be based on the NTK, the approximation error between the infinite-width and the finite-width NN vanishes with the square root of the network width for most popular NN architectures (Liu et al., 2020). Therefore, we can assume that these approximation guarantees will remain useful for wide-enough, finite-width NN models. Reproducing kernel Hilbert spaces. The RKHS Fk associated with a positive-semidefinite kernel k : X X R is a Hilbert space of functions over X with an inner product , k and corresponding norm k := p , k such that, for every π Fk, the reproducing property π(x) = π, k( , x) k holds for all x X (Sch olkopf & Smola, 2001). Implicit regularization. Several results in the literature have shown that training overparameterized neural networks via gradient descent provides a form of implicit regularization on the learned model (Fleming, 1990; Yao et al., 2007; Soudry et al., 2018; Barrett & Dherin, 2021), with some of the same behavior extending to the stochastic gradient setting (Smith et al., 2021). In earlier works, Fleming (1990) showed a direct equivalence between an early stopped gradient-descent linear model and the solution of a regularized least-squares problem with a penalty on the parameters vector Euclidean norm. In the NTK regime, the network output predictions at iteration s N of gradient descent are given by (Lee et al., 2019): ˆπN(x) = π0(x) + k N(x) K 1 N (I e νs KN )(z N π0(XN)) , x X , (92) where π0 represents the network s initialization, ν > 0 denotes the learning rate, the kernel k corresponds to the NTK associated with the given architecture, and the data is represented by XN := {xi}N i=1 X and z N := [zi]N i=1 {0, 1}N. Rearranging terms and performing basic algebraic manipulations, the equation above can be shown to be equivalent to: ˆπN(x) = π0(x) + k N(x) (KN + ΣN) 1(z N π0(XN)) , x X , (93) where ΣN := KN(eνs KN I) 1 corresponds to a data-dependent regularization matrix. The above is equivalent to the solution of the a regularized least-squares problem, as we show below. Lemma F.1. Assume k : X X R is positive definite and π0 = 0. Then Equation 93 solves the following regularized least-squares problem: ˆπN argmin π Fk i=1 (π(xi) zi)2 + R1/2 N π 2 k , (94) where RN := ΦN(e νs KN I) 1Φ N, ΦN := [φ(x1), . . . , φ(x N)], and φ(x) := k( , x) Fk represents the kernel s canonical feature map, for x X. Proof. The least-squares loss can be rewritten as: i=1 (π(xi) zi)2 + π 2 RN = Φ Nπ z N 2 2 + R1/2 N π 2 k , (95) where 2 denotes the Euclidean norm of a vector. Taking the functional gradient with respect to π Fk and equating it to zero, we have that an optimal solution ˆπ satisfies: ℓN(ˆπ) = 2ΦN(Φ N ˆπ z N) + 2RN ˆπ = 0 . (96) Published as a conference paper at ICLR 2025 The minimum-norm solution is then given by: ˆπ = (ΦNΦ N + RN)+ΦNz N , (97) where A+ denotes the Moore-Penrose pseudo-inverse of an operator A : Fk Fk. Let ΦN = UNSNV N represent the singular value decomposition of ΦN in the RKHS (Mollenhauer et al., 2020), where UN := [u1, . . . , u N], VN := [v1, . . . , v N], with {ui}N i=1 Fk and {vi}N i=1 RN denoting the left and right singular vectors, respectively, and SN RN N corresponds to the diagonal matrix of singular values of ΦN. There are N non-zero singular values, since k is assumed to be positive definite, and we have the correspondence KN = Φ NΦN = VNΛNV N with ΛN = S2 N representing the diagonal matrix of eigenvalues of KN, which is full-rank for a positive-definite kernel with distinct entries XN := {xi}N i=1 X. Applying the SVD to derive the pseudo-inverse in Equation 97 then yields: ˆπ = (ΦNΦ N + ΦN(e νs KN I) 1Φ N)+ΦNz N = (UNΛNU N + UNSNV N(e νs VNΛNV N I) 1VNSNU N)+UNSNV Nz N = UN(ΛN + ΛN(e νsΛN I) 1) 1SNV Nz N = UNSNV NVN(ΛN + ΛN(e νsΛN I) 1) 1V Nz N = ΦN(KN + KN(e νs KN I) 1) 1z N = ΦN(KN + ΣN) 1z N , which concludes the proof. For our analysis, we will assume that the classifier network is zero initialized with π0 = 0, noting that the least-squares problem can always be solved for the residuals z π0(x) and then have π0 added back to the solution. We refer the reader to Lee et al. (2019) for further discussion on the effect of the network initialization. Approximation for finite-width networks. For fully connected, convolutional or residual networks equipped with smooth activation functions (e.g., sigmoid or tanh), Liu et al. (2020) showed that the approximation error between the linear model and the finite-width NN is e O(m 1/2), where m denotes the minimum layer width, and the e O notation corresponds to the O-notation with logarithmic factors suppressed. NTK results for other activation functions and different neural network architectures, such as multi-head attention (Hron et al., 2020), are also available in the literature. F.2 ASSUMPTIONS In the following, we present a series of mild technical assumptions needed for our theoretical analysis of NN-based CPEs. For this analysis, we mainly assume that the true classifier π (x) = p(y > τ|x, f ) is a fixed, though unknown, element of the RKHS Fk given by a bounded NTK k, which is formalized by the following two assumptions. As in the GP case, we assume batches of size B = 1 to simplify the analysis. Assumption F.1. There is π Fk such that: π (x) = p(y > τ|x, f ), x X. (99) For a rich enough RKHS, such assumption is mild, especially given that most popular NN architectures offer universal approximation guarantees (Hornik et al., 1989). Assumption F.2. The NTK k corresponding to the network architecture in πθ is positive definite and bounded in X. We will also assume that the threshold is fixed to simplify the analysis. However, our results should asymptotically hold for time-varying thresholds as long as the limit limt τt = τ exists. Assumption F.3. The threshold is fixed, i.e., τt = τ R, for all t 1. Published as a conference paper at ICLR 2025 The following assumption on label noise should always hold for Bernoulli random variables (Boucheron et al., 2013). Any upper bound on the sub-Gaussian parameter should suffice for the analysis (e.g., σζ 1 for Bernoulli variables). Assumption F.4. For all t N and all x X, label noise ζ = 1[y > τ] π (x), with y p(y|x, f ), is σζ-sub-Gaussian: a R, E [exp (aζ)] exp for some σζ 0. The next assumption ensures a sufficient amount of sampling is asymptotically achieved over the domain X, which we still assume is finite. Assumption F.5. For any t 1, the variational family is such that sampling probabilities are bounded away from 0, i.e.: b > 0 : t N, q(x|ϕt) b, x X . (101) The assumption above only imposes mild constraints on the generative models q(x|ϕ), so that probabilities for all candidates x X are never exactly 0, though still allowed to be arbitrarily small. Assumption F.6. The learning rate νt at each round t is such that: 0 < ν νt 1 λmax(Kt) , t N , (102) for some ν > 0, where λmax( ) denotes the maximum eigenvalue of a matrix, and Kt denotes the NTK matrix evaluated at the training points available at iteration t 1. This last assumption ensures that a gradient descent algorithm is convergent (Fleming, 1990), though we use it to bound the spectrum of the implicit regularization matrix Σt after a finite number of training steps s < (a.k.a. early stopping), which is needed for our results. We highlight that, under mild assumptions on the data distribution, λmax(Kt) O(1) w.r.t. the number of data points (Murray et al., 2023), so that the bound in Assumption F.6 will not vanish. Therefore, such assumption is easily satisfied by maintaining a sufficiently small learning rate. F.3 APPROXIMATION ERROR FOR NN-BASED CPES Similar to the GP-PI setting, we will assume a batch size of 1, so that we can simply use the iteration index t 0 for our estimators. We recall that convergence rates for the batch setting should only be affected by a batch-size-dependent multiplicative factor, preserving big-O convergence rates. We start by defining the following proxy variance: t 1, ˆσ2 t (x) = k(x, x) kt(x) (Kt + Σt) 1kt(x) , x X, (103) where Σt is the implicit regularization matrix in Equation 93 due to early stopping. The proxy variance is then equivalent to a GP posterior variance under the assumption of heteroscedastic (i.e., input dependent) Gaussian noise with covariance matrix given by Σt. Given its similarities, we have that if enough sampling is asymptotically guaranteed, we can apply the same convergence results available for the GP-PI-based CPE, i.e., ˆσ2 t O(t 1) almost surely. Lemma F.2. Let assumptions F.2, F.5 and F.6 hold. Then the following almost surely holds for the proxy variance: ˆσ2 t O(t 1) . (104) Proof. We first observe that ˆσ2 t (103) is upper bounded by the posterior predictive variance of a GP model assuming i.i.d. Gaussian noise (cf. Sec. E.1) with variance ρ satisfying: ρ λmax(Σt) , t N , (105) Published as a conference paper at ICLR 2025 which is such that: Σt = Kt(eνts Kt I) 1 Kt(νts Kt) 1 since e A I+A, for any Hermitian matrix A, where denotes the Loewner partial ordering in the space of positive-semidefinite matrices, i.e., A B if and only if A B is positive semidefinite. Noting that the sum of sampling probabilities at any point x X diverges as t by Assumption F.5, the result then follows by applying Lemma E.5 to the GP predictive variance upper bound with noise variance set to ρ := (sν ) 1 (Assumption F.6). Lemma F.3. Let assumptions F.1 to F.6 hold. Then, given any δ (0, 1], the following holds with probability at least 1 δ for the approximation error between ˆπt and π : t 1, |ˆπt(x) π (x)| βt(δ)ˆσt(x), x X, (107) where βt(δ) := π k + σζ p 2ρ 1 log(det(I + ρ 1Kt)1/2/δ), and ρ := e s Proof. The result above is a direct application of Theorem 3.5 in Maillard (2016) which provides an upper confidence bound on the kernelized least-squares regressor approximation error (another version of the same result is also available in Durand et al. (2018, Thr. 1)). Let ζi := zi π (xi) denote the label noise in observation i, for i {1, . . . , t}. Expanding the definition of ˆπt (93) with π0 = 0, given any t N and x X, we can decompose the approximation error as: |π (x) ˆπt(x)| = |π (x) kt(x) (Kt + Σt) 1zt| = |π (x) kt(x) (Kt + Σt) 1(π t + ζt)| |π (x) kt(x) (Kt + Σt) 1π t | + |kt(x) (Kt + Σt) 1ζt| where we applied the triangle inequality to obtain the last line. Analyzing the two terms on the right-hand side, by the reproducing property, we now have for the first term: |π (x) kt(x) (Kt + Σt) 1π t | = | π , (I Φt(Kt + Σt) 1Φ t )φ(x) k| = | π , (I + ΦtΣ 1 t Φ t ) 1φ(x) k| π k (I + ΦtΣ 1 t Φ t ) 1φ(x) k φ(x) (I + ΦtΣ 1 t Φ t ) 2φ(x) φ(x) (I + ΦtΣ 1 t Φ t ) 1φ(x) = π kˆσt(x), where the second equality follows by an application of Woodbury s identity, the first inequality is due to Cauchy-Schwarz, the second inequality is due to the fact that A 2 A 1 whenever A I, and the last line follows from the definition of ˆσ2 t . For the remaining in term (108), we have that: |kt(x) (Kt + Σt) 1ζt| = | φ(x), Φt(Kt + Σt) 1ζt k| = | φ(x), (I + ΦtΣ 1 t Φ t ) 1ΦtΣ 1 t ζt k| = | (I + ΦtΣ 1 t Φ t ) 1/2φ(x), (I + ΦtΣ 1 t Φ t ) 1/2ΦtΣ 1 t ζt k| φ(x) (I + ΦtΣ 1 t Φ t ) 1φ(x) (I + ΦtΣ 1 t Φ t ) 1/2ΦtΣ 1 t ζt k ζ t Σ 1 t Φ t (I + ΦtΣ 1 t Φ t ) 1ΦtΣ 1 t ζt, where we applied the identity B (BB + A) 1 = (I + B A 1B) 1B A 1, which holds for an invertible matrix A (Searle, 1982), to obtain the second equality, and the upper bound follows by the Cauchy-Schwarz inequality. For the norm of the noise-dependent term, we will apply a Published as a conference paper at ICLR 2025 concentration bound by Abbasi-Yadkori (2012), which first requires a few transformations towards a non-time-varying regularization factor. Applying the SVD Φt = Ut St V t (Mollenhauer et al., 2020), as in the proof of Lemma F.1, we have that Σt = Vt Dt V t , where Dt := Λt(eνtsΛt I) 1 and Λt = S2 t, which leads us to: (I + ΦtΣ 1 t Φ t ) 1/2ΦtΣ 1 t ζt 2 k = ζ t Σ 1 t Φ t (I + ΦtΣ 1 t Φ t ) 1ΦtΣ 1 t ζt = ζ t Vt D 1 t St U t (I + Ut St D 1 t St U t ) 1Ut St D 1 t V t ζt = ζ t Vt S2 t D 2 t (I + D 1 t S2 t) 1V t ζt = ζ t Vt(S 2 t D2 t + Dt) 1V t ζt , (111) where we applied the identity (I + AB) 1A = A(I + BA) 1 (Searle, 1982). For the eigenvalues of Σt, we have the following lower bound: Dt = Λt(eνtsΛt I) 1 = Λte νtsΛt(I e νtsΛt) 1 Λte νtsΛt(νtsΛt) 1 = 1 νtse νtsΛt 1 ν se s I , where the first inequality is due to e A I A, and the last inequality holds by Assumption F.6. Hence, setting ρ := e s sν , we have that: (I + ΦtΣ 1 t Φ t ) 1/2ΦtΣ 1 t ζt 2 k ζ t Vt(ρ2S 2 t + ρI) 1V t ζt = ρζ t Vt St(ρI + S2 t) 1St V t ζt = ρ 1ζ t Vt St(ρI + S2 t) 1St V t ζt = ρ 1ζ t Φ t (ρI + ΦtΦ t ) 1Φtζt = (ρI + ΦtΦ t ) 1/2Φtζt 2 k By Abbasi-Yadkori (2012, Cor. 3.6), given any δ (0, 1], we then have that the following holds with probability at least 1 δ: t 1, (I + ΦtΣ 1 t Φ t ) 1/2ΦtΣ 1 t ζt 2 k 2σ2 ζ log det(I + ρ 1Kt)1/2 Finally, combining the bounds above into Equation 108 leads to the result in Lemma F.3. For the next result, we need to define the following quantity: ξT := max XT X:|XT | T 1 2 log det(I + ρ 1K(XT )) , (115) where K(XT ) := [k(x, x )]x,x XT R|XT | |XT |. Note that ξT corresponds to the maximum information gain of a GP model (Srinivas et al., 2010) with covariance function given by the NTK, assuming Gaussian observation noise with variance given by ρ. Then ξT is mainly dependent on the eigenvalue decay of the kernel under its spectral decomposition (Vakili et al., 2021). For the spectrum of the NTK, a few results are available in the literature (Murray et al., 2023). Proposition F.2. Let assumptions F.1 to F.6 hold. Then, given δ (0, 1], the following holds with probability at least 1 δ for VSD equipped with a wide enough NN-based CPE model ˆπt: DKL[p(x|y > τt, Dt) p(x|y > τt, f )] OP Published as a conference paper at ICLR 2025 Proof. The result follows by applying the same steps as in the proof of Theorem 2.1. We note that ℓ t (x) = π (x) > 0, due to observation noise, so that ℓ t (x) 1 OP(1). Similarly, Lemma F.3 implies that |ˆπt(x) π (x)| βt(δ)σt(x) with probability at least 1 δ simultaneously over all x X, so that ratio-dependent terms in Theorem 2.1 should remain bounded in probability. The upper bound in the result then follows by noticing that in our case | ℓt(x)| βt(δ)ˆσt(x) with high probability, where ˆσt O(t 1/2) by Lemma F.2, and βt(δ) O( ξt) by Lemma F.3 and the definition of ξt in Equation 115. The result above tells us that VSD equipped with an NN-based CPE can recover a similar asymptotic convergence guarantee to the one we derived for the GP-PI case, depending on the choice of NN architecture and more specifically on the spectrum of its associated NTK. In the case of a fully connected multi-layer Re LU network, for example, Chen & Xu (2021) showed an equivalence between the RKHS of the Re LU NTK and that of the Laplace kernel k(x, x ) = exp( C x x ). As the latter is equivalent to a Mat ern kernel with smoothness parameter set to 0.5 (Rasmussen & Williams, 2006), the corresponding information gain bound is ξt e O(t d 1+d ), where d here denotes the dimensionality of the domain X (Vakili & Olkhovskaya, 2023). In the case of discrete sequences of length M, the dimensionality of X is determined by M. Hence, we have proven Corollary 2.2.4 Corollary 2.2. Let πθ be modeled via a fully connected Re LU network. Then, under assumptions on identifiability and sampling (F.1 to F.6), in the infinite-width limit, VSD with CPE-PI achieves: DKL[p(x|y > τt, Dt) p(x|y > τt, f )] e OP t 1 2(M+1) . (14) Similar steps can be applied to derive convergence guarantees for VSD with other neural network architectures based on the eigenspectrum of their NTK (Murray et al., 2023) and following the recipe in, e.g., Vakili et al. (2021) or Srinivas et al. (2010). G VSD AS A BLACK-BOX OPTIMIZATION LOWER BOUND A natural question to ask is how VSD relates to the BO objective for probability of improvement (Garnett, 2023, Ch.7), x t = argmax x log αP I(x, DN, τ) . (117) Firstly, we can see that the expected log-likelihood of term of Equation 6 lower-bounds this quantity. Proposition G.1. For a parametric model, q(x|ϕ), given ϕ Φ Rm and q P : X Φ [0, 1], max x log αP I(x, DN, τ) max ϕ Eq(x|ϕ)[log αP I(x, DN, τ)] , (118) and the bound becomes tight as q(x|ϕ t ) δ(x t ), a Dirac delta function at the maximizer x t . Taking the argmax of the RHS will result in the variational distribution collapsing to a delta distribution at x t for an appropriate choice of q(x|ϕ). The intuition for Equation 118 is that the expected value of a random variable is always less than or equal to its maximum. The proof of this is in Daulton et al. (2022); Staines & Barber (2013). Extending this lower bound, we can show the following. Proposition G.2. For a divergence D : P(X) P(X) [0, ), and a prior p0 P(X), max x log αP I(x, DN, τ) max ϕ Eq(x|ϕ)[log αP I(x, DN, τ)] D[q(x|ϕ) p0(x)] . (119) We can see that this bound is trivially true given the range of divergences, and this covers VSD as a special case. However, this bound is tight if and only if p0 concentrates as a Dirac delta at x t with an appropriate choice of q(x|ϕ). In any case, the lower bound remains valid for any choice of informative prior p0 or even a uninformed prior, which allows us to maintain the framework flexible to incorporate existing prior information whenever that is available. 4Here e OP suppresses logarithmic factors, as in e O, and holds in probability.