# automated_model_selection_with_bayesian_quadrature__66bb878b.pdf Automated Model Selection with Bayesian Quadrature Henry Chai 1 Jean-Franc ois Ton 2 Michael A. Osborne 3 4 Roman Garnett 1 We present a novel technique for tailoring Bayesian quadrature (BQ) to model selection. The state-of-the-art for comparing the evidence of multiple models relies on Monte Carlo methods, which converge slowly and are unreliable for computationally expensive models. Although previous research has shown that BQ offers sample efficiency superior to Monte Carlo in computing the evidence of an individual model, applying BQ directly to model comparison may waste computation producing an overly-accurate estimate for the evidence of a clearly poor model. We propose an automated and efficient algorithm for computing the most-relevant quantity for model selection: the posterior model probability. Our technique maximizes the mutual information between this quantity and observations of the models likelihoods, yielding efficient sample acquisition across disparate model spaces when likelihood observations are limited. Our method produces moreaccurate posterior estimates using fewer likelihood evaluations than standard Bayesian quadrature and Monte Carlo estimators, as we demonstrate on synthetic and real-world examples. 1. Introduction Model selection is a fundamental problem that arises in the course of scientific inquiry: which of several candidate models best explains an observed dataset D? The Bayesian approach to model selection involves computing posterior model probabilities, the probability that each model generated the observations. This approach requires computing model evidences, which can be expressed as integrals of the 1Department of Computer Science and Engineering, Washington University in St. Louis, Saint Louis, MO, USA 2Department of Statistics, University of Oxford, Oxford, United Kingdom 3Department of Engineering Science, University of Oxford, Oxford, United Kingdom 4Mind Foundry, Oxford, United Kingdom. Correspondence to: Henry Chai . Proceedings of the 36 th International Conference on Machine Learning, Long Beach, California, PMLR 97, 2019. Copyright 2019 by the author(s). form Z = R f(D | θ) π(θ) dθ, where θ is a vector of model parameters, f(D | θ) is a likelihood, and π(θ) is a prior. Unfortunately, for many real-world model selection tasks, these integrals are computationally intractable and must be estimated numerically. Numerous commonly-used techniques to estimate such integrals rely on Monte Carlo estimators (Metropolis et al., 1953; Hastings, 1970). These methods converge slowly in terms of the number of required integrand samples as they do not incorporate knowledge about sample locations. This makes such methods ill-suited for settings where the integrand is expensive to evaluate. One alternative is Bayesian quadrature (BQ) (Larkin, 1972; Diaconis, 1988; O Hagan, 1991; Rasmussen & Ghahramani, 2003), which relies on a probabilistic belief on the integrand that can be conditioned on observations to derive a posterior belief about the value of the integral. The theoretical properties of kernel quadrature methods (including BQ) have been studied at length: these methods can achieve faster convergence rates than Monte Carlo estimators (Briol et al., 2015; Bach, 2017; Karvonen et al., 2018), even when the underlying model is misspecified (Kanagawa et al., 2016; 2017), a commonly-cited pitfall of kernel-based methods. There has been significant work investigating how traditional, Monte Carlo based methods can be adapted to efficiently estimate posterior model probabilities for model selection (Neal, 2001; Green, 1995; Carlin & Chib, 1995; Meng & Wong, 1996; Godsill, 2001; Skilling, 2004). However, no analogous work has appeared to adapt BQ for this important task. That is our goal in this work. We propose a principled adaptation of BQ designed to automate model selection. Specifically, we define a novel acquisition function for active selection of locations to observe model likelihoods. This acquisition function corresponds to the mutual information between observations of the model likelihood and a quantity specifically relevant to the task of model selection: the posterior model probabilities. This allows our method to automatically select informative sample locations across multiple model parameter spaces unlike previous active BQ approaches to model selection (Osborne et al., 2012; Gunter et al., 2014; Chai & Garnett, 2019), which focused on accurately estimating individual model evidences. We illustrate the shortcomings of such approaches using a toy motivating example. Experiments conducted on Automated Model Selection with Bayesian Quadrature real-world and synthetic data demonstrate that our proposed method can outperform existing BQ techniques and specialized Monte Carlo methods in terms of efficiently reaching accurate estimates of posterior model probabilities. 2. Related Work Much work has been devoted to developing Monte Carlo methods specifically designed for model selection. Broadly speaking, these methods can be broken down into two groups: within-model approaches, such as annealed importance sampling (AIS) (Neal, 2001), nested sampling (Skilling, 2004), and bridge sampling (Bennett, 1976; Meng & Wong, 1996), and trans-dimensional approaches such as Green s (1995) reversible jump MCMC and Godsill s (2001) composite model space framework, a generalization of the product-space approach proposed by Carlin & Chib (1995). Within-model approaches estimate each model s model evidence separately, whereas trans-dimensional approaches directly estimate posterior model probabilities. In our experiments we compare our method against one prominent Monte Carlo method from each category: bridge sampling (within-model) and reversible jump MCMC (transdimensional). All commonly used Monte Carlo methods for model selection have pros and cons, and their performance on specific tasks can be greatly affected by open-ended modeling choices such as the choice of intermediate densities for AIS or the choice of pseudo-priors for the composite model space framework of Godsill (2001). It is beyond the scope of this work to comprehensively analyze all widely used Monte Carlo model selection methods; however, we believe the chosen benchmarks to be reasonable and competitive. In particular, the design choices associated with bridge sampling are easily justified and give rise to greater transparency in our experimental design as opposed to many possible alternatives. Among the commonly used trans-dimensional methods, the original product space approach of Carlin & Chib (1995) made use of a Gibbs sampler, which requires conjugate conditional likelihoods that do not exist in many model selection settings. Godsill (2001) showed that replacing the Gibbs sampler in their composite space model with a Metropolis Hastings proposal mechanism gives rise to Green s (1995) reversible jump MCMC. Godsill (2001) also claimed that the use of such a Metropolis Hastings proposal mechanism is preferable to the use of a Gibbs sampler in nested model settings, that is, settings where there is an overlap in model parameter spaces. As our real-world experimental setting is a model selection task between nested models, the choice to compare against reversible jump MCMC is well justified. Previous work has been done on adapting BQ to situations where the integrand of an intractable integral is known to be nonnegative a priori (Osborne et al., 2012; Gunter et al., 2014; Chai & Garnett, 2019). Such integrals occur frequently in machine learning tasks, including model selection: the model evidence is an integral of probability distributions, which are nonnegative everywhere. These methods make use of warped Gaussian processes (GPs) (Snelson et al., 2004) to weakly enforce the nonnegativity constraint. They have been shown to outperform BQ algorithms that are agnostic to a priori information on a variety of model selection tasks. However, we will show that our method can lead to even greater improvements. Furthermore, our methodology is compatible with the use of warped GPs; in fact, our proposed method can be seen as an instantiation of the framework laid out by Chai & Garnett (2019) with a novel acquisition function. As a final note, our focus in this manuscript is on the traditional Bayesian approach to model selection, which involves the computation of model posteriors. We acknowledge the existence of several alternative approaches to Bayesian model selection (Bernardo & Smith, 1994; Vehtari, 2001; Watanabe, 2010). An extension of our proposed method to these alternatives is a potential line of future inquiry. 3. Background 3.1. Bayesian Model Selection For the purposes of this work, a model is defined as a parametric family of probability distributions that can be used to explain some observed dataset, D. Given a finite set of candidate models {M1, . . . , Mk}, the Bayesian approach to inference in this setting is to reason about the conditional or posterior distribution over models via Bayes theorem: Pr(Mi | D) = Pr(D | Mi) Pr(Mi) = Pr(D | Mi) Pr(Mi) Pk j=1 Pr(D | Mj) Pr(Mj) (1) where Pr(Mi | D) is known as the posterior probability of model Mi, Pr(D | Mi) is the model evidence of model Mi and Pr(Mi) is the prior probability of model Mi. The computation of model evidences requires integrating out the model parameters that control the likelihood of a given model generating the observed data: Pr(D | Mi) = Z Pr(D | Mi, θi) Pr(θi) dθi (2) where θi is the vector of model parameters corresponding to model Mi, Pr(D | Mi, θi) is the likelihood of D under Mi parameterized by θi, and Pr(θi) is the prior probability of the model parameters θi. Given posterior model probabilities, one common practice is to find M = arg max M Pr(Mi | D) and then treat M Automated Model Selection with Bayesian Quadrature as the true, data-generating model for the purpose of future inference tasks. We will refer to this variant of the model selection task as model choice. An alternative to model choice is model averaging: instead of simply using the single, most-likely candidate model, model averaging takes a fully-Bayesian viewpoint by using the posterior model probabilities to marginalize out the choice of model for subsequent inference tasks. 3.2. Mutual Information The mutual information between two random variables is a measure of how much information observing the value of one provides about the other. Formally, given two random variables X and Y , the mutual information of X and Y is I(X; Y ) = H(X) H(X | Y ), (3) where H(X) is the entropy of X, and H(X | Y ) is the conditional entropy of X given Y . If X is discrete with PMF p and domain X, then the entropy is defined to be x X p(x) log p(x) (4) and if X is continuous with PDF p and domain X, then we instead use the differential entropy: X p(x) log p(x) dx. (5) The conditional entropy H(X | Y ) is defined to be the expected (differential) entropy of the posterior distribution p(X | Y ), where the expectation is taken with respect to Y . Therefore the mutual information can be interpreted as the expected information gained about X (that is, the expected reduction in entropy) when measuring Y . 3.3. Bayesian Quadrature Given some intractable integral Z = R f(θ) π(θ) dθ, Bayesian quadrature places a Gaussian process (GP) prior belief on the function f(θ) (or occasionally on the product f(θ) π(θ) directly). A GP specifies a probability distribution over functions, where the joint distribution of the function s value at finitely many locations is multivariate normal. A GP is fully specified by its first two moments: a mean function µ(x) and a covariance function Σ(x, x ). Given a set of observations at locations x D = {x1, . . . , xn} with corresponding function values f(x D), a GP can be conditioned on these observations to arrive at a posterior GP with mean µD(x) = µ(x) + Σ(x, x D)Σ(x D, x D) 1(f(x D) µ(x D)) and covariance ΣD(x, x ) = Σ(x, x ) Σ(x, x D)Σ(x D, x D) 1Σ(x D, x ). For a comprehensive overview of GPs, see (Rasmussen & Williams, 2006). Bayesian quadrature makes use of the fact that GPs are closed under linear functionals (Rasmussen & Ghahramani, 2003), meaning that a GP belief on f induces a Gaussian belief on L[f], where L is any linear functional. As integration against a probability distribution is such a functional, if p(f) = GP(f; µ, Σ), then p(Z) = N(Z; m, K), where m = Z µ(x) π(x) dx ; (6) K = ZZ Σ(x, x ) π(x) π(x ) dx dx . (7) A design choice that must be addressed when using BQ is where to observe the integrand f. One natural approach is to select sample locations so as to minimize one s uncertainty about Z which is equivalent to minimizing the entropy of Z. Because the posterior variance of a GP does not depend on the observed function values (see above), if a GP prior is placed directly on f, then an optimal sampling design (w.r.t. this objective) can be specified in advance (Minka, 2000). However, it is often appropriate to specify a GP prior not on f but on an affine transformation of f in order to incorporate some a priori information. Doing so introduces a dependency between the posterior variance and observed function values. Thus, the optimal sampling sequence cannot be precomputed. One option in this setting is to sequentially select sample locations in order to minimize the expected entropy of Z as proposed by Osborne et al. (2012). As an alternative, Gunter et al. (2014) propose an active sampling mechanism that iteratively minimizes the entropy of the integrand instead of the value of the integral as doing so is more computationally efficient and numerically stable. 4. Motivation Consider the task of selecting between two models, M1 and M2, given data D. Suppose that BQ is used to estimate the model evidences for both models. After some number of iterations of BQ, the posterior beliefs (implicitly conditioned on BQ evaluations) on the model evidences are plotted on the same axis; as an example, see Figure 1. Figure 1 depicts a situation where there is high uncertainty about both model evidences; however, the uncertainty in the posterior model probabilities is low. In particular, for this toy example, Pr(M1 | D) is almost certainly close to one, while Pr(M2 | D) is almost certainly close to zero. This example illustrates the fact that it is not necessary to have low-entropy estimates of model evidences to have lowentropy estimates of posterior model probabilities and thus, methods that aim to achieve accurate estimates of model evidences may be inefficiently sampling observations when the goal to is to achieve accurate estimates of the posterior model probabilities. Automated Model Selection with Bayesian Quadrature p Pr(D | M2) p Pr(D | M1) Figure 1. A toy example of the posteriors for two model evidences; observe that both posterior beliefs are Gaussian, a direct result of the closure of GPs under linear functionals (see (6) and (7)) We propose an adaptation of the standard BQ algorithm to the task of model selection. The principal novelty of our proposed method is in selecting where to observe the likelihood functions of the models involved. Rather than selecting locations with the goal of achieving accurate estimates of the model evidences, as previous work has considered at length (Osborne et al., 2012; Gunter et al., 2014), our method seeks to achieve accurate estimates of posterior model probabilities, a more essential quantity to model selection. Our method makes use of the mutual information between model likelihood observations and posterior model probabilities, allowing them to choose informative sample locations across multiple model parameter spaces simultaneously. Suppose that we have observed some dataset D and have k candidate models {M1, . . . , Mk}, to explain D. Let ℓi(θi) = p(D | θi, Mi) be the likelihood for Mi. We assume that these likelihoods have mutually independent Gaussian process priors: p(ℓi) = GP(ℓi; µi, Σi). Now the model evidence of Mi: Θi ℓi(θi) πi(θi) dθi (8) is normally distributed as p(ai) = N(ai; mi, Ki), where mi and Ki are given by the BQ identities (6) and (7). The posterior probability of Mi can be expressed as: zi = ai Pk j=1 aj . (9) Formally, we consider the mutual information between ℓi(θi) and z = [z1, . . . , zk]: I ℓi(θi); z = H ℓi(θi) H ℓi(θi) | z . (10) Here ℓi(θi) is just a univariate Gaussian and its entropy is H ℓi(θi) = 1 2 log 2πe Σi(θi, θi). Interestingly, the conditional random variable ℓi(θi) | z is also a univariate Gaussian. To see this, consider the joint density between ℓi(θi) and b i = [b1, . . . , bi 1, bi+1, . . . , bk] where bj = zjaj + zj X j [k] ,j =j aj , (11) zj = zj 1 , [k] = {1, . . . , k} . As all ai are independent and Gaussian, b i follows a multivariate Gaussian distribution. We will denote the mean of bj by βj, the variance of bj by sj,j and the covariance between bj and bj by sj,j where βj = zjmj + zj X j [k] ,j =j mj , (12) sj,j = z2 j Kj + z2 j X j [k] ,j =j Kj , (13) sj,j = zjzj Kj + zj zj Kj + zjzj X j [k] , j =j,j Kj . (14) Furthermore, ℓi(θi) is jointly Gaussian with b i; the covariance between ℓi(θi) and bj is zj Li(θi) where Θi Σi(θi, θ i) πi(θ i) dθ i , (15) a result that follows from our assumption that all GP priors are mutually independent and the fact that GPs are closed under integration. Lastly, we note that observing z is equivalent to observing that b i = 0; in essence, this follows because an observation of z collapses the joint Gaussian distribution between all model evidences down to a hyperplane, where each point with support under the conditional belief satisfies the invariant that bj = zjaj + zj P j [k] ,j =j aj = 0 j [k]. Thus, we can conclude that ℓi(θi) | z and ℓi(θi) | b i = 0 have the same distribution. Using the fact that Gaussians are closed under conditioning, it follows that ℓi(θi) | z is a Gaussian random variable and its entropy is H ℓi(θi) | z = 1 2 log 2πe Σi|z(θi, θi) where Σi|z(θi, θi) = Σi(θi, θi) z1Li(θi) ... zi 1Li(θi) zi+1Li(θi) ... zk Li(θi) s1,1 . . . s1,k ... ... ... si 1,1 . . . si 1,k si+1,1 . . . si+1,k ... ... ... sk1 . . . skk z1Li(θi) ... zi 1Li(θi) zi+1Li(θi) ... zk Li(θi) The mutual information between ℓi(θi) and z can now be framed as an expectation over z: I(ℓi(θi); z) = 1 2 log Σi(θi, θi) Z log Σi|z(θi, θi) p(z) dz. (17) Automated Model Selection with Bayesian Quadrature To design our next observation, we choose to evaluate the likelihood at the point maximizing the expected information gain about z, where we maximize over the parameter spaces of all models. Formally, we choose to observe the likelihood ℓi (θ i ) where θ i = arg max θi Θi I ℓi(θi); z , (18) i = arg max i {1,...,k} θ i . (19) Unfortunately, the integral in (17) is intractable. Luckily, it can be accurately and efficiently estimated using standard numerical techniques. Specifically, we can generates samples of z by drawing from the posterior distribution over a = [a1, . . . , ak] using (9). Also note that this estimation is only performed as a means of selecting likelihood observation locations; the effect of inaccuracies in this estimate on the estimated model evidences will be negligible. As a final note, observe that it is imperative to omit an element from the vector b i as specifying k 1 of the k model posteriors fully determines the last one: the covariance matrix of the full random vector b = [b1, . . . , bk] is singular, making the square matrix in (16) non-invertible. The choice to omit bi is largely arbitrary, although it does simplify the notation slightly. 6. Experiments We perform experiments on synthetic and real-world data in which we compare our proposed method against roundrobin BQ, where likelihood evaluations are evenly distributed between all model parameter spaces, and two Monte Carlo based benchmarks: bridge sampling (Meng & Wong, 1996) and reversible jump MCMC (Green, 1995). Our implementation of bridge sampling follows the one described by Gronau et al. (2017): specifically, we use the optimal bridge function defined by Meng & Wong (1996) and a Gaussian proposal distribution with moments fit to samples from the true posterior distribution (as suggested by Overstall & Forster (2010)). Our choice of diffeomorphism for reversible jump MCMC varies by experimental setting and is described in the relevant sections below. For all BQ methods, constant-mean GP priors with Mat ern covariance functions (ν = 3/2) were placed on the log of the model likelihoods and all GP hyperparameters were fit in accordance with the framework defined by Chai & Garnett (2019). Our implementation of round-robin BQ uses uncertainty sampling to select locations to observe log-likelihoods, as proposed by Gunter et al. (2014). 6.1. Synthetic Experiments For our synthetic experiments, we consider a model selection task between two zero-mean GP models: one chosen to have a squared exponential covariance and one chosen to have a Mat ern covariance with ν = 5/2. The observed dataset D consists of 5d observations from a d-dimensional, zero-mean GP with a squared exponential covariance. Each model is parameterized by the d length-scales of their respective covariance functions (for the sake of simplicity, all other GP hyperparameters were set to be the same as the true, data-generating GP s). In this setting, prior knowledge of the experimental setting suggests that the two likelihood functions are similar. Thus an appropriate choice of diffeomorphism is the identity function and the corresponding Jacobian is always 1. The intractable integrals associated with our proposed method are estimated using 10 000 simple Monte Carlo (SMC) samples, i.e., samples drawn from the probability distribution being integrated against. We allot a budget of 50d total likelihood evaluations and initialize each BQ based method with 5d randomly sampled likelihood observations from both model parameter spaces. We run experiments with d ranging from 1 to 4 and for each value of d, we consider 100 different, randomly sampled observed datasets. All methods are evaluated on the fractional error of their z1 estimates with ground truth values being determined by exhaustive SMC. As Figure 2 shows, our proposed method outperforms all benchmarks compared against. Furthermore, the difference in performance between our proposed method and both round-robin BQ and bridge sampling is significant at the 1% significance level across all dimensions according to a one-sided paired t-test; the difference between our proposed method and reversible jump MCMC is significant at the 1% significance level for d =1, 2, and 3. 6.2. Real-World Experiments Our real-world application is a model selection problem from the field of astrophysics. Given spectrographic observations of quasar emissions, astrophysicts are interested in inferring the existence of damped Lyman-α absorbers (DLAs) between the quasar and earth. DLAs are large gaseous clouds containing neutral hydrogen at high densities. The distribution of DLAs throughout the universe is of interest as it informs galaxy formation models. Their location and size can be inferred from quasar spectra as they cause distinctive dips in the observed flux at well-defined wavelengths. Garnett et al. (2017) developed a model that specifies the likelihood that a quasar emission spectrum contains arbitrarily many DLAs. Their likelihood model for n DLAs is parameterized by 2n parameters, two for each putative DLA: one that corresponds to its size and one that corresponds to its distance from earth. Garnett et al. (2017) also specified a data-driven prior over these parameters; computing the model evidence for any number of DLAs requires integrating the likelihood against this prior, an intractable integral. Automated Model Selection with Bayesian Quadrature 10 20 30 40 50 0 total likelihood evaluations fractional error of z1 20 40 60 80 100 0 total likelihood evaluations fractional error of z1 round robin mutual information with z1 bridge RJMCMC 40 60 80 100 120 140 total likelihood evaluations fractional error of z1 50 100 150 200 total likelihood evaluations fractional error of z1 Figure 2. Fractional errors of z1 for all tested methods as a function of the total number of model likelihood observations For both of the experiments below, the diffeomorphism associated with our implementation of reversible jump MCMC is again the identity function and the corresponding Jacobian factor is 1. The intractable integrals associated with our proposed method are estimated using quasi-Monte Carlo (Caflisch, 1998). We evaluated all methods on the absolute error of their estimates of the log Bayes factor: log Bij = log zi log zj (Jeffreys, 1961; Kass & Raftery, 1995), another potential quantity of interest in model selection tasks. We consider the absolute error instead of the fractional error as the target quantity is a log value. We make use of Bartolucci et al. (2006) s work to translate the output Markov chain into a log odds estimate. 6.2.1. TWO MODELS In our first experiment on this dataset, we consider two candidate models for each quasar emission spectrum: the first corresponding to a single DLA and the second corresponding to two DLAs. In this experimental setting, we use the data-driven prior of Garnett et al. (2017) as the proposal distribution for the two additional parameters when transitioning from the single DLA model to the two DLA model. We select 20 spectra from phase III of the Sloan Digital Sky Survey (SDSS III) (Eisenstein et al., 2011) where the model posteriors for the above models are known a priori to be between 0.4 and 0.6; this choice makes the model selection task difficult in some sense. We allot a budget of 150 total likelihood evaluations and initialize each BQ based method with 25 randomly sampled likelihood observations from both model parameter spaces. We repeat the experiment 5 times for each spectra, using a different initialization for each trial. As Figure 3 shows, our proposed method outperforms all benchmarks compared against. The difference in performance between our proposed method and both Monte Carlo based benchmarks is significant at the 1% significance level according to a one-sided paired t-test. Automated Model Selection with Bayesian Quadrature 60 80 100 120 140 10 total likelihood evaluations absolute error of log B12 Figure 3. Absolute errors of log odds for all tested methods as a function of the total number of model likelihood observations. This figure shares a legend with Figure 2 6.2.2. THREE MODELS To demonstrate the ability of our method to generalize to greater than two models, we ran an additional experiment on this dataset where we consider three candidate models for a quasar emission spectrum known to contain three DLAs: in addition to the two models from the previous section, we consider a third model that corresponds to the presence of three DLAs. We again use the data-driven prior of Garnett et al. (2017) as the proposal distribution for the two additional parameters when transitioning from the two DLA model to the three DLA model. We allot a budget of 180 total likelihood evaluations and initialize each BQ based method with 25 randomly sampled likelihood observations from each model parameter spaces. We repeat the experiment 10 times, using a different initialization for each trial. As Figure 4 shows, our proposed method once again outperforms all benchmarks compared against. The difference in performance between our proposed method and both Monte Carlo based benchmarks is significant at the 5% significance level according to a one-sided paired t-test. 6.3. Model Choice Experiments In situations where one is performing model choice as opposed to model averaging, the quantity of interest is not the model posterior probabilities but rather the model with the highest posterior probability, a related but different object. We considered an alternative acquisition function that targets this quantity, which we briefly present here. Given two candidate models M1 and M2, the goal in model choice is to determine the value of the indicator random variable [z1 > z2], where we have adopted the Iverson bracket notation. Therefore, instead of considering the mutual informa- 80 100 120 140 160 180 0 total likelihood evaluations absolute error of log B31 80 100 120 140 160 180 0 total likelihood evaluations absolute error of log B32 Figure 4. Absolute errors of log odds against M3 for all tested methods as a function of the total number of model likelihood observations. This figure shares a legend with Figure 2 tion between ℓi(θi) and z, one could conceivably consider the mutual information between ℓi(θi) and [z1 > z2] directly. Formally, this quantity can be expressed as I [z1 > z2]; ℓi(θi) = H [z1 > z2] H [z1 > z2] | ℓi(θi) . (20) Much like our proposed method, this alternative acquisition function searches over both models parameter spaces for the next evaluation location: θ i = arg max θi Θi I [z1 > z2]; ℓi(θi) , (21) for i {1, 2}. Because the quantity H [z1 > z2] does not involve either ℓ1(θ1) or ℓ2(θ2), it can be safely ignored when searching for this maximum. Unfortunately, the second term Automated Model Selection with Bayesian Quadrature in the mutual information (20) is intractable, much like the integral in (17). However, writing H [z1 > z2] | ℓi(θi) = Z H Pr [z1 > z2] | ℓi(θi) p ℓi(θi) dℓi(θi) , (22) we may recognize the expression as a one-dimensional integral that can also be estimated numerically. 20 40 60 80 100 0 total likelihood evaluations fractional error of z1 mutual information with [z1 > z2] Figure 5. Fractional errors of z1 for all tested methods as well as an alternative method considered specifically for the task of model choice. Please refer to Figure 2 for the omitted legend entries. Despite the arguably more rational choice to target [z1 > z2] instead of z, this approach did not perform well in our experiments. Figure 5 shows the performance of this method in the 2-dimensional synthetic experimental setting described above; this method s relative performance continues to drop off as the number of dimensions increases. We also considered a different performance metric: the fraction of trials where the model with the higher ground truth posterior probability is correctly identified. It is reasonable to expect an acquisition function that targets [z1 > z2] to outperform our method that targets z when considering this metric. The results for the 2-dimensional synthetic experimental setting are shown in Figure 6. We attribute the poor performance of this alternative acquisition function to the fact that the implied alternative objective of this acquisition function, the entropy of [z1 > z2], is less reliable than the entropy of z and thus, this method has a tendency to become overly confident too quickly. If at any point one of the models achieves a much higher posterior model probability, then this method samples the model likelihoods at functionally uninformative locations until the budget of evaluations has been expended. This is because from the perspective of this alternative method, the objective has already been optimized: Pr(z1 > z2) will either be very close to zero or very close to one with very little uncertainty. 20 40 60 80 100 total likelihood evaluations percentage of correct trials Figure 6. Fraction of trials where the correct model or model with the higher ground truth posterior probability would have been selected for all tested methods as well as an alternative method considered specifically for the task of model choice. Please refer to Figure 2 for the omitted legend entries. In future work, we hope to improve the performance of this alternative method, potentially by targeting the random variable z1 z2 instead of [z1 > z2] as we hypothesize that incorporating the magnitude of the difference will encourage continued exploration of the model parameter spaces. 7. Conclusion In this paper, we have presented a novel, BQ based method for automated model selection. Our proposed method makes use of a novel acquisition function that targets the entropy of the posterior model probabilities, quantities specifically relevant to the task of model selection. This allows our method to actively sample locations across multiple model parameter spaces simultaneously. Our experiments conducted on real-world and synthetic data show that our proposed method can outperform both previously published BQ approaches to model selection as well as Monte Carlo based model selection techniques in terms of achieving accurate posterior model probability estimates. One exciting line of inquiry that we hope to study in future work concerns the design of trans-dimensional covariance functions. In certain settings, our assumption that the model evidences are independent does not accurately reflect our a priori knowledge e.g. in model selection tasks with nested model parameters, we should expect model evidences to be at least slightly correlated. These types of relationships could be captured by a multi-task GP (Cressie, 1993; Yu et al., 2005) where the covariance between model likelihoods is learned alongside individual model likelihoods simultaneously. Automated Model Selection with Bayesian Quadrature Acknowledgements HC and RG were supported by the National Science Foundation under award numbers IIA 1355406 and IIS 1845434. JT was supported by EPSRC and MRC CDT in Next Generational Statistical Science under award number EP/L016710/1. We also extend thanks to the Big Bayes Group at Oxford University for access to computational resources. Bach, F. On the equivalence between kernel quadrature rules and random feature expansions. Journal of Machine Learning Research, 18(21):1 38, 2017. Bartolucci, F., Scaccia, L., and Mira, A. Efficient Bayes factor estimation from the reversible jump output. Biometrika, 93(1):41 52, 2006. Bennett, C. H. Efficient estimation of free energy differences from Monte Carlo data. Journal of Computational Physics, 22(2):245 268, 1976. Bernardo, J. M. and Smith, A. F. Bayesian Theory. John Wiley & Sons, 1994. Briol, F.-X., Oates, C. J., Girolami, M., Osborne, M. A., and Sejdinovic, D. Probabilistic Integration: A Role in Statistical Computation? ar Xiv preprint ar Xiv:1512.00933v6 [stat.ML], 2015. Caflisch, R. E. Monte Carlo and quasi-Monte Carlo methods. Acta Numerica, 7:1 49, 1998. Carlin, B. P. and Chib, S. Bayesian model choice via Markov chain Monte Carlo methods. Journal of the Royal Statistical Society. Series B (Methodological), 57:473 484, 1995. Chai, H. and Garnett, R. Improving Quadrature for Constrained Integrands. In Proceedings of The 22nd International Conference on Artificial Intelligence and Statistics (AISTATS), 2019. Cressie, N. A. Statistics for Spatial Data. John Wiley & Sons, 1993. Diaconis, P. Bayesian Numerical Analysis. Statistical Decision Theory and Related Topics, 4(1):163 175, 1988. Eisenstein, D. J., Weinberg, D. H., Agol, E., Aihara, H., Allende Prieto, C., Anderson, S. F., Arns, J. A., Aubourg, E., Bailey, S., Balbinot, E., and et al. SDSS-III: Massive Spectroscopic Surveys of the Distant Universe, the Milky Way, and Extra-Solar Planetary Systems. The Astronomical Journal, 142:72, September 2011. Garnett, R., Ho, S., Bird, S., and Schneider, J. Detecting Damped Lyman-α Absorbers with Gaussian Processes. Monthly Notices of the Royal Astronomical Society, 472 (2):1850 1865, 2017. Godsill, S. J. On the relationship between Markov chain Monte Carlo methods for model uncertainty. Journal of Computational and Graphical Statistics, 10(2):230 248, 2001. Green, P. J. Reversible jump Markov chain Monte Carlo computation and Bayesian model determination. Biometrika, 82(4):711 732, 1995. Gronau, Q. F., Sarafoglou, A., Matzke, D., Ly, A., Boehm, U., Marsman, M., Leslie, D. S., Forster, J. J., Wagenmakers, E.-J., and Steingroever, H. A tutorial on bridge sampling. Journal of Mathematical Psychology, 81:80 97, 2017. Gunter, T., Osborne, M. A., Garnett, R., Hennig, P., and Roberts, S. J. Sampling for Inference in Probabilistic Models with Fast Bayesian Quadrature. Advances in Neural Information Processing Systems (NIPS) 27, 2014. Hastings, W. K. Monte Carlo Sampling Methods Using Markov Chains and Their Applications. Biometrika, 57 (1):97 109, 1970. Jeffreys, H. Theory of Probability. Oxford University Press, 3rd edition, 1961. Kanagawa, M., Sriperumbudur, B. K., and Fukumizu, K. Convergence guarantees for kernel-based quadrature rules in misspecified settings. In Advances in Neural Information Processing Systems (NIPS) 29, 2016. Kanagawa, M., Sriperumbudur, B. K., and Fukumizu, K. Convergence analysis of deterministic kernel-based quadrature rules in misspecified settings. ar Xiv preprint ar Xiv:1709.00147, 2017. Karvonen, T., Oates, C. J., and S arkk a, S. A Bayes Sard Cubature Method. ar Xiv preprint ar Xiv:1804.03016, 2018. Kass, R. E. and Raftery, A. E. Bayes factors. Journal of the American Statistical Association, 90(430):773 795, 1995. Larkin, F. M. Gaussian measure in Hilbert space and applications in numerical analysis. Rocky Mountain Journal of Mathematics, 2(3):379 422, 1972. Meng, X. and Wong, W. H. Simulating ratios of normalizing constants via a simple identity: A theoretical exploration. Statistica Sinica, 6(4):831 860, 1996. Automated Model Selection with Bayesian Quadrature Metropolis, N., Rosenbluth, A. W., Rosenbluth, M. N., Teller, A. H., and Teller, E. Equation of state calculations by fast computing machines. The Journal of Chemical Physics, 21(6):1087 1092, 1953. Minka, T. P. Deriving quadrature rules from Gaussian processes. Technical report, Statistics Department, Carnegie Mellon University, 2000. Neal, R. M. Annealed importance sampling. Statistics and Computing, 11(2):125 139, 2001. O Hagan, A. Bayes Hermite quadrature. Journal of Statistical Planning and Inference, 29:245 260, 1991. Osborne, M. A., Garnett, R., Ghahramani, Z., Duvenaud, D., Roberts, S. J., and Rasmussen, C. E. Active learning of model evidence using Bayesian quadrature. Advances in Neural Information Processing Systems (NIPS) 25, 2012. Overstall, A. M. and Forster, J. J. Default Bayesian model determination methods for generalised linear mixed models. Computational Statistics and Data Analysis, 54(12): 3269 3288, 2010. Rasmussen, C. E. and Ghahramani, Z. Bayesian Monte Carlo. Advances in Neural Information Processing Systems (NIPS) 16, 2003. Rasmussen, C. E. and Williams, C. K. I. Gaussian Processes for Machine Learning. MIT Press, 2006. Skilling, J. Nested sampling. Bayesian Inference and Maximum Entropy Methods in Science and Engineering, 735: 395 405, 2004. Snelson, E., Ghahramani, Z., and Rasmussen, C. E. Warped Gaussian processes. Advances in Neural Information Processing Systems (NIPS) 17, 2004. Vehtari, A. Bayesian model assessment and selection using expected utilities. Ph D thesis, Helsinki University of Technology; Teknillinen korkeakoulu, 2001. Watanabe, S. Asymptotic equivalence of Bayes cross validation and widely applicable information criterion in singular learning theory. Journal of Machine Learning Research, 11:3571 3594, 2010. Yu, K., Tresp, V., and Schwaighofer, A. Learning Gaussian processes from multiple tasks. In Proceedings of the 22nd International Conference on Machine Learning (ICML), 2005.