# variational_bayesian_optimal_experimental_design__ad526627.pdf Variational Bayesian Optimal Experimental Design Adam Foster Martin Jankowiak Eli Bingham Paul Horsfall Yee Whye Teh Tom Rainforth Noah Goodman Department of Statistics, University of Oxford, Oxford, UK Uber AI Labs, Uber Technologies Inc., San Francisco, CA, USA Stanford University, Stanford, CA, USA adam.foster@stats.ox.ac.uk Bayesian optimal experimental design (BOED) is a principled framework for making efficient use of limited experimental resources. Unfortunately, its applicability is hampered by the difficulty of obtaining accurate estimates of the expected information gain (EIG) of an experiment. To address this, we introduce several classes of fast EIG estimators by building on ideas from amortized variational inference. We show theoretically and empirically that these estimators can provide significant gains in speed and accuracy over previous approaches. We further demonstrate the practicality of our approach on a number of end-to-end experiments. 1 Introduction Tasks as seemingly diverse as designing a study to elucidate human cognition, selecting the next query point in an active learning loop, and designing online feedback surveys all constitute the same underlying problem: designing an experiment to maximize the information gathered. Bayesian optimal experimental design (BOED) forms a powerful mathematical abstraction for tackling such problems [8, 23, 37, 43] and has been successfully applied in numerous settings, including psychology [30], Bayesian optimization [16], active learning [15], bioinformatics [42], and neuroscience [38]. In the BOED framework, we construct a predictive model p(y|θ, d) for possible experimental outcomes y, given a design d and a particular value of the parameters of interest θ. We then choose the design that optimizes the expected information gain (EIG) in θ from running the experiment, EIG(d) Ep(y|d) H[p(θ)] H[p(θ|y, d)] , (1) where H[ ] represents the entropy and p(θ|y, d) p(θ)p(y|θ, d) is the posterior resulting from running the experiment with design d and observing outcome y. In other words, we seek the design that, in expectation over possible experimental outcomes, most reduces the entropy of the posterior over our target latent variables. If the predictive model is correct, this forms a design strategy that is (one-step) optimal from an information-theoretic viewpoint [24, 37]. The BOED framework is particularly powerful in sequential contexts, where it allows the results of previous experiments to be used in guiding the designs for future experiments. For example, as we ask a participant a series of questions in a psychology trial, we can use the information gathered from previous responses to ask more pertinent questions in the future, that will, in turn, return more information. This ability to design experiments that are self-adaptive can substantially increase their efficiency: fewer iterations are required to uncover the same level of information. In practice, however, the BOED approach is often hampered by the difficulty of obtaining fast and high-quality estimates of the EIG: due to the intractability of the posterior p(θ|y, d), it constitutes Part of this work was completed by AF during an internship with Uber AI Labs. 33rd Conference on Neural Information Processing Systems (Neur IPS 2019), Vancouver, Canada. a nested expectation problem and so conventional Monte Carlo (MC) estimation methods cannot be applied [33]. Moreover, existing methods for tackling nested expectations have, in general, far inferior convergence rates than those for conventional expectations [22, 30, 32]. For example, nested MC (NMC) can only achieve, at best, a rate of O(T 1/3) in the total computational cost T [33], compared with O(T 1/2) for conventional MC. To address this, we propose a variational BOED approach that sidesteps the double intractability of the EIG in a principled manner and yields estimators with convergence rates in line with those for conventional estimation problems. To this end, we introduce four efficient and widely applicable variational estimators for the EIG. The different methods each present distinct advantages. For example, two allow training with implicit likelihood models, while one allows for asymptotic consistency even when the variational family does not contain the target distribution. We theoretically confirm the advantages of our estimators, showing that they all have a convergence rate of O(T 1/2) when the variational family contains the target distribution. We further verify their practical utility using a number of experiment design problems inspired by applications from science and industry, showing that they provide significant empirical gains in EIG estimation over previous methods and that these gains lead, in turn, to improved end-to-end performance. To maximize the space of potential applications and users for our estimators, we provide2 a generalpurpose implementation of them in the probabilistic programming system Pyro [5], exploiting Pyro s first-class support for neural networks and variational methods. 2 Background The BOED framework is a model-based approach for choosing an experiment design d in a manner that optimizes the information gained about some parameters of interest θ from the outcome y of the experiment. For instance, we may wish to choose the question d in a psychology trial to maximize the information gained about an underlying psychological property of the participant θ from their answer y to the question. In general, we adopt a Bayesian modelling framework with a prior p(θ) and a predictive model p(y|θ, d). The information gained about θ from running experiment d and observing y is the reduction in entropy from the prior to the posterior: IG(y, d) = H[p(θ)] H[p(θ|y, d)] . (2) At the point of choosing d, however, we are uncertain about the outcome. Thus, in order to define a metric to assess the utility of the design d we take the expectation of IG(y, d) under the marginal distribution over outcomes p(y|d) = Ep(θ)[p(y|θ, d)] as per (1). We can further rearrange this as EIG(d) = Ep(y,θ|d) log p(θ|y, d) = Ep(y,θ|d) log p(y, θ|d) = Ep(y,θ|d) log p(y|θ, d) with the result that the EIG can also be interpreted as the mutual information between θ and y given d, or the epistemic uncertainty in y averaged over the prior p(θ). The Bayesian optimal design is defined as d arg maxd D EIG(d), where D is the set of permissible designs. Computing the EIG is challenging since neither p(θ|y, d) or p(y|d) can, in general, be found in closed form. Consequently, the integrand is intractable and conventional MC methods are not applicable. One common way of getting around this is to employ a nested MC (NMC) estimator [30, 43] n=1 log p(yn|θn,0, d) 1 M PM m=1 p(yn|θn,m, d) where θn,m i.i.d. p(θ), yn p(y|θ = θn,0, d). (4) Rainforth et al. [33] showed that this estimator, which has a total computational cost T = O(NM), is consistent in the limit N, M with RMSE convergence rate O(N 1/2 + M 1), and that it is asymptotically optimal to set M N, yielding an overall rate of O(T 1/3). Given a base EIG estimator, a variety of different methods can be used for the subsequent optimization over designs, including some specifically developed for BOED [1, 29, 32]. In our experiments, we 2Implementations of our methods are available at http://docs.pyro.ai/en/stable/contrib.oed.html. To reproduce the results in this paper, see https://github.com/ae-foster/pyro/tree/vboed-reproduce. will adopt Bayesian optimization [39], due to its sample efficiency, robustness to multi-modality, and ability to deal naturally with noisy objective evaluations. However, we emphasize that our focus is on the base EIG estimator and that our estimators can be used more generally with different optimizers. The static design setting we have implicitly assumed thus far in our discussion can be generalized to sequential contexts, in which we design T experiments d1, ..., d T with outcomes y1, ..., y T . We assume experiment outcomes are conditionally independent given the latent variables and designs, i.e. p(y1:T , θ|d1:T ) = p(θ) t=1 p(yt|θ, dt). (5) Having conducted experiments 1, ..., t 1, we can design dt by incorporating data in the standard Bayesian fashion: at experiment iteration t, we replace the prior p (θ) in (3) with p (θ|d1:t 1, y1:t 1), the posterior conditional on the first t 1 designs and outcomes. We can thus conduct an adaptive sequential experiment in which we optimize the choice of the design dt at each iteration. 3 Variational Estimators Though consistent, the convergence rate of the NMC estimator is prohibitively slow for many practical problems. As such, EIG estimation often becomes the bottleneck for BOED, particularly in sequential experiments where the BOED calculations must be fast enough to operate in real-time. In this section we show how ideas from amortized variational inference [10, 17, 34, 40] can be used to sidestep the double intractability of the EIG, yielding estimators with much faster convergence rates thereby alleviating the EIG bottleneck. A key insight for realizing why such fundamental gains can be made is that the NMC estimator is inefficient because a separate estimate of the integrand in (3) is made for each yn. The variational approaches we introduce instead look to directly learn a functional approximation for example, an approximation of y 7 p(y|d) and then evaluate this approximation at multiple points to estimate the integral, thereby allowing information to be shared across different values of y. If M evaluations are made in learning the approximation, the total computational cost is now T = O(N + M), yielding substantially improved convergence rates. Variational posterior ˆµpost Our first approach, which we refer to as the variational posterior estimator ˆµpost, is based on learning an amortized approximation qp(θ|y, d) to the posterior p(θ|y, d) and then using this to estimate the EIG: EIG(d) Lpost(d) Ep(y,θ|d) log qp(θ|y, d) ˆµpost(d) 1 n=1 log qp(θn|yn, d) p(θn) , (6) where yn, θn i.i.d. p(y, θ|d) and ˆµpost(d) is a MC estimator of Lpost(d). We draw samples of p(y, θ|d) by sampling θ p(θ) and then y|θ p(y|θ, d). We can think of this approach as amortizing the cost of the inner expectation, instead of running inference separately for each y. To learn a suitable qp(θ|y, d), we show in Appendix A that Lpost(d) forms a variational lower bound EIG(d) Lpost(d) that is tight if and only if qp(θ|y, d) = p(θ|y, d). Barber and Agakov [3] used this bound to estimate mutual information in the context of transmission over noisy channels, but the connection to experiment design has not previously been made. This result means we can learn qp(θ|y, d) by introducing a family of variational distributions qp(θ|y, d, φ) parameterized by φ and then maximizing the bound with respect to φ: φ = arg max φ Ep(y,θ|d) log qp(θ|y, d, φ) , EIG(d) Lpost(d; φ ). (7) Provided that we can generate samples from the model, this maximization can be performed using stochastic gradient methods [35] and the unbiased gradient estimator φLpost(d; φ) 1 i=1 φ log qp(θi|yi, d, φ) where yi, θi i.i.d. p(y, θ|d), (8) and we note that no reparameterization is required as p(y, θ|d) is independent of φ. After K gradient steps we obtain variational parameters φK that approximate φ , which we use to compute a corresponding EIG estimator by constructing a MC estimator for Lpost(d; φ) as per (6) with qp(θn|yn, d) = qp(θn|yn, d, φK). Interestingly, the tightness of Lpost(d) turns out to be equal to the expected forward KL divergence3 Ep(y|d) [KL (p(θ|y, d)||qp(θ|y, d, φ))] so we can view this approach as learning an amortized proposal by minimizing this expected KL divergence. Variational marginal ˆµmarg In some scenarios, θ may be high-dimensional, making it difficult to train a good variational posterior approximation. An alternative approach that can be attractive in such cases is to instead learn an approximation qm(y|d) to the marginal density p(y|d) and substitute this into the final form of the EIG in (3). As shown in Appendix A, this yields an upper bound EIG(d) Umarg(d) Ep(y,θ|d) log p(y|θ, d) ˆµmarg(d) 1 n=1 log p(yn|θn, d) qm(yn|d) , (9) where again yn, θn i.i.d. p(y, θ|d) and the bound is tight when qm(y|d) = p(y|d). Analogously to ˆµpost, we can learn qm(y|d) by introducing a variational family qm(y|d, φ) and then performing stochastic gradient descent to minimize Umarg(d, φ). As with ˆµpost, this bound was studied in a mutual information context [31], but it has not been utilized for BOED before. Variational NMC ˆµVNMC As we will show in Section 4, ˆµpost and ˆµmarg can provide substantially faster convergence rates than NMC. However, this comes at the cost of converging towards a biased estimate if the variational family does not contain the target distribution. To address this, we propose another EIG estimator, ˆµVNMC, which allows one to trade-off resources between the fast learning of a biased estimator permitted by variational approaches, and the ability of NMC to eliminate this bias.4 We can think of the NMC estimator as approximating p(y|d) using M samples from the prior. At a high-level, ˆµVNMC is based around learning a proposal qv(θ|y, d) and then using samples from this proposal to make an importance sampling estimate of p(y|d), potentially requiring far fewer samples than NMC. Formally, it is based around a bound that can be arbitrarily tightened, namely log p(y|θ0, d) log 1 p(y, θℓ|d) qv(θℓ|y, d) UVNMC(d, L) (10) where the expectation is taken over y, θ0:L p(y, θ0|d) QL ℓ=1 qv(θℓ|y, d), which corresponds to one sample y, θ0 from the model and L samples from the approximate posterior conditioned on y. To the best of our knowledge, this bound has not previously been studied in the literature. As with ˆµpost and ˆµmarg, we can minimize this bound to train a variational approximation qv(θ|y, d, φ). Important features of UVNMC(d, L) are summarized in the following lemma; see Appendix A for the proof. Lemma 1. For any given model p(θ)p(y|θ, d) and valid qv(θ|y, d), 1. EIG(d) = lim L UVNMC(d, L) UVNMC(d, L2) UVNMC(d, L1) L2 L1 1, 2. UVNMC(d, L) = EIG(d) L 1 if qv(θ|y, d) = p(θ|y, d) y, θ, 3. UVNMC(d, L) EIG(d)=Ep(y|d) h KL QL ℓ=1 qv(θℓ|y, d) 1 L PL ℓ=1 p(θℓ|y, d) Q k =ℓqv(θk|y, d) i Like the previous bounds, the VNMC bound is tight when qv(θ|y, d) = p(θ|y, d). Importantly, the bound is also tight as L , even for imperfect qv. This means we can obtain asymptotically unbiased EIG estimates even when the true posterior is not contained in the variational family. Specifically, we first train φ using K steps of stochastic gradient on UVNMC(d, L) with some fixed L. To form a final EIG estimator, however, we use a MC estimator of UVNMC(d, M) where typically M L. This final estimator is a NMC estimator that is consistent as N, M with φK fixed ˆµVNMC(d) 1 log p(yn|θn,0, d) log 1 p(yn, θn,m|d) qv(θn,m|yn, d, φK) where θn,0 i.i.d. p(θ), yn p(y|θ = θn,0, d) and θn,m qv(θ|y = yn, d, φK). In practice, performance is greatly enhanced when the proposal qv is a good, if inexact, approximation to the posterior. This significantly improves upon traditional ˆµNMC, which sets qv(θ|y, d) = p(θ) in (11). 3See Appendix A for a proof. A comparison with the reverse KL divergence can be found in Appendix G. 4In Appendix F, we describe a method using qm(y|d) as a control variate that can also eliminate this bias and lower the variance of NMC, requiring additional assumptions about the model and variational family. Implicit likelihood and ˆµm+ℓ So far we have assumed that we can evaluate p(y|θ, d) pointwise. However, many models of interest have implicit likelihoods from which we can draw samples, but not evaluate directly. For example, models with nuisance latent variables ψ (such as a random effect models) are implicit likelihood models because p(y|θ, d) = Ep(ψ|θ) [p(y|θ, ψ, d)] is intractable, but can still be straightforwardly sampled from. In this setting, ˆµpost is applicable without modification because it only requires samples from p(y|θ, d) and not evaluations of this density. Although ˆµmarg is not directly applicable in this setting, it can be modified to accommodate implicit likelihoods. Specifically, we can utilize two approximate densities: qm(y|d) for the marginal and qℓ(y|θ, d) for the likelihood. We then form the approximation EIG(d) Im+ℓ(d) Ep(y,θ|d) log qℓ(y|θ, d) n=1 log qℓ(yn|θn, d) qm(yn|d) . (12) Unlike the previous three cases, Im+ℓ(d) is not a bound on EIG(d), meaning it is not immediately clear how to train qm(y|d) and qℓ(y|θ, d) to achieve an accurate EIG estimator. The following lemma shows that we can bound the EIG estimation error of Im+ℓ. The proof is in Appendix A. Lemma 2. For any given model p(θ)p(y|θ, d) and valid qm(y|d) and qℓ(y|θ, d), we have |Im+ℓ(d) EIG(d)| Ep(y,θ|d)[log qm(y|d) + log qℓ(y|θ, d)] + C, (13) where C = H[p(y|d)] Ep(θ) [H(p(y|θ, d)] does not depend on qm or qℓ. Further, the RHS of (13) is 0 if and only if qm(y|d) = p(y|d) and qℓ(y|θ, d) = p(y|θ, d) for almost all y, θ. This lemma implies that we can learn qm(y|d) and qℓ(y|θ, d) by maximizing Ep(y,θ|d)[log qm(y|d) + log qℓ(y|θ, d)] using stochastic gradient ascent, and substituting these learned approximations into (12) for the final EIG estimator. To the best of our knowledge, this approach has not previously been considered in the literature. We note that, in general, qm and qℓare learned separately and there need not be any weight sharing between them. See Appendix A.4 for a discussion of the case when we couple qm and qℓso that qm(y|d) = Ep(θ)[qℓ(y|θ, d)]. Using estimators for sequential BOED In sequential settings, we also need to consider the implications of replacing p(θ) in the EIG with p(θ|d1:t 1, y1:t 1). At first sight, it appears that, while ˆµmarg and ˆµm+ℓonly require samples from p(θ|d1:t 1, y1:t 1), ˆµpost and ˆµVNMC also require its density to be evaluated, a potentially severe limitation. Fortunately, we can, in fact, avoid evaluating this posterior density. We note that, from (5), we have p(θ|y1:t 1, d1:t 1) = p(θ) Qt 1 i=1 p(yi|θ, di)/p(y1:t 1|d1:t 1). Substituting this into the integrand of (6) gives Lpost(dt) = Ep(θ|y1:t 1,d1:t 1)p(yt|θ,dt) log qp(θ|yt, dt) p(θ) Qt 1 i=1 p(yi|θ, di) + log p(y1:t 1|d1:t 1) (14) where p(θ) Qt 1 i=1 p(yi|θ, di) can be evaluated exactly and the additive constant log p(y1:t 1|d1:t 1) does not depend on the new design dt, θ, or any of the variational parameters, and so can be safely ignored. Making the same substitution in (11) shows that we can also estimate UVNMC(dt, L) up to a constant, which can then be similarly ignored. As such, any inference scheme for sampling p(θ|d1:t 1, y1:t 1), approximate or exact, is compatible with all our approaches. Table 1: Summary of EIG estimators. Baseline methods are explained in Section 5. Implicit Bound Consistent Eq. ˆµpost Lower (6) ˆµmarg Upper (9) ˆµVNMC Upper (11) ˆµm+ℓ (12) ˆµNMC Upper (4) ˆµlaplace (75) ˆµLFIRE (76) ˆµDV Lower (77) Selecting an estimator Having proposed four estimators, we briefly discuss how to choose between them in practice. For reference, a summary of our estimators is given in Table 1, along with several baseline approaches. First, ˆµmarg and ˆµm+ℓrely on approximating a distribution over y; ˆµpost and ˆµVNMC approximate distributions over θ. We may prefer the former two estimators if dim(y) dim(θ) as it leaves us with a simpler density estimation problem, and vice versa. Second, ˆµmarg and ˆµVNMC require an explicit likelihood whereas ˆµpost and ˆµm+ℓdo not. If an explicit likelihood is available, it typically makes sense to use it one would never use ˆµm+ℓover ˆµmarg for example. Finally, if the variational families do not contain the target densities, ˆµVNMC is the only method guaranteed to converge to the true EIG(d) in the limit as the computational budget increases. So we might prefer ˆµVNMC when computation time and cost are not constrained. 4 Convergence rates We now investigate the convergence of our estimators. We start by breaking the overall error down into three terms: I) variance in MC estimation of the bound; II) the gap between the bound and the tightest bound possible given the variational family; and III) the gap between the tightest possible bound and EIG(d). With variational EIG approximation B(d) {Lpost(d), Umarg(d), UVNMC(d, L), Im+ℓ(d)}, optimal variational parameters φ , learned variational parameters φK after K stochastic gradient iterations, and MC estimator ˆµ(d, φK) we have, by the triangle inequality, ˆµ(d, φK) EIG(d) 2 ˆµ(d, φK) B(d, φK) 2 | {z } I + B(d, φK) B(d, φ ) 2 | {z } II + |B(d, φ ) EIG(d)| | {z } III where we have used the notation X 2 p E [X2] to denote the L2 norm of a random variable. By the weak law of large numbers, term I scales as N 1/2 and can thus be arbitrarily reduced by taking more MC samples. Provided that our stochastic gradient scheme converges, term II can be reduced by increasing the number of stochastic gradient steps K. Term III, however, is a constant that can only be reduced by expanding the variational family (or increasing L for ˆµVNMC). Each approximation B(d) thus converges to a biased estimate of the EIG(d), namely B(d, φ ). As established by the following Theorem, if we set N K, the rate of convergence to this biased estimate is O(T 1/2), where T represents the total computational cost, with T = O(N + K). Theorem 1. Let X be a measurable space and Φ be a convex subset of a finite dimensional inner product space. Let X1, X2, ... be i.i.d. random variables taking values in X and f : X Φ R be a measurable function. Let µ(φ) E[f(X1, φ)] ˆµN(φ) 1 n=1 f(Xn, φ) and suppose that supφ Φ f(X1, φ) 2 < . Then supφ Φ ˆµN(φ) µ(φ) 2 = O(N 1/2). Suppose further that Assumption 1 in Appendix B holds and that φ is the unique minimizer of µ. After K iterations of the Polyak-Ruppert averaged stochastic gradient descent algorithm of [28] with gradient estimator φf(Xt, φ), we have µ(φK) µ(φ ) 2 = O(K 1/2) and, combining with the first result, ˆµN(φK) µ(φ ) 2 = O(N 1/2 + K 1/2) = O(T 1/2) if N K. The proof relies on standard results from MC and stochastic optimization theory; see Appendix B. We note that the assumptions required for the latter, though standard in the literature, are strong. In practice, φ can converge to a local optimum φ , rather than the global optimum φ , introducing an additional asymptotic bias B(d, φ ) B(d, φ ) into term III. Theorem 1 can be applied directly to ˆµmarg, ˆµpost, and ˆµVNMC (with fixed M = L), showing that they converge respectively to Umarg(d, φ ), Lpost(d, φ ), and UVNMC(d, L, φ ) at a rate = O(T 1/2) if N K and the assumptions are satisfied. For ˆµm+ℓ, we combine Theorem 1 and Lemma 2 to obtain the same O(T 1/2) convergence rates; see the supplementary material for further details. The key property of ˆµVNMC is that we need not set M = L and can remove the asymptotic bias by increasing M with N. We begin by training φ with a fixed value of L, decreasing the error term UVNMC(d, L, φK) UVNMC(d, L, φ ) 2 at the fast rate O(K 1/2) until |UVNMC(d, L, φ ) EIG(d)| becomes the dominant error term. At this point, we start to increase N, M. Using the NMC convergence results discussed in Sec. 2, if we set M N, then ˆµVNMC converges to EIG(d) at a rate O((NM) 1/3). Note that the total cost of the ˆµVNMC estimator is T = O(KL + NM), where typically M L. The first stage, costing KL, is fast variational training of an amortized importance sampling proposal for p(y|d) = Ep(θ)[p(y|θ, d)]. The second stage, costing NM, is slower refinement to remove the asymptotic bias using the learned proposal in an NMC estimator. Table 2: Bias squared and variance from 5 runs, averaged over designs, of EIG estimators applied to four benchmarks. We use - to denote that a method does not apply and when it is superseded by other methods. Bold indicates the estimator with the lowest empirical mean squared error. A/B test Preference Mixed effects Extrapolation Bias2 Var Bias2 Var Bias2 Var Bias2 Var ˆµpost 1.33 10 2 7.15 10 3 4.26 10 2 8.53 10 3 2.34 10 3 2.92 10 3 1.24 10 4 5.16 10 5 ˆµmarg 7.45 10 2 6.41 10 3 1.10 10 3 1.99 10 3 - - - - ˆµVNMC 3.44 10 3 3.38 10 3 4.17 10 3 9.04 10 3 - - - - ˆµm+ℓ 3.06 10 3 5.94 10 5 6.90 10 6 1.84 10 5 ˆµNMC 4.70 100 3.47 10 1 7.60 10 2 8.36 10 2 - - - - ˆµlaplace 1.92 10 4 1.47 10 3 8.42 10 2 9.70 10 2 - - - - ˆµLFIRE 2.29 100 6.20 10 1 1.30 10 1 1.41 10 2 1.41 10 1 6.67 10 2 - - ˆµDV 4.34 100 8.85 10 1 9.23 10 2 8.07 10 3 9.10 10 3 5.56 10 4 7.84 10 6 4.11 10 5 One can think of the standard NMC approach as a special case of ˆµVNMC in which we naively choose p(θ) as the proposal. That is, standard NMC skips the first stage and hence does not benefit from the improved convergence rate of learning an amortized proposal. It typically requires a much higher total cost to achieve the same accuracy as VNMC. 5 Related work We briefly discuss alternative approaches to EIG estimation for BOED that will form our baselines for empirical comparisons. The Nested Monte Carlo (NMC) baseline was introduced in Sec. 2. Another established approach is to use a Laplace approximation to the posterior [22, 25]; this approach is fast but is limited to continuous variables and can exhibit large bias. Kleinegesse and Gutmann [18] recently suggested an implicit likelihood approach based on the Likelihood-Free Inference by Ratio Estimation (LFIRE) method of Thomas et al. [41]. We also consider a method based on the Donsker-Varadhan (DV) representation of the KL divergence [11] as used by Belghazi et al. [4] for mutual information estimation. Though not previously considered in BOED, we include it as a baseline for illustrative purposes. For a full discussion of the DV bound and a number of other variational bounds used in deep learning, we refer to the recent work of Poole et al. [31]. For further discussion of related work, see Appendix C. 6 Experiments 6.1 EIG estimation accuracy We begin by benchmarking our EIG estimators against the aforementioned baselines. We consider four experiment design scenarios inspired by applications of Bayesian data analysis in science and industry. First, A/B testing is used across marketing and design [6, 19] to study population traits. Here, the design is the choice of the A and B group sizes and the Bayesian model is a Gaussian linear model. Second, revealed preference [36] is used in economics to understand consumer behaviour. We consider an experiment design setting in which we aim to learn the underlying utility function of an economic agent by presenting them with a proposal (such as offering them a price for a commodity) and observing their revealed preference. Third, fixed effects and random effects (nuisance variables) are combined in mixed effects models [14, 20]. We consider an example inspired by item-response theory [13] in psychology. We seek information only about the fixed effects, making this an implicit likelihood problem. Finally, we consider an experiment where labelled data from one region of design space must be used to predict labels in a target region by extrapolation [27]. In summary, we have two models with explicit likelihoods (A/B testing, preference) and two that are implicit (mixed effects, extrapolation). Full details of each model are presented in Appendix D. For each scenario, we estimated the EIG across a grid of designs with a fixed computational budget for each estimator and calculated the true EIG analytically or with brute force computation as appropriate; see Table 2 for the results. Whilst the Laplace method, unsurprisingly, performed best for the Gaussian linear model where its approximation becomes exact, we see that our methods are otherwise more accurate. All our methods outperformed NMC. (a) Convergence in N (b) Convergence in K (c) Convergence N = K (d) Fixed budget N + K Figure 1: Convergence of RMSE for ˆµpost and ˆµmarg. (a) Convergence in number of MC samples N with a fixed number K of gradient updates of the variational parameters. (b) Convergence in time when increasing K and with N fixed. (c) Convergence in time when setting N = K and increasing both (dashed lines represent theoretical rates). (d) Final RMSE with N + K = 5000 fixed, for different K. Each graph shows the mean with shading representing 1 std. err. from 100 trials. 6.2 Convergence rates We now investigate the empirical convergence characteristics of our estimators. Throughout, we consider a single design point from the A/B test example. We start by examining the convergence of ˆµpost and ˆµmarg as we allocate the computational budget in different ways. We first consider the convergence in N after a fixed number of K updates to the variational parameters. As shown in Figure 1a, the RMSE initially decreases as we increase N, before plateauing due to the bias in the estimator. We also see that ˆµpost substantially outperforms ˆµmarg. We next consider the convergence as a function of wall-clock time when N is held fixed and we increase K. We see in Figure 1b that, as expected, the errors decrease with time and that when a small value of N = 5 is taken, we again see a plateauing effect, with the variance of the final MC estimator now becoming the limiting factor. In Figure 1c we take N = K and increase both, obtaining the predicted convergence rate O(T 1/2) (shown by the dashed lines). We conjecture that the better performance of ˆµpost is likely due to θ being lower dimensional (dim = 2) than y (dim = 10). In Figure 1d, we instead fix T = N + K to investigate the optimal trade-off between optimization and MC error: it appears the range of K/T between 0.5 and 0.9 gives the lowest RMSE. Figure 2: Convergence of ˆµVNMC taking M = N. Steps refers to pre-training of the variational posterior (i.e. K), with 0 steps corresponding to ˆµNMC. Means and confidence intervals as per Fig. 1. Finally, we show how ˆµVNMC can improve over NMC by using an improved variational proposal for estimating p(y|d). In Figure 2, we plot the EIG estimates obtained by first running K steps of stochastic gradient with L = 1 to learn qv(θ|y, d), before increasing M and N. We see that spending some of our time budget training qv(θ|y, d) leads to noticeable improvements in the estimation, but also that it is important to increase N and M. Rather than plateauing like ˆµpost and ˆµmarg, ˆµVNMC continues to improve after the initial training period as, albeit at a slower O(T 1/3) rate. 6.3 End-to-end sequential experiments We now demonstrate the utility of our methods for designing sequential experiments. First, we demonstrate that our variational estimators are sufficiently robust and fast to be used for adaptive experiments with a class of models that are of practical importance in many scientific disciplines. To this end, we run an adaptive psychology experiment with human participants recruited from Amazon Mechanical Turk to study how humans respond to features of stylized faces. To account for fixed effects those common across the population as well as individual variations that we treat as nuisance variables, we use the mixed effects regression model introduced in Sec. 6.1. See Appendix D for full details of the experiment. To estimate the EIG for different designs, we use ˆµm+ℓ, since it yields the best performance on our mixed effects model benchmark (see Table 2). Our EIG estimator is integrated into a system that (a) Entropy (b) Posterior RMSE of ρ (c) Posterior RMSE of α (d) Posterior RMSE of u Figure 4: Evolution of the posterior in the sequential CES experiment. (a) Total entropy of a meanfield variational approximation of the posterior. (b)(c)(d) The RMSE of the posterior approximations of ρ, α and u as compared to the true values used to simulate agent responses. Note the scale of the vertical axis is logarithmic. All plots show the mean and 1 std. err. from 10 independent runs. presents participants with a stimulus, receives their response, learns an updated model, and designs the next stimulus, all online. Despite the relative simplicity of the design problem (with 36 possible designs) using BOED with ˆµm+ℓleads to a more certain (i.e. lower entropy) posterior than random design; see Figure 3. Figure 3: Evolution of the posterior entropy of the fixed effects in the Mechanical Turk experiment in Sec. 6.3. We depict the mean and 1 std. err. from 10 experimental trials. Second, we consider a more challenging scenario in which a random design strategy gleans very little. We compare random design against two BOED strategies: ˆµmarg and ˆµNMC. Building on the revealed preference example in Sec. 6.1, we consider an experiment to infer an agent s utility function which we model using the Constant Elasticity of Substitution (CES) model [2] with latent variables ρ, α, u. We seek designs for which the agent s response will be informative about θ = (ρ, α, u). See Appendix D for full details. We estimate the EIG using ˆµmarg because the dimension of y is smaller than that of θ, and select designs d [0, 100]6 using Bayesian optimization. To investigate parameter recovery we simulate agent responses from the model with fixed values of ρ, α, u. Figure 4 shows that using BOED with our marginal estimator reduces posterior entropy and concentrates more quickly on the true parameter values than both baselines. Random design makes no inroads into the learning problem, while BOED based on NMC particularly struggles at the outset when p(θ|d1:t 1, y1:t 1), the prior at iteration t, is high variance. Our method selects informative designs throughout. 7 Discussion We have developed efficient EIG estimators that are applicable to a wide range of experimental design problems. By tackling the double intractability of the EIG in a principled manner, they provide substantially improved convergence rates relative to previous approaches, and our experiments show that these theoretical advantages translate into significant practical gains. Our estimators are wellsuited to modern deep probabilistic programming languages and we have provided an implementation in Pyro. We note that the interplay between variational and MC methods in EIG estimation is not directly analogous to those in standard inference settings because the NMC EIG estimator is itself inherently biased. Our ˆµVNMC estimator allows one to play off the advantages of these approaches, namely the fast learning of variational approaches and asymptotic consistency of NMC. Acknowledgements We gratefully acknowledge research funding from Uber AI Labs. MJ would like to thank Paul Szerlip for help generating the sprites used in the Mechanical Turk experiment. AF would like to thank Patrick Rebeschini, Dominic Richards and Emile Mathieu for their help and support. AF gratefully acknowledges funding from EPSRC grant no. EP/N509711/1. YWT s and TR s research leading to these results has received funding from the European Research Council under the European Union s Seventh Framework Programme (FP7/2007-2013) ERC grant agreement no. 617071. [1] Billy Amzal, Frédéric Y Bois, Eric Parent, and Christian P Robert. Bayesian-optimal design via interacting particle systems. Journal of the American Statistical association, 101(474):773 785, 2006. [2] Kenneth J Arrow, Hollis B Chenery, Bagicha S Minhas, and Robert M Solow. Capital-labor substitution and economic efficiency. The review of Economics and Statistics, pages 225 250, 1961. [3] David Barber and Felix Agakov. The IM algorithm: a variational approach to information maximization. Advances in Neural Information Processing Systems, 16:201 208, 2003. [4] Mohamed Ishmael Belghazi, Aristide Baratin, Sai Rajeshwar, Sherjil Ozair, Yoshua Bengio, Devon Hjelm, and Aaron Courville. Mutual information neural estimation. In International Conference on Machine Learning, pages 530 539, 2018. [5] Eli Bingham, Jonathan P Chen, Martin Jankowiak, Fritz Obermeyer, Neeraj Pradhan, Theofanis Karaletsos, Rohit Singh, Paul Szerlip, Paul Horsfall, and Noah D Goodman. Pyro: Deep universal probabilistic programming. The Journal of Machine Learning Research, 20(1): 973 978, 2019. [6] George EP Box, J Stuart Hunter, and William G Hunter. Statistics for experimenters. In Wiley Series in Probability and Statistics. Wiley Hoboken, NJ, 2005. [7] Yuri Burda, Roger Grosse, and Ruslan Salakhutdinov. Importance weighted autoencoders. In 4th International Conference on Learning Representations, ICLR, 2016. [8] Kathryn Chaloner and Isabella Verdinelli. Bayesian experimental design: A review. Statistical Science, pages 273 304, 1995. [9] Alex R Cook, Gavin J Gibson, and Christopher A Gilligan. Optimal observation times in experimental epidemic processes. Biometrics, 64(3):860 868, 2008. [10] Peter Dayan, Geoffrey E Hinton, Radford M Neal, and Richard S Zemel. The Helmholtz machine. Neural computation, 7(5):889 904, 1995. [11] Monroe D Donsker and SR Srinivasa Varadhan. Asymptotic evaluation of certain Markov process expectations for large time. Communications on Pure and Applied Mathematics, 28(1): 1 47, 1975. [12] Sylvain Ehrenfeld. Some experimental design problems in attribute life testing. Journal of the American Statistical Association, 57(299):668 679, 1962. [13] Susan E Embretson and Steven P Reise. Item response theory. Psychology Press, 2013. [14] Andrew Gelman, Hal S Stern, John B Carlin, David B Dunson, Aki Vehtari, and Donald B Rubin. Bayesian data analysis. Chapman and Hall/CRC, 2013. [15] Daniel Golovin, Andreas Krause, and Debajyoti Ray. Near-optimal bayesian active learning with noisy observations. In Advances in Neural Information Processing Systems, pages 766 774, 2010. [16] José Miguel Hernández-Lobato, Matthew W Hoffman, and Zoubin Ghahramani. Predictive entropy search for efficient global optimization of black-box functions. In Advances in neural information processing systems, pages 918 926, 2014. [17] Diederik P Kingma and Max Welling. Auto-encoding variational Bayes. In ICLR, 2014. [18] Steven Kleinegesse and Michael U Gutmann. Efficient bayesian experimental design for implicit models. In The 22nd International Conference on Artificial Intelligence and Statistics, pages 476 485, 2019. [19] Ron Kohavi, Roger Longbotham, Dan Sommerfield, and Randal M Henne. Controlled experiments on the web: survey and practical guide. Data mining and knowledge discovery, 18(1): 140 181, 2009. [20] John Kruschke. Doing Bayesian data analysis: A tutorial with R, JAGS, and Stan. Academic Press, 2014. [21] Tuan Anh Le, Maximilian Igl, Tom Rainforth, Tom Jin, and Frank Wood. Auto-Encoding Sequential Monte Carlo. International Conference on Learning Representations (ICLR), 2018. [22] Jeremy Lewi, Robert Butera, and Liam Paninski. Sequential optimal design of neurophysiology experiments. Neural Computation, 21(3):619 687, 2009. [23] Dennis V Lindley. On a measure of the information provided by an experiment. The Annals of Mathematical Statistics, pages 986 1005, 1956. [24] Dennis V Lindley. Bayesian statistics, a review, volume 2. SIAM, 1972. [25] Quan Long, Marco Scavino, Raúl Tempone, and Suojin Wang. Fast estimation of expected information gains for Bayesian experimental designs based on Laplace approximations. Computer Methods in Applied Mechanics and Engineering, 259:24 39, 2013. [26] Chao Ma, Sebastian Tschiatschek, Konstantina Palla, Jose Miguel Hernandez Lobato, Sebastian Nowozin, and Cheng Zhang. EDDI: Efficient dynamic discovery of high-value information with partial VAE. ar Xiv preprint ar Xiv:1809.11142, 2018. [27] David JC Mac Kay. Information-based objective functions for active data selection. Neural computation, 4(4):590 604, 1992. [28] Eric Moulines and Francis R Bach. Non-asymptotic analysis of stochastic approximation algorithms for machine learning. In Advances in Neural Information Processing Systems, pages 451 459, 2011. [29] Peter Müller. Simulation based optimal design. Handbook of Statistics, 25:509 518, 2005. [30] Jay I Myung, Daniel R Cavagnaro, and Mark A Pitt. A tutorial on adaptive design optimization. Journal of mathematical psychology, 57(3-4):53 67, 2013. [31] Ben Poole, Sherjil Ozair, Aäron van den Oord, Alexander A Alemi, and George Tucker. On variational lower bounds of mutual information. Neur IPS Workshop on Bayesian Deep Learning, 2018. [32] Tom Rainforth. Automating Inference, Learning, and Design using Probabilistic Programming. Ph D thesis, University of Oxford, 2017. [33] Tom Rainforth, Robert Cornish, Hongseok Yang, Andrew Warrington, and Frank Wood. On nesting Monte Carlo estimators. In International Conference on Machine Learning, pages 4264 4273, 2018. [34] Danilo Jimenez Rezende, Shakir Mohamed, and Daan Wierstra. Stochastic backpropagation and approximate inference in deep generative models. In Proceedings of the 31st International Conference on Machine Learning, volume 32, pages 1278 1286, 2014. [35] Herbert Robbins and Sutton Monro. A stochastic approximation method. The annals of mathematical statistics, pages 400 407, 1951. [36] Paul A Samuelson. Consumption theory in terms of revealed preference. Economica, 15(60): 243 253, 1948. [37] Paola Sebastiani and Henry P Wynn. Maximum entropy sampling and optimal Bayesian experimental design. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 62(1), 2000. [38] Ben Shababo, Brooks Paige, Ari Pakman, and Liam Paninski. Bayesian inference and online experimental design for mapping neural microcircuits. In Advances in Neural Information Processing Systems, pages 1304 1312, 2013. [39] Jasper Snoek, Hugo Larochelle, and Ryan P Adams. Practical Bayesian optimization of machine learning algorithms. In Advances in neural information processing systems, pages 2951 2959, 2012. [40] Andreas Stuhlmüller, Jacob Taylor, and Noah Goodman. Learning stochastic inverses. In Advances in neural information processing systems, pages 3048 3056, 2013. [41] Owen Thomas, Ritabrata Dutta, Jukka Corander, Samuel Kaski, and Michael U Gutmann. Likelihood-free inference by ratio estimation. ar Xiv preprint ar Xiv:1611.10242, 2016. [42] Joep Vanlier, Christian A Tiemann, Peter AJ Hilbers, and Natal AW van Riel. A Bayesian approach to targeted experiment design. Bioinformatics, 28(8):1136 1142, 2012. [43] Benjamin T Vincent and Tom Rainforth. The DARC toolbox: automated, flexible, and efficient delayed and risky choice experiments using bayesian adaptive design. 2017.