# amortized_monte_carlo_integration__e961aef2.pdf Amortized Monte Carlo Integration Adam Goli nski * 1 2 Frank Wood 3 Tom Rainforth * 1 Current approaches to amortizing Bayesian inference focus solely on approximating the posterior distribution. Typically, this approximation is, in turn, used to calculate expectations for one or more target functions a computational pipeline which is inefficient when the target function(s) are known upfront. In this paper, we address this inefficiency by introducing AMCI, a method for amortizing Monte Carlo integration directly. AMCI operates similarly to amortized inference but produces three distinct amortized proposals, each tailored to a different component of the overall expectation calculation. At runtime, samples are produced separately from each amortized proposal, before being combined to an overall estimate of the expectation. We show that while existing approaches are fundamentally limited in the level of accuracy they can achieve, AMCI can theoretically produce arbitrarily small errors for any integrable target function using only a single sample from each proposal at runtime. We further show that it is able to empirically outperform the theoretically optimal selfnormalized importance sampler on a number of example problems. Furthermore, AMCI allows not only for amortizing over datasets but also amortizing over target functions. 1. Introduction At its core, Bayesian modeling is rooted in the calculation of expectations: the eventual aim of modeling is typically to make a decision or to construct predictions for unseen data, both of which take the form of an expectation under the posterior (Robert, 2007). This aim can *Equal contribution 1Department of Statistics, University of Oxford, United Kingdom 2Department of Engineering Science, University of Oxford, United Kingdom 3Department of Computer Science, University of British Columbia, Vancouver, Canada. Correspondence to: Adam Goli nski . Proceedings of the 36 th International Conference on Machine Learning, Long Beach, California, PMLR 97, 2019. Copyright 2019 by the author(s). thus be summarized in the form of one or more expectations Ep(x|y) f(x) , where f(x) is a target function and p(x|y) is the posterior distribution on x for some data y, which we typically only know up to a normalizing constant p(y). More generally, expectations with respect to distributions with unknown normalization constant are ubiquitous throughout the sciences (Robert & Casella, 2013). Sometimes f(x) is not known up front. Here it is typically convenient to first approximate p(x|y), e.g. in the form of Monte Carlo (MC) samples, and then later use this approximation to calculate estimates, rather than addressing the target expectations directly. However, it is often the case in practice that a particular target function, or class of target functions, is known a priori. For example, in decision-based settings f(x) takes the form of a loss function, while any posterior predictive distribution constitutes a set of expectations with respect to the posterior, parameterized by the new input. Though often overlooked, it is well established that in such targetaware settings the aforementioned pipeline of first approximating p(x|y) and then using this as a basis for calculating Ep(x|y) f(x) is suboptimal as it ignores relevant information in f(x) (Hesterberg, 1988; Wolpert, 1991; Oh & Berger, 1992; Evans & Swartz, 1995; Meng & Wong, 1996; Chen & Shao, 1997; Gelman & Meng, 1998; Lacoste Julien et al., 2011; Owen, 2013; Rainforth et al., 2018b). As we will later show, the potential gains in such scenarios can be arbitrarily large. In this paper, we extend these ideas to amortized inference settings (Stuhlm uller et al., 2013; Kingma & Welling, 2014; Ritchie et al., 2016; Paige & Wood, 2016; Le et al., 2017; 2018a; Webb et al., 2018), wherein one looks to amortize the cost of inference across different possible datasets by learning an artifact that assists the inference process at runtime for a given dataset. Existing approaches do not operate in a target-aware fashion, such that even if the inference network learns proposals that perfectly match the true posterior for every possible dataset, the resulting estimator is still sub-optimal. To address this, we introduce AMCI, a framework for performing Amortized Monte Carlo Integration. Though still based on learning amortized proposals distributions, AMCI varies from standard amortized inference ap- Amortized Monte Carlo Integration proaches in three respects. First, it operates in a targetaware fashion, incorporating information about f(x) into the amortization artifacts. Second, rather than using self-normalization, AMCI employs three distinct proposals for separately estimating Ep(x) p(y|x) max(f(x), 0) , Ep(x) p(y|x) min(f(x), 0) , and Ep(x) p(y|x) , before combining these into an overall estimate. This breakdown allows for arbitrary performance improvements compared to self-normalized importance sampling (SNIS). Finally, to account for cases in which multiple possible target functions may be of interest, AMCI also allows for amortization over parametrized functions f(x; θ). Remarkably, AMCI is able to achieve an arbitrarily low error at run-time using only a single sample from each proposal given sufficiently powerful amortization artifacts, contrary to the fundamental limitations on the accuracy of conventional amortization approaches. This ability is based around its novel breakdown of the target expectation into separate components, the subsequent utility of which extends beyond the amortized setting we consider here. 2. Background 2.1. Importance Sampling Importance Sampling (IS), in its most basic form, is a method for approximating an expectation Eπ(x) f(x) when it is either not possible to sample from π(x) directly, or when the simple MC estimate, 1 N PN n=1 f(xn) where xn π(x), has problematically high variance (Hesterberg, 1988; Wolpert, 1991). Given a proposal q(x) from which we can sample and for which we can evaluate the data, it forms the following estimate µ := Eπ(x) f(x) = Z f(x)π(x) q(x) q(x)dx (1) n=1 f(xn)wn (2) where xn q(x) and wn := π(xn)/q(xn) is known as the importance weight of sample xn. In practice, one often does not have access to the normalized form of π(x). For example, in Bayesian inference settings, we typically have π(x) = p(x|y) p(x, y). Here we can use our samples to both approximate the normalization constant and the unnormalized integral. Thus if π(x) γ(x), we have Eπ(x)[f(x)]= q(x) q(x)dx R γ(x) q(x) q(x)dx PN n=1f(xn)wn PN n=1 wn (3) where xn q(x), and wn := γ(xn)/q(xn). This approach is known as self-normalized importance sampling (SNIS). Conveniently, we can also construct the SNIS estimate lazily by calculating the empirical measure, i.e. stor- ing weighted samples, n=1wnδxn(x) . XN and then using this to construct the estimate in (3) when f(x) becomes available. As such, we can also think of SNIS as a method for Bayesian inference as, informally speaking, the empirical measure produced can be thought of as an approximation of the posterior. For a general unknown target, the optimal proposal, i.e. the proposal which results in estimator having lowest possible variance, is the target distribution q(x) = π(x) (see e.g. (Rainforth, 2017, 5.3.2.2)). However, this no longer holds if we have some information about f(x). In this target-aware scenario, the optimal behavior turns out to depend on whether we are self-normalizing or not. For the non-self-normalized case, the optimal proposal can be shown to be q (x) π(x)|f(x)| (Owen, 2013). Interestingly, in the case where f(x) 0 x, this leads to an exact estimator, i.e. ˆµ = µ (with ˆµ as per (2)). To see this, notice that the normalizing constant for q (x) is R π(x)f(x) dx= µ and hence q (x) = π(x)f(x)/µ. Therefore, even when N =1, any possible value of the resulting sample x1 yields an ˆµ satisfying ˆµ=f(x1)π(x1)/q (x1)=µ. In the self-normalized case, the optimal proposal instead transpires to be q (x) π(x)|f(x) µ| (Hesterberg, 1988). In this case, one can no longer achieve a zero variance estimator for finite N and nonconstant f(x). Instead, the achievable error is lower bounded by (Owen, 2013) E[(ˆµ µ)2] 1 Eπ(x)[|f(x) µ|] 2 , (5) creating a fundamental limit on the performance of SNIS, even when information about f(x) is incorporated. Given that these optimal proposals make use of the true expectation µ, we will clearly not have access to them in practice. However, they provide a guide for the desirable properties of a proposal and can be used as targets for adaptive IS methods (see (Bugallo et al., 2017) for a recent review). 2.2. Inference Amortization Inference amortization involves learning an amortization artifact that takes in datasets and produces proposals tailored to the corresponding inference problems. This amortization artifact typically takes the form of a parametrized proposal, q(x; ϕ(y; η)), which takes in data y and produces proposal parameters using an inference network ϕ(y; η), which itself has parameters η. When clear from the context, we will use the shorthand q(x; y, η) for this proposal. Though the exact process varies with context, the inference network is usually trained either by drawing latent-data sample pairs from the joint p(x, y) (Paige & Wood, 2016; Amortized Monte Carlo Integration Le et al., 2017; 2018b), or by drawing mini-batches from a large dataset using stochastic variational inference approaches (Hoffman et al., 2013; Kingma & Welling, 2014; Rezende et al., 2014; Ritchie et al., 2016). Once trained, it provides an efficient means of approximately sampling from the posterior of a particular dataset, e.g. using SNIS. Out of several variants, we focus on the method introduced by Paige & Wood (2016), as this is the one AMCI builds upon. In their approach, η is trained to minimize the expectation of DKL p(x|y) || q(x; y, η) across possible datasets y, giving the objective J (η) = Ep(y) h DKL p(x|y) || q(x; y, η) i = Ep(x,y) log q(x; y, η) + const wrt η (6) We note that the distribution p(y) over which we are taking the expectation is actually chosen somewhat arbitrarily: it simply dictates how much the network prioritizes a good amortization for one dataset over another; different choices are equally valid and imply different loss functions. This objective requires us to be able to sample from the joint distribution p(x, y) and it can be optimized using gradient methods since the gradient can be easily evaluated: ηJ (η) = Ep(x,y) η log q(x; y, η) . (7) Amortized Monte Carlo integration (AMCI) is a framework for amortizing the cost of calculating expectations µ(y, θ) := Eπ(x;y)[f(x; θ)]. Here y represents changeable aspects of the reference distribution π(x; y) (e.g. the dataset) and θ represents changeable parameters of the target function f(x; θ). The reference distribution is typically known only up to a normalization constant, i.e. π(x; y) = γ(x; y)/Z where γ(x; y) can be evaluated pointwise, but Z is unknown. AMCI can still be useful in settings where Z is known, but here we can simply use its known value rather than constructing a separate estimator. Amortization can be performed across y and/or θ. When amortizing over y, the function does not need to be explicitly parameterized; we just need to be able to evaluate it pointwise. Similarly, when amortizing over θ, the reference distribution can be fixed. In fact, AMCI can be used for a parameterized set of conventional integration problems R x X f(x; θ)dx by exploiting the fact that Z x X f(x; θ)dx = Eπ(x)[f(x; θ)/π(x)] (8) for any π(x) where π(x) = 0 x X for which f(x) = 0. For consistency of notation with the amortized inference literature, we will presume a Bayesian setting in the rest of this section, i.e. π(x; y)=p(x|y) and γ(x; y)=p(x, y). 3.1. Estimator Existing amortized inference methods implicitly evaluate expectations using SNIS (or some other form of selfnormalized estimator (Paige & Wood, 2016; Le et al., 2018a)), targeting the posterior as the optimal proposal q (x; y) p(x|y). Not only is this proposal suboptimal when information about the target function is available, there is a lower bound on the accuracy the SNIS approach itself can achieve as shown in (5). AMCI overcomes these limitations by breaking down the overall expectation into separate components and constructing separate estimates for each. We can first break down the target expectation into the ratio of the unnormalized expectation and the normalization constant: µ(y, θ) := Ep(x|y) f(x; θ) = Ep(x|y) f(x; θ) p(y) Ep(x) p(y|x) = Eq1(x;y,θ) h f(x;θ)p(x,y) q1(x;y,θ) i Eq2(x;y) h p(x,y) q2(x;y) i =: E1 where q1(x; y, θ) and q2(x; y) are two separate proposals, used respectively for each of the two expectations E1 and E2. We note that the proposal q1(x; y, θ) may depend not only on the observed dataset y, but also on the parameters of the target function θ. We can now generate separate MC estimates for E1 and E2, and take their ratio to estimate the overall expectation: µ(y, θ) ˆµ(y, θ) := ˆE1/ ˆE2 where f(x n; θ)p(x n, y) q1(x n; y, θ) x n q1(x; y, θ) p(xm, y) q2(xm; y) xm q2(x; y). The key idea behind AMCI is that we can now separately train each of these proposals to be good estimators for their respective expectation, rather than rely on a single proposal to estimate both, as is implicitly the case for SNIS. Consider, for example, the case where f(x; θ) 0. If q1(x; y, θ) f(x; θ)p(x|y) and q2(x; y) p(x|y) then both ˆE1 and ˆE2 will form exact estimators (as per Section 2.1), even if N =M =1. Consequently, we achieve an exact estimator for µ(y, θ), allowing for arbitrarily large improvements over any SNIS estimator, because SNIS forces q1(x; y, θ) and q2(x; y) to be the same distribution. More generally, the optimal proposal for E1 and E2 are q1(x; y, θ) |f(x; θ)|p(x|y) and q2(x; y) p(x|y) respectively, with the latter always resulting in an exact estimator for E2. Thus the more |f(x; θ)|p(x|y) varies from p(x|y), the worse the conventional approach of only amortizing Amortized Monte Carlo Integration the posterior will perform, while the harder it becomes to construct a reasonable SNIS estimator even when information about f(x; θ) is incorporated. Separately learning q1(x; y, θ) and q2(x; y) means that each will become a better individual proposal and the overall estimator improves. It turns out that we do not actually require the previous assumption of f(x; θ) 0 x, θ to achieve a zero variance estimator. Specifically, if we let1 f +(x; θ) = max(f(x; θ), 0) and (11) f (x; θ) = min(f(x; θ), 0) (12) denote truncations of the target function into its positive and negative components (as per the concept of posivitisation (Owen, 2013, 9.13)), then we can break down the overall expectation as follows = Eq+ 1 (x;y,θ) f +(x;θ)p(x,y) q+ 1 (x;y,θ) Eq 1 (x;y,θ) f (x;θ)p(x,y) q 1 (x;y,θ) Eq2(x;y) h p(x,y) q2(x;y) i =: E+ 1 E 1 E2 (13) where we now have three expectations and three proposals. Analogously to (10), we can construct estimates for each expectation separately and then combine them: µ(y, θ) ˆµ(y, θ) := ( ˆE+ 1 ˆE 1 )/ ˆE2 where f +(x+ n ; θ)p(x+ n , y) q+ 1 (x+ n ; y, θ) x+ n q+ 1 (x; y, θ) f (x k ; θ)p(x k , y) q 1 (x k ; y, θ) x k q 1 (x; y, θ) p(xm, y) q2(xm; y) xm q2(x; y), (14) which forms the AMCI estimator. The theoretical capability of this estimator is summarized in the following result, the proof for which is given in Appendix B. Theorem 1. If the following hold for a given θ and y, Ep(x) f +(x; θ)p(y|x) < (15) Ep(x) f (x; θ)p(y|x) < (16) Ep(x) p(y|x) < (17) and we use the corresponding set of optimal proposals q+ 1 (x; y, θ) f +(x; θ)p(x, y), q 1 (x; y, θ) f (x; θ)p(x, y), and q2(x; y) p(x, y), then the AMCI 1Practically, it may sometimes be beneficial to truncate the proposal about another point, c, by instead using f +(x; θ) = max(f(x; θ) c, 0) and f (x; θ) = min(f(x; θ) c, 0), then adding c onto our final estimate. estimator defined in (14) satisfies E ˆµ(y, θ) = µ(y, θ), Var ˆµ(y, θ) = 0 (18) for any N 1, K 1, and M 1, such that it forms an exact estimator for that θ, y pair. Though our primary motivation for developing the AMCI estimator is its attractive properties in an amortization setting, we note that it may still be of use in static expectation calculation settings. Namely, the fact that it can achieve an arbitrarily low mean squared error for a given number of samples means it forms an attractive alternative to SNIS more generally, particularly when we are well-placed to hand-craft highly effective proposals and in adaptive importance sampling settings. We note that individual elements of this estimator have previously appeared in the literature. For example, the general concept of using multiple proposals has been established in the context of multiple importance sampling (Veach & Guibas, 1995). The use of two separate proposals for the unnormalized target and the normalizing constant (i.e. (10)), on the other hand, was recently independently suggested by Lamberti et al. (2018) in a non-amortized setting. However, we believe that the complete form of the AMCI estimator in (14) has not previously been suggested, nor its theoretical benefits or amortization considered. 3.2. Amortization To evaluate (14), we need to learn three amortized proposals q+ 1 (x; y, θ), q 1 (x; y, θ), and q2(x; y). Learning q2(x; y) is equivalent to the standard inference amortization problem and so we will just use the objective given by (6), as described in section 2.2. The approaches for learning q+ 1 (x; y, θ) and q 1 (x; y, θ) are equivalent, other than the function that is used in the estimators. Therefore, for simplicity, we introduce our amortization procedure in the case where f(x; θ) 0 x, θ, such that we can need only learn a single proposal, q1(x; y, θ), for the numerator as per (10). This trivially extends to the full AMCI setup by separately repeating the same training procedure for q+ 1 (x; y, θ) and q 1 (x; y, θ). 3.2.1. FIXED FUNCTION f(x) We first consider the scenario where f(x) is fixed (i.e. we are not amortizing over function parameters θ) and hence in this section we drop the dependence of q1 on θ. To learn the parameters η for the first amortized proposal q1(x; y, η), we need to adjust the target in (6) to incorporate the effect of the target function. Let E1(y) := Ep(x) f(x)p(y|x) and g(x|y) := f(x) p(x,y) E1(y) , i.e. the normalized optimal proposal for q1. Naively adjusting (6) Amortized Monte Carlo Integration leads to a double intractable objective J 1(η) = Ep(y)[DKL g(x|y)||q1(x; y, η) ] f(x) p(x, y) E1(y) log q1(x; y, η) dx + const wrt η. (19) Here the double intractability comes from the fact we do not know E1(y) and, at least at the beginning of the training process, we cannot estimate it efficiently either. To address this, we use our previous observation that the expectation over p(y) in the above objective is chosen somewhat arbitrarily. Namely, it dictates the relative priority of different datasets y during training and not the optimal proposal for each individual datapoint; disregarding the finite capacity of the network, the global optimum is still always DKL g(x|y) || q1(x; y, η) = 0, y. We thus maintain a well-defined objective if we choose a different reference distribution over datasets. In particular, if we take the expectation with respect to h(y) p(y)E1(y), we get J1(η) = Eh(y) h DKL g(x|y) || q1(x; y, η) i = c 1 Ep(x,y) f(x) log q1(x; y, η) + const wrt η (20) where c = Ep(y) E1(y) > 0 is a positive constant that does not affect the optimization it is the normalization constant for the distribution h(y) and can thus be ignored. Each term in this expectation can now be evaluated directly, meaning we can again run stochastic gradient descent algorithms to optimize it. Note that this does not require evaluation of the density p(x, y), only the ability to draw samples. Interestingly, this choice of h(y) can be interpreted as giving larger importance to the values of y which yield larger E1(y). Informally, we could think about this choice as attempting to minimizing the L1 errors of our estimates, that is Ep(y)[|E1(y) ˆE1(y)|], presuming that the error in our estimation scales as the magnitude of the true value E1(y). More generally, if we choose h(y) p(y)E1(y)λ(y) for some positive evaluable function λ : Y R+, we get a tractable objective of the form J1(η; λ) = Ep(x,y) λ(y) log q1(x; y, η) up to a constant scaling factor and offset. We can thus use this trick to adjust the relative preference given to different datasets, while ensuring the objective is tractable. 3.2.2. PARAMETERIZED FUNCTION f(x; θ) As previously mentioned, AMCI also allows for amortization over parametrized functions, to account for cases in which multiple possible target functions may be of interest. We can incorporate this by using pseudo prior p(θ) to generate example parameters during our training. Analogously to h(y), the choice of p(θ) determines how much importance we assign to different possible functions that we would like to amortize over. Since, in practice, perfect performance is unattainable over the entire space of θ, the choice of p(θ) is important and it will have an important effect on the performance of the system. Incorporating p(θ) is straightforward: we take the expectation of the fixed target function training objective over θ. In this setting, our inference network ϕ needs to take θ as input when determining the parameters of q1 and hence we let q1(x; y, θ, η) := q1(x; ϕ(y, θ; η)). If E1(y, θ) := Ep(x) f(x; θ)p(y|x) , g(x|y; θ) := f(x; θ) p(x, y)/E1(y, θ), and h(y, θ) p(y)p(θ)E1(y; θ), we get an objective which is analogous to (20): J1(η) = Eh(y,θ) h DKL g(x|y; θ) || q1(x; y, θ, η) i =c 1 Ep(x,y)p(θ) f(x; θ) log q1(x; y, θ, η) + const wrt η (21) where c = Ep(y)p(θ) E1(y, θ) > 0 is again a positive constant that does not affect the optimization. 3.3. Efficient Training If f(x; θ) and p(x)p(θ) are mismatched, i.e. f(x; θ) is large in regions where p(x)p(θ) is low, training by na ıvely sampling from p(x)p(θ) can be inefficient. Instead, it is preferable to try and sample from g(θ, x) p(x)p(θ)f(x; θ). Though this is itself an intractable distribution, it represents a standard, rather than an amortized, inference problem and so it is much more manageable than the overall training. Namely, as the samples do not depend on the proposal we are learning or the datasets, we can carry out this inference process as a pre-training step that is substantially less costly than the problem of training the inference networks itself. One approach is to construct an MCMC sampler targeting g(θ, x) to generate the samples, which can be done upfront before training. Another is to use an importance sampler J1(η) = const wrt η (22) + c 1Eq (θ,x)p(y|x) p(θ)p(x)f(x; θ) q (θ, x) log q1(x; y, θ, η) where q (θ, x) is a proposal as close to g(θ, x) as possible. In the case of non-parameterized functions f(x), there is no need to take an expectation over p(θ), and we instead desire to sample from g(x) p(x)f(x). 4. Experiments Even though AMCI is theoretically able to achieve exact estimators with a finite number of samples, this will rarely be the case for practical problems, for which learning per- Amortized Monte Carlo Integration 101 102 103 104 Number of samples N AMCI SNIS q2 SNIS qm SNIS bound (a) One-dimensional tail integral 101 102 103 104 Number of samples N (b) Five-dimensional tail integral Figure 1: Relative mean squared errors (as per (25)) for [left] the one-dimensional and [right] the five-dimensional tail integral example. The solid lines for each estimator indicate the median of δ(y, θ) estimated using a common set of 100 samples from y, θ p(y)p(θ), with the corresponding δ(y, θ) then each separately estimated using 100 samples of the respective ˆδ(y, θ). The shading instead shows the estimates from replacing δ(y, θ) with the 25% and 75% quantiles of ˆδ(y, θ) for a given y and θ. The median of δ(y, θ) is at times outside of this shaded region as δ(y, θ) is often dominated by a few large outliers. The dashed line shows the median of δ(y, θ) with the δ(y, θ) corresponding to the Re MSE optimal SNIS estimator, namely (Ep(x|y)[|f(x; θ) µ(y, θ)|])2/N as per (5), which is itself estimated (with only nominal error) using 106 samples. We note that the error for SNIS with q2 proposal is to a large extent flat because there is not a single sample in the estimator for which f(x; θ) > 0, such that they return ˆµ(y, θ) = 0 and hence give δ(y, θ) = 1. In Figure (b) the SNIS qm line reaches the Re MSE value of 1018 at N=2 and the y-axis limits have been readjusted to allow clear comparison at higher N. This effect is caused by the bias of SNIS: these extremely high errors for SNIS qm arise when all N samples happen to be drawn from distribution q1, for further explanation and the full picture see Figure 5 in Appendix A. fect proposals is not typically realistic, particularly in amortized contexts (Cremer et al., 2018). It is therefore necessary to test its empirical performance to assert that gains are possible with inexact proposals. To this end, we investigate AMCI s performance on two illustrative examples. Our primary baseline is the SNIS approach implicitly used by most existing inference amortization methods, namely the SNIS estimator with proposal q2(x; y). Though this effectively represents the previous state-of-the-art in amortized expectation calculation, it turns out to be a very weak baseline. We, therefore, introduce another simple approach one could hypothetically consider using: training separate proposals as per AMCI, but then using this to form a mixture distribution proposal for an SNIS estimator. For example, in the scenario where f(x; θ) 0 x, θ (such that we only need to learn two proposals), we can use qm(x; y, θ) = 1 2q1(x; y, θ) + 1 2q2(x; y) (23) as an SNIS proposal that takes into account the needs of both E1 and E2. We refer to this method as the mixture SNIS estimator and emphasize that it represents a novel amortization approach in its own right. We also compare AMCI to the theoretically optimal SNIS estimator, i.e. the error bound given by (5). As we will show, AMCI is often able to empirically outperform this bound, thereby giving better performance than any approach based on SNIS, whether that approach is amortized or not. This is an important result and, it particular, it highlights that the potential significance of the AMCI estimator extends beyond the amortized setting we consider here. We further consider using SNIS with proposal q1(x; y, θ). However, this transpires to perform extremely poorly throughout (far worse than q2(x; y)) and so we omit its results from the main paper, giving them in Appendix A. In all experiments, we use the same number of sample from each proposal to form the estimate (i.e. N = M = K). An implementation for AMCI and our experiments is available at http://github.com/talesa/amci. 4.1. Tail Integral Calculation We start with the conceptually simple problem of calculating tail integrals for Gaussian distributions, namely p(x) = N(x; 0, Σ1) p(y|x) = N(y; x, Σ2) (24) f(x; θ) = YD i=1 1xi>θi p(θ) = UNIFORM(θ; [0, u D]D) where D is the dimensionality, we set Σ2 = I, and Σ1 is a fixed covariance matrix (for details see Appendix C). This problem was chosen because it permits easy calculation of the ground truth expectations by exploiting analytic simplifications, while remaining numerically challenging for values of θ far away from the mean when we do not use these simplifications. We performed one and Amortized Monte Carlo Integration five-dimensional variants of the experiment. We use normalizing flows (Rezende & Mohamed, 2015) to construct our proposals, providing a flexible and powerful means of representing the target distributions. Details are given in Appendix C. Training was done by using importance sampling to generate the values of θ and x as per (22) with q (θ, x) = p(θ) HALFNORMAL(x; θ, diag(Σ2)). To evaluate AMCI and our baselines we use the relative mean squared error (Re MSE) δ(y, θ) = E ˆδ(y, θ) , where µ(y, θ) ˆµ(y, θ) 2 µ(y, θ)2 (25) and ˆµ(y, θ) is our estimate for µ(y, θ). We then consider summary statistics across different {y, θ}, such as its median when y, θ p(y)p(θ).2 In calculating this, δ(y, θ) was separately estimated for each value of y and θ using 100 samples of ˆδ(y, θ) (i.e. 100 realizations of the estimator). As shown in Figure 1, AMCI outperformed SNIS in both the oneand five-dimensional cases. For the onedimensional example, AMCI significantly outperformed all of SNIS q2, SNIS qm, and the theoretically optimal SNIS estimator. SNIS q2, the approach implicitly taken by existing inference amortization methods, typically failed to place even a single sample in the tail of the distribution, even for large N. Interestingly, SNIS qm closely matched the theoretical SNIS bound, suggesting that this amortized proposal is very close to the theoretically optimal one. However, this still constituted significantly worse performance than AMCI taking about 103 more samples to achieve the same relative error demonstrating the ability of AMCI to outperform the best possible SNIS estimator. For the five-dimensional example, AMCI again significantly outperformed our main baseline SNIS q2. Though it still also outperformed SNIS qm, its advantage was less than in one-dimensional case, and it did not outperform the SNIS theoretical bound. SNIS qm itself did not match the bound as closely as in the one-dimensional example either, suggesting that the proposals learned were worse than in the one-dimensional case. Further comparisons based on using the mean squared error (instead of Re MSE) are given in Appendix A and show qualitatively similar behavior. 4.2. Planning Cancer Treatment To demonstrate how AMCI might be used in a more realworld scenario, we now consider an illustrative example relating to cancer diagnostic decisions. Imagine that an oncologist is trying to decide whether to administer a treatment to a cancer patient. Because the treatment is highly invasive, they only want to administer it if there is a realis- 2Variability in δ(y, θ) between different instances of {y, θ} is considered in Figures 7 and 8 in Appendix A. 101 102 103 104 Number of samples N AMCI SNIS q2 SNIS qm SNIS bound Figure 2: Relative mean squared errors for the cancer example. Conventions as per Figure 1. It is worth noting that it took about 104 more samples for the SNIS q2 estimator to achieve the same level of accuracy as the AMCI estimator. tic chance of it being successful, i.e. that the tumor shrinks sufficiently to allow a future operation to be carried out. However, they are only able to make noisy observations about the current size of the tumor, and there are various unknown parameters pertaining to its growth, such as the patients predisposition to the treatment. To aid in the oncologists decision, the clinic provides a simulator of tumor evolution, a model of the latent factors required for this simulator, and a loss function for administering the treatment given the final tumor size. We wish to construct an amortization of this simulator, so that we can quickly and directly predict the expected loss function for administering the treatment from a pair of noisy observations of the tumor size taken at separate points in time. A detailed description of the model and proposal setup is in the Appendix C.3. To evaluate the learned proposals we followed the same procedure as for the tail integral example. Results are presented in Figure 2. AMCI again significantly outperformed the literature baseline of SNIS q2 it took about N = 104 samples for SNIS q2 to achieve the level of relative error of AMCI for N = 2. AMCI further maintained an advantage over SNIS qm, which itself again closely matched the optimal SNIS estimator. Further comparisons are given in Appendix A and show qualitatively similar behavior. 5. Discussion In all experiments AMCI performed better than SNIS with either q2 or qm for its proposal. Moreover, it is clear that AMCI is indeed able to break the theoretical bound on the achievable performance of SNIS estimators: in some cases AMCI is outperforming the best achievable error by any SNIS estimator, regardless of the proposal the latter uses. Interestingly, the mixture SNIS estimator we also introduce proved to be a strong baseline as it closely matched the theoretical baseline in both experiments. However, such an effective mixture proposal is only possible thanks learning the multiple inference artifacts we suggest as part of the Amortized Monte Carlo Integration AMCI framework, while its performance was still generally inferior to AMCI itself. We now consider the question of when we expect AMCI to work particularly well compared to SNIS, and the scenarios where it is less beneficial, or potentially even harmful. We first note that scaling with increasing dimensionality is a challenge for both because the importance sampling upon which they rely suffers from the curse of dimensionality. However, the scaling of AMCI should be no worse than existing amortization approaches as each of the amortized proposals is trained in isolation and corresponds to a conventional inference amortization. We can gain more insights into the relative performance of the two approaches in different settings using an informal asymptotic analysis in the limit of a large number of samples. Assuming f(x; θ) 0 x, θ for simplicity,3 then both AMCI and SNIS can be expressed in the form of (10), where for SNIS we set q1(x; y, θ) = q2(x; y), N = M, and share samples between the estimators. Separately applying the central limit theorem to ˆE1 and ˆE2 yields ˆµ(y, θ) = ˆE1 ˆE2 E1 + σ1ξ1 E2 + σ2ξ2 , as N, M (26) where ξ1, ξ2 N(0, 1) and N Varq1(x;y,θ) f(x; θ)p(x, y) q1(x; y, θ) M Varq2(x;y) Asymptotically, the mean squared error of ˆµ(y, θ) is dominated by its variance. Thus, by taking a first order Taylor expansion of Var[ˆµ(y, θ)] about 1/E2, we get, for large M, E h ˆµ(y, θ) µ(y, θ) 2i σ2 1 + σ2 2µ(y, θ)2 2µ(y, θ)σ1σ2Corr[ξ1, ξ2] = σ2 2 E2 2 (κ Corr[ξ1, ξ2])2 + 1 Corr[ξ1, ξ2]2 (29) where the approximation from the Taylor expansion becomes exact in the limit M and κ := σ1/(µ(y, θ)σ2) is a measure of the relative accuracy of the two estimators. See (43) in Appendix D.1 for a more verbose derivation. For a given value of σ2, the value of κ for SNIS is completed dictated by the problem: in general, the larger the mismatch between f(x; θ)p(x, y) and p(x, y), the larger κ will be. This yields the expected result that the errors for SNIS become large in this setting. For AMCI, we can control κ through ensuring a good proposal for both ˆE1 and ˆE2, and, if desired, by adjusting M and N (relative to a 3The results trivially generalize to general f(x) with suitable adjustment of the definition of σ1. f(x; θ)p(x|y) p(x|y) θ q1(x; y, θ) AMCI SNIS q2 5.0 2.5 0.0 2.5 5.0 x 101 102 103 104 Number of samples N Figure 3: Results for the one-dimensional tail integral model in a setting with large mismatch [top] and low mismatch [bottom], with (y, θ), respectively (1, 3) and (3, 0.1). The left column illustrates the shape of the proposal q1 and the achievable quality of fit to f(x; θ)p(x|y), we see that AMCI is able to learn very accurate proposals in both cases. The right column compares the performance of the AMCI and the SNIS estimators where we see that the gain for AMCI is much larger when the mismatch is large. Uncertainty bands in column two are estimated over a 1000 runs and are almost imperceptibly small. fixed budget M +N). Consequently, we can achieve better errors than SNIS by driving κ down. On the other hand, as f(x; θ)p(x, y) and p(x, y) become increasingly well matched, then κ 1 and we find that AMCI has little to gain over SNIS. In fact, we see that AMCI can potentially be worse than SNIS in this setting: when f(x; θ)p(x, y) and p(x, y) are closely matched, we also have Corr[ξ1, ξ2]2 1 for SNIS, such that we observe a canceling effect, potentially leading to very low errors. Achieving Corr[ξ1, ξ2]2 1 can be more difficult for AMCI, potentially giving rise to a higher error. However, it could be possible to mitigate this by correlating the estimates, e.g. through common random numbers. To assess if this theory manifests in practice, we revisit our tail integral example, comparing large and small mismatch scenarios. The results, shown in Figure 3, agree with these theoretical findings. In Appendix D we further showing that the reusing of samples for both ˆE1 and ˆE2 in AMCI can be beneficial when the targets are well matched. More generally, as Theorem 1 tells us that the AMCI estimator can achieve an arbitrarily low error for any given target function, while SNIS cannot, we know that its potential gains are larger the more accurate we are able to make our proposals. As such, as advances elsewhere in the field allow us to produce increasingly effective amortized proposals, e.g. through advanced normalizing flow approaches (Grathwohl et al., 2019; Kingma & Dhariwal, 2018), the larger the potential gains are from using AMCI. Amortized Monte Carlo Integration Acknowledgments We would like to thank Yee Whye Teh for providing helpful discussions at the early stages of the project. AG is supported by the UK EPSRC CDT in Autonomous Intelligent Machines and Systems. FW is supported by DARPA D3M, under Cooperative Agreement FA8750-17-2-0093, Intel under its LBNL NERSC Big Data Center, and an NSERC Discovery grant. TR is supported by the European Research Council under the European Union s Seventh Framework Programme (FP7/2007 2013) / ERC grant agreement no. 617071. His research leading to these results also received funding from EPSRC under grant EP/P026753/1. Bugallo, M. F., Elvira, V., Martino, L., Luengo, D., Miguez, J., and Djuric, P. M. Adaptive importance sampling: the past, the present, and the future. IEEE Signal Processing Magazine, 34(4):60 79, 2017. Chen, M.-H. and Shao, Q.-M. On Monte Carlo methods for estimating ratios of normalizing constants. The Annals of Statistics, 25(4):1563 1594, 08 1997. Cremer, C., Li, X., and Duvenaud, D. Inference suboptimality in variational autoencoders. Proceedings of the International Conference on Machine Learning (ICML), 2018. Enderling, H. and Chaplain, M. A. Mathematical modeling of tumor growth and treatment. Current pharmaceutical design, 20 30:4934 40, 2014. Evans, M. and Swartz, T. Methods for approximating integrals in statistics with special emphasis on Bayesian integration problems. Statistical science, pp. 254 272, 1995. Gelman, A. and Meng, X.-L. Simulating normalizing constants: from importance sampling to bridge sampling to path sampling. Statistical Science, 13(2):163 185, 05 1998. Grathwohl, W., Chen, R. T. Q., Bettencourt, J., Sutskever, I., and Duvenaud, D. FFJORD: free-form continuous dynamics for scalable reversible generative models. International Conference on Learning Representations (ICLR), 2019. Hahnfeldt, P., Panigrahy, D., Folkman, J., and Hlatky, L. Tumor development under angiogenic signaling. Cancer Research, 59(19):4770 4775, 1999. Hesterberg, T. C. Advances in importance sampling. Ph D thesis, Stanford University, 1988. Hoffman, M. D., Blei, D. M., Wang, C., and Paisley, J. Stochastic variational inference. Journal of Machine Learning Research (JMLR), 2013. Kingma, D. P. and Ba, J. Adam: A method for stochastic optimization. International Conference on Learning Representations (ICLR), 2015. Kingma, D. P. and Dhariwal, P. Glow: Generative flow with invertible 1x1 convolutions. Advances in Neural Information Processing Systems (NIPS), 2018. Kingma, D. P. and Welling, M. Auto-encoding variational Bayes. International Conference on Learning Representations (ICLR), 2014. Lacoste-Julien, S., Husz ar, F., and Ghahramani, Z. Approximate inference for the loss-calibrated Bayesian. Proceedings of the International Conference on Artificial Intelligence and Statistics (AISTATS), 2011. Lamberti, R., Petetin, Y., Septier, F., and Desbouvries, F. A double proposal normalized importance sampling estimator. 2018 IEEE Statistical Signal Processing Workshop (SSP), pp. 238 242, 2018. Le, T. A., Baydin, A. G., and Wood, F. Inference compilation and universal probabilistic programming. Proceedings of the International Conference on Artificial Intelligence and Statistics (AISTATS), 2017. Le, T. A., Igl, M., Jin, T., Rainforth, T., and Wood, F. Autoencoding sequential Monte Carlo. International Conference on Learning Representations (ICLR), 2018a. Le, T. A., Kosiorek, A. R., Siddharth, N., Teh, Y. W., and Wood, F. Revisiting reweighted wake-sleep. ar Xiv:1805.10469, 2018b. Meng, X.-L. and Wong, W. H. Simulating ratios of normalizing constants via a simple identity: A theoretical exploration. Statistica Sinica, 6:831 860, 1996. Oh, M.-S. and Berger, J. O. Adaptive importance sampling in Monte Carlo integration. Journal of Statistical Computation and Simulation, 41(3-4):143 168, 1992. Owen, A. B. Monte Carlo theory, methods and examples. 2013. Paige, B. and Wood, F. Inference networks for sequential Monte Carlo in graphical models. Proceedings of the International Conference on Machine Learning (ICML), 2016. Papamakarios, G., Pavlakou, T., and Murray, I. Masked autoregressive flow for density estimation. Advances in Neural Information Processing Systems (NIPS), 2017. Amortized Monte Carlo Integration Rainforth, T. Automating inference, learning, and design using probabilistic programming. Ph D thesis, 2017. Rainforth, T., Cornish, R., Yang, H., Warrington, A., and Wood, F. On Nesting Monte Carlo Estimators. Proceedings of the International Conference on Machine Learning (ICML), 2018a. Rainforth, T., Zhou, Y., Lu, X., Teh, Y. W., Wood, F., Yang, H., and van de Meent, J.-W. Inference trees: Adaptive inference with exploration. ar Xiv preprint ar Xiv:1806.09550, 2018b. Rezende, D. and Mohamed, S. Variational inference with normalizing flows. Proceedings of the International Conference on Machine Learning (ICML), 2015. Rezende, D. J., Mohamed, S., and Wierstra, D. Stochastic backpropagation and approximate inference in deep generative models. Proceedings of the International Conference on Machine Learning (ICML), 2014. Ritchie, D., Horsfall, P., and Goodman, N. D. Deep amortized inference for probabilistic programs. ar Xiv:1610.05735, 2016. Robert, C. The Bayesian choice: from decision-theoretic foundations to computational implementation. Springer Science & Business Media, 2007. Robert, C. and Casella, G. Monte Carlo statistical methods. Springer Science & Business Media, 2013. Stuhlm uller, A., Taylor, J., and Goodman, N. Learning stochastic inverses. Advances in Neural Information Processing Systems (NIPS), 2013. Veach, E. and Guibas, L. J. Optimally combining sampling techniques for Monte Carlo rendering. Proceedings of the 22nd annual conference on Computer graphics and interactive techniques, pp. 419 428, 1995. Webb, S., Goli nski, A., Zinkov, R., Siddharth, N., Rainforth, T., Teh, Y. W., and Wood, F. Faithful inversion of generative models for effective amortized inference. Advances in Neural Information Processing Systems (NIPS), 2018. Wolpert, R. L. Monte Carlo integration in Bayesian statistical analysis. Contemporary Mathematics, 115:101 116, 1991.