# importance_weighting_and_variational_inference__d10b5564.pdf

Importance Weighting and Variational Inference

Justin Domke1 and Daniel Sheldon1,2

1 College of Information and Computer Sciences, University of Massachusetts Amherst

2 Department of Computer Science, Mount Holyoke College

Recent work used importance sampling ideas for better variational bounds on likelihoods. We clarify the applicability of these ideas to pure probabilistic inference, by showing the resulting Importance Weighted Variational Inference (IWVI) technique is an instance of augmented variational inference, thus identifying the looseness in previous work. Experiments conﬁrm IWVI s practicality for probabilistic inference. As a second contribution, we investigate inference with elliptical distributions, which improves accuracy in low dimensions, and convergence in high dimensions.

1 Introduction

Probabilistic modeling is used to reason about the world by formulating a joint model p(z, x) for unobserved variables z and observed variables x, and then querying the posterior distribution p(z | x) to learn about hidden quantities given evidence x. Common tasks are to draw samples from p(z | x) or compute posterior expectations. However, it is often intractable to perform these tasks directly, so considerable research has been devoted to methods for approximate probabilistic inference.

Variational inference (VI) is a leading approach for approximate inference. In VI, p(z | x) is approximated by a distribution q(z) in a simpler family for which inference is tractable. The process to select q is based on the following decomposition [22, Eqs. 11-12]:

log p(x) = E

q(z) log p(z, x)

q(z) | {z } ELBO[q(z)kp(z,x)]

+ KL [q(z)kp(z|x)] | {z } divergence

The ﬁrst term is a lower bound of log p(x) known as the "evidence lower bound" (ELBO). Selecting q to make the ELBO as big as possible simultaneously obtains a lower bound of log p(x) that is as tight as possible and drives q close to p in KL-divergence.

The ELBO is closely related to importance sampling. For ﬁxed q, let R = p(z, x)/q(z) where z q. This random variable satisﬁes p(x) = E R, which is the foundation of importance sampling. Similarly, we can write by Jensen s inequality that log p(x) E log R = ELBO [qkp], which is the foundation of modern black-box versions of VI (BBVI) [19] in which Monte Carlo samples are used to estimate E log R, in the same way that IS estimates E R.

Critically, the only property VI uses to obtain a lower bound is p(x) = E R. Further, it is straightforward to see that Jensen s inequality yields a tighter bound when R is more concentrated about its mean p(x). So, it is natural to consider different random variables with the same mean that are more concentrated, for example the sample average RM = 1 M

m=1 Rm. Then, by identical reasoning, log p(x) E log RM. The last quantity is the objective of importance-weighted auto-encoders [5]; we call it the importance weighted ELBO (IW-ELBO), and the process of selecting q to maximize it importance-weighted VI (IWVI).

32nd Conference on Neural Information Processing Systems (Neur IPS 2018), Montréal, Canada.

However, at this point we should pause. The decomposition in Eq. 1 makes it clear exactly in what sense standard VI, when optimizing the ELBO, makes q close to p. By switching to the one-dimensional random variable RM, we derived the IW-ELBO, which gives a tighter bound on log p(x). For learning applications, this may be all we want. But for probabilistic inference, we are left uncertain exactly in what sense q "is close to" p, and how we should use q to approximate p, say, for computing posterior expectations.

Our ﬁrst contribution is to provide a new perspective on IWVI by highlighting a precise connection between IWVI and self-normalized importance sampling (NIS) [17], which instructs us how to use IWVI for pure inference applications. Speciﬁcally, IWVI is an instance of augmented VI. Maximizing the IW-ELBO corresponds exactly to minimizing the KL divergence between joint distributions q M and p M, where q M is derived from NIS over a batch of M samples from q, and p M is the joint distribution obtained by drawing one sample from p and M 1 dummy samples from q. This has strong implications for probabilistic inference (as opposed to learning) which is our primary focus. After optimizing q, one should compute posterior expectations using NIS. We show that not only does IWVI signiﬁcantly tighten bounds on log p(x), but, by using q this way at test time, it signiﬁcantly reduces estimation error for posterior expectations.

Previous work has connected IWVI and NIS by showing that the importance weighted ELBO is a lower bound of the ELBO applied to the NIS distribution [6, 16, 2]. Our work makes this relationship precise as an instance of augmented VI, and exactly quantiﬁes the gap between the IW-ELBO and conventional ELBO applied to the NIS distribution, which is a conditional KL divergence.

Our second contribution is to further explore the connection between variational inference and importance sampling by adapting ideas of defensive sampling [17] to VI. Defensive importance sampling uses a widely dispersed q distribution to reduce variance by avoiding situations where q places essentially no mass in an area with p has density. This idea is incompatible with regular VI due to its mode seeking behavior, but it is quite compatible with IWVI. We show how to use elliptical distributions and reparameterization to achieve a form of defensive sampling with almost no additional overhead to black-box VI (BBVI). Elliptical VI provides small improvements over Gaussian BBVI in terms of ELBO and posterior expectations. In higher dimensions, these improvements diminish, but elliptical VI provides signiﬁcant improvement in the convergence reliability and speed. This is consistent with the notion that using a defensive q distribution is advisable when it is not well matched to p (e.g., before optimization has completed).

2 Variational Inference

Consider again the "ELBO decomposition" in Eq. 1. Variational inference maximizes the evidence lower bound (ELBO) over q. Since the divergence is non-negative, this tightens a lower-bound on log p(x). But, of course, since the divergence and ELBO vary by a constant, maximizing the ELBO is equivalent to minimizing the divergence. Thus, variational inference can be thought of as simultaneously solving two problems:

probabilistic inference or ﬁnding a distribution q(z) that is close to p(z|x) in KL-divergence.

bounding the marginal likelihood or ﬁnding a lower-bound on log p(x).

The ﬁrst problem is typically used with Bayesian inference: A user speciﬁes a model p(z, x), observes some data x, and is interested in the posterior p(z|x) over the latent variables. While Markov chain Monte Carlo is most commonly for these problems [9, 23], the high computational expense motivates VI [11, 3]. While a user might be interested in any aspect of the posterior, for concreteness, we focus on posterior expectations , where the user speciﬁes some arbitrary t(z) and wants to approximate Ep(z|x) t(z).

The second problem is typically used to support maximum likelihood learning. Suppose that p (z, x) is some distribution over observed data x and hidden variables z. In principle, one would like to set to maximize the marginal likelihood over the observed data. When the integral p (x) =

p (z, x)dz is intractable, one can optimize the lower-bound Eq(z) log (p (z, x)/q(z)) instead [22], over both and the parameters of q. This idea has been used to great success recently with variational auto-encoders (VAEs) [10].

3 Importance Weighting

Recently, ideas from importance sampling have been applied to obtain tighter ELBOs for learning in VAEs [5]. We review the idea and then draw novel connections to augmented VI that make it clear how adapt apply these ideas to probabilistic inference.

Figure 1: How the density of RM changes with M. (Distribution and setting as in Fig. 2.)

Take any random variable R such that E R = p(x), which we will think of as an estimator of p(x). Then it s easy to see via Jensen s inequality that

log p(x) = E log R | {z }

+ E log p(x)

R | {z } looseness

where the ﬁrst term is a lower bound on log p(x), and the second (non-negative) term is the looseness. The bound will be tight if R is highly concentrated.

While Eq. 2 looks quite trivial, it is a generalization of the ELBO decomposition in Eq. 1. To see that, use the random variable

R = !(z) = p(z, x)

q(z) , z q, (3)

which clearly obeys E R = p(x), and for which Eq. 2 becomes Eq. 1.

The advantage of Eq. 2 over Eq. 1 is increased ﬂexibility: alternative estimators R can give a tighter bound on log p(x). One natural idea is to draw multiple i.i.d. samples from q and average the estimates as in importance sampling (IS) . This gives the estimator

q(zm) , zm q. (4)

It s always true that E RM = p(x), but the distribution of RM places less mass near zero for larger M, which leads to a tighter bound (Fig. 1).

This leads to a tighter importance weighted ELBO (IW-ELBO) lower bound on log p(x), namely

IW-ELBOM [q(z)kp(z, x)] := E q(z1:M) log 1

q(zm) , (5)

where z1:M is a shorthand for (z1, ..., z M) and q(z1:M) = q(z1) q(z M). This bound was ﬁrst introduced by Burda et al. [5] in the context of supporting maximum likelihood learning of a variational auto-encoder.

3.1 A generative process for the importance weighted ELBO

While Eq. 2 makes clear that optimizing the IW-ELBO tightens a bound on log p(x), it isn t obvious what connection this has to probabilistic inference. Is there some divergence that is being minimized? Theorem 1 shows this can be understood by constructing augmented distributions p M(z1:M, x) and q M(z1:M) and then applying the ELBO decomposition in Eq. 1 to the joint distributions. Theorem 1 (IWVI). Let q M(z1:M) be the density of the generative process described by Alg. 1, which is based on self-normalized importance sampling over a batch of M samples from q. Let p M (z1:M, x) = p(z1, x)q(z2:M) be the density obtained by drawing z1 and x from p and drawing the dummy samples z2:M from q. Then

q M (z1:M) = p M(z1:M, x)

Further, the ELBO decomposition in Eq. 1 applied to q M and p M is

log p(x) = IW-ELBOM [q(z)kp(z, x)] + KL [q M(z1:M)kp M(z1:M|x)] . (7)

Algorithm 1 A generative process for q M(z1:M)

1. Draw ˆz1, ˆz1, ..., ˆz M independently from q (z) .

2. Choose m 2 {1, ..., M} with probability ! (ˆzm) PM

m0=1 ! (ˆzm0)

3. Set z1 = ˆzm and z2:M = ˆz m and return z1:M.

We will call the process of maximizing the IW-ELBO Importance Weighted Variational Inference (IWVI). (Burda et al. used Importance Weighted Auto-encoder for optimizing Eq. 5 as a bound on the likelihood of a variational auto-encoder, but this terminology ties the idea to a particular model, and is not suggestive of the probabilistic inference setting.)

The generative process for q M in Alg. 1 is very similar to self-normalized importance sampling. The usual NIS distribution draws a batch of size M, and then selects a single variable with probability in proportion to its importance weight. NIS is exactly equivalent to the marginal distribution q M(z1). The generative process for q M(z1:M) additionally keeps the unselected variables and relabels them as z2:M.

Previous work [6, 2, 16, 12] investigated a similar connection between NIS and the importanceweighted ELBO. In our notation, they showed that

log p(x) ELBO [q M(z1)kp(z1, x)] IW-ELBOM [q(z)kp(z, x)] . (8)

That is, they showed that the IW-ELBO lower bounds the ELBO between the NIS distribution and p, without quantifying the gap in the second inequality. Our result makes it clear exactly what KL-divergence is being minimized by maximizing the IW-ELBO and in what sense doing this makes q close to p. As a corollary, we also quantify the gap in the inequality above (see Thm. 2 below).

A recent decomposition [12, Claim 1] is related to Thm. 1, but based on different augmented distributions q IS

M . This result is fundamentally different in that it holds q IS

M "ﬁxed" to be an independent sample of size M from q, and modiﬁes p IS

M so its marginals approach q. This does not inform inference. Contrast this with our result, where q M(z1) gets closer and closer to p(z1 | x), and can be used for probabilistic inference. See appendix (Section A.3.2) for details.

Identifying the precise generative process is useful if IWVI will be used for general probabilistic queries, which is a focus of our work, and, to our knowledge, has not been investigated before. For example, the expected value of t(z) can be approximated as

E p(z|x) t(z) = E p M(z1|x) t(z1) E q M(z1) t(z1) = E q(z1:M)

m=1 ! (zm) t(zm)

The ﬁnal equality is established by Lemma 4 in the Appendix. Here, the inner approximation is justiﬁed since IWVI minimizes the joint divergence between q M(z1:M) and p M(z1:M|x) . However, this is not equivalent to minimizing the divergence between q M(z1) and p M(z1|x), as the following result shows.

Theorem 2. The marginal and joint divergences relevant to IWVI are related by

KL [q M(z1:M)kp M(z1:M|x)] = KL [q M(z1)kp(z1|x)] + KL [q M(z2:M|z1)kq(z2:M)] .

As a consequence, the gap in the ﬁrst inequality of Eq 8 is exactly KL [q M(z1)kp(z1|x)] and the gap

in the second inequality is exactly KL [q M(z2:M|z1)kq(z2:M)].

The ﬁrst term is the divergence between the marginal of q M, i.e., the standard NIS distribution, and the posterior. In principle, this is exactly the divergence we would like to minimize to justify Eq. 9. However, the second term is not zero since the selection phase in Alg. 1 leaves z2:M distributed differently under q M than under q. Since this term is irrelevant to the quality of the approximation in Eq. 9, IWVI truly minimizes an upper-bound. Thus, IWVI can be seen as an instance of auxiliary variational inference [1] where a joint divergence upper-bounds the divergence of interest.

(a) The target p and four candidate variational distributions.

(b) Reweighted densities q M(z1) for each distribution.

(c) The IW-ELBO. (Higher is better.)

(d) Moment error k Eq M t(z1) Ep t(z)k2

2 for t(z) = (z, z2). (Lower is better.)

Figure 2: Two Gaussian (N) and two Student-T (T ) variational distributions, all with constant variance and one of two means (A or B). For M = 1 it is better to use a mean closer to one mode of p. For large M, a mean in the center is superior, and the heavy tails of the Student T lead to better approximation of p and better performance both in terms of IW-ELBO and moment error.

4 Importance Sampling Variance

This section considers the family for the variational distribution. For small M, the mode-seeking behavior of VI will favor weak tails, while for large M, variance reduction provided by importance weighting will favor wider tails.

The most common variational distribution is the Gaussian. One explanation for this is the Bayesian central limit theorem, which, in many cases, guarantees that the posterior is asymptotically Gaussian. Another is that it s safest to have weak tails: since the objective is E log R, small values of R are most harmful. So, VI wants to avoid cases where q(z) p(z, x), which is difﬁcult if q is heavy-tailed. (This is the mode-seeking behavior of the KL-divergence [24].)

With IWVI, the situation changes. Asymptotically in M, RM in Eq. 4 concentrates around p(x), and so it is the variance of RM that matters, as formalized in the following result.

Theorem 3. For large M, the looseness of the IW-ELBO is given by the variance of R. Formally, if there exists some > 0 such that E |R p(x)|2+ < 1 and lim sup M!1 E[1/RM] < 1, then

log p(x) IW-ELBOM [q(z)kp(z, x)] | {z } KL[q Mkp M]

Maddison et al. [13] give a related result. Their Proposition 1 applied to RM gives the same conclusion (after an argument based on the Marcinkiewicz-Zygmund inequality; see appendix) but requires the sixth central moment to exist, whereas we require only existence of E |R p(x)|2+ for any > 0. The lim sup assumption on E 1/RM is implied by assuming that E 1/RM < 1 for any ﬁnite M (or for R itself). Rainforth et al. [18, Theorem 1 in Appendix] provide a related asymptotic for errors in gradient variance, assuming at least the third moment exists.

Directly minimizing the variance of R is equivalent to minimizing the χ2 divergence between q(z) and p(z|x), as explored by Dieng et al. [7]. Overdispersed VI [21] reduces the variance of score-function estimators using heavy-tailed distributions.

The quantity inside the parentheses on the left-hand side is exactly the KL-divergence between q M and p M in Eq. 7, and accordingly, even for constant q, this divergence asymptotically decreases at a 1/M rate.

The variance of R is a well-explored topic in traditional importance sampling. Here the situation is reversed from traditional VI since R is non-negative, it is very large values of R that can cause high variance, which occurs when q(z) p(z, x). The typical recommendation is defensive sampling or using a widely-dispersed proposal [17]. For these reasons, we believe that the best form for q will vary depending on the value of M. Figure 1 explores a simple example of this in 1-D.

5 Elliptical Distributions

Elliptical distributions are a generalization of Gaussians that includes the Student-T, Cauchy, scalemixtures of Gaussians, and many others. The following short review assumes a density function exists, enabling a simpler presentation than the typical one based on characteristic functions [8].

We ﬁrst describe the special case of spherical distributions. Take some density (r) for a non-negative r with

0 (r) = 1. Deﬁne the spherical random variable corresponding to as

= ru, r , u S, (10)

where S represents the uniform distribution over the unit sphere in d dimensions. The density of can be found using two observations. First, it is constant for all with a ﬁxed radius k k. Second, if if q ( ) is integrated over { : k k = r} the result must be (r). Using these, it is not hard to show that the density must be

q ( ) = g(k k2

2), g(a) = 1 Sd 1a(d 1)/2

where Sd 1 is the surface area of the unit sphere in d dimensions (and so Sd 1a(d 1)/2 is the surface area of the sphere with radius a) and g is the density generator.

Generalizing, this, take some positive deﬁnite matrix and some vector µ. Deﬁne the elliptical random variable z corresponding to , , and µ by

z = r A>u + µ, r , u S, (12)

where A is some matrix such that A>A = . Since z is an afﬁne transformation of , it is not hard to show by the Jacobian determinant formula for changes of variables that the density of z is

q(z|µ, ) = 1

(z µ)T 1 (z µ)

where g is again as in Eq. 11. The mean and covariance are E[z] = µ, and C[z] =

For some distributions, (r) can be found from observing that r has the same distribution as k k. For example, with a Gaussian, r2 = k k2 is a sum of d i.i.d. squared Gaussian variables, and so, by deﬁnition, r χd.

6 Reparameterization and Elliptical Distributions

Suppose the variational family q(z|w) has parameters w to optimize during inference. The reparameterization trick is based on ﬁnding some density q ( ) independent of w and a reparameterization function T ( ; w) such that T ( ; w) is distributed as q(z|w). Then, the ELBO can be re-written as

ELBO[q(z|w)kp(z, x)] = E q ( ) log p(T ( ; w), x)

q(T ( ; w)|w).

The advantage of this formulation is that the expectation is independent of w. Thus, computing the gradient of the term inside the expectation for a random gives an unbiased estimate of the gradient. By far the most common case is the multivariate Gaussian distribution, in which case the base density q ( ) is just a standard Gaussian and for some Aw such that A>

T ( ; w) = A>

w + µw. (14)

6.1 Elliptical Reparameterization

To understand Gaussian reparameterization from the perspective of elliptical distributions, note the similarity of Eq. 14 to Eq. 12. Essentially, the reparameterization in Eq. 14 combines r and u into = ru. This same idea can be applied more broadly: for any elliptical distribution, provided the density generator g is independent of w, the reparameterization in Eq. 14 will be valid, provided that comes from the corresponding spherical distribution.

While this independence is true for Gaussians, this is not the case for other elliptical distributions. If w itself is a function of w, Eq. 14 must be generalized. In that case, think of the generative process (for v sampled uniformly from [0, 1])

T (u, v; w) = F 1

wu + µw, (15)

w (v) is the inverse CDF corresponding to the distribution w(r). Here, we should think of the vector (u, v) playing the role of above, and the base density as qu,v(u, v) being a spherical density for u and a uniform density for v.

To calculate derivatives with respect to w, backpropagation through Aw and µw is simple using any modern autodiff system. So, if the inverse CDF F 1

w has a closed-form, autodiff can be directly applied to Eq. 15. If the inverse CDF does not have a simple closed-form, the following section shows that only the CDF is actually needed, provided that one can at least sample from (r).

6.2 Dealing CDFs without closed-form inverses

For many distributions , the inverse CDF may not have a simple closed form, yet highly efﬁcient samplers still exist (most commonly custom rejection samplers with very high acceptance rates). In such cases, one can still achieve the effect of Eq. 15 on a random v using only the CDF (not the inverse). The idea is to ﬁrst directly generate r w using the specialized sampler, and only then ﬁnd the corresponding v = Fw(r) using the closed-form CDF. To understand this, observe that if r and v Uniform[0, 1], then the pairs (r, Fw(r)) and (F 1

w (v), v) are identically distributed. Then, via the implicit function theorem, rw F 1

w (v) = rw Fw(r)

rr Fw(r). All gradients can then be computed by pretending that one had started with v and computed r using the inverse CDF.

6.3 Student T distributions

The following experiments will consider student T distributions. The spherical T distribution can be deﬁned as = p δ/s where δ N(0, I) and s χ [8]. Equivalently, write r = k k = p t/s with t χd. This shows that r is the ratio of two independent χ variables, and thus determined by an F-distribution, the CDF of which could be used directly in Eq. 15. We found a slightly bespoke simpliﬁcation helpful. As there is no need for gradients with respect to d (which is ﬁxed),

we represent as = (p t/s)u, leading to reparameterizing the elliptical T distribution as

T (u, t, v; w) =

p t F 1 (v)A>

where F is the CDF for the χ distribution. This is convenient since the CDF of the χ distribution is more widely available than that of the F distribution.

7 Experiments

All the following experiments compare E-IWVI using student T distributions to IWVI using Gaussians. Regular VI is equivalent to IWVI with M = 1.

We consider experiments on three distributions. In the ﬁrst two, a computable log p(x) enables estimation of the KL-divergence and computable true mean and variance of the posterior enable a precise evaluation of test integral estimation. On these, we used a ﬁxed set of 10, 000 M random inputs to T and optimized using batch L-BFGS, avoiding heuristic tuning of a learning rate sequence.

A ﬁrst experiment considered random Dirichlet distributions p( | ) over the probability simplex in K dimensions, 2 K. Each parameter k is drawn i.i.d. from a Gamma distribution with a shape parameter of 10. Since this density is deﬁned only over the probability simplex, we borrow from Stan

100 101 102 M

ν chosen by E-IWVI

K=50 K=20 K=10 K=5 K=3

100 101 102 M

estimated KL, K=3

VI IWVI E-IWVI

100 101 102 M

estimated KL, K=20

VI IWVI E-IWVI

100 101 102 M

error (cov), K=3

VI IWVI E-IWVI

100 101 102 M

error (cov), K=20

VI IWVI E-IWVI

Figure 3: Random Dirichlets, averaged over 20 repetitions. Top left shows an example posterior for K = 3. The test-integral error is k C[ ] ˆC[ ]k F where ˆC is the empirical covariance of samples drawn from q M(z1) and then transformed to K. In all cases, IWVI is able to reduce the error of VI to negligible levels. E-IWVI provides an accuracy beneﬁt in low dimensions but little when K = 20.

100 101 102 M

ν chosen by E-IWVI

d=10, n=20 d=2, n=15

100 101 102 M

estimated KL, d = 2, n = 15

VI IWVI E-IWVI

100 101 102 M

estimated KL, d = 10, n = 20

VI IWVI E-IWVI

100 101 102 M

moment error, d = 2, n = 15

VI IWVI E-IWVI

100 101 102 M

moment error, d = 10, n = 20

VI IWVI E-IWVI

Figure 4: Clutter Distributions, averaged over 50 repetitions. The error shows the error in the estimated second moment E[zz T ]. IWVI reduces the errors of VI by orders of magnitude. E-IWVI provides a diminishing beneﬁt in higher dimensions.

the strategy of transforming to an unconstrained z 2 RK 1 space via a stick-breaking process [23]. To compute test integrals over variational distributions, the reverse transformation is used. Results are shown in Fig. 3.

A second experiment uses Minka s clutter model [15]: z 2 Rd is a hidden object location, and x = (x1, . . . , xn) is a set of n noisy observations, with p(z) = N(z; 0, 100I) and p(xi|z) = 0.25 N(xi; z, I) + 0.75 N(xi; 0, 10I). The posterior p(z | x) is a mixture of 2n Gaussians, for which we can do exact inference for moderate n. Results are shown in Fig. 4.

Finally, we considered a (non-conjugate) logistic regression model with a Cauchy prior with a scale of 10, using stochastic gradient descent with various step sizes. On these higher dimensional problems, we found that when the step-size was perfectly tuned and optimization had many iterations, both methods performed similarly in terms of the IW-ELBO. E-IWVI never performed worse, and

M = 1 M = 5 M = 20 M = 100

10 4 10 3 10 2 10 1 100 step size

IWVI-2000 IWVI-10000 E-IWVI-2000 E-IWVI-10000

10 4 10 3 10 2 10 1 100 step size

10 4 10 3 10 2 10 1 100 step size

10 4 10 3 10 2 10 1 100 step size

Figure 5: Logistic regression comparing IWVI (red) and E-IWVI (blue) with various M and step sizes. The IW-ELBO is shown after 2,000 (dashed lines) and 10,000 (solid) iterations. A larger M consistently improves both methods. E-IWVI converges more reliably, particularly on higherdimensional data. From top: Madelon (d = 500) Sonar (d = 60), Mushrooms (d = 112).

sometimes performed very slightly better. E-IWVI exhibited superior convergence behavior and was easier to tune, as illustrated in Fig. 5, where E-IWVI converges at least as well as IWVI for all step sizes. We suspect this is because when w is far from optimal, both the IW-ELBO and gradient variance is better with E-IWVI.

Acknowledgements

We thank Tom Rainforth for insightful comments regarding asymptotics and Theorem 3 and Linda Siew Li Tan for comments regarding Lemma 7. This material is based upon work supported by the National Science Foundation under Grant No. 1617533.

[1] Felix V. Agakov and David Barber. An auxiliary variational method. In Neural Information

Processing, Lecture Notes in Computer Science, pages 561 566. Springer, Berlin, Heidelberg, 2004.

[2] Philip Bachman and Doina Precup. Training deep generative models: Variations on a theme. In

NIPS Workshop: Advances in Approximate Bayesian Inference, 2015.

[3] Robert Bamler, Cheng Zhang, Manfred Opper, and Stephan Mandt. Perturbative black box

variational inference. In NIPS, 2017.

[4] Peter J Bickel and Kjell A Doksum. Mathematical statistics: basic ideas and selected topics,

volume I, volume 117. CRC Press, 2015.

[5] Yuri Burda, Roger Grosse, and Ruslan Salakhutdinov. Importance weighted autoencoders.

[6] Chris Cremer, Quaid Morris, and David Duvenaud. Reinterpreting importance-weighted autoencoders. 2017.

[7] Adji Bousso Dieng, Dustin Tran, Rajesh Ranganath, John Paisley, and David Blei. Variational

inference via χ upper bound minimization. In NIPS, pages 2729 2738. 2017.

[8] Kaitai Fang, Samuel Kotz, and Kai Wang Ng. Symmetric multivariate and related distributions.

Number 36 in Monographs on statistics and applied probability. Chapman and Hall, 1990.

[9] W. R. Gilks, A. Thomas, and D. J. Spiegelhalter. A language and program for complex bayesian

modelling. 43(1):169 177, 1994.

[10] Diederik P. Kingma and Max Welling. Auto-encoding variational bayes. In ICLR.

[11] Alp Kucukelbir, Dustin Tran, Rajesh Ranganath, Andrew Gelman, and David M. Blei. Auto-

matic differentiation variational inference. 18(14):1 45, 2017.

[12] Tuan Anh Le, Maximilian Igl, Tom Rainforth, Tom Jin, and Frank Wood. Auto-Encoding

Sequential Monte Carlo. In ICLR, 2018.

[13] Chris J Maddison, John Lawson, George Tucker, Nicolas Heess, Mohammad Norouzi, Andriy

Mnih, Arnaud Doucet, and Yee Teh. Filtering variational objectives. In NIPS, pages 6576 6586. 2017.

[14] Józef Marcinkiewicz and Antoni Zygmund. Quelques théoremes sur les fonctions indépendantes.

Fund. Math, 29:60 90, 1937.

[15] Minka, Thomas. Expectation propagation for approximate bayesian inference. In UAI, 2001.

[16] Christian A. Naesseth, Scott W. Linderman, Rajesh Ranganath, and David M. Blei. Variational

sequential monte carlo. In AISTATS, 2018.

[17] Art Owen. Monte Carlo theory, methods and examples. 2013.

[18] Tom Rainforth, Adam R. Kosiorek, Tuan Anh Le, Chris J. Maddison, Maximilian Igl, Frank

Wood, and Yee Whye Teh. Tighter variational bounds are not necessarily better.

[19] Rajesh Ranganath, Sean Gerrish, and David M. Blei. Black box variational inference. In

AISTATS, 2014.

[20] Gabriel Romon. Bounds on moments of sample mean. https://math.stackexchange.

com/questions/2901196/bounds-on-moments-of-sample-mean, 2018.

[21] Francisco J. R. Ruiz, Michalis K. Titsias, and David M. Blei. Overdispersed black-box varia-

tional inference. In UAI, 2016.

[22] L. K. Saul, T. Jaakkola, and M. I. Jordan. Mean ﬁeld theory for sigmoid belief networks.

Journal of Artiﬁcial Intelligence Research, 4:61 76, 1996.

[23] Stan Development Team. Modeling language user s guide and reference manual, version 2.17.0,

[24] Tom Minka. Divergence measures and message passing. 2005.