# hierarchical_importance_weighted_autoencoders__ab3e350f.pdf Hierarchical Importance Weighted Autoencoders Chin-Wei Huang 1 2 Kris Sankaran 1 Eeshan Dhekane 1 Alexandre Lacoste 2 Aaron Courville 1 3 Importance weighted variational inference (Burda et al., 2015) uses multiple i.i.d. samples to have a tighter variational lower bound. We believe a joint proposal has the potential of reducing the number of redundant samples, and introduce a hierarchical structure to induce correlation. The hope is that the proposals would coordinate to make up for the error made by one another to reduce the variance of the importance estimator. Theoretically, we analyze the condition under which convergence of the estimator variance can be connected to convergence of the lower bound. Empirically, we confirm that maximization of the lower bound does implicitly minimize variance. Further analysis shows that this is a result of negative correlation induced by the proposed hierarchical meta sampling scheme, and performance of inference also improves when the number of samples increases. 1. Introduction Recent advance in variational inference (Kingma & Welling, 2014; Rezende et al., 2014) makes it efficient to model complex distribution using latent variable model with an intractable marginal likelihood p(x) = R z p(x|z)p(z)dz, where z is an unobserved vector distributed by a prior distribution, e.g. p(z) = N(0, I). The use of an inference network, or encoder, allows for amortization of inference by directly conditioning on the data q(z|x), to approximate the true posterior p(z|x). This is known as the Variational Autoencoder (VAE). Normally, learning is achieved by maximizing a lower bound on the marginal likelihood, since the latter is not tractable in general. Naturally, one would be interested in reducing the gap between the bound and the desired objective. Burda et al. (2015) devises a new family of lower 1Mila, University of Montreal 2Element AI 3CIFAR member. Correspondence to: Chin-Wei Huang . Proceedings of the 36 th International Conference on Machine Learning, Long Beach, California, PMLR 97, 2019. Copyright 2019 by the author(s). bounds with progressively smaller gap using multiple i.i.d. samples from the variational distribution q(z|x), which they call the Importance Weighted Autoencoder (IWAE). Cremer et al. (2017); Bachman & Precup (2015) notice that IWAE can be interpretted as using a corrected variational distribution in the normal variational lower bound. The proposal is corrected towards the true posterior by the importance weighting, and approaches the latter with an increasing number of samples. Intuitively, when only one sample is drawn to estimate the variational lower bound, the loss function highly penalizes the drawn sample, and thus the encoder. The decoder will be adjusted accordingly to maximize the likelihood in a biased manner, as it treats the sample as the real, observed data. In the IWAE setup, the inference model is allowed to make mistakes, as a sample corresponding to a high loss is penalized less owing to the importance weight. Drawing multiple samples also allows us to represent the distribution at a higher resolution. This motivates us to construct a joint sampler such that the empirical distribution drawn from the joint sampler can better represent the posterior distribution. More specifically, we consider a hierarchical sampler, whose latent variable z0 acts at a metalevel to decide the salient points of the true posterior, which we call the Hierarchical Importance Weighted Autoencoder (H-IWAE). Doing so allows for (1) summarizing the distribution using the latent variable (Edwards & Storkey, 2017), and (2) counteracting bias induced by each proposal. To analyze the latter effect, we look at the variance of the Monte Carlo estimate of the lower bound, and show that maximizing the lower bound implicitly reduces the variance. Our main contributions are as follows: We propose a hierarchical model to induce dependency among the samples for approximate inference, and derive a hierarchical importance weighted lower bound. We analyze the convergence of the variance of the Monte Carlo estimate of the lower bound, and draw a connection to the convergence of the bound itself. Empirically, we explore different weighting heuristics, validate the hypothesis that the joint samples tend to be negatively correlated, and perform experiments on a suite of standard density estimation tasks. Hierarchical Importance Weighted Autoencoders Figure 1. Latent variable model (left) with hierarchical proposals as the inference model (right). Dashed lines indicate amortization. 2. Background In a typical setup of latent variable model, we assume a joint density function that factorizes as p(x, z) = p(x|z)p(z), where x, z are the observed and unobserved random variables, respectively. Learning is achieved by maximizing the marginal log-likelihood of the data log p(x) = log R z p(x, z)dz, which is in general intractable due to the integration, since a neural network is usually used to parameterize the likelihood function p(x|z). 2.1. Variational Inference One way to estimate the marginal likelihood is via the variational lower bound. Concretely, we introduce a variational distribution q(z) 1, and maximize the bound on the RHS: log p(x) = log Eq(z) log p(x, z) which is known as the evidence lower bound (ELBO). The tightness of the ELBO can be described by the reverse Kullback-Leibler (KL) divergence DKL(q(z)||p(z|x)), which is equal to 0 if and only if q(z) = p(z|x). 2.2. Importance Weighted Auto-Encoder Burda et al. (2015) noticed the likelihood ratio in the ELBO, p(x, z)/q(z), resembles the importance weight in importance sampling, and introduced a family of lower bounds called the Importance Weighted Lower Bound (IWLB): log p(x) = log Ezj q(zj) 1For notational convenience we omit the conditioning on x for amortized inference. with LK(q) = L(q) when K = 1. In practice, the expectation over the product measure is estimated by sampling zj q(z) for j = 1, ..., K, and setting LK(q) = log 1 One property of this family of bounds is monotonicity; i.e. LM(q) LN(q) for M > N. Strong consistency of LK(q) can also be proved since zj s are independently and identically distributed (by law of q(z)), in which case LK(q) K log p(x) almost surely. This asymptotic property motivates the use of large number of samples in practice, but there are two practical concerns: First, even though evaluation of zj under the generative model p(x, z) can be done in parallel (sublinear rate in general), memory scales in O(K). An online algorithm can be adopted to reduce memory to O(1), but at the cost of O(n) evaluation (Huang & Courville, 2017). Second, for any finite K, for any choice of proposal such that q(z) is not proportional to p(x, z) 2, LK(q) is strictly a lower bound on the marginal likelihood. This motivates the research in improving the surrogate loss functions (such as LK(q)) for the intractable log p(x). Nowozin (2018), for instance, interpreted LK(q) as a biased estimator of log p(x), and introduced a family of estimators with reduced bias. The bias of the new estimator can be shown to be reduced to O(K (m+1)), for m < K (compared to O(K 1) for LK(q)). However, the variance of the estimator can be high and is no longer a lower bound. Alternatively, one can look at the dispersion 3 of the likelihood ratio wj = p(x,zj) q(zj) , which itself is a random variable. The gap between log p(x) and LK(q) can be explained by Jensen s inequality and the fact that log is a strictly concave function: log E[w] E[log w] where w is a positive random variable. The equality holds if and only if w is almost surely a constant. This can be approximately true when w is taken to be the likelihood ratio p(x,z) q(z) and when q(z) p(z|x). Instead, if we take w = 1 K PK j=1 wj, we see a direct reduction in variance if wj s are all uncorrelated and identically distributed: Var(w) = 1 K Var(w1). Intuitively, the shape of the distribution of w becomes sharper with larger K, resulting in a smaller gap when Jensen s inequality is applied. This idea was also noticed by (Domke & Sheldon, 2018) and explored by (Klys et al., 2018). 2which is a reasonable assumption given the finite approximating capacity of the chosen family of q and the finiteness of the recognition model for amortization. 3By dispersion, we mean how spread out or stretched the distribution of the random variable is. Common statistical dispersion indices include variance, mean absolute difference, entropy, etc. Hierarchical Importance Weighted Autoencoders Figure 2. Learned hierarchical importance sampling proposals with uniform weighting (2nd row) and balanced heuristic (3rd row). From left to right are the target distribution, heat map of samples drawn from a sampling importance resampling procedure, and the proposals. The color scheme corresponds to the norm of z0, indicating the correlation among the samples. A proposal with Markov transition is presented in the 1st row as a comparison: the dependency among the samples is visually much weaker. However, a clear connection between how well log p(x) is approximated and the minimization of the variance of w and/or the variance of log w is lacking. We defer the discussion to Section 4 wherein we also provide a theoretical analysis. 2.3. Variance Reduction via Negative Correlation Let w = 1 K PK i=1 πiwi where wi s are random variables and πi s sum to one. The variance of w can be written as: i=1 π2 i Var(wi) + 2 X i 1 allows us to relax the necessary condition for optimality when K = 1, i.e. q p. For example, let πj(z) be the posterior probability of the random index j, πj(z) = qj(z) P i qi(z). Then LK(Q) is equivalent to using a mixture proposal PK i=1 qi K . In order for the proposals to be optimal, it is sufficient if p can be decomposed as a mixture density (to wit, qj s do not all need to be equal to p). Point 2 allows us to leverage the correlation among zj s to further reduce the variance. With z1 making a positive deviation from the mean p(x,zj) q1(zj) > p(x), one would hope z2 has a higher chance of making a negative deviation to cancel the error. This can be thought of as a soft version of antithetic variates. One difficulty in optimizing LK(Q) lies in estimating the marginal density qj. Particular choices of Q, such as multivariate normal distribution, allows for exact evaluation, since qj is still normally distributed; this was explored by Klys et al. (2018). Another option is to define K 1 set of invertible transformations, Tj( ) for j = 2, ..., K, and then apply the mapping zj Tj(zj 1), where z1 q1(z). The density of zj under qj can then be evaluated via change of variable transformation. This is known as the normalizing flows (Rezende & Mohamed, 2015). Other more general forms of the joint Q, however, does not have tractable marginal densities, and thus require further approximation. 3.2. Hierarchical Importance Sampling In order to induce correlation among zj s, we consider a hierarchical proposal: Q(z1, ..., z K) = Z z0 Q(z1, ..., z K|z0)dq0(z0) j=1 qj(zj|z0)dq0(z0) with the conditional independence assumption zi zj | z0 for i = j and 1 i, j K (see Figure 1). First, notice that the marginals are not identical since each zj is sampled from a different conditional qj(zj|z0). More specifically, each marginal is a latent variable model, which can be thought of as an infinite mixture (Ranganath et al., 2016): q(zj) = R z0 qj(zj|z0)dq0(z0). Second, z1, ..., z K are entangled through sharing the same common random number z0. The smaller the conditional entropy H(zj|z0) is, the more mutually dependent zj s are. We analyze the effect of this common random number in Section 6.1. While there are different ways to model the joint proposal, we emphasize the following properties of the hierarchical proposals that make it an appealing choice: 1. Sampling can be parallelized. 2. Empirically, we found that optimization behaves better than a sequential model (we also tested a Markov joint proposal, i.e. zj depends only on zj 1, and found the learned proposals do not compensate for each other; see top rule of Figure 2). 3. z0 can be interpreted as a summary (Edwards & Storkey, 2017) of the empirical distribution. Optimization Objective In the spirit of the variational formalism, we need to define an objective to optimize Q. However, the marginal densities are no longer tractable due to the integration over the prior measure q0. We introduce an auxiliary random variable (Agakov & Barber, 2004) r(z0|zj) to approximate qj(z0|zj) 4, and maximize the following objective: LK(Q0) := EQ0 j=1 πj(zj, z0)p(x, zj)r(z0|zj) qj(zj|z0)q0(z0) where Q0 is the joint of z0, ..., z K, and πj(z, z0) is a partition of unity for all z, z0. First, if we choose πj = 1 K and let r depend on j, LK(Q0) boils down to the J-IWLB LK(Q) if rj(z0|zj) = qj(z0|zj). Second, with K = 1, it is simply variational inference with a hierarchical model (Ranganath et al., 2016). Lastly, LK(Q0) is a lower bound on log p(x) (we relegate the proof to Appendix B). We call it the Hierarchical Importance Weighted Lower Bound (H-IWLB). This training criterion is chosen also because the inference model and generative model can have a unified objective. It is also possible to consider training the inference model with different objectives. For instance, direct minimization of the variance of the importance ratio amounts to minimization 4qj(z0|zj) is the posterior of z0 given zj; that is qj(z0|zj) qj(zj|z0)q0(z0). Hierarchical Importance Weighted Autoencoders of the χ2-divergence (see Dieng et al. (2017); M uller et al. (2018)) 5. We discuss the connection between vanishing variance and convergence of the log estimator in Section 4. Weighting Heuristics A family of weighting heuristics commonly used in practice are the power heuristics: πα j (z, z0) = qj(z|z0)α PK i=1 qi(z|z0)α With larger α, the heuristic tends to emphasize the proposal under which z has larger likelihood more. When α = 0 and 1, πj boils down to uniform probability (arithmetic average of the likelihood ratio) and the posterior probability of the index j, respectively. We compare different choices of weighting heuristics qualitatively on a toy example in Section 6.2 and quantitatively on a suite of standard density modeling tasks with latent variable model in Section 6.3. Training Procedure In our experiments, all qj and q0 are conditional Gaussian distributions (also conditioned on x in the amortized inference setting). This allows us to use the reparameterized gradient (Kingma & Welling, 2014; Rezende et al., 2014). However, as noted by Rainforth et al. (2018), the signal-to-noise ratio of the gradient estimator for IWAE s encoder decreases as K increases, large K would render the learning of the inference model inefficient. Thus, we consider the recently proposed doubly reparameterized gradient by Tucker et al. (2018). Empirically, we find the algorithm converges more stably. 4. Connecting Variance and Lower Bound Ideally, using a joint proposal allows for modelling the dependency among zj s and thus the negative correlation among the likelihood ratios, but we did not choose to optimize (minimize) variance directly. Instead we maximize the lower bound. We are interested in how the two are connected, i.e. if convergence of one implies another. The following is the main takeaway of this section: Convergence happening in one mode implies the convergence in the other mode if values of wj and log wj cannot be extreme (we will assume these two are bounded in some sense). However, this is in general not true in the case of importance sampling, where the likelihood ratio is sensitive to the choice of proposal distribution. This suggests the density q needs to be controlled in some way (e.g. smoothing the standard deviation of a Gaussian to avoid over-confidence). We work in a probability space (Ω, F, P). We write w Lp for p 1 if w is a random variable and E[|w|p] < . A 5We have also tried to minimize the variance of the estimator (average of importance ratio), with an estimate of the first moment. But we found even the reparameterized gradient suffers from noisy signal and usually converges to a suboptimal solution. random sequence {wn} converges in Lp to w if E[|wn w|p] 0 as n . Convergence in probability and in Lp are denoted by P and Lp , respectively. Let {wn} be a sequence of random variables, such that E[wn] = p(x) and thus E[log wn] log p(x). wn can be thought of as the likelihood ratio p(x, z)/q(z) where z q(z), and its expectation wrt z is exactly p(x) 6. We d like to know if convergence of the lower bound E[log wn] to log p(x) (e.g. via maximization of H-IWLB) implies vanishing variance under any assumption, which justifies the use of a joint/hierarchical proposal to reduce variance at a potentially faster rate. However, it is in general untrue that smaller Var(wn), smaller Var(log wn) and larger E[log wn] can imply one another 7. Instead, we discuss their limiting behavior, and conclude that the condition Var(log wn) converges to zero sits somewhere between L1-convergence and L2-convergence of log wn. To do so, we require a bit more control over the sequences wn and log wn, such as boundedness. The first implication of the conclusion is that if the variance of log wn vanishes and the sequences are bounded in some sense, E[log wn] also converges to log p(x) (see below). The second implication is that if log wn converges to log p(x) more uniformly (L2-convergence), then the variance of log wn also converges to zero. We will discuss why boundedness control is required at the end of this section from the f-divergence perspective. Now we turn to the analysis on the variance of log wn. Doing so allows us to interpret log wn as an estimator for log p(x), and we would like to answer the following question: if the variance of log wn converges to zero, does E[log wn] converge to log p(x)? The answer is positive given some boundedness condition, and the result suggests that if the variance of the estimator is infinitesimally small, so is the bias. We now state the result: Proposition 1. Assume wn L1 with wn > 0, log wn L2 and E[wn] = c for some c > 0, for all n 1. If {wn} is uniformly integrable and {log wn} is bounded in L1, lim n Var(log wn) = 0 log wn P log c The proposition tells us that if we look at log wn as an estimator for some constant log c (e.g. log p(x)), if E[wn] = c, then the vanishing variance of log wn implies consistency. With the same integrability condition on log wn, we can conclude E[log wn] converges to log p(x). Corollary 1. With the same condition as Proposition 1, if 6The index n can be thought of as an indicator of a sequence of parameterizations of the joint Q0. 7Klys et al. (2018) also attempt to derive a bound, but their result does not hold without making any assumption. Maddison et al. (2017) also require uniform integrability to establish consistency. Hierarchical Importance Weighted Autoencoders Figure 3. H-IWLB trained with (dependent) and without (independent) common random number. The hierarchical proposal trained with the common factor learns to avoid redundant sampling and effectively reduces dispersion (lower variance and standard deviation). (a) Uncorrelated wj (b) Trained with independent z0 (c) Trained with common z0 Figure 4. Correlation matrix of wj with independent z0 (a) and common z0 (b,c). When trained with common z0 (c), the proposals coordinate to make up for the bias made by one another, resulting in negative correlation. we further assume {log wn} is uniformly integrable, lim n Var(log wn) = 0 log wn L1 log c In particular, E[log wn] log p(x) as n . Finally, the following result shows that L2-convergence is sufficient for the convergence of variance of log wn to zero. Proposition 2. With the same condition as Corollary 1: log wn L2 log c lim n Var(log wn) = 0 Now, we turn back to the case of variational inference where w = p/q (assume p is normalized for the ease of exposition) and discuss the difficulty in analyzing the relationship between variance of w and the lower bound E[log w]. It is because variance is more sensitive to large likelihood ratio, whereas lower density region under q does not take a heavy toll on the lower bound. More concretely, since Eq[w] = 1, Var(w) = Eq[(p/q)2] 1 = Dχ2(p||q), where Dχ2 is the χ2-divergence. Our insight is that the χ2-divergence is an upper bound on the forward KL divergence DKL(p||q) (see appendix for a proof). This means when variance of w decreases, the χ2-divergence and the forward KL-divergence also decrease. However, decrease in the forward KL does not impliy decrease in the reverse KL, DKL(q||p), which is what maximization of the lower bound Eq[log w] is equivalent to. This can be explained by the characteristic function f that is used in the f-divergence family: Df(p||q) := R f p(z) q(z) q(z)dz, where f is a convex function such that f(1) = 0. For the forward KL and reverse KL, the corresponding convex functions are f F (w) = w log w and f R(w) = log w, respectively. When w 0, f F (w) 0 and f R(w) . When w , f F (w) and f R(w) . This means the forward KL will not reflect it when p q, but will be high when p q. The reverse KL on the other hand will explode when p q and can tolerate p q. These are known as the inclusive and exclusive properties of the two KLs (Minka et al., 2005). χ2-divergence is defined by the convex function fχ(w) = w2 1, which asymptotically bounds f F : limw fχ(w)/f F (w) = . This means it Hierarchical Importance Weighted Autoencoders Table 1. Variational Autoencoders trained on the statically binarized MNIST dataset (Larochelle & Murray, 2011). Each hyperparameter is run 5 times to carry out mean and standard deviation (uncertainties are reported in Appendix E for readability). L stands for the lower bound on log likelihood of the dataset (training, validation and test). NLL is the negative log likelihood estimated with 2000 samples. mode IWAE H-IWAE IWAE H-IWAE IWAE H-IWAE IWAE H-IWAE α - 0 - 0 1 3 - 0 1 3 - 0 1 3 K 1 1 2 2 2 2 5 5 5 5 10 10 10 10 Ltr 83.26 82.92 82.36 82.19 82.15 82.43 81.48 82.25 81.28 81.32 80.85 82.30 80.89 81.17 Lva 86.57 85.75 85.40 85.05 85.03 85.18 84.45 85.05 84.04 84.12 83.84 85.11 83.77 83.95 Lte 86.36 85.50 85.16 84.79 84.76 84.90 84.25 84.85 83.79 83.90 83.62 84.86 83.56 83.72 NLLva 82.64 82.24 82.03 81.93 81.88 81.96 81.63 82.19 81.42 81.39 81.37 82.42 81.28 81.33 NLLte 82.37 81.96 81.77 81.65 81.60 81.66 81.37 81.93 81.16 81.13 81.13 82.17 81.04 81.08 Table 2. Variational Autoencoders trained on the dynamically binarized OMNIGLOT dataset (Lake et al., 2015). Each hyperparameter is run 5 times to carry out the mean and standard deviation (uncertainties are reported in Appendix E for readability). mode IWAE H-IWAE IWAE H-IWAE IWAE H-IWAE IWAE H-IWAE α - 0 - 0 1 3 - 0 1 3 - 0 1 3 K 1 1 2 2 2 2 5 5 5 5 10 10 10 10 Ltr 106.40 104.57 102.66 103.24 102.88 103.17 102.01 101.35 101.37 102.78 99.71 99.75 100.81 101.29 Lva 109.05 106.48 105.85 105.27 105.18 105.56 104.48 103.76 103.72 104.70 102.67 103.21 103.49 103.66 Lte 109.90 107.36 106.70 106.12 105.93 106.34 105.37 104.68 104.62 105.56 103.53 104.11 104.29 104.45 NLLva 102.98 100.36 100.14 99.82 99.68 99.87 99.43 98.75 98.85 99.38 97.97 98.48 98.58 98.78 NLLte 103.28 100.79 100.48 100.16 100.00 100.18 99.90 99.23 99.21 99.80 98.48 98.80 99.04 99.22 will incur an even higher cost than the forward KL if p q. See the Figure 6 in the appendix for an illustration. The boundedness assumption on log wn and wn made in this section makes sure it is less likely that wn will be arbitrarily large or close to zero. 5. Related Work Much work has been done on improving the variational approximation, e.g. by using a more complex form of q (Rezende & Mohamed, 2015; Kingma et al., 2016; Huang et al., 2018; Ranganath et al., 2016; Maaløe et al., 2016; Miller et al., 2016) in place of the conventional normal distribution. Along side the work of Burda et al. (2015), Mnih & Rezende (2016) and Bornschein & Bengio (2014); Le et al. (2018) also apply importance sampling to learning discrete latent variables and modifying the wake-sleep algorithm, respectively. Cremer et al. (2017); Bachman & Precup (2015) interpret IWAE as using a corrected proposal, whereas Nowozin (2018) interprets the IWLB as a biased estimator of the marginal log likelihood and propose to reduce the bias using the Jackknife method. However, Rainforth et al. (2018) realize the signal-to-noise ratio of the gradient of the inference model vanishes as K increase, as the magnitude of the expected gradient decays faster than variance. Tucker et al. (2018) propose to use a doubly reparameterized gradient to mitigate this problem. Closest to out work is that of Klys et al. (2018), where they propose to explore multivariate normal distribution as the joint proposal. Domke & Sheldon (2018) integrate defensive sampling into variational inference, and Wu et al. (2018) propose to use a differentiable antithetic sampler. Yin & Zhou (2018); Molchanov et al. (2018); Sobolev & Vetrov (2018) propose to use multiple samples to better estimate the marginal distribution of a hierarchical proposal. 6. Experiments We first demonstrate the effect of sharing a common random number on the dependency among the multiple proposals, and then apply the amortized version of hierarchical importance sampling (that is, H-IWAE) to learning a deep latent Gaussian models. The details of how the inference model Q0( |x) is parameterized can be found in Appendix D. 6.1. Effect of Common Random Number In this section, we analyze the effect of training with common random number, z0. As mentioned in Section 3, the correlation among zj s is induced by tying z0 as a common factor. If for each index j, we draw z0 independently from q0, zj will be rendered independent. Doing so allows us to test if inference models trained with a common random number tend to output correlated zj s. Let the target in Figure 2 be the posterior distribution p(z) (unnormalized), and let wj = p(zj)rj(z0|zj)/qj(zj|z0)q0(z0) and w = P wj/K. We consider maximizing H-IWLB with two settings: with and without a common z0. We repeat the experiment 25 times with different random seeds, record the corresponding H-IWLB, variance of log w, variance and standard deviation of w with common z0, as plotted in Figure 3. Qualitatively, we visualize the correlation matrix of wj in Figure 4. This shows that wj s are indeed negatively correlated with each other when trained with common z0. When trained with independent z0 for each zj, the hierarchical sampler tends to find similar solutions and are prone to in- Hierarchical Importance Weighted Autoencoders Table 3. Variational Autoencoders trained on the Caltech101 Silhouettes dataset (Marlin et al., 2010). Each hyperparameter is run 3 times to carry out the statistics (µ:σ). (*) indicates the encoder is updated twice on the same minibatch before the decoder is updated once. model α K Ltr Lva Lte NLLva NLLte IWAE - 1 123.74:1.56 128.87:1.47 129.71:1.47 117.88:1.50 118.61:1.44 IWAE - 2 118.67:2.16 124.78:2.20 125.56:2.12 114.24:2.10 114.79:2.11 IWAE - 5 112.50:1.16 120.12:0.43 120.90:0.40 109.87:0.68 110.49:0.64 IWAE (h) - 1 113.54:3.06 121.20:1.13 121.81:1.23 109.28:1.38 109.71:1.35 IWAE (h) - 2 111.94:2.46 118.59:1.24 119.51:1.13 107.52:1.36 108.18:1.38 IWAE (h) - 5 108.20:0.58 115.58:1.08 116.40:1.11 105.16:0.77 105.70:0.78 H-IWAE 0 2 108.69:1.14 118.35:0.69 119.18:1.01 106.93:0.59 107.51:0.74 H-IWAE 0 5 108.97:1.15 116.84:0.92 117.55:0.79 106.41:0.88 106.83:0.63 H-IWAE 1 2 111.57:0.88 118.03:0.74 118.78:0.71 107.01:0.79 107.62:0.71 H-IWAE 1 5 108.96:0.32 116.27:0.39 117.05:0.15 106.03:0.21 106.62:0.09 H-IWAE 3 2 111.02:0.72 118.02:0.63 119.05:0.45 107.21:0.46 107.90:0.35 H-IWAE 3 5 109.66:1.07 116.96:0.66 117.80:0.73 106.41:0.35 107.13:0.41 (*) H-IWAE 1 2 108.31:1.68 115.31:1.08 116.07:0.94 104.27:0.93 104.88:0.87 (*) H-IWAE 1 5 108.03:1.18 113.59:0.70 114.07:0.75 103.74:0.59 104.16:0.64 curring positively correlated biases (deviation from mean). 6.2. Effect of Weighting Heuristics We also analyze the effect of using different weighting heuristics. We repeat the experiment in the last subsection and apply the power heuristic with α = 0 and α = 1 as the weighting scheme, both with common z0. We find that the weighting function πj has a major effect on the shape of the learned proposals qj (see Figure 2). With α = 1, the proposals behave more like a mixture, and the negative correlation is stronger as the proposals specialize in different regions. We also explore training the weighting function in Appendix E. 6.3. Variational Autoencoder Our final experiment was to apply hierarchical proposals to learning variational autoencoders on a set of standard datasets, including binarized MNIST (Larochelle & Murray, 2011), binarized OMNIGLOT (Lake et al., 2015) and Caltech101 Silhouettes (Marlin et al., 2010), using the same architecture as described in Huang et al. (2018). Results on the binarized MNIST and OMNIGLOT are in Table 1 and 2, respectively. We compare with Gaussian IWAE as a baseline. The hyperparameters are fixed as follows: minibatch size 64, learning rate 5 10 5, linear annealing schedule with 50,000 iterations for the log density terms except p(x|z) (i.e. KL between q(z|x) and p(z) for VAE), polyak averaging with exponential averaging coefficient 0.998. For the MNIST dataset, with α = 0, arithmetic averaging does not provide monotonic improvement in terms of negative log-likelihood (NLL) when K increases, whereas with α = 1 or 2 the performance is better with larger K. For the OMNIGLOT dataset, the performance is consistently better with larger K. Table 3 summarizes the experiment on the Caltech101 dataset. We compare with the Gausssian IWAE and IWAE with one single hierarchical proposal, dubbed IWAE (h), and perform grid search on a set of hyperparameters (Appendix D), each repeated three times to calculate the mean and standard deviation. We report the performance of the models by selecting the hyperparameters that correspond to the lowest averaged NLL on the validation set. We see that with larger K, H-IWAE has an improved performance, but is outperformed by the IWAE with a single hierarchical proposal. We speculate this is due to the increased difficulty in optimizing the inference model with the extra parameters for each qj, resulting in a larger amortization bias in inference (Cremer et al., 2018). To validate the hypothesis, we repeat the experiment of H-IWAE with the balanced heuristic α = 1, but update the encoder twice on the same minibatch before updating the decoder once. This gives us a significant improvement over the learned model (see the last two rows). 7. Conclusion In order to approximate the posterior distribution in variational inference with a more representative empirical distribution, we propose to use a hierarchical meta-sampler. We derive a variational lower bound as a training objective. Theoretically, we provide sufficient condition on boundedness to connect the convergence of the variance of the Monte Carlo estimate of the lower bound with the convergence of the bound itself. Empirically, we show that maximizing the lower bound implicitly reduces the variance of the estimate. Our analysis shows that learning dependency among the joint samples can induce negative correlation and improve the performance of inference. Agakov, F. V. and Barber, D. An auxiliary variational method. In Neural Information Processing, 2004. Bachman, P. and Precup, D. Training deep generative mod- Hierarchical Importance Weighted Autoencoders els: Variations on a theme. In NIPS Approximate Inference Workshop, 2015. Bornschein, J. and Bengio, Y. Reweighted wake-sleep. ar Xiv preprint ar Xiv:1406.2751, 2014. Burda, Y., Grosse, R., and Salakhutdinov, R. Importance weighted autoencoders. In International Conference on Learning Representations, 2015. Clevert, D.-A., Unterthiner, T., and Hochreiter, S. Fast and accurate deep network learning by exponential linear units (elus). International Conference on Learning Representations, 2016. Cremer, C., Morris, Q., and Duvenaud, D. Reinterpreting importance-weighted autoencoders. ar Xiv preprint ar Xiv:1704.02916, 2017. Cremer, C., Li, X., and Duvenaud, D. Inference suboptimality in variational autoencoders. In International Conference on Machine Learning, 2018. Dieng, A. B., Tran, D., Ranganath, R., Paisley, J., and Blei, D. Variational inference via χ upper bound minimization. In Advances in Neural Information Processing Systems, 2017. Domke, J. and Sheldon, D. R. Importance weighting and variational inference. In Advances in Neural Information Processing Systems, pp. 4475 4484, 2018. Edwards, H. and Storkey, A. Towards a neural statistician. In International Conference on Learning Representations, 2017. Huang, C.-W. and Courville, A. Sequentialized sampling importance resampling and scalable iwae. NIPS Bayesian Deep Learning Workshop, 2017. Huang, C.-W., Krueger, D., Lacoste, A., and Courville, A. Neural autoregressive flows. In International Conference on Machine Learning, 2018. Kingma, D. P. and Welling, M. Auto-encoding variational bayes. In International Conference on Learning Representations, 2014. Kingma, D. P., Salimans, T., Jozefowicz, R., Chen, X., Sutskever, I., and Welling, M. Improved variational inference with inverse autoregressive flow. In Advances in Neural Information Processing Systems, 2016. Klys, J., Bettencourt, J., and Duvenaud, D. Joint importance sampling for variational inference. 2018. Lake, B. M., Salakhutdinov, R., and Tenenbaum, J. B. Human-level concept learning through probabilistic program induction. Science, 350(6266):1332 1338, 2015. Larochelle, H. and Murray, I. The neural autoregressive distribution estimator. In International Conference on Artificial Intelligence and Statistics, 2011. Le, T. A., Kosiorek, A. R., Siddharth, N., Teh, Y. W., and Wood, F. Revisiting reweighted wake-sleep. ar Xiv preprint ar Xiv:1805.10469, 2018. Maaløe, L., Sønderby, C. K., Sønderby, S. K., and Winther, O. Auxiliary deep generative models. In International Conference on Machine Learning, 2016. Maddison, C. J., Lawson, J., Tucker, G., Heess, N., Norouzi, M., Mnih, A., Doucet, A., and Teh, Y. Filtering variational objectives. In Advances in Neural Information Processing Systems, pp. 6573 6583, 2017. Marlin, B., Swersky, K., Chen, B., and Freitas, N. Inductive principles for restricted boltzmann machine learning. In Proceedings of the Thirteenth International Conference on Artificial Intelligence and Statistics, pp. 509 516, 2010. Miller, A. C., Foti, N., and Adams, R. P. Variational boosting: Iteratively refining posterior approximations. ar Xiv preprint ar Xiv:1611.06585, 2016. Minka, T. et al. Divergence measures and message passing. Technical report, Technical report, Microsoft Research, 2005. Mnih, A. and Rezende, D. J. Variational inference for monte carlo objectives. ar Xiv preprint ar Xiv:1602.06725, 2016. Molchanov, D., Kharitonov, V., Sobolev, A., and Vetrov, D. Doubly semi-implicit variational inference. ar Xiv preprint ar Xiv:1810.02789, 2018. M uller, T., Mc Williams, B., Rousselle, F., Gross, M., and Nov ak, J. Neural importance sampling. ar Xiv preprint ar Xiv:1808.03856, 2018. Nowozin, S. Debiasing evidence approximations: On importance-weighted autoencoders and jackknife variational inference. In International Conference on Learning Representations, 2018. Owen, A. B. Monte Carlo theory, methods and examples. 2013. Rainforth, T., Kosiorek, A. R., Le, T. A., Maddison, C. J., Igl, M., Wood, F., and Teh, Y. W. Tighter variational bounds are not necessarily better. In International Conference on Machine Learning, 2018. Ranganath, R., Tran, D., and Blei, D. Hierarchical variational models. In International Conference on Machine Learning, 2016. Hierarchical Importance Weighted Autoencoders Rezende, D. J. and Mohamed, S. Variational inference with normalizing flows. In International Conference on Machine Learning, 2015. Rezende, D. J., Mohamed, S., and Wierstra, D. Stochastic backpropagation and approximate inference in deep generative models. In International Conference on Machine Learning, 2014. Sobolev, A. and Vetrov, D. Importance weighted hierarchical variational inference. 2018. Tucker, G., Lawson, D., Gu, S., and Maddison, C. J. Doubly reparameterized gradient estimators for monte carlo objectives. ar Xiv preprint ar Xiv:1810.04152, 2018. Wu, M., Goodman, N., and Ermon, S. Differentiable antithetic sampling for variance reduction in stochastic variational inference. ar Xiv preprint ar Xiv:1810.02555, 2018. Yin, M. and Zhou, M. Semi-implicit variational inference. ar Xiv preprint ar Xiv:1805.11183, 2018.