# semiamortized_variational_autoencoders__d8e8066b.pdf Semi-Amortized Variational Autoencoders Yoon Kim 1 Sam Wiseman 1 Andrew C. Miller 1 David Sontag 2 Alexander M. Rush 1 Amortized variational inference (AVI) replaces instance-specific local inference with a global inference network. While AVI has enabled efficient training of deep generative models such as variational autoencoders (VAE), recent empirical work suggests that inference networks can produce suboptimal variational parameters. We propose a hybrid approach, to use AVI to initialize the variational parameters and run stochastic variational inference (SVI) to refine them. Crucially, the local SVI procedure is itself differentiable, so the inference network and generative model can be trained end-to-end with gradient-based optimization. This semi-amortized approach enables the use of rich generative models without experiencing the posterior-collapse phenomenon common in training VAEs for problems like text generation. Experiments show this approach outperforms strong autoregressive and variational baselines on standard text and image datasets. 1. Introduction Variational inference (VI) (Jordan et al., 1999; Wainwright & Jordan, 2008) is a framework for approximating an intractable distribution by optimizing over a family of tractable surrogates. Traditional VI algorithms iterate over the observed data and update the variational parameters with closed-form coordinate ascent updates that exploit conditional conjugacy (Ghahramani & Beal, 2001). This style of optimization is challenging to extend to large datasets and non-conjugate models. However, recent advances in stochastic (Hoffman et al., 2013), black-box (Ranganath et al., 2014; 2016), and amortized (Mnih & Gregor, 2014; Kingma & Welling, 2014; Rezende et al., 2014) variational inference have made it possible to scale to large datasets and rich, non-conjugate models (see Blei et al. (2017), Zhang et al. (2017) for a review of modern methods). 1School of Engineering and Applied Sciences, Harvard University, Cambridge, MA, USA 2CSAIL & IMES, Massachusetts Institute of Technology, Cambridge, MA, USA. Correspondence to: Yoon Kim . Proceedings of the 35 th International Conference on Machine Learning, Stockholm, Sweden, PMLR 80, 2018. Copyright 2018 by the author(s). In stochastic variational inference (SVI), the variational parameters for each data point are randomly initialized and then optimized to maximize the evidence lower bound (ELBO) with, for example, gradient ascent. These updates are based on a subset of the data, making it possible to scale the approach. In amortized variational inference (AVI), the local variational parameters are instead predicted by an inference (or recognition) network, which is shared (i.e. amortized) across the dataset. Variational autoencoders (VAEs) are deep generative models that utilize AVI for inference and jointly train the generative model alongside the inference network. SVI gives good local (i.e. instance-specific) distributions within the variational family but requires performing optimization for each data point. AVI has fast inference, but having the variational parameters be a parametric function of the input may be too strict of a restriction. As a secondary effect this may militate against learning a good generative model since its parameters may be updated based on suboptimal variational parameters. Cremer et al. (2018) observe that the amortization gap (the gap between the log-likelihood and the ELBO due to amortization) can be significant for VAEs, especially on complex datasets. Recent work has targeted this amortization gap by combining amortized inference with iterative refinement during training (Hjelm et al., 2016; Krishnan et al., 2018). These methods use an encoder to initialize the local variational parameters, and then subsequently run an iterative procedure to refine them. To train with this hybrid approach, they utilize a separate training time objective. For example Hjelm et al. (2016) train the inference network to minimize the KL-divergence between the initial and the final variational distributions, while Krishnan et al. (2018) train the inference network with the usual ELBO objective based on the initial variational distribution. In this work, we address the train/test objective mismatch and consider methods for training semi-amortized variational autoencoders (SA-VAE) in a fully end-to-end manner. We propose an approach that leverages differentiable optimization (Domke, 2012; Maclaurin et al., 2015; Belanger et al., 2017) and differentiates through SVI while training the inference network/generative model. We find that this method is able to both improve estimation of variational parameters and produce better generative models. Semi-Amortized Variational Autoencoders We apply our approach to train deep generative models of text and images, and observe that they outperform autoregressive/VAE/SVI baselines, in addition to direct baselines that combine VAE with SVI but do not perform end-to-end training. We also find that under our framework, we are able to utilize a powerful generative model without experiencing the posterior-collapse phenomenon often observed in VAEs, wherein the variational posterior collapses to the prior and the generative model ignores the latent variable (Bowman et al., 2016; Chen et al., 2017; Zhao et al., 2017). This problem has particularly made it very difficult to utilize VAEs for text, an important open issue in the field. With SA-VAE, we are able to outperform an LSTM language model by utilizing an LSTM generative model that maintains non-trivial latent representations. Code is available at https://github.com/harvardnlp/sa-vae. 2. Background Notation Let f : Rn R be a scalar valued function with partitioned inputs u = [u1, . . . , um] such that Pm i=1 dim(ui) = n. With a slight abuse of notation we define f(u1, . . . , um) = f([u1, . . . , um]). We denote uif(ˆu) Rdim(ui) to be the i-th block of the gradient of f evaluated at ˆu = [ˆu1, . . . , ˆum], and further use df dv to denote the total derivative of f with respect to v, which exists if u is a differentiable function of v. Note that in general uif(ˆu) = df dui since other components of u could be a function of ui.1 We also let Hui,ujf(ˆu) Rdim(ui) dim(uj) be the matrix formed by taking the i-th group of rows and the j-th group of columns of the Hessian of f evaluated at ˆu. These definitions generalize straightforwardly when f : Rn Rp is a vector-valued function (e.g. df 2.1. Variational Inference Consider the following generative process for x, z p(z) x p(x | z; θ) where p(z) is the prior and p(x | z; θ) is given by a generative model with parameters θ. As maximizing the loglikelihood log p(x; θ) = log R z p(x | z; θ)p(z)dz is usually intractable, variational inference instead defines a variational family of distributions q(z; λ) parameterized by λ and maximizes the evidence lower bound (ELBO) log p(x; θ) Eq(z;λ)[log p(x | z)] KL[q(z; λ) p(z)] = ELBO(λ, θ, x) The variational posterior, q(z; λ), is said to collapse to the prior if KL[q(z; λ) p(z)] 0. In the general case we are 1This will indeed be the case in our approach: when we calculate ELBO(λK, θ, x), λK is a function of the data point x, the generative model θ, and the inference network φ (Section 3). 2Total derivatives/Jacobians are usually denoted with row vectors but we denote them with column vectors for clearer notation. given a dataset x(1), . . . , x(N) and need to find variational parameters λ(1), . . . , λ(N) and generative model parameters θ that jointly maximize PN i=1 ELBO(λ(i), θ, x(i)). 2.2. Stochastic Variational Inference We can apply SVI (Hoffman et al., 2013) with gradient ascent to approximately maximize the above objective:3 1. Sample x p D(x) 2. Randomly initialize λ0 3. For k = 0, . . . , K 1, set λk+1 = λk + α λ ELBO(λk, θ, x) 4. Update θ based on θ ELBO(λK, θ, x) Here K is the number of SVI iterations and α is the learning rate. (Note that θ is updated based on the gradient θ ELBO(λK, θ, x) and not the total derivative d ELBO(λK,θ,x) dθ . The latter would take into account the fact that λk is a function of θ for k > 0.) SVI optimizes directly for instance-specific variational distributions, but may require running iterative inference for a large number of steps. Further, because of this block coordinate ascent approach the variational parameters λ are optimized separately from θ, potentially making it difficult for θ to adapt to local optima. 2.3. Amortized Variational Inference AVI uses a global parametric model to predict the local variational parameters for each data point. A particularly popular application of AVI is in training the variational autoencoder (VAE) (Kingma & Welling, 2014), which runs an inference network (i.e. encoder) enc( ) parameterized by φ over the input to obtain the variational parameters: 1. Sample x p D(x) 2. Set λ = enc(x; φ) 3. Update θ based on θ ELBO(λ, θ, x) (which in this case is equal to the total derivative) 4. Update φ based on d ELBO(λ, θ, x) dφ λ ELBO(λ, θ, x) The inference network is learned jointly alongside the generative model with the same loss function, allowing the pair to coadapt. Additionally inference for AVI involves running the inference network over the input, which is usually much faster than running iterative optimization on the ELBO. Despite these benefits, requiring the variational parameters to be a parametric function of the input may be too strict of a 3While we describe the various algorithms for a specific data point, in practice we use mini-batches. Semi-Amortized Variational Autoencoders restriction and can lead to an amortization gap. This gap can propagate forward to hinder the learning of the generative model if θ is updated based on suboptimal λ. 3. Semi-Amortized Variational Autoencoders Semi-amortized variational autoencoders (SA-VAE) utilize an inference network over the input to give the initial variational parameters, and subsequently run SVI to refine them. One might appeal to the universal approximation theorem (Hornik et al., 1989) and question the necessity of additional SVI steps given a rich-enough inference network. However, in practice we find that the variational parameters found from VAE are usually not optimal even with a powerful inference network, and the amortization gap can be significant especially on complex datasets (Cremer et al., 2018; Krishnan et al., 2018). SA-VAE models are trained using a combination of AVI and SVI steps: 1. Sample x p D(x) 2. Set λ0 = enc(x; φ) 3. For k = 0, . . . , K 1, set λk+1 = λk + α λ ELBO(λk, θ, x) 4. Update θ based on d ELBO(λK,θ,x) 5. Update φ based on d ELBO(λK,θ,x) dφ Note that for training we need to compute the total derivative of the final ELBO with respect to θ, φ (i.e. steps 4 and 5 above). Unlike with AVI, in order to update the encoder and generative model parameters, this total derivative requires backpropagating through the SVI updates. Specifically this requires backpropagating through gradient ascent (Domke, 2012; Maclaurin et al., 2015). Following past work, this backpropagation step can be done efficiently with fast Hessian-vector products (Le Cun et al., 1993; Pearlmutter, 1994). In particular, consider the case where we perform one step of refinement, λ1 = λ0 + α λ ELBO(λ0, θ, x), and for brevity let L = ELBO(λ1, θ, x). To backpropagate through this, we receive the derivative d L dλ1 and use the chain rule, d L dλ0 = dλ1 d L dλ1 = (I + αHλ,λ ELBO(λ0, θ, x)) d L dλ1 + αHλ,λ ELBO(λ0, θ, x) d L We can then backpropagate d L dλ0 through the inference network to calculate the total derivative, i.e. d L dφ = dλ0 dφ d L dλ0 . Similar rules can be used to derive d L dθ .4 The full forward/backward step, which uses gradient descent with momentum on the negative ELBO, is shown in Algorithm 1. 4We refer the reader to Domke (2012) for the full derivation. Algorithm 1 Semi-Amortized Variational Autoencoders Input: inference network φ, generative model θ, inference steps K, learning rate α, momentum γ, loss function f(λ, θ, x) = ELBO(λ, θ, x) Sample x p D(x) λ0 enc(x; φ) v0 0 for k = 0 to K 1 do vk+1 γvk λf(λk, θ, x) λk+1 λk + αvk+1 end for L f(λK, θ, x) λK λf(λK, θ, x) θ θf(λK, θ, x) v K 0 for k = K 1 to 0 do vk+1 vk+1 + αλk+1 λk λk+1 Hλ,λf(λk, θ, x)vk+1 θ θ Hθ,λf(λk, θ, x)vk+1 vk γvk+1 end for dθ θ d L dφ dλ0 dφ λ0 Update θ, φ based on d L In our implementation we calculate Hessian-vector products with finite differences (Le Cun et al., 1993; Domke, 2012), which was found to be more memory-efficient than automatic differentiation (and therefore crucial for scaling our approach to rich inference networks/generative models). Specifically, we estimate Hui,ujf(ˆu)v with Hui,ujf(ˆu)v 1 uif(ˆu0, . . . , ˆuj + ϵv, . . . , ˆum) uif(ˆu0, . . . , ˆuj . . . , ˆum) where ϵ is some small number (we use ϵ = 10 5).5 We further clip the results (i.e. rescale the results if the norm exceeds a threshold) before and after each Hessian-vector product as well as during SVI, which helped mitigate exploding gradients and further gave better training signal to the inference network.6 See Appendix A for details. 5Since in our case the ELBO is a non-deterministic function due to sampling (and dropout, if applicable), care must be taken when calculating Hessian-vector product with finite differences to ensure that the source of randomness is the same when calculating the two gradient expressions. 6Without gradient clipping, in addition to numerical issues we empirically observed the model to degenerate to a case whereby it learned to rely too much on iterative inference, and thus the initial parameters from the inference network were poor. Another way to provide better signal to the inference network is to train against a weighted sum PK k=0 wk ELBO(λk, θ, x) for wk 0. Semi-Amortized Variational Autoencoders Figure 1. ELBO landscape with the oracle generative model as a function of the variational posterior means µ1, µ2 for a randomly chosen test point. Variational parameters obtained from VAE, SVI are shown as µVAE, µSVI and the initial/final parameters from SAVAE are shown as µ0 and µK (along with the intermediate points). SVI/SA-VAE are run for 20 iterations. The optimal point, found from grid search, is shown as µ . 4. Experiments We apply our approach to train generative models on a synthetic dataset in addition to text/images. For all experiments we utilize stochastic gradient descent with momentum on the negative ELBO. Our prior is the spherical Gaussian N(0, I) and the variational posterior is diagonal Gaussian, where the variational parameters are given by the mean vector and the diagonal log variance vector, i.e. λ = [µ, log σ2]. In preliminary experiments we also experimented with natural gradients, other optimization algorithms, and learning the learning rates, but found that these did not significantly improve results. Full details regarding hyperparameters/model architectures for all experiments are in Appendix B. 4.1. Synthetic Data We first apply our approach to a synthetic dataset where we have access to the true underlying generative model of discrete sequences. We generate synthetic sequential data according to the following oracle generative process with 2-dimensional latent variables and xt: z1, z2 N(0, 1) ht = LSTM(ht 1, xt) xt+1 softmax(MLP([ht, z1, z2])) We initialize the LSTM/MLP randomly as θ, where the LSTM has a single layer with hidden state/input dimension equal to 100. We generate for 5 time steps (so each example is given by x = [x1, . . . , x5]) with a vocabulary size of 1000 for each xt. Training set consists of 5000 points. See Appendix B.1 for the exact setup. We fix this oracle generative model p(x | z; θ) and learn an inference network (also a one-layer LSTM) with VAE and SA-VAE.7 For a randomly selected test point, we plot the 7With a fixed oracle, these models are technically not VAEs as VAE usually implies that the the generative model is learned (alongside the encoder). MODEL ORACLE GEN LEARNED GEN VAE 21.77 27.06 SVI 22.33 25.82 SA-VAE 20.13 25.21 TRUE NLL (EST) 19.63 Table 1. Variational upper bounds for the various models on the synthetic dataset, where SVI/SA-VAE is trained/tested with 20 steps. TRUE NLL (EST) is an estimate of the true negative loglikelihood (i.e. entropy of the data-generating distribution) estimated with 1000 samples from the prior. ORACLE GEN uses the oracle generative model and LEARNED GEN learns the generative network. ELBO landscape in Figure 1 as a function of the variational posterior means (µ1, µ2) learned from the different methods. For SVI/SA-VAE we run iterative optimization for 20 steps. Finally we also show the optimal variational parameters found from grid search. As can be seen from Figure 1, the variational parameters from running SA-VAE are closest to the optimum while those obtained from SVI and VAE are slightly further away. In Table 1 we show the variational upper bounds (i.e. negative ELBO) on the negative log-likelihood (NLL) from training the various models with both the oracle/learned generative model, and find that SA-VAE outperforms VAE/SVI in both cases. 4.2. Text The next set of experiments is focused on text modeling on the Yahoo questions corpus from Yang et al. (2017). Text modeling with deep generative models has been a challenging problem, and few approaches have been shown to produce rich generative models that do not collapse to standard language models. Ideally a deep generative model trained with variational inference would make use of the latent space (i.e. maintain a nonzero KL term) while accurately modeling the underlying distribution. Our architecture and hyperparameters are identical to the LSTM-VAE baselines considered in Yang et al. (2017), except that we train with SGD instead of Adam, which was found to perform better for training LSTMs. Specifically, both the inference network and the generative model are onelayer LSTMs with 1024 hidden units and 512-dimensional word embeddings. The last hidden state of the encoder is used to predict the vector of variational posterior means/log variances. The sample from the variational posterior is used to predict the initial hidden state of the generative LSTM and additionally fed as input at each time step. The latent variable is 32-dimensional. Following previous works (Bowman et al., 2016; Sønderby et al., 2016; Yang et al., 2017), for all the variational models we utilize a KL-cost annealing strategy whereby the multiplier on the KL term is increased linearly from 0.1 to 1.0 each batch over 10 epochs. Appendix B.2 has the full architecture/hyperparameters. Semi-Amortized Variational Autoencoders MODEL NLL KL PPL LSTM-LM 334.9 66.2 LSTM-VAE 342.1 0.0 72.5 LSTM-VAE + INIT 339.2 0.0 69.9 CNN-LM 335.4 66.6 CNN-VAE 333.9 6.7 65.4 CNN-VAE + INIT 332.1 10.0 63.9 LM 329.1 61.6 VAE 330.2 0.01 62.5 VAE + INIT 330.5 0.37 62.7 VAE + WORD-DROP 25% 334.2 1.44 65.6 VAE + WORD-DROP 50% 345.0 5.29 75.2 SVI (K = 10) 331.4 0.16 63.4 SVI (K = 20) 330.8 0.41 62.9 SVI (K = 40) 329.8 1.01 62.2 VAE + SVI (K = 10) 331.2 7.85 63.3 VAE + SVI (K = 20) 330.5 7.80 62.7 VAE + SVI + KL (K = 10) 330.3 7.95 62.5 VAE + SVI + KL (K = 20) 330.1 7.81 62.3 SA-VAE (K = 10) 327.6 5.13 60.5 SA-VAE (K = 20) 327.5 7.19 60.4 Table 2. Results on text modeling on the Yahoo dataset. Top results are from Yang et al. (2017), while the bottom results are from this work (+ INIT means the encoder is initialized with a pretrained language model, while models with + WORD-DROP are trained with word-dropout). NLL/KL numbers are averaged across examples, and PPL refers to perplexity. K refers to the number of inference steps used for training/testing. In addition to autoregressive/VAE/SVI baselines, we consider two other approaches that also combine amortized inference with iterative refinement. The first approach is from Krishnan et al. (2018), where the generative model takes a gradient step based on the final variational parameters and the inference network takes a gradient step based on the initial variational parameters, i.e. we update θ based on θ ELBO(λK, θ, x) and update φ based on dλ0 dφ λ ELBO(λ0, θ, x). The forward step (steps 1-3 in Section 3) is identical to SA-VAE. We refer to this baseline as VAE + SVI. In the second approach, based on Salakhutdinov & Larochelle (2010) and Hjelm et al. (2016), we train the inference network to minimize the KL-divergence between the initial and the final variational distributions, keeping the latter fixed. Specifically, letting g(ν, ω) = KL[q(z; ν) q(z; ω)], we update θ based on θ ELBO(λK, θ, x) and update φ based on dλ0 dφ νg(λ0, λK). Note that the inference network is not updated based on dg dφ, which would take into account the fact that both λ0 and λK are functions of φ. We found g(λ0, λK) to perform better than the reverse direction g(λK, λ0). We refer to this setup as VAE + SVI + KL. Results from the various models are shown in Table 2. Our baseline models (LM/VAE/SVI in Table 2) are already quite strong and outperform the models considered in Yang et al. (2017). However models trained with VAE/SVI make neg- Figure 2. (Left) Perplexity upper bound of various models when trained with 20 steps (except for VAE) and tested with varying number of SVI steps from random initialization. (Right) Same as the left except that SVI is initialized with variational parameters obtained from the inference network. ligible use of the latent variable and practically collapse to a language model, negating the benefits of using latent variables.8 In contrast, models that combine amortized inference with iterative refinement make use of the latent space and the KL term is significantly above zero.9 VAE + SVI and VAE + SVI + KL do not outperform a language model, and while SA-VAE only modestly outperforms it, to our knowledge this is one of the first instances in which we are able to train an LSTM generative model that does not ignore the latent code and outperforms a language model. One might wonder if the improvements are coming from simply having a more flexible inference scheme at test time, rather than from learning a better generative model. To test this, for the various models we discard the inference network at test time and perform SVI for a variable number of steps from random initialization. The results are shown in Figure 2 (left). It is clear that the learned generative model (and the associated ELBO landscape) is quite different it is not possible to train with VAE and perform SVI at test time to obtain the same performance as SA-VAE (although the performance of VAE does improve slightly from 62.7 to 62.3 when we run SVI for 40 steps from random initialization). Figure 2 (right) has the results for a similar experiment where we refine the variational parameters initialized from the inference network for a variable number of steps at test time. We find that the inference network provides better initial parameters than random initialization and thus requires fewer iterations of SVI to reach the optimum. We do not observe improvements for running more refinement steps than was used in training at test time. Interestingly, SA-VAE without any refinement steps at test time has a substantially nonzero KL term (KL = 6.65, PPL = 62.0). This indicates that the posterior-collapse phenomenon when 8Models trained with word dropout (+ WORD-DROP in Table 2) do make use of the latent space but significantly underperform a language model. 9A high KL term does not necessarily imply that the latent variable is being utilized in a meaningful way (it could simply be due to bad optimization). In Section 5.1 we investigate the learned latent space in more detail. Semi-Amortized Variational Autoencoders IWAE (Burda et al., 2015a) 103.38 LADDER VAE (Sønderby et al., 2016) 102.11 RBM (Burda et al., 2015b) 100.46 DISCRETE VAE (Rolfe, 2017) 97.43 DRAW (Gregor et al., 2015) 96.50 CONV DRAW (Gregor et al., 2016) 91.00 VLAE (Chen et al., 2017) 89.83 VAMPPRIOR (Tomczak & Welling, 2018) 89.76 GATED PIXELCNN 90.59 VAE 90.43 (0.98) SVI (K = 10) 90.65 (0.02) SVI (K = 20) 90.51 (0.06) SVI (K = 40) 90.44 (0.27) SVI (K = 80) 90.27 (1.65) VAE + SVI (K = 10) 90.26 (1.69) VAE + SVI (K = 20) 90.19 (2.40) VAE + SVI + KL (K = 10) 90.24 (2.42) VAE + SVI + KL (K = 20) 90.21 (2.83) SA-VAE (K = 10) 90.20 (1.83) SA-VAE (K = 20) 90.05 (2.78) Table 3. Results on image modeling on the OMNIGLOT dataset. Top results are from prior works, while the bottom results are from this work. GATED PIXELCNN is our autoregressive baseline, and K refers to the number of inference steps during training/testing. For the variational models the KL portion of the ELBO is shown in parentheses. training LSTM-based VAEs for text is partially due to optimization issues. Finally, while Yang et al. (2017) found that initializing the encoder with a pretrained language model improved performance (+ INIT in Table 2), we did not observe this on our baseline VAE model when we trained with SGD and hence did not pursue this further. 4.3. Images We next apply our approach to model images on the OMNIGLOT dataset (Lake et al., 2015).10 While posterior collapse is less of an issue for VAEs trained on images, we still expect that improving the amortization gap would result in generative models that better model the underlying data and make more use of the latent space. We use a three-layer Res Net (He et al., 2016) as our inference network. The generative model first transforms the 32-dimensional latent vector to the image spatial resolution, which is concatenated with the original image and fed to a 12-layer Gated Pixel CNN (van den Oord et al., 2016) with varying filter sizes, followed by a final sigmoid layer. We employ the same KL-cost annealing schedule as in the text experiments. See Appendix B.3 for the exact architecture/hyperparameters. Results from the various models are shown in Table 3. Our findings are largely consistent with results from text: the semi-amortized approaches outperform VAE/SVI baselines, and further they learn generative models that make more 10We focus on the more complex OMNIGLOT dataset instead of the simpler MNIST dataset as prior work has shown that the amortization gap on MNIST is minimal (Cremer et al., 2018). use of the latent representations (i.e. KL portion of the loss is higher). Even with 80 steps of SVI we are unable to perform as well as SA-VAE trained with 10 refinement steps, indicating the importance of good initial parameters provided by the inference network. In Appendix C we further investigate the performance of VAE and SA-VAE as we vary the training set size and the capacity of the inference network/generative model. We find that SA-VAE outperforms VAE and has higher latent variable usage in all scenarios. We note that we do not outperform the state-of-the-art models that employ hierarchical latent variables and/or more sophisticated priors (Chen et al., 2017; Tomczak & Welling, 2018). However these additions are largely orthogonal to our approach and we hypothesize they will also benefit from combining amortized inference with iterative refinement.11 5. Discussion 5.1. Learned Latent Space For the text model we investigate what the latent variables are learning through saliency analysis with our best model (SA-VAE trained with 20 steps). Specifically, we calculate the output saliency of each token xt with respect to z as Eq(z;λ) h d log p(xt | x token is quite high, indicating that the length information is also encoded in the latent space. In the third example we observe that the left parenthesis has higher saliency than the right parenthesis (0.32 vs. 0.24 on average across the test set), as the latter can be predicted by conditioning on the former rather than on the latent representation z. The previous definition of saliency measures the influence of z on the output xt. We can also roughly measure the influence of the input xt on the latent representation z, which we refer to as input saliency: Eq(z;λ) hd z 2 11Indeed, Cremer et al. (2018) observe that the amortization gap can be substantial for VAE trained with richer variational families. Semi-Amortized Variational Autoencoders where can i buy an affordable stationary bike ? try this place , they have every type imaginable with prices to match . http : if our economy collapses , will canada let all of us cross their border ? no , a country would have to be stupid to let that many people cross their borders and drain their resources . does the flat earth society still exist ? i m curious to know whether the original society still exists . i m not especially interested in discussion about whether the earth is flat or round . although there is no currently active website for the society , someone ( apparently a relative of samuel UNK ) maintains the flat earth society forums . this website , which offers a discussion forum and an on-line archive of flat earth society UNK from the 1970s and 1980s , represents a serious attempt to UNK the original flat earth society . where can i buy an affordable stationary bike ? try this place , they have every type imaginable with prices to match . http : UNK where can i find a good UNK book for my daughter ? i am looking for a website that sells christmas gifts for the UNK . thanks ! UNK UNK where can i find a good place to rent a UNK ? i have a few UNK in the area , but i m not sure how to find them . http : UNK which country is the best at soccer ? brazil or germany . who is the best soccer player in the world ? i think he is the best player in the world . ronaldinho is the best player in the world . he is a great player . will ghana be able to play the next game in 2010 fifa world cup ? yes , they will win it all . Figure 3. (Top) Saliency visualization of some examples from the test set. Here the saliency values are rescaled to be between 0-100 within each example for easier visualization. Red indicates higher saliency values. (Middle) Input saliency of the first test example from the top (in blue), in addition to two sample outputs generated from the variational posterior (with their saliency values in red). (Bottom) Same as the middle except we use a made-up example. Best viewed in color. Here wt is the encoder word embedding for xt.12 We visualize the input saliency for a test example (Figure 3, middle) and a made-up example (Figure 3, bottom). Under each input example we also visualize a two samples from the variational posterior, and find that the generated examples are often meaningfully related to the input example.13 We quantitatively analyze output saliency across part-ofspeech, token position, word frequency, and log-likelihood in Figure 4: nouns (NN), adjectives (JJ), verbs (VB), numbers (CD), and the token have higher saliency than conjunctions (CC), determiners (DT), prepositions (IN), and the TO token the latter are relatively easier to predict by conditioning on previous tokens; similarly, on average, tokens occurring earlier have much higher saliency than those 12As the norm of z is a rather crude measure, a better measure would be obtained by analyzing the spectra of the Jacobian dz dwt . However this is computationally too expensive to calculate for each token in the corpus. 13We first sample z q(z; λK) then x p(x | z; θ). When sampling xt p(xt | x has high saliency but is relatively easy to predict with an average log-likelihood of -1.61 (vs. average log-likelihood of -4.10 for all tokens). Appendix D has the corresponding analysis for input saliency, which are qualitatively similar. These results seem to suggest that the latent variables are encoding interesting and potentially interpretable aspects of language. While left as future work, it is possible that manipulations in the latent space of a model learned this way could lead to controlled generation/manipulation of output text (Hu et al., 2017; Mueller et al., 2017). 5.2. Limitations A drawback of our approach (and other non-amortized inference methods) is that each training step requires backpropagating through the generative model multiple times, which can be costly especially if the generative model is expensive to compute (e.g. LSTM/Pixel CNN). This may potentially be mitigated through more sophisticated meta learning ap- Semi-Amortized Variational Autoencoders Figure 4. Output saliency by part-of-speech tag, position, log frequency, and log-likelihood. See Section 5.1 for the definitions of output saliency. The dotted gray line in each plot shows the average saliency across all words. proaches (Andrychowicz et al., 2016; Marino et al., 2018), or with more efficient use of the past gradient information during SVI via averaging (Schmidt et al., 2013) or importance sampling (Sakaya & Klami, 2017). One could also consider employing synthetic gradients (Jaderberg et al., 2017) to limit the number of backpropagation steps during training. Krishnan et al. (2018) observe that it is more important to train with iterative refinement during earlier stages (we also observed this in preliminary experiments), and therefore annealing the number of refinement steps as training progresses could also speed up training. Our approach is mainly applicable to variational families that avail themselves to differentiable optimization (e.g. gradient ascent) with respect to the ELBO, which include much recent work on employing more flexible variational families with VAEs. In contrast, VAE + SVI and VAE + SVI + KL are applicable to more general optimization algorithms. 6. Related Work Our work is most closely related the line of work which uses a separate model to initialize variational parameters and subsequently updates them through an iterative procedure (Salakhutdinov & Larochelle, 2010; Cho et al., 2013; Salimans et al., 2015; Hjelm et al., 2016; Krishnan et al., 2018; Pu et al., 2017). Marino et al. (2018) utilize meta-learning to train an inference network which learns to perform iterative inference by training a deep model to output the variational parameters for each time step. While differentiating through inference/optimization was initially explored by various researchers primarily outside the area of deep learning (Stoyanov et al., 2011; Domke, 2012; Brakel et al., 2013), they have more recently been explored in the context of hyperparameter optimization (Maclaurin et al., 2015) and as a differentiable layer of a deep model (Belanger et al., 2017; Kim et al., 2017; Metz et al., 2017; Amos & Kolter, 2017). Initial work on VAE-based approaches to image modeling focused on simple generative models that assumed independence among pixels conditioned on the latent variable (Kingma & Welling, 2014; Rezende et al., 2014). More recent works have obtained substantial improvements in loglikelihood and sample quality through utilizing powerful autoregressive models (Pixel CNN) as the generative model (Chen et al., 2017; Gulrajani et al., 2017). In contrast, modeling text with VAEs has remained challenging. Bowman et al. (2016) found that using an LSTM generative model resulted in a degenerate case whereby the variational posterior collapsed to the prior and the generative model ignored the latent code (even with richer variational families). Many works on VAEs for text have thus made simplifying conditional independence assumptions (Miao et al., 2016; 2017), used less powerful generative models such as convolutional networks (Yang et al., 2017; Semeniuta et al., 2017), or combined a recurrent generative model with a topic model (Dieng et al., 2017; Wang et al., 2018). Note that unlike to sequential VAEs that employ different latent variables at each time step (Chung et al., 2015; Fraccaro et al., 2016; Krishnan et al., 2017; Serban et al., 2017; Goyal et al., 2017a), in this work we focus on modeling the entire sequence with a global latent variable. Finally, since our work only addresses the amortization gap (the gap between the log-likelihood and the ELBO due to amortization) and not the approximation gap (due to the choice of a particular variational family) (Cremer et al., 2018), it can be combined with existing work on employing richer posterior/prior distributions within the VAE framework (Rezende & Mohamed, 2015; Kingma et al., 2016; Johnson et al., 2016; Tran et al., 2016; Goyal et al., 2017b; Guu et al., 2017; Tomczak & Welling, 2018). 7. Conclusion This work outlines semi-amortized variational autoencoders, which combine amortized inference with local iterative refinement to train deep generative models of text and images. With the approach we find that we are able to train deep latent variable models of text with an expressive autogressive generative model that does not ignore the latent code. From the perspective of learning latent representations, one might question the prudence of using an autoregressive model that fully conditions on its entire history (as opposed to assuming some conditional independence) given that p(x) can always be factorized as QT t=1 p(xt | x