# variational_laplace_autoencoders__3af028c7.pdf Variational Laplace Autoencoders Yookoon Park 1 Chris Dongjoo Kim 1 Gunhee Kim 1 Abstract Variational autoencoders (Kingma & Welling, 2014) employ an amortized inference model to approximate the posterior of latent variables. However, such amortized variational inference faces two challenges: (1) the limited posterior expressiveness of fully-factorized Gaussian assumption and (2) the amortization error of the inference model. We present a novel approach that addresses both challenges. First, we focus on Re LU networks with Gaussian output and illustrate their connection to probabilistic PCA. Building on this observation, we derive an iterative algorithm that finds the mode of the posterior and apply fullcovariance Gaussian posterior approximation centered on the mode. Subsequently, we present a general framework named Variational Laplace Autoencoders (VLAEs) for training deep generative models. Based on the Laplace approximation of the latent variable posterior, VLAEs enhance the expressiveness of the posterior while reducing the amortization error. Empirical results on MNIST, Omniglot, Fashion-MNIST, SVHN and CIFAR10 show that the proposed approach significantly outperforms other recent amortized or iterative methods on the Re LU networks. 1. Introduction Variational autoencoders (VAEs) (Kingma & Welling, 2014) are deep latent generative models which have been popularly used in various domains of data such as images, natural language and sound (Gulrajani et al., 2017; Gregor et al., 2015; Bowman et al., 2016; Chung et al., 2015; Roberts et al., 2018). VAEs introduce an amortized inference network (i.e. an encoder) to approximate the posterior distribution of latent variable z and maximize the evidence lower-bound (ELBO) of the data. However, the two major limitations of 1Neural Processing Research Center, Seoul National University, Seoul, South Korea. Correspondence to: Gunhee Kim . Proceedings of the 36 th International Conference on Machine Learning, Long Beach, California, PMLR 97, 2019. Copyright 2019 by the author(s). VAEs are: (1) the constrained expressiveness of the fullyfactorized Gaussian posterior assumption and (2) the amortization error (Cremer et al., 2018) of the inference model due to the dynamic posterior prediction. There have been various attempts to address these problems. A representative line of works belongs to the category of normalizing flows (Rezende & Mohamed, 2015; Kingma et al., 2016; Tomczak & Welling, 2016), which apply a chain of invertible transformation with tractable densities in order to represent a more flexible posterior distribution. However, not only do they incur additional parameter overhead for the inference model but are yet prone to the amortization error as they entirely depend on the dynamic inference. Recently, iterative approaches based on gradient-based refinement of the posterior parameters have been proposed (Krishnan et al., 2018; Kim et al., 2018; Marino et al., 2018). These methods aim to reduce the amortization error by augmenting the dynamic inference with an additional inner-loop optimization of the posterior. Nonetheless, they still rely on the fully-factorized Gaussian assumption, and accordingly fail to enhance the expressiveness of the posterior. We develop a novel approach that tackles both challenges by (1) iteratively updating the mode of the approximate posterior and (2) defining a full-covariance Gaussian posterior centered on the mode, whose covariance is determined by the local behavior of the generative network. By deducing the approximate posterior directly from the generative network, not only are we able to minimize the amortization error but also model the rich correlations between the latent variable dimensions. We start from the class of neural networks of rectified linear activations (e.g. Re LU) (Montufar et al., 2014; Pascanu et al., 2014) and Gaussian output, which is universally popular for modeling continuous data such images (Krizhevsky et al., 2012; Gregor et al., 2015) and also as a building block of deep latent models (Rezende et al., 2014; Sønderby et al., 2016). Subsequently, we present a generalized framework named Variational Laplace Autoencoders (VLAEs), which encompasses the general class of differentiable neural networks. We show that the Re LU networks are in fact a special case of VLAEs that admits efficient computation. In addition, we illustrate an example for Bernoulli output networks. Variational Laplace Autoencoders In summary, the contributions of this work are as follows: We relate Gaussian output Re LU networks to probabilistic PCA (Tipping & Bishop, 1999) and thereby derive an iterative update for finding the mode of the posterior and a full-covariance Gaussian posterior approximation at the mode. We present Variational Laplace Autoencoders (VLAEs), a general framework for training deep generative models based on the Laplace approximation of the latent variable posterior. VLAEs not only minimize the amortization error but also provide the expressive power of full-covariance Gaussian, with no additional parameter overhead for the inference model. To the best of our knowledge, this work is the first attempt to apply the Laplace approximation to the training of deep generative models. We evaluate our approach on five benchmark datasets of MNIST, Omniglot, Fashion-MNIST, SVHN and CIFAR-10. Empirical results show that VLAEs bring significant improvement over other recent amortized or iterative approaches, including VAE (Kingma & Welling, 2014), semi-amortized VAE (Kim et al., 2018; Marino et al., 2018; Krishnan et al., 2018) and VAE with Householder Flow (Tomczak & Welling, 2016). 2. Background 2.1. Variational Autoencoders For the latent variable model pθ(x, z) = pθ(x|z)p(z), variational inference (VI) (Hinton & Camp, 1993; Waterhouse et al., 1996; Jordan et al., 1999) fits an approximate distribution q(z; λ) to the intractable posterior pθ(z|x) and maximizes the evidence lower-bound (ELBO): Lθ(x; λ) = Eq(z;λ)[ln pθ(x, z) ln q(z; λ)] (1) = ln pθ(x) DKL(q(z; λ)||pθ(z|x)) (2) ln pθ(x), (3) where q(z; λ) is assumed to be a simple distribution such as diagonal Gaussian and λ is the variational parameter of the distribution (e.g. the mean and covariance). The learning involves first finding the optimal variational parameter λ that minimizes the variational gap DKL(q(z; λ)||pθ(z|x)) between the ELBO and the true marginal log-likelihood, and then updating the generative model θ using Lθ(x; λ ). Variational autoencoders (VAE) (Kingma & Welling, 2014) amortize the optimization problem of λ using the inference model φ that dynamically predicts the approximate posterior as a function of x: Lθ,φ(x) = Eqφ(z|x)[ln pθ(x, z) ln qφ(z|x)]. (4) The generative model (decoder) and the inference model (encoder) are jointly optimized. While such amortized variational inference (AVI) is highly efficient, there remain two fundamental challenges: (1) the limited expressiveness of the approximate posterior and (2) the amortization error, both of which will be discussed in the following sections. 2.2. Limited Posterior Expressiveness VAEs approximate the posterior with the fully-factorized Gaussian qφ(z|x) = N(µφ(x), diag(σ2 φ(x)). However the fully-factorized Gaussian may fail to accurately capture the complex true posterior distribution, causing the approximation error (Cremer et al., 2018). As the ELBO (Eq.(4)) tries to reduce the gap DKL(qφ(z|x)||pθ(z|x)), it will force the true posterior pθ(z|x) to match the fullyfactorized Gaussian qφ(z|x), which negatively affect the capacity of the generative model (Mescheder et al., 2017). Hence, it is encouraged to use more expressive families of distributions; for example, one natural expansion is to model the full-covariance matrix Σ of the Gaussian. However, it requires O(D2) variational parameters to be predicted, placing a heavy burden on the inference model. Normalizing flows (Rezende & Mohamed, 2015; Kingma et al., 2016; Tomczak & Welling, 2016) approach this issue by using a class of invertible transformations whose densities are relatively easy to compute. However, the flow-based methods have drawbacks in that they incur additional parameter overhead for the inference model and are prone to the amortization error described below. 2.3. The Amortization Error Another problem of VAEs stems from the nature of amortized inference where the variational parameter λ is not explicitly optimized, but is dynamically predicted by the inference model. The error of the dynamic inference is referred to as the amortization error and is closely related to the performance of the generative model (Cremer et al., 2018). The suboptimal posterior predictions loosen the bound in the ELBO (Eq.(2)) and result in biased gradient signals flowing to the generative parameters. Recently, Kim et al. (2018) and Marino et al. (2018) address this issue by iteratively updating the predicted variational parameters using gradient-based optimization. However, they still rely on the fully-factorized Gaussian assumption, limiting the expressive power of the posterior. 3. Approach We first describe probabilistic PCA (section 3.1) and piecewise linear Re LU networks (section 3.2). Based on the local linearity of such networks, we derive an iterative approach Variational Laplace Autoencoders network data Figure 1. VAE s manifold learned on 2D toy data using a 1D latent variable. The VAE with rectified linear activation learns a piecewise linear manifold and locally performs probabilistic PCA. to find the mode of the posterior and define a full-covariance Gaussian posterior at the mode (section 3.3). Finally, we present a general framework for training deep generative models using the Laplace approximation of the latent variable posterior (section 3.4). 3.1. Probabilistic PCA Probabilistic PCA (Tipping & Bishop, 1999) relates latent variable z to data x through linear mapping W as p(z) = N(0, I), (5) pθ(x|z) = N(Wz + b, σ2I), (6) where W, b, σ are the parameters to be learned. Note that this is basically a linear version of VAEs. Under this particular model, the posterior distribution of z given x can be computed in a closed form (Tipping & Bishop, 1999): pθ(z|x) = N( 1 σ2 ΣWT (x b), Σ), (7) where Σ = ( 1 σ2 WT W + I) 1. (8) 3.2. Piece-wise Linear Neural Networks Consider a following Re LU network y = gθ(z): hl+1 = Re LU(Wlhl + bl), for l = 0, . . . , L 1 (9) y = WLh L + b L, (10) where h0 = z, Re LU(x) = max(0, x) and L is the number of layers. Our motivation is based on the observation that neural networks of rectified linear activations (e.g. Re LU, Leaky Re LU, Maxout) are piece-wise linear (Pascanu et al., 2014; Montufar et al., 2014). That is, the network segments the input space into linear regions within which it locally behaves as a linear function: gθ(z + ϵ) Wz(z + ϵ) + bz, (11) where the subscripts denote the dependence on z. To see this, note that applying a Re LU activation is equivalent to multiplying a corresponding mask matrix O: Re LU(Wx + b) = O(Wx + b), (12) where O is a diagonal matrix whose diagonal element oi defines the activation pattern (Pascanu et al., 2014): ( 1 if w T i x + bi > 0, 0 otherwise. (13) The set of input x that satisfies the activation pattern {oi}d i=1 such that {x | I(w T i x + bi > 0) = oi, for i = 1, . . . , d} defines a convex polytope as it is an intersection of halfspaces. Accordingly, within this convex polytope the mask matrix O is constant. We can obtain Wz and bz in Eq.(11) by computing the activation masks during the forward pass and recursively multiplying them with the network weights: y = WLRe LU(WL 1h L 1 + b L 1) + b L (14) = WLOL 1(WL 1h L 1 + b L 1) + b L (15) = WLOL 1WL 1 O0W0z + . . . (16) = Wzz + bz. (17) Note that Wz is the Jacobian of the network gθ(z)/ z. Similar results apply to other kinds of piece-wise linear activations such as Leaky Re LU (Maas et al., 2013) and Max Out (Goodfellow et al., 2013). 3.3. Posterior Inference for Piece-wise Linear Networks Consider the following nonlinear latent generative model: p(z) = N(0, I), (18) pθ(x|z) = N(gθ(z), σ2I), (19) where gθ(z) is the Re LU network. In general, such nonlinear model does not allow the analytical computation of the posterior. Instead of using the amortized prediction qφ(z|x) = N(µφ(x), diag(σ2 φ(x)) like VAEs, we present a novel approach that exploits the piece-wise linearity of generative networks. The results in the previous section hints that the Re LU network gθ(z) learns the piece-wise linear manifold of the data as illustrated in Fig.1, meaning that the model is locally equivalent to the probabilistic PCA. Based on this observation, we propose a new approach for posterior approximation which consists of two parts: (1) find the mode of Variational Laplace Autoencoders Algorithm 1 Posterior inference for piece-wise linear nets Input: data x, piece-wise linear generative network gθ, inference network encφ, update steps T, decay αt Output: full-covariance Gaussian posterior q(z|x) µ0 = encφ(x) for t = 0 to T 1 do Compute linear approximation gθ(z) Wtz + bt Σt (σ 2WT t Wt + I) 1 µ σ 2Σt WT t (x bt) µt+1 (1 αt)µt + αtµ end for Compute linear approximation gθ(z) WT z + b T ΣT (σ 2WT T WT + I) 1 q(z|x) N(µT , ΣT ) Return q(z|x) posterior where probability density is mostly concentrated, and (2) apply local linear approximation of the generative network (Eq.(11)) at the mode and analytically compute the posterior using the results of probabilistic PCA (Eq.(7)). Algorithm 1 outlines the proposed method. We derive an update equation for the posterior mode exploiting the local linearity of Re LU networks. The results of probabilistic PCA (Eq.(7)) leads to the solution for the posterior mode under the linear model y = Wz +b. Based on this insight, we first assume linear approximation to the generative network gθ(z) Wtz + bt at the current estimate µt at step t and update our mode estimate using the solution (Eq.(7)) under this linear model: σ2 Σt WT t (x bt), (20) where Σt is defined as in Eq.(8). To take advantage of the efficiency of the amortized inference, we initialize the estimate using an inference model (encoder) as µ0 = encφ(x) and iterate the update (Eq.(20)) for T steps. Fig. 2 illustrates the process of iterative mode updates. We find that smoothing the update with decay αt < 1 improves the stability of the algorithm: µt+1 = (1 αt)µt + αtµ (21) where µ = 1 σ2 Σt WT t (x bt). (22) Finally, by assuming the linear model gθ(z) WT z + b T at µT , the approximate Gaussian posterior is defined: q(z|x) = N(µT , ΣT ), (23) where ΣT = ( 1 σ2 WT T WT + I) 1. (24) We train our model by optimizing the ELBO in Eq.(4) by plugging in q(z|x) of Eq.(23). For sampling from the mul- (a) Estimate at step 𝑡𝑡 (b) Linear approximation (c) Solution under linearity (d) Updated estimate Figure 2. Illustration of the iterative update (Eq.(20))2for the posterior mode, drawn on the data space. (a) The posterior mode µ corresponds to the best reconstruction ˆx of data x on the network manifold, i.e. ˆx = gθ(µ ). ˆxt shows the estimate at step t. (b) We apply linear approximation to the network (dashed line). (c) We solve for µt+1 under this linear model (Eq.(20)). The network warps the result according to its manifold (dotted arrow). (d) Updated estimate. ˆxt+1 is now closer to ˆx than previous ˆxt. tivariate Gaussian and propagating the gradient to the inference model, we calculate the Cholesky decomposition LLT = ΣT and apply the reparameterization z = µ + Lϵ, where ϵ is the standard Gaussian noise. We highlight the notable characteristics of our approach: We iteratively update the posterior mode rather than solely relying on the amortized prediction. This is in spirit similar to semi-amortized inference (Kim et al., 2018; Marino et al., 2018; Krishnan et al., 2018), but critical differences are: (1) our method can make large jumps than prevalent gradient-based methods by exploiting the local linearity of the network, and (2) it is efficient and deterministic since it require no sampling during updates. Fig. 3 intuitively depicts these effects. We gain the expressiveness of the full-covariance Gaussian posterior (Fig 4) where the covariance is analytically computed from the local behavior of the generative network. This is in contrast with normalizing flows (Rezende & Mohamed, 2015; Kingma et al., 2016; Tomczak & Welling, 2016) which introduce extra parameter overhead for the inference model and hence are prone to the amortization error. 2We here assume σ2 0 for the purpose of illustration. For σ2 > 0, the prior shrinks the mode toward zero. Variational Laplace Autoencoders Figure 3. Illustration of the iterative mode update (Eq.(20)) on the ELBO landscape. The models are trained on MNIST using 2-dim latent variable. The update paths for the posterior mode µ are depicted for five steps. The VLAE obtains the closest estimate, making large jumps during the process for faster convergence. (a) VAE (c) VAE + HF (d) VLAE Figure 4. The covariance matrix of q(z|x) from four different models. (a) VAE (Kingma & Welling, 2014). (b) Semi-Amortized VAE (Kim et al., 2018). (c) Householder Flow (Tomczak & Welling, 2016). (d) VLAE. While the former two only model diagonal elements, the latter two can model the full-covariance. The VLAE captures the richest correlations between dimensions. 3.4. Variational Laplace Autoencoders So far, the proposed approach assumes the neural networks with rectified linear activations and Gaussian output. We here present a general framework Variational Laplace Autoencoders (VLAE), which are applicable to the general class of differentiable neural networks. To estimate the posterior pθ(z|x), VLAEs employ the Laplace approximation (Bishop, 2006) that finds a Gaussian approximation based on the local curvature at the posterior mode. We present how the Laplace method is incorporated for training deep generative models, and show the model in section 3.3 is a special case of VLAEs where it admits efficient computations. Consider the problem of approximating the posterior pθ(z|x) where we only have access to the unnormalized density pθ(x, z). The Laplace method finds a Gaussian approximation q(z|x) centered on the mode of pθ(z|x) where the covariance is determined by the local curvature of Algorithm 2 Variational Laplace Autoencoders Input: generative model θ, inference model φ Sample x pdata(x) Initialize µ = encφ(x) for t = 0 to T 1 do Update µ (e.g. using gradient descent) end for Σ ( 2 z lnθ p(x, z)|z=µ) 1 q(z|x) N(µ, Σ) Sample ϵ N(0, I) Compute the Cholesky decomposition LLT = Σ z µ + Lϵ Estimate the ELBO: Lθ,φ(x) = ln pθ(x, z) ln qφ(z|x) Update generative model: θ θ + α θLθ,φ(x) Update inference model: φ φ + α φLθ,φ(x) log pθ(x, z). The overall procedure is largely divided into two parts: (1) finding the mode of posterior distribution and (2) computing the Gaussian approximation centered at the mode. First, we iteratively search for the mode µ of pθ(z|x) via z log pθ(x, z)|z=µ = 0. (25) We can generally apply gradient-based optimization for this purpose. After determining the mode, we run the secondorder Taylor expansion centered at µ: log pθ(x, z) log pθ(x, µ) 1 2(z µ)T Λ(z µ), where Λ = 2 z log pθ(x, z)|z=µ. (27) As this form is equivalent to Gaussian distribution, we define the approximate posterior as q(z|x) = N(µ, Σ), (28) where Σ 1 = Λ = 2 z log pθ(x, z)|z=µ. (29) This posterior distribution is then used to estimate the ELBO in Eq.(4) for training the generative model. Alg. 2 summarizes the proposed framework. Gaussian output Re LU networks. We now make a connection to the approach in section 3.3. For the generative model defined in Eq.(18) (19), the joint log-likelihood is log pθ(x, z) 2σ2 (x gθ(z))T (x gθ(z)) 1 2z T z + C, (30) where C is a constant independent of z. Taking the gradient with respect to z and setting it to zero, z log pθ(x, z) = 1 z (gθ(z) x) z = 0. Variational Laplace Autoencoders By assuming linear approximation gθ(z) Wzz + bz (Eq.(11)) and plugging it in, the solution is σ2 WT z Wz + I) 1WT z (x bz), (31) which is equivalent to the update equation of Eq.(20). Moreover, under the linear model above: Λ = 2 z log p(x, z) = ( 1 σ2 WT z Wz + I), (32) which agree with Σ 1 in Eq.(7). That is, by using the local linearity assumption, we can estimate the covariance without the expensive computation of the Hessian of the generative network. Bernoulli output Re LU networks. We illustrate an example for how VLAEs can be applied to Bernoulli output Re LU networks: log pθ(x|z) = i xi log yi(z) + (1 xi) log(1 yi(z)), where y(z) = 1 1 + exp( gθ(z)). (33) Using gθ(z) log p(x|z) = x y(z) and chain rule, the gradient and the Hessian is z log pθ(x, z) = gθ(z)T z (x y(z)) z, (34) 2 z log pθ(x, z) = 2gθ(z)T z2 (x y(z)) (35) z diag(y(z) (1 y(z))) gθ(z) Plugging in the linear approximation gθ(z) Wzz + bz, the result simplifies to z log pθ(x, z) = WT z (x y(z)) z, (36) 2 z log pθ(x, z) = (WT z SWz + I), (37) where Sz = diag(y(z) (1 y(z))). (38) To solve Eq.(36), we first apply the first-order approximation of y(z): y(z ) y(z) + y(z) z (z z) (39) = y(z) + Sz Wz(z z). (40) We plug it into Eq.(36) and solve the equation for zero, z =(WT z Sz Wz + I) 1WT z (x y(z) + Sz Wzz). This leads to the update equation for the mode similar to the Gaussian case (Eq.(20)): µt+1 = Σt WT t (x bt), (41) where Σt = (WT t St Wt + I) 1, bt = (yt St Wtµt). After T updates, the approximate posterior distribution for the Bernoulli output distribution is defined as q(z|x) = N(µT , ΣT ). (42) 3.5. Efficient Computation For a network with width O(D), the calculation of Wz (Eq.(16) (17)) and the update equation (Eq.(20)) requires a series of matrix-matrix multiplication and matrix inversion of O(D3) complexity, whereas the standard forward or backward propagation through the network takes a series of matrix-vector multiplication of O(D2) cost. As this computation can be burdensome for bigger networks3, we discuss efficient alternatives for: (1) iterative mode seeking and (2) covariance estimation. Iterative mode seeking. For piece-wise linear networks, nonlinear variants of Conjugate Gradient (CG) can be effective as the CG solves a system of linear equations efficiently. It requires the evaluation of matrix-vector products Wzz and WT z r, where r is the residual x (Wzz + b). These are computable during the forward and backward pass with O(D2) complexity per iteration. See Appendix for details. For general differentiable neural networks, the gradientbased optimizers such as SGD, momentum or ADAM (Kingma & Ba, 2015) can be adopted to find the solution of Eq.(25). The backpropagation through the gradient-based updates requires the evaluation of Hessian-vector products but there are efficient approximations such as the finite differences (Le Cun et al., 1993) used in Kim et al. (2018). Covariance estimation. Instead of analytically calculating the precision matrix Λ = Σ 1 = σ 2WT W + I (Eq.(32)) we may directly approximate the precision matrix using truncated SVD for top k singular values and vectors of Λ. The truncated SVD can be iteratively performed using the power method where each iteration involves evaluation of the Jacobian-vector product WT Wz. This can be computed through the forward and backward propagation at O(D2) cost. One way to further accelerate the convergence of the power iterations is to extend the amortized inference model to predict k vectors as seed vectors for the power method. However, we still need to compute the Cholesky decomposition of the covariance matrix for sampling. 4. Related Work Cremer et al. (2018) and Krishnan et al. (2018) reveal that VAEs suffer from the inference gap between the ELBO and the marginal log-likelihood. Cremer et al. (2018) decompose this gap as the sum of the approximation error and 3In our experiments, the computational overhead is affordable with GPU acceleration. For example, our MNIST model takes 10 GPU hours for 1,000 epochs on one Titan X Pascal. Variational Laplace Autoencoders the amortization error. The approximation error results from the choice of a particular variational family, such as fully-factorized Gaussians, restricting the distribution to be factorial or more technically, have a diagonal covariance matrix. On the other hand, the amortization error is caused by the suboptimality of variational parameters due to the amortized predictions. This work is most akin to the line of works that contribute to reducing the amortization error by iteratively updating the variational parameters to improve the approximate posterior (Kim et al., 2018; Marino et al., 2018; Krishnan et al., 2018). Distinct from the prior approaches which rely on gradientbased optimization, our method explores the local linearity of the network to make more efficient updates. Regarding the approximation error, various approaches have been proposed for improved expressiveness of posterior approximation. Importance weighted autoencoders (Yuri et al., 2015) learn flexible posteriors using importance weighting. Tran et al. (2016) incorporate Gaussian processes to enrich posterior representation. Maaløe. et al. (2016) augment the model with auxiliary variables and Salimans et al. (2015) use Markov chains with Hamiltonian dynamics. Normalizing flows (Tabak & Turner, 2013; Rezende & Mohamed, 2015; Kingma et al., 2016; Tomczak & Welling, 2016) transform a simple initial distribution to an increasingly flexible one by using a series of flows, invertible transformations whose determinant of the Jacobian is easy to compute. For example, the Householder flows (Tomczak & Welling, 2016) can represent a Gaussian distribution with a full covariance alike to our VLAEs. However, as opposed to the flow-based approaches, VLAEs introduce no additional parameters and are robust to the amortization error. The Laplace approximation have been applied to estimate the uncertainty of weight parameters of neural networks (Mac Kay, 1992; Ritter et al., 2018). However, due to the high dimensionality of neural network parameters (often over millions), strong assumptions on the structure of the Hessian matrix are required to make the computation feasible (Le Cun et al., 1990; Ritter et al., 2018). On the other hand, the latent dimension of deep generative models is typically in the hundreds, rendering our approach practical. 5. Experiments We evaluate our approach on five popular datasets: MNIST (Le Cun et al., 1998), Omniglot (Lake et al., 2013), Fashion MNIST (Xiao et al., 2017), Street View House Numbers (SVHN) (Wang et al., 2011) and CIFAR-10 (Krizhevsky & Hinton, 2009). We verify the effectiveness of our approach not only on the Re LU networks with Gaussian output (section 5.1), but also on the Bernoulli output Re LU networks in section 3.4 on dynamically binarized MNIST (section 5.2). We compare the VLAE with other recent VAE models, including (1) VAE (Kingma & Welling, 2014) using the standard fully-factorized Gaussian assumption, (2) Semi Amortized VAE (SA-VAE) which extends the VAE using gradient-based updates of variational parameters (Kim et al., 2018; Marino et al., 2018; Krishnan et al., 2018) (3) VAE augmented with Householder Flow (VAE+HF) (Tomczak & Welling, 2016) (Rezende & Mohamed, 2015) which employs a series of Householder transformation to model the covariance of the Gaussian posterior. For bigger networks, we also include (4) VAE augmented with Inverse Autoregressive Flow (VAE+IAF) (Kingma et al., 2016). We experiment two network settings: (1) a small network with one hidden layer. The latent variable dimension is 16 and the hidden layer dimension is 256. We double both dimensions for color datasets of SVHN and CIFAR10. (2) A bigger network with two hidden layers. The latent variable dimension is 50 and the hidden layer dimension is 500 for all datasets. For both settings, we apply Re LU activation to hidden layers and use the same architecture for the encoder and decoder. See Appendix for more experimental details. The code is public at http://vision.snu.ac.kr/projects/VLAE. 5.1. Results of Gaussian Outputs Table 1 summarizes the results on the Gaussian output Re LU networks in the small network setting. The VLAE outperforms other baselines with notable margins in all the datasets, proving the effectiveness of our approach. Remarkably, a single step of update (T = 1) leads to substantial improvement compared to the other models, and with more updates the performance further enhances. Fig. 5 depicts how data reconstructions improve with the update steps. Table 2 shows the results with the bigger networks, which bring considerable improvements. The VLAE again attains the best results for all the datasets. The VAE+IAF is generally strong among the baselines whereas the SA-VAE is worse compared to others. One distinguishing trend compared to the small network results is that increasing the number of updates T often degrade the performance of the models. We suspect the increased depth due to the large number of updates or flows causes optimization difficulties, as we observe worse results on the training set as well. For both network settings, the SA-VAE and VAE+HF show mixed results. We hypothesize the causes are as follows: (1) The SA-VAE update is noisy as the gradient of ELBO (Eq.(4)) is estimated using a single sample of z. Hence, it may cause instability during training and may require more iterations than used in our experiments for better performance. (2) Although the VAE+HF is endowed with flexibility to represent correlations between the latent variable dimensions, it is prone to suffer from the amortization Variational Laplace Autoencoders Table 1. Log-likelihood results on small networks, estimated with 100 importance samples. Gaussian output is used except the last column with the Bernoulli output. T refers to the number of updates for the VLAE and SA-VAE (Kim et al., 2018) or the number of flows for the VAE+HF (Tomczak & Welling, 2016). MNIST OMNIFASHION SVHN CIFAR10 BINARY GLOT MNIST MNIST VAE 612.9 343.5 606.3 4555 2364 -96.73 SA-VAE T =1 614.1 341.4 606.7 4553 2366 -96.85 T =2 615.2 346.6 604.1 4551 2366 -96.73 T =4 612.8 348.6 606.6 4553 2366 -96.71 T =8 612.1 345.5 608.0 4559 2365 -96.89 VAE+HF T =1 610.5 341.5 604.3 4557 2366 -96.75 T =2 613.1 343.1 606.5 4569 2361 -96.52 T =4 612.9 333.8 604.9 4564 2362 -96.44 T =8 615.6 332.6 605.5 4536 2357 -96.14 VLAE T =1 638.6 362.0 614.9 4639 2374 -94.68 T =2 645.4 372.7 615.5 4681 2381 -94.46 T =4 649.9 372.3 615.6 4711 2387 -94.41 T =8 650.3 380.7 618.8 4718 2392 -94.57 GT 0 1 2 3 4 Figure 5. Examples of images reconstructed by VLAE with five update iterations proceeding from left to right. The pixel correlations greatly improve on the first update. The ground-truth samples are shown on the rightmost. error as it completely relies on the dynamic prediction of the inference network, whereas the inference problem becomes more complex with the enhanced flexibility. On the other hand, we argue that the VLAE is able to make significant improvements in fewer steps because the VLAE update is more powerful and deterministic (Eq.(20)), thanks to the local linearity of Re LU networks. Moreover, the VLAE computes the covariance of latent variables directly from the generative model, thus providing greater expressiveness with less amortization error. 5.2. Results of Bernoulli Outputs To show the generality of our approach on non-Gaussian output, we experiment Re LU networks with Bernoulli output on dynamically binarized MNIST. Each pixel is stochastically set to 1 or 0 with a probability proportional to the pixel Table 2. Log-likelihood results on bigger networks, estimated with 5000 importance samples. See the caption of Table 1 for details. MNIST OMNIFASHION SVHN CIFAR10 BINARY GLOT MNIST MNIST VAE 1015 602.7 707.7 5162 2640 -85.38 SA-VAE T =1 984.7 598.5 706.4 5181 2639 -85.20 T =2 1006 589.8 708.3 5165 2639 -85.10 T =4 999.8 604.6 706.4 5172 2640 -85.43 T =8 990.0 602.7 697.0 5172 2639 -85.24 VAE+HF T =1 1028 602.5 715.6 5201 2637 -85.27 T =2 1020 603.1 710.4 5179 2636 -85.31 T =4 989.8 607.0 710.0 5209 2641 -85.22 T =8 944.3 608.5 714.6 5196 2640 -85.41 VAE+IAF T =1 1015 609.4 721.2 5037 2642 -84.26 T =2 1057 617.5 724.1 5150 2624 -84.16 T =4 1051 617.5 721.0 4994 2638 -84.03 T =8 1018 606.7 725.7 4951 2639 -83.80 VLAE T =1 1150 727.4 817.7 5324 2687 -83.72 T =2 1096 731.1 825.4 5159 2686 -83.84 T =4 1054 701.2 826.0 5231 2683 -83.73 T =8 1009 661.2 821.0 5341 2639 -83.60 intensity (Salakhutdinov & Murray, 2008). The rightmost columns in Table 1 2 show the results on dynamically binarized MNIST. The VLAE attains the highest log-likelihood as in the Gaussian output experiments. The results demonstrate that VLAEs are also effective for non-Gaussian output models. 6. Conclusion We presented Variational Laplace Autoencoders (VLAEs), which apply the Laplace approximation of the posterior for training deep generative models. The iterative mode updates and full-covariance Gaussian approximation using the curvature of the generative network enhances the expressive power of the posterior with less amortization error. The experiments demonstrated that on Re LU networks, the VLAEs outperformed other amortized or iterative models. As future work, an important study may be to extend VLAEs to deep latent models based on the combination of top-down information and bottom-up inference (Kingma et al., 2016; Sønderby et al., 2016). Acknowledgements This work is supported by Samsung Advanced Institute of Technology, Korea-U.K. FP Programme through NRF of Korea (NRF-2017K1A3A1A16067245) and IITP grant funded by the Korea government (MSIP) (2019-0-01082). Variational Laplace Autoencoders Bishop, C. M. Pattern recognition and machine learning. Springer-Verlag New York, Inc., Secaucus, NJ, 2006. Bowman, S. R., Vilnis, L., Vinyals, O., Dai, A. M., Jozefowicz, R., and Bengio, S. Generating sentences from a continuous space. In Co NLL, 2016. Chung, J., Kastner, K., Dinh, L., Goel, K., Courville, A. C., and Bengio, Y. A recurrent latent variable model for sequential data. In Neur IPS, 2015. Cremer, C., Li, X., and Duvenaud, D. Inference suboptimality in variational autoencoders. In ICML, 2018. Goodfellow, I. J., Warde-farley, D., Mirza, M., Courville, A., and Bengio, Y. Maxout networks. In ICML, 2013. Gregor, K., Danihelka, I., Graves, A., Rezende, D. J., and Wierstra, D. Draw: A recurrent neural network for image generation. In ICML, 2015. Gulrajani, I., Kumar, K., Faruk, A., Taiga, A. A., Visin, F., Vazquez, D., and Courville, A. C. Pixelvae: A latent variable model for natural images. In ICLR, 2017. Hinton, G. E. and Camp, D. V. Keeping the neural networks simple by minimizing the description length of the weights. In COLT, 1993. Jordan, M. I., Ghahramani, Z., Jaakkola, T. S., and Saul, L. K. An introduction to variational methods for graphical models. Machine learning, 37(2):183 233, 1999. Kim, Y., Wiseman, S., Millter, A. C., Sontag, D., and Rush, A. M. Semi-amortized variational autoencoders. In ICML, 2018. Kingma, D. and Ba, J. Adam: A Method for Stochastic Optimization. In ICLR, 2015. Kingma, D. P. and Welling, M. Auto-encoding variational bayes. In ICLR, 2014. Kingma, D. P., Salimans, T., Jozefowicz, R., Chen, X., Sutskever, I., and Welling, M. Improved variational inference with inverse autoregressive flow. In Neur IPS, 2016. Krishnan, R. G., Liang, D., and Hoffman, M. D. On the challenges of learning with inference networks on sparse high-dimensional data. In AISTAT, 2018. Krizhevsky, A. and Hinton, G. Learning multiple layers of features from tiny images. Technical report, Computer Science Department, University of Toronto, 2009. Krizhevsky, A., Sutskever, I., and Hinton, G. E. Imagenet classification with deep convolutional neural networks. In Neur IPS, 2012. Lake, B. M., Salakhutdinov, R. R., and Tenenbaum, J. Oneshot learning by inverting a compositinoal causal process. In Neur IPS, 2013. Le Cun, Y., Denker, J. S., and Solla, S. A. Optimal brain damage. In Neur IPS, 1990. Le Cun, Y., Simard, P. Y., and Pearlmutter, B. Automatic learning rate maximization by on-line estimation of the hessian s eigenvectors. In Neur IPS, 1993. Le Cun, Y., Bottou, L., Bengio, Y., and Haffner, P. Gradient based learning applied to document recognition. In IEEE, 1998. Maaløe., L., Sønderby, C. K., Sønderby, S. K., and Winther, O. Auxiliary deep generative models. In ICML, 2016. Maas, A. L., Hannun, A. Y., and Ng, A. Y. Rectifier nonlinearities improve neural network acoustic models. In ICML, 2013. Mac Kay, D. J. A practical bayesian framework for backpropagation networks. Neural computation, 4(3):448 472, 1992. Marino, J., Yisong, Y., and Mandt, S. Iterative amortized inference. In ICML, 2018. Mescheder, L., Nowozin, S., and Geiger, A. Adversarial variational bayes: Unifying variational autoencoders and generative adversarial networks. In ICML, 2017. Montufar, G., Pascanu, R., Cho, K., and Bengio, Y. On the number of linear regions of deep neural networks. In Neur IPS, 2014. Pascanu, R., Montufar, G., and Bengio, Y. On the number of response regions of deep feedforward networks with piecewise linear activations. In ICLR, 2014. Rezende, D. J. and Mohamed, S. Variational inference with normalizing flows. In ICML, 2015. Rezende, D. J., Mohamed, S., and Wierstra, D. Stochastic backpropagation and approximate inference in deep generative models. In ICML, 2014. Ritter, H., Botev, A., and Barber, D. A scalable laplace approximation for neural networks. In ICLR, 2018. Roberts, A., Engel, J., Raffel, C., Hawthorne, C., and Eck, D. A hierarchical latent vector model for learning long-term structure in music. In ICML, 2018. Salakhutdinov, R. and Murray, I. On the quantitative analysis of deep belief networks. In ICML, 2008. Variational Laplace Autoencoders Salimans, T., Kingma, D., and Welling, M. Markov chain monte carlo and variational inference: Bridging the gap. In ICML, 2015. Sønderby, C. K., Raiko, T. R., Maaløe, L., Sønderby, S. K., and Winther, O. How to train deep variational autoencoders and probabilistic ladder networks. In ICML, 2016. Tabak, E. G. and Turner, C. V. A family of nonparametric density estimation algorithms. In Communications on Pure and Applied Mathematics, 2013. Tipping, M. E. and Bishop, C. M. Probabilistic Principal Component Analysis. J. R. Statist. Soc. B, 61(3):611 622, 1999. Tomczak, J. M. and Welling, M. Improving variational autoencoders using householder flow. In Neur IPS Workshop on Bayesian Deep Learning, 2016. Tran, D., Ranganath, R., and Blei, D. M. The variational gaussian process. In ICLR, 2016. Wang, Y. N. T., Coates, A., Bissacco, A., Wu, B., and Ng, A. Y. Reading digits in natural images with unsupervised feature learning. In Neur IPS, 2011. Waterhouse, S. R., Mac Kay, D., and Robinson, A. J. Bayesian methods for mixtures of experts. In Neur IPS, 1996. Xiao, H., Rasul, K., and Vollgraf, R. Fashion-mnist: a novel image dataset for benchmarking machine learning algorithms. In ar Xiv, 2017. Yuri, B., Grosse, R., and Salakhutdinov, R. Importance weighted autoencoders. In ICLR, 2015.