# deep_attentive_variational_inference__248cc334.pdf Published as a conference paper at ICLR 2022 DEEP ATTENTIVE VARIATIONAL INFERENCE Ifigeneia Apostolopoulou1,2, Ian Char1,2, Elan Rosenfeld1 & Artur Dubrawski1,2 1Machine Learning Department & 2Auton Lab, Carnegie Mellon University Correspondence to: iapostol@andrew.cmu.edu, ifiaposto@gmail.com Stochastic Variational Inference is a powerful framework for learning large-scale probabilistic latent variable models. However, typical assumptions on the factorization or independence of the latent variables can substantially restrict its capacity for inference and generative modeling. A major line of active research aims at building more expressive variational models by designing deep hierarchies of interdependent latent variables. Although these models exhibit superior performance and enable richer latent representations, we show that they incur diminishing returns: adding more stochastic layers to an already very deep model yields small predictive improvement while substantially increasing the inference and training time. Moreover, the architecture for this class of models favors proximate interactions among the latent variables between neighboring layers when designing the conditioning factors of the involved distributions. This is the first work that proposes attention mechanisms to build more expressive variational distributions in deep probabilistic models by explicitly modeling both nearby and distant interactions in the latent space. Specifically, we propose deep attentive variational autoencoder and test it on a variety of established datasets. We show it achieves state-of-the-art log-likelihoods while using fewer latent layers and requiring less training time than existing models. The proposed holistic inference reduces computational footprint by alleviating the need for deep hierarchies. Project code: https://github.com/ifiaposto/Deep_Attentive_VI 1 INTRODUCTION A core line of research in both supervised and unsupervised learning relies on deep probabilistic models. This class of models uses deep neural networks to model distributions that express hypotheses about the way in which the data have been generated. Such architectures are preferred due to their capacity to express complex, non-linear relationships between the random variables of interest while enabling tractable inference and sampling. Latent variable probabilistic models augment the set of the observed variables with auxiliary latent variables. They are characterized by a posterior distribution over the latent variables, one which is generally intractable and typically approximated by closed-form alternatives. They provide an explicit parametric specification of the joint distributions over the expanded random variable space, while the distribution of the observed variables is computed by marginalizing over the latent variables. The Variational Autoencoder (VAE) (Kingma & Welling, 2014) belongs to this model category. The VAE uses neural networks for learning the parametrization of both the inference network (which defines the posterior distribution of the latent variables) and the generative network (which defines the prior distribution of the latent variables and the conditional data distribution given the latent variables). Their parameters are jointly learned via stochastic variational inference (Paisley et al., 2012; Hoffman et al., 2013). Early VAE architectures (Rezende et al., 2014) make strong assumptions about the posterior distribution specifically, it is standard to assume that the posterior is approximately factorial. Since then, research has progressed on learning more expressive latent variable models. For example, Rezende & Mohamed (2015); Kingma et al. (2016); Chen et al. (2017) aim at modeling more complex posterior distributions, constructed with autoregressive layers. Theoretical research focuses on deriving tighter bounds (Burda et al., 2016; Li & Turner, 2016; Masrani et al., 2019) or building more informative latent variables by mitigating posterior collapse (Razavi et al., 2019a; Lucas et al., 2019). Sinha & Dieng (2021) improve generalization by enforcing regularization in the latent space Published as a conference paper at ICLR 2022 for semantics-preserving transformations of the data. Taking a different approach, hierarchical VAEs (Gulrajani et al., 2017; Sønderby et al., 2016; Maaløe et al., 2019; Vahdat & Kautz, 2020; Child, 2020) leverage increasingly deep and interdependent layers of latent variables, similar to how subsequent layers in a discriminative network are believed to learn increasingly abstract representations of the data. These architectures exhibit superior generative and reconstructive capabilities since they allow for modeling of much richer structures in the latent space. Previous work overlooks the effect of long-range correlations among the latent variables. In this work, we propose to restructure common hierarchical, convolutional VAE architectures in order to increase the receptive field of the variational distributions. We first provide experimental evidence that conditional dependencies in deep probabilistic hierarchies may be implicitly disregarded by current models. Subsequently, we propose a decomposed, attention-guided scheme that allows a long-range flow of both the latent and the observed information both across different, potentially far apart, stochastic layers and within the same layer and we investigate the importance of each proposed change through extensive ablation studies. Finally, we demonstrate that our model is both computationally more economical and can attain state-of-the-art performance across a diverse set of benchmark datasets. 2 PROPOSED MODEL 2.1 DEEP VARIATIONAL INFERENCE A latent variable generative model defines the joint distribution of a set of observed variables, x RD, and auxiliary latent variables, z, coming from a prior distribution p(z). To perform inference, the marginal likelihood of the distribution of interest, p(x), can be computed by integrating out the latent variables: p(x) = Z p(x, z) dz. (1) Since this integration is generally intractable, a lower bound on the marginal likelihood is maximized instead. This is done by introducing an approximate posterior distribution q(z | x) and applying Jensen s inequality: log p(x) = log Z p(x, z) dz = log Z q(z | x) q(z | x)p(x, z) dz Z q(z | x) log p(x | z)p(z) = log p(x) Eq(z|x)[log p(x | z)] DKL(q(z | x) p(z)), (2) where θ, ϕ parameterize p(x, z; θ) and q(z | x; ϕ) respectively. For ease of notation, we omit θ, ϕ in the derivations. This objective is called the Evidence Lower BOund (ELBO) and can be optimized efficiently for continuous z via stochastic gradient descent (Kingma & Welling, 2014; Rezende et al., 2014). For ease of implementation, it is common to assume that both q(z | z) and p(z) are fully factorized Gaussian distributions. However, this assumption may be too limiting in cases of complex underlying distributions. To compensate for this modeling constraint, many works focus on stacking and improving the stability of multiple layers of stochastic latent features which are partitioned in groups such that z = {z1, z2, . . . , z L}, where L is the number of such groups (Rezende et al., 2014; Gulrajani et al., 2017; Kingma et al., 2016; Sønderby et al., 2016; Maaløe et al., 2019; Vahdat & Kautz, 2020; Child, 2020). Our work builds on architectures of bidirectional inference with a deterministic bottom-up pass. The schematic diagram of a stochastic layer in such a deep variational model is depicted in Figure 1. In a bidirectional inference architecture with a deterministic bottom-up pass (left part in Figure 1a), posterior sampling is preceded by a sequence of non-linear transformations, T q l , of the evidence, x, i.e., hl = T q l (hl+1), with h L+1 = x. The inference q(z | x) and generative p(z) network decompose in an identical topological ordering: q(z | x) = Q l q(zl | x, z