# densely_connected_normalizing_flows__741eb2e2.pdf Densely connected normalizing flows Matej Grci c, Ivan Grubiši c and Siniša Šegvi c Faculty of Electrical Engineering and Computing University of Zagreb matej.grcic@fer.hr ivan.grubisic@fer.hr sinisa.segvic@fer.hr Normalizing flows are bijective mappings between inputs and latent representations with a fully factorized distribution. They are very attractive due to exact likelihood evaluation and efficient sampling. However, their effective capacity is often insufficient since the bijectivity constraint limits the model width. We address this issue by incrementally padding intermediate representations with noise. We precondition the noise in accordance with previous invertible units, which we describe as crossunit coupling. Our invertible glow-like modules increase the model expressivity by fusing a densely connected block with Nyström self-attention. We refer to our architecture as Dense Flow since both cross-unit and intra-module couplings rely on dense connectivity. Experiments show significant improvements due to the proposed contributions and reveal state-of-the-art density estimation under moderate computing budgets.1 1 Introduction One of the main tasks of modern artificial intelligence is to generate images, audio waveforms, and natural-language symbols. To achieve the desired goal, the current state of the art uses deep compositions of non-linear transformations [1, 2] known as deep generative models [3, 4, 5, 6, 7]. Formally, deep generative models estimate an unknown data distribution p D given by a set of i.i.d. samples D = {x1, ..., xn}. The data distribution is approximated with a model distribution pθ defined by the architecture of the model and a set of parameters θ. While the architecture is usually handcrafted, the set of parameters θ is obtained by optimizing the likelihood across the training distribution p D: θ = argmin θ Θ Ex p D[ ln pθ(x)]. (1) Properties of the model (e.g. efficient sampling, ability to evaluate likelihood etc.) directly depend on the definition of pθ(x), or decision to avoid it. Early approaches consider unnormalized distribution [3] which usually requires MCMC-based sample generation [8, 9, 10] with long mixing times. Alternatively, the distribution can be autoregressively factorized [7, 11], which allows likelihood estimation and powerful but slow sample generation. VAEs [4] use a factorized variational approximation of the latent representation, which allows to learn an autoencoder by optimizing a lower bound of the likelihood. Diffussion models [12, 13, 14] learn to reverse a diffusion process, which is a fixed Markov chain that gradually adds noise to the data in the opposite direction of sampling until the signal is destroyed. Generative adversarial networks [5] mimic the dataset samples by competing in a minimax game. This allows to efficiently produce high quality samples [15], which however often do not span the entire training distribution support [16]. Additionally, the inability to "invert" the generation process in any meaningful way implies inability to evaluate the likelihood. Contrary to previous approaches, normalizing flows [6, 17, 18] model the likelihood using a bijective mapping to a predefined latent distribution p(z), typically a multivariate Gaussian. Given the bijection 1Code available at: https://github.com/matejgrcic/Dense Flow 35th Conference on Neural Information Processing Systems (Neur IPS 2021). fθ, the likelihood is defined using the change of variables formula: pθ(x) = p(z) det z , z = fθ(x). (2) This approach requires computation of the Jacobian determinant (det z x). Therefore, during the construction of bijective transformations, a great emphasis is placed on tractable determinant computation and efficient inverse computation [18, 19]. Due to these constraints, invertible transformations require more parameters to achieve a similar capacity compared to standard NN building blocks [20]. Still, modeling pθ(x) using bijective formulation enables exact likelihood evaluation and efficient sample generation, which makes this approach convenient for various downstream tasks [21, 22, 23]. The bijective formulation (2) implies that the input and the latent representation have the same dimensionality. Typically, convolutional units of normalizing-flow approaches [18] internally inflate the dimensionality of the input, extract useful features, and then compress them back to the original dimensionality. Unfortunately, the capacity of such transformations is limited by input dimensionality [24]. This issue can be addressed by expressing the model as a sequence of bijective transformations [18]. However, increasing the depth alone is a suboptimal approach to improve capacity of a deep model [25]. Recent works propose to widen the flow by increasing the input dimensionality [24, 26]. We propose an effective development of that idea which further improves the performance while relaxing computational requirements. We increase the expressiveness of normalizing flows by incremental augmentation of intermediate latent representations with Gaussian noise. The proposed cross-unit coupling applies an affine transformation to the noise, where the scaling and translation are computed from a set of previous intermediate representations. In addition, we improve intra-module coupling by proposing a transformation which fuses the global spatial context with local correlations. The proposed imageoriented architecture improves expressiveness and computational efficiency. Our models set the new state-of-the-art result in likelihood evaluation on Image Net32 and Image Net64. 2 Densely connected normalizing flows We present a recursive view on normalizing flows and propose improvements based on incremental augmentation of latent representations, and densely connected coupling modules paired with selfattention. The improved framework is then used to develop an image-oriented architecture, which we evaluate in the experimental section. 2.1 Normalizing flows with cross-unit coupling Normalizing flows (NF) achieve their expressiveness by stacking multiple invertible transformations [18]. We illustrate this with the scheme (3) where each two consecutive latent variables zi 1 and zi are connected via a dedicated flow unit fi. Each flow unit fi is a bijective transformation with parameters θi which we omit to keep notation uncluttered. The variable z0 is typically the input x drawn from the data distribution p D(x). z0 f1 z1 f2 z2 f3 fi 1 zi fi f K z K, z K N(0, I). (3) Following the change of variables formula, log likelihoods of consecutive random variables zi and zi+1 can be related through the Jacobian of the corresponding transformation Jfi+1[18]: ln p(zi) = ln p(zi+1) + ln | det Jfi+1|. (4) This relation can be seen as a recursion. The term ln p(zi+1) can be recursively replaced either with another instance of (4) or evaluated under the latent distribution, which marks the termination step. This setup is characteristic for most contemporary architectures [17, 18, 19, 27]. The standard NF formulation can be expanded by augmenting the input by a noise variable ei [24, 26]. The noise ei subjects to some known distribution p (ei), e.g. a multivariate Gaussian. We further improve this approach by incrementally concatenating noise to each intermediate latent representation zi. A tractable formulation of this idea can be obtained by computing the lower bound of the likelihood p(zi) through Monte Carlo sampling of ei: ln p(zi) Eei p (e) [ln p(zi, ei) ln p (ei)] . (5) The learned joint distribution p(zi, ei) approximates the product of the target distributions p (zi) and p (ei), which is explained in more detail in Appendix D. We transform the introduced noise ei with element-wise affine transformation. Parameters of this transformation are computed by a learned non-linear transformation gi(z