# macow_masked_convolutional_generative_flow__0ffdc2e2.pdf Ma Cow: Masked Convolutional Generative Flow Xuezhe Ma, Xiang Kong, Shanghang Zhang, Eduard Hovy Carnegie Mellon University Pittsburgh, PA, USA xuezhem,xiangk@cs.cmu.edu, shanghaz@andrew.cmu.edu, hovy@cmu.edu Flow-based generative models, conceptually attractive due to tractability of the exact log-likelihood computation and latent-variable inference as well as efficiency in training and sampling, has led to a number of impressive empirical successes and spawned many advanced variants and theoretical investigations. Despite computational efficiency, the density estimation performance of flow-based generative models significantly falls behind those of state-of-the-art autoregressive models. In this work, we introduce masked convolutional generative flow (MACOW), a simple yet effective architecture for generative flow using masked convolution. By restricting the local connectivity to a small kernel, MACOW features fast and stable training along with efficient sampling while achieving significant improvements over Glow for density estimation on standard image benchmarks, considerably narrowing the gap with autoregressive models. 1 Introduction Unsupervised learning of probabilistic models is a central yet challenging problem. Deep generative models have shown promising results in modeling complex distributions such as natural images (Radford et al., 2015), audio (Van Den Oord et al., 2016) and text (Bowman et al., 2015). Multiple approaches emerged in recent years, including Variational Autoencoders (VAEs) (Kingma and Welling, 2014), Generative Adversarial Networks (GANs) (Goodfellow et al., 2014), autoregressive neural networks (Larochelle and Murray, 2011; Oord et al., 2016), and flow-based generative models (Dinh et al., 2014, 2016; Kingma and Dhariwal, 2018). Among these, flow-based generative models gained popularity for this capability of estimating densities of complex distributions, efficiently generating high-fidelity syntheses, and automatically learning useful latent spaces. Flow-based generative models typically warp a simple distribution into a complex one by mapping points from the simple distribution to the complex data distribution through a chain of invertible transformations with Jacobian determinants that are efficient to compute. This design guarantees that the density of the transformed distribution can be analytically estimated, making maximum likelihood learning feasible. Flow-based generative models have spawned significant interests for improving and analyzing its algorithms both theoretically and practically, and applying them to a wide range of tasks and domains. In their pioneering work, Dinh et al. (2014) first proposed Non-linear Independent Component Estimation (NICE) to apply flow-based models for modeling complex high-dimensional densities. Real NVP (Dinh et al., 2016) extended NICE with a more flexible invertible transformation to experiment with natural images. However, these flow-based generative models resulted in worse density estimation performance compared to state-of-the-art autoregressive models, and are incapable of realistic synthesis of large images compared to GANs (Karras et al., 2018; Brock et al., 2019). Recently, Kingma and Dhariwal (2018) proposed Glow as a generative flow with invertible 1 1 convolutions, which significantly improved the density estimation performance on natural images. Importantly, they demonstrated that flow-based generative models optimized towards the plain 33rd Conference on Neural Information Processing Systems (Neur IPS 2019), Vancouver, Canada. likelihood-based objective are capable of generating realistic high-resolution natural images efficiently. Prenger et al. (2018) investigated applying flow-based generative models to speech synthesis by combining Glow with Wave Net (Van Den Oord et al., 2016). Ziegler and Rush (2019) adopted variational inference to apply generative flows to discrete sequential data. Unfortunately, the density estimation performance of Glow on natural images remains behind autoregressive models, such as Pixel RNN/CNN (Oord et al., 2016; Salimans et al., 2017), Image Transformer (Parmar et al., 2018), Pixel SNAIL (Chen et al., 2017) and SPN (Menick and Kalchbrenner, 2019). There is also some work (Rezende and Mohamed, 2015; Kingma et al., 2016; Zheng et al., 2017) trying to apply flow to variational inference. In this paper, we propose a novel architecture of generative flow, masked convolutional generative flow (MACOW), which leverages masked convolutional neural networks (Oord et al., 2016). The bijective mapping between input and output variables is easily established while the computation of the determinant of the Jacobian remians efficient. Compared to inverse autoregressive flow (IAF) (Kingma et al., 2016), MACOW offers stable training and efficient inference and synthesis by restricting the local connectivity in a small masked kernel as well as large receptive fields by stacking multiple layers of convolutional flows and using rotational ordering masks ( 3.1). We also propose a finegrained version of the multi-scale architecture adopted in previous flow-based generative models to further improve the performance ( 3.2). Experimenting with four benchmark datasets for images, CIFAR-10, Image Net, LSUN, and Celeb A-HQ, we demonstrate the effectiveness of MACOW as a density estimator by consistently achieving significant improvements over Glow on all the three datasets. When equipped with the variational dequantization mechanism (Ho et al., 2019), MACOW considerably narrows the gap of the density estimation with autoregressive models ( 4). 2 Flow-based Generative Models In this section, we first setup notations, describe flow-based generative models, and review Glow (Kingma and Dhariwal, 2018) as it is the foundation for MACOW. 2.1 Notations Throughout the paper, uppercase letters represent random variables and lowercase letters for realizations of their corresponding random variables. Let X X be the random variables of the observed data, e.g., X is an image or a sentence for image and text generation, respectively. Let P denote the true distribution of the data, i.e., X P, and D = {x1, . . . , x N} be our training sample, where xi, i = 1, . . . , N, are usually i.i.d. samples of X. Let P = {Pθ : θ Θ} denote a parametric statistical model indexed by the parameter θ Θ, where Θ is the parameter space. p denotes the density of the corresponding distribution P. In the deep generative model literature, deep neural networks are the most widely used parametric models. The goal of generative models is to learn the parameter θ such that Pθ can best approximate the true distribution P. In the context of maximum likelihood estimation, we minimize the negative log-likelihood of the parameters with: min θ Θ 1 N i=1 log pθ(xi) = min θ Θ E e P (X)[ log pθ(X)], (1) where P(X) is the empirical distribution derived from training data D. 2.2 Flow-based Models In the framework of flow-based generative models, a set of latent variables Z Z are introduced with a prior distribution p Z(z), which is typically a simple distribution like a multivariate Gaussian. For a bijection function f : X Z (with g = f 1), the change of the variable formula defines the model distribution on X by pθ(x) = p Z (fθ(x)) det fθ(x) where fθ(x) x is the Jacobian of fθ at x. The generative process is defined straightforwardly as the following: z p Z(z) x = gθ(z). (3) Flow-based generative models focus on certain types of transformations fθ that allow the inverse functions gθ and Jacobian determinants to be tractable to compute. By stacking multiple such invertible transformations in a sequence, which is also called a (normalizing) flow (Rezende and Mohamed, 2015), the flow is then capable of warping a simple distribution (p Z(z)) into a complex one (p(x)) through: X f1 g1 H1 f2 g2 H2 f3 g3 f K g K Z, where f = f1 f2 f K is a flow of K transformations. For brevity, we omit the parameter θ from fθ and gθ. Recently, several types of invertible transformations emerged to enhance the expressiveness of flows, among which Glow (Kingma and Dhariwal, 2018) has stood out for its simplicity and effectiveness on both density estimation and high-fidelity synthesis. The following briefly describes the three types of transformations that comprise Glow. Actnorm. Kingma and Dhariwal (2018) proposed an activation normalization layer (Actnorm) as an alternative for batch normalization (Ioffe and Szegedy, 2015) to alleviate the challenges in model training. Similar to batch normalization, Actnorm performs an affine transformation of the activations using a scale and bias parameter per channel for 2D images, such that yi,j = s xi,j + b, where both x and y are tensors of shape [h w c] with spatial dimensions (h, w) and channel dimension c. Invertible 1 1 convolution. To incorporate a permutation along the channel dimension, Glow includes a trainable invertible 1 1 convolution layer to generalize the permutation operation as: yi,j = Wxi,j, where W is the weight matrix with shape c c. Affine Coupling Layers. Following Dinh et al. (2016), Glow includes affine coupling layers in its architecture of: xa, xb = split(x) ya = xa yb = s(xa) xb + b(xa) y = concat(ya, yb), where s(xa) and b(xa) are outputs of two neural networks with xa as input. The split() and concat() functions perform operations along the channel dimension. From this designed architecture of Glow, we see that interactions between spatial dimensions are incorporated only in the coupling layers. The coupling layer, however, is typically costly for memory resources, making it infeasible to stack a significant number of coupling layers into a single model, especially when processing high-resolution images. The main goal of this work is to design a new type of transformation that simultaneously models the dependencies in both the spatial and channel dimensions while maintaining a relatively small memory footprint to improve the capacity of the generative flow. 3 Masked Convolutional Generative Flows In this section, we describe the architectural components of the masked convolutional generative flow (MACOW). First, we introduce the proposed flow transformation using masked convolutions in 3.1. Then, we present a fine-grained version of the multi-scale architecture adopted by previous generative flows (Dinh et al., 2016; Kingma and Dhariwal, 2018) in 3.2. Figure 1: Visualization of the receptive field of four masked convolutions with rotational ordering. 3.1 Flow with Masked Convolutions Applying autoregressive models to normalizing flows has been previously explored in studies (Kingma et al., 2016; Papamakarios et al., 2017), with idea of sequentially modeling the input random variables in an autoregressive order to ensure the model cannot read input variables behind the current one: yt = s(x