# normalizing_flows_are_capable_generative_models__e8796f60.pdf Normalizing Flows are Capable Generative Models Shuangfei Zhai 1 Ruixiang Zhang 1 Preetum Nakkiran 1 David Berthelot 1 Jiatao Gu 1 Huangjie Zheng 1 Tianrong Chen 1 Miguel Angel Bautista 1 Navdeep Jaitly 1 Josh Susskind 1 Normalizing Flows (NFs) are likelihood-based models for continuous inputs. They have demonstrated promising results on both density estimation and generative modeling tasks, but have received relatively little attention in recent years. In this work, we demonstrate that NFs are more powerful than previously believed. We present TARFLOW: a simple and scalable architecture that enables highly performant NF models. TARFLOW can be thought of as a Transformer-based variant of Masked Autoregressive Flows (MAFs): it consists of a stack of autoregressive Transformer blocks on image patches, alternating the autoregression direction between layers. TARFLOW is straightforward to train end-to-end, and capable of directly modeling and generating pixels. We also propose three key techniques to improve sample quality: Gaussian noise augmentation during training, a post training denoising procedure, and an effective guidance method for both classconditional and unconditional settings. Putting these together, TARFLOW sets new state-of-theart results on likelihood estimation for images, beating the previous best methods by a large margin, and generates samples with quality and diversity comparable to diffusion models, for the first time with a stand-alone NF model. We make our code available at https://github.com/apple/mltarflow. 1. Introduction Normalizing Flows (NFs) are a well-established likelihood based method for unsupervised learning (Tabak & Vanden Eijnden, 2010; Rezende & Mohamed, 2015; Dinh et al., 2014). The method follows a simple learning objective, 1Apple. Correspondence to: Shuangfei Zhai . Proceedings of the 42 nd International Conference on Machine Learning, Vancouver, Canada. PMLR 267, 2025. Copyright 2025 by the author(s). 2016-2020 2021-2024 2024 Gated Pixel CNN Transformer Flow Matching AR Diffusion NF(Ours) Figure 1. TARFLOW demonstrates substantial progress in the domain of normalizing flow models, achieving state-of-the-art results in both density estimation and sample generation. Left: We show the historical progression of likelihood performance on Image Net 64x64, measured in bits per dimension (BPD), where our model significantly outperforms previous methods (see Table 2 for details). Right: Selected samples from our model trained on Image Net 128x128 demonstrate unprecedented image quality and diversity for a normalizing flow model, establishing a new benchmark for this class of generative models. which is to transform a data distribution into a simple prior distribution (such as Gaussian noise), keeping track of likelihoods via the change of variable formula. Normalizing Flows enjoy many unique and appealing properties, including exact likelihood computation, deterministic objective functions, and efficient computation of both the data generator and its inverse. There has been a large body of work dedicated to studying and improving NFs, and in fact NFs were the method of choice for density estimation for a number of years (Dinh et al., 2017; Kingma & Dhariwal, 2018; Chen et al., 2018; Papamakarios et al., 2017; Ho et al., 2019). However in spite of this rich line of work, Normalizing Flows have seen limited practical adoption in stark contrast to other generative models such as Diffusion Models (Sohl-Dickstein et al., 2015; Ho et al., 2020) and Large Language Models (Brown et al., 2020). Moreover, the state-of-the-art in Normalizing Flows has not kept pace with the rapid progress of these other generative techniques, leading to less attention from the research community. It is natural to wonder whether this situation is inherent i.e., are Normalizing Flows fundamentally limited as a modeling paradigm? Or, have we just not found an appropriate Normalizing Flows are Capable Generative Models way to train powerful NFs and fully realize their potential? Answering this question may allow us to reopen an alternative path to powerful generative modeling, similar to how DDPM (Ho et al., 2020) enlightened the field of diffusion modeling and brought about its current renaissance. In this work, we show that NFs are more powerful than previously believed, and in fact can compete with stateof-the-art generative models on images. Specifically, we introduce TARFLOW (short for Transformer Auto Regressive Flow): a powerful NF architecture that allows one to easily scale up the model s capacity; as well as a set of techniques that drastically improve the model s generation capability. On the architecture side, TARFLOW is conceptually similar to Masked Autoregressive Flows (MAFs) (Papamakarios et al., 2017), where we compose a deep transformation by iteratively stacking multiple blocks of autoregressive transformations with alternating directions. The key difference is that we deploy a powerful masked Transformer (Vaswani et al., 2017) based implementation that operates in a block autoregression fashion (that is, predicting a block of dimensions at a time), instead of simple masked MLPs used in MAFs that factorizes the input on a per dimension basis. In the context of image modeling, we implement each autoregressive flow transformation with a causal Vision Transformer (Vi T) (Dosovitskiy et al., 2021) on top of a sequence of image patches, given a particular order of autoregression (e.g., top left to bottom right, or the reverse). This admits a powerful non-linear transformation among all image patches, while maintaining a parallel computational graph during training. Compared to other NF design choices (Dinh et al., 2017; Grathwohl et al., 2019; Kingma & Dhariwal, 2018; Ho et al., 2019) which often have several types of interleaving modules, our model features a modular design and enjoys greater simplicity, both conceptually and practically. This in return allows for much improved scalability and training stability, which is another critical aspect for high performance models. With this new architecture, we can immediately train much stronger NF models than previously reported, resulting in state-of-the-art results on image likelihood estimation. On the generation side, we introduce three important techniques. First, we show that for perceptual quality, it is critical to add a moderate amount of Gaussian noise to the inputs, in contrast to a small amount of uniform noise commonly used in the literature. Second, we identify a posttraining score based denoising technique that allows one to remove the noise portion of the generated samples. Third, we show for the first time that guidance (Ho & Salimans, 2022) is compatible with NF models, and we propose guidance recipes for both the class conditional and unconditional models. Putting these techniques together, we are able to achieve state-of-the-art sample quality for NF models on standard image modeling tasks. We highlight our main results in Figure 1, and summarize our contributions as follows. We introduce TARFLOW, a simple and powerful Transformer based Normalizing Flow architecture. We achieve state-of-the-art results on likelihood estimation on images, achieving a sub-3 BPD on Image Net 64x64 for the first time. We show that Gaussian noise augmentation during training plays a critical role in producing high quality samples. We present a post-training score-based denoising technique that allows one to remove the noise in the generated samples. We show that guidance is compatible with both class conditional and unconditional models, which drastically improves sampling quality. Table 1. Notation. Notation Meaning pdata(x) training distribution pmodel(y) model distribution f(x) the forward flow function f t(zt) the forward function for the t-th flow block µt, αt learnable causal functions in the t-th flow block pϵ(ϵ) the noise distribution q(y) the noisy data distribution p( x) the discrete model distribution 2.1. Normalizing Flows Given continuous inputs x pdata, x RD, a Normalizing Flow learns a density pmodel via the change of variable formula pmodel(x) = p0(f(x))|det( df(x) dx )|, where f : RD 7 RD is an invertible transformation for which we can also compute the determinant of the Jacobian det( df(x) dx ); p0 is a prior distribution. The maximum likelihood estimation (MLE) objective can then be written as min f log p0(f(x)) log(|det df(x) In this paper, we let p0 be a standard Gaussian distribution N(0, ID), so Equation 1 can be explicitly written as min f 0.5 f(x) 2 2 log(|det df(x) Normalizing Flows are Capable Generative Models Transformer AR Flow Block x T Z8 t+1 Z7 t+1 Z6 t+1 Z5 t+1 Z4 t+1 Z3 t+1 Z2 t+1 Z1 t+1 Z0 t+1 Z8 t Z7 t Z6 t Z5 t Z4 t Z3 t Z2 t Z1 t Z0 t ft( ) Causal Transformer Transformer AR Flow Block Permutation πt(zt) Z1 t ~ Z2 t ~ Z3 t ~ Z4 t ~ Z5 t ~ Z6 t ~ Z7 t ~ Z8 t ~ Z0 t+1 Z1 t+1 Z2 t+1 Z3 t+1 Z4 t+1 Z5 t+1 Z6 t+1 Z7 t+1 Z8 t+1 Z0 t Z1 t Z2 t Z3 t Z4 t Z5 t Z6 t Z7 t Z8 t µ1t α1t µit αit µ8t α8t Figure 2. Left, TARFLOW consists of T flow blocks trained end to end; Right, a zoom-in view of each flow bock, which contains a sequence permutation operation, a standard causal Transformer, and an affine transformation to the permuted inputs. where we have omitted constant terms. Equation 2 bears an intuitive interpretation: the first term encourages the model to map data samples x to latent variables z = f(x) of small norm, while the second term discourages the model from collapsing i.e., the model should map proximate inputs to separated latents which allows it to fully occupy the latent space. Once the model is trained, one automatically obtains a generative model via z p0(z), x = f 1(z). 2.2. Block Autoregressive Flows One appealing method for constructing a deep normalizing flow is by stacking multiple layers of autoregressive flows. This was first proposed in IAF (Kingma et al., 2016) in the context of variational inference, and later extended by MAF (Papamakarios et al., 2017) as standalone density models. In this paper, we consider a generalized formulation of MAF block autoregressive flows. Without loss of generality, we assume an input presented in the form of a sequence x RN D, where N is the sequence length and D is the dimension of each block of input. Let T N be the number of flow layers in the stack of flows. Subscripts denote indexing along the sequence dimension, e.g. xi RD, and superscripts denotes flow-layer indices (see Figure 2). We then specify a flow transformation z T = f(x) := (f T 1 f T 2 f 0)(x) as follows. First, we choose {πt} as any fixed set of permutation functions along the sequence dimension. The t-th flow, f t, is parameterized by two learnable functions µt, αt : RN D RN D, which are both causal along the sequence dimension. We initialize with z0 := x. Then, the t-th flow transforms zt RN D into zt+1 RN D by transforming a block of inputs {zt i}i [N] as: ( zt i i = 0 ( zt i µt i( zt 0 Note that since µt is causal, the i-token of its output µt i( zt) only depends on zt 0 zt = (πt) 1( zt). This yields x := z0 as the final iterate. As for the choice of permutations πt, in this work we set all πt as the reverse function πt(z)i = z N 1 i, except for π0 which is set as identity. Ultimately, the entire flow transformation consists of T flows {f t}, and in each flow the input is first permuted then causally transformed with learnable element-wise subtractive and divisive terms µt i( ), exp(αt i( )). It is worth noting that Equation 3 degenerates to MAF when D = 1. Intuitively, D plays a role of balancing the difficulty of modeling each position in the sequence and the length of the entire sequence. This allows for extra modeling flexibility compared to the naive setting in MAF, which will become clearer in the later discussions. In each flow transformation f t, there are two operations. The first permutation operation πt is volume preserving, therefore its log determinant of the Jacobian is zero. The second autoregressive step has a Jacobian matrix of lower Normalizing Flows are Capable Generative Models triangular shape, which means its determinant needs to only account for the diagonal entries. The log determinant of the Jacobian then readily evaluates to log(|det(df t(zt) j=0 αt i( zt