# self_normalizing_flows__466a8184.pdf Self Normalizing Flows T. Anderson Keller 1 2 Jorn W.T. Peters 1 2 Priyank Jaini 1 2 Emiel Hoogeboom 1 2 Patrick Forr e 2 Max Welling 2 Efficient gradient computation of the Jacobian determinant term is a core problem in many machine learning settings, and especially so in the normalizing flow framework. Most proposed flow models therefore either restrict to a function class with easy evaluation of the Jacobian determinant, or an efficient estimator thereof. However, these restrictions limit the performance of such density models, frequently requiring significant depth to reach desired performance levels. In this work, we propose Self Normalizing Flows, a flexible framework for training normalizing flows by replacing expensive terms in the gradient by learned approximate inverses at each layer. This reduces the computational complexity of each layer s exact update from O(D3) to O(D2), allowing for the training of flow architectures which were otherwise computationally infeasible, while also providing efficient sampling. We show experimentally that such models are remarkably stable and optimize to similar data likelihood values as their exact gradient counterparts, while training more quickly and surpassing the performance of functionally constrained counterparts. 1. Introduction The framework of normalizing flows (Tabak & Turner, 2013) allows for powerful exact density estimation through the change of variables formula (Rudin, 1987). A significant challenge with this approach is the Jacobian determinant in the objective, which is generally expensive to compute. A large body of work has therefore focused on methods to evaluate the Jacobian determinant efficiently, usually by limiting the expressivity of the transformation. Two classes of functions have been proposed to achieve this: i) those with triangular Jacobians, such that the determinant only 1Uv A-Bosch Delta Lab 2University of Amsterdam, Netherlands. Correspondence to: T. Anderson Keller . Proceedings of the 38 th International Conference on Machine Learning, PMLR 139, 2021. Copyright 2021 by the author(s). depends on the diagonal (Bogachev et al., 2005; Marzouk et al., 2016; Jaini et al., 2019), and ii) those which are Lipschitz continuous such that Jacobian determinant can be approximated at each iteration through an infinite series (Behrmann et al., 2019; Grathwohl et al., 2019). The drawback of both of these approaches is that they rely on strong functional constraints. Recently, a number of works have pursued an alternative approach to training unconstrained normalizing flows by avoiding the expensive computation of the Jacobian determinant altogether during training (Gresele et al., 2020; Kr amer et al., 2020). However, these works restrict the parametrization of the transformation, or the parameter updates. As a result, it can be difficult to scale these to higher dimensional data such as images. In this work we introduce a new framework to avoid the determinant computation during training, which we name the Self Normalizing Framework. Instead of computing the log Jacobian determinant, we approximate its gradient directly. This can be achieved through the insight that the derivative of the log Jacobian determinant is given by the inverse of the Jacobian itself. In the framework, flow components learn to approximate their own inverse through a self-supervised layer-wise reconstruction loss. Further, we then define the overall density model as a mixture of the probability induced by both the forward and inverse transformations and show how both transformations can be updated symmetrically using their respective learned inverses directly in the gradient. Ultimately, this avoids the O(D3) complexity required for computing the determinant of each layer at each training iteration, instead substituting it with an additional backwards pass of order O(D2) required to propagate the reconstruction error gradients. 2. Related Work The field of normalizing flows can be broadly divided into linear and non-linear flows (Papamakarios et al., 2019). Nonlinear flows are generally constructed either by constraining the Jacobian to be triangular (Kingma et al., 2016; Papamakarios et al., 2017; van den Berg et al., 2018; Huang et al., 2018; Jaini et al., 2019) or by constraining a residual function to be Lipschitz (Grathwohl et al., 2019; Behrmann et al., 2019; Chen et al., 2019; Perugachi-Diaz et al., 2020). Linear Self Normalizing Flows Figure 1. Overview of self normalizing flows. A matrix W transforms x to z. The matrix R is constrained to approximate the inverse of W with a reconstruction loss E. The likelihood is efficiently optimized by approximating the gradient of the log Jacobian determinant with the learned inverse. flows are constructed using a variety of methods. Examples of linear flows are 1 1 convolutions (Kingma & Dhariwal, 2018) which have block-diagonal structure, periodic convolutions (Karami et al., 2019; Hoogeboom et al., 2019) which leverage the frequency domain, Woodbury flows (Lu & Huang, 2020) that use low-rank transformations and relative gradient-based flows (Gresele et al., 2020) that re-frame optimization of fully connected linear flows. The disadvantage of these methods is that they are either constrained to a subset of transformations, or based on matrix decomposition structures that cannot straightforwardly be extended to convolutional weight sharing. The work of Gresele et al. (2020) is most similar to ours in the goal of training unconstrained normalizing flows through the use of efficient gradient computations. In Gresele et al. (2020) this is achieved by applying a carefully constructed post-conditioner to the gradient of each layer, transforming it into the natural gradient (Cardoso & Laheld, 1996; Amari, 1998), thereby selectively canceling out the inverse normally required during training. However, since parameters are extensively shared for convolutions, natural gradient methods cannot be straightforwardly applied to convolutional layers. In contrast, the framework proposed in this paper makes use of the traditional gradient, allowing for more flexibility in parameterization, requiring only that an inverse function can be learned and maintained throughout training. Related to our framework, the idea of using learned inverse functions in the setting of density estimation was proposed in early work on invertible neural networks (Rippel & Adams, 2013). In that work, similar to ours, both directions of the density model are parameterized and constrained to be approximate inverses through a reconstruction loss. The learned encoder is then used to approximate the marginal distribution of the data in latent space by finding the best fit of a tractable parametric family (such as the Beta family), and the divergence of the approximate latent distribution from the target latent distribution is then min- imized. Important differences with our work are that we compute the likelihood exactly through the change of variables formula, and use the Jacobian of our learned decoder as an approximation to the gradient of the intractable Jacobian determinant term which arises. An advantage of this approach is that we no longer need to explicitly constrain our decoder to be well conditioned, or invertible, since this constraint is implicitly imposed through the maximization of the Jacobian determinant. Additionally, our framework novelly models the density of observations as a mixture of the density under the forward and inverse transformations, taking advantage of all learned parameters. More broadly, the idea of greedy layer-wise learning has been explored in many forms for training neural networks (Bengio et al., 2006; Hinton et al., 2006; L owe et al., 2019). One influential class of work uses stacked auto-encoders, or deep belief networks, for pre-training or representation learning (Bengio et al., 2006; Hinton et al., 2006; Vincent et al., 2008; Kosiorek et al., 2019). Our work leverages similar models and training paradigms, but introduces them into a modern flow framework, demonstrating additional potential uses for the learned feedback connections. Another related class of work addresses the biological implausibility of backpropagation also through learned layerwise autoencoders. Target propagation (Lecun, 1986; Bengio, 2014; Lee et al., 2015; Bengio, 2020; Meulemans et al., 2020) addresses the so-called weight transport problem by training auto-encoders at each layer of a network and using these learned feedback weights to propagate targets to previous layers. Our method takes inspiration from this approach. Specifically, our method can be viewed as a hybrid of target propagation and backpropagation (Linnainmaa, 1976; Werbos, 1982; Rumelhart et al., 1986) particularly suited to unsupervised density estimation in the normalizing flow framework. The novelty of our approach in this regard lies in the use of the inverse weights directly in the update, rather than in the backward propagation of updates. Self Normalizing Flows Figure 2. MNIST (left) and CIFAR-10 (right) samples generated from a self normalizing Glow model using the exact inverse (top) vs. the approximate learned inverse (bottom). From these samples we observe that the self normalizing flow models have learned to become good generative models of the data. Additionally, comparing the top and bottom rows, we see the inverse approximation is nearly exact. 3. A General Framework for Self Normalizing Flows 3.1. Preliminaries Given an observation x RD, it is assumed that x is generated from an underlying real vector z RD through an invertible and differentiable transformation g, and that f = g 1 is also differentiable (i.e. g is a diffeomorphism). It is further assumed that z is a sample from a simple known underlying distribution p Z such as a standard Gaussian. Then, the probability density p X can be computed exactly using the change of variables formula: p X(x) = p Z(z) z x = p Z g 1(x) Jg 1 = p Z f(x) |Jf| (1) where the change of volume term |Jf| = f(x) x is the determinant of the Jacobian of the transformation between z and x, evaluated at x. Typically, the functions f and g are defined as compositions of diffeomorphisms themselves, i.e. g = g0 g1 . . . g K and f = f K f K 1 . . . f0 where gk 1 = fk. This formulation takes advantage of the fact that a composition of diffeomorphic functions is also a diffeomorphism, meaning that if each fk is invertible and differentiable, then so is the composition f and the change of variables formula in Equation 1 still holds. Most approaches then propose defining and parameterizing only one direction of the flow, for example the forward functions fk, and compute the inverses gk exactly when needed. The log-likelihood of the observations is then simultaneously maximized, with respect to a given fk s vector of parameters θk, for all k, requiring the gradient. Using the identity J log |J| = J T we obtain the following gradient of the loss with respect to a given layer k s parameters: θk log p X(x) = θk log p Z(f(x)) + (vec Jf)T (2) Following the conventions of Magnus (2010) for matrix derivatives, we make use of the vectorization operator vec which maps from m n matricies to mn 1 column vectors by stacking the columns of the matrix sequentially, and formulate the parameters θk as column vectors. 3.2. Self Normalizing Framework In order to avoid the inverse Jacobian in the gradient, we instead propose to define and parameterize both the forward and inverse functions fk and gk with parameters θk and γk respectively. We then constrain the parameterized inverse gk to be approximately equal to the true inverse f 1 k through a layer-wise reconstruction loss. We can thus define our maximization objective as the mixture of the log-likelihoods induced by both models minus the reconstruction penalty: 2 log pf X(x) + 1 2 log pg X(x) λ k=0 ||gk (fk(hk)) hk||2 2 (3) where pf X and pg X now denote the densities induced by both the forwards and inverse transformations separately, and hk = gradient stop(fk 1 ...f0(x)) is the output of function fk 1 with the gradients blocked such that only gk and fk receive gradients from the reconstruction loss at layer k. We see that when f = g 1 exactly, this is equivalent to the traditional normalizing flow framework. By the inverse function theorem, we know that the inverse of the Jacobian of an invertible function is given by the Jacobian of the inverse function, i.e. J 1 f (x) = Jf 1(z). Therefore, we see that with the above parameterization and constraint, we can approximate both the change of variables formula, and the gradients for both functions, in terms of the Jacobians of the respective inverse functions. Explicitly: θk log pf X(x) θk log p Z(f(x)) + (vec Jf)T (4) γk log pg X(x) γk log p Z(g 1(x)) (vec Jg)T (5) where Equation 5 follows from the derivation of Equation 4 and the application of the derivative of the inverse. We note that the above approximation requires that the Jacobians of the functions are approximately inverses in addition to the functions themselves being approximate inverses. For the models presented in this work, this property is obtained for free since the Jacobian of a linear mapping is the matrix Self Normalizing Flows Figure 3. Angle (in degrees) between the exact gradient and the approximate gradient at each step of training. We see that the angle is close to 0.1 degrees for the fully connected network (FC), and less than 1 degree for the convolutional network (CNN), suggesting the approximation is close throughout training. representation of the map itself. However, for more complex mappings, this may not be exactly the case and should be constrained explicitly. We suggest this could be efficiently implemented with an additional loss analogous to a reconstruction loss but employing Jacobian vector products instead of matrix vector products (see Section A.4). Although there are no known convergence guarantees for such a method, we observe in practice that, with sufficiently large values of λ, most models quickly converge to solutions which maintain the desired constraint. This phenomenon is demonstrated in Figure 3 where it is shown that the angle between the true gradient and the approximate gradient at each training iteration decreases to less than 1 degree over the course of training for both models tested. As a visual example of the quality of the inverse approximation, Figure 2 shows samples from the base distribution p Z passed through both the true inverse f 1 (top) and the learned approximate inverse g (bottom) to generate samples from p X. As demonstrated by the nearly identical samples, the approximate inverse appears to be a very close match to the true inverse. The details of the models which generated these samples are in Section 5. Further mitigation strategies for potential optimization difficulties are discussed in Sections 6 and A.4. Finally, in Figure 4 we see the efficiency improvements of our method compared with the exact gradient. As expected, we see the self normalizing method scales much more favorably with input dimensionality than the exact gradient method. 3.3. Inference with Self Normalizing Flows The above framework proposes approximate gradients which allow for the training of normalizing flow architectures without having to compute expensive determinants or matrix inverses. However, to compute the exact loglikelihood of an observation (i.e. perform inference), the exact log Jacobian determinant still has to be computed for Figure 4. Time per batch vs. input dimensionality for a single fully connected layer, comparing the self normalizing and exact gradients. We see the self normalizing flow model scales much more favorably with input dimension. each input, which may be expensive. To alleviate this limitation we take advantage of the compositional formulation of f, and the multiplicative property of the determinant. In detail, the determinant of the product of square matrices is equal to the product of their determinants. Therefore, assuming our flow is composed of square transformations fk, the Jacobian is given by Jf = Q k Jfk, and the log determinant of the Jacobian is given by: k log |Jfk| (6) For neural network architectures composed of sequential linear and nonlinear transformations, this allows us to separate the components of the Jacobian which are dataindependent (e.g. those of linear layers) from those that are data-dependant (e.g. those of the activations). For example, given a network composed of L layers of the form hl = σl(Wlhl 1) where σl is a point-wise non-linearity, combining Equations 1 and 6 yields: log p X(x) = log p Z(z)+ k log |Wk|+ k log |Σk(x)| (7) where Σk(x) is the Jacobian of the activation function at layer k, evaluated at the sample x. Importantly, we observe that the data-independent terms, P k log |Wk|, are the expensive part of inference, and the data-dependent terms, P k log |Σk(x)|, are frequently cheap to compute analytically given the derivative of the activation. Therefore, as can be seen, once trained, the expensive data-independent terms must only be computed once and their values can then be reused for all future examples, effectively amortizing the cost of the determinant computation. Self Normalizing Flows Figure 5. Negative log-likelihood on the MNIST validation set for a 2-layer fully connected flow trained with exact vs. self normalizing (SNF) gradients. Shown vs. training time (left) and vs. epochs (right). We see that both models converge to similar optima while the SNF model trains much more quickly. 4. Self Normalizing Flows In this section we introduce two simple applications of the self normalizing framework: fully-connected and convolutional self normalizing flows. 4.1. Self Normalizing Fully Connected Layer As a specific case of the self normalizing framework, we introduce a single fully connected self normalizing layer, as exemplified in Figure 1. Let W , R RD D, with f(x) = W x = z, and g(z) = Rz, such that W 1 R. We additionally denote the layer-wise reconstruction loss as E(x) = ||RW x x||2 2, but leave the derivation of the gradients of this term for the appendix (see Section A.1) since these are efficiently computed by standard deep learning frameworks and require no approximations. Taking the gradients of Equation 3 with respect to all parameters W and R, we get the following exact gradients: 2 W log pf X(x) + 1 2 W log pg X(x) λ W E log p Z(W x) W + log |W | δf zx T + W T λ W E (8) 2 R log pf X(x) + 1 2 R log pg X(x) R λE log p Z(R 1x) R + log |R 1| R T δg zx T R T R T λ where δf z = log p Z(W x) W x and δg z = log p Z(R 1x) R 1x can be computed by standard backpropagation. To avoid computing matrix inverses, using our framework we substitute W 1 with R in Equation 8, and symmetrically, all instances of R 1 with W in Equation 9. Additionally, we approximate δf z δg z in Equation 9 such that we do not need to forward propagate through the inverse model, and can re-use the backpropagated error from the forward model. This corresponds to assuming that R 1x W x for all x which follows directly from R 1 W . Ultimately this results in the following approximations to gradient: δf zx T + RT λ δf xz T W T λ where δf x = log p Z(W x) x = W T δf z. We see that by using such a self normalizing layer, the gradient of the logdeterminant of the Jacobian term, which originally required an expensive matrix inversion at each iteration, is approximately given by the weights of the inverse transformation sidestepping computation of both the Jacobian and the inverse. Additionally, we see the δf x term required for Equation 11 is already computed by traditional backpropagation, making the update efficient. Finally, we note the above gradients can trivially be extended to compositions of such layers, combined with non-linearities, by substituting the appropriate deltas for each layer, and the corresponding layer inputs and outputs for x and z respectively. 4.2. Self Normalizing Convolutional Layer To construct a self normalizing convolutional layer, we consider the setting where both the forward transformation f, and the inverse transformation g are convolutional with the same kernel size. We note, importantly, that the inverse of a convolution operation is not necessarily another convolution. However, for sufficiently large λ, we observe that f is regularized such that it is restricted to the class of convolutions which is approximately invertible by a convolution. Self Normalizing Flows We define the parameters of f to be the kernel w and similarly, the parameters of g to be the kernel r. Then, letting f(x) = w x = z and g(z) = r z, such that f 1 g with x, z RD, we proceed to derive the exact gradients of the log-likelihood. Again, we ignore the reconstruction term for simplicity as it requires no approximations. To make the derivation easier, we note that the convolution operation is a linear operation, and can therefore be represented in matrix form. We define a transformation W = T (w), which maps between the convolutional kernel w and the corresponding matrix form of the convolution: z = T (w)x = w x (12) Letting w be a column vector and again making use of the vectorization operator vec, we compute the exact gradient of the log-likelihood with respect to the kernel weights w: w log pf X(x) = (vec T (w))T log p Z (T (w)x) vec T (w) + log |T (w)| = (vec T (w))T vec h δf zx T i + vec T (w) T = δf z x + (vec T (w))T vec T (w) T (13) where we see that the first term (vec T (w))T w vec δf zx T is given by the convolution δf z x, as is usually done with backpropagation in convolutional neural networks. Then, given our soft constraint f 1 g, we can approximate T (w) T with T (r)T giving us: w log pf X(x) δf z x + (vec T (w))T To simplify the second term we note two points. First, the transpose of a convolution can similarly be achieved by standard convolution with a transformed kernel. We call this transformed kernel flip(r) such that T (flip(r)) = T (r)T . Succinctly, flip( ) is implemented by swapping the input and output axes, and mirroring the spatial (height and width) dimensions of the kernel. The second point is that the partial derivative (vec T (w))T w is given by a rectangular matrix, which, when multiplied by the vectorized form of the convolution matrix T (w) yields a constant multiple m elementwise multiplied with the kernel w. Each multiple element mi is given by the number of times the associated kernel element wi is shared across the matrix T (w). We provide the derivation of this more thoroughly in the appendix, as well as an efficient method for calculating m for arbitrary convolutions (see Section A.2). In combination, we arrive at the approximate gradient of the likelihood with respect to the kernel w: w log pf X(x) δf z x + flip(r) m (15) We see that the symmetric derivation can be obtained for the gradient with respect to the kernel r, as outlined below: r log pg X(x) = (vec T (r))T log p Z T (r) 1x vec T (r) + log |T (r) 1| = (vec T (r))T vec h T (r) T δg zx T T (r) T i vec T (r) T (vec T (r))T vec h T (w)T δf zx T T (w)T i vec T (w)T = δf x z flip(w) m (16) 5. Experiments In our first set of experiments, we train simple flows composed of the above layers and invertible non-linearities on the MNIST dataset. To evaluate our proposed approximate gradients, we compare to baseline models of the same architectures trained with the exact gradient. These architectures are designed to be small, so that it is still possible to compute the exact gradients quickly. We additionally compare with similar recent approaches to training normalizing flows with linear and convolutional layers, namely the relative gradient method of Gresele et al. (2020) and the convolution parametrizations of Hoogeboom et al. (2019; 2020). To evaluate the scalability of our method, we perform a second set of experiments where we integrate self normalizing flows into the Glow framework (Kingma & Dhariwal, 2018) as a replacement for the 1x1 convolutional mixing layers. In this framework we train models on MNIST, CIFAR-10, and the downsized Imagenet 32x32 dataset. All experimental results are from our re-implementations for consistency. In some cases, due to differing hyper-parameters or errors in prior work, this yielded slightly different results than those published. We provide extended explanations for these discrepancies, as well as a link to our code repository, in the appendix (See Section A.3). On the MNIST dataset we train three classes of models: 2-layer fully connected (FC) models with smooth-leaky Re LU activations (Gresele et al., 2020), 9-layer convolutional (CNN) models with spline activations (Durkan et al., 2019), and 32-layer Glow models composed of affine coupling layers and 1x1 convolutional mixing layers (Kingma & Dhariwal, 2018). In all cases, we compare a self normalizing version with its exact gradient baseline. As can be seen in Table 1 the models composed of self normalizing flow layers are nearly identical in performance to their exact gradient counterparts on the MNIST dataset. We see that the fully connected self normalizing model drastically outperforms the relative gradient method of Gresele Self Normalizing Flows Table 1. Negative Log-likelihood in nats on the MNIST test set. Mean std. over 3 runs. Self normalizing flows (SNF) achieve comparable performance to their exact counterparts. Model log p X(x) Relative Grad. FC 2-Layer 1096.5 0.5 Exact Gradient FC 2-Layer 947.6 0.2 SNF FC 2-Layer (ours) 947.1 0.2 Emerging Conv. 9-Layer 645.7 3.6 SNF Conv. 9-Layer (ours) 638.6 0.9 Conv. Exponential 9-Layer 638.1 1.0 Exact Gradient Conv. 9-Layer 637.4 0.2 Glow 2L-16K 575.7 0.8 SNF Glow 2L-16K (ours) 575.4 1.4 et al. (2020), reaching the same performance as the exact gradient method. Additionally, we see that the self normalizing convolutional layer outperforms its constrained counterpart from Hoogeboom et al. (2019), likely due to the fact that the emerging convolution is unable to represent the 1x1 convolution explicitly. We hypothesize that the convolutional self normalizing flow model slightly under-performs the exact gradient method due to the convolutional-inverse constraint. We propose this constraint can be relaxed by using more complex inverse functions, potentially composed of multiple layers and non-linearities (see Section A.4), but leave this to future work. All training details can be found in the appendix (see Section A.3). In Figure 5 the plot on the right shows that the qualitative convergence properties of the approximate gradient methods are very similar to those of the exact gradient, eventually converging to nearly the same validation likelihood. However, as can be seen in the plot on the left, due to the reduced computational complexity, the approximate gradient method trains in less than half the time, and even more quickly to approximate convergence. This timing comparison is demonstrated more exactly in Table 3 where the time per training batch and time per sample is computed for all models presented in this work. As can be seen, the self normalizing flow models are the fastest of all presented models, with the exception of the relative gradient method which appears to lag behind in likelihood performance. From this table we also see that the relative improvements to speed are directly related to the portions of the network which are replaced with self normalizing components. Since only the 1x1 convolution is replaced in the Glow framework, there is only a slight speed increase to be had. Finally, we quantitatively measure the quality of the gradient approximation by measuring the alignment of directions of the approximate gradient and the exact gradient in Figure 3. Specifically, for models trained in the self normalizing Table 2. Bits per dimension for large scale Glow models on two larger scale natural image datasets. Mean std. over 3 runs. Self normalizing flows (SNF) achieve comparable performance to their exact counterparts demonstrating that this method scales to large models (> 100 steps of flow). See Section A.3 for details. Model CIFAR-10 Image Net32 Glow 3.36 0.002 4.12 0.002 SNF Glow 3.37 0.004 4.14 0.007 framework, we measure the angular error between the approximate gradient and the true gradient at each training iteration, for each layer, and plot the average angular error over all layers for each training epoch. The shaded area denotes the standard deviation in angles. We make two observations from this figure. First, both the 2-layer FC model and the 9-Layer CNN appear to have an initial slight divergence from the true gradient, but then quickly align to less than 1-degree of error, suggesting they are close approximations to the true gradient direction. Second, we observe that the average angular error of the approximate gradient is significantly larger for the convolutional model than it is for the fully connected model. When comparing this with the results in Table 1 we see that this could be contributing to the slightly lower performance of the self normalizing model when compared with the exact gradient methods. We again hypothesize this could be due to convolutional inverse constraint, making the inverse approximation more challenging. We leave further exploration of this topic to future work but believe that the performance could be ameliorated with improved inverse approximations, potentially achieved through more complex constrained optimization techniques. 5.2. CIFAR-10 and Image Net 32x32 For large scale experiments, we incorporate self normalizing flow layers into the Glow framework and train the models on CIFAR-10 and Imagenet 32x32. Specifically, we use the same model architectures as those proposed in Kingma & Dhariwal (2018), with some slight changes to the optimization parameters as detailed in Section A.3. We observe that the self normalizing flow models are able to achieve competitive bits per dimension on both datasets (as seen in Table 2), while simultaneously training slightly faster than their exact gradient counterparts (as seen in Table 3). Importantly, we experimented with values of λ in the set {1, 10, 100, 1000}, and chose λ = 1000 for all CIFAR10 and Imagenet models due to increased stability during training. We observed a slight reduction in final likelihood performance as a result of such a large reconstruction weight and believe this is a factor in the performance gap between the self normalizing and exact gradient models. We believe that with more tuning, or with a dynamically weighted con- Self Normalizing Flows Table 3. Runtime comparison for the models presented in Tables 1 and 2. Hardware and implementation details are in Section A.3 Model Dataset Time / batch (ms) Time / sample (ms) Exact FC 2-Layer MNIST 44.9 4.4 61.5 5.8 Relative Gradient FC 2-Layer MNIST 7.0 0.4 69.2 5.6 Self Normalizing FC 2-Layer MNIST 18.7 0.8 38.6 3.1 Exact Conv. 9-Layer MNIST 372.2 24.5 241.6 12.9 Emerging Conv. 9-Layer MNIST 305.0 14.5 71.7 8.8 Conv. Exponential 9-Layer MNIST 304.4 11.9 84.2 9.2 Self Normalizing Conv. 9-Layer MNIST 212.5 37.3 29.9 6.3 Glow 2L-16K MNIST 583.4 21.2 163.1 21.8 Self Normalizing Glow 2L-16K MNIST 476.4 16.7 30.6 2.2 Glow 3L-32K CIFAR-10 1841.3 85.4 126.3 13.9 Self Normalizing Glow 3L-32K CIFAR-10 1761.2 104.5 97.8 12.9 Glow 3L-48K Image Net 32x32 2397.2 204.0 174.8 16.7 Self Normalizing Glow 3L-48K Image Net 32x32 2047.9 152.8 150.7 20.8 strained optimization method such as that presented in Platt & Barr (1988), the self normalizing model is likely to match the exact gradient model even more closely. Preliminary experiments in this direction are shown in Section A.5. Although it is clear that the glow framework is not the optimal setting for the application of self normalizing layers, given the determinant calculation of the 1x1 convolution is relatively quick to evaluate already, we present this work as a proof of scalability of our framework to models with greater than 100 steps of flow (as in the Image Net 32x32 case), and to larger scale images. 6. Discussion We see that the above framework yields an efficient update rule for flow-based models which appears to perform similarly to the exact gradient while taking significantly less time to train. However, in its current form, this approach is limited in a number of its own ways. First, as described in Section 3.3, the evaluation of the exact log-likelihood is still expensive, requiring the computation of the exact log Jacobian determinant of a general function class. However, once a model is trained, the Jacobian determinants of the linear transformations only need to be computed once, and can then be re-used for all future likelihood evaluation, effectively amortizing the cost. Second, there are no known optimization guarantees for our proposed model. Therefore, the model could converge to a sub-optimal trade-off between the negative log-likelihood and the reconstruction error, or even diverge if the inverse approximation is very poor. In practice, we observe that reconstruction error stays very low for most models when initialized properly, and final likelihood values are only marginally impacted by the choice of λ. In future work, we intend to explore the possibility of augmented Lagrangian methods, the Modified Differential Method of Multipliers (Platt & Barr, 1988) and other constrained optimization techniques, which could provide better convergence guarantees. We note one of the biggest constraints of the models presented here is that the inverse of the forward function may not always be given by a function of the same class, as for convolution. Although not evaluated in this work, we note that the general framework proposed here would allow for more complex asymmetric self normalizing flow components and we intend to evaluate these in future work. Finally, as with all current exact likelihood flow methods, the dimensionality of the representation must stay consistent throughout the depth of the flow. Currently, this problem has been approached with flow components such as coupling layers and variational data augmentation, however these methods are either restrictive in their design or add significant noise. This is clearly one of the greatest architectural constraints for existing normalizing flows and remains so in this work. 7. Conclusion In summary, we introduce Self Normalizing Flows, a new method to efficiently optimize normalizing flow layers. The method approximates the gradient of the log Jacobian determinant using learned inverses, allowing for the training of otherwise intractable normalizing flow architectures. We demonstrate that our method performs competitively with other models in the literature while simultaneously providing faster training and sampling. Self Normalizing Flows Amari, S.-i. Natural gradient works efficiently in learning. Neural Computation, 10(2):251 276, 1998. Behrmann, J., Grathwohl, W., Chen, R. T. Q., Duvenaud, D., and Jacobsen, J. Invertible residual networks. In Proceedings of the 36th International Conference on Machine Learning, ICML 2019, 9-15 June 2019, Long Beach, California, USA, volume 97 of Proceedings of Machine Learning Research, pp. 573 582. PMLR, 2019. Bengio, Y. How auto-encoders could provide credit assignment in deep networks via target propagation. Co RR, abs/1407.7906, 2014. Bengio, Y. Deriving differential target propagation from iterating approximate inverses, 2020. Bengio, Y., Lamblin, P., Popovici, D., and Larochelle, H. Greedy layer-wise training of deep networks. In Advances in Neural Information Processing Systems 19: Annual Conference on Neural Information Processing Systems 2006, Neur IPS 2006 Montr eal, Canada, pp. 153 160, 2006. Biewald, L. Experiment tracking with weights and biases, 2020. URL https://www.wandb.com/. Software available from wandb.com. Bogachev, V. I., Kolesnikov, A. V., and Medvedev, K. V. Triangular transformations of measures. Sbornik: Mathematics, 196(3):309, 2005. Cardoso, J.-F. and Laheld, B. H. Equivariant adaptive source separation. IEEE Transactions on signal processing, 44 (21):3017 3030, 1996. Chen, T. Q., Behrmann, J., Duvenaud, D., and Jacobsen, J. Residual flows for invertible generative modeling. In Advances in Neural Information Processing Systems 32: Annual Conference on Neural Information Processing Systems 2019, Neur IPS 2019, 2019. Dinh, L., Sohl-Dickstein, J., and Bengio, S. Density estimation using real NVP. Co RR, abs/1605.08803, 2016. Durkan, C., Bekasov, A., Murray, I., and Papamakarios, G. Neural spline flows. In Advances in Neural Information Processing Systems 32, pp. 7511 7522. 2019. Glorot, X. and Bengio, Y. Understanding the difficulty of training deep feedforward neural networks. volume 9 of Proceedings of Machine Learning Research, pp. 249 256, Chia Laguna Resort, Sardinia, Italy, 13 15 May 2010. JMLR Workshop and Conference Proceedings. Grathwohl, W., Chen, R. T. Q., Bettencourt, J., Sutskever, I., and Duvenaud, D. FFJORD: free-form continuous dynamics for scalable reversible generative models. In 7th International Conference on Learning Representations, ICLR 2019, New Orleans, LA, USA, May 6-9, 2019. Open Review.net, 2019. Gresele, L., Fissore, G., Javaloy, A., Sch olkopf, B., and Hyv arinen, A. Relative gradient optimization of the jacobian term in unsupervised deep learning. Co RR, abs/2006.15090, 2020. Gritsenko, A. A., Snoek, J., and Salimans, T. On the relationship between normalising flows and variationaland denoising autoencoders. In Workshop on Deep Generative Models for Highly Structured Data, at 7th International Conference on Learning Representations, ICLR 2019, New Orleans, LA, USA, May 6-9, 2019. Open Review.net, 2019. Hinton, G. E., Osindero, S., and Teh, Y.-W. A fast learning algorithm for deep belief nets. Neural Computation, 18 (7):1527 1554, 2006. Hoogeboom, E., van den Berg, R., and Welling, M. Emerging convolutions for generative normalizing flows. In Proceedings of the 36th International Conference on Machine Learning, ICML 2019, 9-15 June 2019, Long Beach, California, USA, volume 97 of Proceedings of Machine Learning Research, pp. 2771 2780. PMLR, 2019. Hoogeboom, E., Satorras, V. G., Tomczak, J. M., and Welling, M. The convolution exponential and generalized sylvester flows. Co RR, abs/2006.01910, 2020. Huang, C., Krueger, D., Lacoste, A., and Courville, A. C. Neural autoregressive flows. In Dy, J. G. and Krause, A. (eds.), Proceedings of the 35th International Conference on Machine Learning, ICML, 2018. Jaini, P., Selby, K. A., and Yu, Y. Sum-of-squares polynomial flow. ar Xiv preprint ar Xiv:1905.02325, 2019. Karami, M., Schuurmans, D., Sohl-Dickstein, J., Dinh, L., and Duckworth, D. Invertible convolutional flow. In Advances in Neural Information Processing Systems 32: Annual Conference on Neural Information Processing Systems 2019, Neur IPS, 2019. Kingma, D. and Ba, J. Adam: A method for stochastic optimization. International Conference on Learning Representations, 12 2014. Kingma, D. P. and Dhariwal, P. Glow: Generative flow with invertible 1x1 convolutions. In Advances in Neural Information Processing Systems 31: Annual Conference on Neural Information Processing Systems 2018, Neur IPS 2018, 3-8 December 2018, Montr eal, Canada, pp. 10236 10245, 2018. Self Normalizing Flows Kingma, D. P., Salimans, T., J ozefowicz, R., Chen, X., Sutskever, I., and Welling, M. Improving variational autoencoders with inverse autoregressive flow. In Advances in Neural Information Processing Systems 29: Annual Conference on Neural Information Processing Systems 2016, 2016. Kosiorek, A., Sabour, S., Teh, Y. W., and Hinton, G. E. Stacked capsule autoencoders. In Advances in Neural Information Processing Systems 32, pp. 15512 15522. 2019. Kr amer, A., K ohler, J., and No e, F. Training invertible linear layers through rank-one perturbations, 2020. Lecun, Y. Learning processes in an asymmetric threshold network. In Disordered systems and biological organization, Les Houches, France, pp. 233 240. Springer-Verlag, 1986. Lee, D.-H., Zhang, S., Fischer, A., and Bengio, Y. Difference target propagation. In Proceedings of the 2015th European Conference on Machine Learning and Knowledge Discovery in Databases - Volume Part I, ECMLPKDD 15, pp. 498 515, 2015. Linnainmaa, S. Taylor expansion of the accumulated rounding error. BIT Numerical Mathematics, 16(2):146 160, 1976. ISSN 1572-9125. doi: 10.1007/BF01931367. L owe, S., O Connor, P., and Veeling, B. Putting an end to end-to-end: Gradient-isolated learning of representations. In Advances in Neural Information Processing Systems 32, pp. 3039 3051. 2019. Lu, Y. and Huang, B. Woodbury transformations for deep generative flows. In Advances in Neural Information Processing Systems 33: Annual Conference on Neural Information Processing Systems, 2020. Magnus, J. R. On the concept of matrix derivative. Journal of Multivariate Analysis, 101(9):2200 2206, 2010. Marzouk, Y., Moselhy, T., Parno, M., and Spantini, A. An introduction to sampling via measure transport. ar Xiv preprint ar Xiv:1602.05023, 2016. Meulemans, A., Carzaniga, F. S., Suykens, J. A. K., Sacramento, J., and Grewe, B. F. A theoretical framework for target propagation, 2020. Papamakarios, G., Pavlakou, T., and Murray, I. Masked autoregressive flow for density estimation. In Proceedings of the 31st International Conference on Neural Information Processing Systems, Red Hook, NY, USA, 2017. Papamakarios, G., Nalisnick, E., Rezende, D. J., Mohamed, S., and Lakshminarayanan, B. Normalizing flows for probabilistic modeling and inference, 2019. Paszke, A., Gross, S., Massa, F., Lerer, A., Bradbury, J., Chanan, G., Killeen, T., Lin, Z., Gimelshein, N., Antiga, L., Desmaison, A., Kopf, A., Yang, E., De Vito, Z., Raison, M., Tejani, A., Chilamkurthy, S., Steiner, B., Fang, L., Bai, J., and Chintala, S. Pytorch: An imperative style, high-performance deep learning library. In Advances in Neural Information Processing Systems 32, pp. 8024 8035. 2019. Perugachi-Diaz, Y., Tomczak, J. M., and Bhulai, S. Invertible densenets. Co RR, abs/2010.02125, 2020. Platt, J. and Barr, A. Constrained differential optimization. In Neural Information Processing Systems, pp. 612 621. American Institute of Physics, 1988. Rezende, D. J. and Viola, F. Taming vaes, 2018. Rippel, O. and Adams, R. P. High-dimensional probability estimation with deep density models, 2013. Rudin, W. Real and Complex Analysis. 1987, volume 156. Mc Graw-Hill Book Company, 1987. Rumelhart, D. E., Hinton, G. E., and Williams, R. J. Learning representations by back-propagating errors. Nature, 323(6088):533 536, 1986. Tabak, E. G. and Turner, C. V. A family of nonparametric density estimation algorithms. Communications on Pure and Applied Mathematics, 66(2):145 164, 2013. van den Berg, R., Hasenclever, L., Tomczak, J. M., and Welling, M. Sylvester normalizing flows for variational inference. In Globerson, A. and Silva, R. (eds.), Proceedings of the Thirty-Fourth Conference on Uncertainty in Artificial Intelligence, UAI, pp. 393 402. AUAI Press, 2018. Vincent, P., Larochelle, H., Bengio, Y., and Manzagol, P.-A. Extracting and composing robust features with denoising autoencoders. In Proceedings of the 25th International Conference on Machine Learning, ICML 08, pp. 1096 1103, New York, NY, USA, 2008. Association for Computing Machinery. Werbos, P. J. Applications of advances in nonlinear sensitivity analysis. In Drenick, R. F. and Kozin, F. (eds.), System Modeling and Optimization, pp. 762 770, Berlin, Heidelberg, 1982. Springer Berlin Heidelberg. ISBN 978-3-540-39459-4.