# invertible_densenets_with_concatenated_lipswish__ce95d265.pdf Invertible Dense Nets with Concatenated Lip Swish Yura Perugachi-Diaz Vrije Universiteit Amsterdam y.m.perugachidiaz@vu.nl Jakub M. Tomczak Vrije Universiteit Amsterdam j.m.tomczak@vu.nl Sandjai Bhulai Vrije Universiteit Amsterdam s.bhulai@vu.nl We introduce Invertible Dense Networks (i-Dense Nets), a more parameter efficient extension of Residual Flows. The method relies on an analysis of the Lipschitz continuity of the concatenation in Dense Nets, where we enforce invertibility of the network by satisfying the Lipschitz constant. Furthermore, we propose a learnable weighted concatenation, which not only improves the model performance but also indicates the importance of the concatenated weighted representation. Additionally, we introduce the Concatenated Lip Swish as activation function, for which we show how to enforce the Lipschitz condition and which boosts performance. The new architecture, i-Dense Net, out-performs Residual Flow and other flow-based models on density estimation evaluated in bits per dimension, where we utilize an equal parameter budget. Moreover, we show that the proposed model outperforms Residual Flows when trained as a hybrid model where the model is both a generative and a discriminative model. 1 Introduction Neural networks are widely used to parameterize non-linear models in supervised learning tasks such as classification. In addition, they are also utilized to build flexible density estimators of the true distribution of the observed data [25, 33]. The resulting deep density estimators, also called deep generative models, can be further used to generate realistic-looking images that are hard to separate from real ones, detection of adversarial attacks [9, 17], and for hybrid modeling [27] which have the property to both predict a label (classify) and generate. Many deep generative models are trained by maximizing the (log-)likelihood function and their architectures come in different designs. For instance, causal convolutional neural networks are used to parameterize autoregressive models [28, 29] or various neural networks can be utilized in Variational Auto-Encoders [19, 32]. The other group of likelihood-based deep density estimators, flow-based models (or flows), consist of invertible neural networks since they are used to compute the likelihood through the change of variable formula [31, 37, 36]. The main difference that determines an exact computation or approximation of the likelihood function for a flow-based model lies in the design of the transformation layer and tractability of the Jacobian-determinant. Many flow-based models formulate the transformation that is invertible and its Jacobian is tractable [3, 6 8, 21, 30, 31, 38]. Recently, Behrmann et al. [2] proposed a different approach, namely, deep-residual blocks as a transformation layer. The deep-residual networks (Res Nets) of [12] are known for their successes in supervised learning approaches. In a Res Net block, each input of the block is added to the output, which forms the input for the next block. Since Res Nets are not necessarily invertible, Behrmann et al. [2] enforce the Lipschitz constant of the transformation to be smaller than 1 (i.e., it becomes a contraction) that allows applying an iterative procedure to invert the network. Furthermore, Chen et al. [4] proposed Residual Flows, an improvement of i-Res Nets, that uses an unbiased estimator for the logarithm of the Jacobian-determinant. 35th Conference on Neural Information Processing Systems (Neur IPS 2021) (a) Residual block (b) Dense block Figure 1: A schematic representation for: (a) a residual block, (b) a dense block. The pink part in (b) expresses a 1 1 convolution to reduce the dimension of the last dense layer. Wi denotes the (convolutional) layer at step i that satisfy ||Wi||2 < 1. In supervised learning, an architecture that uses fewer parameters and is even more powerful than the deep-residual network is the Densely Connected Convolution Network (Dense Net), which was first presented in [15]. Contrary to a Res Net block, a Dense Net layer consists of a concatenation of the input with the output. The network showed to improve significantly in recognition tasks on benchmark datasets such as CIFAR10, SVHN, and Image Net, by using fewer computations and having fewer parameters than Res Nets while performing at a similar level. In this work, we extend Residual Flows [2, 4], and use densely connected blocks (Dense Blocks) as a residual layer. First, we introduce invertible Dense Networks (i-Dense Nets), and we show that we can derive a bound on the Lipschitz constant to create an invertible flow-based model. Furthermore, we propose the Concatenated Lip Swish (CLip Swish) as an activation function, and derive a stronger Lipschitz bound. The CLip Swish function preserves more signal than Lip Swish activation functions. Finally, we demonstrate how i-Dense Nets can be efficiently trained as a generative model, outperforming Residual Flows and other flow-based models under an equal parameter budget. 2 Background Flow-based models Let us consider a vector of observable variables x 2 Rd and a vector of latent variables z 2 Rd. We define a bijective function f : Rd ! Rd that maps a latent variable to a datapoint x = f(z). Since f is invertible, we define its inverse as F = f 1. We use the change of variables formula to compute the likelihood of a datapoint x after taking the logarithm, that is: ln p X(x) = ln p Z(z) + ln | det JF (x)|, (1) where p Z(z) is a base distribution (e.g., the standard Gaussian) and JF (x) is the Jacobian of F at x. The bijective transformation is typically constructed as a sequence of K invertible transformations, x = f K f1(z), and a single transformation fk is referred to as a flow [31]. The change of variables formula allows evaluating the data in a tractable manner. Moreover, the flows are trained using the log-likelihood objective where the Jacobian-determinant compensates the change of volume of the invertible transformations. Residual flows Behrmann et al. [2] construct an invertible Res Net layer which is only constrained in Lipschitz continuity. A Res Net is defined as: F(x) = x + g(x), where g is modeled by a (convolutional) neural network and F represents a Res Net layer (see Figure 1(a)) which is in general not invertible. However, g is constructed in such a way that it satisfies the Lipschitz constant being strictly lower than 1, Lip(g) < 1, by using spectral normalization of [10, 26]: Lip(g) < 1, if ||Wi||2 < 1, (2) where || ||2 is the 2 matrix norm. Then Lip(g) = K < 1 and Lip(F) < 1 + K. Only in this specific case the Banach fixed-point theorem holds and Res Net layer F has a unique inverse. As a result, the inverse can be approximated by fixed-point iterations. To estimate the log-determinant is, especially for high-dimensional spaces, computationally intractable due to expensive computations. Since Res Net blocks have a constrained Lipschitz constant, the log-likelihood estimation of Equation (1) can be transformed to a version where the logarithm of the Jacobian-determinant is cheaper to compute, tractable, and approximated with guaranteed convergence [2]: ln p(x) = ln p(f(x)) + tr where Jg(x) is the Jacobian of g at x that satisfies ||Jg||2 < 1. The Skilling-Hutchinson trace estimator [35, 16] is used to compute the trace at a lower cost than to fully compute the trace of the Jacobian. Residual Flows [4] use an improved method to estimate the power series at an even lower cost with an unbiased estimator based on "Russian roulette" of [18]. Intuitively, the method estimates the infinite sum of the power series by evaluating a finite amount of terms. In return, this leads to less computation of terms compared to invertible residual networks. To avoid derivative saturation, which occurs when the second derivative is zero in large regions, the Lip Swish activation is proposed. 3 Invertible Dense Networks In this section, we propose Invertible Dense Networks by using a Dense Block as a residual layer. We show how the network can be parameterized as a flow-based model and refer to the resulting model as i-Dense Nets. The code can be retrieved from: https://github.com/yperugachidiaz/ invertible_densenets. 3.1 Dense blocks The main component of the proposed flow-based model is a Dense Block that is defined as a function F : Rd ! Rd with F(x) = x + g(x), where g consists of dense layers {hi}n i=1. Note that an important modification to make the model invertible is to output x + g(x) whereas a standard Dense Block would only output g(x). The function g is expressed as follows: g(x) = Wn+1 hn h1(x), (4) where Wn+1 represents a 1 1 convolution to match the output size of Rd. A layer hi consists of two parts concatenated to each other. The upper part is a copy of the input signal. The lower part consists of the transformed input, where the transformation is a multiplication of (convolutional) weights Wi with the input signal, followed by a non-linearity φ having Lip(φ) 1, such as Re LU, ELU, Lip Swish, or tanh. As an example, a dense layer h2 can be composed as follows: , h2(h1(x)) = h1(x) φ(W2h1(x)) In Figure 1, we schematically outline a residual block (Figure 1(a)) and a dense block (Figure 1(b)). We refer to concatenation depth as the number of dense layers in a Dense Block and growth as the channel growth size of the transformation in the lower part. 3.2 Constraining the Lipschitz constant If we enforce function g to satisfy Lip(g) < 1, then Dense Block F is invertible since the Banach fixed point theorem holds. As a result, the inverse can be approximated in the same manner as in [2]. To satisfy Lip(g) < 1, we need to enforce Lip(hi) < 1 for all n layers, since Lip(g) Lip(hn+1) . . . Lip(h1). Therefore, we first need to determine the Lipschitz constant for a dense layer hi. For the full derivation, see Appendix A. We know that a function f is K-Lipschitz if for all points v and w the following holds : d Y (f(v), f(w)) Kd X(v, w), (6) where we assume that the distance metrics d X = d Y = d are chosen to be the 2-norm. Further, let two functions f1 and f2 be concatenated in h: f1(v) f2(v) f1(w) f2(w) where function f1 is the upper part and f2 is the lower part. We can now find an analytical form to express a limit on K for the dense layer in the form of Equation (6): d(hv, hw)2 = d(f1(v), f1(w))2 + d(f2(v), f2(w))2, d(hv, hw)2 (K2 2)d(v, w)2, where we know that the Lipschitz constant of h consist of two parts, namely, Lip(f1) = K1 and Lip(f2) = K2. Therefore, the Lipschitz constant of layer h can be expressed as: With spectral normalization of Equation (2), we know that we can enforce (convolutional) weights Wi to be at most 1-Lipschitz. Hence, for all n dense layers we apply the spectral normalization on the lower part which locally enforces Lip(f2) = K2 < 1. Further, since we enforce each layer hi to be at most 1-Lipschitz and we start with h1, where f1(x) = x, we know that Lip(f1) = 1. Therefore, the Lipschitz constant of an entire layer can be at most Lip(h) < 2, thus dividing by this limit enforces each layer to be at most 1-Lipschitz. 3.3 Learnable weighted concatenation Figure 2: Range of the possible normalized parameters ˆ 1 and ˆ 2. We have shown that we can enforce an entire dense layer to have Lip(hi) < 1 by applying a spectral norm on the (convolutional) weights Wi and then divide the layer hi by 2. Although learning a weighting between the upper and lower part would barely affect a standard dense layer, it matters in this case because the layers are regularized to be 1-Lipschitz. To optimize and learn the importance of the concatenated representations, we introduce learnable parameters 1 and 2 for, respectively, the upper and lower part of each layer hi. Since the upper and lower part of the layer can be at most 1-Lipschitz, multiplication by these factors results in functions that are at most 1-Lipschitz and 2-Lipschitz. As indicated by Equation (9), the layer is then at most p 2 Lipschitz. Dividing by this factor results in a bound that is at most 1-Lipschitz. In practice, we initialize 1 and 2 at value 1 and during training use a softplus function to avoid them being negative. The range of the normalized parameters is between ˆ 1, ˆ 2 2 [0, 1] and can be expressed on the unit circle as shown in Figure 2. In the special case where 1 = 2, the normalized parameters are ˆ 1 = ˆ 2 = 1 2. This case corresponds to the situation in Section 3.2 where the concatenation is not learned. An additional advantage is that the normalized ˆ 1 and ˆ 2 express the importance of the upper and lower signal. For example, when ˆ 1 > ˆ 2, the input signal is of more importance than the transformed signal. 3.4 CLip Swish When a deep neural network is bounded to be 1-Lipschitz, in practice each consecutive layer reduces the Jacobian-norm. As a result, the Jacobian-norm of the entire network is becoming much smaller than 1 and the expressive power is getting lost. This is known as gradient norm attenuation [1, 24]. This problem arises in activation functions in regions where the derivative is small, such as the left tail of the Re LU and the Lip Swish. Non-linearities φ modeled in i-Dense Nets are required to be at most 1-Lipschitz and, thus, face gradient-norm attenuation issues. For this reason we introduce a new activation function, which mitigates these issues. Recall that Residual Flows use the Lip Swish activation function [4]: Lip Swish(x) = xσ(βx)/1.1, (10) where σ(βx) = 1/(1 + exp( xβ)) is the sigmoid , β is a learnable constant, initialized at 0.5 and is passed through a softplus to be strictly positive. This activation function is not only Lip(Lip Swish) = 1 but also resolves the derivative saturation problem [4]. However, the Lip Swish function has large ranges on the negative axis where its derivative is close to zero. Therefore, we propose the Concatenated Lip Swish (CLip Swish) which concatenates two Lip Swish functions with inputs x and x. This is a concatenated activation function as in [34] but using a Lip Swish instead of a Re LU. Intuitively, even if an input lies in the tail of the upper part, it will have a larger derivative in the bottom part and thus suffer less from gradient norm attenuation. Since using CLip Swish increases the channel growth and to stay inline with the channel growth that non-concatenated activation functions use, we use a lower channel growth when using CLip Swish. To utilize the CLip Swish, we need to derive Lipschitz continuity of the activation function Φ defined below and enforce it to be 1-Lipschitz. We could use the result obtained in Equation (9) to obtain a p 2-bound, however, by using knowledge about the activation function Φ, we can derive a tighter 1.004 < 2 bound. In general, a tighter bound is preferred since more expressive power will be preserved in the network. To start with, we define function Φ : R ! R2 for a point x as: φ1(x) φ2(x) Lip Swish(x) Lip Swish( x) , CLip Swish(x) = Φ(x)/Lip(Φ), (11) where the Lip Swish is given by Equation (10) and the derivative of Φ(x) exists. To find Lip(Φ) we use that for a differentiable 2-Lipschitz bounded function Φ, the following identity holds: Lip(Φ) = sup x ||JΦ(x)||2, (12) where JΦ(x) is the Jacobian of Φ at x and || ||2 represents the induced matrix norm which is equal to the spectral norm of the matrix. Rewriting the spectral norm results in solving: det(JΦ(x)T JΦ(x) λIn) = 0, which gives us the final result (see Appendix A.3.1 for the full derivation): x ||JΦ(x)||2 = sup x σmax(JΦ(x)) = sup where σmax( ) is the largest singular value. Now Lip(Φ) is the upper bound of the CLip Swish and is equal to the supremum of: Lip(Φ) = supx ||JΦ(x)||2 1.004, for all values of β. This can be numerically computed by any solver, by determining the extreme values of Equation (13). Therefore, dividing Φ(x) by its upper bound 1.004 results in Lip(CLip Swish) = 1. The generalization to higher dimensions can be found in Appendix A.3.2. The analysis of preservation of signals for (CLip)Swish activation by simulations can be found in Section 5.1. 4 Experiments Figure 3: Density estimation for smaller architectures of Residual Flows and i-Dense Nets, trained on 2-dimensional toy data. To make a clear comparison between the performance of Residual Flows and i-Dense Nets, we train both models on 2-dimensional toy data and high-dimensional image data: CIFAR10 [22] and Image Net32 [5]. Since we have a constrained computational budget, we use smaller architectures for the exploration of the network architectures. An indepth analysis of different settings and experiments can be found in Section 5. For density estimation, we run the full model with the best settings for 1,000 epochs on CIFAR10 and 20 epochs on Ima- ge Net32 where we use single-seed results following [2, 4, 20], due to little fluctuations in performance. In all cases, we use the density estimation results of the Residual Flow and other flow-based models using uniform dequantization to create a fair comparison and benchmark these with i-Dense Nets. We train i-Dense Nets with learnable weighted concatenation (LC) and CLip Swish as the activation function, and utilize a similar number of parameters for i-Dense Nets as Residual Flows; this can be found in Table 2. i-Dense Nets uses slightly fewer parameters than the Residual Flow. A detailed description of the architectures can be found in Appendix B. To speed up training, we use 4 GPUs. 4.1 Toy data Table 1: Negative log-likelihood results on test data in nats (toy data). i-Dense Nets w/ and w/o LC are compared with the Residual Flow. Model 2 circles checkerboard 2 moons Residual Flows 3.44 3.81 2.60 i-Dense Nets 3.32 3.68 2.39 i-Dense Nets+LC 3.30 3.66 2.39 We start with testing i-Dense Nets and Residual Flows on toy data, where we use smaller architectures. Instead of 100 flow blocks, we use 10 flow blocks. We train both models for 50,000 iterations and, at the end of the training, we visualize the learned distributions. The results of the learned density distributions are presented in Figure 3. We observe that Residual Flows are capable to capture highprobability areas. However, they have trouble with learning low probability regions for two circles and moons. i-Dense Nets are capable of capturing all regions of the datasets. The good performance of i-Dense Nets is also reflected in better performance in terms of the negative-log-likelihood (see Table 1). 4.2 Density Estimation Table 2: The number of parameters of Residual Flows and i-Dense Nets for the full models as trained in Chen et al. [4]. In brackets, the number of parameters of the smaller models. Model/Data CIFAR10 Image Net32 Residual Flows 25.2M (8.7M) 47.1M i-Dense Nets 24.9M (8.7M) 47.0M We test the full i-Dense Net models with LC and CLip Swish activation. To utilize a similar number of parameters as the Residual Flow with 3 scale levels and flow blocks set to 16 per scale trained on CIFAR10, we set for the same number of blocks, Dense Nets growth to 172 with a depth of 3. Residual Flow trained on Image Net32 uses 3 scale levels with 32 flow blocks per scale, therefore, we set for the same number of blocks Dense Nets growth to 172 and depth of 3 to utilize a similar number of parameters. Dense Nets depth set to 3 proved to be the best settings for smaller architectures; see the analysis in Section 5. Table 3: Density estimation results in bits per dimension for models using uniform dequantization. In brackets results for the smaller Residual Flow and i-Dense Net run for 200 epochs. Model CIFAR10 Image Net32 Real NVP [8] 3.49 4.28 Glow [20] 3.35 4.09 FFJORD [11] 3.40 - Flow++ [13] 3.29 - Conv SNF [14] 3.29 - i-Res Net [2] 3.45 - Residual Flow [4] 3.28 (3.42) 4.01 i-Dense Net 3.25 (3.37) 3.98 The density estimation on CIFAR10 and Image Net32 are benchmarked against the results of Residual Flows and other comparable flow-based models, where the results are retrieved from Chen et al. [4]. We measure performances in bits per dimension (bpd). The results can be found in Table 3. We find that i-Dense Nets out-perform Residual Flows and other comparable flow-based models on all considered datasets in terms of bpd. On CIFAR10, i-Dense Net achieves 3.25bpd, against 3.28bpd of the Residual Flow. On Image Net32 i Dense Net achieves 3.98bpd against 4.01bpd of the Residual Flow. Samples of the i-Dense Net models can be found in Figure 4. Samples of the model trained on CIFAR10 are presented in Figure 4(b) and samples of the model trained on Image Net32 in Figure 4(d). For more unconditional samples, see Appendix C.1. Note that this work does not compare against flow-based models using variational dequantization. Instead we focus on extending and making a fair comparison with Residual Flows which, similar to other flow-based models, use uniform dequantization. For reference note that Flow++ [13] with variational dequantization obtains 3.08bpd on CIFAR10 and 3.86bpd on Image Net32 that is better than the model with uniform dequantization which achieves 3.29bpd on CIFAR10. 4.3 Hybrid Modeling Besides density estimation, we also experiment with hybrid modeling [27]. We train the joint distribution p(x, y) = p(x) p(y|x), where p(x) is modeled with a generative model and p(y|x) is modeled with a classifier, which uses the features of the transformed image onto the latent space. (a) Real CIFAR10 images. (b) Samples of i-Dense Nets trained on CIFAR10. (c) Real Image Net32 images. (d) Samples of i-Dense Nets trained on Image Net32. Figure 4: Real and samples of CIFAR10 and Image Net32 data Due to the different dimensionalities of y and x, the emphasis of the likelihood objective is more likely to be focused on p(x) and a scaling factor for a weighted maximum likelihood objective is suggested, Ex,y D [log p(y|x) + λ log p(x)], where λ is the scaling factor expressing the trade-off between the generative and discriminative parts. Unlike [27] where a linear layer is integrated on top of the latent representation, we use the architecture of [4] where the set of features are obtained after every scale level. Then, they are concatenated and are followed by a linear softmax classifier. We compare our experiments with the results of [4] where Residual Flow, coupling blocks [7] and 1 1 convolutions [20] are evaluated. Table 4: Results of hybrid modeling on CIFAR10. Arrows indicate if low or high values are of importance. Results are average over the last 5 epochs. λ = 0 λ = 1 Model \Evaluation Acc " Acc " bpd # Acc " bpd # Coupling 89.77% 87.58% 4.30 67.62% 3.54 + 1 1 conv 90.82% 87.96% 4.09 67.38% 3.47 Residual Blocks (full) 91.78% 90.47% 3.62 70.32% 3.39 Dense Blocks (full) 92.40% 90.79% 3.49 75.67% 3.31 Table 4 presents the hybrid modeling results on CIFAR10 where we used λ = {0, 1 D, 1}. We run the three models for 400 epochs and note that the model with λ = 1 was not fully converged in both accuracy and bits per dimension after training. The classifier model obtains a converged accuracy after around 250 epochs. This is in line with the accuracy for the model with λ = 1 D, yet based on bits per dimension the model was not fully converged after 400 epochs. This indicates that even though the accuracy is not further improved, the model keeps optimizing the bits per dimension which gives room for future research. Results in Table 4 show the average result over the last 5 epochs. We find that Dense Blocks out-perform Residual Blocks for all possible λ settings. Interestingly, Dense Blocks have the biggest impact using no penalty (λ = 1) compared to the other models. We obtain an accuracy of 75.67% with 3.31bpd, compared to 70.32% accuracy and 3.39bpd of Residual Blocks, indicating that Dense Blocks significantly improves classification performance with more than 5%. In general, the Dense Block hybrid model is out-performing Real NVP, Glow, FFJORD, and i-Res Net in bits per dimension (see Appendix C.2 for samples of the hybrid models). 5 Analysis and future work To get a better understanding of i-Dense Nets, we perform additional experiments, explore different settings, analyze the results of the model and discuss future work. We use smaller architectures for these experiments due to a limited computational budget. For Residual Flows and i-Dense Nets we use 3 scale levels set to 4 Flow blocks instead of 16 per scale level and train models on CIFAR10 for 200 epochs. We will start with a short explanation of the limitations of 1-Lipschitz deep neural networks. 5.1 Analysis of activations and preservation of signals Since gradient-norm attenuation can arise in 1-Lipschitz bounded deep neural nets, we analyze how much signal of activation functions is preserved by examining the maximum and average distance ratios of sigmoid, Lip Swish, and CLip Swish. Note that the maximum distance ratio approaches the Lipschitz constant and it is desired that the average distance ratio remains high. Table 5: The mean and maximum ratio for different dimensions with sample size set to 100,000. Activation\ Measure D = 1 D = 128 D = 1024 Mean Max Mean Max Mean Max Sigmoid 0.22 0.25 0.21 0.22 0.21 0.21 Lip Swish 0.46 1.0 0.51 0.64 0.51 0.55 CLip Swish 0.72 1.0 0.71 0.77 0.71 0.73 Identity 1.0 1.0 1.0 1.0 1.0 1.0 We sample 100,000 datapoints v, w N(0, 1) with dimension set to D = {1, 128, 1024}. We compute the mean and maximum of the sampled ratios with: 2(φ(v), φ(w))/ 2(v, w) and analyze the expressive power of each function. Table 5 shows the results. We find that CLip Swish for all dimensions preserves most of the signal on average compared to the other non-linearities. This may explain why i-Dense Nets with CLip Swish activation achieves better results than using, e.g., Lip Swish. This experiment indicates that on randomly sampled points, CLipswish functions suffer from considerably less gradient norm attenuation. Note that sampling from a distribution with larger parameter values is even more pronounced in preference of CLip Swish, see Appendix D.1. 5.2 Activation Functions We start with exploring different activation functions for both networks and test these with the smaller architectures. We compare our CLip Swish to the Lip Swish and the Leaky LSwish as an additional baseline, which allows freedom of movement in the left tail as opposed to a standard Lip Swish: Leaky LSwish(x) = x + (1 )Lip Swish(x), (14) with 2 (0, 1) by passing it through a sigmoid function σ. Here is a learnable parameter which is initialized at = σ( 3) to mimic the Lip Swish at initialization. Note that the dimension for Residual Flows with CLip Swish activation function is set to 652 instead of 512 to maintain a similar number of parameters (8.7M) as with Lip Swish activation. Table 6: Results in bits per dimensions for small architectures, testing different activation functions. Model Lip Swish Leaky LSwish CLip Swish Residual Flow 3.42 3.42 3.38 i-Dense Net 3.39 3.39 3.37 Table 6 shows the results of each model using different activation functions. With 3.37bpd we conclude that i-Dense Net with our CLip Swish as the activation function obtains the best performance compared to the other activation functions, Lip Swish and Leaky LSwish. Furthermore, all i Dense Nets out-perform Residual Flows with the same activation function. We want to point out that CLip Swish as the activation function not only boosts performance of i-Dense Nets but it also significantly improves the performance of Residual Flows with 3.38bpd. The running time for the forward pass, train time and sampling time, expressed in percentage faster or slower than Residual Flow with the same activation functions, can be found in Appendix D.2. 5.3 Dense Nets concatenation depth Figure 5: Effect of different concatenation depths with CLip Swish activation function for i Dense Nets in bits per dimension. Next, we examine the effect of different concatenation depth settings for i-Dense Nets. We run experiments with concatenation depth set to 2, 3, 4, and 5 with CLip Swish. Furthermore, to utilize 8.7M parameters of the Residual Flow, we choose a fixed depth and appropriate Dense Net growth size to have a similar number of parameters. This results in a Dense Net depth 2 with growth size 318 (8.8M), depth 3 with growth 178 (8.7M), depth 4 with growth 122 (8.7M), and depth 5 with growth 92 (8.8M). The effect of each architecture can be found in Figure 5. We observe that the model with a depth of 3 obtains the best scores and after 200 epochs it achieves the lowest bits per dimension with 3.37bpd. A concatenation depth of 5 results in 3.42bpd after 200 epochs, which is the least preferred. This could indicate that the corresponding Dense Net growth of 92 is too little to capture the density structure sufficiently and due to the deeper depth the network might lose important signals. Further, the figure clearly shows how learnable weighted concatenation after 25 epochs boosts training for all i-Dense Nets. See Appendix D.3 for an in-depth analysis of results of the learnable weighted concatenation. Furthermore, we performed an additional experiment (see Appendix D.4) where we extended the width and depth of the Res Net connections in g(x) of Residual Flows in such a way that it matches the size of the i-Dense Net. As a result on CIFAR10 this puts the extended Residual Flow at a considerable advantage as it utilizes 19.1M parameters instead of 8.7M. However, when looking at performance the model performs worse (7.02bpd) than i-Densenets (3.39bpd) and even worse than its original version (3.42bpd) in terms of bpd. A possible explanation of this phenomenon is that by forcing more convolutional layers to be 1-Lipschitz, the gradient norm attenuation problems increase and in practice they become considerably less expressive. This indicates that modeling a Dense Net in g(x) is indeed an important difference that gives better performance. 5.4 Future Work We introduced a new framework, i-Dense Net, that is inspired by Residual Flows and i-Res Nets. We demonstrated how i-Dense Nets out-performs Residual Flows and alternative flow-based models for density estimation and hybrid modeling, constraint by using uniform dequantization. For future work, we want to address several interesting aspects we came across and where i-Dense Nets may be further deployed and explored. First of all, we find that smaller architectures have more impact on performance than full models compared to Residual Flows. Especially for exploration of the network, we recommend experimenting with smaller architectures or when a limited computational budget is available. This brings us to the second point. Due to a limited budget, we trained and tested i-Dense Nets on 32 32 CIFAR10 and Image Net32 data. It will be interesting to test higher resolution and other types of datasets. Further exploration of Dense Nets depth and growth for other or higher resolution datasets may be worthwhile. In our studies, deeper Dense Nets did not result in better performance. However, it would also be beneficial to further examine the optimization of Dense Nets architectures. Similarly, we showed how to constrain Dense Blocks for the 2-norm. For future work, it may be interesting to generalize the method to different norm types, as well as the norm for CLip Swish activation function. Note that CLip Swish as activation function not only boosts performance of i-Dense Nets but also for Residual Flows. We recommend this activation function for future work. We want to stress that we focused on extending Residual Flows, which uses uniform dequantization. However, we believe that the performance of our network may be improved using variational dequantization or augmentation. Finally, we found that especially hybrid model with λ = 1, achieve better performance than its predecessors. This may be worthwhile to further investigate in the future. Societal Impact We discussed methods to improve normalizing flow, a method that learns highdimensional distributions. We generated realistic-looking images and also used hybrid models that both predict the label of an image and generate new ones. Besides generating images, these models can be deployed to, e.g., detect adversarial attacks. Additionally, this method is applicable to all different kind of fields such as chemistry or physics. An increasing concern is that generative models in general, have an impact on society. They can not only be used to aid society but can also be used to generate misleading information by those who use these models. Examples of these cases could be generating real-looking documents, Deepfakes or even detection of fraud with the wrong intentions. Even though current flow-based models are not there yet to generate flawless reproductions, this concern should be kept in mind. It even raises the question if these models should be used in practice when detection of misleading information becomes difficult or even impossible to track. 6 Conclusion In this paper, we proposed i-Dense Nets, a parameter-efficient alternative to Residual Flows. Our method enforces invertibility by satisfying the Lipschitz continuity in dense layers. In addition, we introduced a version where the concatenation of features is learned during training that indicates which representations are of importance for the model. Furthermore, we showed how to deploy the CLip Swish activation function. For both i-Dense Nets and Residual Flows this significantly improves performance. Smaller architectures under an equal parameter budget were used for the exploration of different settings. The full model for density estimation was trained on 32 32 CIFAR10 and Image Net32 data. We demonstrated the performance of i-Dense Nets and compared the models to Residual Flows and other comparable Flow-based models on density estimation in bits per dimension. Yet, it also demonstrated how the model could be deployed for hybrid modeling that includes classification in terms of accuracy and density estimation in bits per dimension. Furthermore, we showed that modeling Res Net connections matching the size of an i-Dense Net obtained worse performance than the i-Dense Net and the original Residual Flow. In conclusion, i-Dense Nets out-perform Residual Flows and other competitive flow-based models for density estimation on all considered datasets in bits per dimension and hybrid modeling that includes classification. The obtained results clearly indicate the high potential of i-Dense Nets as powerful flow-based models. Acknowledgments We would like to thank Patrick Forré for his helpful feedback on the derivations. Furthermore, this work was carried out on the Dutch national e-infrastructure with the support of SURF Cooperative. Funding Transparency Statement There are no additional sources of funding to disclose. [1] Cem Anil, James Lucas, and Roger Grosse. Sorting out Lipschitz function approximation. In International Conference on Machine Learning, 2019. [2] Jens Behrmann, Will Grathwohl, Ricky TQ Chen, David Duvenaud, and Jörn-Henrik Jacobsen. Invertible residual networks. In International Conference on Machine Learning, 2019. [3] Rianne van den Berg, Leonard Hasenclever, Jakub M Tomczak, and Max Welling. Sylvester normalizing flows for variational inference. ar Xiv preprint ar Xiv:1803.05649, 2018. [4] Ricky T. Q. Chen, Jens Behrmann, David Duvenaud, and Jörn-Henrik Jacobsen. Residual flows for invertible generative modeling. In Advances in Neural Information Processing Systems, 2019. [5] Patryk Chrabaszcz, Ilya Loshchilov, and Frank Hutter. A downsampled variant of imagenet as an alternative to the cifar datasets. ar Xiv preprint ar Xiv:1707.08819, 2017. [6] Nicola De Cao, Wilker Aziz, and Ivan Titov. Block neural autoregressive flow. In Proceedings of the 35th Conference on Uncertainty in Artificial Intelligence, 2019. [7] Laurent Dinh, David Krueger, and Yoshua Bengio. Nice: Non-linear independent components estimation. ar Xiv:1410.8516, 2015. [8] Laurent Dinh, Jascha Sohl-Dickstein, and Samy Bengio. Density estimation using real NVP. ar Xiv:1605.08803, 2017. [9] Ethan Fetaya, Jörn-Henrik Jacobsen, Will Grathwohl, and Richard Zemel. Understanding the limitations of conditional generative models. ar Xiv:1906.01171, 2019. [10] Henry Gouk, Eibe Frank, Bernhard Pfahringer, and Michael Cree. Regularisation of neural networks by enforcing Lipschitz continuity. ar Xiv preprint ar Xiv:1804.04368, 2018. [11] Will Grathwohl, Ricky T. Q. Chen, Jesse Bettencourt, Ilya Sutskever, and David Duvenaud. Ffjord: Free-form continuous dynamics for scalable reversible generative models. International Conference on Learning Representations, 2019. [12] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 770 778, 2016. [13] Jonathan Ho, Xi Chen, Aravind Srinivas, Yan Duan, and Pieter Abbeel. Flow++: Improving flow-based generative models with variational dequantization and architecture design. ar Xiv preprint ar Xiv:1902.00275, 2019. [14] Emiel Hoogeboom, Victor Garcia Satorras, Jakub M Tomczak, and Max Welling. The convolu- tion exponential and generalized Sylvester flows. ar Xiv preprint ar Xiv:2006.01910, 2020. [15] Gao Huang, Zhuang Liu, Laurens Van Der Maaten, and Kilian Q Weinberger. Densely connected convolutional networks. In IEEE Conference on Computer Vision and Pattern Recognition, 2017. [16] Michael F Hutchinson. A stochastic estimator of the trace of the influence matrix for laplacian smoothing splines. Communications in Statistics-Simulation and Computation, 19(2):433 450, 1990. [17] Jörn-Henrik Jacobsen, Jens Behrmann, Richard Zemel, and Matthias Bethge. Excessive invariance causes adversarial vulnerability. ar Xiv:1811.00401, 2018. [18] Herman Kahn. Use of different Monte Carlo sampling techniques. Proceedings of Symposium on Monte Carlo Methods, 1955. [19] Diederik P Kingma and Max Welling. Auto-encoding variational Bayes. ar Xiv preprint ar Xiv:1312.6114, 2013. [20] Durk P Kingma and Prafulla Dhariwal. Glow: Generative flow with invertible 1x1 convolutions. In Advances in Neural Information Processing Systems, pages 10215 10224, 2018. [21] Durk P Kingma, Tim Salimans, Rafal Jozefowicz, Xi Chen, Ilya Sutskever, and Max Welling. Improved variational inference with inverse autoregressive flow. In Advances in Neural Information Processing Systems, pages 4743 4751, 2016. [22] Alex Krizhevsky. Learning multiple layers of features from tiny images. Technical report, University of Toronto, 2009. [23] David C Lay. Linear Algebra and its Applications. Pearson, 2006. [24] Qiyang Li, Saminul Haque, Cem Anil, James Lucas, Roger B Grosse, and Jörn-Henrik Jacobsen. Preventing gradient attenuation in Lipschitz constrained convolutional networks. In Advances in Neural Information Processing Systems, pages 15390 15402, 2019. [25] David JC Mac Kay and Mark N Gibbs. Density networks. Statistics and neural networks: advances at the interface. Oxford University Press, Oxford, pages 129 144, 1999. [26] Takeru Miyato, Toshiki Kataoka, Masanori Koyama, and Yuichi Yoshida. Spectral normalization for generative adversarial networks. ar Xiv preprint ar Xiv:1802.05957, 2018. [27] Eric Nalisnick, Akihiro Matsukawa, Yee Whye Teh, Dilan Gorur, and Balaji Lakshminarayanan. Hybrid Models with Deep and Invertible Features. In International Conference on Machine Learning, 2019. [28] Aaron van den Oord, Sander Dieleman, Heiga Zen, Karen Simonyan, Oriol Vinyals, Alex Graves, Nal Kalchbrenner, Andrew Senior, and Koray Kavukcuoglu. Wavenet: A generative model for raw audio. ar Xiv preprint ar Xiv:1609.03499, 2016. [29] Aaron van den Oord, Nal Kalchbrenner, and Koray Kavukcuoglu. Pixel recurrent neural networks. In International Conference on Machine Learning, pages 1747 1756, 2016. [30] George Papamakarios, Theo Pavlakou, and Iain Murray. Masked autoregressive flow for density estimation. In Advances in Neural Information Processing Systems, pages 2338 2347, 2017. [31] Danilo Jimenez Rezende and Shakir Mohamed. Variational inference with normalizing flows. ar Xiv preprint ar Xiv:1505.05770, 2015. [32] Danilo Jimenez Rezende, Shakir Mohamed, and Daan Wierstra. Stochastic backpropagation and approximate inference in deep generative models. In International Conference on Machine Learning, 2014. [33] Oren Rippel and Ryan Prescott Adams. High-dimensional probability estimation with deep density models. ar Xiv preprint ar Xiv:1302.5125, 2013. [34] Wenling Shang, Kihyuk Sohn, Diogo Almeida, and Honglak Lee. Understanding and improving convolutional neural networks via concatenated rectified linear units. ar Xiv:1603.05201, 2016. [35] John Skilling. The eigenvalues of mega-dimensional matrices. In Maximum Entropy and Bayesian Methods, pages 455 466. Springer, 1989. [36] Esteban G Tabak and Cristina V Turner. A family of nonparametric density estimation algo- rithms. Communications on Pure and Applied Mathematics, 66(2):145 164, 2013. [37] Esteban G Tabak, Eric Vanden-Eijnden, et al. Density estimation by dual ascent of the log- likelihood. Communications in Mathematical Sciences, 8(1):217 233, 2010. [38] Jakub M Tomczak and Max Welling. Improving variational auto-encoders using householder flow. ar Xiv preprint ar Xiv:1611.09630, 2016.