# densely_connected_normalizing_flows__741eb2e2.pdf

Densely connected normalizing ﬂows

Matej Grci c, Ivan Grubiši c and Siniša Šegvi c Faculty of Electrical Engineering and Computing University of Zagreb matej.grcic@fer.hr ivan.grubisic@fer.hr sinisa.segvic@fer.hr

Normalizing ﬂows are bijective mappings between inputs and latent representations with a fully factorized distribution. They are very attractive due to exact likelihood evaluation and efﬁcient sampling. However, their effective capacity is often insufﬁcient since the bijectivity constraint limits the model width. We address this issue by incrementally padding intermediate representations with noise. We precondition the noise in accordance with previous invertible units, which we describe as crossunit coupling. Our invertible glow-like modules increase the model expressivity by fusing a densely connected block with Nyström self-attention. We refer to our architecture as Dense Flow since both cross-unit and intra-module couplings rely on dense connectivity. Experiments show signiﬁcant improvements due to the proposed contributions and reveal state-of-the-art density estimation under moderate computing budgets.1

1 Introduction

One of the main tasks of modern artiﬁcial intelligence is to generate images, audio waveforms, and natural-language symbols. To achieve the desired goal, the current state of the art uses deep compositions of non-linear transformations [1, 2] known as deep generative models [3, 4, 5, 6, 7]. Formally, deep generative models estimate an unknown data distribution p D given by a set of i.i.d. samples D = {x1, ..., xn}. The data distribution is approximated with a model distribution pθ deﬁned by the architecture of the model and a set of parameters θ. While the architecture is usually handcrafted, the set of parameters θ is obtained by optimizing the likelihood across the training distribution p D: θ = argmin θ Θ Ex p D[ ln pθ(x)]. (1)

Properties of the model (e.g. efﬁcient sampling, ability to evaluate likelihood etc.) directly depend on the deﬁnition of pθ(x), or decision to avoid it. Early approaches consider unnormalized distribution [3] which usually requires MCMC-based sample generation [8, 9, 10] with long mixing times. Alternatively, the distribution can be autoregressively factorized [7, 11], which allows likelihood estimation and powerful but slow sample generation. VAEs [4] use a factorized variational approximation of the latent representation, which allows to learn an autoencoder by optimizing a lower bound of the likelihood. Diffussion models [12, 13, 14] learn to reverse a diffusion process, which is a ﬁxed Markov chain that gradually adds noise to the data in the opposite direction of sampling until the signal is destroyed. Generative adversarial networks [5] mimic the dataset samples by competing in a minimax game. This allows to efﬁciently produce high quality samples [15], which however often do not span the entire training distribution support [16]. Additionally, the inability to "invert" the generation process in any meaningful way implies inability to evaluate the likelihood.

Contrary to previous approaches, normalizing ﬂows [6, 17, 18] model the likelihood using a bijective mapping to a predeﬁned latent distribution p(z), typically a multivariate Gaussian. Given the bijection

1Code available at: https://github.com/matejgrcic/Dense Flow

35th Conference on Neural Information Processing Systems (Neur IPS 2021).

fθ, the likelihood is deﬁned using the change of variables formula:

pθ(x) = p(z) det z

, z = fθ(x). (2)

This approach requires computation of the Jacobian determinant (det z

x). Therefore, during the construction of bijective transformations, a great emphasis is placed on tractable determinant computation and efﬁcient inverse computation [18, 19]. Due to these constraints, invertible transformations require more parameters to achieve a similar capacity compared to standard NN building blocks [20]. Still, modeling pθ(x) using bijective formulation enables exact likelihood evaluation and efﬁcient sample generation, which makes this approach convenient for various downstream tasks [21, 22, 23].

The bijective formulation (2) implies that the input and the latent representation have the same dimensionality. Typically, convolutional units of normalizing-ﬂow approaches [18] internally inﬂate the dimensionality of the input, extract useful features, and then compress them back to the original dimensionality. Unfortunately, the capacity of such transformations is limited by input dimensionality [24]. This issue can be addressed by expressing the model as a sequence of bijective transformations [18]. However, increasing the depth alone is a suboptimal approach to improve capacity of a deep model [25]. Recent works propose to widen the ﬂow by increasing the input dimensionality [24, 26]. We propose an effective development of that idea which further improves the performance while relaxing computational requirements.

We increase the expressiveness of normalizing ﬂows by incremental augmentation of intermediate latent representations with Gaussian noise. The proposed cross-unit coupling applies an afﬁne transformation to the noise, where the scaling and translation are computed from a set of previous intermediate representations. In addition, we improve intra-module coupling by proposing a transformation which fuses the global spatial context with local correlations. The proposed imageoriented architecture improves expressiveness and computational efﬁciency. Our models set the new state-of-the-art result in likelihood evaluation on Image Net32 and Image Net64.

2 Densely connected normalizing ﬂows

We present a recursive view on normalizing ﬂows and propose improvements based on incremental augmentation of latent representations, and densely connected coupling modules paired with selfattention. The improved framework is then used to develop an image-oriented architecture, which we evaluate in the experimental section.

2.1 Normalizing ﬂows with cross-unit coupling

Normalizing ﬂows (NF) achieve their expressiveness by stacking multiple invertible transformations [18]. We illustrate this with the scheme (3) where each two consecutive latent variables zi 1 and zi are connected via a dedicated ﬂow unit fi. Each ﬂow unit fi is a bijective transformation with parameters θi which we omit to keep notation uncluttered. The variable z0 is typically the input x drawn from the data distribution p D(x).

z0 f1 z1 f2 z2 f3 fi 1 zi fi f K z K, z K N(0, I). (3)

Following the change of variables formula, log likelihoods of consecutive random variables zi and zi+1 can be related through the Jacobian of the corresponding transformation Jfi+1[18]:

ln p(zi) = ln p(zi+1) + ln | det Jfi+1|. (4)

This relation can be seen as a recursion. The term ln p(zi+1) can be recursively replaced either with another instance of (4) or evaluated under the latent distribution, which marks the termination step. This setup is characteristic for most contemporary architectures [17, 18, 19, 27].

The standard NF formulation can be expanded by augmenting the input by a noise variable ei [24, 26]. The noise ei subjects to some known distribution p (ei), e.g. a multivariate Gaussian. We further improve this approach by incrementally concatenating noise to each intermediate latent representation zi. A tractable formulation of this idea can be obtained by computing the lower bound of the likelihood p(zi) through Monte Carlo sampling of ei:

ln p(zi) Eei p (e) [ln p(zi, ei) ln p (ei)] . (5)

The learned joint distribution p(zi, ei) approximates the product of the target distributions p (zi) and p (ei), which is explained in more detail in Appendix D. We transform the introduced noise ei with element-wise afﬁne transformation. Parameters of this transformation are computed by a learned non-linear transformation gi(z<i) of previous representations z<i = [z0, ..., zi 1]. The resulting layer hi can be deﬁned as:

z(aug) i = hi(zi, ei, z<i) = [zi, σ ei + µ], (µ, σ) = gi(z<i). (6)

Square brackets [ , ] denote concatenation along the features dimension. In order to compute the likelihood for (zi, ei), we need the determinant of the jacobian

z(aug) i [zi, ei] = I 0 0 diag(σ)

Now we can express p(zi, ei) in terms of p(z(aug) i ) according to (4):

ln p(zi, ei) = ln p(z(aug) i ) + ln | det diag(σ)|. (8)

We join equations (5) and (8) into a single step:

ln p(zi) Eei p (ei)[ln p(z(aug) i ) ln p (ei) + ln | det diag(σ)|]. (9)

We refer the transformation hi as cross-unit coupling since it acts as an afﬁne coupling layer [17] over a group of previous invertible units. The latent part of the input tensor is propagated without change, while the noise part is linearly transformed. The noise transformation can be viewed as reparametrization of the distribution from which we sample the noise [4]. Note that we can conveniently recover zi from z(aug) i by removing the noise dimensions. This step is performed during model sampling.

Fig. 1 compares the standard normalizing ﬂow (a) normalizing ﬂow with input augmentation [24] (b) and the proposed densely connected incremental augmentation with cross-unit coupling (c). Each ﬂow unit f DF i consists of several invertible modules mi,j and cross-unit coupling hi. The main novelty of our architecture is that each ﬂow unit f DF i+1 increases the dimensionality with respect to its predecessor f DF i . Cross-unit coupling hi augments the latent variable zi with afﬁnely transformed noise ei. Parameters of the afﬁne noise transformation are obtained by an arbitrary function gi which accepts all previous variables z<i. Note that reversing the direction does not require evaluating gi since we are only interested in the value of zi. For further clariﬁcation, we show the likelihood computation for the extended framework.

Invertible module mi,N

zi ei~N(0, I)

[zi,ei σ+μ]

Cross-unit coupling (hi)

Invertible module mi,1

Flow unit f DF

Invertible module m1,N

Invertible module m1,1

Flow unit f1

Invertible module m2,N

Invertible module m2,1

Flow unit f2

Invertible module m1,N

Invertible module m1,1

Flow unit f1

Invertible module m2,N

Invertible module m2,1

Flow unit f2

Figure 1: Standard normalizing ﬂow [17, 18] (a), normalizing ﬂow with augmented input [24] (b), and the proposed incremental augmentation with cross-unit coupling (c). Unlike (b) which adds noise only to the input, (c) adds noise to the output of every unit except the last.

Example 1 (Likelihood computation) Let m1 and m2 be the bijective mappings from z0 to z1 and z(aug) 1 to z2, respectively. Let h1 be the cross-unit coupling from z1 to z(aug) 1 , z(aug) 1 = [z1, σ e1+µ]. Assume σ and µ are computed by any non-invertible neural network g1. The network accepts z0 as the input. We calculate log likelihood of the input z0 according to the following sequence of equations: [transformation, cross-unit coupling, transformation, termination]. ln p(z0) = ln p(z1) + ln | det Jf1|, (10)

ln p(z1) Ee1 p (e1)[ln p(z(aug) 1 ) ln p(e1) + ln | det diag(σ)|], (σ, µ) = g1(z0), (11)

ln p(z(aug) 1 ) = ln p(z2) + ln | det Jf2|, (12) ln p(z2) = ln N(z2; 0, I). (13)

We approximate the expectation using MC sampling with a single sample during training and a few hundreds of samples during evaluation to reduce the variance of the likelihood. Note however that our architecture generates samples with a single pass since the inverse does not require MC sampling.

We repeatedly apply the cross-unit coupling hi throughout the architecture to achieve incremental augmentation of intermediate latent representations. Consequently, the data distribution is modeled in a latent space of higher dimensionality than the input space [24, 26]. This enables better alignment of the ﬁnal latent representation with the NF prior. We materialize the proposed expansion of the normalizing ﬂow framework by developing an image-oriented architecture which we call Dense Flow.

2.2 Image-oriented invertible module

Convolution 1x1

Intra-module

Figure 2: A glow-like module mi,j consist of Act Norm, 1x1 convolution and intra-module afﬁne coupling. The proposed intra-module coupling fuses the global context recovered by fast self-attention [28] and local correlations extracted by densely connected convolutions [29].

We propose a glow-like invertible module (also known as step of ﬂow [19]) consisting of activation normalization, 1 1 convolution and intra-module afﬁne coupling layer. The attribute "intra-module" emphasizes distinction with respect to cross-unit coupling. Different than in the original glow design, our coupling network leverages advanced transformations based on dense connectivity and fast selfattention. All three layers are designed to capture complex data dependencies while keeping tractable Jacobians and efﬁcient inverse computation. For completeness, we start by reviewing elements of the original glow module [19].

Act Norm [19] is an invertible substitute for batch normalization [30]. It performs afﬁne transformation with per-channel scale and bias parameters: yi,j = s xi,j + b. (14) Scale and bias are calculated as the variance and mean of the initial minibatch.

Invertible 1 1 Convolution is a generalization of channel permutation [19]. Convolutions with 1 1 kernel are not invertible by construction. Instead, a combination of orthogonal initialization and the loss function keeps the kernel inverse numerically stable. The normalizing ﬂow loss maximizes ln | det Jf| which is equivalent to maximizing P

i ln |λi|, where λi are eigenvalues of the Jacobian. Maintaining a relatively large amplitude of the eigenvalues ensures a stable inversion. The Jacobian of this transformation can be efﬁciently computed by LU-decomposition [19].

Afﬁne Coupling [18] splits the input tensor x channel-wise into two halves x1 and x2. The ﬁrst half is propagated without changes, while the second half is linearly transformed (15). The parameters of the linear transformation are calculated from the ﬁrst half. Finally, the two results are concatenated as shown in Fig. 2. y1 = x1, y2 = s x2 + t, (s, t) = coupling_net(x1). (15)

Parameters s and t are calculated using a trainable network which is typically implemented as a residual block [18]. However, this setting can only capture local correlations. Motivated by recent advances in discriminative architectures [29, 31, 32], we design our coupling network to fuse both global context and local correlations as shown in Fig. 2: First, we project the input into a low-dimensional manifold. Next, we feed the projected tensor to a densely-connected block [29] and self-attention module [31, 33]. The densely connected block captures the local correlations [34], while the self-attention module captures the global spatial context. Outputs of these two branches are concatenated and blended through a BN-Re LU-Conv unit. As usual, the obtained output parameterizes the afﬁne coupling transformation (15). Differences between the proposed coupling network and other network designs are detailed in related work.

It is well known that full-ﬂedged self-attention layers have a very large computational complexity. This is especially true in the case of normalizing ﬂows which require many coupling layers and large latent dimensionalities. We alleviate this issue by approximating the keys and queries with their low-rank approximations according to the Nystrom method [28].

2.3 Multi-scale architecture

We propose an image-oriented architecture which extends multi-scale Glow [19] with incremental augmentation through cross-unit coupling. Each Dense Flow block consists of several Dense Flow units and resolves a portion of the latent representation according to a decoupled normal distribution [18]. Each Dense Flow unit f DF i consists of N glow-like modules (mi = mi,N mi,1) and cross-unit coupling (hi). Recall that each invertible module mi,j contains the afﬁne coupling network from Fig. 2 as described Section 2.2.

The input to each Dense Flow unit is the output of the previous unit augmented with the noise and transformed in the cross-unit coupling fashion. The number of introduced noise channels is deﬁned as the growth-rate hyperparameter. Generally, the number of invertible modules in latter Dense Flow units should increase due to enlarged latent representation. We stack M Dense Flow units to form a Dense Flow block. The last invertible unit in the block does not have the corresponding cross-unit coupling. We stack multiple Dense Flow blocks to form a normalizing ﬂow with large capacity. We decrease the spatial resolution and compress the latent representation by introducing a squeeze-and-drop modules [18] between each two blocks. A squeeze-and-drop module applies space-to-channel reshaping and resolves half of the dimensions according to the prior distribution. We denote the developed architecture as Dense Flow-L-k, where L is the total number of invertible modules while k denotes the growth rate. The developed architecture uses two independent levels of skip connections. The ﬁrst level (intra-module) is formed of skip connections inside every coupling network. The second level (cross-unit) connects Dense Flow units at the top level of the architecture.

Fig. 3 shows the ﬁnal architecture of the proposed model. Gray squares represent Dense Flow units. Cross-unit coupling is represented with blue dots and dashed skip connections. Finally, squeezeand-drop operations between successive Dense Flow blocks are represented by dotted squares. The proposed Dense Flow design applies invertible but less powerful transformations (e.g. convolution 1 1) on tensors of larger dimensionality. On the other hand, powerful non-invertible transformations

Figure 3: The proposed Dense Flow architecture. Dense Flow blocks consist of Dense Flow units (f DF i ) and a Squeeze-and-Drop module [18]. Dense Flow units are densely connected through cross-unit coupling (hi). Each Dense Flow unit includes multiple invertible modules (mi,j) from Fig. 2.

such as coupling networks perform most of their operations on lower-dimensional tensors. This leads to resource-efﬁcient training and inference.

3 Experiments

Our experiments compare the proposed Dense Flow architecture with the state of the art. Quantitative experiments measure the accuracy of density estimation and quality of generated samples, analyze the computational complexity of model training, as well as ablate the proposed contributions. Qualitative experiments present generated samples.

3.1 Density estimation

We study the accuracy of density estimation on CIFAR-10 [35], Image Net [36] resized to 32 32 and 64 64 pixels and Celeb A [37]. Tab. 1 compares generative performance of various contemporary models. Models are grouped into four categories based on factorization of the probability density. Among these, autoregressive models have been achieving the best performance. Image Transformer [38] has been the best on Image Net32, while Routing transformer [39] has been the best on Image Net64. The ﬁfth category contains hybrid architectures which combine multiple approaches into a single model. Hybrid models have succeeded to outperform many factorization-speciﬁc architectures.

The bottom row of the table presents the proposed Dense Flow architecture. We use the same Dense Flow-74-10 model in all experiments except ablations in order to illustrate the general applicability of our concepts. The ﬁrst block of Dense Flow-74-10 uses 6 units with 5 glow-like modules in each Dense Flow unit, the second block uses 4 units with 6 modules, while the third block uses a single unit with 20 modules. We use the growth rate of 10 in all units. Each intra-module coupling starts with a projection to 48 channels. Subsequently, it includes a dense block with 7 densely connected layers, and the Nyström self-attention module with a single head. Since the natural images are discretized, we apply variational dequantization [27] to obtain continuous data which is suitable for normalizing ﬂows.

On CIFAR-10, Dense Flow reaches the best recognition performance among normalizing ﬂows, which equals to 2.98 bpd. Models trained on Image Net32 and Image Net64 achieve state-of-the-art density estimation corresponding to 3.63 and 3.35 bpd respectively. The obtained recognition performance is signiﬁcantly better than the previous state of the art (3.77 and 3.43 bpd). Finally, our model achieves competetive results on the Celeb A dataset, which corresponds to 1.99 bpd. The likelihood is computed using 1000 MC samples for CIFAR-10 and 200 samples for Celeb A and Image Net. The reported results are averaged over three runs with different random seeds. One MC sample is enough for accurate log-likelihood estimation since the per-example standard deviation is already about 0.01 bpd and a validation dataset size N additionally divides it by

N. The reported results are averaged over seven runs with different random seeds. Training details are available in Appendix C.

3.2 Computational complexity

Deep generative models require an extraordinary amount of time and computation to reach state-ofthe-art performance. Moreover, contemporary architectures have scaling issues. For example, VFlow [24] requires 16 GPUs and two months to be trained on the Image Net32 dataset, while the NVAE [56] requires 24 GPUs and about 3 days. This limits downstream applications of developed models and slows down the rate of innovation in the ﬁeld. In contrast, the proposed Dense Flow design places a great emphasis on the efﬁciency and scalability.

Tab. 2 compares the time and memory consumption of the proposed model with respect to competing architectures. We compare our model with VFlow [24] and NVAE [56] due to similar generative performance on CIFAR-10 and Celeb A, respectively. We note that RTX 3090 and Tesla V100 deliver similar performance, while RTX2080Ti has a slightly lower performance compared to the previous two. However, since we model relatively small images, GPU utilization is limited by I/O performance. In our experiments, training the model for one epoch on any of the aforementioned GPUs had similar duration. Therefore, we can still make a fair comparison. Please note that we are unable to include approaches based on transformers [38, 39, 58] since they do not report the computational effort for model training.

Table 1: Likelihood evaluation (in bits/dim) on standard datasets.

Method CIFAR-10 Image Net Celeb A Image Net 32x32 32x32 64x64 64x64

Variational Autoencoders

Conv Draw [40] 3.58 4.40 - 4.10 DVAE++ [41] 3.38 - - - IAF-VAE [42] 3.11 - - - BIVA [43] 3.08 3.96 2.48 - CR-NVAE [44] 2.51 - 1.86 -

Diffusion models

DDPM [13] 3.70 - - - UDM (RVE) + ST [45] 3.04 - 1.93 - Imp. DDPM [46] 2.94 - - 3.53 VDM [47] 2.65 3.72 - 3.40

Autoregressive Models

Gated Pixel CNN [48] 3.03 3.83 - 3.57 Pixel RNN [7] 3.00 3.86 - 3.63 Pixel CNN++ [11] 2.92 - - - Image Transformer [38] 2.90 3.77 2.61 - Pixel SNAIL [49] 2.85 3.80 - - SPN [50] - 3.85 - 3.53 Routing transformer [39] 2.95 - - 3.43

Normalizing Flows

Real NVP [18] 3.49 4.28 3.02 3.98 GLOW [19] 3.35 4.09 - 3.81 Wavelet Flow [51] - 4.08 - 3.78 Residual Flow [52] 3.28 4.01 - 3.78 i-Dense Net [53] 3.25 3.98 - - Flow++ [27] 3.08 3.86 - 3.69 ANF [26] 3.05 3.92 - 3.66 VFlow [24] 2.98 3.83 - 3.66

Hybrid Architectures

m AR-SCF [54] 3.22 3.99 - 3.80 Ma Cow [55] 3.16 - - 3.69 Sur VAE Flow [34] 3.08 4.00 - 3.70 NVAE [56] 2.91 3.92 2.03 - Pixel VAE++ [57] 2.90 - - - δ-VAE [58] 2.83 3.77 - - Dense Flow-74-10 (ours) 2.98 3.63 1.99 3.35

Table 2: Comparative analysis of the computational budget for training contemporary methods. Dense Flow decreases the training complexity by an order of magnitude.

Dataset Model Params GPU type GPUs Duration (h) BPD

CIFAR-10 VFlow [24] 38M RTX 2080Ti 16 500 2.98 NVAE [56] 257M Tesla V100 8 55 2.91 Dense Flow-74-10 130M RTX 3090 1 250 2.98

Image Net32 VFlow [24] 38M Tesla V100 16 1440 3.83 NVAE [56] - Tesla V100 24 70 3.92 Dense Flow-74-10 130M Tesla V100 1 310 3.63

Celeb A VFlow [24] - n/a n/a n/a - NVAE [56] 153M Tesla V100 8 92 2.03 Dense Flow-74-10 130M Tesla V100 1 224 1.99

3.3 Image generation

Normalizing ﬂows can efﬁciently generate samples. The generation is performed in two steps. We ﬁrst sample from the latent distribution and then transform the obtained latent tensor through the inverse mapping. Fig. 4 shows unconditionally generated images with the model trained on Image Net64. Fig. 5 shows generated images using the model trained on Celeb A. In this case, we modify the latent distribution by temperature scaling with factor 0.8 [38, 19, 56]. Generated images show diverse hairstyles, skin tones and backgrounds. More generated samples can be found in Appendix G. The

developed Dense Flow-74-10 model generates minibatch of 128 CIFAR-10 samples for 0.96 sec. The result is averaged over 10 runs on RTX 3090.

Figure 4: Samples from Dense Flow-74-10 trained on Image Net 64 64.

Figure 5: Samples from Dense Flow-74-10 trained on Celeb A.

3.4 Visual quality

The ability to generate high ﬁdelity samples is crucial for real-world applications of generative models. We measure the quality of generated samples using the FID score [59]. The FID score requires a large corpus of generated samples in order to provide an unbiased estimate. Hence, we generate 50k samples for CIFAR-10, and Celeb A, and 200k samples for Image Net. The samples are generated using the model described in Sec. 3.1. The generated Image Net32 samples achieve a FID score of 38.8, the Celeb A samples achieve 17.1 and CIFAR-10 samples achieve 34.9 when compared to the corresponding training dataset. When compared with corresponding validation datasets, we achieve 37.1 on CIFAR10 and 38.5 on Image Net32.

Tab. 3 shows a comparison with FID scores of other generative models. Our model outperforms contemporary autoregressive models [7, 60] and the majority of normalizing ﬂows [61, 52, 19]. Our FID score is comparable with the ﬁrst generation of GANs. Similar to other NF models, the achieved FID score is still an order of magnitude higher than current state of the art [62]. The results for Pixel CNN, DCGAN, and WGAN-GP are taken from [60].

3.5 Ablations

Tab. 4 explores the contributions of incremental augmentation and dense connectivity in cross-unit and intra-module coupling transforms. We decompose cross-unit coupling into incremental augmentation of the ﬂow dimensionality (column 1) and afﬁne noise transformation (column 2). Column 3 ablates the proposed intra-module coupling network based on fusion of fast self-attention and a densely connected convolutional block with the original Glow coupling [19].

The bottom row of the table corresponds to a Dense Flow-45-6 model. The ﬁrst Dense Flow block has 5 Dense Flow units with 3 invertible modules per unit. The second Dense Flow block has 3 units with 5 modules, while the ﬁnal block has 15 modules in a single unit. We use the growth rate of 6. The top row of the table corresponds to the standard normalized ﬂow [18, 19] with three blocks and 15 modules per block. Consequently, all models have the same number of invertible glow-like modules. All models are trained on CIFAR-10 for 300 epochs and then ﬁne-tuned for 10 epochs. We use the same training hyperparameters for all models. The proposed cross-unit coupling improves the density estimation from 3.42 bpd (row 1) to 3.37 bpd (row 3) starting from a model with the standard glow modules. When a model is equipped with our intra-module coupling, cross-unit coupling leads to improvement from 3.14 bpd (row 4) to 3.07 bpd (row 6). Hence, the proposed cross-unit coupling

Table 3: Evaluation of FID score on CIFAR-10.

Model FID Autoregressive Models Pixel CNN [7, 60] 65.93 Pixel IQN [60] 49.46

Normalizing Flows

i-Res Net [61] 65.01 Glow [19] 46.90 Residual ﬂow [52] 46.37 ANF [26] 30.60

GANs DCGAN [15, 60] 37.11 WGAN-GP [63, 60] 36.40 DA-Style GAN V2 [62] 5.79

Diffusion models VDM [47] 4.00 DDPM [13] 3.17 UDM (RVE) + ST [45] 2.33

Hybrid Architectures

Sur VAE-ﬂow [34] 49.03 m AR-SCF [54] 33.06 VAEBM [64] 12.19 Dense Flow-74-10 (ours) 34.90

improves the density estimation in all experiments. Both components of cross-unit coupling are important. Models with preconditioned noise outperform models with simple white noise (row 2 vs row 3, and row 5 vs row 6). A comparison of rows 1-3 with rows 4-6 reveals that the proposed intra-module coupling network also yields signiﬁcant improvements. We have performed two further ablation experiments with the same model. Densely connected cross-coupling contributes 0.01 bpd in comparison to preconditioning noise with respect to the previous representation only. Self-attention module contributes 0.01 bpd with respect to the model with only Dense Block coupling on Image Net 32 32.

Table 4: Ablations on the CIFAR-10 dataset with Dense Flow-45-6.

# Latent variable augmentation Pre-conditioned noise Intra-module coupling with two-way fusion BPD

1 3.42 2 3.40 3 3.37 4 3.14 5 3.08 6 3.07

4 Related work

VFlow [24] increases the dimensionality of a normalizing ﬂow by concatenating input with a random variable drawn from p(e|x). The resulting optimization maximizes the lower bound Ee p (e|x)[ln p(x, e) ln p (e|x)], where each term is implemented by a separate normalizing ﬂow. Similarly, ANF [26] draws a connection between maximizing the joint density p(x, e) and lower-bound optimization [4]. Both approaches augment only the input variable x while we augment latent representations many times throughout our models.

Surjective ﬂows [34] decrease the computational complexity of the ﬂow by reducing the dimensionality of deep layers. However, this also reduces the generative capacity. Our approach achieves a better generative performance under affordable computational budget due to gradual increase of the latent dimensionality and efﬁcient coupling.

Invertible Dense Nets [65, 53] apply skip connections within invertible residual blocks [61, 52]. However, this approach lacks a closed-form inverse, and therefore can generate data only through slow iterative algorithms. Our approach leverages skip connections both in cross-unit and intramodule couplings, and supports fast analytical inverse by construction.

Models with an analytical inverse allocate most of their capacity to coupling networks [17]. Early coupling networks were implemented as residual blocks [18]. Recent work [27] increases the coupling capacity by stacking convolutional and multihead self-attention layers into a gated residual [66, 49]. However, heavy usage of self-attention radically increases the computational complexity. Contrary to stacking convolutional and self-attention layers in alternating fashion, the design of our network uses two parallel modules. Outputs of the these two modules are fused into a single output. Sur VAE [34] expresses the coupling network as a densely connected block [29] with residual connection. In comparison with [34], our intra-module coupling omits residual connectivity, decreases the number of densely connected layers and introduces a parallel branch with Nyström self-attention. Thus, our intra-module coupling fuses local cues with the global context.

Normalizing ﬂow capacity can be further increased by adding complexity to the latent prior p(z). Autoregressive prior [54] may deliver better density estimation and improved visual quality of the generated samples. However, the computational cost of sample generation grows linearly with spatial dimensionality. Joining this approach with the proposed incremental latent variable augmentation could be a suitable direction for future work.

5 Conclusion

Normalizing ﬂows allow principled recovery of the likelihood by evaluating factorized latent activations. However, their efﬁciency is hampered by the bijectivity constraint since it determines the model width. We propose to address this issue by incremental augmentation of intermediate latent representations. The introduced noise is preconditioned with respect to preceding representations throughout cross-unit afﬁne coupling. We also propose an improved design of intra-module coupling transformations within glow-like invertible modules. We express these transformations as a fusion of local correlations and the global context captured by self-attention. The resulting Dense Flow architecture sets the new state-of-the-art in likelihood evaluation on Image Net while requiring a relatively small computational budget. Our results imply that the expressiveness of a NF does not only depend on latent dimensionality but also on its distribution across the model depth. Moreover, expressiveness of a NF can be further improved by conditioning the introduced noise with the proposed densely connected cross-unit coupling.

6 Broader impact

This paper introduces a new generative model called Dense Flow, which can be trained to achieve stateof-the-art density evaluation under moderate computational budget. Fast convergence and modest memory footprint lead to relatively small environmental impact of training and favor applicability to many downstream tasks. Technical contributions of this paper do not raise any particular ethical challenges. However, image generation has known issues related to bias and fairness [67]. In particular, samples generated by our method will reﬂect any kind of bias from the training dataset.

Acknowledgements

This work has been supported by Croatian Science Foundation, grant IP-2020-02-5851 ADEPT. The ﬁrst two authors have been employed on research projects KK.01.2.1.02.0119 DATACROSS and KK.01.2.1.02.0119 A-Unit funded by European Regional Development Fund and Gideon Brothers ltd. This work has also been supported by VSITE - College for Information Technologies who provided access to 2 GPU Tesla-V100 32GB. We thank Marin Orši c, Julije Ožegovi c, Josip Šari c as well as Jakob Verbeek for insightful discussions during early stages of this work.

[1] Yann Le Cun, Yoshua Bengio, et al. Convolutional networks for images, speech, and time series. The handbook of brain theory and neural networks, 3361(10):1995, 1995.

[2] Ian Goodfellow, Yoshua Bengio, and Aaron Courville. Deep Learning. MIT Press, 2016. http: //www.deeplearningbook.org.

[3] Ruslan Salakhutdinov and Geoffrey Hinton. Deep boltzmann machines. In Proceedings of the Twelth International Conference on Artiﬁcial Intelligence and Statistics, volume 5 of Proceedings of Machine Learning Research, pages 448 455, Hilton Clearwater Beach Resort, Clearwater Beach, Florida USA, 2009. PMLR.

[4] Diederik P. Kingma and Max Welling. Auto-encoding variational bayes. In 2nd International Conference on Learning Representations, ICLR 2014, Banff, AB, Canada, April 14-16, 2014, Conference Track Proceedings, 2014.

[5] Ian J. Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron C. Courville, and Yoshua Bengio. Generative adversarial networks. Commun. ACM, 63(11):139 144, 2020.

[6] Danilo Jimenez Rezende and Shakir Mohamed. Variational inference with normalizing ﬂows. In Proceedings of the 32nd International Conference on Machine Learning, ICML 2015, Lille, France, 6-11 July 2015, volume 37 of JMLR Workshop and Conference Proceedings, pages 1530 1538. JMLR.org, 2015.

[7] Aaron Van Oord, Nal Kalchbrenner, and Koray Kavukcuoglu. Pixel recurrent neural networks. In International Conference on Machine Learning, pages 1747 1756. PMLR, 2016.

[8] Rémi Bardenet, Arnaud Doucet, and Christopher C. Holmes. On markov chain monte carlo methods for tall data. J. Mach. Learn. Res., 18:47:1 47:43, 2017.

[9] Geoffrey E. Hinton. Training products of experts by minimizing contrastive divergence. Neural Comput., 14(8):1771 1800, 2002.

[10] Radford M Neal et al. Mcmc using hamiltonian dynamics. Handbook of markov chain monte carlo, 2(11):2, 2011.

[11] Tim Salimans, Andrej Karpathy, Xi Chen, and Diederik P. Kingma. Pixelcnn++: Improving the pixelcnn with discretized logistic mixture likelihood and other modiﬁcations. In 5th International Conference on Learning Representations, ICLR 2017, Toulon, France, April 24-26, 2017, Conference Track Proceedings, 2017.

[12] Jascha Sohl-Dickstein, Eric A. Weiss, Niru Maheswaranathan, and Surya Ganguli. Deep unsupervised learning using nonequilibrium thermodynamics. In Proceedings of the 32nd International Conference on Machine Learning, ICML 2015, Lille, France, 6-11 July 2015, volume 37 of JMLR Workshop and Conference Proceedings, pages 2256 2265. JMLR.org, 2015.

[13] Jonathan Ho, Ajay Jain, and Pieter Abbeel. Denoising diffusion probabilistic models. In Advances in Neural Information Processing Systems 33: Annual Conference on Neural Information Processing Systems 2020, Neur IPS 2020, December 6-12, 2020, virtual, 2020.

[14] Jiaming Song, Chenlin Meng, and Stefano Ermon. Denoising diffusion implicit models. Co RR, abs/2010.02502, 2020.

[15] Alec Radford, Luke Metz, and Soumith Chintala. Unsupervised representation learning with deep convolutional generative adversarial networks. In 4th International Conference on Learning Representations, ICLR 2016, San Juan, Puerto Rico, May 2-4, 2016, Conference Track Proceedings, 2016.

[16] Thomas Lucas, Konstantin Shmelkov, Karteek Alahari, Cordelia Schmid, and Jakob Verbeek. Adaptive density estimation for generative models. In Advances in Neural Information Processing Systems 32: Annual Conference on Neural Information Processing Systems 2019, Neur IPS 2019, December 8-14, 2019, Vancouver, BC, Canada, pages 11993 12003, 2019.

[17] Laurent Dinh, David Krueger, and Yoshua Bengio. NICE: non-linear independent components estimation. In 3rd International Conference on Learning Representations, ICLR 2015, San Diego, CA, USA, May 7-9, 2015, Workshop Track Proceedings, 2015.

[18] Laurent Dinh, Jascha Sohl-Dickstein, and Samy Bengio. Density estimation using real NVP. In 5th International Conference on Learning Representations, ICLR 2017, Toulon, France, April 24-26, 2017, Conference Track Proceedings, 2017.

[19] Diederik P. Kingma and Prafulla Dhariwal. Glow: Generative ﬂow with invertible 1x1 convolutions. In Advances in Neural Information Processing Systems 31: Annual Conference on Neural Information Processing Systems 2018, Neur IPS 2018, December 3-8, 2018, Montréal, Canada, pages 10236 10245, 2018.

[20] Jörn-Henrik Jacobsen, Arnold W. M. Smeulders, and Edouard Oyallon. i-revnet: Deep invertible networks. In 6th International Conference on Learning Representations, ICLR 2018, Vancouver, BC, Canada, April 30 - May 3, 2018, Conference Track Proceedings, 2018.

[21] Jie Ren, Peter J. Liu, Emily Fertig, Jasper Snoek, Ryan Poplin, Mark A. De Pristo, Joshua V. Dillon, and Balaji Lakshminarayanan. Likelihood ratios for out-of-distribution detection. In Advances in Neural Information Processing Systems 32: Annual Conference on Neural Information Processing Systems 2019, Neur IPS 2019, December 8-14, 2019, Vancouver, BC, Canada, pages 14680 14691, 2019.

[22] Eric T. Nalisnick, Akihiro Matsukawa, Yee Whye Teh, Dilan Görür, and Balaji Lakshminarayanan. Hybrid models with deep and invertible features. In Proceedings of the 36th International Conference on Machine Learning, ICML 2019, 9-15 June 2019, Long Beach, California, USA, volume 97 of Proceedings of Machine Learning Research, pages 4723 4732. PMLR, 2019.

[23] Thomas Müller, Brian Mc Williams, Fabrice Rousselle, Markus Gross, and Jan Novák. Neural importance sampling. ACM Trans. Graph., 38(5):145:1 145:19, 2019.

[24] Jianfei Chen, Cheng Lu, Biqi Chenli, Jun Zhu, and Tian Tian. Vﬂow: More expressive generative ﬂows with variational data augmentation. In International Conference on Machine Learning, pages 1660 1669. PMLR, 2020.

[25] Mingxing Tan and Quoc V. Le. Efﬁcientnet: Rethinking model scaling for convolutional neural networks. In Proceedings of the 36th International Conference on Machine Learning, ICML 2019, 9-15 June 2019, Long Beach, California, USA, volume 97 of Proceedings of Machine Learning Research, pages 6105 6114. PMLR, 2019.

[26] Chin-Wei Huang, Laurent Dinh, and Aaron Courville. Augmented normalizing ﬂows: Bridging the gap between generative ﬂows and latent variable models. ar Xiv preprint ar Xiv:2002.07101, 2020.

[27] Jonathan Ho, Xi Chen, Aravind Srinivas, Yan Duan, and Pieter Abbeel. Flow++: Improving ﬂow-based generative models with variational dequantization and architecture design. In International Conference on Machine Learning, pages 2722 2730. PMLR, 2019.

[28] Yunyang Xiong, Zhanpeng Zeng, Rudrasis Chakraborty, Mingxing Tan, Glenn Fung, Yin Li, and Vikas Singh. Nyströmformer: A nyström-based algorithm for approximating self-attention. Co RR, abs/2102.03902, 2021.

[29] Gao Huang, Zhuang Liu, Laurens van der Maaten, and Kilian Q. Weinberger. Densely connected convolutional networks. In 2017 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2017, Honolulu, HI, USA, July 21-26, 2017, pages 2261 2269. IEEE Computer Society, 2017.

[30] Sergey Ioffe and Christian Szegedy. Batch normalization: Accelerating deep network training by reducing internal covariate shift. In Proceedings of the 32nd International Conference on Machine Learning, ICML 2015, Lille, France, 6-11 July 2015, volume 37 of JMLR Workshop and Conference Proceedings, pages 448 456. JMLR.org, 2015.

[31] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, and Illia Polosukhin. Attention is all you need. In Advances in Neural Information Processing Systems 30: Annual Conference on Neural Information Processing Systems 2017, December 4-9, 2017, Long Beach, CA, USA, pages 5998 6008, 2017.

[32] Xiaolong Wang, Ross Girshick, Abhinav Gupta, and Kaiming He. Non-local neural networks. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 7794 7803, 2018.

[33] Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, Jakob Uszkoreit, and Neil Houlsby. An image is worth 16x16 words: Transformers for image recognition at scale. Co RR, abs/2010.11929, 2020.

[34] Didrik Nielsen, Priyank Jaini, Emiel Hoogeboom, Ole Winther, and Max Welling. Survae ﬂows: Surjections to bridge the gap between vaes and ﬂows. In Advances in Neural Information Processing Systems 33: Annual Conference on Neural Information Processing Systems 2020, Neur IPS 2020, December 6-12, 2020, virtual, 2020.

[35] Alex Krizhevsky. Learning multiple layers of features from tiny images. University of Toronto, 05 2012.

[36] Olga Russakovsky, Jia Deng, Hao Su, Jonathan Krause, Sanjeev Satheesh, Sean Ma, Zhiheng Huang, Andrej Karpathy, Aditya Khosla, Michael Bernstein, Alexander C. Berg, and Li Fei-Fei. Image Net Large Scale Visual Recognition Challenge. International Journal of Computer Vision (IJCV), 115(3):211 252, 2015.

[37] Ziwei Liu, Ping Luo, Xiaogang Wang, and Xiaoou Tang. Deep learning face attributes in the wild. In Proceedings of International Conference on Computer Vision (ICCV), December 2015.

[38] Niki Parmar, Ashish Vaswani, Jakob Uszkoreit, Lukasz Kaiser, Noam Shazeer, Alexander Ku, and Dustin Tran. Image transformer. In Proceedings of the 35th International Conference on Machine Learning, ICML 2018, Stockholmsmässan, Stockholm, Sweden, July 10-15, 2018, volume 80 of Proceedings of Machine Learning Research, pages 4052 4061. PMLR, 2018. [39] Aurko Roy, Mohammad Saffar, Ashish Vaswani, and David Grangier. Efﬁcient content-based sparse attention with routing transformers. Trans. Assoc. Comput. Linguistics, 9:53 68, 2021. [40] Karol Gregor, Frederic Besse, Danilo Jimenez Rezende, Ivo Danihelka, and Daan Wierstra. Towards conceptual compression. In Advances in Neural Information Processing Systems 29: Annual Conference on Neural Information Processing Systems 2016, December 5-10, 2016, Barcelona, Spain, pages 3549 3557, 2016. [41] Arash Vahdat, William G. Macready, Zhengbing Bian, Amir Khoshaman, and Evgeny Andriyash. DVAE++: discrete variational autoencoders with overlapping transformations. In Proceedings of the 35th International Conference on Machine Learning, ICML 2018, Stockholmsmässan, Stockholm, Sweden, July 10-15, 2018, volume 80 of Proceedings of Machine Learning Research, pages 5042 5051. PMLR, 2018. [42] Diederik P Kingma, Tim Salimans, Rafal Jozefowicz, Xi Chen, Ilya Sutskever, and Max Welling. Improving variational inference with inverse autoregressive ﬂow. ar Xiv preprint ar Xiv:1606.04934, 2016. [43] Lars Maaløe, Marco Fraccaro, Valentin Liévin, and Ole Winther. BIVA: A very deep hierarchy of latent variables for generative modeling. In Advances in Neural Information Processing Systems 32: Annual Conference on Neural Information Processing Systems 2019, Neur IPS 2019, December 8-14, 2019, Vancouver, BC, Canada, pages 6548 6558, 2019. [44] Samarth Sinha and Adji B. Dieng. Consistency regularization for variational auto-encoders. Co RR, abs/2105.14859, 2021. [45] Dongjun Kim, Seungjae Shin, Kyungwoo Song, Wanmo Kang, and Il-Chul Moon. Score matching model for unbounded data score. Co RR, abs/2106.05527, 2021. [46] Alex Nichol and Prafulla Dhariwal. Improved denoising diffusion probabilistic models. Co RR, abs/2102.09672, 2021. [47] Diederik P. Kingma, Tim Salimans, Ben Poole, and Jonathan Ho. Variational diffusion models. Co RR, abs/2107.00630, 2021. [48] Aäron van den Oord, Nal Kalchbrenner, Lasse Espeholt, Koray Kavukcuoglu, Oriol Vinyals, and Alex Graves. Conditional image generation with pixelcnn decoders. In Advances in Neural Information Processing Systems 29: Annual Conference on Neural Information Processing Systems 2016, December 5-10, 2016, Barcelona, Spain, pages 4790 4798, 2016. [49] Xi Chen, Nikhil Mishra, Mostafa Rohaninejad, and Pieter Abbeel. Pixelsnail: An improved autoregressive generative model. In Proceedings of the 35th International Conference on Machine Learning, ICML 2018, Stockholmsmässan, Stockholm, Sweden, July 10-15, 2018, volume 80 of Proceedings of Machine Learning Research, pages 863 871. PMLR, 2018. [50] Jacob Menick and Nal Kalchbrenner. Generating high ﬁdelity images with subscale pixel networks and multidimensional upscaling. In 7th International Conference on Learning Representations, ICLR 2019, New Orleans, LA, USA, May 6-9, 2019, 2019. [51] Jason J. Yu, Konstantinos G. Derpanis, and Marcus A. Brubaker. Wavelet ﬂow: Fast training of high resolution normalizing ﬂows. In Advances in Neural Information Processing Systems 33: Annual Conference on Neural Information Processing Systems 2020, Neur IPS 2020, December 6-12, 2020, virtual, 2020. [52] Tian Qi Chen, Jens Behrmann, David Duvenaud, and Jörn-Henrik Jacobsen. Residual ﬂows for invertible generative modeling. In Advances in Neural Information Processing Systems 32: Annual Conference on Neural Information Processing Systems 2019, Neur IPS 2019, December 8-14, 2019, Vancouver, BC, Canada, pages 9913 9923, 2019. [53] Yura Perugachi-Diaz, Jakub M Tomczak, and Sandjai Bhulai. Invertible densenets with concatenated lipswish. ar Xiv preprint ar Xiv:2102.02694, 2021. [54] Apratim Bhattacharyya, Shweta Mahajan, Mario Fritz, Bernt Schiele, and Stefan Roth. Normalizing ﬂows with multi-scale autoregressive priors. In 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2020, Seattle, WA, USA, June 13-19, 2020, pages 8412 8421. Computer Vision Foundation / IEEE, 2020. [55] Xuezhe Ma, Xiang Kong, Shanghang Zhang, and Eduard H. Hovy. Macow: Masked convolutional generative ﬂow. In Advances in Neural Information Processing Systems 32: Annual Conference on Neural Information Processing Systems 2019, Neur IPS 2019, December 8-14, 2019, Vancouver, BC, Canada, pages 5891 5900, 2019. [56] Arash Vahdat and Jan Kautz. NVAE: A deep hierarchical variational autoencoder. In Advances in Neural Information Processing Systems 33: Annual Conference on Neural Information Processing Systems 2020, Neur IPS 2020, December 6-12, 2020, virtual, 2020.

[57] Hossein Sadeghi, Evgeny Andriyash, Walter Vinci, Lorenzo Buffoni, and Mohammad H. Amin. Pixelvae++: Improved pixelvae with discrete prior. Co RR, abs/1908.09948, 2019.

[58] Ali Razavi, Aäron van den Oord, Ben Poole, and Oriol Vinyals. Preventing posterior collapse with delta-vaes. In 7th International Conference on Learning Representations, ICLR 2019, New Orleans, LA, USA, May 6-9, 2019, 2019.

[59] Martin Heusel, Hubert Ramsauer, Thomas Unterthiner, Bernhard Nessler, and Sepp Hochreiter. Gans trained by a two time-scale update rule converge to a local nash equilibrium. In Advances in Neural Information Processing Systems 30: Annual Conference on Neural Information Processing Systems 2017, December 4-9, 2017, Long Beach, CA, USA, pages 6626 6637, 2017.

[60] Georg Ostrovski, Will Dabney, and Rémi Munos. Autoregressive quantile networks for generative modeling. In Proceedings of the 35th International Conference on Machine Learning, ICML 2018, Stockholmsmässan, Stockholm, Sweden, July 10-15, 2018, volume 80 of Proceedings of Machine Learning Research, pages 3933 3942. PMLR, 2018.

[61] Jens Behrmann, Will Grathwohl, Ricky TQ Chen, David Duvenaud, and Jörn-Henrik Jacobsen. Invertible residual networks. In International Conference on Machine Learning, pages 573 582. PMLR, 2019.

[62] Shengyu Zhao, Zhijian Liu, Ji Lin, Jun-Yan Zhu, and Song Han. Differentiable augmentation for dataefﬁcient GAN training. In Advances in Neural Information Processing Systems 33: Annual Conference on Neural Information Processing Systems 2020, Neur IPS 2020, December 6-12, 2020, virtual, 2020.

[63] Ishaan Gulrajani, Faruk Ahmed, Martín Arjovsky, Vincent Dumoulin, and Aaron C. Courville. Improved training of wasserstein gans. In Advances in Neural Information Processing Systems 30: Annual Conference on Neural Information Processing Systems 2017, December 4-9, 2017, Long Beach, CA, USA, pages 5767 5777, 2017.

[64] Zhisheng Xiao, Karsten Kreis, Jan Kautz, and Arash Vahdat. VAEBM: A symbiosis between variational autoencoders and energy-based models. Co RR, abs/2010.00654, 2020.

[65] Yura Perugachi-Diaz, Jakub M. Tomczak, and Sandjai Bhulai. Invertible densenets. In 3rd Symposium on Advances in Approximate Bayesian Inference, pages 1 11, 2020.

[66] Nikhil Mishra, Mostafa Rohaninejad, Xi Chen, and Pieter Abbeel. A simple neural attentive meta-learner. In 6th International Conference on Learning Representations, ICLR 2018, Vancouver, BC, Canada, April 30 - May 3, 2018, Conference Track Proceedings, 2018.

[67] Ramya Srinivasan and Ajay Chander. Biases in AI systems. Commun. ACM, 64(8):44 49, 2021.

[68] Alexandre Lacoste, Alexandra Luccioni, Victor Schmidt, and Thomas Dandres. Quantifying the carbon emissions of machine learning. ar Xiv preprint ar Xiv:1910.09700, 2019.

1. For all authors...

(a) Do the main claims made in the abstract and introduction accurately reﬂect the paper s contributions and scope? [Yes] See Abstract and the last paragraph of Section 1. (b) Did you describe the limitations of your work? [Yes] See Appendix A.

(c) Did you discuss any potential negative societal impacts of your work? [Yes] See Broader impact (Section 6) and Environmental impact (Appendix B). (d) Have you read the ethics review guidelines and ensured that your paper conforms to them? [Yes]

2. If you are including theoretical results...

(a) Did you state the full set of assumptions of all theoretical results? [N/A] (b) Did you include complete proofs of all theoretical results? [N/A]

3. If you ran experiments...

(a) Did you include the code, data, and instructions needed to reproduce the main experimental results (either in the supplemental material or as a URL)? [Yes] Code available at https: //github.com/matejgrcic/Dense Flow (b) Did you specify all the training details (e.g., data splits, hyperparameters, how they were chosen)?

[Yes] See Section 3, Appendix C. (c) Did you report error bars (e.g., with respect to the random seed after running experiments multiple times)? [Yes] The results are averaged over multiple seeds. The variance is reported in Section 3. (d) Did you include the total amount of compute and the type of resources used (e.g., type of GPUs, internal cluster, or cloud provider)? [Yes] See Section 3, Tab. 2, Appendix B.

4. If you are using existing assets (e.g., code, data, models) or curating/releasing new assets...

(a) If your work uses existing assets, did you cite the creators? [Yes] See Appendix F. (b) Did you mention the license of the assets? [Yes] See Appendix F.

(c) Did you include any new assets either in the supplemental material or as a URL? [Yes] See the ofﬁcial code release. (d) Did you discuss whether and how consent was obtained from people whose data you re using/curating? [N/A] (e) Did you discuss whether the data you are using/curating contains personally identiﬁable information or offensive content? [N/A]

5. If you used crowdsourcing or conducted research with human subjects...

(a) Did you include the full text of instructions given to participants and screenshots, if applicable?

[N/A] (b) Did you describe any potential participant risks, with links to Institutional Review Board (IRB) approvals, if applicable? [N/A] (c) Did you include the estimated hourly wage paid to participants and the total amount spent on participant compensation? [N/A]