# implicit_normalizing_flows__c96d5a7e.pdf Published as a conference paper at ICLR 2021 IMPLICIT NORMALIZING FLOWS Cheng Lu , Jianfei Chen , Chongxuan Li , Qiuhao Wang , Jun Zhu Dept. of Comp. Sci. & Tech., Institute for AI, BNRist Center Tsinghua-Bosch Joint ML Center, THBI Lab,Tsinghua University, Beijing, 100084 China Center for Data Science, Peking University, Beijing, 100871 China {lucheng.lc15,chris.jianfei.chen,chongxuanli1991}@gmail.com, dcszj@tsinghua.edu.cn, wqh19@pku.edu.cn Normalizing flows define a probability distribution by an explicit invertible transformation z = f(x). In this work, we present implicit normalizing flows (Imp Flows), which generalize normalizing flows by allowing the mapping to be implicitly defined by the roots of an equation F(z, x) = 0. Imp Flows build on residual flows (Res Flows) with a proper balance between expressiveness and tractability. Through theoretical analysis, we show that the function space of Imp Flow is strictly richer than that of Res Flows. Furthermore, for any Res Flow with a fixed number of blocks, there exists some function that Res Flow has a nonnegligible approximation error. However, the function is exactly representable by a single-block Imp Flow. We propose a scalable algorithm to train and draw samples from Imp Flows. Empirically, we evaluate Imp Flow on several classification and density modeling tasks, and Imp Flow outperforms Res Flow with a comparable amount of parameters on all the benchmarks. 1 INTRODUCTION Normalizing flows (NFs) (Rezende & Mohamed, 2015; Dinh et al., 2014) are promising methods for density modeling. NFs define a model distribution px(x) by specifying an invertible transformation f(x) from x to another random variable z. By change-of-variable formula, the model density is ln px(x) = ln pz(f(x)) + ln |det(Jf(x))| , (1) where pz(z) follows a simple distribution, such as Gaussian. NFs are particularly attractive due to their tractability, i.e., the model density px(x) can be directly evaluated as Eqn. (1). To achieve such tractability, NF models should satisfy two requirements: (i) the mapping between x and z is invertible; (ii) the log-determinant of the Jacobian Jf(x) is tractable. Searching for rich model families that satisfy these tractability constraints is crucial for the advance of normalizing flow research. For the second requirement, earlier works such as inverse autoregressive flow (Kingma et al., 2016) and Real NVP (Dinh et al., 2017) restrict the model family to those with triangular Jacobian matrices. More recently, there emerge some free-form Jacobian approaches, such as Residual Flows (Res Flows) (Behrmann et al., 2019; Chen et al., 2019). They relax the triangular Jacobian constraint by utilizing a stochastic estimator of the log-determinant, enriching the model family. However, the Lipschitz constant of each transformation block is constrained for invertibility. In general, this is not preferable because mapping a simple prior distribution to a potentially complex data distribution may require a transformation with a very large Lipschitz constant (See Fig. 3 for a 2D example). Moreover, all the aforementioned methods assume that there exists an explicit forward mapping z = f(x). Bijections with explicit forward mapping only covers a fraction of the broad class of invertible functions suggested by the first requirement, which may limit the model capacity. In this paper, we propose implicit flows (Imp Flows) to generalize NFs, allowing the transformation to be implicitly defined by an equation F(z, x) = 0. Given x (or z), the other variable can be computed by an implicit root-finding procedure z = Root Find(F( , x)). An explicit mapping z = f(x) used in prior NFs can viewed as a special case of Imp Flows in the form of F(z, x) = f(x) z = Corresponding Author. Published as a conference paper at ICLR 2021 0. To balance between expressiveness and tractability, we present a specific from of Imp Flows, where each block is the composition of a Res Flow block and the inverse of another Res Flow block. We theoretically study the model capacity of Res Flows and Imp Flows in the function space. We show that the function family of single-block Imp Flows is strictly richer than that of two-block Res Flows by relaxing the Lipschitz constraints. Furthermore, for any Res Flow with a fixed number of blocks, there exists some invertible function that Res Flow has non-negligible approximation error, but Imp Flow can exactly model. On the practical side, we develop a scalable algorithm to estimate the probability density and its gradients, and draw samples from Imp Flows. The algorithm leverages the implicit differentiation formula. Despite being more powerful, the gradient computation of Imp Flow is mostly similar with that of Res Flows, except some additional overhead on root finding. We test the effectiveness of Imp Flow on several classification and generative modeling tasks. Imp Flow outperforms Res Flow on all the benchmarks, with comparable model sizes and computational cost. 2 RELATED WORK Expressive Normalizing Flows There are many works focusing on improving the capacity of NFs. For example, Dinh et al. (2014; 2017); Kingma & Dhariwal (2018); Ho et al. (2019); Song et al. (2019); Hoogeboom et al. (2019); De Cao et al. (2020); Durkan et al. (2019) design dedicated model architectures with tractable Jacobian. More recently, Grathwohl et al. (2019); Behrmann et al. (2019); Chen et al. (2019) propose NFs with free-form Jacobian, which approximate the determinant with stochastic estimators. In parallel with architecture design, Chen et al. (2020); Huang et al. (2020); Cornish et al. (2020); Nielsen et al. (2020) improve the capacity of NFs by operating in a higher-dimensional space. As mentioned in the introduction, all these existing works adopt explicit forward mappings, which is only a subset of the broad class of invertible functions. In contrast, the implicit function family we consider is richer. While we primarily discuss the implicit generalization of Res Flows (Chen et al., 2019) in this paper, the general idea of utilizing implicit invertible functions could be potentially applied to other models as well. Finally, Zhang et al. (2020) formally prove that the model capacity of Res Flows is restricted by the dimension of the residual blocks. In comparison, we study another limitation of Res Flows in terms of the bounded Lipschitz constant, and compare the function family of Res Flows and Imp Flows with a comparable depth. Continuous Time Flows (CTFs) (Chen et al., 2018b; Grathwohl et al., 2019; Chen et al., 2018a) are flexible alternative to discrete time flows for generative modeling. They typically treat the invertible transformation as a dynamical system, which is approximately simulated by ordinary differential equation (ODE) solvers. In contrast, the implicit function family considered in this paper does not contain differential equations, and only requires fixed point solvers. Moreover, the theoretical guarantee is different. While CTFs typically study the universal approximation capacity under the continuous time case (i.e., infinite depth limit), we consider the model capacity of Imp Flows and Res Flows under a finite number of transformation steps. Finally, while CTFs are flexible, their learning is challenging due to instability (Liu et al., 2020; Massaroli et al., 2020) and exceedingly many ODE solver steps (Finlay et al., 2020), making their large-scale application still an open problem. Implicit Deep Learning Utilizing implicit functions enhances the flexibility of neural networks, enabling the design of network layers in a problem-specific way. For instance, Bai et al. (2019) propose a deep equilibrium model as a compact replacement of recurrent networks; Amos & Kolter (2017) generalize each layer to solve an optimization problem; Wang et al. (2019) integrate logical reasoning into neural networks; Reshniak & Webster (2019) utilize the implicit Euler method to improve the stability of both forward and backward processes for residual blocks; and Sitzmann et al. (2020) incorporate periodic functions for representation learning. Different from these works, which consider implicit functions as a replacement to feed-forward networks, we develop invertible implicit functions for normalizing flows, discuss the conditions of the existence of such functions, and theoretically study the model capacity of our proposed Imp Flow in the function space. 3 IMPLICIT NORMALIZING FLOWS We now present implicit normalizing flows, by starting with a brief overview of existing work. Published as a conference paper at ICLR 2021 3.1 NORMALIZING FLOWS As shown in Eqn. (1), a normalizing flow f : x 7 z is an invertible function that defines a probability distribution with the change-of-variable formula. The modeling capacity of normalizing flows depends on the expressiveness of the invertible function f. Residual flows (Res Flows) (Chen et al., 2019; Behrmann et al., 2019) are a particular powerful class of NFs due to their free-form Jacobian. Res Flows use f = f L f1 to construct the invertible mapping, where each layer fl is an invertible residual network with Lipschitz constraints bounded by a fixed constant κ: fl(x) = x + gl(x), Lip(gl) κ < 1, (2) where Lip(g) is the Lipschitz constant of a function g (see Sec. 4.1 for details). Despite their free-form Jacobian, the model capacity of Res Flows is still limited by the Lipschitz constant of the invertible function. The Lipschitz constant of each Res Flow block fl cannot exceed 2 (Behrmann et al., 2019), so the Lipschitz constant of an L-block Res Flow cannot exceed 2L. However, to transfer a simple prior distribution to a potentially complex data distribution, the Lipschitz constant of the transformation can be required to be sufficiently large in general. Therefore, Res Flows can be undesirably deep simply to meet the Lipschitz constraints (see Fig. 3 for a 2D example). Below, we present implicit flows (Imp Flows) to relax the Lipschitz constraints. 3.2 MODEL SPECIFICATION In general, an implicit flow (Imp Flow) is defined as an invertible mapping between random variables x and z of dimension d by finding the roots of F(z, x) = 0, where F is a function from R2d to Rd. In particular, the explicit mappings z = f(x) used in prior flow instances (Chen et al., 2019; Kingma & Dhariwal, 2018) can be expressed as an implicit function in the form F(z, x) = f(x) z = 0. While Imp Flows are a powerful family to explore, generally they are not guaranteed to satisfy the invertibility and the tractability of the log-determinant as required by NFs. In this paper, we focus on the following specific form, which achieves a good balance between expressiveness and tractability, and leave other possibilities for future studies. Definition 1. Let gz : Rd Rd and gx : Rd Rd be two functions such that Lip(gx) < 1 and Lip(gz) < 1, where Lip(g) is the Lipschitz constant of a function g. A specific form of Imp Flows is defined by F(z, x) = 0, where F(z, x) = gx(x) gz(z) + x z. (3) The root pairs of Eqn. (3) form a subset in Rd Rd, which actually defines the assignment rule of a unique invertible function f. To see this, for any x0, according to Definition 1, we can construct a contraction hx0(z) = F(z, x0) + z with a unique fixed point, which corresponds to a unique root (w.r.t. z) of F(z, x0) = 0, denoted by f(x0). Similarly, in the reverse process, given a z0, the root (w.r.t. x) of F(z0, x) = 0 also exists and is unique, denoted by f 1(z0). These two properties are sufficient to ensure the existence and the invertibility of f, as summarized in the following theorem. Theorem 1. Eqn.(3) defines a unique mapping f : Rd Rd, z = f(x), and f is invertible. See proof in Appendix A.1. Theorem 1 characterizes the validness of the Imp Flows introduced in Definition 1. In fact, a single Imp Flow is a stack of a single Res Flow and the inverse of another single Res Flow, which will be formally stated in Sec 4. We will investigate the expressiveness of the function family of the Imp Flows in Sec 4, and present a scalable algorithm to learn a deep generative model built upon Imp Flows in Sec. 5. 4 EXPRESSIVENESS POWER We first present some preliminaries on Lipschitz continuous functions in Sec. 4.1 and then formally study the expressiveness power of Imp Flows, especially in comparison to Res Flows. In particular, we prove that the function space of Imp Flows is strictly richer than that of Res Flows in Sec. 4.2 (see an illustration in Fig. 1 (a)). Furthermore, for any Res Flow with a fixed number of blocks, there exists some function that Res Flow has a non-negligible approximation error. However, the function is exactly representable by a single-block Imp Flow. The results are illustrated in Fig. 1 (b) and formally presented in Sec. 4.3. Published as a conference paper at ICLR 2021 Corollary 1 Equation (5) (2-composition) Theorem 2 (2-composition) (a) Relationship between R2 and I. (b) Relationship between Rℓand I. Figure 1: An illustration of our main theoretical results on the expressiveness power of Imp Flows and Res Flows. Panel (a) and Panel (b) correspond to results in Sec. 4.2 and Sec. 4.3 respectively. 1.0 0.5 0.0 0.3 (a) Target function 1.0 0.5 0.0 0.3 L=1 L=2 L=3 (b) Res Flow 1.0 0.5 0.0 0.3 (c) Imp Flow 1.0 0.5 0.0 0.3 f1 = f2 f 1 2 (d) Composition Figure 2: A 1-D motivating example. (a) Plot of the target function. (b) Results of fitting the target function using Res Flows with different number of blocks. All functions have non-negligible approximation error due to the Lipschtiz constraint. (c) An Imp Flow that can exactly represent the target function. (d) A visualization of compositing a Res Flow block and the inverse of another Res Flow block to construct an Imp Flow block. The detailed settings can be found in Appendix D. 4.1 LIPSCHITZ CONTINUOUS FUNCTIONS For any differentiable function f : Rd Rd and any x Rd, we denote the Jacobian matrix of f at x as Jf(x) Rd d. Definition 2. A function Rd Rd is called Lipschitz continuous if there exists a constant L, s.t. f(x1) f(x2) L x1 x2 , x1, x2 Rd. The smallest L that satisfies the inequality is called the Lipschitz constant of f, denoted as Lip(f). Generally, the definition of Lip(f) depends on the choice of the norm || ||, while we use L2-norm by default in this paper for simplicity. Definition 3. A function Rd Rd is called bi-Lipschitz continuous if it is Lipschitz continuous and has an inverse mapping f 1 which is also Lipschitz continuous. It is useful to consider an equivalent definition of the Lipschitz constant in our following analysis. Proposition 1. (Rademacher (Federer (1969), Theorem 3.1.6)) If f : Rd Rd is Lipschitz continuous, then f is differentiable almost everywhere, and Lip(f) = sup x Rd Jf(x) 2, where M 2 = sup{v: v 2=1} Mv 2 is the operator norm of the matrix M Rd d. 4.2 COMPARISON TO TWO-BLOCK RESFLOWS We formally compare the expressive power of a single-block Imp Flow and a two-block Res Flow. We highlight the structure of the theoretical results in this subsection in Fig. 1 (a) and present a 1D motivating example in Fig. 2. All the proofs can be found in Appendix. A. On the one hand, according to the definition of Res Flow, the function family of the single-block Res Flow is R := {f : f = g + Id, g C1(Rd, Rd), Lip(g) < 1}, (4) Published as a conference paper at ICLR 2021 where C1(Rd, Rd) consists of all functions from Rd to Rd with continuous derivatives and Id denotes the identity map. Besides, the function family of ℓ-block Res Flows is defined by composition: Rℓ:= {f : f = fℓ f1 for some f1, , fℓ R}. (5) By definition of Eqn. (4) and Eqn. (5), R1 = R. On the other hand, according to the definition of the Imp Flow in Eqn. (3), we can obtain (gx + Id)(x) = gx(x) + x = gz(z) + z = (gz + Id)(z), where denotes the composition of functions. Equivalently, we have z = (gz + Id) 1 (gx + Id) (x), which implies the function family of the single-block Imp Flow is I = {f : f = f 1 2 f1 for some f1, f2 R}. (6) Intuitively, a single-block Imp Flow can be interpreted as the composition of a Res Flow block and the inverse function of another Res Flow block, which may not have an explicit form (see Fig. 2 (c) and (d) for a 1D example). Therefore, it is natural to investigate the relationship between I and R2. Before that, we first introduce a family of monotonically increasing functions that does not have an explicit Lipschitz constraint, and show that it is strictly larger than R. R F := {f D : inf x Rd,v Rd, v 2=1 v T Jf(x)v > 0}, (7) where D is the set of all bi-Lipschitz C1-diffeomorphisms from Rd to Rd, and A B means A is a proper subset of B. Note that it follows from Behrmann et al. (2019, Lemma 2) that all functions in R are bi-Lipschitz, so R D. In the 1D input case, we can get R = {f C1(R) : infx R f (x) > 0, supx R f (x) < 2}, and F = {f C1(R) : infx R f (x) > 0}. In the high dimensional cases, R and F are hard to illustrate. Nevertheless, the Lipschitz constants of the functions in R is less than 2 (Behrmann et al., 2019), but those of the functions in F can be arbitrarily large. Based on Lemma 1, we prove that the function family of Imp Flows I consists of the compositions of two functions in F, and therefore is a strictly larger than R2, as summarized in the following theorem. Theorem 2. (Equivalent form of the function family of a single-block Imp Flow). I = F2 := {f : f = f2 f1 for some f1, f2 F}. (8) Note that the identity mapping Id F, and it is easy to get F I. Thus, the Lipschitz constant of a single Imp Flow (and its reverse) can be arbitrarily large. Because R F and there exists some functions in I \ R2 (see a constructed example in Sec. 4.3), we can get the following corollary. Corollary 1. R R2 F2 = I. The results on the 1D example in Fig. 2 (b) and (c) accord with Corollary 1. Besides, Corollary 1 can be generalized to the cases with 2ℓ-block Res Flows and ℓ-block Imp Flows, which strongly motivates the usage of implicit layers in normalizing flows. 4.3 COMPARISON WITH MULTI-BLOCK RESFLOWS We further investigate the relationship between Rℓfor ℓ> 2 and I, as illustrated in Fig. 1 (b). For a fixed ℓ, the Lipschitz constant of functions in Rℓis still bounded, and there exist infinite functions that are not in Rℓbut in I. We construct one such function family: for any L, r R+, define P(L, r) = {f : f F, Br Rd, x, y Br, f(x) f(y) 2 L x y 2}, (9) where Br is an d-dimensional ball with radius of r. Obviously, P(L, r) is an infinite set. Below, we will show that 0 < ℓ< log2(L), Rℓhas a non-negligible approximation error for functions in P(L, r). However, they are exactly representable by functions in I. Theorem 3. Given L > 0 and r > 0, we have Published as a conference paper at ICLR 2021 0 < ℓ< log2(L), P(L, r) Rℓ= . Moreover, for any f P(L, r) with d-dimensional ball Br, the minimal error for fitting f in Br by functions in Rℓsatisfies inf g Rℓsup x Br f(x) g(x) 2 r 2(L 2ℓ) (10) It follows Theorem 3 that to model f P(L, r), we need only a single-block Imp Flow but at least a log2(L)-block Res Flow. In Fig. 2 (b), we show a 1D case where a 3-block Res Flow cannot fit a function that is exactly representable by a single-block Imp Flow. In addition, we also prove some other properties of Imp Flows. In particular, R3 I. We formally present the results in Appendix B. 5 GENERATIVE MODELING WITH IMPFLOWS Imp Flows can be parameterized by neural networks and stacked to form a deep generative model to model high-dimensional data distributions. We develop a scalable algorithm to perform inference, sampling and learning in such models. For simplicity, we focus on a single-block during derivation. Formally, a parametric Imp Flow block z = f(x; θ) is defined by F(z, x; θ) = 0, where F(z, x; θ) = gx(x; θ) gz(z; θ) + x z, (11) and Lip(gx) < 1, Lip(gz) < 1. Let θ denote all the parameters in gx and gz (which does NOT mean gx and gz share parameters). Note that x refers to the input of the layer, not the input data. The inference process to compute z given x in a single Imp Flow block is solved by finding the root of F(z, x; θ) = 0 w.r.t. z, which cannot be explicitly computed because of the implicit formulation. Instead, we adopt a quasi-Newton method (i.e. Broyden s method (Broyden, 1965)) to solve this problem iteratively, as follows: z[i+1] = z[i] αBF(z[i], x; θ), for i = 0, 1, , (12) where B is a low-rank approximation of the Jacobian inverse1 and α is the step size which we use line search method to dynamically compute. The stop criterion is F(z[i], x; θ) 2 < ϵf, where ϵf is a hyperparameter that balances the computation time and precision. As Theorem 1 guarantees the existence and uniqueness of the root, the convergence of the Broyden s method is also guaranteed, which is typically faster than a linear rate. Another inference problem is to estimate the log-likelihood. Assume that z p(z) where p(z) is a simple prior distribution (e.g. standard Gaussian). The log-likelihood of x can be written by ln p(x) = ln p(z) + ln det(I + Jgx(x)) ln det(I + Jgz(z)), (13) where Jf(x) denotes the Jacobian matrix of a function f at x. See Appendix. A.4 for the detailed derivation. Exact calculation of the log-determinant term requires O(d3) time cost and is hard to scale up to high-dimensional data. Instead, we propose the following unbiased estimator of ln p(x) using the same technique in Chen et al. (2019) with Skilling-Hutchinson trace estimator (Skilling, 1989; Hutchinson, 1989): ln p(x) = ln p(z) + En p(N),v N(0,I) v T [Jgx(x)k]v v T [Jgz(z)k]v where p(N) is a distribution supported over the positive integers. The sampling process to compute x given z can also be solved by the Broyden s method, and the hyperparameters are shared with the inference process. In the learning process, we perform stochastic gradient descent to minimize the negative loglikelihood of the data, denoted as L. For efficiency, we estimate the gradient w.r.t. the model parameters in the backpropagation manner. According to the chain rule and the additivity of the log-determinant, in each layer we need to estimate the gradients w.r.t. x and θ of Eqn. (13). In particular, the gradients computation involves two terms: one is ( ) ln det(I + Jg(x; θ)) and the 1We refer readers to Broyden (1965) for the calculation details for B. Published as a conference paper at ICLR 2021 Table 1: Classification error rate (%) on test set of vanilla Res Net, Res Flow and Imp Flow of Res Net18 architecture, with varying Lipschitz coefficients c. Vanilla c = 0.99 c = 0.9 c = 0.8 c = 0.7 c = 0.6 CIFAR10 Res Flow 6.61( 0.02) 8.24 8.39 8.69 9.25 9.94 ( 0.03) ( 0.01) ( 0.03) ( 0.02) ( 0.02) Imp Flow 7.29 7.41 7.94 8.44 9.22 ( 0.03) ( 0.03) ( 0.06) ( 0.04) ( 0.02) CIFAR100 Res Flow 27.83( 0.03) 31.02 31.88 32.21 33.58 34.48 ( 0.05) ( 0.02) ( 0.03) ( 0.02) ( 0.03) Imp Flow 29.06 30.47 31.40 32.64 34.17 ( 0.03) ( 0.03) ( 0.03) ( 0.01) ( 0.02) z z ( ), where g is a function satisfying Lip(g) < 1 and ( ) denotes x or θ. On the one hand, for the log-determinant term, we can use the same technique as Chen et al. (2019), and obtain an unbiased gradient estimator as follows. ln det(I + Jg(x; θ)) ( ) = En p(N),v N(0,I) P(N k)v T Jg(x; θ)k ! Jg(x; θ) (15) where p(N) is a distribution supported over the positive integers. On the other hand, L z z ( ) can be computed according to the implicit function theorem as follows (See details in Appendix A.5): z z ( ) = L z J 1 G (z) F(z, x; θ) ( ) , where G(z; θ) = gz(z; θ) + z. (16) In comparision to directly calculate the gradient through the quasi-Newton iterations of the forward pass, the implicit gradient above is simple and memory-efficient, treating the root solvers as a blackbox. Following Bai et al. (2019), we compute L z J 1 G (z) by solving a linear system iteratively, as detailed in Appendix C.1. The training algorithm is formally presented in Appendix C.4. 6 EXPERIMENTS We demonstrate the model capacity of Imp Flows on the classification and density modeling tasks2. In all experiments, we use spectral normalization (Miyato et al., 2018) to enforce the Lipschitz constrants, where the Lipschitz constant upper bound of each layer (called Lipschitz coefficient) is denoted as c. For the Broyden s method, we use ϵf = 10 6 and ϵb = 10 10 for training and testing to numerically ensure the invertibility and the stability during training. Please see other detailed settings including the method of estimating the log-determinant, the network architecture, learning rate, batch size, and so on in Appendix D. 6.1 VERIFYING CAPACITY ON CLASSIFICATION We first empirically compare Res Flows and Imp Flows on classification tasks. Compared with generative modeling, classification is a more direct measure of the richness of the functional family, because it isolates the function fitting from generative modeling subtleties, such as log-determinant estimation. We train both models in the same settings on CIFAR10 and CIFAR100 (Krizhevsky & Hinton, 2009). Specifically, we use an architecture similar to Res Net-18 (He et al., 2016). Overall, the amount of parameters of Res Net-18 with vanilla Res Blocks, Res Flows and Imp Flows are the same of 6.5M. The detailed network structure can be found in Appendix D. The classification results are shown in Table 1. To see the impact of the Lipschitz constraints, we vary the Lipschitz coefficient c to show the difference between Res Flows and Imp Flows under the condition of a fixed Lipschitz upper bound. Given different values of c, the classification results of Imp Flows are consistently better than those of Res Flows. These results empirically validate Corollary 1, which claims that the 2See https://github.com/thu-ml/implicit-normalizing-flows for details. Published as a conference paper at ICLR 2021 Table 2: Average test log-likelihood (in nats) of tabular datasets. Higher is better. POWER GAS HEPMASS MINIBOONE BSDS300 Real NVP (Dinh et al., 2017) 0.17 8.33 -18.71 -13.55 153.28 FFJORD (Grathwohl et al., 2019) 0.46 8.59 -14.92 -10.43 157.40 MAF (Papamakarios et al., 2017) 0.24 10.08 -17.70 -11.75 155.69 NAF (Huang et al., 2018) 0.62 11.96 -15.09 -8.86 157.73 Imp Flow (L = 20) 0.61 12.11 -13.95 -13.32 155.68 Res Flow (L = 10) 0.26 6.20 -18.91 -21.81 104.63 Imp Flow (L = 5) 0.30 6.94 -18.52 -21.50 113.72 Table 3: Average bits per dimension of Res Flow and Imp Flow on CIFAR10, with varying Lipschitz coefficients c. Lower is better. c = 0.9 c = 0.8 c = 0.7 c = 0.6 Res Flow (L = 12) 3.469( 0.0004) 3.533( 0.0002) 3.627( 0.0004) 3.820( 0.0003) Imp Flow (L = 6) 3.452( 0.0003) 3.511( 0.0002) 3.607( 0.0003) 3.814( 0.0005) functional family of Imp Flows is richer than Res Flows. Besides, for a large Lipschitz constant upper bound c, Imp Flow blocks are comparable with the vanilla Res Blocks in terms of classification. 6.2 DENSITY MODELING ON 2D TOY DATA (a) Checkerboard data (5.00 bits) (b) Res Flow, L = 8 (5.08 bits) (c) Imp Flow, L = 4 (5.05 bits) Figure 3: Checkerboard data density and the results of a 8-block Res Flow and a 4-block Imp Flow. For the density modeling tasks, we first evaluate Imp Flows on the Checkerboard data whose density is multi-modal, as shown in Fig. 3 (a). For fairness, we follow the same experiment settings as Chen et al. (2019) (which are specified in Appendix D), except that we adopt a Sine (Sitzmann et al., 2020) activation function for all models. We note that the data distribution has a bounded support while we want to fit a transformation f mapping it to the standard Gaussian distribution, whose support is unbounded. A perfect f requires a sufficiently large Jf(x) 2 for some x mapped far from the mean of the Gaussian. Therefore, the Lipschtiz constant of such f is too large to be fitted by a Res Flow with 8 blocks (See Fig. 3 (b)). A 4-block Imp Flow can achieve a result of 5.05 bits, which outperforms the 5.08 bits of a 8-block Res Flow with the same number of parameters. Such results accord with our theoretical results in Theorem 2 and strongly motivate Imp Flows. 6.3 DENSITY MODELING ON REAL DATA We also train Imp Flows on some real density modeling datasets, including the tabular datasets (used by Papamakarios et al. (2017)), CIFAR10 and 5-bit 64 64 Celeb A (Kingma & Dhariwal, 2018). For all the real datasets, we use the scalable algorithm proposed in Sec. 5. We test our performance on five tabular datasets: POWER (d = 6), GAS (d = 8), HEPMASS (d = 21), MINIBOONE (d = 43) and BSDS300 (d = 63) from the UCI repository (Dua & Graff, 2017), where d is the data dimension. For a fair comparison, on each dataset we use a 10-block Res Flow and a 5-block Imp Flow with the same amount of parameters, and a 20-block Imp Flow for a better result. The detailed network architecture and hyperparameters can be found in Appendix D. Table 2 shows the average test log-likelihood for Res Flows and Imp Flows. Imp Flows achieves better density estimation performance than Res Flow consistently on all datasets. Again, the results demonstrate the effectiveness of Imp Flows. Published as a conference paper at ICLR 2021 Then we test Imp Flows on the CIFAR10 dataset. We train a multi-scale convolutional version for both Imp Flows and Res Flows, following the same settings as Chen et al. (2019) except that we use a smaller network of 5.5M parameters for both Imp Flows and Res Flows (see details in Appendix D). As shown in Table 3, Impflow achieves better results than Res Flow consistently given different values of the Lipschitz coefficient c. Moreover, the computation time of Imp Flow is comparable to that of Res Flow. See Appendix C.2 for detailed results. Besides, there is a trade-off between the expressiveness and the numerical optimization of Imp Flows in larger models. Based on the above experiments, we believe that advances including an lower-variance estimate of the log-determinant can benefit Imp Flows in larger models, which is left for future work. We also train Imp Flows on the 5-bit 64 64 Celeb A. For a fair comparison, we use the same settings as Chen et al. (2019). The samples from our model are shown in Appendix E. 7 CONCLUSIONS We propose implicit normalizing flows (Imp Flows), which generalize normalizing flows via utilizing an implicit invertible mapping defined by the roots of the equation F(z, x) = 0. Imp Flows build on Residual Flows (Res Flows) with a good balance between tractability and expressiveness. We show that the functional family of Imp Flows is richer than that of Res Flows, particularly for modeling functions with large Lipschitz constants. Based on the implicit differentiation formula, we present a scalable algorithm to train and evaluate Imp Flows. Empirically, Imp Flows outperform Res Flows on several classification and density modeling benchmarks. Finally, while this paper mostly focuses on the implicit generalization of Res Flows, the general idea of utilizing implicit functions for NFs could be extended to a wider scope. We leave it as a future work. ACKNOWLEDGEMENT We thank Yuhao Zhou, Shuyu Cheng, Jiaming Li, Kun Xu, Fan Bao, Shihong Song and Qi An Fu for proofreading. This work was supported by the National Key Research and Development Program of China (Nos. 2020AAA0104304), NSFC Projects (Nos. 61620106010, 62061136001, U19B2034, U181146, 62076145), Beijing NSF Project (No. JQ19016), Beijing Academy of Artificial Intelligence (BAAI), Tsinghua-Huawei Joint Research Program, Huawei Hisilicon Kirin Intelligence Engineering Development, the Mind Spore team, a grant from Tsinghua Institute for Guo Qiang, Tiangong Institute for Intelligent Computing, and the NVIDIA NVAIL Program with GPU/DGX Acceleration. C. Li was supported by the fellowship of China postdoctoral Science Foundation (2020M680572), and the fellowship of China national postdoctoral program for innovative talents (BX20190172) and Shuimu Tsinghua Scholar. J. Chen was supported by Shuimu Tsinghua Scholar. Brandon Amos and J Zico Kolter. Optnet: Differentiable optimization as a layer in neural networks. In International Conference on Machine Learning, pp. 136 145, 2017. Shaojie Bai, J. Zico Kolter, and Vladlen Koltun. Deep equilibrium models. In Advances in Neural Information Processing Systems, 2019. Jens Behrmann, Will Grathwohl, Ricky TQ Chen, David Duvenaud, and J orn-Henrik Jacobsen. Invertible residual networks. In International Conference on Machine Learning, pp. 573 582, 2019. Charles G Broyden. A class of methods for solving nonlinear simultaneous equations. Mathematics of Computation, 19(92):577 593, 1965. Changyou Chen, Chunyuan Li, Liqun Chen, Wenlin Wang, Yunchen Pu, and Lawrence Carin Duke. Continuous-time flows for efficient inference and density estimation. In International Conference on Machine Learning, pp. 824 833, 2018a. Jianfei Chen, Cheng Lu, Biqi Chenli, Jun Zhu, and Tian Tian. Vflow: More expressive generative flows with variational data augmentation. In International Conference on Machine Learning, 2020. Published as a conference paper at ICLR 2021 Ricky TQ Chen, Jens Behrmann, David K Duvenaud, and J orn-Henrik Jacobsen. Residual flows for invertible generative modeling. In Advances in Neural Information Processing Systems, pp. 9916 9926, 2019. Tian Qi Chen, Yulia Rubanova, Jesse Bettencourt, and David K Duvenaud. Neural ordinary differential equations. In Advances in Neural Information Processing Systems, pp. 6571 6583, 2018b. Rob Cornish, Anthony L Caterini, George Deligiannidis, and Arnaud Doucet. Relaxing bijectivity constraints with continuously indexed normalising flows. In International Conference on Machine Learning, 2020. Nicola De Cao, Wilker Aziz, and Ivan Titov. Block neural autoregressive flow. In Uncertainty in Artificial Intelligence, pp. 1263 1273. PMLR, 2020. Laurent Dinh, David Krueger, and Yoshua Bengio. Nice: Non-linear independent components estimation. In International Conference on Learning Representations Workshop, 2014. Laurent Dinh, Jascha Sohl-Dickstein, and Samy Bengio. Density estimation using real nvp. In International Conference on Learning Representations, 2017. Dheeru Dua and Casey Graff. UCI machine learning repository, 2017. URL http://archive. ics.uci.edu/ml. Conor Durkan, Artur Bekasov, Iain Murray, and George Papamakarios. Neural spline flows. In Advances in Neural Information Processing Systems, pp. 7511 7522, 2019. Herbert Federer. Grundlehren der mathematischen wissenschaften. In Geometric measure theory, volume 153. Springer New York, 1969. Chris Finlay, J orn-Henrik Jacobsen, Levon Nurbekyan, and Adam M Oberman. How to train your neural ode: the world of jacobian and kinetic regularization. In International Conference on Machine Learning, 2020. Will Grathwohl, Ricky TQ Chen, Jesse Betterncourt, Ilya Sutskever, and David Duvenaud. Ffjord: Free-form continuous dynamics for scalable reversible generative models. In International Conference on Learning Representations, 2019. Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 770 778, 2016. Jonathan Ho, Xi Chen, Aravind Srinivas, Yan Duan, and Pieter Abbeel. Flow++: Improving flowbased generative models with variational dequantization and architecture design. In International Conference on Machine Learning, pp. 2722 2730, 2019. Emiel Hoogeboom, Rianne Van Den Berg, and Max Welling. Emerging convolutions for generative normalizing flows. In International Conference on Machine Learning, pp. 2771 2780, 2019. Chin-Wei Huang, David Krueger, Alexandre Lacoste, and Aaron Courville. Neural autoregressive flows. In International Conference on Machine Learning, pp. 2078 2087, 2018. Chin-Wei Huang, Laurent Dinh, and Aaron Courville. Augmented normalizing flows: Bridging the gap between generative flows and latent variable models. ar Xiv:2002.07101, 2020. Michael F Hutchinson. A stochastic estimator of the trace of the influence matrix for laplacian smoothing splines. Communications in Statistics-Simulation and Computation, 18(3):1059 1076, 1989. Durk P Kingma and Prafulla Dhariwal. Glow: Generative flow with invertible 1x1 convolutions. In Advances in Neural Information Processing Systems, pp. 10215 10224, 2018. Durk P Kingma, Tim Salimans, Rafal Jozefowicz, Xi Chen, Ilya Sutskever, and Max Welling. Improved variational inference with inverse autoregressive flow. In Advances in Neural Information Processing Systems, pp. 4743 4751, 2016. Published as a conference paper at ICLR 2021 Alex Krizhevsky and Geoffrey Hinton. Learning multiple layers of features from tiny images. Technical report, University of Toronto, 2009. Xuanqing Liu, Tesi Xiao, Si Si, Qin Cao, Sanjiv Kumar, and Cho-Jui Hsieh. How does noise help robustness? explanation and exploration under the neural sde framework. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 282 290, 2020. Stefano Massaroli, Michael Poli, Michelangelo Bin, Jinkyoo Park, Atsushi Yamashita, and Hajime Asama. Stable neural flows. ar Xiv preprint ar Xiv:2003.08063, 2020. Takeru Miyato, Toshiki Kataoka, Masanori Koyama, and Yuichi Yoshida. Spectral normalization for generative adversarial networks. In International Conference on Learning Representations, 2018. Didrik Nielsen, Priyank Jaini, Emiel Hoogeboom, Ole Winther, and Max Welling. Survae flows: Surjections to bridge the gap between vaes and flows. ar Xiv preprint ar Xiv:2007.02731, 2020. George Papamakarios, Theo Pavlakou, and Iain Murray. Masked autoregressive flow for density estimation. In Advances in Neural Information Processing Systems, pp. 2338 2347, 2017. Viktor Reshniak and Clayton Webster. Robust learning with implicit residual networks. ar Xiv preprint ar Xiv:1905.10479, 2019. Danilo Rezende and Shakir Mohamed. Variational inference with normalizing flows. In International Conference on Machine Learning, pp. 1530 1538, 2015. Vincent Sitzmann, Julien NP Martel, Alexander W Bergman, David B Lindell, and Gordon Wetzstein. Implicit neural representations with periodic activation functions. ar Xiv preprint ar Xiv:2006.09661, 2020. John Skilling. The eigenvalues of mega-dimensional matrices. In Maximum Entropy and Bayesian Methods, pp. 455 466. Springer, 1989. Yang Song, Chenlin Meng, and Stefano Ermon. Mintnet: Building invertible neural networks with masked convolutions. In Advances in Neural Information Processing Systems, pp. 11002 11012, 2019. Po-Wei Wang, Priya Donti, Bryan Wilder, and Zico Kolter. Satnet: Bridging deep learning and logical reasoning using a differentiable satisfiability solver. In International Conference on Machine Learning, pp. 6545 6554, 2019. Han Zhang, Xi Gao, Jacob Unterman, and Tom Arodz. Approximation capabilities of neural odes and invertible residual networks. In International Conference on Machine Learning, 2020. A ADDITIONAL LEMMAS AND PROOFS A.1 PROOF FOR THEOREM 1 Proof. (Theorem 1) Firstly, x0 Rd, the mapping hx0(z) = F(z, x0) + z is a contrative mapping, which can be shown by Lipschitz condition of gz : (F(z1, x0) + z1) (F(z2, x0) + z2) = gz(z1) gz(z2) < z1 z2 . Therefore, hx0(z) has an unique fixed point, denoted by f(x0) : hx0(f(x0)) = f(x0) F(f(x0), x0) = 0 Similarly, we also have: z0 Rd, there exists an unique g(z0) satisfying F(z0, g(z0)) = 0. Moreover, Let z0 = f(x0), we have F(f(x0), g(f(x0))) = 0. By the uniqueness, we have g(f(x0)) = x0, x0 Rd . Similarly, f(g(x0)) = x0, x0 Rd. Therefore, f is unique and invertible. Published as a conference paper at ICLR 2021 A.2 PROOF FOR THEOREM 2 We denote D as the set of all bi-Lipschitz C1-diffeomorphisms from Rd to Rd. Firstly, we prove Lemma 1 in the main text. Proof. (Lemma 1). f R, we have sup x Rd Jf(x) I 2 2 < 1, which is equivalent to sup x Rd,v Rd, v 2=1 (Jf(x) I)v 2 2 < 1 (Definition of operator norm.) sup x Rd,v Rd, v 2=1 v T (JT f (x) I)(Jf(x) I)v < 1 sup x Rd,v Rd, v 2=1 v T JT f (x)Jf(x)v 2v T Jf(x)v < 0 Note that Jf(x) is nonsingular, so x, v Rd, v 2 = 1, we have v T JT f (x)Jf(x)v > 0. Thus, 0 > sup x Rd,v Rd, v 2=1 v T JT f (x)Jf(x)v 2v T Jf(x)v sup x Rd,v Rd, v 2=1 2v T Jf(x)v inf x Rd,v Rd, v 2=1 v T Jf(x)v > 0. Note that the converse is not true, because v T Jf(x)v > 0 does not restrict the upper bound of Lipschitz constant of f. For example, when f(x) = mx where m is a positive real number, we have inf x Rd,v Rd, v 2=1 v T Jf(x)v = inf x Rd,v Rd, v 2=1 v T (m I)v = m > 0 However, m can be any large positive number. So we have R F. Lemma 2. f D, if inf x Rd,v Rd, v 2=1 v T Jf(x)v > 0, (17) then inf x Rd,v Rd, v 2=1 v T Jf 1(x)v > 0, (18) Proof. (Proof of Lemma 2). By Inverse Function Theorem, Jf 1(x) = J 1 f (f 1(x)). Because f is from Rd to Rd, we have inf x Rd,v Rd, v 2=1 v T Jf 1(x)v = inf x Rd,v Rd, v 2=1 v T J 1 f (f 1(x))v = inf x Rd,v Rd, v 2=1 v T J 1 f (x)v Let u = J 1 f (x)v and v0 = u u 2 , we have v0 2 = 1, and v T J 1 f (x)v = u T JT f (x)u = u T Jf(x)u = u 2 2v T 0 Jf(x)v0. The above equation uses this fact: for a real d d matrix A, x Rd, x T Ax = (x T Ax)T = x T AT x because x T Ax R. Note that f is Lipschitz continuous, Jf(x) 2 Lip(f). So 1 = v 2 Jf(x) 2 u 2 Lip(f) u 2, Published as a conference paper at ICLR 2021 which means u 2 1 Lip(f). inf x Rd,v Rd, v 2=1 v T J 1 f (x)v = inf x Rd,v Rd, v 2=1 u 2 2v T 0 Jf(x)v0 inf x Rd,u Rd, Jf (x)u 2=1 u 2 2 inf x Rd,v0 Rd, v0 2=1 v T 0 Jf(x)v0 1 Lip(f)2 inf x Rd,v Rd, v 2=1 v T Jf(x)v Lemma 3. f D, if f 1 R, we have inf x Rd,v Rd, v 2=1 v T Jf(x)v > 0. (19) Proof. (Proof of Lemma 3). f D, if f 1 R, then from Lemma 1, we have inf x Rd,v Rd, v 2=1 v T Jf 1(x)v > 0. Note that f 1 D, from Lemma 2 we have inf x Rd,v Rd, v 2=1 v T Jf(x)v > 0. Lemma 4. f D, if inf x Rd,v Rd, v 2=1 v T Jf(x)v > 0, then α0 > 0, s.t. 0 < α < α0, sup x Rd αJf(x) I 2 < 1. Proof. (Proof of Lemma 4). Note that f is Lipschitz continuous, so Lip(f) = supx Rd Jf(x) 2. Denote β = inf x Rd,v Rd, v 2=1 v T Jf(x)v. α0 = β Lip(f)2 > 0. Published as a conference paper at ICLR 2021 0 < α < α0, we have sup x Rd αJf(x) I 2 2 = sup x Rd,v Rd, v 2=1 v T (αJT f (x) I)(αJf(x) I)v = 1 + sup x Rd,v Rd, v 2=1 α2v T JT f (x)Jf(x)v 2αv T Jf(x)v 1 + α2 sup x Rd,v Rd, v 2=1 v T JT f (x)Jf(x)v + 2α sup x Rd,v Rd, v 2=1 = 1 + α2 sup x Rd,v Rd, v 2=1 v T JT f (x)Jf(x)v 2α inf x Rd,v Rd, v 2=1 v T Jf(x)v = 1 + α2 sup x Rd Jf(x) 2 2 2αβ = 1 + α(αLip(f)2 2β) < 1 + α(α0Lip(f)2 2β) = 1 αβ < 1. The above equation uses this fact: for a real d d matrix A, x Rd, x T Ax = (x T Ax)T = x T AT x because x T Ax R. Proof. (Theorem 2) Denote P = {f D | f1, f2 D, f = f2 f1, where inf x Rd,v Rd, v 2=1 v T Jf1(x)v > 0, inf x Rd,v Rd, v 2=1 v T Jf2(x)v > 0}. Firstly, we show that I P. f I, assume f = f2 f1, where f1 R and f 1 2 R. By Lemma 1 and Lemma 3, we have inf x Rd,v Rd, v 2=1 v T Jf1(x)v > 0, inf x Rd,v Rd, v 2=1 v T Jf2(x)v > 0. Thus, f P. So I P. Next, we show that P I. f P, assume f = f2 f1, where inf x Rd,v Rd, v 2=1 v T Jf1(x)v > 0, inf x Rd,v Rd, v 2=1 v T Jf2(x)v > 0. From Lemma 2, we have inf x Rd,v Rd, v 2=1 v T Jf 1 2 (x)v > 0. From Lemma 4, α1 > 0, α2 > 0, s.t. 0 < α < min{α1, α2}, sup x Rd αJf1(x) I 2 < 1, sup x Rd αJf 1 2 (x) I 2 < 1. 2 min{α1, α2}. Let g = g2 g1, where g1(x) = αf1(x), g2(x) = f2(x Published as a conference paper at ICLR 2021 We have g(x) = f2( αf1(x) α ) = f(x), and Jg1(x) = αJf1(x), g 1 2 (x) = αf 1 2 (x), Jg 1 2 (x) = αJf 1 2 (x). sup x Rd Jg1(x) I 2 = sup x Rd αJf1(x) I 2 < 1, sup x Rd Jg 1 2 (x) I 2 = sup x Rd αJf 1 2 (x) I 2 < 1. Thus, g1 R and g 1 2 R and f = g2 g1. So f I. Therefore, P I. In conclusion, I = P. A.3 PROOF FOR THEOREM 3 Firstly, we prove a lemma of bi-Lipschitz continuous functions. Lemma 5. If f : (Rd, ) (Rd, ) is bi-Lipschitz continuous, then 1 Lip(f 1) f(x1) f(x2) x1 x2 Lip(f), x1, x2 Rd, x1 = x2. Proof. (Proof of Lemma 5). x1, x2 Rd, x1 = x2, we have f(x1) f(x2) Lip(f) x1 x2 x1 x2 = f 1(f(x1)) f 1(f(x2)) Lip(f 1) f(x1) f(x2) Thus, we get the results. Assume a residual flow f = f L f1 where each layer fl is an invertible residual network: fl(x) = x + gl(x), Lip(gl) κ < 1. Thus, each layer fl is bi-Lipschitz and it follows by Behrmann et al. (2019) and Lemma 5 that 1 κ fl(x1) fl(x2) x1 x2 1 + κ < 2L, x1, x2 Rd, x1 = x2. (20) By multiplying all the inequalities, we can get a bound of the bi-Lipschitz property for Res Flows, as shown in Lemma 6. Lemma 6. For Res Flows built by f = f L f1, where fl(x) = x + gl(x), Lip(gl) κ < 1, then (1 κ)L f(x1) f(x2) x1 x2 (1 + κ)L, x1, x2 Rd, x1 = x2. Next, we prove Theorem 3. Proof. (Theorem 3) According to the definition of P(L, r), we have P(L, r) F I. 0 < ℓ< log2(L), we have L 2ℓ> 0. g Rℓ, by Lemma 6, we have g(x) g(y) 2 2ℓ x y 2, x, y Br. Thus, x0 Br, we have f(x) g(x) 2 = f(x) f(x0) + g(x0) g(x) + f(x0) g(x0) 2 f(x) f(x0) 2 g(x0) g(x) + f(x0) g(x0) 2 f(x) f(x0) 2 g(x0) g(x) 2 f(x0) g(x0) 2 (L 2ℓ) x x0 2 f(x0) g(x0) 2 Published as a conference paper at ICLR 2021 sup x Br f(x) g(x) 2 sup x Br (L 2ℓ) x x0 2 f(x0) g(x0) 2 (L 2ℓ)r f(x0) g(x0) 2 Notice that the inequality above is true for any x0 Br, so we have sup x Br f(x) g(x) 2 sup x0 Br (L 2ℓ)r f(x0) g(x0) 2 = (L 2ℓ)r inf x0 Br f(x0) g(x0) 2 (L 2ℓ)r sup x0 Br f(x0) g(x0) 2 sup x Br f(x) g(x) 2 r 2(L 2ℓ), g Rℓ inf g Rℓsup x Br f(x) g(x) 2 r Because f P(L, r), infg Rℓsupx Br f(x) g(x) 2 > 0, we have Rℓ P(L, r) = . A.4 PROOF FOR EQUATION 13 Proof. (Equation 13) By Change of Variable formula: log p(x) = log p(z) + log | z/ x|, Since z = f(x) is defined by the equation F(z, x) = gx(x) gz(z) + x z = 0, by Implicit function theorem, we have z/ x = Jf(x) = [JF,z(z)] 1[JF,x(x)] = (I + Jgz(z)) 1(I + Jgx(x)). Thus, log | z/ x| = ln | det(I + Jgx(x))| ln | det(I + Jgz(z))| Note that any eigenvalue λ of Jgx(x) satisfies |λ| < σ(Jgx(x)) = Jgx(x) 2 < 1, so λ ( 1, 1). Thus, det(I + Jgx(x)) > 0. Similarly, det(I + Jgz(z)) > 0. Therefore, log | z/ x| = ln det(I + Jgx(x)) ln det(I + Jgz(z)) A.5 PROOF FOR EQUATION 16 Proof. (Equation 16) By implicitly differentiating two sides of F(z, x; θ) = 0 by x, we have z z x + I z z x = I + gz(z; θ) 1 I + gx(x; θ) = J 1 G (z) F(z, x; θ) Published as a conference paper at ICLR 2021 By implicitly differentiating two sides of F(z, x; θ) = 0 by θ, we have z θ = I + gz(z; θ) = J 1 G (z) F(z, x; θ) Therefore, the gradient from z to ( ) is z z ( ) = L z J 1 G (z) F(z, x; θ) B OTHER PROPERTIES OF IMPLICIT FLOWS In this section, we propose some other properties of Imp Flows. Lemma 7. For a single implicit flow f I, assume that f = f 1 2 f1, where f1(x) = x + g1(x), Lip(g1) κ < 1, (21) f2(x) = x + g2(x), Lip(g2) κ < 1, (22) then 1 κ 1 + κ f(x1) f(x2) x1 x2 1 + κ 1 κ, x1, x2 Rd, x1 = x2. (23) Proof. (Proof of Lemma 7) According to Eqn. (20), we have 1 κ f1(x1) f1(x2) x1 x2 1 + κ, x1, x2 Rd, x1 = x2, (24) 1 1 + κ f 1 2 (x1) f 1 2 (x2) x1 x2 1 1 κ, x1, x2 Rd, x1 = x2. (25) By multiplying these two inequalities, we can get the results. Theorem 4. (Limitation of the single Imp Flow). I {f : f D, x Rd, λ(Jf(x)) R = }, (26) where λ(A) denotes the set of all eigenvalues of matrix A. Proof. (Proof of Theorem 4) Proof by contradiction. Assume f I and x Rd, s.t. λ λ(Jf(x)), λ < 0. There exists a vector u = 0, Jf(x)u = λu. By Theorem 2, f1, f2 F, f = f2 f1, hence Jf(x) = Jf2(f1(x))Jf1(x). We denote A := Jf2(f1(x)), B := Jf1(x). Since f1, f2 F, we have v T Av > 0, w T Bw > 0, v, w = 0, v, w Rd. Note that B is the Jacobian of a bi-Lipschitz function at a single point, so B is non-singular. As u = 0, we have Bu = 0. Thus, (Bu)T A(Bu) = (Bu)T ((AB)u) = λu T BT u = λu T Bu The last equation uses this fact: for a real d d matrix A, x Rd, x T Ax = (x T Ax)T = x T AT x because x T Ax R. Note that the left side is positive, and the right side is negative. It s a contradiction. Published as a conference paper at ICLR 2021 Therefore, I cannot include all the bi-Lipschitz C1-diffeomorphisms. As a corollary, we have R3 I. Corollary 2. R3 I. Proof. (Proof for Corollary 2) Consider three linear functions in R: f1(x) = x + 0.46 0.20 0.85 0.00 f2(x) = x + 0.20 0.70 0.30 0.60 f3(x) = x + 0.50 0.60 0.20 0.55 We can get that f = f1 f2 f3 is in R3, and f is also a linear function with Jacobian 0.2776 0.4293 0.5290 0.6757 However, this is a matrix with two negative eigenvalues: -0.1881, -0.2100. Hence f is not in I. Therefore, R3 I. C COMPUTATION C.1 APPROXIMATE INVERSE JACOBIAN The exact computation for the Jacobian inverse term costs much for high dimension tasks. We use the similar technique in Bai et al. (2019) to compute L z J 1 G (z): solving the linear system of variable y: JT G(z)y T = ( L z )T , (27) where the left hand side is a vector-Jacobian product and it can be efficiently computed by autograd packages foy any y without computing the Jacobian matrix. In this work, we also use Broyden s method to solve the root, the same as methods in the forward pass, where the tolerance bound for the stop criterion is ϵb. Remark. Although the forward, inverse and backward pass of Imp Flows all need to solve the root of some equation, we can choose small enough ϵf and ϵb to ensure the approximation error is small enough. Thus, there is a trade-off between computation costs and approximation error. In practice, we use ϵf = 10 6 and ϵb = 10 10 and empirically does not observe any error accumulation. Note that such approximation is rather different from the variational inference technique in Chen et al. (2020); Nielsen et al. (2020), because we only focus on the exact log density itself. C.2 COMPUTATION TIME We evaluate the average computation time for the model trained on CIFAR10 in Table 3 on a single Tesla P100 (SXM2-16GB). See Table 4 for the details. For a fair comparision, the forward (inference) time in the training phase of Imp Flow is comparable to that of Res Flow because the log-determinant term is the main cost. The backward time of Imp Flow costs more than that of Res Flow because it requires to rewrite the backward method in Py Torch to solve the linear equation. The training time includes forward, backward and other operations (such as the Lipschitz iterations for spectral normalization). We use the same method as the release code of Res Flows (fixed-point iterations with tolerance 10 5) for the sample phase. The sample time of Imp Flow is less than that of Res Flow because the inverse of L-block Imp Flow needs to solve L fixed points while the inverse of 2L-block Res Flow needs to solve 2L fixed points. Fast sampling is particularly desirable since it is the main advantage of flow-based models over autoregressive models. Also, we evaluate the average Broyden s method iterations and the average function evaluation times during the Broyden s method. See Table 5 for the details. Published as a conference paper at ICLR 2021 Table 4: Single-batch computation time (seconds) for Res Flow and Imp Flow in Table 3 on a single Tesla P100 (SXM2-16GB). c Model Forward (Inference) Backward Training Sample 0.5 Imp Flow Fixed-point Log-det Others Inv-Jacob Others 4.152 0.138 0.445 2.370 0.090 0.562 0.441 2.905 1.003 Res Flow 2.656 0.031 2.910 0.229 0.6 Imp Flow Fixed-point Log-det Others Inv-Jacob Others 4.415 0.159 0.497 2.356 0.120 0.451 0.800 2.973 1.251 Res Flow 2.649 0.033 2.908 0.253 0.7 Imp Flow Fixed-point Log-det Others Inv-Jacob Others 4.644 0.181 0.533 2.351 0.157 0.525 0.887 3.041 1.412 Res Flow 2.650 0.030 2.908 0.312 0.8 Imp Flow Fixed-point Log-det Others Inv-Jacob Others 4.881 0.206 0.602 2.364 0.139 0.641 0.943 3.105 1.584 Res Flow 2.655 0.030 2.910 0.374 0.9 Imp Flow Fixed-point Log-det Others Inv-Jacob Others 5.197 0.258 0.707 2.357 0.137 0.774 1.033 3.201 1.807 Res Flow 2.653 0.030 2.916 0.458 Table 5: Single-batch iterations of Broyden s method during forward and backward pass for a single block of Imp Flow in Table 3. c Broyden s Method Iterations Function Evaluations 0.5 Forward 7.2 8.2 Backward 12.5 13.5 0.6 Forward 8.3 9.3 Backward 14.9 15.9 0.7 Forward 9.4 10.4 Backward 17.9 18.9 0.8 Forward 10.8 11.8 Backward 22.4 23.4 0.9 Forward 12.9 13.9 Backward 27.4 28.4 C.3 NUMERICAL SENSITIVITY We train a 20-block Imp Flow on POWER dataset with ϵf = 10 6 (see Appendix. D for detailed settings), and then test this model with different ϵf to see the numerical sensitivity of the fixed-point iterations. Table 6 shows that our model is not sensitive to ϵf in a fair range. C.4 TRAINING ALGORITHM We state the training algorithms for both forward and backward processes in Algorithm 1 and Algorithm 2 Published as a conference paper at ICLR 2021 Algorithm 1: Forward Algorithm For a Single-Block Imp Flow Require: gx;θ, gz;θ in Eqn. (3), stop criterion ϵf. Input: x. Output: z = f(x) and ln p(x), where f is the implicit function defined by gx;θ and gz;θ. Define h(z) = F(z, x; θ) z 0 while h(z) 2 ϵf do B The estimated inverse Jacobian of h(z) (e.g. by Broyden s method) α Line Search(z, h, B) z z αBh(z) if training then Esitamate ln det(I + Jgx(x; θ)) by Eqn. (15) Esitamate ln det(I + Jgz(z; θ)) by Eqn. (15) else Esitamate ln det(I + Jgx(x; θ)) by Eqn. (14) Esitamate ln det(I + Jgz(z; θ)) by Eqn. (14) Compute ln p(x) by Eqn. (13) Algorithm 2: Backward Algorithm For a Single-Block Imp Flow Require: gx;θ, gz;θ in Eqn. (3), stop criterion ϵb. Input: x, z, L z . Output: The gradient for x and θ from z, i.e. L z z x and L z z θ. Define G(z; θ) = gz(z; θ) + z and h(y) = y JG(z) L z y 0 while h(y) 2 ϵb do B The estimated inverse Jacobian of h(y) (e.g. by Broyden s method) α Line Search(y, h, B) y y αBh(y) Compute F (z,x;θ) x and F (z,x;θ) θ by autograd packages. z z x y F (z,x;θ) z z θ y F (z,x;θ) Published as a conference paper at ICLR 2021 Table 6: Average test log-likelihood (in nats) for different ϵf of Imp Flow on POWER dataset. ϵf 10 8 10 7 10 6 10 5 10 4 10 3 10 2 10 1 log-likelihood 0.606 0.603 0.607 0.611 0.607 0.607 0.602 0.596 D NETWORK STRUCTURES D.1 1-D EXAMPLE We specify the function (data) to be fitted is f(x) = 0.1x, x < 0 10x, x 0 For I, we can construct a fully-connected neural network with Re LU activation and 3 parameters as following: gx(x) = Re LU( 0.9x) The two networks can be implemented by spectral normalization. Assume the implicit function defined by Eqn. (3) using the above gx(x) and gz(z) is f I. Next we show that f = f I. Let f1(x) = x + Re LU( 0.9x) and f2(x) = x 0.9x), we have f 1 2 (x) = x + Re LU(9x). Therefore, f I = f 1 2 f1 = f. For every residual block of R,R2 and R3, we train a 4-layer MLP with Re LU activation and 128 hidden units, and the Lipschitz coefficient for the spectral normalization is 0.99, and the iteration number for the spectral computation is 200. The objective function is min θ Ex Unif( 1,1) (fθ(x) f(x))2 , where fθ is the function of 1 or 2 or 3 residual blocks. We use a batch size of 5000. We use the Adam optimizer, with learning rate 10 3 and weight decay 10 5. We train the model until convergence, on a single NVIDIA Ge Force GTX 1080Ti. The losses for R,R2 and R3 are 5.25, 2.47, 0.32, respectively. D.2 CLASSIFICATION For the classification tasks, we remove all the Batch Norm layers which are inside of a certain Res Block, and only maintain the Batch Norm layer in the downsampling layer. Moreover, as a single Imp Flow consists of two residual blocks with the same dimension of input and output, we replace the downsampling shortcut by a identity shortcut in each scale of Res Net-18, and add a downsampling layer (a convolutional layer) with Batch Norm after the two residual blocks of each scale. Thus, each scale consists of two Res Blocks with the same dimension of input and output, which (6.5M parameters) is different from the vanilla Res Net-18 architecture (11.2M parameters). Note that the vanilla Res Net-18 in our main text is refered to the 6.5M-parameter architecture, which is the same as the versions for Res Flow and Imp Flow. We use the comman settings: batch size of 128, Adam optimizer with learning rate 10 3 and no weight decay, and total epoch of 150. For the spectral normalization iterations, we use a error bound of 10 3, the same as Chen et al. (2019). We train every experiment on a single NVIDIA Ge Force GTX 2080Ti. D.3 DENSITY MODELING ON TOY 2D DATA Following the same settings as Chen et al. (2019), we use 4-layer multilayer perceptrons (MLP) with fully-connected layers of 128 hidden units. We use the Adam optimizer with learning rate of 10 3 and weight decay of 10 5. Moreover, we find that 1 2π sin(2πx) is a better activation for this Published as a conference paper at ICLR 2021 task while maintain the property of 1-Lipschitz constant, so we use this activation function for all experiments, which can lead to faster convergence and better log-likelihood for both Res Flows and Imp Flows, as shown in Fig. 4. We do not use any Act Norm or Batch Norm layers. For the log-determinant term, we use brute-force computation as in Chen et al. (2019). For the forward and backward, we use the Broyden s method to compute the roots, with ϵf = 10 6. The Lipschitz coefficient for spectral normalization is 0.999, and the iteration number for spectral normalization is 20. The batch size is 5000, and we train 50000 iterations. The test batch size is 10000. Also, we vary the network depth to see the difference between Imp Flow and Res Flow. For every depth L, we use an L-block Imp Flow and a 2L-block Res Flow with the same settings as stated above, and train 3 times with different random seeds. As shown in Figure 5, the gap between Imp Flow and Res Flow shrinks as the depth grows deep, because the Lipschitz constant of Res Flow grows exponentially. Note that the dashed line is a 200-block Res Flow in Chen et al. (2019), and we tune our model better so that our models perform better with lower depth. (a) Lip Swish (5.22 bits) (b) Sine (5.08 bits) Figure 4: 8-block Res Flow with different activation function trained on Checkerboard dataset. 3 4 5 6 7 8 Depth Imp Flow Res Flow (2x depth) 200-depth Res Flow (Chen, 2019) Figure 5: Test NLL (in bits) by varying the network depth. Lower is better. D.4 DENSITY MODELING ON TABULAR DATASETS We use the same data preprocessing as Papamakarios et al. (2017), including the train/valid/test datasets splits. For all models, we use a batch size of 1000 (both training and testing) and learning rate of 10 3 for the Adam optimizer. The main settings are the same as Chen et al. (2019) on the toy2D dataset. The residual blocks are 4-layer MLPs with 128 hidden units. The Res Flows use 10 Published as a conference paper at ICLR 2021 blocks and Imp Flows use 5 blocks to ensure the same amount of parameters. And we use a 20-block Imp Flow for a better result. Also, we use the Sine activation as 1 2π sin(2πx). We do not use any Act Norm or Batch Norm layers. For the Lipschitz coefficient, we use c = 0.9 and the iteration error bound for spectral normalization is 10 3. For the settings of our scalable algorithms, we use brute-force computation of the log-determinant term for POWER and GAS datasets and use the same estimation settings as Chen et al. (2019) for HEPMASS, MINIBOONE and BSDS300 datasets. In particular, for the estimation settings, we always exactly compute 2 terms in training process and 20 terms in testing process for the logdeterminant series. We use a geometric distribution of p = 0.5 for the distribution p(N) for the log-determinant term. We use a single sample of (n, v) for the log-determinant estimators for both training and testing. We train each expeirment on a single NVIDIA Ge Force GTX 2080Ti for about 4 days for Res Flows and 6 days for Imp Flows. For 20-block Imp Flow, we train our model for about 2 weeks. However, we find that the 20-block Imp Flow will overfit the training dastaset for MINIBOONE because this dataset is quite small, so we use the early-stopping technique. D.5 DENSITY MODELING ON IMAGE DATASETS For the CIFAR10 dataset, we follow the same settings and architectures as Chen et al. (2019). In particular, every convolutional residual block is Lip Swish 3 3 Conv Lip Swish 1 1 Conv Lip Swish 3 3 Conv. The total architecture is Image Logit Transform(α) k Conv Block [Squeeze k Conv Block] 2, where Conv Block is i-Res Block for Res Flows and Imp Block for Imp Block, and k = 4 for Res Flows and k = 2 for Imp Flows. And the first Conv Block does not have Lip Swish as pre-activation, followed as Chen et al. (2019). We use Act Norm2d after every Conv Block. We do not use the FC layers (Chen et al., 2019). We use hidden channels as 512. We use batch size of 64 and the Adam optimizer of learning rate 10 3. The iteration error bound for spectral normalization is 10 3. We use α = 0.05 for CIFAR10. For the settings of our scalable algorithms, we use the same as Chen et al. (2019) for the logdeterminant terms. In particular, we always exactly compute 10 terms in training process and 20 terms in testing process for the log-determinant series. We use a possion distribution for the distribution p(N) for the log-determinant term. We use a single sample of (n, v) for the log-determinant estimators for both training and testing. We train each Res Flow on a single NVIDIA Ge Force GTX 2080Ti and each Imp Flow on two cards of NVIDIA Ge Force GTX 2080Ti for about 6 days for Res Flows and 8 days for Imp Flows. Although the amount of parameters are the same, Imp Flows need more GPU memory due to the implementation of Py Torch for the backward pass of implicit function. For the Celeb A dataset, we use exactly the same settings as the final version of Res Flows in Chen et al. (2019), except that we use the Sine activation of the form as 1 2π sin(2πx). E IMPFLOW SAMPLES Published as a conference paper at ICLR 2021 Figure 6: Qualitative samples on 5bit 64 64 Celeb A by Imp Flow, with a temperature of 0.8(Kingma & Dhariwal, 2018)