# lifting_architectural_constraints_of_injective_flows__02900fcf.pdf Published as a conference paper at ICLR 2024 LIFTING ARCHITECTURAL CONSTRAINTS OF INJECTIVE FLOWS Peter Sorrenson , Felix Draxler, Armand Rousselot, Sander Hummerich, Lea Zimmermann & Ullrich K othe Computer Vision and Learning Lab Heidelberg University firstname.lastname@iwr.uni-heidelberg.de Normalizing Flows explicitly maximize a full-dimensional likelihood on the training data. However, real data is typically only supported on a lower-dimensional manifold leading the model to expend significant compute on modeling noise. Injective Flows fix this by jointly learning a manifold and the distribution on it. So far, they have been limited by restrictive architectures and/or high computational cost. We lift both constraints by a new efficient estimator for the maximum likelihood loss, compatible with free-form bottleneck architectures. We further show that naively learning both the data manifold and the distribution on it can lead to divergent solutions, and use this insight to motivate a stable maximum likelihood training objective. We perform extensive experiments on toy, tabular and image data, demonstrating the competitive performance of the resulting model. 1 INTRODUCTION Generative modeling is one of the most important tasks in machine learning, having numerous applications across vision (Rombach et al., 2022), language modeling (Brown et al., 2020), science (Ardizzone et al., 2018; Radev et al., 2022) and beyond. One of the best-motivated approaches to generative modeling is maximum likelihood training, due to its favorable statistical properties (Hastie et al., 2009). In the continuous setting, exact maximum likelihood training is most commonly achieved by normalizing flows (Rezende & Mohamed, 2015; Dinh et al., 2015; Kobyzev et al., 2021) which parameterize an exactly invertible function with a tractable change of variables (log-determinant term). This generally introduces a trade-off between model expressivity and computational cost, where the cheapest networks to train and sample from, such as coupling block architectures, require very specifically constructed functions which may limit expressivity (Draxler et al., 2022). In addition, normalizing flows preserve the dimensionality of the inputs, requiring a latent space of the same dimension as the data space. Due to the manifold hypothesis (Bengio et al., 2013), which suggests that realistic data lies on a low-dimensional manifold embedded into a high-dimensional data space, it is more efficient to only model distributions on a low-dimensional manifold and regard deviations from the manifold as uninformative noise. Prior works such as Caterini et al. (2021); Brehmer & Cranmer (2020) have restricted normalizing flows to low-dimensional manifolds via specially-constructed bottleneck architectures (known as invertible autoencoders (Teng & Choromanska, 2019) or injective flows (Kothari et al., 2021)) where encoder and decoder share parameters. These injective flows are optimized by some version of maximum likelihood training. This is not an ideal design decision, as the restrictive architectures used in such models were originally designed for tractable change of variables calculations in normalizing flows, but such calculations are not possible in the presence of a bottleneck (Brehmer & Cranmer, 2020). As a result, we propose to drop the restrictive constructions (such as coupling blocks), instead use an unconstrained encoder and decoder, and introduce a new technique to get around calculating the change of variables. This greatly simplifies the design of the model and makes it more expressive. Equal contribution. Published as a conference paper at ICLR 2024 Enc f Dec g Figure 1: Free-form injective flow (FIF) training and inference. (Left) We combine a reconstruction loss Lrecon. with a novel maximum likelihood loss LNLL to obtain an injective flow without architectural constraints. (Right) We generate novel samples by decoding standard normal latent samples with our best-performing models on Celeb A and MNIST. The reconstructions shown are on Celeb A validation data, the samples are uncurated samples from our models. We build on the unbiased maximum likelihood estimator used by rectangular flows (Caterini et al., 2021) to approximate the gradient of the change of variables term. We simplify the estimator considerably by replacing iterative conjugate gradient with an efficient single-step estimator. This is fast: a batch can be processed in about twice the time (or less) as an autoencoder trained on reconstruction loss only. In addition, we make a novel observation about injective flows: naively training with maximum likelihood is ill-defined due to the possibility of diverging curvature in the decoding function. To fix this problem, we propose a modification to our maximum likelihood estimator which counteracts the possibility of diverging curvature. We call our model the free-form injective flow (FIF). To summarize, we make the following contributions: We introduce an efficient maximum likelihood estimator for free-form injective flows and use it to train an unconstrained injective flow for the first time (section 4.1). We identify pathological behavior in the naive application of maximum likelihood training in the presence of a bottleneck, and offer a solution to avoid this behavior while maintaining computational efficiency (section 4.2). We outperform previous injective flows and demonstrate competitive performance to generative autoencoders on toy, tabular and image data (section 5). We provide code to implement our model and reproduce our results at https://github.com/vislearn/ FFF. 2 RELATED WORK Injective flows jointly learn a manifold and maximize likelihood on that manifold. The latter requires estimating the Jacobian determinant of the transformation to calculate the change of variables. Efficient computation of this determinant traditionally imposed two major restrictions on normalizing flow architectures: Firstly, the latent space has to match in dimension with the data space, ruling out bottleneck architectures. Secondly, normalizing flows are restricted to certain functional forms, such as coupling and autoregressive blocks. Below we outline the existing approaches to overcome these problems and how our solution compares. Lower-dimensional latent spaces One set of methods attempts to use full-dimensional normalizing flows, with some additional regularization or architectural constraints such that a subspace of the latent space corresponds to the manifold. One strategy adds noise to the data to make it a full-dimensional distribution then denoises to the manifold (Horvat & Pfister, 2021; Loaiza-Ganem et al., 2022). Another restricts the non-manifold latent dimensions to have small variance (Beitler et al., 2021; Silvestri et al., 2023; Zhang et al., 2023). Other methods sidestep the problem by making training into a two-step procedure. First, an autoencoder is trained on reconstruction loss, then a normalizing flow is trained to learn the resulting latent distribution. In this line of work, Brehmer & Cranmer (2020) and Kothari et al. (2021) use Published as a conference paper at ICLR 2024 an injective flow, while B ohm & Seljak (2022) use unconstrained networks as autoencoder. Ghosh et al. (2020) additionally regularize the decoder. Conformal embedding flows (Ross & Cresswell, 2021) ensure the decomposition of the determinant into the contribution from each block by further restricting the architecture to exclusively conformal transformations. Cramer et al. (2022) uses an isometric autoencoder such that the change of variables is trivial. However, the resulting transformations are quite restrictive and cannot represent arbitrary manifolds. The most similar work to ours is the rectangular flow (Caterini et al., 2021) which estimates the gradient of the log-determinant via an iterative, unbiased estimator. The resulting method is quite slow to train, and uses injective flows, which are restrictive. Unconstrained normalizing flow architectures Several works attempt to reduce the constraints imposed by typical normalizing flow architectures, allowing the use of free-form networks. However, all of these methods only apply to full-dimensional architectures. FFJORD (Grathwohl et al., 2019) is a type of continuous normalizing flow (Chen et al., 2018b) which estimates the change of variables stochastically. Residual flows (Behrmann et al., 2019; Chen et al., 2019) make residual networks invertible, but require expensive iterative estimators to train via maximum likelihood. Self-normalizing flows (Keller et al., 2021) and relative gradient optimization (Gresele et al., 2020) estimate maximum likelihood gradients for the matrices used in neural networks, but restrict the architecture to use exclusively square weight matrices without skip connections. In a parallel work, we present the extension of the present paper to full-dimensional normalizing flows based on free-form neural networks, called free-form flows (Draxler et al., 2024). Approximating maximum likelihood Many methods optimize some bound on the fulldimensional maximum likelihood, notably the variational autoencoder (Kingma & Welling, 2014) and its variants. Cunningham et al. (2020) also optimizes a variational lower bound to the likelihood. Other methods fit into the injective flow framework by jointly optimizing a reconstruction loss and some approximation to maximum likelihood on the manifold: Kumar et al. (2020); Zhang et al. (2020) approximate the log-determinant of the Jacobian by its Frobenius norm. The entropic AE (Ghose et al., 2020) maximizes the entropy of the latent distribution by a nearest-neighbor estimator while constraining its variance, resulting in a Gaussian latent space. In addition, there are other ways to regularize the latent space of an autoencoder which are not based on maximum likelihood, e.g. adversarial methods (Makhzani et al., 2015). In contrast to the above, our approach jointly learns the manifold and maximizes the likelihood on it with an unconstrained architecture, which easily accommodates a lower-dimensional latent space. 3 BACKGROUND Notation Let f : RD Rd be an encoder which compresses data to a latent space and a decoder g : Rd RD which decompresses the latent representation. A full-dimensional model has d = D while a bottleneck model has d < D. If f g : Rd Rd is the identity, then we call f and g consistent. For example, the forward and inverse function of a normalizing flow are consistent as f 1 = g. Injective flows Injective flows (Brehmer & Cranmer, 2020), also called invertible autoencoders (Teng & Choromanska, 2019), adapt invertible normalizing flow architectures to a bottleneck setting. They parameterize f and g as the composition of two invertible functions, w defined in RD and h defined in Rd, with a slicing/padding operation in between: f = h 1 slice w 1 and g = w pad h, (1) where slice(x) selects the first d elements of x and pad(z) concatenates D d zeros to the end of z. Since slice and pad are consistent, so too are f and g. Injective flows typically minimize a reconstruction loss (to learn a manifold which spans the data) alongside maximizing the likelihood of the data on that manifold, performing joint manifold and maximum likelihood training (alternatively, they are trained via two-step training, see section 2) Published as a conference paper at ICLR 2024 Change of variables across dimensions The maximum likelihood objective resulting from the change of variables theorem, used to train normalizing flow models, is only well-defined when mapping between spaces of equal dimension. A result from differential geometry (Krantz & Parks, 2008) allows us to generalize the change of variables theorem to non-equal-dimension transformations through the formula: p X(x) = p Z(f(x)) det g (f(x)) g (f(x)) 1 where f and g are consistent and primes denote derivatives: g (f(x)) is the Jacobian matrix of g evaluated at f(x). Note that, since p X is derived as the pushforward of the latent distribution p Z by g, the formula is valid only for x which lie on the decoder manifold (see appendix A for more details). Unlike in the full-dimensional case, there is no architecture known which can represent arbitrary manifolds and at the same time allows efficient exact computation of eq. (2) (see section 2). Rectangular flows Minimizing the negative logarithm of eq. (2) and adding a Lagrange multiplier to restrict the distance of data points from the decoder manifold results in the following per-sample loss term: LRF(x) = log p Z(z) + 1 2 log det g (z) g (z) + β ˆx x 2, (3) where z = f(x), ˆx = g(z), and β is a hyperparameter. The log-determinant term is the difficult one to optimize. Fortunately, its gradient with respect to the parameters θ of the decoder can be estimated tractably by the following construction (Caterini et al., 2021). Note that g = gθ but the θ subscript is dropped to avoid clutter. The relevant quantity is (with J = g (z)): 1 2 log det(J J) = 1 2 tr (J J) 1 θj (J J) . (4) The trace can be estimated (Hutchinson, 1989; Girard, 1989) by: 1 2 log det(J J) 1 2K k=1 ϵ k (J J) 1 | {z } ( ) θj (J J)ϵk | {z } ( ) where the ϵk are K samples from a distribution where E[ϵϵ ] = I, typically either Rademacher or standard normal. Now, we take steps to compute the above without constructing the entire network Jacobian J. First, note that ( ) can be written as a Jacobian-vector product v1 = Jv and a vector-Jacobian product JT v1 = (v 1 J) , each readily available via automatic differentiation. For ( ), Caterini et al. (2021) propose to employ the iterative conjugate gradient method: Write ϵ k (J J) 1 = CG(J J; ϵk) where CG(A; b) denotes the conjugate gradient solution to Ax = b. Conjugate gradient is a clever choice here, since it again only requires computing terms of the form J Jv using autodiff. The parameter derivative can be made to act only on the rightmost Jacobian terms by applying the stop gradient operation to the output of the conjugate gradient method, which returns its input, but has zero gradient. The final surrogate for the log-determinant term is therefore: 1 2K k=1 stop gradient CG J J; ϵk J Jϵk, (6) which replaces the log-determinant term in the loss. This computation yields the same gradient as the original loss in eq. (3), even though it has a different value. In the following, we present a significantly improved version of this estimator based on the insight that the encoder is an (approximate) inverse of the decoder. 4 FREE-FORM INJECTIVE FLOW (FIF) Our modification to rectangular flows is threefold: first, we use an unconstrained autoencoder architecture (no restrictively parameterized invertible functions); second, we introduce a more computationally efficient surrogate estimator; third, we modify the surrogate to avoid pathological behavior Published as a conference paper at ICLR 2024 related to manifolds with high curvature. Our per-sample loss function is: L(x) = log p Z(z) 1 k=1 ϵ k f (x) stop gradient (g (z)ϵk) + β ˆx x 2, (7) with z = f(x). Note the negative sign before the surrogate term, which comes from sending the log-determinant gradient to the encoder rather than the decoder. We will derive and motivate this formulation of the loss in sections 4.1 and 4.2. 4.1 SIMPLIFYING THE SURROGATE We considerably simplify the optimization of rectangular flows by a new surrogate for the logdeterminant term in eq. (2), which uses the Jacobian of the encoder as an approximation for the inverse Jacobian of the decoder. This allows the surrogate to be computed in a single pass, avoiding costly conjugate gradient iterations. We do this by expanding the derivative in eq. (4): 1 2 tr (J J) 1 θj (J J) = tr J where J = (J J) 1J is the Moore-Penrose inverse of J. The full derivation is in appendix B. To see the advantage of this formulation consider that for encoder f and decoder g optimal with respect to the reconstruction loss, f (ˆx) = g (f(x)) (see appendix B.1). Using the stop gradient operation, this leads to the following surrogate loss term: k=1 stop gradient ϵ k f (ˆx) g (z)ϵk, (9) or equivalently, using the encoder Jacobian in place of the decoder Jacobian (see appendix B): k=1 ϵ k f (ˆx)stop gradient (g (z)ϵk) . (10) Each term of the sum can be computed from just two vector-Jacobian/Jacobian-vector products obtained from automatic differentiation. This is a significant improvement on the iterative conjugate gradient method needed in the original formulation of rectangular flows which requires up to 2(d+1) such products to ensure convergence (Caterini et al., 2021). Compared to reconstruction loss only, we measure 1.5 to 2 the wall clock time, independent of the latent dimension. Note that the surrogate is only accurate if f and g are (at least approximately) optimal with respect to the reconstruction loss. We observe stable training in practice, validating this assumption. 4.2 PROBLEMS WITH MAXIMUM LIKELIHOOD IN THE PRESENCE OF A BOTTLENECK Rectangular flows are trained with a combination of a reconstruction and a likelihood term. We might ask what happens if we only train with the likelihood term, making an analogy to normalizing flows. In this case our loss would be: LNLL(x) = log p Z(z) + 1 2 log det g (z) g (z) . (11) Unfortunately, optimizing this loss can lead us to learn a degenerate decoder manifold, an issue raised in Brehmer & Cranmer (2020). Here we expand on their argument. First consider that if f and g are consistent, then f(ˆx) = f(g(f(x))) = f(x) and the per-sample loss is invariant to projections: LNLL(ˆx) = LNLL(x), since LNLL(x) is a function only of f(x) . This means that we can write our loss as: LNLL = Epdata(x)[LNLL(x)] = Eˆpdata(ˆx)[LNLL(ˆx)], (12) where ˆpdata(ˆx) is the probability density of the projection of the training data onto the decoder manifold. Now consider that the negative log-likelihood loss is one part of a KL divergence, and KL divergences are always non-negative: KL(ˆpdata(ˆx) pθ(ˆx)) = H(ˆpdata(ˆx)) Eˆpdata(ˆx)[log pθ(ˆx)] 0. (13) Published as a conference paper at ICLR 2024 Naive gradient computation t = 1 t = 2 t = t = 0 Corrected gradient computation Figure 2: Naive training of autoencoders with negative log-likelihood (NLL, see section 4.2) leads to pathological solutions (left). Starting with the initialization (t = 0, black), gradient steps increase the curvature of the learnt manifold (t = 1, 2, orange). This reduces NLL because the entropy of the projected data is reduced, by moving the points closer to one another. This effect is stronger than the reconstruction loss. We fix this problem by evaluating the volume change off-manifold (right). This moves the manifold closer to the data and reduces the curvature (t = 1, 2, green), until it eventually centers the manifold on the data with zero curvature (t = , green). Light lines show the set of points which map to the same latent point. Data is projected onto the t = 2 manifold. As a result, the loss is lower bounded by the entropy of the data projected onto the manifold: LNLL = Eˆpdata(ˆx)[log pθ(ˆx)] H(ˆpdata(ˆx)). (14) Unlike in standard normalizing flow optimization, where the right hand side would be fixed, here the entropy depends on the projection learned by the model. Thus, the model could modify the projection such that entropy is as low as possible. We break this pathology down into two cases: 1. A model manifold which does not align with the data manifold but instead intersects it. For example, Brehmer & Cranmer (2020) discuss a case where a linear model learns to project a data distribution to a single point on the manifold, thus reducing its entropy to , the lowest possible value. To the best of our knowledge, this can be fixed by adding noise and a reconstruction loss with sufficiently high weight. In appendix C we prove as much for linear models and characterize the solutions, which are the same as PCA if β 1/2σ2 where σ2 is the smallest eigenvalue of the data covariance matrix. 2. A model manifold which concentrates the data by use of high curvature, see fig. 2 (left). This newly identified pathological case only occurs in nonlinear models, hence Brehmer & Cranmer (2020) did not notice this effect in their linear example. Importantly, this is not fixed by adding a reconstruction loss. Most existing injective flows avoid this by a two-stage training, which first learns a projection and then the distribution of the projected data in the latent space. To enable jointly learning a manifold and a maximum-likelihood density on it, we need to find a fix for the pathology. Towards a well-behaved loss The term which leads to pathological behavior in the likelihood loss is the log-determinant. When using the change of variables with f evaluated at ˆx, all that matters is the change of volume from the projected data to the latent space, so the model can decrease the loss by choosing a manifold which concentrates the projected data more tightly (the more possibility it has to expand the data, the lower the loss will be). We can counteract this effect by introducing a factor inversely proportional to the concentration. This can be achieved by the fairly simple modification of evaluating f in our estimator at x rather than ˆx. Namely, we modify eq. (10) to estimate the gradient of the log-determinant term by: k=1 ϵ k f (x)stop gradient (g (z)ϵk) . (15) See appendix B.2 for a detailed explanation. Published as a conference paper at ICLR 2024 Figure 3: Learning a noisy 2-D sinusoid with a 1-D latent space for different reconstruction weights β. Color codes denote the value of the latent variable at each location. When the reconstruction term has low weight (left), the autoencoder learns to throw away information about the position along the sinusoid and only retains the orthogonal noise. Only sufficiently high weights (right) result in the desired solution, where the decoder spans the sinusoid manifold. The middle plot shows the tradeoff between reconstruction error and NLL as we transition between these regimes (box plots indicate variability across runs). In this way, we discourage pathological solutions involving high curvature. In fig. 2 (right) we can see the effect of the modified estimator: the manifold now moves towards the data since the optimization is not dominated by diverging curvature. We note that the modified estimator is also computationally cheaper, since the vector-Jacobian product ϵ k f (x) can reuse the computational graph generated when computing z. Along with the results of section 4.1, this leads to the following loss (same as eq. (7)): L(x) = LNLL(x) + βLrecon.(x) (16) = log p Z(z) 1 k=1 ϵ k f (x) stop gradient (g (z)ϵk) + β ˆx x 2. (17) Phase transition Figure 3 shows that when using this loss, if β is large enough, the dominant manifold direction is identified. In appendix E.1.2, we show a similar experiment on MNIST. 5 EXPERIMENTS In this section, we test the empirical performance of the proposed model. First, we show that our model is much faster than rectangular flows on tabular data. Second, we show that it outperforms previous SOTA injective flows on generating images. Finally, we compare against other generative autoencoders on the Pythae image generation benchmark Chadebec et al. (2022), achieving the best FID score in some categories. Implementation details In implementing the trace estimator, we have to make a number of choices. Briefly, i) we chose to formulate the log-determinant gradient in terms of the encoder rather than decoder as it was more stable in practice, ii) we performed traces in the order f (x)g (z) as this reduces variance (both orderings are valid due to the cyclic property of the trace but since f (x)g (z) is a d d matrix whereas g (z)f (x) is D D, the former is typically easier to estimate), iii) we used a mixture of forwardand backward-mode automatic differentiation as this was compatible with our estimator, and iv) we used orthogonalized Gaussian noise in the trace estimator, to reduce variance. Full justification for these choices is given in appendix D. 5.1 TABULAR DATA We evaluate our method on four of the tabular datasets used by Papamakarios et al. (2017), using the same data splits, and make a comparison to the published rectangular flow results (Caterini et al., 2021), see table 1. We adopt the FID-like metric from that work, which computes the Wasserstein2 distance between the Gaussian distributions with equal mean and covariance as the test data and the data generated by the model. This is a measure of the difference of the means and covariance matrices of the generated and test datasets. We outperform rectangular flows on all datasets except GAS. In addition, we see a speedup in training time of between 1.5 and 6 times between FIF and Published as a conference paper at ICLR 2024 Table 1: Free-form injective flows (FIF) are significantly faster than rectangular flows (RF) with superior performance in FID-like metric on 3 out of 4 tabular datasets (Papamakarios et al., 2017). Both methods use K = 1. The results for RF are taken directly from (Caterini et al., 2021). Method POWER GAS HEPMASS MINIBOONE RF (Caterini et al., 2021) 0.083 0.015 0.110 0.021 0.779 0.191 1.001 0.051 FIF (ours) 0.041 0.007 0.281 0.031 0.541 0.034 0.598 0.024 Training Time Speedup 3.9 2.2 6.1 1.5 a rerun of rectangular flows (using the published code) on the same hardware. Full experimental details are in appendix E.2. We also show an ablation study in table 6 and table 7 in the appendix disentangling the effect of the three individual components of our method: The surrogate of eq. (10), the fix for high curvature solutions in eq. (15) and the use of free-form architectures. Comparing to table 1, we find that the fix to estimate the negative log-likelihood with an off-manifold encoder Jacobian is crucial for good performance of free-from architectures, as the on-manifold variant diverges (see section 4.2). This is not the case for the injective architecture based on coupling flows, indicating a stabilizing regularization via the architecture. 5.2 COMPARISON TO INJECTIVE FLOWS We compare FIF against previous injective flows on Celeb A images (Liu et al., 2015) in table 2. Our models significantly improve the quality of the generated images in terms of the Fr echet inception distance (FID) Heusel et al. (2017) and Inception Score (IS) Salimans et al. (2016). The former compares generated samples to a set of reference samples, by computing the Wasserstein-2 distance between two Gaussian distributions fit to some embedding of the respective sets of samples. The later measures diversity by the entropy of the distribution of class labels in the generated samples, where the class labels are provided by some pre-trained classifier. Samples from this model are depicted in fig. 1. For a fair comparison, we train each model on the same hardware for equal wall clock time with the code provided by the authors. The architectures of previous works were dominated by the need that most layers are invertible and have a tractable Jacobian determinant. Our loss in eq. (7) does not impose these constraints on the architecture, and we can use an off-the-shelf convolutional auto-encoder with additional fully-connected layers in the latent space. Details can be found in appendix E.3. 5.3 COMPARISON TO GENERATIVE AUTOENCODERS As free-form injective flows (FIF) do not require any specific architecture, we expand our comparison to the much broader range of generative autoencoders. This is a general class of bottleneck architectures that encode the training data to a standard normal distribution, so that the decoder can be used as a generator after training. Recently, Chadebec et al. (2022) proposed the Pythae benchmark for comparing generative autoencoders on image generation. They evaluate different training methods using two different architec- Table 2: Comparison of injective flows on Celeb A under equal computational budget. Free-form Injective Flows (FIF) outperform previous work significantly in terms of FID. Model # parameters N sampler GMM sampler FID IS FID IS DNF (Horvat & Pfister, 2021) 39.4M 55.6 0.59 1.9 52.7 0.33 2.0 Trumpet (Kothari et al., 2021) 19.1M 56.2 1.39 1.8 47.7 2.24 1.9 FIF (ours) 34.3M 47.3 1.39 1.7 37.4 1.35 2.0 Published as a conference paper at ICLR 2024 Table 3: Pythae benchmark results on Celeb A, following Chadebec et al. (2022). We train their architectures (Conv Net and Res Net) with our new training objective, achieving SOTA FID on Res Net. We draw latent samples from standard normal N or a GMM fit using training data GMM . Models with multiple variants (indicated in brackets) have been merged to indicate only the best result across variants. We mark the best FID in each column in bold and underline the second best. Model Conv Net + N Res Net + N Conv Net + GMM Res Net + GMM FID IS FID IS FID IS FID IS VAE (Kingma & Welling, 2014) 54.8 1.9 66.6 1.6 52.4 1.9 63.0 1.7 IWAE (Burda et al., 2015) 55.7 1.9 67.6 1.6 52.7 1.9 64.1 1.7 VAE-lin NF (Rezende & Mohamed, 2015) 56.5 1.9 67.1 1.6 53.3 1.9 62.8 1.7 VAE-IAF (Kingma et al., 2016) 55.4 1.9 66.2 1.6 53.6 1.9 62.7 1.7 β-(TC) VAE (Higgins et al., 2017; Chen et al., 2018a) 55.7 1.8 65.9 1.6 51.7 1.9 59.3 1.7 Factor VAE (Kim & Mnih, 2018) 53.8 1.9 66.4 1.7 52.4 2.0 63.3 1.7 Info VAE - (RBF/IMQ) (Zhao et al., 2017) 55.5 1.9 66.4 1.6 52.7 1.9 62.3 1.7 AAE (Makhzani et al., 2015) 59.9 1.8 64.8 1.7 53.9 2.0 58.7 1.8 MSSSIM-VAE (Snell et al., 2017) 124.3 1.3 119.0 1.3 124.3 1.3 119.2 1.3 Vanilla AE 327.7 1.0 275.0 2.9 55.4 2.0 57.4 1.8 WAE - (RBF/IMQ) (Tolstikhin et al., 2018) 64.6 1.7 67.1 1.6 51.7 2.0 57.7 1.8 VQVAE (van den Oord et al., 2017) 306.9 1.0 140.3 2.2 51.6 2.0 57.9 1.8 RAE - (L2/GP) Ghosh et al. (2020) 86.1 2.8 168.7 3.1 52.5 1.9 58.3 1.8 FIF (ours) 56.9 2.1 62.3 1.7 47.3 1.9 55.0 1.8 tures on MNIST (Le Cun et al., 2010) (data D = 784, latent d = 16), CIFAR10 (Krizhevsky, 2009) (D = 3072, d = 256), and Celeb A (Liu et al., 2015) (D = 12288, d = 64). All models are trained with the same limited computational budget. The goal of the benchmark is to provide a fair comparison of different models, not to achieve SOTA image generation results, as this would require significantly more compute. As shown in table 3, our model performs strongly on the benchmark, achieving SOTA on Celeb A in Fr echet Inception Distance (FID) (Heusel et al., 2017) on the Res Net architecture with latent codes sampled from a standard normal, and on both architectures when sampling from a Gaussian Mixture Model fit using training data. At the same time, the Inception Scores (IS) (Salimans et al., 2016) are high, indicating a high diversity. On the one combination where our model does not outperform the competitors, FIF still achieves a comparable FID and high Inception Score. FIF also performs strongly on the other datasets, see appendix E.4. For each method in the benchmark, ten hyperparameter configurations are trained and the best model according to FID is reported. For our method, we choose to vary β = 5, 10, 15, 20, 25 and the number of Hutchinson samples K = 1, 2. We find the performance to be robust against these choices, and give all details on the training procedure in appendix E.4. 6 CONCLUSION This paper offers a computationally efficient solution to jointly learning a manifold and a distribution on it, which we call the free-form injective flow (FIF). We i) significantly improve an existing estimator for the gradient of the change of variables across dimensions, ii) note that it can be applied to unconstrained architectures, iii) analyze problems with joint manifold and maximumlikelihood training and offer a solution, and iv) implement and test our model on toy, tabular and image datasets. We find that the model is practical and scalable, outperforming comparable injective flows, and showing similar or better performance to other autoencoder generative models on the Pythae benchmark. Several theoretical and practical questions remain for future work: We identified a previously overlooked problem with jointly learning a manifold and maximum likelihood. We propose a fix in section 4.2 that provides high-quality models, but further investigation is needed for a thorough understanding. Fitting a GMM to the latent space after training improves performance on image data, suggesting that our latent distributions are not perfectly Gaussian. We generally find that architectures with more fully-connected layers in the latent space have a more Gaussian latent distribution, suggesting that larger models suffer less from this problem. We leave potential theoretical or practical improvements to future work. Published as a conference paper at ICLR 2024 Mart ın Abadi, Ashish Agarwal, Paul Barham, Eugene Brevdo, Zhifeng Chen, Craig Citro, Greg S. Corrado, Andy Davis, Jeffrey Dean, Matthieu Devin, Sanjay Ghemawat, Ian Goodfellow, Andrew Harp, Geoffrey Irving, Michael Isard, Yangqing Jia, Rafal Jozefowicz, Lukasz Kaiser, Manjunath Kudlur, Josh Levenberg, Dandelion Man e, Rajat Monga, Sherry Moore, Derek Murray, Chris Olah, Mike Schuster, Jonathon Shlens, Benoit Steiner, Ilya Sutskever, Kunal Talwar, Paul Tucker, Vincent Vanhoucke, Vijay Vasudevan, Fernanda Vi egas, Oriol Vinyals, Pete Warden, Martin Wattenberg, Martin Wicke, Yuan Yu, and Xiaoqiang Zheng. Tensor Flow: Large-scale machine learning on heterogeneous systems, 2015. Lynton Ardizzone, Jakob Kruse, Carsten Rother, and Ullrich K othe. Analyzing Inverse Problems with Invertible Neural Networks. In International Conference on Learning Representations, 2018. Jens Behrmann, Will Grathwohl, Ricky TQ Chen, David Duvenaud, and J orn-Henrik Jacobsen. Invertible residual networks. In International Conference on Machine Learning, 2019. Jan Jetze Beitler, Ivan Sosnovik, and Arnold Smeulders. PIE: Pseudo-invertible encoder. ar Xiv:2111.00619, 2021. Yoshua Bengio, Aaron Courville, and Pascal Vincent. Representation learning: A review and new perspectives. IEEE transactions on pattern analysis and machine intelligence, 35(8):1798 1828, 2013. Vanessa B ohm and Uroˇs Seljak. Probabilistic auto-encoder. Transactions on Machine Learning Research, 2022. Johann Brehmer and Kyle Cranmer. Flows for simultaneous manifold learning and density estimation. Advances in Neural Information Processing Systems, 33:442 453, 2020. Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. Language models are few-shot learners. Advances in neural information processing systems, 33:1877 1901, 2020. Yuri Burda, Roger Grosse, and Ruslan Salakhutdinov. Importance weighted autoencoders. ar Xiv preprint ar Xiv:1509.00519, 2015. Anthony L Caterini, Gabriel Loaiza-Ganem, Geoff Pleiss, and John P Cunningham. Rectangular flows for manifold learning. Advances in Neural Information Processing Systems, 2021. Cl ement Chadebec, Louis J. Vincent, and St ephanie Allassonni ere. Pythae: Unifying generative autoencoders in python a benchmarking use case. In S. Koyejo, S. Mohamed, A. Agarwal, D. Belgrave, K. Cho, and A. Oh (eds.), Advances in Neural Information Processing Systems, 2022. Ricky TQ Chen, Xuechen Li, Roger B Grosse, and David K Duvenaud. Isolating sources of disentanglement in variational autoencoders. Advances in Neural Information Processing Systems, 31, 2018a. Ricky TQ Chen, Yulia Rubanova, Jesse Bettencourt, and David K Duvenaud. Neural ordinary differential equations. Advances in Neural Information Processing Systems, 2018b. Ricky TQ Chen, Jens Behrmann, David K Duvenaud, and J orn-Henrik Jacobsen. Residual flows for invertible generative modeling. Advances in Neural Information Processing Systems, 2019. Eike Cramer, Felix Rauh, Alexander Mitsos, Ra ul Tempone, and Manuel Dahmen. Isometric manifold learning for injective normalizing flows. ar Xiv preprint ar Xiv:2203.03934, 2022. Edmond Cunningham, Renos Zabounidis, Abhinav Agrawal, Madalina Fiterau, and Daniel Sheldon. Normalizing flows across dimensions. International Conference on Machine Learning, Workshop Track, 2020. Laurent Dinh, David Krueger, and Yoshua Bengio. NICE: Non-linear Independent Components Estimation. In International Conference on Learning Representations, Workshop Track, 2015. Published as a conference paper at ICLR 2024 Felix Draxler, Christoph Schn orr, and Ullrich K othe. Whitening Convergence Rate of Couplingbased Normalizing Flows. In Advances in Neural Information Processing Systems, 2022. Felix Draxler, Peter Sorrenson, Lea Zimmermann, Armand Rousselot, and Ullrich K othe. Freeform Flows: Make Any Architecture a Normalizing Flow. In Artificial Intelligence and Statistics, 2024. William Falcon and The Py Torch Lightning team. Py Torch lightning, March 2019. Amur Ghose, Abdullah Rashwan, and Pascal Poupart. Batch norm with entropic regularization turns deterministic autoencoders into generative models. In Conference on Uncertainty in Artificial Intelligence, 2020. Partha Ghosh, Mehdi SM Sajjadi, Antonio Vergari, Michael Black, and Bernhard Sch olkopf. From variational to deterministic autoencoders. International Conference on Learning Representations, 2020. A Girard. A fast Monte-Carlo cross-validation procedure for large least squares problems with noisy data. Numerische Mathematik, 56:1 23, 1989. Gene H Golub and Victor Pereyra. The differentiation of pseudo-inverses and nonlinear least squares problems whose variables separate. SIAM Journal on numerical analysis, 10(2):413 432, 1973. Will Grathwohl, Ricky TQ Chen, Jesse Bettencourt, Ilya Sutskever, and David Duvenaud. Ffjord: Free-form continuous dynamics for scalable reversible generative models. In International Conference on Learning Representations, 2019. Luigi Gresele, Giancarlo Fissore, Adri an Javaloy, Bernhard Sch olkopf, and Aapo Hyvarinen. Relative gradient optimization of the jacobian term in unsupervised deep learning. In Advances in Neural Information Processing Systems, 2020. Charles R. Harris, K. Jarrod Millman, St efan J. van der Walt, Ralf Gommers, Pauli Virtanen, David Cournapeau, Eric Wieser, Julian Taylor, Sebastian Berg, Nathaniel J. Smith, Robert Kern, Matti Picus, Stephan Hoyer, Marten H. van Kerkwijk, Matthew Brett, Allan Haldane, Jaime Fern andez del R ıo, Mark Wiebe, Pearu Peterson, Pierre G erard-Marchant, Kevin Sheppard, Tyler Reddy, Warren Weckesser, Hameer Abbasi, Christoph Gohlke, and Travis E. Oliphant. Array programming with Num Py. Nature, 585(7825):357 362, 2020. Trevor Hastie, Robert Tibshirani, and Jerome H Friedman. The Elements of Statistical Learning: Data Mining, Inference, and Prediction, volume 2. Springer, 2009. Martin Heusel, Hubert Ramsauer, Thomas Unterthiner, Bernhard Nessler, and Sepp Hochreiter. Gans trained by a two time-scale update rule converge to a local nash equilibrium. Advances in Neural Information Processing Systems, 30, 2017. Irina Higgins, Loic Matthey, Arka Pal, Christopher Burgess, Xavier Glorot, Matthew Botvinick, Shakir Mohamed, and Alexander Lerchner. Beta-vae: Learning basic visual concepts with a constrained variational framework. In International Conference on Learning Representations, 2017. Christian Horvat and Jean-Pascal Pfister. Denoising normalizing flow. Advances in Neural Information Processing Systems, 34:9099 9111, 2021. J. D. Hunter. Matplotlib: A 2D graphics environment. Computing in Science & Engineering, 9(3): 90 95, 2007. Michael F Hutchinson. A stochastic estimator of the trace of the influence matrix for Laplacian smoothing splines. Communications in Statistics-Simulation and Computation, 18(3):1059 1076, 1989. Thomas A Keller, Jorn WT Peters, Priyank Jaini, Emiel Hoogeboom, Patrick Forr e, and Max Welling. Self normalizing flows. In International Conference on Machine Learning, 2021. Published as a conference paper at ICLR 2024 Hyunjik Kim and Andriy Mnih. Disentangling by factorising. In International Conference on Machine Learning, 2018. Diederik P Kingma and Max Welling. Auto-encoding variational bayes. In International Conference on Learning Representations, 2014. Durk P Kingma, Tim Salimans, Rafal Jozefowicz, Xi Chen, Ilya Sutskever, and Max Welling. Improved variational inference with inverse autoregressive flow. Advances in Neural Information Processing Systems, 29, 2016. Ivan Kobyzev, Simon J.D. Prince, and Marcus A. Brubaker. Normalizing Flows: An Introduction and Review of Current Methods. IEEE Transactions on Pattern Analysis and Machine Intelligence, 43(11):3964 3979, 2021. Konik Kothari, Amir Ehsan Khorashadizadeh, Maarten V. de Hoop, and Ivan Dokmanic. Trumpets: Injective flows for inference and inverse problems. In Conference on Uncertainty in Artificial Intelligence, 2021. Steven G Krantz and Harold R Parks. Geometric Integration Theory. Springer Science & Business Media, 2008. Alex Krizhevsky. Learning multiple layers of features from tiny images. Technical report, University of Toronto, Toronto, Ontario, 2009. Abhishek Kumar, Ben Poole, and Kevin Murphy. Regularized autoencoders via relaxed injective probability flow. In International Conference on Artificial Intelligence and Statistics, 2020. Yann Le Cun, Corinna Cortes, and CJ Burges. MNIST handwritten digit database. ATT Labs [Online]. Available: http://yann.lecun.com/exdb/mnist, 2, 2010. Ziwei Liu, Ping Luo, Xiaogang Wang, and Xiaoou Tang. Deep learning face attributes in the wild. In Proceedings of International Conference on Computer Vision (ICCV), December 2015. Gabriel Loaiza-Ganem, Brendan Leigh Ross, Luhuan Wu, John Patrick Cunningham, Jesse C Cresswell, and Anthony L Caterini. Denoising deep generative models. In Advances in Neural Information Processing Systems, Workshop Track, 2022. Alireza Makhzani, Jonathon Shlens, Navdeep Jaitly, Ian Goodfellow, and Brendan Frey. Adversarial autoencoders. In International Conference on Learning Representations, 2015. Wes Mc Kinney. Data Structures for Statistical Computing in Python. In St efan van der Walt and Jarrod Millman (eds.), 9th Python in Science Conference, 2010. Francesco Mezzadri. How to generate random matrices from the classical compact groups. Notices of the American Mathematical Society, 54(5):592 604, May 2007. ISSN 0002-9920. George Papamakarios, Theo Pavlakou, and Iain Murray. Masked autoregressive flow for density estimation. Advances in Neural Information Processing Systems, 2017. Adam Paszke, Sam Gross, Francisco Massa, Adam Lerer, James Bradbury, Gregory Chanan, Trevor Killeen, Zeming Lin, Natalia Gimelshein, Luca Antiga, et al. Pytorch: An imperative style, highperformance deep learning library. In Advances in Neural Information Processing Systems, 2019. Stefan T. Radev, Ulf K. Mertens, Andreas Voss, Lynton Ardizzone, and Ullrich K othe. Bayes Flow: Learning Complex Stochastic Models With Invertible Neural Networks. IEEE Transactions on Neural Networks and Learning Systems, 33(4):1452 1466, April 2022. ISSN 2162-2388. doi: 10.1109/TNNLS.2020.3042395. Danilo Rezende and Shakir Mohamed. Variational inference with normalizing flows. In International Conference on Machine Learning, July 2015. Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Bj orn Ommer. Highresolution image synthesis with latent diffusion models. In IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022. Published as a conference paper at ICLR 2024 Brendan Ross and Jesse Cresswell. Tractable density estimation on learned manifolds with conformal embedding flows. Advances in Neural Information Processing Systems, 34:26635 26648, 2021. Tim Salimans, Ian Goodfellow, Wojciech Zaremba, Vicki Cheung, Alec Radford, and Xi Chen. Improved techniques for training gans. Advances in neural information processing systems, 29, 2016. Gianluigi Silvestri, Daan Roos, and Luca Ambrogioni. Deterministic training of generative autoencoders using invertible layers. In The Eleventh International Conference on Learning Representations, 2023. Jake Snell, Karl Ridgeway, Renjie Liao, Brett D Roads, Michael C Mozer, and Richard S Zemel. Learning to generate images with perceptual similarity metrics. In IEEE International Conference on Image Processing, pp. 4277 4281. IEEE, 2017. Yunfei Teng and Anna Choromanska. Invertible autoencoder for domain adaptation. Computation, 7(2):20, 2019. The pandas development team. Pandas-dev/pandas: Pandas, February 2020. Ilya Tolstikhin, Olivier Bousquet, Sylvain Gelly, and Bernhard Schoelkopf. Wasserstein autoencoders. In International Conference on Learning Representations, 2018. Aaron van den Oord, Oriol Vinyals, et al. Neural discrete representation learning. Advances in Neural Information Processing Systems, 30, 2017. Mingtian Zhang, Yitong Sun, Chen Zhang, and Steven Mcdonagh. Spread flows for manifold modelling. In International Conference on Artificial Intelligence and Statistics, 2023. Zijun Zhang, Ruixiang Zhang, Zongpeng Li, Yoshua Bengio, and Liam Paull. Perceptual generative autoencoders. In International Conference on Machine Learning, 2020. Shengjia Zhao, Jiaming Song, and Stefano Ermon. Infovae: Information maximizing variational autoencoders. ar Xiv:1706.02262, 2017. Published as a conference paper at ICLR 2024 A CHANGE OF VARIABLES FORMULA ACROSS DIMENSIONS The change of variables formula describes how probability densities change as they are mapped through an injective pushforward function g. It is instructive to derive this formula when g is an invertible function. Let p Z be a base density and p X the pushforward density obtained by mapping samples from p Z through g. Then we can write p X(x) = Z p(x | z)p Z(z)dz (18) = Z δ(x g(z))p Z(z)dz (19) = Z δ(x ˆx)p Z(f(ˆx)) |det(g (f(ˆx)))| 1 dˆx (20) = p Z(f(x)) |det(g (f(x)))| 1 (21) using the change of variables ˆx = g(z), meaning that z = f(ˆx) and | det(g (z))|dz = dˆx with f the inverse of g. Now suppose that g maps from Rd to RD with d < D. We can generalize the change of variables ˆx = g(z) using z = f(ˆx) and det(g (z) g (z)) 1 2 dz = dˆx where f and g are consistent (f g is the identity) (see chapter 5 of Krantz & Parks (2008)). This gives us p X(x) = Z δ(x g(z))p Z(z)dz (22) = Z δ(x ˆx)p Z(f(ˆx)) det(g (f(ˆx)) g (f(ˆx))) 1 This expression defines a probability density in the full ambient space RD (albeit a degenerate distribution) but we cannot easily remove the integral. However, we can convert it into an expression resembling the full-dimensional case, but defined only on the image of g: p X(x) = p Z(f(x)) det(g (f(x)) g (f(x))) 1 Note that this expression only integrates to 1 if we restrict integration to the image of g. As such, it should only be regarded as defining a probability distribution on this manifold, not in the ambient space RD. B DERIVATION OF GRADIENT ESTIMATOR We expand the derivative in eq. (4): 1 2 log det J J = 1 2 tr (J J) 1 θj (J J) (25) 2 tr (J J) 1 2 tr (J J) 1J 2 tr J(J J) 1 2 tr (J J) 1J = tr (J J) 1J where we used the cyclic property of the trace and that tr(AB) = tr(A B ). J = (J J) 1J is the Moore-Penrose inverse of J. Now we will do an equivalent derivation for the encoder. Observe that, since (J J ) 1 = ((J J) 1J J(J J) 1) 1 = J J (30) Published as a conference paper at ICLR 2024 we can rewrite the log-determinant term using the encoder Jacobian: 1 2 log det(J J) = 1 2 log det(J J ) (31) Note the negative sign on the right-hand side. The derivation for the derivative is very similar to that for the decoder, where we now take a derivative with respect to encoder parameters ϕ: 1 2 log det J J = 1 2 tr (J J ) 1 ϕj (J J ) (32) 2 tr (J J ) 1 2 tr (J J ) 1J 2 tr J (J J ) 1 2 tr (J J ) 1J ϕj J = tr J (J J ) 1 where we used the cyclic property of the trace, that tr(AB) = tr(A B ) and that J = J = J (J J ) 1. Recall eq. (9) in section 4.1, which gives the surrogate for the log-determinant term: k=1 stop gradient ϵ k f (ˆx) g (z)ϵk, (37) This is formulated in terms of the Jacobian of the decoder, in other words it is derived from det(J J). The equivalent term, formulated in terms of the Jacobian of the encoder should be derived from det(J J ) and is therefore (note the negative sign): k=1 ϵ k f (ˆx)stop gradient (g (z)ϵk) (38) We write it in the order f (ˆx)g (z) rather than g (z)f (ˆx) since this reduces the variance of the estimate. See appendix D.1 for further details on this point. B.1 NOTE ON OPTIMALITY WITH RESPECT TO RECONSTRUCTION LOSS Our estimator relies on the approximation f (ˆx) g (f(x)) . If f and g are consistent (f g is the identity) this guarantees that g (f(x))f (ˆx) = I, but not that f (ˆx) is the Moore-Penrose inverse of g (f(x)). A sufficient requirement is that f and g are optimal with respect to the reconstruction loss, that is, any variation in the functions would lead to a higher reconstruction. With calculus of variations, it is possible to show that such f and g are consistent, and (ˆx x) g (f(x)) = 0 (39) for all x. By taking the derivative with respect to x and evaluating at some x in the image of g (so ˆx = x) we have that f (ˆx) g (f(x)) g (f(x)) g (f(x)) = 0 (40) and hence f (ˆx) = g (f(x)) (g (f(x)) g (f(x))) 1 = g (f(x)) . (41) In the remainder of the appendix, given an encoder-decoder pair f and g which are optimal with respect to the reconstruction loss, we refer to g as the pseudoinverse of f, and f as the pseudoinverse of g. Published as a conference paper at ICLR 2024 Figure 4: Representation of ill-defined probability density p(x) p(ˆx)e β ˆx x 2 (left and center). Solid black lines denote the manifold, dashed lines are a constant distance from the manifold. The probability density is constant along the manifold. The width of the cyan bands is proportional to e β ˆx x 2 and represents the probability density along the onand off-manifold contours. While the density is reasonable for a flat manifold (left), note that the amount of probability mass associated with a region of the manifold (bounded by solid lines) is larger at some points off the manifold than on it when the manifold has curvature (center). This behavior can lead to divergent solutions when optimizing for likelihood and should be compensated for. The appropriate compensation factor is the ratio of the volume of a small region on the manifold (small blue square embedded in green manifold, right) to the equivalent region off the manifold (large blue square, right). The blue arrows represent an orthonormal frame on the manifold, and the equivalent frame in the off-manifold region. B.2 MODIFIED ESTIMATOR As stated in 4.2, we modify the log-determinant estimator (eq. (38)) by replacing f (ˆx) by f (x). The loss we are trying to optimize (sending gradient from the log-determinant to the encoder) is: L(x) = log p Z(f(x)) log vol (f (ˆx)) + β ˆx x 2 (42) with vol(f (ˆx)) = p det(f (ˆx)f (ˆx) ). Consider the probability density p implied by interpreting this loss as a negative log-likelihood: p(x) p(ˆx)e β ˆx x 2 = p Z(z) vol (f (ˆx)) e β ˆx x 2 (43) where p(ˆx) is the on-manifold density. Unfortunately, this density is ill-defined and leads to pathological behavior (see fig. 4). In order to provide a correction to this density, we need a term which compensates for the volume increase or decrease of off-manifold regions in comparison to the onmanifold region they are projected to. This is depicted in fig. 4 (right). The blue arrows in the on-manifold region will span the same latent-space volume as the blue arrows in the off-manifold region. The change in volume between the depicted on-manifold region and the latent space is vol(f (ˆx)) and vol(f (x)) between the off-manifold region and the latent space. Combining these facts means vol(f (ˆx)) (volume of on-manifold region) = vol(f (x)) (volume of off-manifold region) (44) and hence the ratio of the volume of the on-manifold region to the off-manifold region is vol(f (x))/ vol(f (ˆx)). Multiplying p(x) by this factor leads to p(x)vol(f (x)) vol(f (ˆx)) = p Z(z) vol (f (x)) e β ˆx x 2 (45) and the corresponding negative log-likelihood loss is L(x) = log p Z(f(x)) log vol (f (x)) + β ˆx x 2 (46) Published as a conference paper at ICLR 2024 The surrogate for the log-determinant term is therefore tr(f (x)stop gradient(f (x) )) (47) In order to maintain computational efficiency, we approximate f (x) by g (f(x)): tr(f (x)stop gradient(g (f(x)))) (48) giving the stated correction. C LINEAR MODEL TRAINED ON MAXIMUM LIKELIHOOD ALONE Consider a linear model, trained on data with zero mean and covariance Σ. Let the encoder function be f(x) = Ax and suppose that A has positive singular values, meaning that AA is positive definite. Let the decoder function be g(z) = A z, where A = A (AA ) 1. We want to minimize a combination of negative log-likelihood and a reconstruction loss (here we use 1/2σ2 instead of β as prefactor): L = LNLL + Lrecon. (49) 2 log det(AA ) + 1 2σ2 A Ax x 2 (50) 2Ex[x A Ax] 1 2 log det(AA ) + 1 2σ2 Ex[x (A A I)2x] (51) 2 tr(AEx[xx ]A ) 1 2 log det(AA ) + 1 2σ2 tr(Ex[xx ](I A A)) (52) 2 tr(AΣA ) 1 2 log det(AA ) + 1 2σ2 tr(Σ(I A A)) (53) where A is a full rank d D matrix with d D. Before solving for the minimum, let s review some matrix calculus identities. It is often convenient to consider A as a function of a single variable x, differentiate with respect to x, and then choose x to be Aij. Then the derivative is d d Aij A = E(ij) (54) where E(ij) is a matrix of zeros, except for the ij entry which is a one. We can write this as E(ij) kl = δikδjl where δ is the Kronecker delta. When evaluating E(ij) inside a trace we get the simple expression: tr(E(ij)B) = E(ij) kl Blk = δikδjk Blk = Bji (55) using Einstein notation. The additional matrix identities we will need are Jacobi s formula for a square invertible matrix B: d dx det(B) = det(B) tr B 1 d and hence d dx log det(B) = tr B 1 d and we will prove the following lemma. Lemma C.1. Suppose the matrix A depends on a variable x. Then we have the following expression for the derivative of the projection operator A A: d dx(A A) = A d dx A (I A A) + A d dx A (I A A) (58) Proof. The following is based on the proof to lemma 4.1 in Golub & Pereyra (1973). Define the projection operator PA = A A and its complement P A = I PA. Then, since PAPA = PA, d Published as a conference paper at ICLR 2024 In addition, since PAA = A and therefore d dx A A = P A (61) By similar steps but using APA = A, we can derive dx A P A (62) Putting it all together gives d dx A P A + A d Note that the second term is just the transpose of the first. Now we are ready to find the derivative of the loss and set it to zero. Lemma C.2. The derivative of the loss with respect to A takes the form: d d AL = ΣA A 1 σ2 (I A A)ΣA (64) Proof. Let s apply the above identities to the first term in the loss: d dx 1 2 tr(AΣA ) = 1 dx A ΣA + AΣ d dx A ΣA (66) since the trace is invariant under transposition and hence 1 2 tr(AΣA ) = tr(E(ij)ΣA ) = (ΣA )ji (67) Applying Jacobi s formula to the second term in the loss gives: d dx 1 2 log det(AA ) = 1 2 tr (AA ) 1 d dx(AA ) (68) dx A A + A d and therefore 1 2 log det(AA ) = 1 2 tr (AA ) 1 E(ij)A + AE(ji) (70) 2 tr E(ij)A (AA ) 1 + E(ij)A (AA ) 1 (71) = A (AA ) 1 = A ji (73) where we used the cyclic and transpose properties of the trace and that E(ji) = E(ij). Published as a conference paper at ICLR 2024 The final term requires a derivative of tr(Σ(I A A)), which is equal to a derivative of tr(ΣA A). We use the formula for the derivative of the projection operator to get d dx 1 2 tr(ΣA A) = 1 dx(A A) (74) dx A (I A A) (75) again using the transpose property of the trace, and therefore 1 2 tr(ΣA A) = tr(E(ij)(I A A)ΣA ) = ((I A A)ΣA )ji (76) Putting the three expressions together, we have that d d AL = ΣA A 1 σ2 (I A A)ΣA (77) Lemma C.3. The critical points of L satisfy the following properties: 1. A = UΣ 1 2 with UU = I 2. U U commutes with Σ Proof. Using lemma C.2, the critical points satisfy σ2 (I A A)ΣA = 0 (78) By multiplying by A from the left we have AΣA = Id (79) meaning that U = AΣ 1 2 must have orthonormal rows (since UU = I). With this definition, we can write A = UΣ 1 If we now multiply by A from the right, we get σ2 ΣA A + 1 σ2 A AΣA A = 0 (80) Noting that the second and fourth terms are symmetric (since A A is symmetric), this means that the remaining terms must be symmetric: σ2 ΣA A = A AΣ 1 σ2 A AΣ (81) Since A A commutes with A A, they are simultaneously diagonalizable, and since they are both symmetric, they share an orthonormal basis of eigenvectors. Clearly A A A A/σ2 has the same basis. Since this matrix commutes with Σ, it must share a basis with Σ and hence Σ has the same basis as A A. This means that Σ commutes with A A. Expanding A in terms of U, this means that ΣA A = Σ 1 2 U UΣ 1 2 U UΣ 1 2 = A AΣ (82) and therefore ΣU U = U UΣ, meaning that Σ and U U commute. Consider as an example the case where Σ is diagonal. U U is a projection matrix and in this case must be diagonal due to commuting with Σ. As a result, it must have exactly d ones and D d zeros along the diagonal. This means that the rows of U are a basis for the d dimensional axis-aligned subspace corresponding to the d nonzero entries. In the case of a non-diagonal Σ, this generalizes to the rows of U spanning the same subspace as some subset of d eigenvectors of Σ. This leads to the expression of the loss function in the next theorem. Published as a conference paper at ICLR 2024 Figure 5: Plot of f(λi) = log λi λi/σ2 with σ = 1, showing maximum value at λi = σ2 and unbounded behavior on either side. Theorem C.4. Let Σ have the eigen-decomposition V ΛV with Λ = diag(λ). Let U U have the eigen-decomposition V EV , with E = diag(α). Then the minimum of the loss is satisfied by α such that is minimal, subject to the constraint αi {0, 1} with PD i=1 αi = d. Proof. Let s note a couple of properties. First, we have (UΣU )k = UΣk U due to Σ commuting with U U, so we can say that f(UΣU T ) = Uf(Σ)U for any matrix function f with a Taylor series. Next, U U is an orthogonal projection matrix, so E is a diagonal matrix with ones or zeros on the diagonal. We know that the rank of U is d, hence E has exactly d ones and D d zeros along the diagonal. Therefore we have the constraint αi {0, 1} with PD i=1 αi = d. Next, note that A A = U U. Now we substitute back into the loss in terms of U: 2 tr(UU ) 1 2 log det(UΣ 1U ) + 1 2σ2 tr(Σ(I U U)) (84) 2 tr(U log(Σ)U ) 1 2σ2 tr(UΣU ) + const. (85) where we used that log det(UΣ 1U ) = tr log(UΣ 1U ) = tr(U log(Σ)U ) (86) Note that tr(UU ) is constant. Consider that tr(UΣU ) = tr(U UΣU U) = tr(V EDEV ) = tr(EDE) = α λ (87) The same logic holds for the term with log(Σ). Therefore, dropping constant terms, the loss can be written in terms of α and λ: Published as a conference paper at ICLR 2024 The loss will take different values depending on which elements of α are nonzero. Define f(λi) = log λi λi/σ2. The loss will be minimized when the nonzero αi correspond to those values of f(λi) which are minimal. Clearly f (λi) < 0, so f has only one maximum at λi = σ2 and is unbounded below on either side of this maximum (see fig. 5). Consider the two extreme cases: 1. All eigenvalues λ are smaller than σ2. The minimal values of f(λi) will occur for the smallest values of λi. Hence the d smallest eigenvalues of Σ will be selected. 2. All eigenvalues λ are larger than σ2. The minimal values of f(λi) will occur for the largest values of λi. Hence the d largest eigenvalues of Σ will be selected. In the intermediate regime, there will be a phase transition between these two extremes. In the first case, the reconstructed manifold will be a projection onto the d-dimensional subspace with the lowest variance, exactly the opposite result to PCA. In the second case, the reconstructed manifold will be a projection onto the d-dimensional subspace with the highest variance, exactly the same result as PCA. If we maximize the likelihood on the manifold without any reconstruction loss, corresponding to the σ2 limit, we actually learn the lowest entropy manifold. It makes more sense to learn the highest entropy manifold as in PCA. We can ensure this is the case by adding Gaussian noise of variance σ2 to the data, ensuring that the minimum eigenvalue of the covariance matrix is at least σ2, even if the original data is degenerate. D IMPLEMENTATION DETAILS In implementing the trace estimator, we have to make a number of choices, each elaborated below. The main reasons for each choice are given first, with more technical details deferred to later in the appendix. Gradient to encoder or decoder The log-determinant term can be formulated either in terms of the Jacobian of the encoder (see eq. (10)) or the decoder (see eq. (9) ). As discussed in section 4.2 we find that formulating it in terms of the encoder Jacobian leads to more stable training. Since the training also minimizes the squared norm of f(x), we speculate that having gradient from this term and the log-determinant term both being sent to the encoder allows the encoder to more efficiently shape the latent space distribution. We note the similarity of this formulation to the standard change-of-variables loss used to train normalizing flows. If we instead send the gradient of the log-determinant to the decoder, the information about how the encoder can change can only reach it via the reconstruction term, which doesn t allow the encoder to deviate significantly from being the pseudoinverse of the decoder. A change in the decoder will therefore lead to a corresponding change in the encoder, but this is a less direct process than sending gradient to the encoder directly. In addition, this formulation means the decoder is optimized only to minimize reconstruction loss, meaning that it will likely be an approximate pseudoinverse for the encoder, a condition we require for the accuracy of the surrogate estimator. Space in which trace is performed Considering eqs. (29) and (36), the central component of the surrogate is a trace (estimator). Making use of the cyclic property of the trace, i.e. tr(A B) = tr(BA ) for any A, B RD d, we can choose which expansion of the trace to estimate: i=1 (A B)ii = tr(A B) = tr(BA ) = i=1 (BA )ii. (89) The variance of a stochastic trace estimator depends on the noise used but in general is roughly proportional to the squared Frobenius norm of the matrix (see appendix D.3). Given two matrices A, B RD d with d < D, it is likely that A B 2 F < BA 2 F . This statement is not true for all A and B, but is almost always fulfilled when d D. Transferred to our context: In general the matrices f (x) Rd D and g (z) RD d are rectangular and can be multiplied together in either the f (x)g (z) Rd d order or g (z)f (x) RD D order. This matters for applying the trace since generally d < D. A more precise statement (proven in appendix D.1) is that if the entries of A and B are sampled from standard normal distributions, then E[ A B 2 F ] = Dd2 versus E[ BA 2 F ] = D2d. For Published as a conference paper at ICLR 2024 Table 4: Different possible estimators for the gradient of the log-determinant term. gradient to encoder gradient to decoder trace in data space tr g (z) ϕj f (x) tr θj g (x) f (x) trace in latent space tr ϕj f (x) g (z) tr f (x) θj g (z) d D the difference becomes significant. The difference between the two estimators may not be large if the two matrices have special structure, in particular if they share a basis. However, since the terms being multiplied in our case are a Jacobian matrix and the derivative of another Jacobian matrix with respect to a parameter θj or ϕj, it is unlikely that any such particular structure is present. As a result, when the latent space is smaller than the data space, the preferable estimator is the one that performs the trace in the latent space, meaning that products in the estimator have the order f (x)g (z) (see table 4). In appendix D.1.1, we experimentally test the convergence of trace estimators with increasing Hutchinson samples, performed in both data and latent space, confirming that convergence is much faster when performing the trace in latent space. Type of gradient Consider the estimator: ϕj f (x) g (z) = ϕj Eϵ ϵ f (x)stop gradient(g (z))ϵ (90) Ignoring the stop gradient operation for now, this requires computing terms of the form ϵ f (x)g (z)ϵ. In order to avoid calculating full Jacobian matrices, we can implement the calculation using some combination of vector-Jacobian (vjp) or Jacobian-vector (jvp) products, which are efficient to compute with backward-mode respectively forward-mode automatic differentiation. Note that we can use the result from one product as the vector for another vjp or jvp. For example, v1 := (ϵ f (x)) RD yields a vector, so we can compute ϵ f (x)g (z)ϵ = v 1 g (z)ϵ via two vector-Jacobian products. This gives us three choices: i) backward mode only (two vjp), ii) forward mode only (two jvp) or iii) a mix of both (one jvp and one jvp). We opt to use mixed mode (see appendix D.2 for further details). Trace estimator noise Trace estimators rely on the identity Eϵ[ϵ Aϵ] = tr(AEϵ[ϵϵ ]) = tr(A), meaning that we require only Eϵ[ϵϵ ] = I for the noise variable. The choice comes down mainly to the variance of the estimator. Among all noise vectors whose entries are sampled independently, Rademacher noise has the lowest variance (Hutchinson, 1989). However, if the entries are sampled from a standard normal distribution and then scaled to have length d where d is the dimension of ϵ, the entries are no longer independent and the variance of the estimator is comparable to Rademacher noise (Girard, 1989). When using a single Hutchinson sample, we choose to use scaled Gaussian noise for its low variance, and since it covers more directions than Rademacher noise (covering the hypersphere uniformly, rather than at a fixed 2d points). When we have more than one Hutchinson sample, we additionally orthogonalize the vectors as this further reduces variance. More details are in appendix D.3. Number of noise samples We can choose to use between 1 and d noise samples in the trace estimator (with d samples we already can calculate the exact trace, so more samples are not necessary). Denote the number of samples by K. We find that in general K = 1 is enough for good performance, especially if the batch size is sufficiently high. D.1 VARIANCE OF TRACE ESTIMATOR Theorem D.1. Let A, B RD d where the entries of both matrices are sampled from a standard normal distribution. Then E A B 2 F = d2D and E BA 2 F = d D2 (91) Published as a conference paper at ICLR 2024 2 4 6 8 10 12 14 16 number of Hutchinson samples relative gradient distance batch size 1 16 64 256 512 2 4 6 8 10 12 14 16 number of Hutchinson samples relative gradient distance batch size 1 16 64 256 512 1 Figure 6: Relative gradient distance to the exact surrogate gradient as a function of Hutchinson samples for varying batch sizes. (Left) Trace estimation in data space. (Right) Trace estimation in latent space. We estimate the trace using orthogonalized Gaussian noise (see appendix D.3.5) (solid lines), but also feature non-orthogonalized samples for batch size 1 in latent space (dashed line). Note the different scales on the y-axes and the use of symlog in the right-hand plot. We evaluate the trace estimation on the surrogate estimator with gradient to encoder as specified in table 4 for a converged model trained on conditional MNIST (d = 16, D = 784). Proof. Consider first A B 2 F . We can write this as k=1 Aki Bkj k,l Aki Ali Bkj Blj (92) Taking an expectation over this expression, the only nonzero contributions will be from terms where the A and B terms are both quadratic, since if not, the term will be multiplied by E[X] = 0 where X is standard normal. This requires k = l, giving E A B 2 F = k E A2 ki B2 kj (93) k E A2 ki E B2 kj (94) since Aki and Bkj are independent and the expectation of the square of a standard normal variable is its variance, i.e. 1. The equivalent expressions for BA 2 F can be obtained by swapping d and D in these expressions. D.1.1 EXPERIMENTAL CONFIRMATION To evaluate the convergence behavior of the trace estimation, which is exact for d and D Hutchinson samples in latent and data space respectively, we compute the relative gradient distance of the resulting surrogate gradient with respect to the exact solution as a function of Hutchinson samples K: relative gradient distance(K) = surrogate(K) exact 2 exact 2 . (96) Here, surrogate(K) denotes the gradient of the surrogate loss term after K Hutchinson samples and exact the gradient of the exact surrogate loss term, i.e. after d or D samples. In fig. 6 one can see a clear decrease in gradient distance and its variance when computing the trace in latent space instead of data space. Furthermore, we note that increasing the batch size also contributes to fast and steady convergence, which is a result of sampling an independent noise sample per batch instance. D.2 FORWARD/BACKWARD AUTOMATIC DIFFERENTIATION Two of the basic building blocks of automatic differentiation (autodiff) libraries are the vector Jacobian product (vjp) and the Jacobian-vector product (jvp). The vjp is implemented by Published as a conference paper at ICLR 2024 backward-mode autodiff and computes a vector multiplied with a Jacobian matrix from the left, along with the output of the function being used: f(x), ϵ f (x) = vjp(f, x, ϵ) (97) In Py Torch, this is implemented by the torch.autograd.functional.vjp function, or by first computing f(x), then using torch.autograd.grad. The jvp is implemented by forward-mode autodiff and computes a vector multiplied by a Jacobian matrix from the right, along with the output of the function being used: f(x), f (x)ϵ = jvp(f, x, ϵ) (98) In Py Torch, this is implemented by the torch.autograd.forward ad package. Our preferred estimator for the log-determinant is the following (see section 4.2): k=1 ϵ k f (x)stop gradient (g (z)ϵk) (99) Therefore we need to compute terms of the form ϵ f (x)g (z)ϵ, with a stop gradient operation. The stop gradient operation is implemented by applying the .detach() method to a tensor in Py Torch. We have the following options. Note that we can use a product obtained from vjp or jvp as a vector input to a subsequent product. Backward mode This mode uses only vector-Jacobian products, requiring backward-mode autodiff. 1. v 1 = ϵ f (x) z, v 1 = vjp(f, x, ϵ) 2. v 2 = v 1 g (z) ˆx, v 2 = vjp(g, z, v1) 3. ϵ f (x)g (z)ϵ = v 2 ϵ Forward mode This mode uses only Jacobian-vector products, requiring forward-mode autodiff. 1. v1 = g (z)ϵ ˆx, v1 = jvp(g, z, ϵ) 2. v2 = f (x)v1 z, v2 = jvp(f, x, v1) 3. ϵ f (x)g (z)ϵ = ϵ v2 Mixed mode This mode uses one vector-Jacobian and one Jacobian-vector product, requiring both forwardand backward-mode autodiff. 1. v 1 = ϵ f (x) z, v1 = vjp(f, x, ϵ) 2. v2 = g (z)ϵ ˆx, v2 = jvp(g, z, ϵ) 3. ϵ f (x)g (z)ϵ = v 1 v2 We prefer using backward mode autodiff where possible, since we find that it is slightly faster than forward mode in Py Torch. However for our estimator of choice, we use mixed mode, since this is most easily implemented. Using backward mode would require a stop gradient operation to be introduced after the second step, but in a way that allows gradient to flow back to f (x). While we believe this is possible if implemented carefully, we did not pursue this option. In mixed mode, we can easily detach the gradient of v2 without affecting the first step of the calculation. D.3 PROPERTIES OF TRACE ESTIMATOR NOISE Hutchinson style trace estimators (Hutchinson, 1989) have the form Eϵ[ϵ Aϵ] and equal tr(A) in expectation. If A is skew-symmetric (A = A), then (ϵ Aϵ) = ϵ A ϵ = ϵ Aϵ and hence ϵ Aϵ = 0 with zero variance. Since any matrix can be decomposed into a symmetric and skewsymmetric part, the variance in the estimator comes only from the symmetric part of A, namely As = (A + A )/2. From now on, suppose A is symmetric and if not, substitute As for A. Published as a conference paper at ICLR 2024 D.3.1 RADEMACHER NOISE If the entries of ϵ are sampled independently from a distribution with zero mean and unit variance, then the variance of the estimator is minimized by the Rademacher distribution which samples the values 1 and 1 each with probability half. This estimator achieves the following variance for symmetric A (see proposition 1 in Hutchinson (1989)): Vϵ[ϵ Aϵ] = 2 X i =j A2 ij (100) D.3.2 GAUSSIAN NOISE With standard normal noise, the estimator is unbiased, but the variance is higher (see again Hutchinson (1989)): Vϵ[ϵ Aϵ] = 2 X i,j A2 ij = 2 A 2 F (101) i.e. twice the Frobenius norm. D.3.3 SCALED GAUSSIAN NOISE By contrast d tr(A) (102) where ϵ is a standard normal variable in Rd. The variance of this estimator for symmetric A (see theorem 2.2 in Girard (1989)) is: = 2 d + 2σ2(λ(A)) (103) where σ2(λ(A)) denotes the variance of the eigenvalues of A. We can write this estimator in the Hutchinson form by sampling ϵ from a standard normal distribution, then normalizing it such that its length is d. Then we have Eϵ[ϵ Aϵ] = tr(A) (104) Vϵ[ϵ Aϵ] = 2d2 d + 2σ2(λ(A)) (105) D.3.4 COMPARISON When the dimension of A becomes large, the variance of Rademacher and scaled Gaussian estimators are comparable. Suppose that the eigenvalues of A have zero mean (e.g. the entries are independent normal samples). Then dσ2(λ(A)) = X i λ2 i = tr(A2) = A 2 F (106) If we further assume that all entries of A have roughly equal magnitude we have that X i =j A2 ij A 2 F (107) since the sum in the Frobenius norm is dominated by the d(d 1) off-diagonal terms. Similarly, d + 2σ2(λ(A)) dσ2(λ(A)) = A 2 F (108) meaning that the two estimators have approximately the same variance. If the matrix has special structure, we might choose one estimator over the other. For example, if the standard deviation of the eigenvalues of A is small in comparison to the mean eigenvalue, the Published as a conference paper at ICLR 2024 scaled Gaussian estimator is preferable and if A is dominated by its diagonal then the Rademacher estimator is preferable. We don t expect either type of special structure in our matrices, so we consider the estimators interchangeable. We decided to use scaled Gaussian noise since it produces noise which points in all possible directions in Rd whereas Rademacher noise is restricted to a finite 2d points. We assume that there is no reason to prefer this set of 2d directions and therefore sampling from all possible directions is better. D.3.5 REDUCING VARIANCE WHEN SAMPLING MORE THAN 1 HUTCHINSON SAMPLE When the number of Hutchinson samples K are greater than 1, it is more favorable to sample the noise vectors in a dependent way than independently. Consider the case of a d d matrix A with K = d. Then we can get an exact estimate of the trace via i=1 q i Aqi = tr(Q AQ) = tr(A) (109) with orthogonal Q and qi the i-th column of Q. If the qi were sampled independently, we almost certainly wouldn t achieve this exact result. We therefore sample our noise vectors as the first K columns of a randomly sampled d d orthogonal matrix and scale each column by d. We show below that this reduces variance compared with sampling independently and make an experimental comparison (see fig. 6). If the resulting noise vectors are denoted ϵi, we estimate tr(A) by i=1 ϵ i Aϵi (110) This estimator is unbiased: Eϵ1,...,ϵK[btr(A)] = 1 i=1 Eϵi[ϵ i Aϵi] (111) i=1 tr AEϵi[ϵiϵ i ] (112) i=1 tr(A) (113) = tr(A) (114) since Eϵi[ϵiϵ i ] = I for all ϵi. This procedure is equivalent to using scaled Gaussian noise when K = 1 and is what we implement in practice for all values of K. Note that it is not necessary to use K > d since we already achieve the exact value with K = d. A note on our sampling strategy: in practice we sample by taking the Q matrix of the QR decomposition of a d K matrix with entries sampled from a standard normal. Since the QR decomposition performs Gram-Schmidt orthogonalization, the Q matrix is uniformly sampled from the group of orthogonal matrices if Q is square (Mezzadri, 2007). The same logic applies to the QR decomposition of non-square matrices, yielding a d K matrix made up of the first K columns of a uniformly sampled orthogonal d d matrix. Strictly speaking, the QR decomposition is only unique if the R matrix has a positive diagonal, and uniqueness is required for uniform sampling. Let X = QR and define by D the sign of the diagonal of R. D is diagonal with 1 or 1 on the diagonal and D2 = I. Uniqueness can be achieved by multiplying Q by D from the right and multiplying R by D from the left: X = QDDR = QR. The resulting uniformly sampled orthogonal matrix is QD, meaning the columns of Q are multiplied by either 1 or 1. In our setting, we have terms of the form ϵ i Aϵi where the ϵi are the columns of Q, so multiplying the columns by 1 has no effect on the trace estimate. As a result we opt not to multiply by D. Therefore, although we do not sample uniformly from the orthogonal group, the final result is equivalent to sampling uniformly. Published as a conference paper at ICLR 2024 Variance derivation Using the formula for the variance of a sum of random variables, we have that Vϵ1,...,ϵK[btr(A)] = 1 K2 i,j=1 Cϵi,ϵj[ϵ i Aϵi, ϵ j Aϵj] (115) where C denotes the covariance between two random variables. Note that since the permutation of the columns of a randomly sampled orthogonal matrix is arbitrary (permuting the columns results in another randomly sampled orthogonal matrix of equal probability), all columns are equivalent and we only have to distinguish between the cases i = j and i = j. This leads to1 Vϵ1,...,ϵK[btr(A)] = 1 K2 (Kv + K(K 1)c) = 1 K (v + (K 1)c) (116) with v = Vϵi[ϵ i Aϵi] and c = Cϵi,ϵj[ϵ i Aϵi, ϵ j Aϵj] when i = j. Each column viewed individually is a randomly sampled Gaussian vector, scaled to have length d, hence the value of v is equal to the scaled Gaussian noise case above, namely d + 2σ2(λ(A)) (117) where σ2(λ(A)) denotes the variance of the eigenvalues of A. We also know that the variance of the estimator reduces to zero when K = d and hence v + (d 1)c = 0 leading to c = v d 1 (118) Putting it all together leads to Vϵ1,...,ϵK[btr(A)] = 1 d 1 v) (119) d 1 v (120) = 2d2(d K) K(d 1)(d + 2)σ2(λ(A)) (121) valid for d > 1. The comparable quantity for independently sampled scaled Gaussian noise is Vϵ1,...,ϵK[btr(A)] = 2d2 K(d + 2)σ2(λ(A)) (122) i.e. 1/K times the K = 1 result. The orthogonalized noise strategy always has lower variance since the ratio between the variances is d K d 1 1 (123) which even reduces to zero when K = d. If d is large, the difference is not great for small K, which aligns with the fact that randomly sampled directions in Rd are close to orthogonal for large d. E EXPERIMENTAL DETAILS E.1 ROLE OF RECONSTRUCTION WEIGHT E.1.1 TOY DATA To analyze the model behaviour depending on the reconstruction weight β, we train the same architecture on a simple sinusoid data set with β varied between 0.01 and 100. For the generation of data points, we draw x positions from a 1D standard normal distribution and calculate the respective y positions by y = sin(πx/2). Then, isotropic Gaussian noise with σ = 0.1 is added. We train an autoencoder architecture built with four residual blocks and a 1D 1This argument is inspired by https://math.stackexchange.com/questions/1081345/ finding-variance-of-the-sample-mean-of-a-random-sample-of-size-n-without-replace Published as a conference paper at ICLR 2024 Figure 7: The position of the transition point depends on the data set. The plots show the tradeoff between reconstruction error and NLL with reconstruction weight β (box plots summarize 20 runs per condition). The point at which β becomes sufficiently large (transition point) shifts to lower values with increased Gaussian noise added to the data points. latent space for 50 epochs with learning rate 0.001 until convergence. Each residual block is made up of a feedforward network with one hidden layer of width 256. For each value of β, 20 models are trained. To visualize the dimension of the data that is captured by the model, we project samples from the data distribution to the (1D) latent space and color the data points using the respective latent code as color value. Figure 3 illustrates that low reconstruction weight β values result in learning the dimension with the lowest entropy (noise) and higher values are required to learn the manifold that spans the sinusoid. Additionally, we repeat the procedure with higher noise levels (σ = 0.2, 0.3). We observe that the point at which the model transitions from learning the noise to representing the manifold is not fixed, but depends on features of the data set such as the noise (fig. 7). Published as a conference paper at ICLR 2024 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 number of singular value singular value reconstruction weight 10 1 100 101 102 reconstruction weight number of singular values 1 Figure 8: (Left) Singular value spectra for varying reconstruction weight. (Right) Number of singular values greater or equal to one as a function of reconstruction weight. The error bars show the span of the intersection of the shaded region with the line y = 1 in the left hand plot, rounded down to the nearest integer. For each trained model we generate 1024 samples per condition, compute the singular value spectra and average over all samples regardless of condition. The mean spectra and their standard deviation are evaluated across five trained models per reconstruction weight. Figure 9: Conditional MNIST samples for varying reconstruction weight β. For each value of β [0.1, 1, 10, 100] (from left to right), and each condition (rows) we generate ten samples (columns) at temperature T = 1. E.1.2 CONDITIONAL MNIST To measure the structure of the conditional MNIST dataset learned by the generating model g, we compute the full decoder Jacobian matrix J by calculating d Jacobian-vector products (one per column of the Jacobian). We compute the singular values Σ = diag(s1, . . . , sd) of J, which roughly speaking indicate the stretching or shrinking of the latent manifold by g. Hence, the number of non-vanishing singular values suggest the dimension of the data manifold and the sum of the log singular values is equal to the change in entropy between the latent space and data space, where a higher entropy indicates that more of the data manifold is spanned by the decoder. We can see this from the formula H(p X) = H(p Z) + Ep Z 2 log det(g (z) g (z)) (124) with H the differential entropy, and noting that 2 tr log Σ = log det(g (z) g (z)). We train multiple FIF models on conditional MNIST with reconstruction weights β ranging from 0.1 to 100, and evaluate their singular value spectra. The models are trained for 400 epochs, which is sufficient for convergence. The architecture used is the same as in table 11, except that it has four times as many channels in each convolutional layer. In fig. 8 it is clear that a higher reconstruction weight gives rise to a higher number of non-vanishing singular values. Hence, the reconstruction weight contributes towards learning structure of the true data manifold. This observed additional structure for higher reconstruction weights is reflected in an increasing diversity of samples (see fig. 9). Nevertheless, for high reconstruction weights we note the trade-off between sample diversity and properly learned latent distributions, which might result in out of distribution samples. E.2 TABULAR DATA We compare to the tabular data experiments in Caterini et al. (2021), using the same datasets and data splits, as well as the same latent space dimensions. We train models with roughly the same number Published as a conference paper at ICLR 2024 Table 5: Dataset-dependent hyperparameters and average total runtime for tabular data experiments. We compare our runtime against the published rectangular flow (Caterini et al., 2021) runtimes (RNFs-ML (K = 1) model), as well as rerunning their code on our hardware. Hyperparameter POWER GAS HEPMASS MINIBOONE Latent dimension 3 2 10 21 Training epochs 15 30 85 875 Model Training time (minutes) FIF (ours) 38 39 41 49 Rectangular flow (published) 113 75 138 34 Rectangular flow (our hardware) 147 86 249 75 Training time speedup (same hardware) 3.9 2.2 6.1 1.5 Table 6: Ablation study on the effect of each component of our proposed improvement to rectangular flows (RF). By NLL estimator we denote how the loss in equation 3 is approximated. For this experiment we used our reimplementation of RF. Hyperparameters NLL estimator & Model (on-/off-manifold) POWER GAS HEPMASS MINIBOONE FIF & free-form net off manifold (eq. 15) 0.041 0.007 0.281 0.031 0.541 0.034 0.598 0.024 FIF & free-form net on manifold (eq. 10) 19.54 20.81 7.48 5.40 29.03 5.42 77.23 16.55 FIF & coupling flow off manifold (eq. 15) 0.11 0.06 0.45 0.09 1.30 0.14 1.55 0.04 RF & coupling flow off manifold (eq. 15) 0.98 0.69 6.16 4.20 2.02 0.74 1.80 0.10 FIF & coupling flow on manifold (eq. 10) 3.71 2.19 0.40 0.22 0.71 0.05 3.13 0.42 RF & coupling flow on manifold (eq. 10) 0.33 0.22 0.33 0.17 0.82 0.07 1.84 0.11 of parameters. Our main architectural difference is that we use an unconstrained autoencoder rather than an injective flow. Our encoder consists of two parts: i) a feed-forward network with two hidden layers of dimension 256 and Re LU activations (no normalization layers), which maps from the input dimension to the latent dimension ii) a Res Net with two blocks, each with two hidden layers of dimension 256 and Re LU activations. The Res Net has input and output dimension equal to the latent space dimension. The decoder is the inverse: i) an identical Res Net to the encoder (but with separate parameters) followed by ii) a feed-forward network with two hidden layers of dimension 256 mapping from the latent space to the data space dimension. We use a batch size of 512, add isotropic Gaussian noise with standard deviation 0.01, use K = 1 Hutchinson samples and a reconstruction weight β = 10 for all experiments. We use the Adam optimizer with the onecycle LR scheduler with LR of 10 4 (except for HEPMASS which has LR of 3 10 4) and weight decay of 10 4. The number of epochs was chosen such that all experiments had approximately the same number of training iterations. We ran the model 5 times per dataset. The dataset-dependent parameters and average training times are given in table 5. We compare our training times against the published rectangular flow training times for their RNFs ML (K = 1) model, as well as rerunning their code on our hardware (a single RTX 2070 card). We find comparable FID-like scores on our rerun (except on GAS where we could not reproduce the score, see section 5), but our hardware is slower, with runs consistently taking at least 15% longer and more than twice as long on MINIBOONE. We find that our model runs in half the time or less of the rectangular flow on the same hardware, except for MINIBOONE (about 2/3 the time). E.3 COMPARISON TO EXISTING INJECTIVE FLOWS We compare against Trumpets Kothari et al. (2021) and Denoising Normalizing Flows (DNF) Horvat & Pfister (2021), as they are the best-performing injective flows to our knowledge, and report performance on Celeb A in table 3. Note that Trumpets default to d = 192, DNF to d = 512, whereas we are able to reduce the bottleneck dimension to d = 64 (consistent with the Pythae benchmark in appendix E.4). Published as a conference paper at ICLR 2024 Table 7: Reconstruction losses of FIF with a free-form architecture on the POWER, GAS HEPMASS and MINIBOONE datasets. The reconstruction error is always much higher for on-manifold training compared to off-manifold, demonstrating the instability caused by on-manifold NLL evaluation in free-form networks. Note: the large standard deviations in on-manifold runs are typically the result of a single large outlier. We remove the largest outlier where applicable ( On Manifold (outliers removed) row). POWER GAS HEPMASS MINIBOONE On manifold 237 498 5835 13006 119 34 300 160 On manifold (outliers removed) 14 27 18 31 119 34 229 23 Off manifold 0.072 0.002 0.188 0.012 2.569 0.098 1.077 0.011 Both models differ in the recommended wall clock time, and we therefore fix the wall clock time available to each model to five hours on a single NVIDIA A40. Trumpets train the manifold and the distribution on it in two sequential steps. To accommodate both steps in the reduced training time, we vary the fraction of the five hours spent in training manifold and distribution and report the best FID among the variants tried. We vary number of manifold epochs as nmanifold = 2, 5, 10, with 10 performing best. Our free-form injective flows (FIF) are not restricted in their architecture, and we choose an offthe-shelf convolutional autoencoder, followed by a total of four fully-connected Res Net blocks, see table 11. The fully-connected blocks are important, as can be seen when comparing to the architecture used in the Pythae benchmark (see appendix E.4). We note that the Pythae benchmark could benefit from a modified architecture, but leave this modification open for future work. For Trumpets and DNF, we point to the training details provided by the respective authors. For FIF, we choose these training hyperparameters: We train with the Adam optimizer with a LR of 10 3 and a weight decay of 10 4, linearly increase β from 20 at initialization to 40 at the end of training, a single Hutchinson sample K = 1 and a student-t distribution on the latent space. We set the batch size to 256. We conclude that the Pythae benchmark could benefit from an optimized architecture, as this change probably also improves the other methods. From the data at hand, we further conclude that the full potential of FIF has not yet been exploited, and that easy gains can be made by improving the architecture and other hyperparameters. E.4 PYTHAE BENCHMARK ON GENERATIVE AUTOENCODERS We compare our method to existing generative autoencoder paradigms using the benchmark from Chadebec et al. (2022). We use the provided open-source pipeline and follow the training setup described by the authors. For MNIST and CIFAR10 this means training for 100 epochs with the Adam optimizer at a starting LR of 10 4, reserving the last 10k images of the training sets as validation sets. Celeb A trains for 50 epochs with a starting LR of 10 3. All experiments are performed with a batch size of 100 and LR is reduced by half when the loss plateaus for 10 epochs. In accordance with the original benchmark we pick 10 sets of hyper-parameters, compute their validation FID (see fig. 10) and use the model which achieves the best FID on the validation set as the final model. In table 8 we report the FID and IS of this model on the test set. To complement the metrics, we show samples from all models in fig. 11, demonstrating convincing quality. We exclude the VAEGAN from the FID comparison, as the model trains more than double the time required for FIF and goes beyond fitting a transformation of the training data to a standard normal distribution. As described in section 5, we use the architectures from Chadebec et al. (2022) for the benchmark, which we replicate in tables 9 and 10. E.5 COMPUTE AND DEPENDENCIES We used approximately 150 GPU hours for computing the Pythae benchmark, and an additional 800 GPU hours for model exploration and testing. The majority of the experiments were performed on an internal cluster of A100s. The majority of compute time was spent on image datasets. Published as a conference paper at ICLR 2024 5.0 7.5 10.0 12.5 15.0 17.5 20.0 22.5 25.0 reconstruction loss validation loss Conv Net MNIST , K=1 GMM, K=1 , K=2 GMM, K=2 5.0 7.5 10.0 12.5 15.0 17.5 20.0 22.5 25.0 reconstruction loss Res Net MNIST , K=1 GMM, K=1 , K=2 GMM, K=2 5.0 7.5 10.0 12.5 15.0 17.5 20.0 22.5 25.0 reconstruction loss validation loss Conv Net CIFAR10 , K=1 GMM, K=1 , K=2 GMM, K=2 5.0 7.5 10.0 12.5 15.0 17.5 20.0 22.5 25.0 reconstruction loss Res Net CIFAR10 , K=1 GMM, K=1 , K=2 GMM, K=2 5.0 7.5 10.0 12.5 15.0 17.5 20.0 22.5 25.0 reconstruction loss validation loss Conv Net CELEBA , K=1 GMM, K=1 , K=2 GMM, K=2 5.0 7.5 10.0 12.5 15.0 17.5 20.0 22.5 25.0 reconstruction loss 100 Res Net CELEBA , K=1 GMM, K=1 , K=2 GMM, K=2 Figure 10: Validation FID of our 10 benchmark model setups on the different datasets and architectures used by Pythae. We report results for reconstruction weights β = 5, 10, 15, 20, 25 and number of Hutchinson samples K = 1 in the solid lines and K = 2 in the dashed lines. We show the performance of standard normal sampling (N) and a Gaussian mixture model (GMM). We see that GMM sampling always improves performance. In contrast, the reconstruction weight and number of Hutchinson samples have no noticeable effect on performance, except that increasing β improves performance on Conv Net MNIST and Conv Net CIFAR10, and decreases performance on Res Net MNIST (normal sampling only). We build our code upon the following python libraries: Py Torch (Paszke et al., 2019), Py Torch Lightning (Falcon & The Py Torch Lightning team, 2019), Tensorflow (Abadi et al., 2015) for FID score evaluation, Numpy (Harris et al., 2020), Matplotlib (Hunter, 2007) for plotting and Pandas (Mc Kinney, 2010; The pandas development team, 2020) for data evaluation. Published as a conference paper at ICLR 2024 Table 8: Table taken from Chadebec et al. (2022) with our results added at the top. We report Inception Score (IS) and Fr echet Inception Distance (FID) computed with 10k samples on the test set. The best model per dataset and sampler is highlighted in bold, the second best is underlined. Model Sampler Conv Net Res Net MNIST CIFAR10 CELEBA MNIST CIFAR10 CELEBA FID IS FID IS FID IS FID IS FID IS FID IS FIF (ours) N 23.8 2.2 121.0 3.0 56.9 2.1 19.5 2.1 132.6 2.9 62.3 1.7 GMM 11.0 2.2 90.6 4.0 47.3 1.9 11.7 2.1 119.2 3.4 55.0 1.8 VAE N 28.5 2.1 241.0 2.2 54.8 1.9 31.3 2.0 181.7 2.5 66.6 1.6 GMM 26.9 2.1 235.9 2.3 52.4 1.9 32.3 2.1 179.7 2.5 63.0 1.7 VAMP VAMP 64.2 2.0 329.0 1.5 56.0 1.9 34.5 2.1 181.9 2.5 67.2 1.6 IWAE N 29.0 2.1 245.3 2.1 55.7 1.9 32.4 2.0 191.2 2.4 67.6 1.6 GMM 28.4 2.1 241.2 2.1 52.7 1.9 34.4 2.1 188.8 2.4 64.1 1.7 VAE-lin NF N 29.3 2.1 240.3 2.1 56.5 1.9 32.5 2.0 185.5 2.4 67.1 1.6 GMM 28.4 2.1 237.0 2.2 53.3 1.9 33.1 2.1 183.1 2.5 62.8 1.7 VAE-IAF N 27.5 2.1 236.0 2.2 55.4 1.9 30.6 2.0 183.6 2.5 66.2 1.6 GMM 27.0 2.1 235.4 2.2 53.6 1.9 32.2 2.1 180.8 2.5 62.7 1.7 β-VAE N 21.4 2.1 115.4 3.6 56.1 1.9 19.1 2.0 124.9 3.4 65.9 1.6 GMM 9.2 2.2 92.2 3.9 51.7 1.9 11.4 2.1 112.6 3.6 59.3 1.7 N (0, 1) 96.5 2.3 219.4 3.6 130.8 1.6 109.4 2.7 209.8 3.2 110.6 1.5 GMM 192.7 2.3 300.8 1.8 94.0 1.5 98.7 2.1 202.3 2.6 83.2 1.6 2-s sampler 236.1 1.5 371.2 1.0 167.9 1.0 250.8 1.1 349.0 1.2 161.0 1.2 MAF sampler 191.8 2.2 300.6 1.8 94.3 1.5 98.3 2.1 200.6 2.6 82.6 1.7 β-TC VAE N 21.3 2.1 116.6 2.8 55.7 1.8 20.7 2.0 125.8 3.4 65.9 1.6 GMM 11.6 2.2 89.3 4.1 51.8 1.9 13.3 2.1 106.5 3.7 59.3 1.7 Factor VAE N 27.0 2.1 236.5 2.2 53.8 1.9 31.0 2.0 185.4 2.5 66.4 1.7 GMM 26.9 2.1 234.0 2.2 52.4 2.0 32.7 2.1 184.4 2.5 63.3 1.7 Info VAE - RBF N 27.5 2.1 235.2 2.1 55.5 1.9 31.1 2.0 182.8 2.5 66.5 1.6 GMM 26.7 2.1 230.4 2.2 52.7 1.9 32.3 2.1 179.5 2.5 62.8 1.7 Info VAE - IMQ N 28.3 2.1 233.8 2.2 56.7 1.9 31.0 2.0 182.4 2.5 66.4 1.6 GMM 27.7 2.1 231.9 2.2 53.7 1.9 32.8 2.1 180.7 2.6 62.3 1.7 AAE N 16.8 2.2 139.9 2.6 59.9 1.8 19.1 2.1 164.9 2.4 64.8 1.7 GMM 9.3 2.2 92.1 3.8 53.9 2.0 11.1 2.1 118.5 3.5 58.7 1.8 MSSSIM-VAE N 26.7 2.2 279.9 1.7 124.3 1.3 28.0 2.1 254.2 1.7 119.0 1.3 GMM 27.2 2.2 279.7 1.7 124.3 1.3 28.8 2.1 253.1 1.7 119.2 1.3 VAEGAN N 8.7 2.2 199.5 2.2 39.7 1.9 12.8 2.2 198.7 2.2 122.8 2.0 (not compared) GMM 6.3 2.2 197.5 2.1 35.6 1.8 6.5 2.2 188.2 2.6 84.3 1.7 AE N 26.7 2.1 201.3 2.1 327.7 1.0 221.8 1.3 210.1 2.1 275.0 2.9 GMM 9.3 2.2 97.3 3.6 55.4 2.0 11.0 2.1 120.7 3.4 57.4 1.8 WAE - RBF N 21.2 2.2 175.1 2.0 332.6 1.0 21.2 2.1 170.2 2.3 69.4 1.6 GMM 9.2 2.2 97.1 3.6 55.0 2.0 11.2 2.1 120.3 3.4 58.3 1.7 WAE - IMQ N 18.9 2.2 164.4 2.2 64.6 1.7 20.3 2.1 150.7 2.5 67.1 1.6 GMM 8.6 2.2 96.5 3.6 51.7 2.0 11.2 2.1 119.0 3.5 57.7 1.8 VQVAE N 28.2 2.0 152.2 2.0 306.9 1.0 170.7 1.6 195.7 1.9 140.3 2.2 GMM 9.1 2.2 95.2 3.7 51.6 2.0 10.7 2.1 120.1 3.4 57.9 1.8 RAE - L2 N 25.0 2.0 156.1 2.6 86.1 2.8 63.3 2.2 170.9 2.2 168.7 3.1 GMM 9.1 2.2 85.3 3.9 55.2 1.9 11.5 2.1 122.5 3.4 58.3 1.8 RAE - GP N 27.1 2.1 196.8 2.1 86.1 2.4 61.5 2.2 229.1 2.0 201.9 3.1 GMM 9.7 2.2 96.3 3.7 52.5 1.9 11.4 2.1 123.3 3.4 59.0 1.8 Table 9: Conv Net, neural network architecture used for the convolutional networks, adapted from Chadebec et al. (2022). MNIST CIFAR10 CELEBA Encoder (1, 28, 28) (3, 32, 32) (3, 64, 64) Layer 1 Conv(128, 4, 2), BN, Re LU Conv(128, 4, 2), BN, Re LU Conv(128, 4, 2), BN, Re LU Layer 2 Conv(256, 4, 2), BN, Re LU Conv(256, 4, 2), BN, Re LU Conv(256, 4, 2), BN, Re LU Layer 3 Conv(512, 4, 2), BN, Re LU Conv(512, 4, 2), BN, Re LU Conv(512, 4, 2), BN, Re LU Layer 4 Conv(1024, 4, 2), BN, Re LU Conv(1024, 4, 2), BN, Re LU Conv(1024, 4, 2), BN, Re LU Layer 5 Linear(1024, latent dim)* Linear(4096, latent dim)* Linear(16384, latent dim)* Decoder Layer 1 Linear(latent dim, 16384) Linear(latent dim, 65536) Linear(latent dim, 65536) Layer 2 Conv T(512, 3, 2), BN, Re LU Conv T(512, 4, 2), BN, Re LU Conv T(512, 5, 2), BN, Re LU Layer 3 Conv T(256, 3, 2), BN, Re LU Conv T(256, 4, 2), BN, Re LU Conv T(256, 5, 2), BN, Re LU Layer 4 Conv(1, 3, 2), Sigmoid Conv(3, 4, 1), Sigmoid Conv T(128, 5, 2), BN, Re LU Layer 5 - - Conv T(3, 5, 1), Sigmoid #Parameters 17.2M 39.4M 33.5M Published as a conference paper at ICLR 2024 Table 10: Res Net, neural network architecture used for the residual networks, adapted from Chadebec et al. (2022). MNIST CIFAR10 CELEBA Encoder (1, 28, 28) (3, 32, 32) (3, 64, 64) Layer 1 Conv(64, 4, 2) Conv(64, 4, 2) Conv(64, 4, 2) Layer 2 Conv(128, 4, 2) Conv(128, 4, 2) Conv(128, 4, 2) Layer 3 Conv(128, 3, 2) Conv(128, 3, 1) Conv(128, 3, 2) Layer 4 Res Block* Res Block* Conv(128, 3, 2) Layer 5 Res Block* Res Block* Res Block* Layer 6 Linear(2048, latent dim)* Linear(8192, latent dim)* Res Block* Layer 7 - - Linear(2048, latent dim)* Decoder Layer 1 Linear(latent dim, 2048) Linear(latent dim, 8192) Linear(latent dim, 2048) Layer 2 Conv T(128, 3, 2) Res Block* Conv T(128, 3, 2) Layer 3 Res Block* Res Block* Res Block* Layer 4 Res Block*, Re LU Conv T(64, 4, 2) Res Block* Layer 5 Conv T(64, 3, 2), Re LU Conv T(3, 4, 2), Sigmoid Conv T(128, 5, 2), Sigmoid Layer 6 Conv T(1, 3, 2), Sigmoid - Conv T(64, 5, 2), Sigmoid Layer 6 - - Conv T(3, 4, 2), Sigmoid #Parameters 0.73M 4.8M 1.6M *The Res Blocks are composed of one Conv(32, 3, 1) followed by Conv(128, 1, 1) with Re LU. Table 11: FIF-Conv Net, neural network architecture used for comparison to Trumpet and Denoising Normalizing Flow. MNIST CELEBA Encoder (1, 28, 28) (3, 64, 64) Layer 1 Conv(32, 4, 2), BN, Re LU Conv(128, 4, 2), BN, Re LU Layer 2 Conv(64, 4, 2), BN, Re LU Conv(256, 4, 2), BN, Re LU Layer 3 Conv(128, 4, 2), BN, Re LU Conv(512, 4, 2), BN, Re LU Layer 4 Conv(256, 4, 2), BN, Re LU Conv(1024, 4, 2), BN, Re LU Layer 5 Linear(256, latent dim)* Linear(16384, latent dim)* Layer 6-9 4x Res Block(512) 4x Res Block(256) Decoder Layer 1-4 4x Res Block(512) 4x Res Block(256) Layer 5 Linear(latent dim, 4096) Linear(latent dim, 65536) Layer 6 Conv T(256, 3, 2), BN, Re LU Conv T(512, 5, 2), BN, Re LU Layer 7 Conv T(128, 3, 2), BN, Re LU Conv T(256, 5, 2), BN, Re LU Layer 8 Conv T(64, 3, 2), BN, Re LU Conv T(128, 5, 2), BN, Re LU Layer 9 Conv(1, 3, 2), Sigmoid Conv T(3, 5, 1), Sigmoid #Parameters 3.3M 34.3M *The Res Blocks(inner dim) are composed of Linear(latent dim, inner dim), Si LU, Linear(inner dim, inner dim), Si LU, Linear(inner dim, latent dim) with a skip connection. Published as a conference paper at ICLR 2024 CELEBA - GMM Info VAE - IMQ Info VAE - RBF Figure 11: Uncurated samples from the Celeb A Conv Net experiments in the Pyth AE benchmark. Our model is shown at the bottom, samples from the other models have been taken from Chadebec et al. (2022). Published as a conference paper at ICLR 2024 Figure 12: Possible learnt manifolds of varying curvature 1/R for data supported on a subspace, where d = 1 and D = 2. MSE NLL* Total Figure 13: Weighting the reconstruction loss higher does not lead to stable training. The plots show different reconstruction weights β. In all settings, highly curved manifolds (i.e. low radius) achieve the lowest loss. F DETAILS ON PATHOLOGY INDUCED BY CURVATURE As described in section 4.2, gradients from the on-manifold loss in eq. (10) cause the learned manifold to increase curvature. This is visualized in the main text in fig. 2, where the left plot shows that this loss leads to ever-increasing curvature. The reason is that the entropy of data projected to a curved manifold is smaller than the entropy of data projected to a flat manifold. Here, we provide intuition for why this happens for synthetic data where d is known and the data could in principle be perfectly reconstructed. Figure 12 depicts projections of the data to possible model manifolds of varying curvature κ. We parameterize the curvature by varying the radius R = 1/κ. One can observe that for with increasing curvature (i.e. decreasing radius), the data is projected to an increasingly small region. Correspondingly, the entropy H(ˆpdata(ˆx)) of the projected data becomes arbitrarily negative (just like a Gaussian with low standard deviation has arbitrarily negative entropy), lowering the achievable negative log-likelihood. Adding reconstruction loss alone does not fix this pathology, which we illustrate in fig. 13: The reconstruction loss saturates for small radii, but the best achievable negative log-likelihood (i.e. the entropy of the data) continues to decrease with the radius. Thus, even when increasing β, the minimal possible value of the total loss is still achieved by a spuriously curved manifold.