# autoencoding_goodness_of_fit__5836ceeb.pdf Published as a conference paper at ICLR 2023 AUTO-ENCODING GOODNESS OF FIT Aaron Palmer1, Zhiyi Chi2, Derek Aguiar1, Jinbo Bi1 1Department of Computer Science, 2Department of Statistics {aaron.palmer,zhiyi.chi,derek.aguiar,jinbo.bi}@uconn.edu University of Connecticut, Storrs, CT, USA We develop a new type of generative autoencoder called the Goodness-of-Fit Autoencoder (Go FAE), which incorporates Go F tests at two levels. At the minibatch level, it uses Go F test statistics as regularization objectives. At a more global level, it selects a regularization coefficient based on higher criticism, i.e., a test on the uniformity of the local Go F p-values. We justify the use of Go F tests by providing a relaxed L2-Wasserstein bound on the distance between the latent distribution and a distribution class. We prove that optimization based on these tests can be done with stochastic gradient descent on a compact Riemannian manifold. Empirically, we show that our higher criticism parameter selection procedure balances reconstruction and generation using mutual information and uniformity of p-values respectively. Finally, we show that Go FAE achieves comparable FID scores and mean squared errors with competing deep generative models while retaining statistical indistinguishability from Gaussian in the latent space based on a variety of hypothesis tests. 1 INTRODUCTION Generative autoencoders (GAEs) aim to achieve unsupervised, implicit generative modeling via learning a latent representation of the data (Bousquet et al., 2017). A generative model, known as the decoder, maps elements in a latent space, called codes, to the data space. These codes are sampled from a pre-specified distribution, or prior. GAEs also learn an encoder that maps the data space into the latent space, by controlling the probability distribution of the transformed data, or posterior. As an important type of GAEs, variational autoencoders (VAEs) maximize a lower bound on the data log-likelihood, which consists of a reconstruction term and a Kullback-Leibler (KL) divergence between the approximate posterior and prior distributions (Kingma & Welling, 2013; Rezende et al., 2014). Another class of GAEs seek to minimize the optimal transport cost (Villani, 2008) between the true data distribution and the generative model. This objective can be simplified into an objective minimizing a reconstruction error and subject to matching the aggregated posterior to the prior distribution (Bousquet et al., 2017). This constraint is relaxed in the Wasserstein autoencoder (WAE) (Tolstikhin et al., 2017) via a penalty on the divergence between the aggregated posterior and the prior, allowing for a variety of discrepancies (Patrini et al., 2018; Kolouri et al., 2018). Regardless of the criterion, training a GAE requires balancing low reconstruction error with a regularization loss that encourages the latent representation to be meaningful for data generation (Hinton & Salakhutdinov, 2006; Ruthotto & Haber, 2021). Overly emphasizing minimization of the divergence metric between the data derived posterior and the prior in GAEs is problematic. In VAEs, this can manifest as posterior collapse (Higgins et al., 2017; Alemi et al., 2018; Takida et al., 2021) resulting in the latent space containing little information about the data. Meanwhile, WAE can suffer from over-regularization when the prior distribution is too simple, e.g. isotropic Gaussian (Rubenstein et al., 2018; Dai & Wipf, 2019). Generally, it is difficult to decide when the posterior is close enough to the prior but not to a degree that is problematic. The difficulty is rooted in several issues: (a) an absence of tight constraints on the statistical distances; (b) distributions across minibatches used in the training; and (c) difference in scale between reconstruction and regularization objectives. Unlike statistical distances, goodness-of-fit (Go F) tests are statistical hypothesis tests that assess the indistinguishability between a given (empirical) distribution and a distribution class (Stephens, 2017). In recent years, GAEs based on Go F tests have been proposed to address some of the aforementioned Published as a conference paper at ICLR 2023 Figure 1: Latent behaviors for VAE, WAE, and Go FAE inspired from Figure 1 of Tolstikhin et al. (2017). (a) The VAE requires the approximate posterior distribution (orange contours) to match the prior PZ (white contours) for each example. (b) The WAE forces the encoded distribution (green contours) to match prior PZ. (c) The Go FAE forces the encoded distribution (purple contours) to match some PZ in the class prior G. For illustration, several PZi G are visualized (white contours). issues of VAEs and WAEs (Ridgeway & Mozer, 2018; Palmer et al., 2018; Ding et al., 2019). However, this emerging GAE approach still has some outstanding issues. In GAEs, Go F test statistics are optimized locally in minibatches. The issue of balancing reconstruction error and meaningful latent representation is manifested as the calibration of Go F test p-values. If Go F test p-values are too small (i.e., minibatches are distinguishable from the prior), then sampling quality is poor; conversely, an abundance of large Go F p-values may result in poor reconstruction as the posterior matches too closely to the prior at the minibatch level. In addition, there currently does not exist a stochastic gradient descent (SGD) algorithm that is applicable to Go F tests due to identifiability issues, unbounded domains, and gradient singularities. Our Contributions: We study the Go FAE a framework for parametric test statistic optimization, resulting in a novel GAE that optimizes Go F tests for normality, and an algorithm for regularization coefficient selection. Note that the Go F tests are not only for Gaussians with nonsingular covariance matrices, but all Gaussians, so they can handle situations where the data distribution is concentrated on or closely around a manifold with dimension smaller than the ambient space. The framework uses Gaussian priors as it is a standard option (Doersch, 2016; Bousquet et al., 2017; Kingma et al., 2019), and also because normality tests are better understood than Go F tests for other distributions, with more tests and more accurate calculation of p-values available. The framework can be modified to use other priors in a straightforward way provided that the same level of understanding can be gained on Go F tests for those priors as for normality tests. See Fig.(1) for latent space behavior of VAE, WAE and our Go FAE. Proofs are deferred to the appendix. Our contributions are summarized as follows. We propose a framework (Sec. 2) for bounding the statistical distance between the posterior and a prior distribution class G in GAEs, which forms the theoretical foundation for a deterministic GAE - Goodness of Fit Autoencoder (Go FAE), that directly optimizes Go F hypothesis test statistics. We examine four Go F tests of normality based on correlation and empirical distribution functions (Sec. 3). Each Go F test focuses on a different aspect of Gaussianity, e.g., moments or quantiles. A model selection method using higher criticism of p-values is proposed (Sec. 3), which enables global normality testing and is test-based instead of performance-based. This method helps determine the range of the regularization coefficient that well balances reconstruction and generation using uniformity of p-values respectively (Fig. 3b). We show that gradient based optimization of test statistics for normality can be complicated by identifiability issues, unbounded domains, and gradient singularities; we propose a SGD that optimizes over a Riemannian manifold (Stiefel manifold in our case) that effectively solves our GAE formulation with convergence analysis (Sec. 4). We show that Go FAE achieves comparable FID scores and mean squared error on three datasets while retaining statistical indistinguishability from Gaussian in the latent space (Sec. 5). 2 PRELIMINARIES FOR GOODNESS OF FIT AUTOENCODING Background. Let (X, PX) and (Z, PZ) be two probability spaces. In our setup, PX is the true, but unknown, non-atomic distribution on the data space X, while PZ is a prior distribution on the latent space Z. An implicit generative model is defined by sampling a code Z PZ and Published as a conference paper at ICLR 2023 applying a mapping G, called the decoder, to produce G(Z) X. The distribution of G(Z) is given by the pushforward of PZ under G, commonly denoted by G#PZ. Our concern is finding G such that G#PZ is close to PX. One approach is based on optimal transport. If µ and ν are two distributions respectively on spaces E1 and E2, and c(u, v) is a cost function on E1 E2, then the optimal transport cost to transfer µ to ν is Tc(µ, ν) = infπ Π(µ,ν) E(U,V ) π[c(U, V )], where Π(µ, ν) is the set of all joint probability distributions with marginals µ and ν. If E1 = E2 and is endowed with a metric d, and c(u, v) = d(u, v)p with p > 0, then d Wp(µ, ν) = Tc(µ, ν)1/p is known as the Lp-Wasserstein distance. In principle, G can be learned by solving min G Tc(PX, G#PZ). Unfortunately, the minimization is intractable. Instead, the WAE approach entails finding a mapping F, known as the encoder, such that the pushforward distribution PY := F#PX matches PZ. We will only consider non-random encoders and decoders, that is, a fully deterministic autoencoder. This is theoretically supported by the Monge-Kantorovich equivalence (Villani, 2008; Patrini et al., 2018). In this setting, F : X Z encodes X PX to produce the code Y = F(X), and G : Z X decodes Y to produce the reconstruction G(Y ). To emphasize the encode-decode process we can write the reconstruction as (G F)(X) := G(F(X)) = G(Y ) and the reconstruction distribution as the pushforward distribution (G F)#PX = G#F#PX = G#PY . We can define the WAE objective WAEλ c (PX, G) = inf F R X c(x, G(F(x)))d PX(x) + λD(F#PX, PZ) , (1) where λ > 0 is a regularization coefficient and D is a penalty on statistical discrepancy. Wasserstein Distance: Bounds. When X and Z are Euclidean spaces, a common choice for c is the squared Euclidean distance, leading to the L2-Wasserstein distance d W2 and the following bounds. Proposition 1. If X and Z are Euclidean spaces and the decoder G is differentiable, then (a) d W2(G#PY , G#PZ) G d W2(PY , PZ), (b) d W2(PX, G#PZ) [E X G(Y ) 2]1/2 + G d W2(PY , PZ). Combined with the triangle inequality for d W2, (a) and (b) in Proposition 1 imply that when the reconstruction error [E X G(Y ) 2]1/2 is small, proximity between the latent distribution PY and the prior PZ is sufficient to ensure proximity of the generated distribution G#PZ and the data distribution PX. Both (a) and (b) are similar to Patrini et al. (2018), though here a less tight bound the square-error is used in (b) as it is an easier objective to optimize. Given a sample of data, {Xi}n i=1, the L2-Wasserstein distance between the empirical distribution of {Yi = F(Xi)}n i=1 and PZ is a natural approximation of d W2(PY , PZ) because it is a consistent estimator of the latter as n , although in general is biased (Xu et al., 2020). Proposition 2. Let ˆPX,n be the empirical distribution of samples {Xi}n i=1, and ˆPY,n be the empirical distribution of {Yi = F(Xi)}n i=1. Assume that F(X) is differentiable with respect to X with bounded gradients F(X). Then, (a) d W2(ˆPY,n, PY ) F d W2(ˆPX,n, PX), (b) d W2(PY , PZ) d W2(ˆPY,n, PZ) + F d W2(ˆPX,n, PX). For large n, d W2(ˆPX,n, PX) is small, so from (b), the proximity of the latent distribution and the prior is mainly controlled by the proximity between the empirical distribution ˆPY,n and PZ. Together with Proposition 1, this shows that in order to achieve close proximity between G#PZ and PX, we need to have a strong control on the proximity between ˆPY,n and PZ in addition to a good reconstruction. Extending to a Class. So far we have focused on a single completely specified PZ. However, as we will argue in the next section, it can be beneficial to reduce the specificity to allow for a class G of priors. Letting d W2( , G) = inf P G d W2( , P), all the above results can be easily extended. For example, Proposition 1 (a) gives inf PZ G d W2(G#PY , G#PZ) G d W2(PY , G). Once d W2(PY , G) is controlled, we can then identify PZ G that satisfies d W2(PY , PZ) = d W2(PY , G) and use it as the prior. Clearly, if G only contains one distribution, this is reduced to the case where the prior is completely specified. Thus, the setting that allows for a class of priors is more general. We propose to use hypothesis tests associated with the Wasserstein distance. This does not preclude tests associated with other statistical distances, which may even be used in conjunction. Many standard statistical distances are not proper metrics, but aim to control statistical distances dominated by the Wasserstein distance. At the population level, the Wasserstein distance provides stronger separation between different distributions; at the sample level, it is useful to use different Go F tests since each Go F test is sensitive to different characteristics of the data. Published as a conference paper at ICLR 2023 3 THE NEED FOR A HIGHER CRITICISM: FROM LOCAL TO GLOBAL TESTING GAEs are typically trained using size m minibatches, or subsamples, of data. A natural starting point will be to push for matching the prior at the minibatch level. However, Section 3.2 discusses how only focusing on the minibatch level runs the risk of overfitting and how it can be mitigated with higher criticism. Go F tests assess whether a sample of observed data, {yi}m i=1, is drawn from a distribution class G. The most general opposing hypotheses are H0 : PY G vs H1 : PY G. A test is specified by a test statistic T and a significance level α (0, 1). If the observed value of T is T when it is applied to {yi}m i=1, i.e., T({yi}m i=1) = T , then the (lower tail) p-value is P(T({Yi}m i=1) T | H0), where {Yi}m i=1 PY G. H0 is rejected in favor of H1 at level α if the p-value is less than α, or equivalently, if T Tα, where Tα is the α-th quantile of T under H0. The probability integral transform states that if T has a continuous distribution under H0, the p-values are uniformly distributed on (0, 1) (Murdoch et al., 2008). Normality. Multivariate normality (MVN) has received the most attention in the study on multivariate Go F tests. If Y Rd has a normal distribution, then for every u S, Y T u, the projection of Y on u, follows a UVN distribution (Rao et al., 1973), where S = {u Rd | u = 1} is the unit sphere in Rd. Conversely, if Y has a non-normal distribution, the set of u S with Y T u being normal has Lebesgue measure 0 (Shao & Zhou, 2010). Thus, one way to test MVN is to apply Go F tests of univariate normality (UVN) to a set of (random) projections of Y and then make a decision based on the resulting collection of test statistics. Many UVN tests exist (Looney, 1995; Mecklin & Mundfrom, 2005), each focusing on different distributional characteristics. We next briefly describe correlation based (CB) tests because they are directly related to the Wasserstein distance. For other test types, see appendix B. In essence, CB tests on UVN are based on assessment of the correlation between empirical and UVN quantiles (Dufour et al., 1998). Two common CB tests are Shapiro-Wilk (SW) (Shapiro & Wilk, 1965) and Shapiro-Francia (SF) (Shapiro & Francia, 1972). Their test statistics will be denoted by TSW and TSF respectively. Both are closely related to d W2. For example, for m 1, TSW is a strictly decreasing function of d W2(ˆPY,m, N(0, 1)) (del Barrio et al., 1999), i.e., large TSW values correspond to small d W2, justifying the rule that T SW > Tα in order not to reject H0. Figure 2: Illustration of distribution family G. The Benefits of Go F Testing in Autoencoders. Go F tests on MVN inspect if PY is equal to some PZ G (Fig. 1), where G is the class of all Gaussians, i.e., MVN distributions. The primary justification for preferring PZ as a Gaussian to the more commonly specified isotropic Gaussian (Makhzani et al., 2015; Kingma & Welling, 2013; Higgins et al., 2017) is that its covariance Σ is allowed to implicitly adapt during training. For example, Σ can be of full rank, allowing for correlations between the latent variables, or singular (i.e., a degenerate MVN, Figs. 5b,5c). The ability to adapt to degenerate MVN is of particular benefit. If F is differentiable and the dimension of the latent space is greater than the intrinsic dimension of the data distribution, then the latent variable Y = F(X) only takes values in a manifold whose dimension is less than that of the latent space (Rubenstein et al., 2018). If PZ is only allowed to be an isotropic Gaussian, or more generally, a Gaussian with a covariance matrix of full rank, the regularizer in Equation 1 will promote F that can fill the latent space with F(X), leading to poor sample quality and wrong proportions of generated images (Rubenstein et al., 2018). Since Go F tests use the class G of all Gaussians, they may help avoid such a situation. Note that this precludes the use of whitening, i.e. the transformation Σ 1/2[X E(X)] to get to N(0, I), as the covariance matrix Σ may be singular. In other words, whitening confines available Gaussians to a subset much smaller than G (Fig. 2). Notably, a similar benefit was exemplified in the 2-Stage VAE, where decoupling manifold learning from learning the probability measure stabilized FID scores and sharpened generated samples when the intrinsic dimension is less than the dimension of the latent space (Dai & Wipf, 2019). There are several additional benefits of Go F testing. First, it allows the use of higher criticism (HC), discussed in Section 3.2. Go F tests act at the local level (minibatch), while HC applies to the global level (training set). Second, many Go F tests produce closed-form test statistics, and thus do not require tuning and provide easily understood output. Lastly, testing for MVN via projections is unaffected by rank deficiency. This is not the case with the majority of multivariate Go F tests, e.g., the Henze Zirkler test (Henze & Zirkler, 1990). In fact, any affine invariant test statistic for MVN must be a function of Mahalanobis distance, consequently requiring non-singularity (Henze, 2002). Published as a conference paper at ICLR 2023 Figure 3: Model selection demonstration. If the regularization coefficient λ is too small (a) or too big (c), the model is under (over) regularized and the minibatch p-values (p) are skewed right (left). As a result, global normality is rejected through the higher criticism principle for p-value uniformity (p KSunif < α). (b) An appropriately chosen λ fails to reject global normality (p KSunif > α). 3.1 A LOCAL PERSPECTIVE: GOODNESS OF FIT FOR NORMALITY In light of Section 2, the aim is to ensure proximity of PX to G#PZ by finding F, G such that the reconstruction loss, d, is small and F#PX is sufficiently close to prior PZ. As discussed above, PZ is not completely specified but required to belong to G. We can formulate the problem as min F,G E[d(X, G(F(X))] subject to F#PX G. To enforce the constraint in the minimization, a hypothesis test can be used to decide whether or not to reject the null H0 : F#PX G. If H0 is rejected for small values of the test statistic, we can include this rejection criterion as a constraint to the optimization problem min F,G E[d(X, G(F(X))] subject to E[T({F(Xi)}m i=1)] > Tα. Rewriting with regularization coefficient λ leads to the Lagrangian min F,G E[d(X, G(F(X))] + λ(Tα E[T({F(Xi)}m i=1)]). Since λ 0, the final objective can be simplified to min F,G E[d(X, G(F(X))) λT({F(Xi)}m i=1)]. (2) When the network is trained using a single minibatch of size m, making Y less statistically distinguishable from G amounts to increasing E[T({F(Xi)}m i=1)], where Xi are i.i.d. PX. However, if the training results in none of the minibatches yielding small T values to reject H0 at a given α level, it also indicates a mismatch between the latent and prior distributions. This type of mismatch cannot be detected at the minibatch level; a more global view is detailed in the next section. 3.2 A GLOBAL PERSPECTIVE: GOODNESS OF FIT FOR UNIFORMITY - HIGHER CRITICISM Algorithm 1 Evaluating Higher Criticism Require: Trained encoder Fθ, {xi}N i=1, Go F test T 1: for i = 1 : N/m do 2: Randomly sample minibatch X of size m 3: Y = Fθ(X) 4: if projection required then 5: T = T(Yu), where u S 6: else 7: T = T(Y) 8: Calculate p-value of T and store it 9: Use KSunif to evaluate p-value set. Neural networks are powerful universal function approximators (Hornik et al., 1989; Cybenko, 1989). With sufficient capacity it may be possible to train F to overfit, i.e. to produce too many minibatches with large p-values. Under H0, the probability integral transform posits p-value uniformity. Therefore, it is expected that after observing many minibatches, approximately a fraction α of them will have p-values that are less than α. This idea of a more encompassing test is known as Tukey s higher criticism (HC) (Donoho et al., 2004). While each minibatch Go F test is concerned with optimizing for indistinguishability from the prior distribution class G, the HC test is concerned with testing whether the collection of p-values is uniformly distributed on (0, 1), which may be accomplished through the Kolmogorov Smirnov uniformity (KSunif) test. See Algorithm 1 for a pseudo-code and Fig. 3 for an illustration of the HC process. HC and Go F form a symbiotic relationship; HC cannot exist by itself, and Go F by itself may over or under fit, producing p-values in incorrect proportions. Published as a conference paper at ICLR 2023 (a) Full architecture for Go FAE (b) TSW polytope (c) Optimization on M Figure 4: (a) Go FAE Architecture. (b) Example of singularity. Blue region is a polytope created as U = {x = [x1, x2, x3] R3 : 0 x1 x2 x3 1}. For x U, the coordinates of x are also its order statistics, the min, median and max coordinates corresponds to the x, y and z-axis. The green simplex is where TSW ({x1, x2, x3}) = 1, and the region built with the dotted lines and the red line creates an acceptance region, outside of which is the rejection region. The light blue arrows are the derivatives at corresponding points and the red boundary line corresponds to the singularity where the gradient blows up. (c) Visualization for Go F optimization on a Stiefel manifold M. 4 RIEMANNIAN SGD FOR OPTIMIZING A GOODNESS OF FIT AUTOENCODER In practice F and G are neural networks parameterized by θ and φ respectively. The empirical Go FAE minibatch loss function based on Equation 2 is written as L(θ, φ; {xi}m i=1) = 1 m Pm i=1 d(xi, Gφ(Fθ(xi))) λT({Fθ(xi)}m i=1), (3) A Go F test statistic T is not merely an evaluation mechanism. Its gradient will impact how the model learns what characteristics of a sample indicate strong deviation from normality, carrying over to what the PY becomes. The following desiderata may help when selecting T. We denote a collection of sample observations by bold X, V, and Y. We suppose that with probability one under PX the following two conditions are satisfied. Go F-Trainable: T(Fθ(X)) is almost everywhere continuously differentiable in feasible region Ω. Go F-Consistent: There exists a θ in Ωsuch that Fθ (X) is consistent with the assumption that Fθ (X) are i.i.d. samples from the target distribution. Go F-Trainable is needed if gradient based methods are used to optimize the network parameters. Consider a common encoder architecture Fθ that consists of multiple feature extraction layers (forming a mapping HΞ(X)) followed by a fully connected layer with parameter Θ. Thus, Fθ(X) = HΞ(X)Θ = VΘ = Y, and θ = {Ξ, Θ}. The last layer is linear with no activation function as shown in Fig. 4a. With this design, normality can be optimized with respect to the last layer Θ as discussed below. Given an Go F-Consistent statistic T, it is natural to seek a solution, θ . Theorem 3. Suppose V Rm k is of full row rank. For Θ Rk d, define Y = VΘ. Denote TSW = TSW (Yu) where u Rd is a unit vector. Then, TSW is differentiable with respect to Θ almost everywhere, and ΘTSW = 0 if and only if TSW = 1. This theorem justifies the use of TSW as an objective function according to Go F desiderata. The largest possible value of TSW is 1, corresponding to an inability to detect deviation from normality no matter what level α is specified. See the appendix for other test choices. Identifiability, Singularities, and Stiefel Manifold. If Y = VΘ is Gaussian for some Θ, then VΘM is also Gaussian for any nonsingular matrix M. Thus, any matrix of the form ΘM is a solution. This leads to several problems when optimizing an objective function containing Θ as a variable and the test statistic as part of the objective. First, there is an identifiability issue since, without any restrictions, the set of solutions is infinite. Here, all that matters is the direction of the projection, so restricting the space to the unit sphere is reasonable. Second, almost all Go F tests for normality in the current literature are affine invariant. However, the gradient and Hessian of the test statistics will not be bounded. Fig. 4b illustrates an example of such a singularity issue. Theorem 4. Let T({yi}m i=1), m 3 be any affine invariant test statistic that is non-constant and differentiable wherever yi are not all equal. Then, for any b, as (y1, . . . , ym) (b, . . . , b) , sup T({yi}m i=1) , where is the Frobenius norm. Published as a conference paper at ICLR 2023 If Θ is searched without being kept away from zero or the diagonal line, traditional SGD results will not be applicable to yield convergence of Θ. A common strategy might be to lower bound the smallest singular value of ΘT Θ. However, it does not solve the issue that any re-scaling of Θ leads to another solution, so Θ must be also upper bounded. It is thus desirable to restrict Θ in such a way that avoids getting close to singular by both upper and lower bounding it in terms of its singular values. We propose to restrict Θ to the compact Riemannian manifold of orthonormal matrices, i.e. Θ M = {Θ Rk d : ΘT Θ = Id} which is also known as the Stiefel manifold. This imposes a feasible region for θ Ω= {{Ξ, Θ} : ΘT Θ = Id}. Optimization over a Stiefel manifold has been studied previously (Nishimori & Akaho, 2005; Absil et al., 2009; Cho & Lee, 2017; Bécigneul & Ganea, 2018; Huang et al., 2018; 2020; Li et al., 2020). 4.1 OPTIMIZING GOODNESS OF FIT TEST STATISTICS A Riemannian metric provides a way to measure lengths and angles of tangent vectors of a smooth manifold. For the Stiefel manifold M, in the Euclidean metric, we have Γ1, Γ2 = trace(ΓT 1 Γ2) for any Γ1, Γ2 in the tangent space of M at Θ, TΘM. It is known that TΘM = {Γ : ΓT Θ + ΘT Γ = 0, Θ M} (Li et al., 2020). For arbitrary matrix Θ Rk d, the retraction back to M, denoted Θ M, can be performed via a singular value decomposition, Θ M = RST , where Θ = RΛST and Λ is the d d diagonal matrix of singular values, R Rk d has orthonormal columns, i.e. R M, and S Rd d is an orthogonal matrix. The retraction of the gradient D = ΘT to TΘM, denoted by D TΘM, can be accomplished by D TΘM = D Θ(ΘT D + DT Θ)/2 (Fig. 4c). The process of mapping Euclidean gradients to TΘM, updating Θ, and retracting back to M is well known (Nishimori & Akaho, 2005). The next result states that the Riemannian gradient for TSW has a finite second moment, which is needed for our convergence theorem in 4.2. Theorem 5. Let T = TSW be as in Theorem 3. Denote by θ, Θ, the Riemannian gradient w.r.t. θ, Θ, respectively, and the Frobenius norm. Let B = (I J/m)V, i.e., B is obtained by subtracting from each row of V the mean of the rows. Suppose (a) supΞ(E[ B 4] + E[ ΞB 4]) < , and (b) sup x =1,Ξ E[ Bx 4] < . Then, supθ E[ θT 2] < . 4.2 CONVERGENCE OF THE GOODNESS OF FIT AUTOENCODER Since Θ M, model training requires Riemannian SGD. The proof of Bonnabel (2013) for the convergence of SGD on a Riemannian manifold, which was extended from the Euclidean case (Bottou, 1998), requires conditions on step size, differentiability, and a uniform bound on the stochastic gradient. We show that this result holds under a much weaker condition, eliminating the need for a uniform bound. While we apply the resulting theorem to our specific problem with a Stiefel manifold, we emphasize that Theorem 6 is applicable to other suitable Riemannian manifolds. Theorem 6. Let M be a connected Riemannian manifold with injectivity radius uniformly bounded from below by I > 0. Let C C(3)(M) and Rw be a twice continuously differentiable retraction. Let z0, z1, . . . be i.i.d. ζ taking values in Z. Let H : Z M T M be a measurable function such that E[H(ζ, w)] = C(w) for all w M, where T M is the tangent bundle of M. Consider the SGD update wt+1 = Rwt( γt H(zt, wt)) with step size γt > 0 satisfying P γ2 t < + and P γt = + . Suppose there exists a compact set K such that all wt K and supw K E[ H(ζ, w) 2] A2 for some A > 0. Then, C(wt) converges almost surely and C(wt) 0 almost surely. In our context, C( ) corresponds to Equation 3. The parameters of the Go FAE are {θ, φ} = {Ξ, Θ, φ} where Ξ, φ are defined in Euclidean space, a Riemannian manifold endowed with the Euclidean metric, and Θ on the Stiefel manifold. Thus, {Ξ, Θ, φ} is a product manifold that is also Riemannian and {Ξ, Θ φ} are updated simultaneously. Convergence of the Go FAE holds from Proposition 5 and Theorem 6 provided that Ξ, φ stay within a compact set. Algorithms 2 and S1 give the pseudo-code for Go FAE optimization and the complete Go FAE and HC pipeline, respectively. 5 EXPERIMENTS We evaluate generation and reconstruction performance, normality, and informativeness of Go FAE1 using several Go F statistics on the MNIST (Le Cun et al., 1998), Celeb A (Liu et al., 2015) and 1Code can be found at https://github.com/aripalmer/Go FAE. Published as a conference paper at ICLR 2023 (a) p-value distribution (b) Shapiro-Wilk (c) Cramer von Mises Figure 5: Effects of λ on p-value distribution and mutual information for Go FAE models. CIFAR10 (Krizhevsky et al., 2009) datasets and compare to several other GAE models. We emphasize that our goal is not to merely to produce competitive evaluation metrics, but to provide a principled way to balance the reconstruction versus prior matching trade-off. For MNIST and Celeb A the architectures are based on Tolstikhin et al. (2017), while CIFAR10 is from Lippe (2022). The aim is to keep the architecture consistent across models with the exception of method specific components. See the appendix for complete architectures ( C), training details ( D), and additional results ( E). The following results are from experiments on Celeb A. Algorithm 2 Go FAE Optimization Require: test T, learning rates: η1, η2, max iterations J, regularization coefficient λ 1: Initialize: Ξ, Θ, φ 2: while j < J do 3: Sample minibatch of size m, X 4: V = F (1) Ξj (X), Y = F (2) Θj (V) = VΘj 5: if test requires projection then 6: T = T(Yu), where u S 7: else 8: T = T(Y) 9: L = d(X, Gφ(Y)) λT 10: Ξj+1 = Ξj η1 ΞL or other optim 11: φj+1 = φj η1 φL or other optim 12: D = ΘT 13: Γ = D Θj(ΘT j D + DT Θj)/2 14: Θ j+1 = Θj + η2Γ 15: Compute RΛST = SVD(Θ j+1) 16: Θj+1 = Θ j+1 M = RST Effect of λ and Mutual Information (MI). We investigated the balance between generation and reconstruction in the Go FAE by considering minibatch (local) Go F test p-value distributions, (global) normality, and MI as a function of λ (Fig. 5). Global normality as assessed through the higher criticism (HC) principle for p-value uniformity is computed using Algorithm 1. We trained Go FAE on Celeb A using TSW as the test statistic for λ = 10 (Fig. 5a, blue) and λ = 100, 000 (Fig. 5a, yellow). For small λ, the model emphasizes reconstruction and the p-value distribution will be skewed right since the penalty for deviating from normality is small (Fig. 5a, red dashed line). As λ increases, more emphasis is placed on normality and, for some λ, p-values are expected to be uniformly distributed (Fig. 5a, black line). If λ is too large, the p-value distribution will be left-skewed (Fig. 5a, red solid line), which corresponds to overfitting the prior. We assessed the normality of Go FAE minibatch encodings using 30 repetitions of Algorithm 1 for different λ (Figs. 5b-5c). The blue points and blue solid line represent the KSunif test pvalues (p KSunif ) and mean ( p KSunif = P30 i=1 p KSunif i), respectively. HC rejects global normality ( p KSunif <0.05) when the test statistic p-values are right-skewed (too many minibatches deviate from normality) or left-skewed (minibatches are too normal); this Go FAE property discourages the posterior from over-fitting the prior. The range of λ values for which the distribution of Go F test p-values is indistinguishable from uniform is where the mean p KSunif 0.05 (Figs. 5b-5c). Our method uses the class of Gaussians, G = {N(µ, Σ) : µ Rd, Σ Rd d} as a prior, where (µ, Σ) are the parameters of a Gaussian distribution denoting the mean and covariance matrix, respectively. When a model is finished training, we assume F#PX G and use estimates of the mean and covariance (ˆµ, ˆΣ) for each Go FAE model to generate samples. Specifically, we drew N = 104 samples {zi}N i=1 to generate images {Gφ(zi)}N i=1. These images are then encoded, giving the set { zi}N i=1 with zi = Fθ(Gφ(zi)), resulting in a Markov chain {zi}N i=1 {Gφ(zi)}N i=1 { zi}N i=1. Assuming zi are also normal, the data processing inequality gives a lower bound on the mutual information between Z PZ and Z F#G#PZ, I(Z, Z) (Figs. 5b-5c, red line). As λ increases the MI diminishes, suggesting the posterior overfits the prior as the encoding becomes independent of the data. The unimodality of p KSunif and monotonicity of MI suggests λ can be selected solely based on KSunif without reference to a performance criterion. Gaussian Degeneracy. Due to noise in the data, numerically PY can never be singular. Nevertheless, the experiments indicated the Go F test pushed PY to become more and more singular in the Published as a conference paper at ICLR 2023 Table 1: Evaluation of Celeb A by MSE, FID scores, and samples with p-values from higher criticism. Algorithm MSE FID Score Kolmogorov-Smirnov Uniformity Test Recon. Gen. SW SF CVM KS EP AE-Baseline 69.32 40.28 126.43 0.0(0.0) 0.0(0.0) 0.0(0.0) 0.0(0.0) 0.0(0.0) VAE (fixed γ) 101.78 49.29 56.78 0.01(0.05) 0.0(0.01) 0.10(0.19) 0.30(0.30) 0.23(0.27) VAE (learned γ) 68.09 34.54 58.15 0.0(0.0) 0.0(0.0) 0.0(0.0) 0.0(0.0) 0.0(0.0) 2-Stage-VAE 68.09 34.54 49.55 0.51(0.26) 0.49(0.27) 0.49(0.29) 0.52(0.28) 0.49(0.32) β(10)-VAE 235.37 99.59 101.30 0.56(0.31) 0.46(0.35) 0.47(0.28) 0.38(0.25) 0.44(0.27) β(2)-VAE 129.72 60.72 65.65 0.19(0.21) 0.06(0.11) 0.42(0.29) 0.52(0.30) 0.44(0.27) WAE-GAN 72.70 38.23 48.39 0.01(0.04) 0.0(0.0) 0.11(0.18) 0.14(0.19) 0.13(0.15) Go FAE-SW(63) 66.11 34.69 43.76 0.14(0.23) 0.28(0.26) 0.01(0.03) 0.08(0.17) 0.08(0.17) Go FAE-SF(25) 66.35 33.89 43.3 0.04(0.09) 0.12(0.20) 0.0(0.01) 0.02(0.04) 0.01(0.03) Go FAE-CVM(398) 67.43 37.50 46.74 0.19(0.22) 0.34(0.29) 0.05(0.08) 0.16(0.27) 0.19(0.28) Go FAE-KS(63) 66.62 35.14 43.93 0.0(0.00) 0.0(0.00) 0.0(0.00) 0.00(0.00) 0.00(0.00) Go FAE-EP(10) 67.03 39.71 48.53 0.03(0.08) 0.06(0.10) 0.00(0.00) 0.00(0.00) 0.02(0.04) sense that the condition number of its covariance matrix, κ( ˆΣ), became increasingly large. As λ continues to increase, κ( ˆΣ) also increases (Figs. 5b-5c, green line), implying a Gaussian increasingly concentrated around a lower-dimensional linear manifold. We observed the same trend in the spectrum of singular values (SV) of ˆΣ for each λ after training with TSW (Fig. 6); while the spectrum did not exhibit large SVs, it had many small SVs. Notably, even when λ was not too large, κ( ˆΣ) was relatively large, indicating a shift to a lower-dimension (Figs. 5b-5c). However, MI remained relatively large and the KSunif test suggests the PY is still indistinguishable from Gaussian. Together, these results are evidence that the Go FAE can adapt as needed to a reduced-dimension representation while maintaining the representation s informativeness. Figure 6: SV of ˆΣ Reconstruction, Generation, and Normality. We assessed the quality of the generated and test set reconstructed images using Fréchet Inception Distance (FID) (Heusel et al., 2017) based on 104 samples and mean-square error (MSE) (Table 1). We compared the Go FAE with the AE (Bengio et al., 2013), VAE (Kingma & Welling, 2013), VAE with learned γ (Dai & Wipf, 2019), 2-Stage VAE (Dai & Wipf, 2019), β-VAE (Higgins et al., 2017) and WAE-GAN (Tolstikhin et al., 2017); convergence was assessed by tracking the test set reconstruction error over training epochs (Tables S3 S9 and Fig. S9). Figs. 7a, 7b are for models presented in Table 1. We selected the smallest λ whose mean p-value of the HC uniformity test, KSunif, was greater than 0.05 for the Go FAE models. The correlation based Go F tests have the most competitive performance on FID and test set MSE. We assessed the normality of minibatch encodings across each method using several Go F tests for normality combined with KSunif. We ran 30 repetitions of Algorithm 1 for each method and reported the mean (std) of the KSunif p-values in Table 1. In addition to superior MSE and FID scores, the Go FAE models obtained uniform p-values under varying Go F tests. The variability across Go F tests highlights the fact that different tests are sensitive to different distributional characteristics. Qualitative and quantitative assessments ( E.3) and convergence plots ( E.4) are given in the appendix for MNIST, CIFAR-10, and Celeb A. Finally, an ablation study provided empirical justification for Riemannian SGD in the Go FAE ( E.5). 6 CONCLUSION Figure 7: Reconstruction error on the Celeb A training (a) and testing (b) sets. We presented the Go FAE, a deterministic GAE based on optimal transport and Wasserstein distances that optimizes Go F test statistics. We showed that gradient based optimization of Go FAE induces identifiability issues, unbounded domains, and gradient singularities, which we resolve using Riemannian SGD. By using Go F statistics to measure deviation from the prior class of Gaussians, our model is capable of implicitly adapting its covariance matrix during training from full-rank to singular, which we demonstrate empirically. We developed a performance agnostic model selection algorithm based on higher criticism of p-values for global normality testing. Collectively, empirical results show that Go FAE achieves comparable reconstruction and generation performance while retaining statistical indistinguishability from Gaussian in the latent space. Published as a conference paper at ICLR 2023 ACKNOWLEDGEMENTS The authors thank the anonymous reviewers for their comments which have improved this work. This work was partially supported by a National Science Foundation (NSF) grant: IIS-1718738, a National Institutes of Health (NIH) grant: K02DA043063, and a grant from US Department of Education: a Graduate Assistance in Areas of National Need (GAANN) program. J.Bi was also supported by NIH grants: 5R01MH119678-02 and 5R01DA051922-02. P-A Absil, Robert Mahony, and Rodolphe Sepulchre. Optimization algorithms on matrix manifolds. Princeton University Press, 2009. Alexander Alemi, Ben Poole, Ian Fischer, Joshua Dillon, Rif A Saurous, and Kevin Murphy. Fixing a broken elbo. In International Conference on Machine Learning, pp. 159 168. PMLR, 2018. Theodore W Anderson, Donald A Darling, et al. Asymptotic theory of certain "goodness of fit" criteria based on stochastic processes. The annals of mathematical statistics, 23(2):193 212, 1952. Ludwig Baringhaus and Norbert Henze. A consistent test for multivariate normality based on the empirical characteristic function. Metrika, 35(1):339 348, 1988. Gary Bécigneul and Octavian-Eugen Ganea. Riemannian adaptive optimization methods. ar Xiv preprint ar Xiv:1810.00760, 2018. Yoshua Bengio, Aaron Courville, and Pascal Vincent. Representation learning: A review and new perspectives. IEEE transactions on pattern analysis and machine intelligence, 35(8):1798 1828, 2013. Silvere Bonnabel. Stochastic gradient descent on riemannian manifolds. IEEE Transactions on Automatic Control, 58(9):2217 2229, 2013. Léon Bottou. Online learning and stochastic approximations. On-line learning in neural networks, 17(9):142, 1998. Olivier Bousquet, Sylvain Gelly, Ilya Tolstikhin, Carl-Johann Simon-Gabriel, and Bernhard Schoelkopf. From optimal transport to generative modeling: the vegan cookbook. ar Xiv preprint ar Xiv:1705.07642, 2017. Minhyung Cho and Jaehyung Lee. Riemannian approach to batch normalization. In Advances in Neural Information Processing Systems, pp. 5225 5235, 2017. George Cybenko. Approximation by superpositions of a sigmoidal function. Mathematics of control, signals and systems, 2(4):303 314, 1989. Ralph B D Agostino. Goodness-of-fit-techniques. Routledge, 2017. Bin Dai and David Wipf. Diagnosing and enhancing vae models. ar Xiv preprint ar Xiv:1903.05789, 2019. Eustasio del Barrio, Juan A Cuesta-Albertos, Carlos Matrán, and Jesús M Rodríguez-Rodríguez. Tests of goodness of fit based on the l2-wasserstein distance. Annals of Statistics, pp. 1230 1239, 1999. Lizhong Ding, Mengyang Yu, Li Liu, Fan Zhu, Yong Liu, Yu Li, and Ling Shao. Two generator game: Learning to sample via linear goodness-of-fit test. In Advances in Neural Information Processing Systems, pp. 11260 11271, 2019. Carl Doersch. Tutorial on variational autoencoders. ar Xiv preprint ar Xiv:1606.05908, 2016. David Donoho, Jiashun Jin, et al. Higher criticism for detecting sparse heterogeneous mixtures. The Annals of Statistics, 32(3):962 994, 2004. Published as a conference paper at ICLR 2023 Jean-Marie Dufour, Abdeljelil Farhat, Lucien Gardiol, and Lynda Khalaf. Simulation-based finite sample normality tests in linear regressions. The Econometrics Journal, 1(1):154 173, 1998. Thomas W Epps and Lawrence B Pulley. A test for normality based on the empirical characteristic function. Biometrika, 70(3):723 726, 1983. Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Delving deep into rectifiers: Surpassing human-level performance on imagenet classification. In Proceedings of the IEEE international conference on computer vision, pp. 1026 1034, 2015. N Henze and B Zirkler. A class of invariant consistent tests for multivariate normality. Communications in Statistics-Theory and Methods, 19(10):3595 3617, 1990. Norbert Henze. Invariant tests for multivariate normality: a critical review. Statistical papers, 43(4): 467 506, 2002. Martin Heusel, Hubert Ramsauer, Thomas Unterthiner, Bernhard Nessler, and Sepp Hochreiter. Gans trained by a two time-scale update rule converge to a local nash equilibrium. ar Xiv preprint ar Xiv:1706.08500, 2017. Irina Higgins, Loïc Matthey, Arka Pal, Christopher P. Burgess, Xavier Glorot, Matthew M. Botvinick, Shakir Mohamed, and Alexander Lerchner. beta-vae: Learning basic visual concepts with a constrained variational framework. In ICLR, 2017. Geoffrey E Hinton and Ruslan R Salakhutdinov. Reducing the dimensionality of data with neural networks. science, 313(5786):504 507, 2006. Kurt Hornik, Maxwell Stinchcombe, and Halbert White. Multilayer feedforward networks are universal approximators. Neural networks, 2(5):359 366, 1989. Lei Huang, Xianglong Liu, Bo Lang, Adams Yu, Yongliang Wang, and Bo Li. Orthogonal weight normalization: Solution to optimization over multiple dependent stiefel manifolds in deep neural networks. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 32, 2018. Lei Huang, Xianglong Liu, Jie Qin, Fan Zhu, Li Liu, and Ling Shao. Projection based weight normalization: Efficient method for optimization on oblique manifold in dnns. Pattern Recognition, 105:107317, 2020. Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization. ar Xiv preprint ar Xiv:1412.6980, 2014. Diederik P Kingma and Max Welling. Auto-encoding variational bayes. ar Xiv preprint ar Xiv:1312.6114, 2013. Diederik P Kingma, Max Welling, et al. An introduction to variational autoencoders. Foundations and Trends in Machine Learning, 12(4):307 392, 2019. Yuta Koike. High-dimensional central limit theorems for homogeneous sums. ar Xiv preprint ar Xiv:1902.03809, 2019. Soheil Kolouri, Phillip E Pope, Charles E Martin, and Gustavo K Rohde. Sliced-wasserstein autoencoder: An embarrassingly simple generative model. ar Xiv preprint ar Xiv:1804.01947, 2018. Selcuk Korkmaz, Dincer Goksuluk, and Gokmen Zararsiz. Mvn: An r package for assessing multivariate normality. The R Journal, 6(2):151 162, 2014. Alex Krizhevsky, Geoffrey Hinton, et al. Learning multiple layers of features from tiny images. 2009. Yann Le Cun, Léon Bottou, Yoshua Bengio, and Patrick Haffner. Gradient-based learning applied to document recognition. Proceedings of the IEEE, 86(11):2278 2324, 1998. Jun Li, Li Fuxin, and Sinisa Todorovic. Efficient riemannian optimization on the stiefel manifold via the cayley transform. ar Xiv preprint ar Xiv:2002.01113, 2020. Published as a conference paper at ICLR 2023 Phillip Lippe. Uv A Deep Learning Tutorials. https://uvadlc-notebooks.readthedocs. io/en/latest/, 2022. Ziwei Liu, Ping Luo, Xiaogang Wang, and Xiaoou Tang. Deep learning face attributes in the wild. In Proceedings of the IEEE international conference on computer vision, pp. 3730 3738, 2015. Stephen W Looney. How to use tests for univariate normality to assess multivariate normality. The American Statistician, 49(1):64 70, 1995. Alireza Makhzani, Jonathon Shlens, Navdeep Jaitly, Ian Goodfellow, and Brendan Frey. Adversarial autoencoders. ar Xiv preprint ar Xiv:1511.05644, 2015. Christopher J Mecklin and Daniel J Mundfrom. A monte carlo comparison of the type i and type ii error rates of tests of multivariate normality. Journal of Statistical Computation and Simulation, 75(2):93 107, 2005. Duncan J Murdoch, Yu-Ling Tsai, and James Adcock. P-values are random variables. The American Statistician, 62(3):242 245, 2008. Yasunori Nishimori and Shotaro Akaho. Learning algorithms utilizing quasi-geodesic flows on the stiefel manifold. Neurocomputing, 67:106 135, 2005. Aaron Palmer, Dipak Dey, and Jinbo Bi. Reforming generative autoencoders via goodness-of-fit hypothesis testing. In UAI, pp. 1009 1019, 2018. Giorgio Patrini, Marcello Carioni, Patrick Forre, Samarth Bhargav, Max Welling, Rianne van den Berg, Tim Genewein, and Frank Nielsen. Sinkhorn autoencoders. ar Xiv preprint ar Xiv:1810.01118, 2018. Calyampudi Radhakrishna Rao, Calyampudi Radhakrishna Rao, Mathematischer Statistiker, Calyampudi Radhakrishna Rao, and Calyampudi Radhakrishna Rao. Linear statistical inference and its applications, volume 2. Wiley New York, 1973. Danilo Jimenez Rezende, Shakir Mohamed, and Daan Wierstra. Stochastic backpropagation and approximate inference in deep generative models. ar Xiv preprint ar Xiv:1401.4082, 2014. Karl Ridgeway and Michael C Mozer. Learning deep disentangled embeddings with the f-statistic loss. Advances in Neural Information Processing Systems, 31:185 194, 2018. Nathan Ross. Fundamentals of stein s method. Probability Surveys, 8:210 293, 2011. Paul K Rubenstein, Bernhard Schoelkopf, and Ilya Tolstikhin. On the latent space of wasserstein auto-encoders. ar Xiv preprint ar Xiv:1802.03761, 2018. Lars Ruthotto and Eldad Haber. An introduction to deep generative modeling. ar Xiv preprint ar Xiv:2103.05180, 2021. Yongzhao Shao and Ming Zhou. A characterization of multivariate normality through univariate projections. Journal of multivariate analysis, 101(10):2637 2640, 2010. Samuel S Shapiro and RS Francia. An approximate analysis of variance test for normality. Journal of the American Statistical Association, 67(337):215 216, 1972. Samuel Sanford Shapiro and Martin B Wilk. An analysis of variance test for normality (complete samples). Biometrika, 52(3/4):591 611, 1965. Michael A Stephens. Goodness-of-fit techniques. Routledge, 2017. Yuhta Takida, Wei-Hsiang Liao, Toshimitsu Uesaka, Shusuke Takahashi, and Yuki Mitsufuji. Preventing posterior collapse induced by oversmoothing in gaussian vae. ar Xiv preprint ar Xiv:2102.08663, 2021. Ilya Tolstikhin, Olivier Bousquet, Sylvain Gelly, and Bernhard Schoelkopf. Wasserstein auto-encoders. ar Xiv preprint ar Xiv:1711.01558, 2017. Published as a conference paper at ICLR 2023 Cédric Villani. Optimal transport: old and new, volume 338. Springer Science & Business Media, 2008. Tianlin Xu, Li Kevin Wenliang, Michael Munn, and Beatrice Acciaio. Cot-gan: Generating sequential data via causal optimal transport. Advances in Neural Information Processing Systems, 33:8798 8809, 2020. Published as a conference paper at ICLR 2023 A.1 PROOF OF PROPOSITION 1. Recall that in the setup of the result, X is a random variable taking values in Rm, Y and Z are random variables taking values in Rd, F : Rm Rd and G : Rd Rm are two mappings, and X and Y are linked by Y = F(X). Proposition 1. If X and Z are Euclidean spaces and the decoder G is differentiable, then (a) d W2(G#PY , G#PZ) G d W2(PY , PZ), (b) d W2(PX, G#PZ) [E X G(Y ) 2]1/2 + G d W2(PY , PZ). Proof. (a) Recall that for any two probability measures µ and ν on the same Euclidean space, d2 W2(µ, ν) = inf π Π(µ,ν) Z y z 2π(dz, dz). Let f(x, y) = (G(x), G(y)). For any Q Π(PY , PZ), the measure Q ( ) = Q f 1( ) Π(PG(Y ), PG(Z)), so d2 W (PG(Y ), PG(Z)) Z y z 2Q (dy , dz ) = Z G(y) G(z) 2Q(dy, dz) Z y z 2Q(dy, dz). Taking infimum of the right hand side over all Q Π(PY , PZ) gives the result. (b) By triangular inequality for d W2 and part (a), d W2(PX, PG(Z)) d W2(PX, PG(Y )) + d W2(PG(y), PG(Z)) [E X G(Y ) 2]1/2 + G d W2(PY , PZ). A.2 PROOF OF PROPOSITION 2. Proposition 2. Let ˆPX,n be the empirical distribution of samples {Xi}n i=1, and ˆPY,n be the empirical distribution of {Yi = F(Xi)}n i=1. Assume that F(X) is differentiable with respect to X with bounded gradients F(X). Then, (a) d W2(ˆPY,n, PY ) F d W2(ˆPX,n, PX), (b) d W2(PY , PZ) d W2(ˆPY,n, PZ) + F d W2(ˆPX,n, PX). Proof. (a) is a direct consequence of Proposition 1 (a), except that G therein is replaced with F. (b) follows by combining (a) with d W2(PY , PZ) d W2(ˆPY,m, PZ) + d W2(ˆPY,m, PY ). A.3 PROOF OF THEOREM 3. As described in Section B.1, the Shapiro Wilk statistic of a sample X1, . . . , Xm is TSW = (Pm i=1 ai X(i))2 Pm i=1(Xi X)2 , where X(1) X(2) X(m) are the order statistics and ai are certain constants that are all different. Let a = (a1, . . . , am)T . We only need the fact that a = 1 and 1T a = 0, where 1 is the vector of m 1 s. Theorem 3. Suppose V Rm k is of full row rank. For Θ Rk d, define Y = VΘ. Denote TSW = TSW (Yu) where u Rd is a unit vector. Then, TSW is differentiable with respect to Θ almost everywhere, and ΘTSW = 0 if and only if TSW = 1. Proof. Let V be a row permutation of V such that the coordinates of V Θu are in increasing order. Denote by I the m m identity matrix and J = 11T . Put B = (I J/m)V and note that a T V = a T B as a T J = a T 11T = (1T a)T 1 = 0. Then TSW = (a T V Θu)2 (I J/m)V Θu 2 = (a T BΘu)2 BΘu 2 . (S4) Published as a conference paper at ICLR 2023 It is easy to see that TSW is differentiable at Θ if all v T i Θu are different, where v T 1 , . . . , v T m are the rows of V. Also, in this case, V is unique and BΘu = 0. On the other hand, for i = j, since u = 0 and vi = vj as V is of full row rank, the set of Θ with (vi vj)T Θu = 0 is a strict linear subspace of Rk d, so has Lebesgue measure 0. Then TSW is differentiable in Θ almost everywhere. Fix Θ with all the coordinates of VΘu being different. Then ΘTSW = 2(a T BΘu) BΘu 2 BT a (a T BΘu)(BΘu) Note that a T BΘu is a scalar. Also, a T BΘu = 0, for otherwise from (S4) TSW = 0, contradicting Lemma 3 (Shapiro & Wilk, 1965). Suppose ΘTSW = 0. Then from (S5), BT a (a T BΘu)(BΘu) Let c = BΘu 2 a T BΘu and h = BΘu ca. Then the above equation can be written as u T = c 1BT hu T = 0, giving BT hu T u = u 2BT h = 0. From u = 0, 0 = h T B = h T (I J/m)V = (h k1)T V with k = (h T 1)/m. Because V is of full row rank, h = k1. However, 1T h = 1T BΘu c1T a. Since 1T B = 1T (I J/m)V = 0 and as pointed earlier, 1T a = 0, then 1T h = 0. As a result k = 0, giving h = 0, i.e. BΘu = ca. Plugging this into (S5) and noting that a = 1, TSW = 1. On the other hand, by Lemma 2 of (Shapiro & Wilk, 1965), TSW = 1 if and only if BΘu = sa for some scalar s = 0. Clearly in this case all the coordinates of BΘu are different, so TSW is differentiable at Θ. Since 1 is the maximum value of TSW , then ΘTSW = 0. A.4 PROOF OF THEOREM 4. Theorem 4. Let T({yi}m i=1), m 3 be any affine invariant test statistic that is non-constant and differentiable wherever yi are not all equal. Then, for any b, as (y1, . . . , ym) (b, . . . , b) , sup T({yi}m i=1) , where is the Frobenius norm. Proof. First, s > 0, T(sy1 + b, . . . , sym + b) = T(y1, . . . , ym) then T(sy1 + b, . . . , sym + b) = 1 s T(y1, . . . , ym). By assumption T(y1, . . . , ym) = 0 for some (y, . . . , ym). Then let s 0, T(sy1 + b, . . . , sym + b) = 1 s T(y1, . . . , ym) . A.5 PROOF OF THEOREM 5. Recall X = (X1, . . . , Xm) and V = HΞ(X), where the Xi s are i.i.d. sample observations. Note that V is a function of Ξ and X but not a function of Θ. Theorem 5. Let T = TSW be as in Theorem 3. Denote by θ, Θ, the Riemannian gradient w.r.t. θ, Θ, respectively, and the Frobenius norm. Let B = (I J/m)V, i.e., B is obtained by subtracting from each row of V the mean of the rows. Suppose (a) supΞ(E[ B 4] + E[ ΞB 4]) < , and (b) sup x =1,Ξ E[ Bx 4] < . Then, supθ E[ θT 2] < . Proof. Since all the assumptions and assertion of Theorem 6 are invariant under any permutation of the rows of V, for ease of notation and without loss of generality, suppose the coordinates of VΘu are in increasing order. By the construction, the coordinates of BΘu are also increasing and have mean 0. Then T = (a T BΘu)2 By θT = ( ΘT, ΞT), it suffices to show supθ E( ΘT 2) < and supθ E( ΞT 2) < . Published as a conference paper at ICLR 2023 Regard the Stiefel manifold M as a subset of Rk d equipped with the inner product Γ1, Γ2 = tr(ΓT 1 Γ2), which is simply the Euclidean inner product of the vectorized Γ1 and Γ2. The Riemannian gradient Θ is obtained under this inner product and is equal to the orthogonal projection of Θ onto TΘM, where Θ denotes the Euclidean gradient. Then ΘT ΘT , so it is enough to show supθ E( ΘT 2) < . From (S5), ΘT = 2(a T BΘu) BΘu 2 BT a (a T BΘu)(BΘu) Then by u = 1, so by Cauchy-Schwarz inequality E( ΘT 2) 16 a 4 E( B 4)E( BΘu 4) 1/2 E( B 4) sup x =1,Ξ E( Bx 4) where the second inequality is due to the independence of u and B. Then by assumptions (b) and (c), supθ E( ΘT 2) < . Next, since Ξ lives in a Euclidean space, the Riemannian gradient of T w.r.t. Ξ consists of the partial derivatives of T w.r.t. of its coordinates. For each coordinate ξ of Ξ, ξ = 2 a T ( ξB)Θu(a T BΘu) BΘu 2 (a T BΘu)2(BΘu)T ( ξB)Θu Then by Θu = 1, | ξT| 4 a 2 ξB Taking the sum of ( ξT)2 over all the coordinates of Ξ then yields ΞT 2 16 a 4 ΞB 2 So by Cauchy-Schwarz inequality E( ΞT)2 16 a 4 E( ΞB 4)E( BΘu 4) 1/2 E( ΞB 4) sup x =1,Ξ E( Bx 4) where the second inequality is again due to the independence of u and B. Then by assumptions (b) and (c), supθ E( ΞT 2) < . A.6 PROOF OF THEOREM 6. The following result is a relaxation of Theorems 1 and 2 in (Bonnabel, 2013). Let (γt)t 0 = (γ0, γ1, γ2, . . .) be a sequence of step sizes. Suppose C( ) is a three times continuously differentiable cost function on a smooth connected Riemannian manifold M, (zt)t 0 is a sequence of i.i.d. random variables taking values in a measurable space Z, and H( , ) is a measurable function on Z T M such that Ez H(z, w) = C(w), where T M is the tangent bundle of M and z zt. Also suppose w0 M is independent of (zt)t 0. Theorem 6. Let M be a connected Riemannian manifold with injectivity radius uniformly bounded from below by I > 0. Let C C(3)(M) and Rw be a twice continuously differentiable retraction. Let z0, z1, . . . be i.i.d. ζ taking values in Z. Let H : Z M T M be a measurable function such that E[H(ζ, w)] = C(w) for all w M, where T M is the tangent bundle of M. Consider the SGD update wt+1 = Rwt( γt H(zt, wt)) with step size γt > 0 satisfying P γ2 t < + and P γt = + . Suppose there exists a compact set K such that all wt K and supw K E[ H(ζ, w) 2] A2 for some A > 0. Then, C(wt) converges almost surely and C(wt) 0 almost surely. Published as a conference paper at ICLR 2023 As in (Bonnabel, 2013), the proof of Theorem 6 starts with the the following result which is of interest in its own right. Proposition S1. Let (γt)t 0 and M be as in Theorem 7. Consider the update wt+1 = expwt( γt H(zt, wt)), where expw is the exponential map at w. Suppose there exists a compact set K such that wt K for all t 0. We also suppose for some A > 0, Ez( H(z, w) 2) A2 for all w K. Then, C(wt) converges a.s. and C(wt) 0 a.s. Proof of Proposition S1. The proof builds upon the one for Theorem 1 (Bonnabel, 2013). Let Ft = σ(w0, z0, . . . , zt 1). Then for t 1, wt is Ft measurable. If γt H(zt, wt < I, then expwt{s H(zt, wt)}0 s γt is the geodesic linking wt+1 and wt, so as in the proof of Theorem 1 in (Bonnabel, 2013), the Taylor formula yields C(wt+1) C(wt) γt H(zt, wt), C(wt) + γ2 t H(zt, wt) 2k2, where , is the Riemannian inner product at Twt M. The inequality has the same form as equation (5) in (Bonnabel, 2013) but with k2 = supw K0 2C , i.e., ( 2C(w))v k2 v for all w K0 and v Tw M, where K0 is the compact set of all points with distance at most I from K. Define events E 1 = , Et = { H(zt, wt) < I/γt}, t 0. (S6) Denote by 1E the indicator function of an event E. Then by C(w) 0, C(wt+1)1Et C(wt) γt H(zt, wt), C(wt) 1Et + γ2 t H(zt, wt) 2k2. (S7) Taking expectations conditional on Ft on both sides of the equality, E[C(wt+1)1Et | Ft] C(wt) γt E( H(zt, wt), C(wt) | Ft) + γt E( H(zt, wt), C(wt) 1Ec t | Ft) + γ2 t E( H(zt, wt) 2 | Ft)k2. (S8) Since zt is independent from Ft while wt is Ft-measurable, E( H(zt, wt), C(wt) | Ft) = Ez( H(z, wt), C(wt) ) = C(wt) 2, E( H(zt, wt) 2 | Ft) = Ez( H(z, wt) 2) A2, (S9) where in Ez( ), wt is treated as a fixed value. On the other hand, E( H(zt, wt), C(wt) 1Ec t | Ft) = E[H(zt, wt)1Ec t | Ft], C(wt) E[ H(zt, wt) 1Ec t | Ft]k1, where k1 = sup K C . Since Ec t = { H(zt, wt) I/γt}, by Markov inequality, E[ H(zt, wt) 1Ec t | Ft] E[ H(zt, wt) 2/(I/γt) | Ft] γt A2/I. (S10) Then from (S8), E[C(wt+1)1Et | Ft] C(wt) + γ2 t A2k γt C(wt) 2 (S11) with k = k2 + k1/I. Let Nt = C(wt)1Et 1 + A2k X s 0 and 0 < I0 I, such that d(Rw(v), expw(v)) ϱ v 2 for all w K and v Tw M with v I0. Without loss of generality, suppose ϱ > 1 and I0 = I < 1/ϱ, for otherwise we can decrease I. Define the same events Et as in (S6) and let constants k0, ..., k4 be defined as in the proof of Proposition S1. Let w t = expwt( γt H(zt, wt)). Then d(w t+1, wt+1)1Et γ2 t ϱ H(zt, wt) 2. By assumption, wt+1 K. Then on the event Et, d(w t+1, wt+1) ϱI2 < I, so wt+1 K0, where K0 is defined in the proof of Proposition S1. Then C(wt+1)1Et C(wt) C(w t+1)1Et C(wt) + |C(wt+1) C(w t+1)|1Et C(w t+1)1Et C(wt) + d(wt+1, w t+1)k11E1 C(w t+1)1Et C(wt) + γ2 t H(zt, wt) 2ϱk1. Then by (S9) and (S11), E(C(wt+1)1Et C(wt) | Ft) γ2 t A2k γt C(wt) 2, where k = k2 + k1/I + ϱk1. Then, following the same argument as in the proof of Proposition S1, C(wt) converges a.s. and P t 0 γt E( C(wt) 2) < , and to show p(wt) := C(wt) 2 0, it suffices to show p(wt) converges a.s. We have p(wt+1)1Et p(wt) |p(wt+1) p(w t+1)|1Et + p(w t+1)1Et p(wt) k4γ2 t ϱ H(zt, wt) 2 + p(w t+1)1Et p(wt). Then by (S9) and (S13), E(p(wt+1)1Et p(wt)1Et 1 | Ft) p(wt)1Ec t 1 + 2k2γtp(wt) + (2k4 + 2k1k2/I)γ2 t A2. Then, with the same argument following (S13), p(wt) converges a.s. Published as a conference paper at ICLR 2023 B GOODNESS-OF-FIT TESTS In this section, we present several commonly used Go F tests. They are grouped into three classes: tests based on correlation (CB), tests based on empirical distribution function (EDF), and tests based on empirical characteristic function (ECF). The CB Go F tests were first covered in Section 3. The latter two are covered here. Empirical Distribution Function (EDF) Tests: UVN EDF tests are based on the discrepancy between the empirical and hypothesized distribution functions (D Agostino, 2017), encompassing two broad classes: supremum tests, and quadratic tests. Kolmogorov Smirnov (KS) is a supremum test, measuring the largest absolute distance. Two popular quadratic tests are Cramér von Mises (CVM), which measures the integrated quadratic deviation weighted by a function Ψ, and Anderson Darling (AD) (Anderson et al., 1952), which gives higher weight to distribution tails. Empirical Characteristic Function (ECF) Tests: ECF tests are based on the weighted integral of the difference between the ECF and its pointwise limit, including Epps Pulley (EP) for UVN (Epps & Pulley, 1983) and MVN (Baringhaus & Henze, 1988) and the Henze Zirkler (HZ) test, a generalization of EP (Henze & Zirkler, 1990). For consistency, T and d represent respectively the test statistic and corresponding statistical distance for each test. We will used FX to denote both the law of X and its cumulative distribution function. For a random sample X1, X2, . . . Xm, denote its sample mean, sample variance, and empirical cumulative distribution function by X, Sm, and ˆFX,m, respectively. If the Xi s are univariate, we further sort them into order statistics X(1) X(2) X(m). B.1 CB CLASS: SHAPIRO WILK, SHAPIRO FRANCIA Let X1, . . . , Xm be univairate and i.i.d. FX. To test H0 : FX G, where G is the class of normal distributions, the Shapiro Wilk (SW) test statistic is defined as Pm i=1 ai X(i) 2 Pm i=1 Xi X 2 , where a = (a1, a2, . . . , am)T is obtained via a = M 1c M 1c with c = (c1, . . . , cm)T and M being the mean vector and covariance matrix, respectively, of the order statistics of m independent N(0, 1) random variables. The corresponding L2-Wasserstein distance is d2 W2( ˆFX,m, G) = inf a,σ2 d2 W2( ˆFX,m, N(a, σ2)) = S2 m Z 1 0 ˆF 1 X,m(t)Φ 1(t) dt 2 with Φ( ) being the distribution function of standard normal (del Barrio et al., 1999). Following the same notations for SW test, the Shapiro Francis (SF) test on normality (Shapiro & Francia, 1972) is defined as Pm i=1 bi X(i) 2 Pm i=1 Xi X 2 , where b = (b1, b2, . . . , bm)T is obtained via B.2 EDF CLASS: CRAMER VON MISES, KOLMOGOROV SMIRNOV Let X1, . . . , Xm be univariate and i.i.d. FX. Let F be a specified univariate distribution and suppose we whish to test H0 : FX = F. The Cramer Von Mises (CVM) test statistic is defined as TCV M = 1 12m + F(X(i)) 2i 1 Published as a conference paper at ICLR 2023 The CVM test statistic corresponds to the statistical distance d CV M( ˆFX,m, F) = Z ˆFX,m F 2 Ψ(F) d F, where Ψ is a weight function. Note that for TCV M, Ψ 1. On the other hand, the Kolmogorov Smirnov (KS) test statistic and related statistical distance are defined respectively as TKS = max 1 i m F(X(i)) i 1 m F(X(i)) , d KS( ˆFX,m, F) = sup | ˆFX,m F|. B.3 ECF CLASS: HENZE ZIRKLER Suppose Xi are k-dimensional and i.i.d. PX. Following the similar notation in (Henze & Zirkler, 1990), define the scaled residuals Yj = S 1/2 m (Xj X), j = 1, . . . , m. Let j=1 exp(it Yj) denote the empirical characteristic function of Yj. To test H0 : FX G, the Henze Zirkler (HZ) test statistic is defined as THZ = m (4I{Sm is singular} + Dm,βI{Sm is nonsingular}) , where β > 0 is a parameter, φm(t) exp 1 2 ψβ(t) dt, ψβ(t) = (2πβ2) k/2 exp t 2 After some simplification Korkmaz et al. (2014), we have Dm,β = 1 m2 i,j=1 e β2 Yi Yj 2/2 2 (1 + β)k/2m i=1 e β2 Yi 2/[2(1+β2)] + 1 (1 + 2β2)k/2 . The optimal choice for β is proposed to be Denote the Fourier transform of a probability measure µ on Rk by µ(t) = R Rk eit xµ(dx). Then Rk | FY,m(t) G(t)|2ψβ(t) dt, where G denotes the k-variate standard normal distribution. Thus, HZ statistic corresponds to the following statistical distance d HZ(µ, ν) = Z Rk | µ(t) ν(t)|2ψβ(t) dt. B.4 A COMPARISON BETWEEN THE ASSOCIATED STATISTICAL DISTANCES Proposition S2. Let X, Y denote two k-variate random variables. Then, (a) d HZ(FX, FY ) C1d W2(FX, FY ), (b) d KS(FX, FY ) C2 p d W1(FX, FY ), (c) if k = 1 and the weight function Ψ( ) in the CVM test is bounded, then d CV M(FX, FY ) C3d KS(FX, FY )2. In these inequalities the constants C1, C2 may depend on k, while C3 may depend on Ψ( ). Published as a conference paper at ICLR 2023 Proof. (a) Put µ = FX and ν = FY . For any ω Π(µ, ν), by Cauchy Schwarz inequality and Fubini s theorem d HZ(µ, ν)2 = Z Rd eit xµ(dx) Z Rd eit yν(dy) Rd Rd(eit x eit y)ω(dx, dy) Rd Rd |eit x eit y|2ω(dx, dy) φ(t) dt Rd |eit x it y|2φ(t) dt ω(dx, dy). Using |eia eib|2 (a b)2 for a, b R, d HZ(µ, ν)2 Z Rd |t x t y|2φ(t) dt ω(dx, dy) Rd t 2 x y 2φ(t) dt ω(dx, dy) Rd t 2φ(t) dt Z Rd Rd x y 2ω(dx, dy) Rd Rd x y 2ω(dx, dy) Rd t 2φ(t) dt < . Since the above inequality holds for all ω Π(µ, ν), then d HZ(µ, ν) Cd W2(µ, ν). Let C1 = C. Then (a) follows. (b) The connection between Kolmogorov Smirnov distance and L1-Wasserstain distance under both of the univariate and multivariate cases are well studied, detailed proof can be found in Corollary 3.1 (Koike, 2019), Proposition 1.2 (Ross, 2011) etc. (c) Letting C3 = sup Ψ, d CV M(FX, FY ) = Z (FX FY )2Ψ(FY ) d FY C3 Z |FX FY |2 d FY Z (sup |FX FY |)2 d FY = C3d KS(FX, FY )2. C MODEL ARCHITECTURE The architecture for the encoder and decoder of Celeb A and MNIST is based in the WAE (Tolstikhin et al., 2017). The discriminator for the WAE-GAN followed Tolstikhin et al. (2017). The architecture for CIFAR10 is based on Lippe (2022). When modeling a specific dataset, the specified encoder and decoder components are the same for all models. D TRAINING DETAILS D.1 TEST STATISTIC PROJECTIONS Once a batch has been encoded, it is projected from 64D to 1D in order to be tested. Instead of randomly sampling directions from the unit sphere to implement the projections, we sample an orthonormal basis and project down each direction, calculating the test statistic along each. This is used in two ways: 1) selecting the most pessimistic direction to optimize, or 2) computing the average direction to optimize. For example, the SW test statistic value should be large to fail to reject normality. Selecting the direction associated with the smallest statistic value can be used as a new statistic to optimize. Published as a conference paper at ICLR 2023 Table S1: Architectures used for modeleing the Celeb A, MNIST and Cifar10 datasets. Conv 128 4 4 represents a convolution with 128 filters which are 4 4, FC stands for fully connected layer, Re LU for the rectified linear units, and Conv T for convolution transpose. Dataset Optimizer Architecture Adam Input 64x64x3 Encoder Conv 128x4x4 (stride 2), 256x4x4 (stride 2), 512x4x4 (stride 2), 1024x4x4 (stride 2), FC 16384, 256. Re LU activation RSGD Input 256 Stiefel FC 64 Orthogonal initialization Adam Input 64 Decoder FC 65536, Conv T 512x4x4 (stride 2), Conv T 256x4x4 (stride 2), Conv T 128x4x4 (stride 2), Conv T 3x1x1 (stride 1), Sigmoid activation. Adam Input 28x28x1 Encoder Conv 128x4x4 (stride 2), 256x4x4 (stride 2), 512x4x4 (stride 2), 1024x4x4 (stride 2), FC 16384, 256. Re LU activation RSGD Input 256 Stiefel FC 10 Orthogonal initialization Adam Input 10 Decoder FC 65536, Conv T 512x4x4 (stride 1), Conv T 256x4x4 (stride 1), Conv T 128x4x4 (stride 2), Sigmoid activation. Input 32x32x3 Encoder Conv 64x3x3 (stride 2, pad 1), Conv 64x3x3 (stride 1, pad 1) Conv 128x3x3 (stride 2, pad 1), Conv 128x3x3 (stride 1, pad 1) Conv 128x3x3 (stride 2, pad 1), FC 2048, 256, Re LU activation RSGD Input 256 Stiefel FC 64 Orthogonal initialization Input 64 Decoder FC 2048, Conv T 128x3x3 (stride 2, pad 1, out pad 1) Conv 128x3x3 (stride 1, pad1), Conv T 64x3x3(stride 2, pad 1, out pad 1) Conv 64x3x3 (stride 1, pad 1), Conv T 3x3x3 (stride 2, pad 1, out pad 1), Sigmoid activation Similarly, for tests that fail to reject for small values, selecting the direction associated with the largest value can be used. We refer to both scenarios as selecting the most pessimistic direction. Alternatively, the mean of all the statistics could be used as a new statistic. D.2 EMPIRICAL DISTRIBUTIONS The distribution of a test statistic must be known analytically or estimable in order to compute p-values. There are a few options for calculating the empirical distribution of a test statistic when using Go F tests with projections. The first possibility is to randomly sample from the unit sphere, project the multivariate data, and then use the corresponding distribution to determine the p-value. However, a single projection may not be particularly informative, and early testing indicated this method lead to slower convergence. Another option is to project along multiple directions and calculate the statistics of these directions. It is then possible to create a new statistic from these, for example the average, minimum, or maximum. Unfortunately, this method precludes using the original test statistic distribution or calculating p-values. To remedy this, we create an empirical distribution of this new statistic by repeatedly sampling; an empirical p-value are produced from the test statistic samples. D.3 MUTUAL INFORMATION The mutual information is estimated after the model has been trained by sampling from N64(ˆµ, ˆΣ) where (ˆµ, ˆΣ) are estimated either during or after training. Go FAE tracks these statistics during training. These samples are decoded, and then encoded. Assuming the encoded data is Gaussian, mutual information may be calculated as I(Z; Z) = 1 2 log |ΣZ| |Σ Z| |ΣZ, Z| . Published as a conference paper at ICLR 2023 D.4 FURTHER DETAILS ON MODEL SELECTION We selected λ with grid-search using a training and validation set and an array of λ values. Each model was evaluated for p-value uniformity using Algorithm 1. Note that Algorithm 1 takes multiple minibatch (local) Go F p-values and produces a single Kolmogorov-Smirnov test statistic and p-value pair for uniformity (a single blue dot in Figs. 5b-5c). Since univariate Go F tests require projection, there are two sources of randomness that come into play when producing Go F p-values from a data set: 1) shuffling minibatches (line 2 in Algorithm 1), and 2) random projections. Instead of selecting λ based on a single KS uniformity p-value, Algorithm 1 was run 30 times. The average of these 30 uniformity p-values was computed, and compared against a pre-specified threshold, which is 0.05 by default. If the average is larger than the threshold, then λ is a possible candidate. There is a region where a variety of λ values satisfy the threshold condition of 0.05 (Figs. 5b-5c). Since the loss function, Equation 2, should also have small reconstruction error, the smallest λ for which the corresponding average of 30 KS p-values is larger than 0.05 is the chosen hyperparameter and used in the final model. Pseudo-code for the Go FAE pipeline can be seen in Algorithm S1. Algorithm S1 Go FAE Pipeline: Training Locally and Assessing with Higher Criticism Require: Train/Val/Test Set, array of λ values, repeats = R, threshold 1: for Each λ in array do 2: Use Algorithm 2 on Training Set 3: for r = 1:R do 4: Use Algorithm 1 on Training/Validation Set 5: Store KSunif p-value 6: Compute average KSunif p-value and store 7: Select smallest λ where average KSunif p-value > threshold E ADDITIONAL EXPERIMENTS E.1 DETAILS ON EXPERIMENTS We trained Go FAE and competing methods on MNIST Le Cun et al. (1998) and Celeb A Liu et al. (2015). For fair comparisons, we adopted a common architecture for all methods for each dataset as described in Section C. E.1.1 MNIST The MNIST dataset was re-scaled to the unit interval. Training proceeded for 50 epochs using mini-batches of size 128. We specified a 10-dimensional code layer. Competitor models include β-VAE Higgins et al. (2017), VAE Kingma & Welling (2013), WAE-GAN Tolstikhin et al. (2017), VAE with learned γ Dai & Wipf (2019), 2-Stage VAE Dai & Wipf (2019), and AE Bengio et al. (2013). Both VAE and β-VAE had an initial learning rate of 1e 4, while VAE with learnable γ used 1e 3. For our models, Adam optimizer Kingma & Ba (2014) was used for the encoder (learning rate 3e 3, β = (0.5, 0.999)) and decoder (learning rate 3e 3, β = (0.5, 0.999)). Riemannian SGD with learning rate 5e 3 was used to constrain Θ to the Stiefel manifold using a one cycle learning rate scheduler with max learning rate as 1e 3. Singular value decomposition was used for retracting Θ back to M after a parameter update. E.1.2 CELEBA The Celeb A dataset was pre-processed following the same procedure in Tolstikhin et al. (2017). First, a 140x140 center crop is taken and the image is resized to 64 64 resolution. For the VAE, the Adam optimizer was used with a learning rate of 1e 4, and β1 = 0.5 and β2 = 0.999. The β-VAE used the Adam optimizer with a learning rate of 1e 4, β1 = 0.5, and β2 = 0.999. The VAE with learnable γ used the Adam optimizer with a learning rate of 1e 3, β1 = 0.5, and β2 = 0.999. The Published as a conference paper at ICLR 2023 WAE-GAN also used the Adam optimizer where the encoder and decoder parameters had an initial learning rate of 3e 4 and the discriminator was set to 1e 3 with β1 = 0.5, β2 = 0.999. For the two-stage VAE, we used the trained VAE with learned gamma as stage one. A second VAE was trained using the µ, σ from the first stage as inputs, also with a trainable gamma. The architecture was a three layer dense net with Re LU activations, and the decoder was the inverse, with the last layer being linear. For our method, which had two components for the encoder, as visualized in Figure (3a) in the main paper. The first component of the encoder used the Adam optimizer with learning rate set to 3e 3 and β1 = 0.5 and β2 = 0.999. The decoder also used Adam, with the same learning rates and hyperparameters. The dense layer of the encoding processes is parameterized by Θ, an element of the Steifel manifold M, and is trained with Riemannian SGD with a learning rate of 5e 3 while making use of the 1-cycle learning rate scheduler with a maximum learning rate of 1e 3. As our methods require projection, a 64 64 dimensional orthonormal basis is sampled and used to project the encoded data. E.1.3 CIFAR-10 CIFAR10 is the final dataset, and consists of 60 thousand small images. The training set contains 50 thousand with the remaining going to the test set. After setting a seed, the training set was split into a smaller training set, of 45 thousand, and a validation set with the rest. The architecture used follows from the Model Architecture section. A latent dimension size of 64 is used. Models are evaluated on the average reconstruction quality. Here, the mean-squared error of randomly selected batches is computed over over a single pass through the test set, and then averaged. This is repeated 30 times, with the mean and standard deviation reported. FID scores for reconstructed and generated data are also computed. The Adam optimizer was used for the encoder and decoder of all methods. The learning rate was set at 3e 4, with β1 = 0.5 and β2 = 0.999. The discriminator for WAE-GAN is the same from the Celeb A setup. All models are trained for 200 epochs, and reduce the learning rate when hitting a plateau (Reduce LROn Plateau in Py Torch). For this a minimum learning rate was set as 5e 5, patience of 10, and a factor of 0.2. For the two-stage VAE, the encoder consisted of 2 dense layers of 32 nodes each with Re LU activation, 64-dimension code layer, and the decoder contained 3 dense layers with Re LU activation except the final layer which was linear. The architecture was explored using the encoded validation set coming from stage 1 (the VAE with learnable gamma). This second stage VAE was trained for 50 epochs, with KL converging to zero quickly. Uniformity for the 2-Stage VAE was assessed using sampling from the test set which was encoded after stage 1 training was complete. Similar to the models for MNIST and Celeb A, the dense layer encoding for the Go FAE models, Θ, is trained with Riemannian SGD. The learning rate is 5e 4 and uses a 1-Cycle learning rate scheduler with a maximum learning rate of 1e 4. Each encoded batch is projected onto a randomly sampled 64 64 orthonormal basis. The test statistic T is applied to each direction and the final statistic (min, average, max) computed. In the following sections, we visually assess the quality of Go FAE and competing models. For the Go FAE plots, a good" regularization coefficient λ > 0 indicates that the model should (i) fail to reject H0 at an α-level chosen a-priori (close to normal but not overly so), (ii) accurately reconstruct the input, and finally (iii) generate samples which are both qualitatively and quantitatively convincing. In this section, we present performance of competing and Go FAE methods in Table S2. Reconstruction error, reconstruction FID, generation FID, and the mean (std) of the KSunif p-values are reported. Additionally, we examine the effects of λ with Go FAE-SW model in Figure S1. The Go F test pushes the transformation to become increasingly singular in the sense that the condition number of the covariance matrix becomes increasingly large (Figure S1). As λ gets larger, the average KS uniformity statistic also increases, indicating PY appears to be indistinguishable from the Gaussian class. However, as λ continues to increase, the condition number of the estimated covariance matrix also increases. Yet, for a small interval of λ, the mutual information has not completely diminished, Published as a conference paper at ICLR 2023 and the KS uniformity suggests PY is still indistinguishable from Gaussian. This visualization provides insights on how the class of Gaussians allows the model to adapt as needed during training. Table S2: Evaluation of MNIST by MSE, FID scores (Heusel et al., 2017), and samples with p-values from higher criticism. Algorithm MSE FID Score Kolmogorov Smirnov Uniformity Test Gen. Recon. SW SF CVM KS EP AE-Baseline 7.66 62.29 7.39 0.0 (0.0) 0.0 (0.0) 0.0 (0.0) 0.0 (0.0) 0.0 (0.0) VAE 13.99 16.23 8.33 0.18 (0.18) 0.42 (0.30) 0.07 (0.12) 0.14 (0.18) 0.03 (0.06) β(2)-VAE 20.98 20.54 11.52 0.38 (0.31) 0.29 (0.26) 0.41 (0.29) 0.45 (0.31) 0.24 (0.25) C-β-VAE 20.94 19.21 11.38 0.48 (0.30) 0.37 (0.28) 0.29 (0.26) 0.42 (0.31) 0.19 (0.19) WAE-GAN 9.49 15.72 7.28 0.07 (0.12) 0.49 (0.29) 0.09 (0.15) 0.11 (0.12) 0.06 (0.12) Go FAE-SW 7.16 16.58 7.57 0.11 (0.16) 0.48 (0.23) 0.06 (0.12) 0.17 (0.24) 0.03 (0.05) Go FAE-SF 7.08 18.35 7.47 0.0 (0.0) 0.26 (0.27) 0.0 (0.0) 0.0 (0.0) 0.0 (0.0) Go FAE-CVM 7.67 15.56 7.71 0.14 (0.18) 0.46 (0.28) 0.14 (0.22) 0.12 (0.21) 0.05 (0.15) Go FAE-EP 7.54 15.85 7.84 0.25 (0.27) 0.44 (0.28) 0.12 (0.23) 0.17 (0.20) 0.05 (0.08) Figure S1: Effects of λ. We evaluated the uniformity KS test p-values (blue dots), average KS pvalues (blue line), mutual information (red line), log condition number (green line) and reconstruction error (black line) under different λ for Go FAE-SW. E.3 ADDITIONAL PLOTS In this section, we include additional experimental results for Go FAE and comparison models. E.3.1 CELEBA This section contains images produced from Go FAE and competing GAEs on the Celeb A dataset with a latent space dimension of 64. Figure S2 depicts reconstructed faces. In Table 1 in the main paper, we evaluated reconstruction quality with mean-squared error and the Frechet-Inception Distance (FID). The visual quality of the images produced by the Go FAE models are consistent with the low MSE and reconstruction FID observed in Table 1. Figure S3 contains faces generated from the model. The Go FAE models again produced competitive FID scores, with Go FAE-SF producing the lowest (best) of the group. Tables S3 and S4 illustrate the effects of λ on KS uniformity test, corresponding p-value, reconstruction error, and test statistics for Go FAE-SW, Go FAE-SF, Go FAE-CVM and Go FAE-KS, respectively. Published as a conference paper at ICLR 2023 There are several things to note from the Tables of figures demonstrating the effects of λ on the training of Go FAE architectures (Table S3 and Table S4). The first row shows the mutual information and Go F p-value distribution computed from the test set as a function of log(λ). As λ increases the models are producing PY that are appearing increasingly indistinguishable from normal. However, beyond a certain λ, the encoding no longer appears normal, Go FAE-SW clearly depicts this occurring in Table S3, row (a), left pannel. Row (b) tracks the average p-value produced on the test set while training with a particular Go F test statistic. Larger λ can easily be seen to focus more on normality as the p-values tend to start at high levels and stay. The reconstruction error in row (c) converges quickly. Of particular interest in the log(λ) vs KS p-value plots, is the behavior of the average KS p-value; after the average KS uniformity peaks, the mutual information falls rapidly. This clearly illustrates the relationship posited in the experiments section where the model prioritizes posterior matching. Similar plots for MNIST can be seen in section E.3.2 with the quantitative information regarding reconstruction error, reconstruction FID and generation FID in Table S2. Once again, the Go FAE models produce overall lower MSE than competitor models, while maintaining competitive generation FID scores. Published as a conference paper at ICLR 2023 (a) Ground truth (c) VAE (fixed γ) (d) VAE (learned γ) (e) β(10)-VAE (f) WAE-GAN (g) Go FAE-SW (h) Go FAE-SF (i) Go FAE-CVM (j) Go FAE-KS Figure S2: Comparison of reconstruction between competing methods: (b) AE, (c) VAE (fixed γ), (d) VAE (learned γ), (e) β-VAE (β = 10), (f) WAE-GAN, and our framework: (g) Go FAE-SW, (h) Go FAE-SF, (i) Go FAE-CVM, (j) Go FAE-KS with Celeb A. Published as a conference paper at ICLR 2023 (a) Ground truth (c) VAE (fixed γ) (d) VAE (learned γ) (e) β(10)-VAE (f) WAE-GAN (g) 2-Stage VAE (h) Go FAE-SW (i) Go FAE-SF (j) Go FAE-CVM (k) Go FAE-KS Figure S3: Comparison of generated samples between competing methods: (b) AE, (c) VAE (fixed γ), (d) VAE (learned γ), (e) β-VAE (β = 10), (f) WAE-GAN, (g) 2-Stage VAE, and our framework: (h) Go FAE-SW, (i) Go FAE-SF, (j) Go FAE-CVM, (k) Go FAE-KS with Celeb A. Published as a conference paper at ICLR 2023 (a) Test Reconstructions (b) TSW Statistics Figure S4: Convergence of reconstruction error (a) and TSW (b) for multiple λ. Published as a conference paper at ICLR 2023 Table S3: The effects of λ with Go FAE-SW (left) and Go FAE-SF (right) for Celeb A. From top to bottom, we have: (a) the normality of Go FAE mini-batch encodings using 30 repetitions of Algorithm 1 for different λ. (b) p-value changing from epoch 1-50 under varying λ. (c) Reconstruction error on the testing set with different λ. (d) Test statistics value on the testing set with different λ. Go FAE-SW Go FAE-SF Published as a conference paper at ICLR 2023 Table S4: The effects of λ with Go FAE-CVM (left) and Go FAE-KS (right) for Celeb A. From top to bottom, we have: (a) the normality of Go FAE mini-batch encodings using 30 repetitions of Algorithm 1 for different λ. (b) p-value changing from epoch 1-50 under varying λ. (c) Reconstruction error on the testing set with different λ. (d) Test statistics value on the testing set with different λ. Go FAE-CVM Go FAE-KS Published as a conference paper at ICLR 2023 E.3.2 MNIST This section includes additional reconstructed and generated samples in Figure S5 and Figure S6. Table S5 and S6 demonstrate the effects of λ on KS uniformity test, corresponding p-value, reconstruction error, and test statistics for Go FAE-SW, Go FAE-SF, Go FAE-CVM and Go FAE-KS, respectively. The experiments on MNIST reach the same conclusion as in the Celeb A, that our Go FAE methods is competitive in both reconstruction and generation. Published as a conference paper at ICLR 2023 (a) Ground truth (d) β(2)-VAE (e) C-β-VAE (f) WAE-GAN (g) Go FAE-SW (h) Go FAE-SF (i) Go FAE-EP (j) Go FAE-CVM Figure S5: Comparison of reconstruction between competing methods (b) AE, (c) VAE, (d) β-VAE (β = 2), (e) C-β-VAE, (f) WAE-GAN, and our framework: (g) Go FAE-SW, (h) Go FAE-SF, (i) Go FAE-EP, (j) Go FAE-CVM with MNIST. Published as a conference paper at ICLR 2023 (a) Ground truth (d) β(2)-VAE (e) C-β-VAE (f) WAE-GAN (g) Go FAE-SW (h) Go FAE-SF (i) Go FAE-EP (j) Go FAE-CVM Figure S6: Comparison of generated samples between competing methods (b) AE, (c) VAE, (d) β-VAE (β = 2), (e) C-β-VAE, (f) WAE-GAN, and our framework: (g) Go FAE-SW, (h) Go FAE-SF, (i) Go FAE-EP, (j) Go FAE-CVM with MNIST. Published as a conference paper at ICLR 2023 Table S5: The effects of λ with Go FAE-SW (left) and Go FAE-SF (right) for MNIST. From top to bottom, we have: (a) p-value changing from epoch 1-50 under varying λ. (b) Reconstruction error on the testing set with different λ. (c) Test statistics value on the testing set with different λ. Go FAE-SW Go FAE-SF Published as a conference paper at ICLR 2023 Table S6: The effects of λ with Go FAE-CVM (left) and Go FAE-EP (right) for MNIST. From top to bottom, we have: (a) p-value changing from epoch 1-50 under varying λ. (b) Reconstruction error on the testing set with different λ. (c) Test statistics value on the testing set with different λ. Go FAE-CVM Go FAE-EP Published as a conference paper at ICLR 2023 E.3.3 CIFAR-10 This section includes MSE, FID scores, and p-values from the higher criticism evaluation in Table S7 and reconstructed and generated samples in Figure S7 and Figure S8. Table S8 and S9 demonstrate the effects of λ on KS uniformity test, corresponding p-value, reconstruction error, and test statistics for Go FAE-SW, Go FAE-SF, Go FAE-CVM and Go FAE-KS, respectively. The 2-Stage VAE uses the VAE with learned γ for reconstruction. The experiments on CIFAR-10 reach the same conclusion as in the Celeb A and MNIST, that our Go FAE methods is competitive in both reconstruction and generation. Table S7: Evaluation of CIFAR-10 by MSE, FID scores, and samples with p-values from higher criticism. Algorithm MSE FID Score Recon. Gen. VAE (fixed γ) 51.07 (0.04) 179.75 188.55 VAE (learned γ) 24.10 (0.01) 97.72 119.86 2-Stage-VAE 24.10 (0.01) 97.72 117.60 β(2)-VAE 65.81 (0.07) 252.98 262.55 WAE-GAN 25.53 (0.00) 109.55 120.30 Go FAE-SW 24.57 (0.08) 96.86 115.54 Go FAE-SF 24.69 (0.06) 97.71 115.76 Go FAE-CVM 24.40(0.06) 99.76 119.51 Go FAE-KS 24.47 (0.05) 102.47 121.97 Go FAE-EP 24.74 (0.06) 106.52 124.79 Algorithm Kolmogorov Smirnov Uniformity Test SW SF CVM KS EP VAE (fixed γ) 0.29 (0.28) 0.28 (0.24) 0.58 (0.29) 0.54 (0.30) 0.48 (0.28) VAE (learned γ) 0.00 (0.00) 0.00 (0.00) 0.00 (0.00) 0.01 (0.03) 0.00 (0.00) 2-Stage-VAE 0.44 (0.27) 0.45 (0.28) 0.48 (0.30) 0.49 (0.26) 0.47 (0.28) β(2)-VAE 0.43 (0.25) 0.29 (0.26) 0.52 (0.33) 0.46 (0.33) 0.48 (0.31) WAE-GAN 0.06 (0.18) 0.01 (0.03) 0.41 (0.32) 0.37 (0.32) 0.32 (0.28) Go FAE-SW(30) 0.47 (0.32) 0.51 (0.28) 0.37 (0.29) 0.44 (0.32) 0.23 (0.22) Go FAE-SF(12) 0.00 (0.01) 0.10 (0.16) 0.01 (0.05) 0.08 (0.11) 0.01 (0.01) Go FAE-CVM(5) 0.03 (0.08) 0.00 (0.01) 0.19 (0.21) 0.31 (0.31) 0.12 (0.23) Go FAE-KS 0.00 (0.00) 0.00 (0.00) 0.02 (0.04) 0.10 (0.18) 0.00 (0.00) Go FAE-EP(5) 0.43 (0.25) 0.36 (0.26) 0.4 (0.25) 0.53 (0.26) 0.47 (0.27) Published as a conference paper at ICLR 2023 (a) Ground truth (b) VAE (fixed γ) (c) VAE (learned γ) (d) β(2)-VAE (e) WAE-GAN (f) Go FAE-SW (g) Go FAE-SF (h) Go FAE-CVM (i) Go FAE-KS (j) Go FAE-EP Figure S7: Comparison of reconstructed images between competing methods (b) VAE (fixed γ), (c) VAE (LEARNED γ), (d) β-VAE (β = 2), (e) WAE-GAN, and our framework: (f) Go FAE-SW, (g) Go FAE-SF, (h) Go FAE-CVM, (i) Go FAE-KS, (j) Go FAE-EP with CIFAR-10. Published as a conference paper at ICLR 2023 (a) Ground truth (b) VAE (fixed γ) (c) VAE (learned γ) (d) 2-Stage VAE (e) β(2)-VAE (f) WAE-GAN (g) Go FAE-SW (h) Go FAE-SF (i) Go FAE-CVM (j) Go FAE-KS (k) Go FAE-EP Figure S8: Comparison of generated samples between competing methods (b) VAE (fixed γ), (c) VAE (LEARNED γ), (d) 2-Stage-VAE, (e) β-VAE (β = 2), (f) WAE-GAN, and our framework: (g) Go FAE-SW, (h) Go FAE-SF, (i) Go FAE-CVM, (j) Go FAE-KS, (k) Go FAE-EP with CIFAR-10. Published as a conference paper at ICLR 2023 Table S8: The effects of λ with Go FAE-SW (left) and Go FAE-SF (right) for CIFAR-10. From top to bottom, we have: (a) p-value changing from epoch 1-50 under varying λ. (b) Reconstruction error on the testing set with different λ. (c) Test statistics value on the testing set with different λ. Go FAE-SW Go FAE-SF Published as a conference paper at ICLR 2023 Table S9: The effects of λ with Go FAE-CVM (left), Go FAE-KS (middle), and Go FAE-EP (right) for CIFAR-10. From top to bottom, we have: (a) p-value changing from epoch 1-50 under varying λ. (b) Reconstruction error on the testing set with different λ. (c) Test statistics value on the testing set with different λ. Go FAE-CVM Go FAE-KS Go FAE-EP Published as a conference paper at ICLR 2023 E.4 CONVERGENCE ANALYSIS FOR COMPETING METHODS In this section, we include the test set reconstruction error over training epochs for the competing methods for MNIST, Celeb A and CIFAR-10 data. Since we adopted a common autoencoder architecture (Section C), reconstruction error follows a similar trend across all models (Figure S9). Since the 2-Stage VAE shares the VAE with learned γ as the first stage, we was assessed its convergence in the same manner as the VAE with learned γ. (b) Celeb A (c) CIFAR-10 Figure S9: Reconstruction error over training epochs for competing methods. E.5 ABLATION STUDY We proposed Riemannian SGD as a solution to several problems associated with optimizing Go F test statistics. Theoretically, this approach is necessary to prove convergence. However, it remains to be seen whether there is any impact on model performance when the constraints are removed. To explore this, we considered several Go FAE models trained them without RSGD on MNIST and CIFAR-10. We used traditional SGD in place of Riemannian SGD; Θ is no longer retracted to the Stiefel manifold. Go FAE models with SW, SF, CVM and EP are retrained using the same architecture and hyperparameter configuration on MNIST. In place of the orthogonal initialization used for Θ, Kaiming initialization He et al. (2015) was used. Table S2 summarizes the results with the best performing model in bold for each evaluation measure. Go FAE models with SW, SF, CVM, and EP were also retrained on On CIFAR-10. All architectures and hyper-parameters remained the same as the Riemannian SGD counterparts, except that Θ which was initialized following Kaiming initialization. Table S11 summarized the results with the best performing model in bold for each evaluation measure. These results show empirical support for Published as a conference paper at ICLR 2023 using Riemannian SGD, particularly for generation. The results for reconstruction performance are less clear and would benefit from further experimentation. Table S10: Comparison of Riemannian SGD (left) and standard SGD (right) for MNIST. Riemannian SGD Standard SGD MSE Gen FID Recon FID MSE Gen FID Recon FID SW 7.16 16.58 7.57 7.31 19.47 7.71 SF 7.08 18.35 7.47 7.12 20.36 7.36 CVM 7.67 15.56 7.71 7.79 15.12 7.60 EP 7.54 15.85 7.84 7.54 16.59 7.63 Table S11: Comparison of Riemannian SGD (left) and standard SGD (right) for CIFAR-10. Riemannian SGD Standard SGD MSE Gen FID Recon FID MSE Gen FID Recon FID SW 24.58 117.05 100.11 23.94 119.48 101.27 SF 24.69 115.76 97.71 23.92 118.05 101.37 CVM 24.40 119.51 99.76 23.68 120.48 100.22 KS 24.47 121.97 102.47 24.03 122.23 103.22 EP 24.74 124.79 106.52 24.05 124.88 105.19