# geometrically_regularized_autoencoders_for_noneuclidean_data__88e31782.pdf Published as a conference paper at ICLR 2023 GEOMETRICALLY REGULARIZED AUTOENCODERS FOR NON-EUCLIDEAN DATA Cheongjae Jang1, Yonghyeon Lee2,3, Yung-Kyun Noh1,3, Frank Chongwoo Park2,4 1Hanyang University, 2Seoul National University, 3Korea Institute for Advanced Study, 4Saige Research cjjang@hanyang.ac.kr, ylee@kias.re.kr, nohyung@hanyang.ac.kr, fcp@snu.ac.kr Regularization is almost de rigueur when designing autoencoders that are sparse and robust to noise. Given the recent surge of interest in machine learning problems involving non-Euclidean data, in this paper we address the regularization of autoencoders on curved spaces. We show that by ignoring the underlying geometry of the data and applying standard vector space regularization techniques, autoencoder performance can be severely degraded, or worse, training can fail to converge. Assuming that both the data space and latent space can be modeled as Riemannian manifolds, we show how to construct regularization terms in a coordinate-invariant way, and develop geometric generalizations of the denoising autoencoder and reconstruction contractive autoencoder such that the essential properties that enable the estimation of the derivative of the log-probability density are preserved. Drawing upon various non-Euclidean data sets, we show that our geometric autoencoder regularization techniques can have important performance advantages over vector-spaced methods while avoiding other breakdowns that can result from failing to account for the underlying geometry. 1 INTRODUCTION Regularization is almost de rigueur when designing autoencoders that are sparse and robust to noise. With appropriate regularization, autoencoders enable representations useful for downstream applications (Bengio et al., 2013), generate plausible data samples (Kingma & Welling, 2013; Rezende et al., 2014), or even obtain information on the data-generating probability density (Vincent et al., 2010; Rifai et al., 2011b). Existing work on autoencoder regularization has mostly been confined to vector spaces, i.e., the data are assumed to be drawn from a vector space. On the other hand, a significant and growing number of problems in machine learning involve data that is non-Euclidean (in some past cases the fact that the data was non-Euclidean was not recognized or ignored). Bronstein et al. (2017) reviews several deep neural network architectures and modeling principles to explicitly deal with data defined on non-Euclidean domains, e.g., data collected from sensor networks, social networks in computational social sciences, or two-dimensional meshes embedded in the three-dimensional space. Other works have also addressed manifold-valued data including human mass and shape data (Kendall, 1984; Freifeld & Black, 2012), directional data (Mardia, 2014), point cloud data (Lee et al., 2022), and MRI imaging data (Fletcher & Joshi, 2007; Banerjee et al., 2015), with several deep neural networks proposed to handle such data in a coordinate-invariant way (Huang & Van Gool, 2017; Chakraborty et al., 2020). The fundamental idea behind these works is that the geometrical structure of the curved space from which the non-Euclidean data are drawn needs to be accounted for properly, so that the output of any deep learning network applied to such input data should not depend on the particular choice of coordinates used to parametrize the data. Ignoring the underlying geometry of the data and simply applying standard vector space techniques can severely degrade performance, or worse, cause training to fail. Autoencoder training and its regularization are no exception. Published as a conference paper at ICLR 2023 (a) Training data (b) RCAE (c) GRCAE Figure 1: Autoencoder training on spherical data sampled from the von Mises-Fisher (v MF) distribution. We train the reconstruction contractive autoencoder (RCAE) and the geometric RCAE (GRCAE). For (b)-(c), we plot the reconstruction directions of the autoencoders trained using representations obtained from different coordinate choices. The results from each coordinate choice are color-coded along with corresponding spherical coordinate origins. (See Appendix E for more details.) For example, consider autoencoder training on a set of data points on a sphere as shown in Figure 1. When using spherical coordinate representations as inputs to train an autoencoder with a contractive regularization, the trained reconstruction function can heavily depend on the choice of coordinates. Moreover, it often fails to learn the correct contractive directions toward datadense regions, especially near the singularity (or the spherical coordinate origins). On the other hand, an autoencoder that properly reflects the spherical constraints can recover those directions successfully and show results almost invariant to the choice of coordinates. In this paper we address the regularization of autoencoders on curved spaces. Any loss function used to train or regularize the autoencoder should be formulated in a coordinate-invariant way, i.e., invariant to the choice of local coordinates used to parametrize the data, and instead depend only on the intrinsic properties of the curved space such as curvature or the choice of metric. Assuming that both the data space and latent space can be modeled as Riemannian manifolds, we show how to construct regularization terms and objective functions in a coordinate-invariant way. We also develop geometric generalizations of the denoising autoencoder (DAE) and reconstruction contractive autoencoder (RCAE) such that the essential properties that enable the estimation of the score, i.e., the log-derivative of the data-generating density, are preserved. We provide some applications that use this property, such as sampling, clustering, and filtering for non-Euclidean data, and also show that the proposed autoencoders can obtain useful representations for non-Euclidean data, especially when noise exists in data. Drawing upon various non-Euclidean data sets, we show that our geometric autoencoder regularization techniques can have important performance advantages over vector-spaced methods in some cases by significant margins while avoiding other breakdowns that can result from failing to account for the underlying geometry. The paper is organized as follows. We describe regularized autoencoders for Euclidean data in Section 2 and propose their coordinate-invariant generalizations to non-Euclidean data in Section 3. We then provide autoencoder training case studies using non-Euclidean data sets in Section 4. 2 REGULARIZED AUTOENCODERS FOR EUCLIDEAN DATA Mathematically, an autoencoder can be represented as the composition of two mappings f : RD Rd (the encoder) and g : Rd RD (the decoder), i.e., r = g f : RD RD with the space of hidden variables Rd. Assume there exists a data-generating probability density ρ : RD R from which data points on RD are drawn. Autoencoder training in a vector space can then be formulated as minimizing the reconstruction error R RD x g(f(x; θ1); θ2) 2 ρ(x) dx over θ = (θ1, θ2), where x RD denotes an input variable, θ1 and θ2 are respectively the parameter sets of the maps f and g, and 2 is the squared Euclidean norm. In the typical case where d < D, the autoencoder engages in a type of dimensionality reduction. By disregarding the assumption that d < D, autoencoders have been modeled in the form of deep artificial neural networks accompanied by certain regularization terms to learn useful representations of the data (Bengio et al., 2007; Vincent et al., 2008; 2010; Ranzato et al., 2007; 2008; Kingma & Welling, 2013; Rezende et al., 2014; Rifai et al., 2011b;a). For a more comprehensive review of autoencoders, we refer the reader to Goodfellow et al. (2016). In the meantime, the effects of regularization have been investigated in some detail in Alain & Bengio (2014) for the denoising autoencoder (DAE) (Vincent et al., 2010) and the reconstruction contractive autoencoder (RCAE). They point out that these regularization methods reduce the autoencoder s sensitivity to the input, while the reconstruction error increases the autoencoder s sensitivity to variations along the region of the highest density in the data space. Reconstruction and regularization together successfully capture variations in such regions while ignoring variations that Published as a conference paper at ICLR 2023 are orthogonal to those and obtain information on the data-generating probability density. Our focus will be on these two types of regularized autoencoders, i.e., the DAE and the RCAE, but the methods described in this paper are easily generalizable to other autoencoders. In the standard vector space formulation of the DAE, an input x RD is assumed to have been corrupted by some noise density q( x|x) to x RD, i.e., x q( x|x). We then seek the reconstruction function r = g f : RD RD that minimizes RD Eq( x|x) r( x) x 2 ρ(x) dx, (1) where Eq( x|x) [ ] denotes the expectation with respect to the noise density q( x|x). A trivial identity mapping r(x) = x can be avoided due to the injected noise. For the vector space formulation of the RCAE, the objective function is r(x) x 2 + σ2Tr ρ(x) dx, (2) where σ2 is a scalar weighting coefficient. The second term in (2) acts as a regularization term and can be interpreted as the Dirichlet energy of r : RD RD, measuring how variable the mapping r is (Belkin & Niyogi, 2003; Solomon et al., 2013). Minimizing this term induces the contraction of the mapping r (e.g., in the absence of the reconstruction error term in (2), an extreme contraction of r = constant would be obtained), preventing r(x) from becoming the identity mapping r(x) = x. Note that replacing r x in (2) reduces to the objective function for the contractive autoencoder (CAE) (Rifai et al., 2011b). For both the DAE under a Gaussian corruption process x N(x, σ2I) and the RCAE with σ small, the derivative of the log-probability density, which is also referred to as the score, can be estimated from the optimized r(x) as follows (Alain & Bengio, 2014): ρ ρ x(x) = r(x) x σ2 + O(σ2). (3) 3 REGULARIZED AUTOENCODERS FOR NON-EUCLIDEAN DATA In this section, we address the problem of autoencoder training for the case where the data points are drawn from an a priori known non-Euclidean space M, possibly with another non-Euclidean latent space N. We formulate the reconstruction error and regularization terms in a coordinate-invariant way using notions from Riemannian geometry. (We provide some mathematical backgrounds required for our formulations in Appendix A.) We then show that, as in the Euclidean case, it is possible to estimate the score for non-Euclidean data using the trained autoencoders. 3.1 COORDINATE-INVARIANT GENERALIZATIONS OF THE AUTOENCODER COMPONENTS Figure 2: Local coordinates and Riemannian metrics for the reconstruction mapping r : M M. Local coordinates are denoted in italics. Referring to Figure 2, let M be an mdimensional manifold with local coordinates x = (x1, . . . , xm) and Riemannian metric ds2 = Pm i=1 Pm j=1 gij(x) dxidxj. Throughout this paper, we use italics to represent local coordinates, e.g., a point x M has local coordinates x Rm and the mapping r : M M can be represented in local coordinates as r : Rm Rm. The metric will also be denoted in matrix form as G(x) = (gij(x)) Rm m. We now provide coordinate-invariant generalizations of each component in autoencoder training, especially in (1) and (2), while simultaneously reflecting any intrinsic properties of the manifold M. Published as a conference paper at ICLR 2023 Reconstruction error: The reconstruction error on Riemannian manifolds can be defined as dist(r(x), x)2, where dist(x, y) denotes the minimal geodesic distance between points x, y M. Probability function: Assume there exists a data-generating probability function pg : M R from which data points on M are drawn. Denote by ρg : Rm R its representation in local coordinates, satisfying ρg(x) > 0 for all x Rm and R det G(x)dx = 1 (Pennec, 1999), where p det G(x)dx is the natural volume element induced from the metric. When formulating regularized autoencoders later, pg (or ρg) is used in the form of a weighted volume element ρ(x)dx ρg(x) p det G(x)dx. In practice, the integrations involving ρ(x)dx are approximated as an equally weighted finite sum of the integrands (with ρ(x) excluded) evaluated at given data points. Contractive regularization: The conventional Dirichlet energy (appearing in (2)) has been generalized to mappings between Riemannian manifolds in the theory of harmonic maps (Eells & Sampson, 1964), based on which we formulate the contractive regularization for non-Euclidean settings. Let M be the input manifold, and let N be an n-dimensional output manifold with local coordinates y = (y1, . . . , yn) and Riemannian metric dr2 = Pn α=1 Pn β=1 hαβ(y)dyαdyβ. The metric will also be denoted in matrix form as H(y) = (hαβ(y)) Rn n. In Eells & Sampson (1964) the Dirichlet energy of a smooth map f : M N is defined as Z M Tr(J HJG 1) det G dx1 dxm, (4) where J(x) = f i xj (x) Rn m is the differential dfx : Tx M Ty N denoted in local coordinates. The energy functional in (4) is an intrinsic quantity, i.e., coordinate-invariant. We discuss the coordinate-invariance of (4) and some physical interpretations of minimizing (4) in Appendix B.1 and refer the reader to the extensive literature on the theory and applications of harmonic maps, e.g., Eells & Lemaire (1978; 1988); Park & Brockett (1994); Gu et al. (2004); Jang et al. (2021); Lee et al. (2021b). To apply this energy functional to autoencoder regularization, we replace the mapping f : M N and the Riemannian metric H(f(x)) with the reconstruction mapping r : M M and G(r(x)), respectively, and replace the natural volume element det G dx with the weighted volume element ρ(x) dx. Also note that for the case of non-Euclidean latent space N, the objective in (4) with the volume element ρ(x) dx can serve as a geometric regularizer for the contractive autoencoders. (See Appendix B.2 for more discussions related to non-Euclidean latent spaces.) x = Expx(v) Figure 3: A data corruption example for an input x S2. Tangent vectors (the yellow triangles) are sampled from a zero-mean Gaussian in Tx S2 and mapped via the exponential map to black dots, the corrupted points from x. The red dot x = Expx(v) for v Tx S2. Data corruption process: To corrupt an input x M, we sample a tangent vector v Tx M from an isotropic zero-mean multivariate Gaussian, i.e., a linear combination of an orthonormal basis for Tx M with coefficients sampled from N(0, σ2I), and then apply the exponential map Expx : Tx M M to v. The point x = Expx(v) can then be interpreted as a corrupted point from x. We denote by q( x|x) the noise density that samples x for given x according to this procedure. A data corruption example for data on a sphere (S2) is illustrated in Figure 3, and a rationale for adopting this way of corruption is explained in Appendix B.3. 3.2 GEOMETRICALLY REGULARIZED AUTOENCODERS We now derive coordinate-invariant formulations of the regularized autoencoders presented in (1) and (2). For the reconstruction contractive autoencoder (2), the contractive regularizer modified from (4) (as explained above) is augmented to the reconstruction error term as follows: dist(r(x), x)2 + σ2 Tr ρ(x) dx, (5) Published as a conference paper at ICLR 2023 where the reconstruction map r : M M is expressed in local coordinates as r : Rm Rm, and G(r) denotes the metric at point r(x). We refer to (5) as the geometric RCAE or GRCAE. Similarly, the geometric version of the DAE (1), referred to as GDAE, can be formulated as follows: M Eq( x|x) dist(r( x), x)2 ρ(x) dx, (6) where Eq( x|x) [ ] denotes the expectation with respect to the noise density q( x|x) presented above. From the geometric formulations (5)-(6), we can obtain the following relations between the reconstruction function r and the log of the probability function ρg(x) = ρ(x) Theorem 1. Provided σ2 is small, the derivative of the log of the probability function ρg(x) = ρ(x) det G(x) can be approximated for both the GRCAE and GDAE as x (x) = G(x) r(x) x + O(σ2). (7) The proof of Theorem 1 is provided in Appendix C. Equation (7) can be thought of as a generalization of (3) for non-Euclidean data, and we will refer to log ρg(x) x as the geometric score. The reconstruction function r : M M for proposed autoencoders is modeled as a neural network in later experiments, with implementation details provided in Appendix D. We also provide there some ideas to deal with the case of manifolds that require multiple coordinate charts and discuss another case where the data space Riemannian metric is not known a priori. 4 EXPERIMENTS In the experiments, we first demonstrate the geometric score estimation (Theorem 1) based on our geometrically regularized autoencoders for non-Euclidean data, providing a solid basis for future applications of autoencoders. We then utilize the proposed autoencoders for various applications, such as data sampling based on the Langevin Monte Carlo methods (Girolami & Calderhead, 2011) or clustering and noise filtering based on mode-seeking (Fukunaga & Hostetler, 1975; Cheng, 1995; Comaniciu & Meer, 2002), involving real-world non-Euclidean data sets. We also examine the usefulness of the proposed autoencoders in the representation learning perspective, using noisy point cloud data. 4.1 GEOMETRIC SCORE ESTIMATION For geometric score estimation, we consider the data sampled from P(n), the space of n n symmetric positive-definite matrices, endowed with the affine-invariant Riemannian metric (Fletcher & Joshi, 2007). We train the GDAE and GRCAE using synthetic data sampled from m mixtures of isotropic tangent space Gaussians for which the ground truth geometric score values are obtainable. For purposes of comparison, we also use DAE, RCAE, the least-squares log-density gradient method (LSLDG) presented in Sasaki et al. (2014), and also their extension to data on Riemannian manifolds (R-LSLDG) in Ashizawa et al. (2017). Full experimental details are provided in Appendix F. For a given data set {x1, . . . , x N} represented in local coordinates of P(n), the geometric score estimation error (Est. error), which is also defined to be coordinate-invariant, is evaluated as follows: Est. error = 1 N PN i=1 log ρg x est (xi) log ρg x (xi) G 1(xi) log ρg x est (xi) log ρg x (xi) , (8) where log ρg x est (xi) and log ρg x (xi) respectively denote the estimated value and the ground truth value for log ρg(x) x at xi. The estimation errors measured on 10, 000 test data points are averaged over five runs in Table 1. Note that the GDAE, GRCAE, and R-LSLDG methods show superior performance over DAE, RCAE, and LSLDG in estimating log ρg(x) x . Notably, GRCAE shows the best performance for higher dimensionality and a higher number of mixtures in terms of estimation error. We also obtain a similar tendency for another synthetic data set on the hypersphere Sn = {p Rn+1 | p = 1}, and the results are provided in Appendix F.4. Published as a conference paper at ICLR 2023 Table 1: Estimation errors for log ρg x for m mixtures of tangent space Gaussian data on P(n) with the standard errors in parentheses. Bolds represent the best and comparable methods from the t-test with a significance level of 5%. n m LSLDG DAE RCAE R-LSLDG GDAE (ours) GRCAE (ours) 2 2 10.5 (0.87) 8.72 (3.03) 9.04 (2.49) 3.84 (0.19) 3.55 (0.31) 2.61 (0.20) 3 2 30.0 (0.91) 17.9 (1.23) 18.1 (1.83) 5.64 (0.19) 4.95 (0.27) 4.33 (0.34) 3 3 37.6 (2.16) 21.0 (0.97) 22.4 (1.18) 7.07 (0.31) 6.86 (0.96) 5.85 (1.07) 4 2 75.5 (0.85) 42.4 (0.69) 43.8 (1.01) 9.67 (0.78) 7.72 (0.60) 6.97 (0.39) 4 3 90.8 (1.91) 59.5 (4.41) 63.3 (6.25) 12.2 (1.45) 11.8 (0.71) 10.5 (1.06) 4 4 88.4 (0.82) 73.2 (2.98) 78.1 (4.76) 13.9 (0.82) 16.0 (1.29) 12.6 (0.28) 4.2 SAMPLING ON S2 We can apply the geometric scores estimated from GDAE and GRCAE to the sampling of non Euclidean data via Riemannian Langevin Monte Carlo (RLMC) methods. The stochastic process for the RLMC methods in Girolami & Calderhead (2011) can be reformulated using the geometric score in (7) as follows: G 1(x) log ρg(x) x + Ψ(x) dt + p G 1(x)dw, (9) where dw Rm denotes the Brownian motion in an m-dimensional vector space, β(x) = 1/ p det G(x), and Ψ(x) Rm is a vector whose i-th component is given by Pm j=1 gij xj with gij as the (i, j) element of G 1(x). A discretization of the above process gives the RLMC method (Girolami & Calderhead, 2011). Note that in (9), the terms except for 1 2G 1(x) log ρg(x) x dt correspond to the Brownian motion on manifolds (Brockett, 1997). As an illustrative case study, we consider sampling on a sphere (S2). After training DAE, RCAE, GDAE, and GRCAE for data points on S2 shown in Figure 4 (left), we sample new data points that approximately follow the original data distribution by applying the RLMC methods using the geometric scores estimated from GDAE/GRCAE and its Euclidean counterparts using DAE/RCAE. We also report the results obtained from the S-Flow method (a variation of the M-Flow method (Brehmer & Cranmer, 2020) for the case of S2). The experimental details are provided in Appendix G. For a quantitative comparison of the sampling performances, the maximum mean discrepancy (MMD) (Gretton et al., 2012) between a test data set and the obtained samples is provided in Table 2. As shown in the table and Figure 4, GDAE shows the best quantitative and qualitative performance among the considered autoencoders. Even though the samples from GDAE do not yield better numerical results than those from the S-Flow method, an algorithm targeted for data sampling, it is observed that plausible samples can be obtained. Also note that the samples obtained from RCAE and GRCAE tend to be inferior to those from DAE and GDAE in this task, possibly due to the algorithmic properties of RCAE/GRCAE in which, unlike DAE/GDAE, the inputs to neural networks are strictly confined to the given training data. Thus, the reconstruction functions may not be accurately trained on other regions visited during the sampling process compared to DAE/GDAE. Table 2: The MMD measure (the lower, the better) between the test and the sampled data with standard errors in parentheses. 10 2 DAE RCAE GDAE (ours) GRCAE (ours) S-Flow four blobs 0.215 (0.015) 0.598 (0.059) 0.188 (0.018) 0.291 (0.052) 0.137 (0.038) two moons 0.147 (0.022) 0.249 (0.020) 0.144 (0.032) 0.161 (0.020) 0.100 (0.030) s-curve 0.300 (0.042) 0.248 (0.048) 0.242 (0.031) 0.231 (0.072) 0.326 (0.063) circles 0.075 (0.021) 0.353 (0.012) 0.061 (0.005) 0.323 (0.008) 0.046 (0.093) 4.3 CLUSTERING OF HYPERSPHERICAL DATA In natural language processing, the similarity between word embeddings, i.e., the vector representations that encode semantic information of the words, is often measured according to the cosine Published as a conference paper at ICLR 2023 Figure 4: Data sampling on S2. similarity (Mikolov et al., 2013). This can be equated to considering the embeddings as points on a hypersphere and measuring the distance based on the geodesic distance (Straub et al., 2015). Based on this idea, we group documents in the Newsgroup20 data set (Lang, 1995) using the GDAE trained on the document embeddings. The document embeddings are first represented as the average of the word embeddings in the document and then projected to a hypersphere. For the word embeddings, we utilize pre-trained 50-dimensional Glo Ve embeddings (Pennington et al., 2014), hence the document embeddings lie on S49. We define four clustering tasks as described in Appendix H.1. We train DAE, GDAE, LSLDG, and R-LSLDG as explained in Appendix H.2. Here we consider two variations of DAE and LSLDG, respectively; the first ones are trained on the spherical coordinate representations of the data, and the second ones on the representations in the ambient space Rn+1. After training the models, the document embeddings are iteratively updated along the gradient of the log-probability (for GDAE, this can be performed by repeatedly applying the reconstruction function on the embedding) for a fixed number of steps and then grouped as done in the mean shift clustering (Comaniciu & Meer, 2002). The adjusted Rand index (ARI) is used as the performance metric for the clustering tasks (Hubert & Arabie, 1985). Table 3: The adjusted Rand index for the clustering results on the Newsgroup20 data set (S49) with standard errors in parentheses. The higher, the better. Tasks LSLDG LSLDG Ambient DAE DAE Ambient R-LSLDG GDAE (ours) crypt 0.174 (0.016) 0.034 (0.025) 0.205 (0.069) 0.071 (0.040) 0.221 (0.109) 0.465 (0.018) electronics 0.204 (0.006) 0.036 (0.018) 0.228 (0.082) 0.092 (0.001) 0.192 (0.099) 0.448 (0.044) med 0.153 (0.012) 0.021 (0.017) 0.227 (0.120) 0.071 (0.040) 0.246 (0.090) 0.451 (0.055) space 0.176 (0.019) 0.025 (0.026) 0.211 (0.066) 0.089 (0.001) 0.241 (0.081) 0.429 (0.070) The averaged clustering performance for five runs of each method is presented in Table 3. Note that the vector-spaced methods trained on the ambient representations of the data are hardly successful in grouping the data. On average, the R-LSLDG performs slightly better than LSLDG and DAE but with higher variance. The results obtained from GDAE show a much higher ARI than others. An additional case study for the clustering of covariance matrix data is provided in Appendix H.3. 4.4 FILTERING OF DIFFUSION TENSOR IMAGING DATA Table 4: The R-squared score for N(2) data filtering. The higher, the better. Noise level DAE GDAE (ours) MVKR 0.02 99.80 (0.00) 99.86 (0.00) 96.31 (0.01) 0.05 99.12 (0.05) 99.51 (0.03) 96.21 (0.03) 0.1 97.14 (0.16) 98.51 (0.03) 95.79 (0.04) 0.2 89.44 (0.32) 93.70 (0.47) 94.11 (0.10) For the next case study, we consider data obtained from diffusion tensor imaging (DTI). Mathematically, a DTI datum is a threedimensional image in which the value assigned to each voxel is an element of P(3). A voxel of DTI data can be treated as in the space of threedimensional normal distributions N(3), by regarding the voxel location and voxel value as Published as a conference paper at ICLR 2023 the mean and covariance of a normal distribution, respectively. In this section, following Han & Park (2014), we adopt the Fisher information Riemannian metric (FIRM) for DTI data (see Appendix D.3.1 for the definition of the FIRM). We train the GDAE on raw DTI data and apply the trained GDAE to filter the noise in the data. Figure 5: N(2) data filtering (noise 0.2). The redder, the higher error. Consider a data set comprised of N voxels {(x1, P1), . . . , (x N, PN)} of a raw DTI datum, where xi R3 and Pi P(3) denote the location and value of the i-th voxel, respectively. For the data set, we train the GDAE with the reconstruction function r : N(3) N(3); the overall training process is explained in Appendix I.1. By iteratively applying the reconstruction function of the trained GDAE, data points are mapped toward the local modes of the probability function (as can be implied from (7)), and the voxels with added noise, which usually have a lower probability, can be automatically filtered (Comaniciu & Meer, 2002). We summarize a DTI filtering algorithm based on the GDAE in Appendix I.2. We first demonstrate the effectiveness of this algorithm in a simpler setting by using a synthetic data set in two-dimensional normal distributions N(2). We train our GDAE on a noisy input artificially corrupted from clean data in Figure 5 and apply the filtering algorithm. The proposed algorithm can effectively filter out the noises qualitatively better than other filtering methods, such as the DAE-based filtering with more remaining noises and a manifold-valued kernel regression-based filtering approach (MVKR) (Banerjee et al., 2015) which tends to erroneously smooth discontinuous voxel values. A quantitative comparison in Table 4 also suggests that our GDAE-based filtering can perform well at various noise levels. The experimental details are deferred to Appendix I.3. (a) Noisy input Figure 6: DTI filtering results. We now conduct numerical experiments of DTI filtering by applying our algorithm. Data used in these experiments were obtained from the Alzheimer s Disease Neuroimaging Initiative (ADNI) database (adni.loni. usc.edu). See Appendix I.4 for the data preprocessing details. We also perform filtering based on DAE trained on nine-dimensional vector representations of voxels. Note that, critically, the DAE-based filtering has no guarantee for the filtered voxel values to be positive-definite, while GDAE-based filtering always satisfies the constraint. The input image and the filtering results from each algorithm are shown in Figure 6. We plot axial slices of the corresponding DTI data in Figures 6 (a)-(c). Voxel values are drawn as ellipsoids with colors representing the direction of the first principal axis. Since the ground truths of DTI data are not available, only qualitative comparisons between the filtering results can be made. In Figures 6 (a)-(c), it can be seen that the GDAE-based method effectively filters outliers (the abruptly changing colors appearing in Figure 6 (a)) when compared to DAE-based filtering. Using brain anatomical terms, the separation between the cerebrospinal fluid and brain parenchyma became clear after GDAE-based filtering, and as a result specifying the sulci and gyri of the brain is much easier than from the raw input or DAE-based filtering results. Furthermore, noise in the ventricle area also disappeared after GDAE-based filtering, while DAEbased filtering failed to eliminate this noise. We should note here that the DTI filtering results would require further concordance verification procedures with experts such as radiologists to ensure that only spurious artifacts are erased rather than important anatomy. Published as a conference paper at ICLR 2023 4.5 ROBUST REPRESENTATION LEARNING OF POINT CLOUD DATA We now show that we can obtain useful representations for non-Euclidean data from our proposed autoencoders, especially when noise exists in the data. We consider point cloud data in this case study. A point cloud data in RD is a set of n points in RD, represented as X = {x1, . . . , xn | xi RD}. Recently, a statistical manifold framework has been suggested for point cloud data in Lee et al. (2022), and they have observed that reflecting this geometry to train autoencoders can obtain better representations than the vector space counterparts. (See Appendix D.4.1 for more details.) Based on this choice of geometry, we train GDAE for point cloud data with n = 2, 048 and D = 3 (hence of dimension n D = 6, 144). We reflect the Fisher information metric proposed in Lee et al. (2022) in the data corruption process for GDAE and use the modified Chamfer loss as the reconstruction error for point cloud data. For comparison purposes, we also consider DAE trained on the vector representations of point cloud data. We use the Fc Net (Yang et al., 2018) as the structure for the reconstruction functions with a latent space dimension of 512. Further details are provided in Appendix D.4. To verify the usefulness of the representations obtained from our trained autoencoders, we utilize them as features to train classifiers in the transfer learning setting based on some benchmark data sets following Yang et al. (2018). More specifically, we train autoencoders using the Shape Net data set (Chang et al., 2015) and obtain representations for the Model Net data set (Wu et al., 2015). We then train a linear SVM using the representations and measure the transfer classification accuracy. We inject varying noises in the data sets to verify if the obtained representations are robust to noise. More experimental details are explained in Appendix J. Table 5: Classification accuracy by transfer learning for Model Net10 and Model Net40 from Shape Net under the noise levels of 0.01, 0.05, 0.1, and 0.2. Noise level Model Net10 Model Net40 AE DAE GDAE AE GDAE AE DAE GDAE AE GDAE (ours) +R. (ours) + R. (ours) + R. (ours) + R. 0.01 93.1 92.7 93.2 92.8 93.2 87.3 87.0 87.3 88.6 88.8 0.05 91.3 91.6 92.3 93.3 92.4 83.2 84.6 85.0 86.8 86.5 0.1 88.9 90.6 91.1 90.2 91.7 75.6 79.9 83.2 80.5 82.4 0.2 79.8 81.2 88.2 84.8 88.0 64.6 67.3 77.3 72.2 76.1 The experimental results are in Table 5. The features obtained from GDAE lead to a better transfer classification accuracy than those from the vanilla autoencoder (AE) or DAE; this tendency gets stronger as the noise level increases. Our approach also shows comparable or better performances compared to another regularization method ( AE + R. in Table 5) considered in Lee et al. (2022), which tries to match the pull-back metric of the Fisher information metric (via the decoder mapping) to the identity. Combining these two regularization methods ( GDAE + R. in Table 5) shows the best overall transfer classification accuracy, demonstrating the usefulness of our geometric regularization methods in obtaining better representations of non-Euclidean data for downstream tasks. 5 CONCLUSION In this paper, we have introduced geometrically regularized autoencoders for non-Euclidean data. By constructing regularization terms in a coordinate-invariant way, we have developed two types of geometric autoencoders, the geometric reconstruction contractive autoencoder and the geometric denoising autoencoder. These autoencoders are effective in estimating the derivative of the log of the probability density of non-Euclidean data and have been successfully applied to several applications such as data sampling, mode-seeking, and representation learning tasks involving real-world non Euclidean data sets. Although training these models can be computationally demanding and may involve numerical issues when reflecting the geometry of the data, our experiments show that our approach can be a viable option for handling high-dimensional and complex non-Euclidean data. In the future, it would be worth investigating more efficient and robust methods for reflecting the geometry during training. Additionally, the idea of using geometric regularizations could be applied to other machine learning problems that involve non-Euclidean data. Published as a conference paper at ICLR 2023 ACKNOWLEDGMENTS C. Jang and Y.-K. Noh were supported by IITP Artificial Intelligence Graduate School Program for Hanyang University funded by MSIT (Grant No. 2020-0-01373). Y.-K. Noh was partly supported by NRF/MSIT (Grant No. 2018R1A5A7059549, 2021M3E5D2A01019545) and IITP/MSIT (Grant No. IITP-2021-0-02068). Y. Lee and F. C. Park were supported in part by SRRC NRF grant 2016R1A5A1938472, IITP-MSIT (Grant No. 2022-0-00480, Development of Training and Inference Methods for Goal-Oriented AI Agents, 20%), SNU-AIIS, SNU-IAMD, and the SNU Institute for Engineering Research. REPRODUCIBILITY STATEMENT We refer the reader to the following pointers for reproducibility: Codes to train the proposed autoencoders: Supplementary Material. Proof of Theorem 1: Appendix C. Implementation details of the proposed autoencoders: Appendix D. Experimental settings: Appendix E, F, G, H, I, and J. Guillaume Alain and Yoshua Bengio. What regularized auto-encoders learn from the data-generating distribution. Journal of Machine Learning Research, 15(1):3563 3593, 2014. Georgios Arvanitidis, Lars Kai Hansen, and Søren Hauberg. Latent space oddity: on the curvature of deep generative models. In International Conference on Learning Representations, 2018. Georgios Arvanitidis, Søren Hauberg, and Bernhard Sch olkopf. Geometrically enriched latent spaces. In International Conference on Artificial Intelligence and Statistics, pp. 631 639. PMLR, 2021. Mina Ashizawa, Hiroaki Sasaki, Tomoya Sakai, and Masashi Sugiyama. Least-squares log-density gradient clustering for riemannian manifolds. In Artificial Intelligence and Statistics, pp. 537 546, 2017. Monami Banerjee, Rudrasis Chakraborty, Edward Ofori, David Vaillancourt, and Baba C Vemuri. Nonlinear regression on riemannian manifolds and its applications to neuro-image analysis. In International Conference on Medical Image Computing and Computer-Assisted Intervention, pp. 719 727. Springer, 2015. Mikhail Belkin and Partha Niyogi. Laplacian eigenmaps for dimensionality reduction and data representation. Neural Computation, 15(6):1373 1396, 2003. Yoshua Bengio, Pascal Lamblin, Dan Popovici, and Hugo Larochelle. Greedy layer-wise training of deep networks. In Advances in neural information processing systems, pp. 153 160, 2007. Yoshua Bengio, Aaron Courville, and Pascal Vincent. Representation learning: A review and new perspectives. IEEE transactions on pattern analysis and machine intelligence, 35(8):1798 1828, 2013. William M Boothby. An introduction to differentiable manifolds and Riemannian geometry, volume 120. Academic press, 1986. Johann Brehmer and Kyle Cranmer. Flows for simultaneous manifold learning and density estimation. Advances in Neural Information Processing Systems, 33:442 453, 2020. Roger Brockett. Notes on stochastic processes on manifolds. In Systems and Control in the twentyfirst century, pp. 75 100. Springer, 1997. Michael M Bronstein, Joan Bruna, Yann Le Cun, Arthur Szlam, and Pierre Vandergheynst. Geometric deep learning: going beyond euclidean data. IEEE Signal Processing Magazine, 34(4):18 42, 2017. Published as a conference paper at ICLR 2023 Rudrasis Chakraborty, Jose Bouza, Jonathan Manton, and Baba C Vemuri. Manifoldnet: A deep neural network for manifold-valued data with applications. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2020. Angel X Chang, Thomas Funkhouser, Leonidas Guibas, Pat Hanrahan, Qixing Huang, Zimo Li, Silvio Savarese, Manolis Savva, Shuran Song, Hao Su, et al. Shapenet: An information-rich 3d model repository. ar Xiv preprint ar Xiv:1512.03012, 2015. Yizong Cheng. Mean shift, mode seeking, and clustering. IEEE transactions on pattern analysis and machine intelligence, 17(8):790 799, 1995. Dorin Comaniciu and Peter Meer. Mean shift: A robust approach toward feature space analysis. IEEE Transactions on Pattern Analysis & Machine Intelligence, (5):603 619, 2002. Tim R Davidson, Luca Falorsi, Nicola De Cao, Thomas Kipf, and Jakub M Tomczak. Hyperspherical variational auto-encoders. ar Xiv preprint ar Xiv:1804.00891, 2018. Laurent Dinh, Jascha Sohl-Dickstein, and Samy Bengio. Density estimation using real nvp. ar Xiv preprint ar Xiv:1605.08803, 2016. Boris A Dubrovin, Anatolij Timofeeviˇc Fomenko, and Serge ı Petrovich Novikov. Modern Geometry Methods and Applications Part I. The Geometry of Surfaces, Transformation Groups, and Fields. Springer, 1992. James Eells and Luc Lemaire. A report on harmonic maps. Bulletin of the London Mathematical Society, 10(1):1 68, 1978. James Eells and Luc Lemaire. Another report on harmonic maps. Bulletin of the London Mathematical Society, 20(5):385 524, 1988. James Eells and Joseph H Sampson. Harmonic mappings of riemannian manifolds. American Journal of Mathematics, 86(1):109 160, 1964. P Thomas Fletcher and Sarang Joshi. Riemannian geometry for the statistical analysis of diffusion tensor data. Signal Processing, 87(2):250 262, 2007. Oren Freifeld and Michael J Black. Lie bodies: A manifold representation of 3d human shape. In European Conference on Computer Vision, pp. 1 14. Springer, 2012. Keinosuke Fukunaga and Larry Hostetler. The estimation of the gradient of a density function, with applications in pattern recognition. IEEE Transactions on information theory, 21(1):32 40, 1975. Mark Girolami and Ben Calderhead. Riemann manifold langevin and hamiltonian monte carlo methods. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 73(2): 123 214, 2011. Xavier Glorot and Yoshua Bengio. Understanding the difficulty of training deep feedforward neural networks. In Proceedings of the thirteenth international conference on artificial intelligence and statistics, pp. 249 256. JMLR Workshop and Conference Proceedings, 2010. Ian Goodfellow, Yoshua Bengio, and Aaron Courville. Deep Learning. MIT Press, 2016. http: //www.deeplearningbook.org. Arthur Gretton, Karsten M Borgwardt, Malte J Rasch, Bernhard Sch olkopf, and Alexander Smola. A kernel two-sample test. The Journal of Machine Learning Research, 13(1):723 773, 2012. Xianfeng Gu, Yalin Wang, Tony F Chan, Paul M Thompson, and Shing-Tung Yau. Genus zero surface conformal mapping and its application to brain surface mapping. IEEE Transactions on Medical Imaging, 23(8):949 958, 2004. Minyeon Han and Frank C Park. Dti segmentation and fiber tracking using metrics on multivariate normal distributions. Journal of mathematical imaging and vision, 49(2):317 334, 2014. Søren Hauberg, Oren Freifeld, and Michael Black. A geometric take on metric learning. Advances in Neural Information Processing Systems, 25, 2012. Published as a conference paper at ICLR 2023 Zhiwu Huang and Luc Van Gool. A riemannian network for spd matrix learning. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 31, 2017. Lawrence Hubert and Phipps Arabie. Comparing partitions. Journal of classification, 2(1):193 218, 1985. Aapo Hyv arinen. Estimation of non-normalized statistical models by score matching. Journal of Machine Learning Research, 6(Apr):695 709, 2005. Cheongjae Jang, Yung-Kyun Noh, and Frank Chongwoo Park. A riemannian geometric framework for manifold learning of non-euclidean data. Advances in Data Analysis and Classification, 15 (3):673 699, 2021. Sadeep Jayasumana, Richard Hartley, Mathieu Salzmann, Hongdong Li, and Mehrtash Harandi. Kernel methods on riemannian manifolds with gaussian rbf kernels. IEEE Transactions on Pattern Analysis and Machine Intelligence, 37(12):2464 2477, 2015. Mark Jenkinson, Christian F. Beckmann, Timothy E.J. Behrens, Mark W. Woolrich, and Stephen M. Smith. Fsl. Neuro Image, 62(2):782 790, 2012. ISSN 1053-8119. doi: https://doi.org/10. 1016/j.neuroimage.2011.09.015. URL http://www.sciencedirect.com/science/ article/pii/S1053811911010603. 20 YEARS OF f MRI. David G Kendall. Shape manifolds, procrustean metrics, and complex projective spaces. Bulletin of the London Mathematical Society, 16(2):81 121, 1984. Diederik P. Kingma and Jimmy Ba. Adam: A method for stochastic optimization. In 3rd International Conference on Learning Representations, ICLR 2015, San Diego, CA, USA, May 7-9, 2015, Conference Track Proceedings, 2015. URL http://arxiv.org/abs/1412.6980. Diederik P Kingma and Max Welling. Auto-encoding variational bayes. ar Xiv preprint ar Xiv:1312.6114, 2013. Brian Kulis et al. Metric learning: A survey. Foundations and Trends in Machine Learning, 5(4): 287 364, 2013. Ken Lang. Newsweeder: Learning to filter netnews. In Machine Learning Proceedings 1995, pp. 331 339. Elsevier, 1995. Yonghyeon Lee, Jonghyuk Baek, Young Min Kim, and Frank Chongwoo Park. Imat: The iterative medial axis transform. In Computer Graphics Forum, volume 40, pp. 162 181. Wiley Online Library, 2021a. Yonghyeon Lee, Sangwoong Yoon, Min Jun Son, and Frank C Park. Regularized autoencoders for isometric representation learning. In International Conference on Learning Representations, 2021b. Yonghyeon Lee, Seungyeon Kim, Jinwon Choi, and Frank Park. A statistical manifold framework for point cloud data. In International Conference on Machine Learning, pp. 12378 12402. PMLR, 2022. Bastian Leibe and Bernt Schiele. Analyzing appearance and contour based methods for object categorization. In 2003 IEEE Computer Society Conference on Computer Vision and Pattern Recognition, 2003. Proceedings., volume 2, pp. II 409. IEEE, 2003. Kantilal Varichand Mardia. Statistics of directional data. Academic press, 2014. Emile Mathieu, Charline Le Lan, Chris J Maddison, Ryota Tomioka, and Yee Whye Teh. Continuous hierarchical representations with poincar e variational auto-encoders. Advances in neural information processing systems, 32, 2019. Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg Corrado, and Jeffrey Dean. Distributed representations of words and phrases and their compositionality. ar Xiv preprint ar Xiv:1310.4546, 2013. Maher Moakher. A differential geometric approach to the geometric mean of symmetric positivedefinite matrices. SIAM Journal on Matrix Analysis and Applications, 26(3):735 747, 2005. Published as a conference paper at ICLR 2023 Sameer A Nene, Shree K Nayar, and Hiroshi Murase. Columbia object image library (coil-20). Technical Report CUCS-005-96, 1996a. Sameer A Nene, Shree K Nayar, and Hiroshi Murase. Columbia object image library (coil-100). Technical Report CUCS-006-96, 1996b. Frank C Park and Roger W Brockett. Kinematic dexterity of robotic mechanisms. The International Journal of Robotics Research, 13(1):1 15, 1994. Adam Paszke, Sam Gross, Soumith Chintala, Gregory Chanan, Edward Yang, Zachary De Vito, Zeming Lin, Alban Desmaison, Luca Antiga, and Adam Lerer. Automatic differentiation in pytorch. In NIPS-W, 2017. Xavier Pennec. Probabilities and statistics on riemannian manifolds: Basic tools for geometric measurements. In NSIP, pp. 194 198. Citeseer, 1999. Jeffrey Pennington, Richard Socher, and Christopher D. Manning. Glove: Global vectors for word representation. In Empirical Methods in Natural Language Processing (EMNLP), pp. 1532 1543, 2014. URL http://www.aclweb.org/anthology/D14-1162. Marc'aurelio Ranzato, Christopher Poultney, Sumit Chopra, and Yann L. Cun. Efficient learning of sparse representations with an energy-based model. In Advances in Neural Information Processing Systems 19, pp. 1137 1144. MIT Press, 2007. Marc'aurelio Ranzato, Y lan Boureau, and Yann L. Cun. Sparse feature learning for deep belief networks. In Advances in Neural Information Processing Systems 20, pp. 1185 1192. Curran Associates, Inc., 2008. Luis A P erez Rey, Vlado Menkovski, and Jacobus W Portegies. Diffusion variational autoencoders. ar Xiv preprint ar Xiv:1901.08991, 2019. Danilo Jimenez Rezende, Shakir Mohamed, and Daan Wierstra. Stochastic backpropagation and approximate inference in deep generative models. In Proceedings of the 31st International Conference on Machine Learning, volume 32, pp. 1278 1286. PMLR, 2014. Salah Rifai, Gr egoire Mesnil, Pascal Vincent, Xavier Muller, Yoshua Bengio, Yann Dauphin, and Xavier Glorot. Higher order contractive auto-encoder. In Joint European Conference on Machine Learning and Knowledge Discovery in Databases, pp. 645 660. Springer, 2011a. Salah Rifai, Pascal Vincent, Xavier Muller, Xavier Glorot, and Yoshua Bengio. Contractive autoencoders: Explicit invariance during feature extraction. In Proceedings of the 28th International Conference on Machine Learning (ICML-11), pp. 833 840, 2011b. Hiroaki Sasaki, Aapo Hyv arinen, and Masashi Sugiyama. Clustering via mode seeking by direct estimation of the gradient of a log-density. In Joint European Conference on Machine Learning and Knowledge Discovery in Databases, pp. 19 34. Springer, 2014. Ondrej Skopek, Octavian-Eugen Ganea, and Gary B ecigneul. Mixed-curvature variational autoencoders. ar Xiv preprint ar Xiv:1911.08411, 2019. Justin Solomon, Leonidas Guibas, and Adrian Butscher. Dirichlet energy for analysis and synthesis of soft maps. In Computer Graphics Forum, volume 32, pp. 197 206. Wiley Online Library, 2013. Julian Straub, Jason Chang, Oren Freifeld, and John Fisher III. A dirichlet process mixture model for spherical data. In Artificial Intelligence and Statistics, pp. 930 938. PMLR, 2015. Pascal Vincent, Hugo Larochelle, Yoshua Bengio, and Pierre-Antoine Manzagol. Extracting and composing robust features with denoising autoencoders. In Proceedings of the 25th International Conference on Machine Learning, pp. 1096 1103. ACM, 2008. Pascal Vincent, Hugo Larochelle, Isabelle Lajoie, Yoshua Bengio, and Pierre-Antoine Manzagol. Stacked denoising autoencoders: Learning useful representations in a deep network with a local denoising criterion. Journal of Machine Learning Research, 11(Dec):3371 3408, 2010. Published as a conference paper at ICLR 2023 Zhirong Wu, Shuran Song, Aditya Khosla, Fisher Yu, Linguang Zhang, Xiaoou Tang, and Jianxiong Xiao. 3d shapenets: A deep representation for volumetric shapes. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 1912 1920, 2015. Jiacheng Xu and Greg Durrett. Spherical latent spaces for stable variational autoencoders. ar Xiv preprint ar Xiv:1808.10805, 2018. Yaoqing Yang, Chen Feng, Yiru Shen, and Dong Tian. Foldingnet: Point cloud auto-encoder via deep grid deformation. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 206 215, 2018. Published as a conference paper at ICLR 2023 A MATHEMATICAL BACKGROUNDS In this section, we briefly review some notions related to differentiable manifolds and Riemannian geometry. For further mathematical details on differentiable manifolds and Riemannian geometry, we refer the reader to Boothby (1986) and Dubrovin et al. (1992). Intuitively, an m-dimensional differentiable manifold M is a space which is locally diffeomorphic1 to m-dimensional Euclidean space. For every point p M, there exists a coordinate chart (U, x), where U is an open subset of M containing p, and x is a homeomorphism2 of U to an open subset of Rm. Applying x to p gives the m coordinates of p, i.e., x(p) = (x1(p), . . . , xm(p)) Rm each xi is a real-valued function on U, the i-th coordinate function. Here x is called the local coordinates; note that other choices of local coordinates are also possible (e.g., for the sphere, both spherical coordinates and stereographic projection correspond to different local coordinates of the sphere). A differentiable manifold M endowed with a Riemannian metric is called a Riemannian manifold. The Riemannian metric is a function defined on the manifold M that assigns to each point p M a bilinear mapping Φp : Tp M Tp M R, where Tp M denotes the tangent space to M at p. Using the local coordinates x = (x1, . . . , xm), the Riemannian metric can be expressed as Φ = Pm i=1 Pm j=1 gij(x)dxidxj or ds2 = Pm i=1 Pm j=1 gij(x)dxidxj. Here gij(x) is assumed to be smooth, i.e., infinitely differentiable, and its matrix representation G = (gij) Rm m is symmetric positive-definite. The Riemannian metric allows one to calculate lengths, angles, volumes, and even define a distance metric on differentiable manifolds in an intrinsic way, i.e., in a way that is invariant to the choice of local coordinates.3 (a) A curve and a region on M (b) Corresponding curve and region in local coordinates Figure 7: A Riemannian manifold M and its local coordinate x. The length of a curve C = {x(t) M | t [0, 1]} is calculated as Length(C) = Z 1 x(t) G(x(t)) x(t) dt, (10) where x(t) Rm is the local coordinate representation of x(t) (see Figure 7). Given two fixed boundary points x(0), x(1) M, the curves that minimize the length (10) are called the minimal geodesics, and the corresponding lengths the minimal geodesic distances. The equations for geodesics are obtained as j=1 Γk ij dxi dt = 0, k = 1, . . . , m, (11) where Γk ij denote the Christoffel symbols of the second kind in M, i.e., and gkl is the (k, l) entry of G 1 Rm m. 1Two manifolds are said to be diffeomorphic if there exists a differentiable mapping between the two manifolds which is invertible and its inverse is also differentiable. Such a mapping is called a diffeomorphism. 2A continuous function is called a homeomorphism if it is invertible, and its inverse is continuous. 3See Appendix B.1 for more discussions on the coordinate transformations and the coordinate-invariance. Published as a conference paper at ICLR 2023 Figure 8: A geodesic curve x : [0, 1] M emanating from x(0) M along v Tx(0)M. Consider a geodesic curve x : [0, 1] M emanating from a point x(0) M with an initial velocity vector x(0) = v Tx(0)M as shown in Figure 8. It is known that such a geodesic is unique if it exists (Boothby, 1986), and denote by x(1) M the endpoint of the geodesic that propagates for a unit time. The mapping that maps the initial velocity vector v to the point x(1) is called the exponential map Expx(0) : Tx(0)M M. Note that the distance between x(0) and x(1) is the same as the norm of v. This corresponds to a generalization of propagating a line in vector space from a starting point along a vector. On Riemannian manifolds, there exists a (natural) volume element induced from the Riemannian metric G(x) which is expressed in local coordinates as p det G(x) dx1 dxm. The volume of a compact subset V M is then obtained by the following integral: Volume(V) = Z det G(x) dx1 dxm, (13) where V denotes the domain of integration expressed in local coordinates4 (see Figure 7). The integration of a bounded and continuous function f : M R on the integration domain V is also obtained using the volume element as follows: Z det G(x) dx1 dxm. (14) B FURTHER EXPLANATIONS OF THE GEOMETRIC REGULARIZATION COMPONENTS B.1 CONTRACTIVE REGULARIZATION To see why (4) is coordinate-invariant, observe that under a pair of local coordinate transformations x 7 x = ϕ(x) and y 7 y = ψ(y), G(x) = (gij(x)) Rm m, H(y) = (hαβ(y)) Rn n, and J(x) = f i xj (x) Rn m transform according to the following rules (Dubrovin et al., 1992): (i) G 7 G = Φ GΦ 1, where Φ = ϕ x Rm m; (ii) H 7 H = Ψ HΨ 1, where Ψ = ψ y Rn n; (iii) J 7 J = ΨJΦ 1,5 where it can be verified that Tr J HJG 1 remains the same. Also note that minimizing (4) induces the contraction (shrinking without distortion) of the mapping f. As an extreme case without any boundary conditions or constraints, trivial solutions are obtained as J = 0 or equivalently f = constant, which is an extreme contraction. On the other hand, provided the boundary conditions or constraints for f are well-specified, a useful physical analogy for minimizing (4) is to imagine wrapping a curved object made of marble (N) by an elastic sheet (M); harmonic maps, which are extrema of (4), can be viewed as solutions corresponding to elastic equilibria (Eells & Sampson, 1964). B.2 SOME REMARKS ON NON-EUCLIDEAN LATENT SPACES Recently, to better capture the structure of data distributions, there have been increasing works on autoencoders that deal with non-Euclidean latent spaces, e.g., hyperspheres (Davidson et al., 2018; Xu & Durrett, 2018), hyperbolic spaces or Poincar e balls (Mathieu et al., 2019), their mixtures 4We may also use V rather than V to denote the domain of integration for notational simplicity. 5We can easily verify (i) and (ii) by observing the Riemannian metric should remain the same under the local coordinate transform as ds2 = [dx] G[dx] = [dx ] G [dx ], where [dx] = [dx1, . . . , dxm] and [dx ] = [dx 1, . . . dx m] are related by [dx] = x x [dx ] = Φ 1[dx ]. We can verify (iii) by considering the chain rule (y f) y f x x x = Ψ f Published as a conference paper at ICLR 2023 (Skopek et al., 2019), or submanifolds embedded in Euclidean spaces (Rey et al., 2019). In the case of such non-Euclidean latent spaces, by reflecting the coordinate-invariant contractive regularization discussed in Section 3.1, we can formulate geometric contractive autoencoder (GCAE) as follows: dist(r(x), x)2 + σ2 Tr ρ(x) dx, (15) where the reconstruction map r : M M, the encoder mapping f : M N, and the decoder mapping g : N M are respectively expressed in local coordinates as r : Rm Rm, f : Rm Rn, and g : Rn Rm. Training the GCAE may induce additional regularization effects that better capture dataconcentrated regions or make the model more robust to noise, in addition to the effect of using non-Euclidean latent spaces. Confirming these experimentally is left for future work. As another choice of the Riemannian metric for latent space, we can consider the pullback of the data space metric via decoder mapping (Arvanitidis et al., 2018). It has been observed that applying this metric can better characterize the data distances in latent space and provide more meaningful results in analyzing latent representations. Adopting this metric on the latent space reveals an interesting connection between the GCAE in (15) and GRCAE in (5). This pullback metric is obtained as H(y) = g y G(g(y)) g y Rn n, and we can observe that (15) and (5) become identical when substituting this metric into (15) and considering the chain rule r y f x Rm m. B.3 DATA CORRUPTION PROCESS The corruption process x N(x, σ2I) in vector space is equivalent to set x = x + ϵ for ϵ N(0, σ2I), i.e., the endpoint of a line segment starting from x and extended according to the vector ϵ. On the Riemannian manifolds, a similar discussion is possible using the notion of the geodesic and the exponential map as discussed in Appendix A. Therefore, we corrupt an input x M by applying the exponential map Expx : Tx M M to a tangent vector v Tx M sampled from an isotropic zero-mean multivariate Gaussian, i.e., v = v1E1 + + vm Em for an orthonormal basis E1, . . . , Em for Tx M and an m-dimensional vector (v1, . . . , vm) N(0, σ2I). By using an isotropic Gaussian, the corruption process x = Expx(v) does not depend on which orthonormal basis E1, . . . , Em is used. C PROOF OF THEOREM 1 C.1 THE FIRST-ORDER NECESSARY CONDITIONS FOR GRCAE Lemma 1. Provided σ2 is small, the derivative of the log of the probability function ρg(x) = ρ(x) det G(x) can be approximated for the geometric reconstruction contractive autoencoder as ρ ρ x(x) Γ(x) = G r(x) x + O(σ2), (16) where Γ(x) Rm is a vector whose i-th component is given by 1 Proof. The first-order necessary conditions for (5) can be obtained from the following Euler Lagrange equations: = 0, i = 1, . . . , m, (17) where L is the integrand of (5) with an approximation on the squared geodesic distance to (r(x) x) G(x)(r(x) x) provided σ2 small,6 and ri denotes the i-th coordinate representation of the 6The difference between optimal r(x) and x turns out to be O(σ2). It can be verified that this approximation does not affect the final results, since the approximation error of the squared geodesic distance becomes O(σ6). Published as a conference paper at ICLR 2023 reconstruction function r. By applying L to the Euler-Lagrange equations, (17) results in ri xi = σ2 ηi(x, r, r , r ), i = 1, . . . , m, (18) where r denotes r x, r denotes the collection of 2rj x2 for j = 1, . . . , m, and ηi denotes a function of x, r, r , r .7 Assuming r, G, G 1, ρ are smooth, and their higher-order derivatives are bounded, ri xi = σ2 ηi(x, r, r , r ) = O(σ2) holds. By iterating the relation ri = xi +σ2 ηi(x, r, r , r ) = xi + O(σ2), i.e., substituting this form of ri into ηi(x, r, r , r ), the i-th component of r is obtained for small σ2 as follows: + O(σ4), (19) + O(σ4), (20) where gij denotes the (i, j) entry of G 1, ρg = ρ det G, and the identity log(det G) xj = Tr G 1 G is used to derive (20). By gathering (20) for all i, the first-order necessary conditions can be rewritten as 1 ρg x (x) = G(x) r(x) x + O(σ2). (21) C.2 THE FIRST-ORDER NECESSARY CONDITIONS FOR GDAE To obtain the first-order necessary conditions for GDAE, the integrand of (6) is first represented in local coordinates x = (x1, . . . , xm) in Lemma 2. Lemma 2. Provided σ2 is small, Eq( x|x) dist(r( x), x)2 in (6) can be approximated in local coordinates as Eq( x|x) dist(r( x), x)2 (22) = (r(x) x) G(x)(r(x) x) + σ2 j=1 (ri xi) gij Tr 2rj n=1 (ri xi) Γi;jk rj xn gln gij rj xk Γk lngln ! where ri is the i-th component of r, gij and gij respectively denote the (i, j) entry of G and G 1. The Γk;ij and Γk ij respectively denote the Christoffel symbol of the first and second kind in M. Proof. We represent x M and x M in local coordinates as x Rm and x Rm, respectively. Furthermore, at x, consider a nonlinear function ϕ : Rm Rm that maps the representations in local coordinates of the points near x to those in the normal coordinates; from these settings, ϕ(x) = 0 holds and let u = ϕ( x). Denote by ru : Rm Rm the reconstruction function represented in the normal coordinates. (The functions r and ru are then related by ru( u) = ϕ(r( x)), and ru(0) = ϕ(r(x)) holds.) Since the distance between x M and r( x) M corresponds to the standard vector norm of the representation of r( x) in normal coordinates at x, dist(r( x), x)2 = ru( u) 2 holds. Near the origin 7The explicit form of ηi(x, r, r , r ) is obtained as j,k,l,α gij P xl gkl + g(r)jα xl gkl + g(r)jα 2rα ρ ρ xk , where gij and g(r)ij respectively denote the (i, j) entry of G 1 and G(r(x)). Published as a conference paper at ICLR 2023 of the normal coordinates at x, ri u : Rm R (the i-th component of ru : Rm Rm) admits following Taylor s expansion: ri u( u) = ri u(0) + ri u u u + 1 2! u 2ri u u2 Then ru( u) 2 can be expressed as follows: ri u( u) 2 (24) = ru(0) 2 + 2 ru(0) ru i=1 ri u(0) u 2ri u u2 where the terms of the order (with respect to u) higher than three are omitted in (25). Consider the expectation with respect to u N(0, σ2I) of (25) (this corresponds to the expectation of dist(r( x), x)2 with respect to q( x|x)). Provided σ2 is small, we obtain Eq( x|x) dist(r( x), x)2 = Eq( u) ru( u) 2 (26) = ru(0) 2 + σ2 Tr i=1 ri u(0) Tr 2ri u u2 where q( u) denotes the noise density for u N(0, σ2I), and the terms with an order higher than σ2 are assumed to be negligible. We now represent (26) in local coordinates. For this purpose, the following equations are useful: ri u(0) = ϕi(r(x)) ϕi x (r(x) x), (27) ri u uj = ϕi(r(x)) 2ri u uj uk = xk xl , (32) k=1 Γi jkgjk = uj uk δjk, (33) uj is the (i, j) entry of x u = ϕ x 1 Rm m, and δjk = 1 for j = k and δjk = 0 otherwise in (33). Here, the higher-order terms in (27) can be neglected for small σ2, and (28)- (29) are obtained from the chain rule. We can derive (30)-(33) using the properties of the normal coordinates. By applying (27)-(33) to (26) and rearranging the terms, we obtain the result in (22). Published as a conference paper at ICLR 2023 We now provide the first-order necessary conditions for (6) using Lemma 2. Lemma 3. Provided σ2 is small, the derivative of the log of the probability function ρg(x) = ρ(x) det G(x) can be approximated for the geometric denoising autoencoder as ρ ρ x(x) Γ(x) = G(x) r(x) x + O(σ2), (34) where Γ(x) Rm is a vector whose i-th component is given by 1 Proof. The first-order necessary conditions for (6) are obtained from the following Euler-Lagrange equations: = 0, i = 1, . . . , m, (35) where L denotes (22) approximated according to Lemma 2. Applying L to (35) results in an equation of the form ri xi = σ2 φi(x, r, r , r ) for i = 1, . . . , m, where φi is a function of x, r, r , r .8 Assuming r, G, G 1, ρ are smooth and their higher-order derivatives are bounded, we obtain an equation identical to (20) for small σ2 by iterating ri = xi + σ2φi(x, r, r , r ) = xi + O(σ2). Hence the first-order necessary conditions are obtained as (21). C.3 PROOF OF THEOREM 1 Theorem 2. Provided σ2 is small, the derivative of the log of the probability function ρg(x) = ρ(x) det G(x) can be approximated for both the GRCAE and GDAE as ρ ρ x(x) Γ(x) = G(x) r(x) x + O(σ2), (36) where Γ(x) Rm is a vector whose i-th component is given by 1 Proof. Collecting Lemma 1 from Appendix C.1 and Lemma 3 from Appendix C.2 completes the proof. D IMPLEMENTATION OF GEOMETRICALLY REGULARIZED AUTOENCODERS In Appendix D.1, D.2, D.3, and D.4, we provide the implementation details of the geometrically regularized autoencoders for Sn, P(n), N(3), and point cloud data considered in our experiments, respectively. We also provide some ideas to deal with the case of manifolds that require multiple coordinate charts in Appendix D.5. We then discuss the case when the data space Riemannian metric is not known a priori in Appendix D.6. D.1 Sn DATA D.1.1 DATA CORRUPTION PROCESS Given an n + 1-dimensional vector with the unit norm x Sn Rn+1 and a tangent vector v Tx Sn Rn+1, the exponential map Expx : Tx Sn Sn is defined as follows: Expx(v) = cos v x + sin v 8The explicit form of φi(x, r, r , r ) is obtained as j,k,l,α gij 1 xl gkl + Γj;αβ rα xl gkl gjα rα xβ Γβ klgkl xk xl gkl + 1 ρ xk g(r)jα rα xl gklρ + P β(rα xα)Γα;jβ rβ 2(rα xα)gjα Γk lβglβρ 1 xk xl gjα(rα xα)gklρ , where g(r)ij is the (i, j) entry of G(r(x)). Published as a conference paper at ICLR 2023 where is the Euclidean norm. For the data corruption process, we first sample an n + 1-dimensional vector ϵ N(0, σ2I) and project ϵ onto Tx Sn (with the origin identified to that of the ambient Euclidean space) as v = (I xx )ϵ. As discussed in Section 3.2, the input x is then corrupted to x = Expx(v) according to (37). D.1.2 THE RECONSTRUCTION ERROR For a reconstruction function r : Sn Sn, the reconstruction error is defined as follows: dist(r( x), x)2 = arccos(r( x) x)2. (38) D.1.3 THE RECONSTRUCTION FUNCTION We implement our regularized autoencoder for Sn as a mapping r : Sn Sn. The input and output of the mapping are n + 1-dimensional vectors with the unit norm. The reconstruction function r is modeled by a neural network with two hidden layers, with the hyperbolic tangent (Tanh) activation function as follows: r(x) = Proj(x + W3h2 + b3), (39) hi = Tanh(Wihi 1 + bi), i = 1, 2, (40) where h1, h2 Rdh denote the hidden variables, h0 is set to be x Sn Rn+1, Proj : Rn+1 Sn Rn+1 is the function to project a vector onto a hypersphere by dividing the input vector by its norm, and Wi, bi for i = 1, 2, 3 respectively denote the matrix and vector parameters with sizes defined accordingly as above. We set the hidden variable dimensionality dh to 1,000 in all the experiments. D.1.4 CONTRACTIVE REGULARIZATION For the case of Sn, the contractive term of (5) (without the ρ(x) term) can be computed using r as x (I xx ) , where r x Rn+1 n+1 and x Sn Rn+1. D.1.5 AN EMPIRICAL ANALYSIS OF COMPUTATIONAL TIME Training geometrically regularized autoencoders requires more computations than training vector space autoencoders since it needs to calculate the geodesic distance, exponential map, and geometric contractive term. We have experimentally measured the computational times for the models applying each geometric component explained above to the autoencoder r : M M. We have used the Pytorch library (Paszke et al., 2017) and have utilized NVIDIA Tesla V100 GPU with Intel Xeon E5-2698 v4 2.2 GHz (20-Core) CPU (also for most of the other experiments). We report in Table 6 the computational times spent for a gradient update iteration for each model with a batch size of 10,000. In the table, AE, GAE, GAE + G., GAE + E., and GAE + C. respectively represent the vanilla autoencoder (of the same input and output dimensions with other models), an autoencoder reflecting the non-Euclidean input and output structure, a model applying the geodesic distance as the reconstruction error for GAE, a model applying the exponential map in data corruption process for GAE, and a model applying the geometric regularization term for GAE. For comparison, we also consider GDAE and GRCAE. To check the time for various data dimensionality, we consider n = 2, 6, 10, 25, 50, 100 for Sn data generated as explained in Appendix F.4. From the table, we can observe that applying our geometric components can be scaled to high-dimensional Sn data reasonably well, except for the geometric contractive term, which requires the derivative of the Jacobian r x Rn+1 n+1 during the computation of gradients for an update. D.2 P(n) DATA D.2.1 MATRIX EXPONENTIAL AND MATRIX LOGARITHM We first provide the closed-form expressions of the matrix exponential on S(n), the space of n n symmetric matrices, and matrix logarithm on P(n). Given an eigendecomposition of a symmetric Published as a conference paper at ICLR 2023 Table 6: The computational times (in milliseconds) spent for a gradient update iteration for autoencoders applying different geometric components for Sn data. n AE GAE GAE + G. GAE + E. GAE + C. GDAE GRCAE 2 5.37 5.30 7.53 6.09 35.7 10.7 39.8 6 5.46 5.32 7.59 6.13 68.7 10.9 74.0 10 5.56 5.47 7.44 6.23 103 10.9 110 25 5.46 5.53 7.71 6.31 234 11.0 248 50 5.68 5.58 7.79 6.45 462 10.7 491 100 6.15 6.16 8.23 7.09 1016 11.3 1100 matrix A = RDR S(n), with R O(n) as the eigenvector matrix and D = diag(d1, . . . , dn) the matrix of corresponding eigenvalues, the matrix exponential is obtained as Exp(A) = R Exp(D) R , (41) where Exp(D) = diag(exp(d1), . . . , exp(dn)). Given an eigendecomposition of a symmetric positive-definite matrix A = RDR P(n) similarly to the above, the matrix logarithm is obtained as Log(A) = R Log(D) R , (42) where Log(D) = diag(log(d1), . . . , log(dn)). D.2.2 DATA CORRUPTION PROCESS Given an input P P(n), we first sample a n(n+1) 2 -dimensional vector ϵ N(0, σ2I). As discussed in Section 3.2, the input P is then corrupted to P P(n) as follows: P = P 1 2 Exp ([ϵ]) P 1 2 , (43) where P 1 2 = RS 1 2 R for an eigendecomposition of P = RSR , and the bracket operator is defined for ϵ = (ϵ11, . . . , ϵ1n, ϵ22, . . . , ϵ2n, . . . , ϵnn) R n(n+1) 2 as ([ϵ])ij = aijϵij for i j and ([ϵ])ij = ([ϵ])ji for i > j, with aij = 1 if i = j and 1 2 otherwise. D.2.3 THE RECONSTRUCTION ERROR For a reconstruction function r : P(n) P(n), the reconstruction error is defined from the affineinvariant Riemannian metric as follows (Fletcher & Joshi, 2007): dist(r( P), P)2 = Log P 1 2 R for an eigendecomposition of P = RSR , Log : P(n) S(n) is the matrix logarithm, and F is the matrix Frobenius norm. D.2.4 THE RECONSTRUCTION FUNCTION The reconstruction functions r : P(n) P(n) for GDAE and GRCAE should consider the special structure of P(n). By defining a function v : P(n) S(n), the reconstruction function at P P(n) is modeled as r(P) = P 1 2 Exp(v(P))P 1 2 . (45) Hence training r reduces to training unconstrained v (when considering only the loweror uppertriangular part), and we denote by v : P(n) R n(n+1) 2 the function that returns the n(n+1) 2 - dimensional vector representation of the lower- (or upper-) triangular part of the output for v. We model v by neural networks with two hidden layers, with the hyperbolic tangent (Tanh) activation function as follows: v(P) = W3h2 + b3, (46) hi = Tanh(Wihi 1 + bi), i = 1, 2, (47) Published as a conference paper at ICLR 2023 where h1, h2 Rdh denote the hidden variables, h0 R n(n+1) 2 is set to the vector representation of the lower- (or upper-) triangular part of Log(P) with Log : P(n) S(n) as the matrix logarithm,9 and Wi, bi for i = 1, 2, 3 respectively denote the matrix and vector parameters with sizes defined accordingly as above. We set the dimensionality of the hidden variables dh to 1,000 for both GDAE and GRCAE in all the experiments. D.2.5 CONTRACTIVE REGULARIZATION For the case of P(n), we can consider the upper- (or lower-) triangular part of the matrix representations as its local coordinates, i.e., r(x) r(x) and x x, and calculate the contractive term of (5) straight-forwardly. D.2.6 AN EMPIRICAL ANALYSIS OF COMPUTATIONAL TIME Similarly to Appendix D.1.5, we have experimentally measured the computational times for each autoencoder model for P(n) data. Note that we can calculate the matrix exponential and logarithm faster and numerically more stably using the approximations from Taylor s expansion by assuming small σ2 (hence inputs becoming near zero and identity for the matrix exponential and logarithm, respectively, in (43)-(45)). We also precompute and use some quantities frequently appears in (43)- (45) such as P 1 2 and P 1 We consider two different models with different degrees of approximation for GDAE, namely GDAE-v1 and GDAE-v2. GDAE-v1 applies in (43)-(45) the second-order Taylor s expansion on the matrix exponential and logarithm, i.e., Exp(v) I +v+ 1 2v2 and Log(M) (M I) 1 2(M I)2. GDAE-v2 additionally applies the first-order Taylor s expansion near P for P 1 2 and Log( P) required to calculate r( P) based on (45) so that it can avoid performing additional eigenvalue decompositions during training. We report in Table 7 the computational times for a gradient update iteration for each model with a batch size of 5,000. To check the time for various data dimensionality, we consider n = 2, 3, 4, 7, 10, 14 for P(n) data (with dimensionality d = n(n+1) 2 = 3, 6, 10, 28, 55, 105) generated as explained in Appendix F.1. From the table, we can observe that applying our geometric components can be scaled to high-dimensional P(n) data at a rate slower than the linear rate to d, except for the geometric contractive term of which the computational time increases at a nearly quadratic rate to d when applied. Also note that even if the GDAE-v2 shows a faster computational time when n = 2, 3, it becomes slower than GDAE-v1 as the data dimensionality increases due to the heavy matrix multiplications required in the approximation. We utilize GDAE-v1 for other experiments using P(n) data. Table 7: The computational times (in milliseconds) spent for a gradient update iteration for autoencoders applying different geometric components for P(n) data. n d AE GAE GAE + G. GAE + E. GAE + C. GDAE-v1 GDAE-v2 GRCAE 2 3 3.02 5.10 4.66 14.6 40.1 22.1 11.2 45.0 3 6 3.31 7.13 6.28 18.1 94.7 28.8 19.9 92.3 4 10 3.29 8.96 8.85 25.1 213 38.1 38.8 220 7 28 3.32 19.8 19.1 40.4 1111 52.5 82.2 1126 10 55 3.34 39.6 39.1 63.2 4573 79.2 164 4594 14 105 3.73 62.8 62.8 98.6 28949 111 262 28776 D.3 N(3) DATA In training GDAE for diffusion tensor imaging (DTI) data in Section 4.4, the structure of the manifold N(3) (using the Fisher information Riemannian metric) should be dealt with properly. Here, we provide technical details for the modeling and training of the reconstruction function r : N(3) N(3). 9It has empirically performed better to use the vector representation obtained from Log(P) rather than P. Published as a conference paper at ICLR 2023 D.3.1 THE FISHER INFORMATION RIEMANNIAN METRIC Denoting by N(µ, Σ) a normal distribution in Rn with mean µ and covariance Σ, the space of n-dimensional normal distributions N(n) is defined as follows: N(n) = {N(µ, Σ) | µ Rn, Σ P(n)}. (48) N(n) becomes a differentiable manifold with dimension n + 1 2n(n + 1). By using the Fisher information Riemannian metric with local coordinates (µ, Σ) (for Σ, only the loweror upper-triangular is needed part), the inner product of two tangent vectors V = (Vµ, VΣ), W = (Wµ, WΣ) for Vµ, Wµ Rn and VΣ, WΣ S(n) at a point N = (µ, Σ) N(n) is defined as follows: V, W N = V µ Σ 1Wµ + 1 2Tr(Σ 1VΣΣ 1WΣ). (49) It can be easily verified that the Fisher information metric reduces to the affine-invariant metric when the Rn part of the tangent vectors is zero; the geodesic distance between two normal distributions with identical means is then obtained in closed-form from the affine-invariant distance between the two covariance matrices. However, in general, the distance between two normal distributions cannot be obtained in closed-form; a numerical algorithm to find the minimal geodesic according to this metric is proposed in Han & Park (2014), along with some physical insights of adopting the Fisher information Riemannian metric for DTI analysis. D.3.2 DATA CORRUPTION PROCESS Given an input voxel data N = (x, P) N(3), we first sample ϵp N(0, σ2P) and a sixdimensional vector ϵc N(0, σ2I). The input voxel (x, P) N(3) is then corrupted to ( x, P) N(3) as follows: x = x + ϵp, (50) P = P 1 2 Exp ([ϵc]) P 1 2 , (51) where P 1 2 = RS 1 2 R for an eigendecomposition of P = RSR , and the bracket operator [ ] is defined for ϵ = (ϵ11, ϵ12, ϵ13, ϵ22, ϵ23, ϵ33) R6 as ([ϵ])ij = aijϵij for i j and ([ϵ])ij = ([ϵ])ji for i > j, with aij = 2 if i = j and 1 otherwise. Note that this corresponds to an approximation of the data corruption process q( x|x) (for small σ2) discussed in Section 3.2. In the experiments, we have further applied the first-order Taylor s expansion on the matrix exponential in (51). D.3.3 THE RECONSTRUCTION ERROR Denote the reconstruction function by r = (rp, rc), where rp : N(3) R3 corresponds to the location part and rc : N(3) P(3) the value part of the reconstruction function, respectively (we discuss their exact modeling in the later section). Assuming σ2 is small, the reconstruction error between (x, P) N(3) and (rp( x, P), rc( x, P)) N(3) is approximated according to the inner product of the Fisher information metric defined in (49) as follows: dist (rp( x, P), rc( x, P)), (x, P) 2 (52) (rp( x, P) x) P 1(rp( x, P) x) + 1 2 rc( x, P)P 1 2 R for an eigendecomposition of P = RSR . D.3.4 THE RECONSTRUCTION FUNCTION Next, we model the reconstruction function r = (rp, rc) on N(3). By defining a function v : N(3) S(3), the voxel value part rc of the reconstruction function at (x, P) N(3) is modeled as rc(x, P) = P 1 2 Exp (v(x, P)) P 1 2 . (53) In this way, training rc reduces to training unconstrained v (when considering only the loweror upper-triangular part), and we denote by v : N(3) R6 the function that returns the six-dimensional vector representation of the lower- (or upper-) triangular part of the output for v. Published as a conference paper at ICLR 2023 We model rp and v by neural networks with two hidden layers, with the hyperbolic tangent (Tanh) activation function as follows: rp(x, P) = x + W3h2 + b3, (54) v(x, P) = W4h2 + b4, (55) hi = Tanh(Wihi 1 + bi), i = 1, 2, (56) where h1, h2 Rdh denote the hidden variables, and h0 is set to be (x, p) R9 with p R6 as the vector representation of the lower- (or upper-) triangular part of P. Moreover, Wi, bi for i = 1, 2, 3, 4 respectively denote the matrix and vector parameters with sizes defined accordingly as above. We set the dimensionality of the hidden variables dh to 1,000. D.4 POINT CLOUD DATA D.4.1 THE FISHER INFORMATION RIEMANNIAN METRIC FOR POINT CLOUD DATA A statistical manifold framework has been suggested to deal with point cloud data in Lee et al. (2022). Briefly speaking, they deem each point in a point cloud as a sample from a specific parametric probability density model and identify the data to the density. The Fisher information metric for these density models can then serve as a natural Riemannian metric in the point cloud data space. For a point cloud data in RD represented as X = {x1, . . . , xn | xi RD}, by using X itself as the parameter, following parametric probability density model is considered: p(x; X) = 1 2 (x xi)), (57) where K : RD R is a kernel function satisfying R RD K(u)du = 1, and Σ RD D is a symmetric positive-definite bandwidth matrix. In the experiments, we choose the Gaussian kernel function K(u) = 1/ p (2π)D exp( u u/2) with Σ = σ2 p I following Lee et al. (2022). With such a choice of density model, the Fisher information metric G Rn D n D is obtained as Gijkl = Z p(x; X) log p(x; X) Xij log p(x; X) Xkl dx (58) for i, k = 1, . . . , n and j, l = 1, . . . , D, where Gijkl is the ((i 1)D + j, (k 1)D + l) entry of G and Xij is the j-th entry of xi. We approximate the integration in (58) as a finite sum of the integrands (with p(x; X) excluded) over the data points x1, . . . , xn. We refer the reader to Lee et al. (2022) for more details. D.4.2 DATA CORRUPTION PROCESS Since applying the exponential map based on the Riemannian metric in (58) is computationally very expensive, we resort to an approximation for the data corruption process. Given a vector representation of the point cloud data x = (x 1 , . . . , x n ) Rn D, we sample a random vector of the same size ϵ N(0, σ2I), and corrupt the data as x x + G 1ϵ, where G 1 Rn D n D is the inverse of the Riemannian metric G in (58). We further approximate G to be a diagonal matrix by considering only the diagonal elements of (58) so that G 1 is computationally tractable. D.4.3 THE RECONSTRUCTION ERROR For a reconstruction function r : Rn D Rn D, consider a point cloud X = {x1, . . . , xn | xi RD} and its reconstructed version r( X) = Y = {y1, . . . , yn | yi RD}. We define the reconstruction error by slightly modifying the Chamfer distance in Yang et al. (2018) as follows (Lee et al., 2022): Reconstruction error = 1 |X| x X min y Y x y 2 + 1 y Y min x X x y 2, (59) where |X| denotes the number of elements in the set X. Published as a conference paper at ICLR 2023 Note that using the reconstruction error in (59) does not perfectly correspond to our original definition of GDAE in (6), which uses the geodesic distance as the reconstruction error. However, with a slight abuse of notation, we still use the term GDAE for the method minimizing the reconstruction error defined in (59). D.4.4 THE RECONSTRUCTION FUNCTION We use exactly the same point cloud autoencoder with the reconstruction function r : Rn D Rn D (i.e., the composition of the encoder f : Rn D Rd and the decoder g : Rd Rn D) adopted from the Fc Net in Yang et al. (2018). We set the latent space dimensionality d as 512. D.5 A REMARK ON IMPLEMENTATION FOR MANIFOLDS THAT REQUIRE MULTIPLE COORDINATE CHARTS Since the manifolds mainly considered in the experiments, e.g., Sn, P(n), N(n), and the statistical manifold for point cloud data, could be embedded in Euclidean spaces, we have directly used the manifold elements embedded in Euclidean spaces as the form of input and output for r : M M. If this is not the case, e.g., for the abstract manifolds, we can implement the mappings r : Rm Rm represented in local coordinates. When multiple charts {(U1, x1), . . . , (UC, x C)} are required,10 one available approach would be to implement mappings for each chart as separate neural networks and train each mapping by minimizing objective functions weighted by appropriate weight functions. In the case of GDAE, for example, we can solve C optimization problems as follows: M Eq( x|x)[dist(ri( x), x)2]ρ(x)fi(x)dx, i = 1, . . . , C, (60) where C is the number of charts, ri : Rm Rm is the mapping for the i-th chart (this is ri : Ui M represented in local coordinates), and fi : M R 0 is the weight function for the i-th chart satisfying fi(x) = 0 for x / Ui, e.g., determined based on partitions of unity (satisfying that, for all x M, there is a neighborhood of x where all but a finite number of fi(x)s are zero, and PC i=1 fi(x) = 1). The final reconstruction results for x M can be obtained by geometrically averaging (after appropriate coordinate transformations) the outputs ri(x) from each mapping with corresponding weights fi(x). Note that there is no guarantee that ri(xi(Ui)) xi(Ui) (or ri(Ui) Ui) would always hold. That is, ri(x) may be outside of Ui hence ri(x) and x may not be in the same coordinate chart for some x Ui. For small σ2, such a phenomenon (if it happens) would mostly occur at x Ui near the boundary of Ui since ri is trained so that ri(x) is close enough to x, i.e., ri(x) x = O(σ2), according to Theorem 1. We can eliminate any undesirable effect that this phenomenon may have on the optimization of the objective function in (60) or on the weighted average of the final results by choosing a weight function fi that has the value of zero in a sufficiently wide region within Ui that includes the boundary of Ui. As a side effect, this choice would also assign relatively large weights for the inner regions of Ui far from the boundary of Ui hence inducing a larger regularization effect (e.g., denoising or contraction) in those regions. Therefore, this may lead to learning the reconstruction function to direct toward the inner regions of Ui, which helps to prevent the range of ri (or ri) from going outside of xi(Ui) (or Ui). D.6 A REMARK ON THE CASE WHEN THE DATA SPACE RIEMANNIAN METRIC IS NOT KNOWN a priori When the Riemannian metric for data space RD is not known a priori, we can construct the Riemannian metric by resorting to metric learning techniques that either utilize some supervision on the desired distance for a given task or reflect some intuition about the data manifold. For example, in Hauberg et al. (2012), given a set of centers {c1, . . . , c K}, ck RD, k = 1, . . . , K and a set of metrics {G1, . . . , GK}, Gk RD D, k = 1, . . . , K corresponding to each center, a smoothly 10Here Ui is an open subset of M, and xi is a homeomorphism of Ui to an open subset of Rm for i = 1, . . . , C. The collection {U1, . . . , UC} covers M, and for all i and j, the transition map xi x 1 j is smooth. Published as a conference paper at ICLR 2023 varying Riemannian metric G : RD RD D is constructed by interpolating the metrics as follows: k=1 wk(x)Gk, (61) where wk(x) = wk(x) PK i=1 wi(x), wk(x) = exp( 1 2h(x ck) Gk(x ck)), and h > 0 is a bandwidth parameter. Here we can obtain the set of metrics {G1, . . . , GK} from different metric learning methods, which learn task-specific distance metrics in a supervised manner (Kulis et al., 2013). In the unsupervised case, based on the intuition that the shortest path in data space should be near the data manifold, not the region where data are sparse, the Riemannian metric G : RD RD D can be constructed as follows: G(x) = (α h(x) + ϵ) 1 I, (62) where h : RD R>0, h(x) 1 when x is near the data manifold and h(x) 0 otherwise, and α, ϵ > 0 are scalars to control the lower and upper bound of the metric, respectively. (Some methods to construct smoothly varying h(x) are provided in Arvanitidis et al. (2021).) Consider formulating geometrically regularized autoencoders that reflect the Riemannian metrics constructed above. The calculations of the geodesics and exponential maps in these cases usually require numerical solvers (Hauberg et al., 2012). Therefore, efficient training of these autoencoders would require some methods to approximately apply the geometric components, a detailed discussion of which is left for future work. E DETAILS FOR EXPERIMENTS IN THE INTRODUCTION (a) Reference frames considered in the experiments Original data Augmented data (b) An example of original and augmented data in spherical coordinates Figure 9: Additional figures for experiments in the introduction. In (a), red, green, and blue lines represent each reference frame s X-, Y-, and Z-axis, respectively. Note that the reference frames are translated to corresponding spherical coordinate origins for better visualization. For the example provided in the introduction, we train the RCAE and the GRCAE (modeled as described in Appendix D.1) on spherical data sampled from the von Mises-Fisher (v MF) distribution with the concentration parameter of 10. To see if the trained results depend on the choice of coordinates, we consider five different spherical coordinate representations of the data for RCAE. We obtain these representations by representing the data with respect to different choices of ambient space reference frames shown in Figure 9 (a) and converting them to spherical coordinate representations. We then train the RCAEs using each representation and compare in the ambient space the reconstruction directions obtained from each RCAE.11 For the GRCAEs using the ambient space representations, we train the models using data represented in five different reference frames (the same as those considered for RCAE) and compare the reconstruction directions from each model after transforming them to an identical reference frame. In Figure 1 (b)-(c), we plot the reconstruction directions for a subsample of the data in Figure 1 (a) and provide a magnified view near the zenith for better visualization. 11To alleviate any topological issues arising from using spherical coordinate representations (x1, x2) which parametrizes the three-dimensional points as (cos(x1), sin(x1) cos(x2), sin(x1) sin(x2)), e.g., the fact that, for points with the x2 value near π or π, nearby points on the sphere can be mapped from two distant regions in spherical coordinates, we augment data obtained by adding or subtracting 2π from the x2 values (as shown in Figure 9 (b)) during training and only use the original data when getting the reconstruction directions. Published as a conference paper at ICLR 2023 F DETAILS FOR SECTION 4.1 F.1 SYNTHETIC DATA GENERATION We use 10,000 data sampled from m mixtures of tangent space Gaussian on P(n) in Section 4.1. For the i-th mixture, we set the mean as µi = Exp( d 2 diag(ei)) P(n), where d = n(n+1) 2 is the dimension of P(n), ei Rn is a standard vector whose i-th element is one, and Exp : S(n) P(n) is the matrix exponential. The covariance of the i-th mixture is represented using an orthonormal basis of Tµi P(n) as Σi = diag(σ1, . . . , σd) Rd d, where σk = 0.1 if k = 1+(i 1)n (i 1)(i 2) 2 (for i = 1, . . . , n) and σk = 0.01 otherwise, and the k-th basis of Tµi P(n) is defined using the bracket operator in (43) as µi [ek] µi S(n) with ek Rd as a standard vector whose k-th element is one. F.2 DEFINITION OF THE ESTIMATION ERROR For a given data set {x1, . . . , x N} represented in local coordinates of P(n), the log-density gradient estimation error (Est. error) is evaluated as follows: Est. error = 1 N PN i=1 log ρg x est (xi) log ρg x (xi) G 1(xi) log ρg x est (xi) log ρg x (xi) , (63) where log ρg x est (xi) and log ρg x (xi) respectively denote the estimated value and the ground truth value for log ρg(x) x at xi. Note that it is usually impossible to calculate the estimation performance based on (63) since the ground truth value of log ρg(x) x is generally not available. Instead, (63) can be reformulated not to require the log ρg(x) x term by using the integration by parts technique presented in Hyv arinen (2005). Ignoring the constant terms that do not depend on the estimated value, (63) can be rewritten as est (xi)G 1(xi) log ρg x est (xi) + 2Γ(xi) + 2 Pm j=1 Pm k=1 xk log ρg xj est (xi)gjk(xi) , (64) where Γ(x) Rm is a vector whose i-th component is given by 1 xi . Note that this formulation turns out to be equivalent to the minimization objective of the R-LSLDG method (Ashizawa et al., 2017). In the case of GDAE and GRCAE, further simplification is available for (64): P k xk log ρg xj est gjk = 1 σ2 Tr r x I holds from (7). Equation (64) can be used in hyperparameter tuning to evaluate the performance of the estimation in place of (63). F.3 TRAINING DAE, RCAE, GDAE, GRCAE, LSLDG, AND R-LSLDG We model the reconstruction functions r : P(n) P(n) for GDAE and GRCAE as explained in Appendix D.2. The reconstruction functions r : Rd Rd (d = n(n+1) 2 for P(n)) for DAE and RCAE are modeled by a neural network with two hidden layers, with the hyperbolic tangent (Tanh) activation function as follows: r(x) = x + W3h2 + b3, (65) hi = Tanh(Wihi 1 + bi), i = 1, 2, (66) where h1, h2 Rdh denote the hidden variables, h0 is set to be x Rd, and Wi, bi for i = 1, 2, 3 respectively denote the matrix and vector parameters with sizes defined accordingly as above. We set the dimensionality of the hidden variables dh to 1,000 for both DAE and RCAE. We optimize the parameters θ = {W1, b1, W2, b2, W3, b3} in (46), (47) for (6) and (5) to respectively train GDAE and GRCAE (or those in (65), (66) for (1) and (2) to train DAE and RCAE respectively). We apply the Adam algorithm (Kingma & Ba, 2015) using the Pytorch library (Paszke et al., 2017) and update the parameters for 500,000 iterations. We use the batch size of 1,000 to train RCAE and GRCAE and the batch size of 10,000 to train DAE and GDAE. The learning rate starts at 2.5e-5 and is divided by ten after 250,000 iterations with a weight decay parameter of 1e-12. We have applied the weight initialization scheme of Glorot & Bengio (2010). Dividing the initial W1, b1 by two helped to improve the estimation error for GDAE and GRCAE. Published as a conference paper at ICLR 2023 In discretizing (1), (2), (6), and (5), the expectations with respect to the data-generating probability density ρ(x) are approximated by a finite sum over the N training data points with equal weights 1 N . Given an input x (or x), the expectation with respect to the noise density q( x|x) in (6) for GDAE (or q( x|x) in (1) for DAE) is also approximated by a finite sum over finite samples of x (or x) with equal weights. In training the autoencoders, the noise parameter in the data corruption process or the weighting coefficient for the contractive term (σ2) should be chosen carefully. In this experiment, σ is selected among σ {0.01, 0.025, 0.05} to reduce the modified estimation error in (64), which does not require the true value of log ρg(x) x , on a randomly selected validation data set of sizes 20,000. Since the estimation results from the autoencoders can vary over iterations due to the stochastic nature of the optimization algorithm, instances with the minimum value of (64) during the training process are taken to be the output for each run of autoencoder training. In LSLDG, the derivative of the log-probability density, i.e., log ρ(x) x , is modeled as a weighted sum of some basis functions, with the weights optimized in a least-squares sense. The RLSLDG method estimates the derivative of the log-probability function, i.e., P j gij log ρg(x) xj for i = 1, . . . , m in local coordinates, similar to LSLDG but using basis functions defined on the Riemannian manifold. For both LSLDG and R-LSLDG, we perform crossvalidation for the hyperparameters λ {10 3, 10 2, 10 1.5, 10 1, 10 0.5, 100, 100.5, 101} and σ {10 4, 10 3, 10 2.5, 10 2, 10 1.5, 10 1, 10 0.5, 100} as described in Sasaki et al. (2014) and Ashizawa et al. (2017).12 The number of the basis functions is set to 500. The time spent in training each model is shown in Table 8. The autoencoders (DAE, RCAE, GDAE, and GRCAE) take a much longer time for training when compared to LSLDG and R-LSLDG, which are linear-in-parameter models and have closed-form solutions for the parameters. Furthermore, GRCAE (or RCAE) requires more computations than GDAE (or DAE) due to the contractive terms containing the derivative of the reconstruction functions. Compared to DAE and RCAE, GDAE and GRCAE involve heavier computations such as the eigenvalue decomposition (EVD) to reflect the non-Euclidean geometry of the data. Efficient parallel computation for the eigenvalue decomposition has been available by using the torch batch svd repository (MIT License).13 Table 8: The time (in seconds) spent in training each model for P(n) data. n LSLDG DAE RCAE R-LSLDG GDAE GRCAE 2 9 3062 17382 126 20671 43254 3 17 4087 19622 306 22499 67580 4 33 4014 25120 545 33166 116255 F.4 GEOMETRIC SCORE ESTIMATION OF SYNTHETIC HYPERSPHERICAL DATA This section provides the result of the geometric score estimation for the data on hypersphere Sn = {p Rn+1 | p = 1} endowed with the Riemannian metric induced from the Euclidean metric on Rn+1. We use 10,000 data generated by mapping samples from m Gaussian mixtures in Tp Sn via the exponential map Expp : Tp Sn Sn, where p = (1, 0, . . . , 0) Rn+1. For the i-th mixture, we set the mean as µi = 1 2ei Rn with ei Rn as a standard vector whose i-th element is one, and the covariance as Σi = (0.1)2I Rn n so that the ground truth log-probability derivative values are obtainable. We train the DAE, RCAE, GDAE, GRCAE, LSLDG, and R-LSLDG using the data. For the autoencoder models described in Appendix D.1 (for GDAE and GRCAE) and Appendix F.3 (for DAE and RCAE), we use the Adam optimizer (Kingma & Ba, 2015) of the Pytorch library (Paszke et al., 2017) to update the parameters for 500,000 iterations. We use the batch size of 1,000 to train RCAE and GRCAE and the batch size of 10,000 to train DAE and GDAE. From the initialization scheme of Glorot & Bengio (2010), the initial W1, b1 are further divided by 1-5 for the experiments. Other optimization and hyperparameter tuning conditions are the same as those explained in Appendix F.3. 12The definitions of λ and σ can be found in Sasaki et al. (2014) and Ashizawa et al. (2017). 13URL: https://github.com/Kinglittle Q/torch-batch-svd. Published as a conference paper at ICLR 2023 For both LSLDG and R-LSLDG, we perform cross-validation for the hyperparameters λ {10 3, 10 2, 10 1.5, 10 1} and σ {10 2, 10 1.5, 10 1, 10 0.5, 100} as described in Sasaki et al. (2014) and Ashizawa et al. (2017). The number of the basis functions is set to 500. The time spent in training each model is shown in Table 9. Table 9: The time (in seconds) spent in training each model for Sn data. n LSLDG DAE RCAE R-LSLDG GDAE GRCAE 2 33 3002 17165 240 5236 21804 6 111 3472 20369 399 6590 25864 10 203 4112 25019 570 7696 34365 The estimation errors for log ρg x obtained from each model are reported in Table 10. Similar to the results in Section 4.1, the GDAE and GRCAE perform much better than other methods (especially for higher dimensionality), while GRCAE gives the least estimation error in most cases. Table 10: The estimation errors for log ρg x for tangent space Gaussian mixture data on Sn. Bolds represent the best and comparable methods from the t-test with a significance level of 5%. n m LSLDG DAE RCAE R-LSLDG GDAE GRCAE 2 2 1.44 (0.07) 4.94 (0.25) 6.43 (2.05) 1.77 (0.09) 1.91 (0.18) 1.74 (0.15) 6 2 644 (10.3) 647 (266) 586 (93.9) 5.83 (0.11) 5.16 (0.16) 4.74 (0.09) 6 6 677 (21.1) 1194 (369) 1175 (259) 22.9 (0.14) 14.2 (1.05) 14.1 (1.27) 10 2 782 (6.45) 630 (107) 802 (187) 14.4 (0.35) 8.52 (0.17) 7.58 (0.07) 10 6 1201 (4.46) 739 (81.1) 1142 (67.4) 42.7 (0.12) 21.3 (3.90) 20.7 (3.57) 10 10 1700 (133) 1380 (54.8) 1655 (82.2) 50.4 (0.27) 39.3 (0.95) 38.2 (1.72) G DETAILS FOR SECTION 4.2 G.1 THE RIEMANNIAN LANGEVIN MONTE CARLO (RLMC) METHOD FOR Sn To apply the RLMC method for Sn, we should appropriately discretize (9). Recall that, from (9), the Brownian motion dx B on an n-dimensional manifold is written as G 1(x) log p det G(x) x + Ψ(x) G 1(x)dw, (67) where dw Rn is the Brownian motion in an n-dimensional vector space and Ψ(x) Rn is a vector whose i-th component is given by Pn j=1 gij xj with gij as the (i, j) element of G 1(x) Rn n. Consider spherical coordinate representations (x1, . . . , xn) which parametrizes the points on Sn as x(x1, . . . , xn) = (x1, . . . , xn+1) Rn+1 with x1 = cos(x1), xi = Qi 1 j=1 sin(xj) cos(xi) for i = 2, . . . , n, and xn+1 = Qn j=1 sin(xj). The Riemannian metric is then obtained as G(x) = diag(1, g22, . . . , gnn) Rn n with gii = Qi 1 j=1 sin2(xj) for i = 2, . . . , n. To formulate the RLMC algorithm for Sn, it is useful to represent the Brownian motion in (67) in the ambient space. Applying the Ito rule, we can obtain such a representation as follows: G 1(x) log p det G(x) x + Ψ(x) 2Ξ(x)dt, (68) x R(n+1) n is the Jacobian of the parametrization x(x) with respect to x and Ξ(x) Rn+1 is a vector whose i-th component is given by Tr( G 1) with 2xi x2 Rn n as the Hessian of xi with respect to x. After a straightforward calculation, the Brownian motion in (68) reduces to 2 xdt + Bdw, (69) Published as a conference paper at ICLR 2023 where B = x G 1(x) R(n+1) n is a matrix whose column vectors form an orthonormal basis of Tx Sn.14 Note here that the drift term ( n 2 xdt) is orthogonal to the tangent space Tx Sn. Each Langevin diffusion step of the RLMC method for Sn can then be performed by xj+1 = Expxj t (I xjx j )wj where xj Sn is the point sampled at the j-th step, s(xj) Txj Sn is the estimated geometric score at xj, and wj Rn+1 is a random vector sampled from N(0, I) with (I xjx j )wj implying the vector obtained by projecting wj onto Txj Sn. Note that (70) can be approximated up to the order of t to the update of xj+1 = Proj xj + t ( 1 t (I xjx j )wj , which straightforwardly considers the Brownian motion in the ambient space in (69). The geometric score in (70) can be estimated for a point x Sn using the reconstruction function r : Sn Sn of the GDAE trained from a given set of training data as σ2 Logx(r(x)), (71) where the logarithm map Logx : Sn Tx Sn, the inverse of the exponential map in (37), is locally defined for an input y Sn near x as follows: Logx(y) = arccos(x y) p 1 (x y)2 (I xx )y. (72) G.2 EXPERIMENTAL DETAILS We consider four synthetic data sets on S2: four blobs, two moons, an s-curve, and circles, which are generated by the following procedure. First, we make two-dimensional data sets by using the python sklearn.datasets package with zero noise level, place them on the plane z = 1, and project them to the spherical surface S2. Secondly, we add tangent Gaussian noises with standard deviations of 0.01 for two moons, s-curve, circles, and 0.05 for four blobs, and project those perturbed data to the sphere by Riemannian projections. The number of training data is 800, that of validation data is 200, and that of test data is 1000. For autoencoder-based models, we use the fully-connected neural network architecture that has 3-512-512-512-512-512-3 layers with Re LU-Re LU-linear-Re LU-Re LU-linear activation functions. For S-Flow that uses the Real NVP model (Dinh et al., 2016), the depth is eight and the length of the hidden feature is 512. For all cases, the learning rate is 1e-3 and the weight decay parameter is 1e-12. For autoencoder-based models, we search σ {0.01, 0.025, 0.05}. We use batch gradient descent, and the maximum training epoch is set to 5000. As a validation loss used to determine the best model during the training, we use the modified score estimation error in (64) for autoencoderbased models and negative log-likelihood for S-Flow. In RLMC sampling of the autoencoderbased models, we need to determine the step size t in (70). The step size is searched over t {0.00001, 0.00005, 0.0001, 0.0005, 0.001}. We run the experiment five times with different random seeds. In the result table, the averages and standard errors of the best cases are reported, where for each run the best-model-resulting hyperparameters (i.e., σ and step size for autoencoderbased models) are selected by computing the MMD metrics (Gretton et al., 2012) between sampled points and the validation data sets. We compute the MMD metric with the exponential kernel k(x, y) = exp( d2 S2(x, y)/(2 η2)) where d S2(x, y) is the squared geodesic distance between data x, y S2 and the bandwidth parameter η is defined as the mean of the 1,2,3,4,5-nearest neighbor geodesic distances for all training data. 14This can be easily verified by the identity of B B = G 1 = I since G(x) = x Published as a conference paper at ICLR 2023 H DETAILS FOR SECTION 4.3 H.1 DEFINITION OF THE CLUSTERING TASKS We define the clusters by merging relevant topics among 20 groups in the Newsgroup20 data set,15 resulting in religion, computers, politics, automobiles, sports, and a topic in science (one among crypt, electronics, med, and space ). We define four clustering tasks by differing the topic in science. H.2 TRAINING DAE, GDAE, LSLDG, AND R-LSLDG The reconstruction function for GDAE is modeled as explained in Appendix D.1, and DAE is modeled in the same way as explained in Appendix F.3. We then optimize the parameters θ = {W1, b1, W2, b2, W3, b3} for (6) and (1) to train GDAE and DAE, respectively. We use the Adam optimizer (Kingma & Ba, 2015) of the Pytorch library (Paszke et al., 2017) and update the parameters for 500,000 iterations. The learning rate respectively starts at 2.5e-3 and 1e-3 for GDAE and DAE, and is divided by ten after 250,000 iterations. Other optimization conditions are kept the same as those in Appendix F.3, and the corruption noise standard deviations σ used are selected between σ {0.025, 0.05} to reduce the modified estimation error in (64) on the training data set. For both LSLDG and R-LSLDG, we perform cross-validation for the hyperparameters λ {10 3, 10 2, 10 1.5, 10 1} and σ {10 1.2, 10 1, 10 0.5, 100} as described in Sasaki et al. (2014) and Ashizawa et al. (2017). The number of the basis functions is set to 300. The time spent in training each model is shown in Table 11. Table 11: The time (in seconds) spent in training each model for S49 data. LSLDG LSLDG Amb DAE DAE Amb R-LSLDG GDAE 288 166 5006 5015 1435 6108 H.3 CLUSTERING OF COVARIANCE MATRIX DATA In this section we provide a case study of the clustering of covariance matrix data. Following Jayasumana et al. (2015), we calculate covariance matrices from the features of image data and group the images based on the covariance matrices. For this case study, we consider the ETH-80 data set16 with 3,240 images from eight categories (Leibe & Schiele, 2003), COIL-20 data set17 with 1,440 images from twenty categories (Nene et al., 1996a), and COIL-100 data set18 with 7,200 images from a hundred categories (Nene et al., 1996b). In calculating the covariance matrix for each image, we use the feature (from every pixel of the image) consisting of {x, y, I, |Ix|, |Iy|}, where x, y are the horizontal and vertical pixel locations, I is the intensity of the pixel, and Ix and Iy are the horizontal and vertical gradients of the intensity, respectively.19 Hence the data in use are in P(5). We train DAE, GDAE, LSLDG, and R-LSLDG using the covariance matrix data and perform clustering using the trained models similarly as explained in Section 4.3. For the GDAE and DAE models respectively described in Appendix D.2 and Appendix F.3, we use the Adam optimizer (Kingma & Ba, 2015) of the Pytorch library (Paszke et al., 2017) to update the parameters for 400,000 iterations. The learning rate starts at 1e-4 and is divided by ten after 200,000 iterations with a weight decay parameter of 1e-12. The corruption noise standard deviations σ can significantly affect the clustering performance in DAE and GDAE. To choose σ, we use a (randomly chosen) half of the data as a validation set to train the DAE and GDAE models and measure the clustering performance. The value that gives the 15URL: http://qwone.com/ jason/20Newsgroups/. 16URL: https://github.com/Kai-Xuan/ETH-80. 17URL: https://www.cs.columbia.edu/CAVE/software/softlib/coil-20.php. 18URL: https://www.cs.columbia.edu/CAVE/software/softlib/coil-100.php. 19Each dimension of the features has been normalized before calculating the covariance matrices. Published as a conference paper at ICLR 2023 best clustering performance among σ {0.025, 0.05, 0.1, 0.2} is chosen, and we use the rest half of the data to train the models and report the clustering performance. Other optimization conditions are kept the same as those explained in Appendix F.3. For both LSLDG and R-LSLDG, we perform cross-validation for the hyperparameters λ {10 3, 10 2, 10 1.5, 10 1} and σ {10 2, 10 1.5, 10 1, 10 0.5, 100} as described in Sasaki et al. (2014) and Ashizawa et al. (2017). The number of the basis functions is set to 100. The clustering is performed using the identical data used to report the results from DAE and GDAE. The time spent in training each model is shown in Table 12. Table 12: The time (in seconds) spent in training each model for P(5) data. Data set LSLDG DAE R-LSLDG GDAE ETH-80 20 1184 63 14227 COIL-20 15 1221 50 13282 COIL-100 23 1178 99 14924 The averaged clustering performance for five runs of each method is presented in Table 13. For the ETH-80 data set, R-LSLDG and GDAE show better performance than others on average but without much significance. In the case of the COIL-20 data set, the clustering results from DAE show the best performance. The absence of a significant performance difference between DAE and GDAE in both cases may partially be due to the less peculiarity in the covariance matrices, e.g., having extremely small or large eigenvalues, for the relatively small number of categories. (According to the affine-invariant metric, the geometry of the covariance matrices can become more relevant when the relative eigenvalues of the covariance matrices get bigger or smaller.) When the number of categories gets bigger as in the COIL-100 data set, the results from GDAE show better performance with a large margin compared to DAE. Even though this case study does not show a distinct tendency, clustering using GDAE performs well among the considered methods in terms of the overall adjusted Rand index. Table 13: The adjusted Rand index for the covariance matrix data on P(5). Bolds represent the best and comparable methods from the t-test with a significance level of 5%. Data set LSLDG DAE R-LSLDG GDAE ETH-80 0.318 (0.039) 0.348 (0.016) 0.362 (0.035) 0.361 (0.022) COIL-20 0.170 (0.030) 0.655 (0.012) 0.483 (0.030) 0.616 (0.016) COIL-100 1.46e-3 (2.9e-4) 0.204 (0.009) 0.314 (0.036) 0.380 (0.015) I DETAILS FOR SECTION 4.4 I.1 TRAINING OF THE RECONSTRUCTION FUNCTION We model GDAE for N(3) as explained in Appendix D.3.4. We then optimize the parameters θ = {W1, b1, W2, b2, W3, b3, W4, b4} for (6) to train GDAE. We model DAE as explained in Appendix F.3 with d = 9, and the parameters in (65), (66) are optimized for (1). We use the Adam optimizer (Kingma & Ba, 2015) of the Pytorch library (Paszke et al., 2017) and update the parameters for 40,000 iterations. The learning rate starts at 1e-4 and is divided by 100 after 20,000 iterations. In discretizing (6), the expectation with respect to the data-generating probability density ρ(x) and the noise density q( x|x) are dealt with in the same way as for the case studies in Section 4.1 (explained in Appendix F.3), and we use the approximated reconstruction error in (52). We use the corruption noise standard deviation σ = 0.05 in training both DAE and GDAE. I.2 A GDAE-BASED DTI FILTERING ALGORITHM Based on the discussions in Section 4.4, we summarize a DTI filtering algorithm based on the GDAE in Algorithm 1. At each iteration, the input points are updated toward the reconstructed points of GDAE along the geodesic with a predefined step size ϵ. Assuming small step sizes, we approximate the update along the geodesic on N(3) by treating the updates of the location (in R3) and the value (in Published as a conference paper at ICLR 2023 P(3)) separately. The update of the voxel values then reduces to update along the geodesic according to the affine-invariant Riemannian metric on P(3) (Fletcher & Joshi, 2007), as expressed in line 5 of Algorithm 1. The functions Exp Q : S(3) P(3) and Log Q : P(3) S(3) (for Q P(3)) are defined as follows: Exp Q(A) = Q 1 2 Exp Q 1 2 Q 1 2 , (73) Log Q(P) = Q 1 2 Log Q 1 2 Q 1 2 , (74) where Q 1 2 = RS 1 2 R and Q 1 2 R for an eigendecomposition of Q = RSR , and the functions Exp( ) and Log( ) respectively denote the matrix exponential and logarithm of which the closed-forms are provided in Appendix D.2.1. When the iteration is terminated, we assign the final voxel values to their original locations. Algorithm 1 DTI Filtering Algorithm Using GDAE Given: Voxels of a DTI datum (xi, Pi) N(3), i = 1, . . . , N. Input: Reconstruction function r : N(3) N(3) of the GDAE trained from the given set of voxels, step size ϵ, the number of iterations Niter. Initialize: (yi,1, Qi,1) = (xi, Pi) for i = 1, . . . , N. Iteration: 1: for j = 1, . . . , Niter, i = 1, . . . , N do 2: Compute the reconstruction point (yi, Qi) = r(yi,j, Qi,j). 3: Shift the current point according to step size ϵ: 4: yi,j+1 yi,j + ϵ(yi yi,j), 5: Qi,j+1 Exp Qi,j ϵ Log Qi,j(Qi) (Qi,j+1 Qi holds when ϵ = 1). 6: end for 7: Assign Pi,f = Qi,Niter+1 for i = 1, . . . , N. Output: Filtered voxels (xi, Pi,f) N(3), i = 1, . . . , N. I.3 DETAILS FOR N(2) FILTERING EXPERIMENTS In this experiment, we consider a synthetic N(2) data set obtained by gathering the position (in R2)- value (in P(2)) pairs at 1,600 grid points (i.e., N = 1, 600) of a P(2) field on R2. For the P(2) field for clean data, we gather four distinct smoothly-varying P(2) fields as depicted in Figure 5. For the noisy data, we corrupt each value in P(2) as explained in (43) with noise standard deviations (or noise levels) in {0.02, 0.05, 0.1, 0.2}. We model the neural networks for both DAE and GDAE as explained in Appendix F.3 and Appendix D.3.4 with appropriate modifications to deal with data in N(2), respectively, and train the reconstruction functions. We use the Adam optimizer (Kingma & Ba, 2015) and update the parameters for 20,000 iterations. The learning rate starts at 1e-4 and is divided by 100 after 10,000 iterations. The corruption noise standard deviation is chosen among σ {0.01, 0.025, 0.05, 0.1} to show the best modified estimation error in (64). For the trained models, we apply Algorithm 1 (with appropriate modifications to deal with N(2) data) with Niter 300 and the step size ϵ {0.01, 0.03, 0.1, 0.3, 1.0} chosen to achieve the minimum averaged square distances (according to the affine-invariant metric) to the clean data. For the MVKR method in Banerjee et al. (2015), the number of considered nearest neighbors k {10, 20, 30, 40, 100} and the kernel bandwidth parameter σ {1, 10, 20, 30, 50} are chosen to achieve the minimum averaged square distances (according to the affine-invariant metric) to the clean data. We report the R-squared score to compare the filtering results from each method in Table 4, and the scores are calculated as follows: R-squared score = 1 P i dist(Pi,f, Pi,true)2 P i dist( P, Pi,true)2 , (75) Published as a conference paper at ICLR 2023 where dist(P1, P2) is the distance between P1, P2 P(2) obtained according to the affine-invariant metric, Pi,f P(2) is the i-th filtered value, Pi,true P(2) is the i-th value of the clean data, and P P(2) is the intrinsic mean of {P1,true, . . . , PN,true} (Moakher, 2005). I.4 PREPROCESSING DIFFUSION TENSOR IMAGING (DTI) DATA Data used in the preparation of the experiments performed in Section 4.4 were obtained from the Alzheimer s Disease Neuroimaging Initiative (ADNI) database (adni.loni.usc.edu).20 Among the data set, we use the DTI data for a randomly chosen healthy subject. FSL library (Jenkinson et al., 2012) is utilized to fit diffusion tensors from raw diffusion-weighted image data.21 The brain extraction tool (BET), eddy current correction function (eddy correct), a linear transformation tool to match the brain template (flirt), and dtifit module are used in the preprocessing of the DTI data. The region of interest (ROI) is chosen as an axial slice that intersects with the corpus callosum region of the brain. J DETAILS FOR SECTION 4.5 Overall experimental settings have been set similarly to those in Lee et al. (2022). We add noise to each point x R3 in the point cloud of the data sets (Shape Net, Model Net10, and Model Net40) according to x x + m v, where v R3 S2 is uniformly sampled on the unit sphere and m is sampled from the Gaussian distribution with zero mean and different levels of standard deviation (0.01, 0.05, 0.1, and 0.2 of the diagonal length of the point cloud bounding box) as done in Lee et al. (2021a). For the parameters in the kernel functions (57), we use Σ = σ2 p I with σp = 0.8 MED, where MED denotes the median of the distances between the points in the point cloud and their nearest points. The mean values of MEDs of each data set are 0.0320, 0.0364, 0.0442, and 0.579 for the cases of the noise levels 0.01, 0.05, 0.1, and 0.2, respectively. To train the autoencoders defined in Appendix D.4, we use ADAM (Kingma & Ba, 2015) with a learning rate of 1e-4, betas of [0.9, 0.999], a weight decay of 1e-6, and a batch size of 16; the total number of epochs is 500. The noise parameters σ for DAE and GDAE are set to 0.1 and 1e-4, respectively. The coefficient for the regularization term (defined in (13) of Lee et al. (2022)) is set to λ = 8000 for both the AE + R. and GDAE + R. methods in Table 5. 20The ADNI was launched in 2003 as a public-private partnership, led by Principal Investigator Michael W. Weiner, MD. The primary goal of ADNI has been to test whether serial magnetic resonance imaging (MRI), positron emission tomography (PET), other biological markers, and clinical and neuropsychological assessment can be combined to measure the progression of mild cognitive impairment (MCI) and early Alzheimer s disease (AD). 21The license information is available at URL: https://fsl.fmrib.ox.ac.uk/fsl/fslwiki/Licence.