# isometric_manifold_learning_using_hierarchical_flow__af1f8c67.pdf Isometric Manifold Learning Using Hierarchical Flow Ziqi Pan, Jianfu Zhang*, Li Niu*, Liqing Zhang Mo E Key Lab of Artificial Intelligence, Department of Computer Science and Engineering Shanghai Jiao Tong University, Shanghai, China {panziqi ai, c.sis, ustcnewly}@sjtu.edu.cn, zhang-lq@cs.sjtu.edu.cn We propose the Hierarchical Flow (HF) model constrained by isometric regularizations for manifold learning that combines manifold learning goals such as dimensionality reduction, inference, sampling, projection and density estimation into one unified framework. Our proposed HF model is regularized to not only produce embeddings preserving the geometric structure of the manifold, but also project samples onto the manifold in a manner conforming to the rigorous definition of projection. Theoretical guarantees are provided for our HF model to satisfy the two desired properties. In order to detect the real dimensionality of the manifold, we also propose a two-stage dimensionality reduction algorithm, which is a time-efficient algorithm thanks to the hierarchical architecture design of our HF model. Experimental results justify our theoretical analysis, demonstrate the superiority of our dimensionality reduction algorithm in cost of training time, and verify the effect of the aforementioned properties in improving performances on downstream tasks such as anomaly detection. 1 Introduction Manifold learning (Pless and Souvenir 2009) is the task aiming to recover low-dimensional embeddings from nonlinear high-dimensional data that preserve the geometric structure of data manifolds (Fefferman, Mitter, and Narayanan 2016), which is a fundamental research topic in the field of machine learning. Applications of manifold learning include nonlinear dimensionality reduction (Gisbrecht and Hammer 2015), denoising (Buades, Coll, and Morel 2005), anomaly detection (Chandola, Banerjee, and Kumar 2009), etc. The research on manifold learning has a long history, and the manifold learning methods range from classical algorithms such as LLE (Rowes 2000), Local Preserving Projections (He and Niyogi 2003), Semidefinite Embedding (Weinberger and Saul 2006), and Isomap (Tenenbaum, Silva, and Langford 2000), to modern deep learning methods based on neural network frameworks such as Generative Adversarial Nets (Goodfellow et al. 2014) (GANs), Variational Autoencoders (Kingma and Welling 2013) (VAEs), and Normalizing Flow (Dinh, Krueger, and Bengio 2014) (NF) based generative models. Classical spectral methods (e.g., Isomap *Corresponding authors Copyright 2023, Association for the Advancement of Artificial Intelligence (www.aaai.org). All rights reserved. and LLE) are commonly non-parametric methods by utilizing the Multidimensional Scaling (Carroll and Arabie 1998) (MDS) algorithm and eigendecomposition. However, classical manifold learning algorithms are hard to apply to modern datasets, since these datasets are typically complex and high-dimensional (Verleysen and Franc ois 2005). Since nonparametric methods do not produce an embedding function, the inference of the embedding can not be generalized to unseen samples. To tackle this problem, some parametric methods are proposed such as SNE (Van Der Maaten 2009), t-SNE (Van der Maaten and Hinton 2008), Dr LIM (Hadsell, Chopra, and Le Cun 2006) and so on (Gong et al. 2006; Chui and Mhaskar 2018; Mishne et al. 2019), which can perform out-of-sample-extension. There are also methods combining classical algorithms and neural networks for manifold learning (Pai et al. 2019; Geng et al. 2020). In terms of the neural network methods (Basri and Jacobs 2017), GANs and VAEs as generative models can also be regarded as manifold learning framework, which is able to efficiently sample from the manifold by using ancestral sampling. However, GANs and VAEs lack the ability to estimate densities on the manifold. Compared to GANs and VAEs, since NFs induce an explicit density on the manifold based on the change-of-variable formula and are intrinsic nonlinear bijective mappings between the embeddings and the manifold, many NF-based manifold learning models are proposed, including NFs on a prescribed manifold (Gemici, Rezende, and Mohamed 2016; Bose et al. 2020) such as tori and spheres (Rezende et al. 2020), and the M-flow (Brehmer and Cranmer 2020) for learning a manifold topologically equivalent to Euclidean space. In this paper, we propose a Hierarchical Flow (HF) model by extending the M-flow into a hierarchical architecture constrained by isometric regularizations for manifold learning. Given a set of training samples from a manifold M RD whose ground-truth dimensionality K is unknown, our proposed HF model combines the following manifold learning goals in one unified framework: Dimensionality reduction. The model is able to detect the ground-truth dimensionality K of M efficiently. Sampling. The model can efficiently sample from M. Inference: Given a manifold sample x M, the model can infer the embedding u (x) RK of x that is faithful to the geometric structure of M. Specifically, x1, x2 The Thirty-Seventh AAAI Conference on Artificial Intelligence (AAAI-23) M, the Euclidean distance between u (x1) and u (x2) is equal to the manifold distance (i.e., the geodesic length) between x1 and x2 (Petersen 2006), which we refer to as the distance preserving property. Projection. Given an off-manifold sample x M from the ambient space RD close to M, the model can project x onto M in a manner that conforms to the rigorous definition of projection, namely the projection ex M of x satisfies that ex = argminx M d (x , x), and the distance from x to M is d (ex, x), where d : RD RD R is the Euclidean distance in RD. We refer to such a property of the model as the rigorous projection property. Density estimation. Given x M, the model is capable of estimating the explicit density at x. Existing manifold learning methods differ in their ability to achieve the above goals, see Tbl. 1 for comparison. In terms of projection, although existing methods such as M-flow are able to project given samples onto the manifold, the projections are not guaranteed to conform to the rigorous definition as we mentioned above, which could affect the performance on downstream tasks such as anomaly detection, as we show in Sec. 4 and supplementary. For our HF model, we provide theoretical guarantees on the rigorous projection property of the generator constrained by our proposed isometric regularizations, and show through experimental results that our HF model performs better on downstream tasks such as anomaly detection due to the rigorous projection property. Moreover, thanks to the hierarchical architecture of our HF model, we can develop a two-stage dimensionality reduction algorithm to detect the ground-truth dimensionality of the manifold in a time-efficient manner. Compared with a brute-force search algorithm, our proposed algorithm is greatly superior in cost of training time. Our contributions are threefold: We propose the Hierarchical Flow (HF) model for manifold learning, which is constrained by isometric regularizations for satisfying the properties of distance preserving and rigorous projection with theoretical guarantees. Our proposed HF model combines the manifold learning goals mentioned above in one unified framework. We propose a two-stage dimensionality reduction algorithm based on the hierarchical architecture design of our HF model, which allows us to detect the ground-truth dimensionality of the manifold in a time-efficient manner. Experimental results justify our theoretical analysis, and show the time-efficiency of our dimensionality reduction algorithm and the superiority of our proposed HF model in performance on downstream tasks such as anomaly detection, thanks to the aforementioned desired properties. The remainder of this paper is organized as follows. Sec. 2 provides details of our manifold learning method, including our HF model, the properties of distance preserving and rigorous projection with theoretical guarantees, and the training objectives and algorithms for manifold learning and dimensionality reduction. Sec. 3 relates our method to prior work. We provide experimental results in Sec. 4 to justify our theoretical analysis, and conclude our paper in Sec. 5. 2 Methodology In this section, we introduce the details of our proposed HF model and its training objectives and algorithms. 2.1 Hierarchical Generator Given K D, we propose to employ a generator g : RK RD with a hierarchical architecture to characterize the data manifold M using K-dimensional embedding. Specifically, g : RK RD comprised of L layers can be represented as g = f L p L | {z } m L f L 1 p L 1 | {z } m L 1 f1 p1 | {z } m1 where mi = fi pi : RKi 1 RKi is the i-th layer of g, which is comprised of a flow module fi : RKi RKi and a padding module pi : RKi 1 RKi, where K = K0 K1 KL = D. The flow module fi : RKi RKi is a nonlinear bijective mapping with an explicit inversion f 1 i and a tractable Jacobian determinant, which is comprised of several coupling layers (Dinh, Krueger, and Bengio 2014) and 1 1 convolution layers. In terms of the padding module pi : RKi 1 RKi, given an input feature x RKi 1, pi pads Ki Ki 1 zeros at the end of x pi (x) = [x; 0] RKi, (2) and the pseudo inverse p i : RKi RKi 1 is also referred to as a projection module that drops the last Ki Ki 1 elements of an input feature x (Brehmer and Cranmer 2020) p i (x) = x1, x2, , x Ki 1 RKi 1, (3) where xj is the j-th element of x. Hence the pseudo inverse of g, i.e., g : RD RK, is given as g = p 1 f 1 1 | {z } p L 1 f 1 L 1 | {z } p L f 1 L | {z } which is essentially an encoder that can be used to infer the embedding u (x) RK of a sample x RD. For the details of the implementation of g, see supplementary. We firstly introduce mathematical notations used throughout the paper, and then provide the theoretical guarantees for satisfying the properties of distance preserving and rigorous projection for the generator g. Given x RD, we denote ui (x) ; vi (x) f 1 i ui+1 (x) , i [L] , (5) where ui (x) RKi 1 and vi (x) RKi Ki 1, and therefore p i ui (x) ; vi (x) = ui (x). We use u L+1 (x) x, and denote the embedding u (x) of x as u (x) u1 (x). We also use [N] {1, 2, , N} for a positive integer N. From the above, running inference for x RD using g produces ui (x) L i=1 and vi (x) L i=1. Given a manifold M RD, for i [L] and some r > 0, we denote Ui (M) ui (x) |x M , (6) Vi (r) v RKi Ki 1| v 2 r , (7) Ai (M; r) fi ([u; v]) |u Ai 1 (M; r) , v Vi (r) , (8) e Ai (M; r) fi ([u; 0]) |u Ai 1 (M; r) , (9) where we use A0 (M; r) U1 (M). For brevity of description, we also stipulate that fi (u) fi ([u; 0]) = fi pi (u) if the given u is a Ki 1 dimensional vector. We also present an intuitive understanding for Ui (M), Vi (r), Ai (M; r) and e Ai (M; r) by using Fig. 2. The Distance Preserving Property We formally state the conditions for the generator g to satisfy the distance preserving property in the following Prop. 1 (see supplementary for proof) with theoretical guarantees, where an orthonormal Jacobian is encouraged for the generator g. Recent works also show that an isometric antoencoder preserves the geometric structure of the data manifold (Gropp, Atzmon, and Lipman 2020; Yonghyeon et al. 2021). We present an intuitive illustration for the distance preserving property in Fig. 1. Proposition 1 (Distance preserving property). Assume that U1 (M) is a convex set, and vi (x) = 0 for i [L] , x M. The length of the geodesic between g (u1) and g (u2) on M for u1, u2 U1 (M) equals to u1 u2 2, if Jg (u) is orthonormal for u U1 (M), where Jg (u) is the Jacobian matrix of the generator g at u. The Rigorous Projection Property We also formally state the conditions for each flow module fi to satisfy the rigorous projection property in the following Prop. 2 (see supplementary for proof) with theoretical guarantees. To intuitively see how satisfying the conditions in Prop. 2 leads to a flow module fi that satisfies the rigorous projection property, refer to supplementary. In terms of the rigorous projection property of the whole generator g, we directly learn from Prop. 2 that a generator g comprised of only one layer (i.e., g = f1 p1) satisfies the rigorous projection property, which is formally stated in Cor. 1. For g comprised of more than one layer, the rigorous projection property of g is approximately satisfied over an ambient space close to M, see Rem. 1 and Fig. 2. Proposition 2 (Rigorous projection property). Assume that vi (x) = 0 for x M. Given x Ai (M; r), by denoting [α; β] f 1 i (x) where α RKi 1 and β RKi Ki 1, the projection of x onto e Ai (M; r) is ex fi (α), and the distance from x to e Ai (M; r), d E (x, ex), equals to β 2, if u Ai 1 (M; r) , v Vi (r), I ([u; v]) J ([u; v]) J ([u; v]) RKi Ki (10) is a diagonal matrix, and Ijj ([u; v]) equals to 1 for Ki 1 + 1 j Ki, where J ([u; v]) RKi Ki is the Jacobian of fi at [u; v], Ijj is the (j, j)-th element of I, and d E ( , ) is the Euclidean distance in RKi. Corollary 1. Given g = f1 p1, assume that v1 (x) = 0 for x M. Given x A1 (M; r), the projection of x onto M is ex g u1 (x) = g g (x), and the distance from x to M, d E (x, ex), is v1 (x) 2, if f1 satisfies the conditions for the rigorous projection property in Prop. 2. Figure 1: An intuitive illustration for the distance preserving property by using the exemplar manifold Swiss roll (also see Fig.3 of (Tenenbaum, Silva, and Langford 2000)). U is obtained by a generator g whose Jacobian Jg is orthonormal. The correspondence between M and U is indicated by color, and the orthonormality of Jg can be shown by the coordinate lines of M. The geodesic between x and y is plotted by using the red line on M, and the geodesic length d M (x, y) is equal to the Euclidean distance d E g (x) , g (y) between the corresponding embeddings g (x) and g (y). Remark 1. Given g comprised of L 2 layers, assume that vi (x) = 0 for x M, i [L]. Given x AL (M; r), let exi fi ui (x) , i [L], and ex g u1 (x) = g g (x). Assume that r is small such that AL (M; r) is an ambient space close to M. Given fi that satisfies the rigorous projection property for i [L], we have The projection of ui+1 (x) onto e Ai (M; r) is exi, and the distance from ui+1 (x) to e Ai (M; r), d E ui+1 (x) , exi , equals to vi (x) 2, where i [L]. The projection of x onto M is ex, and the distance from x to M, d E (x, ex), is approximately q PL i=1 vi (x) 2 2. See Fig. 2 and supplementary for detailed analysis. 2.2 Isometric Regularizations Based on the aforementioned Prop. 1-2, we propose two isometric regularizations for the generator g to satisfy the distance preserving property and the rigorous projection property, respectively. Both isometric regularizations are realized to pursue an orthonormal Jacobian. Orthonormal Regularization on Ai 1 (M; r) To satisfy the distance preserving property for g, according to Prop. 1 and the following Prop. 3 (see supplementary for proof), we can constrain the Jacobian Jfi of fi to be orthonormal over Ai 1 (M; r) for i [L], namely Jfi (u) is orthonormal, u Ai 1 (M; r) . (11) In practice, to ensure the generator expressiveness, we propose to constrain Jfi to be orthonormal up to a global scalar ζ > 0. To achieve this, we propose to utilize a sample efficient regularization on Jfi based on the recently proposed Linearized Transpose (Pan, Niu, and Zhang 2022) (LT) technique that is used to efficiently estimate the spectral norm of the Jacobian. Specifically, we exploit the property of fi that Figure 2: An intuitive illustration of the math notations and the rigorous projection for a hierarchical generator g = f2 p2 f1 p1 : R1 R3. The projection of x A2 (M; r) onto e A2 (M; r) and M are ex2 f2 p2 p 2 f 1 2 (x) = f2 u2 (x) ; 0 and ex g g (x), respectively. Assume that ex is in a local region around ex2, i.e., it can be regarded that ex Tex2 e A2 (M; r) and d e A2(M;r) ex, ex2 d E ex, ex2 , where Tex2 e A2 (M; r) represents the tangent space of e A2 (M; r) at ex2. Because f2 satisfies the rigorous projection property, from Prop. 2 and the rigorous definition of projection, we have d E x, ex2 = v2 (x) 2 and vec x, ex2 Tex2 e A2 (M; r), where vec x, ex2 denotes the vector from x to ex2. Hence according to the Pythagorean theorem, we have d2 E (x, ex) = d2 E x, ex2 + d2 E ex2, ex v2 (x) 2 2 + d2 e A2(M;r) ex, ex2 . Because Jf2 is constrained to be orthonormal over A1 (M; r), we have d e A2(M;r) ex, ex2 = d E u2 (ex) , u2 (x) . Given that f1 satisfies the rigorous projection property, we know that u2 (ex) is the projection of u2 (x) A1 (M; r) onto e A1 (M; r) according to Prop. 2, since we know u1 (x) = u1 (ex) from ex = g g (x). Hence d E u2 (ex) , u2 (x) = v1 (x) 2, and d E (x, ex) q v1 (x) 2 2 + v2 (x) 2 2. det Jfi is tractable. Because det Jfi equals to the product of all singular values of Jfi, we know that all singular values of Jfi are equal and hence Jfi is orthonormal up to a scalar, if we constrain the maximum singular value (i.e., the spectral norm) of Jfi to equal Kip det Jfi. Therefore, in practice, we estimate the spectral norm of Jfi by employing LT, and then constrain LTfi (u) = Kip det Jfi (u) = ζ, where LTfi (u) is the estimated spectral norm of Jfi at u Ai 1 (M; r). Proposition 3. Jg (z) is orthonormal for z U1 (M), if Jfi (u) is orthonormal for u Ai 1 (M; r) , i [L]. Proxy Regularization on Ai 1 (M; r) Vi (r) Based on Prop. 2 and Rem. 1, to satisfy the rigorous projection property for g, an approach is to trivially constrain Jfi to be orthonormal over Ai 1 (M; r) Vi (r) by using LT. Although we can verify that this is a special case satisfying the conditions in Prop. 2, the flexibility of fi is greatly reduced (see supplementary for detailed discussion). To tackle this problem, we observe that for a given z Ai 1 (M; r) Vi (r), by using an auxiliary module si : RKi RKi, the Jacobian of the following composed mapping ϕz i (ξ) fi (z si (z) + si (z) ξ) , ξ RKi (12) at ξ = 1, i.e., Jϕz i (1) = ϕz i ξ ξ=1, equals to Jfi (z) diag (ω1, ω2, , ωKi) , (13) where denotes the elementwise multiplication, 1 RKi is a vector with all elements being 1, and ωj denotes the j-th element of si (z). From Prop. 4, we know that the following objective leads to an fi satisfying the conditions for the rigorous projection in Prop. 2 (see supplementary for proof): constrain all singular values of Jϕz i (1) to be equal, s.t. ωj = Ki 1 det Jfi, Ki 1 + 1 j Ki In practice, we constrain LTϕz i (1) = Ki q det Jϕz i (1) by us- ing LT for z Ai 1 (M; r) Vi (r), which is a sample efficient regularization while giving flexibility to fi. We refer to such a method as the proxy regularization for fi to satisfy the rigorous projection property. We refer to the module si as the singular values predictor for fi. Proposition 4. All singular values of Jϕz i (1) equal a > 0 Jfi (z) Jfi (z) = diag a2 ω2 1 , , a2 , and n a ωj j=1 are the singular values of Jfi (z). 2.3 Training Objectives and Algorithms We present the training objectives and algorithms for manifold learning given a training dataset X x(i) M N i=1 from a manifold M RD. We adopt the training strategy of M-flow (Brehmer and Cranmer 2020) which separates the manifold and density training. We first introduce the training objectives for the manifold and density training respectively, then introduce the training algorithms, where a two-stage dimensionality reduction algorithm for detecting the groundtruth dimensionality K of M is included. Manifold Training Objectives During the manifold training phase, the following loss function Lm is minimized Lm = λdist Ldist + λproj Lproj + λv Lv, (15) where λ are balancing hyper-parameters. We then introduce each loss function in detail as follows. Ldist is the isometric loss function used to satisfy the distance preserving property for the generator g, which is realized by constraining all the Jacobians of fi to be orthonormal up to ζ as we introduced in Sec. 2.2. Specifically, i=1 Eui Ai 1(M;r) n LTfi ui ζ 2 + det Jfi ui ζ 2 o . Note that Ldist involves sampling ui from Ai 1 (M; r). For i = 1, since A0 (M; r) = U1 (M), we sample u1 from the K-dimensional normal distribution N µ, σ2 , where u and σ2 are the running mean and running variance of u1 (x) for x X. For 2 i L, according to Eq. (8), we sample ui 1 from Ai 2 (M; r) and vi 1 from Vi 1 (r), respectively, and then obtain ui = fi 1 ui 1; vi 1 . Lproj is the isometric loss function used to satisfy the rigorous projection property for the generator g, which is realized by the proxy regularization over Ai 1 (M; r) Vi (r) as we introduced in Sec. 2.2. Specifically, i=1 Ez Ai 1(M;r) Vi(r) LTϕz i (1) det Jfi (z) where we sample z from Ai 1 (M; r) Vi (r) by using the ancestral sampling method involved in Ldist. Note that from Eq. (13) we learn that det Jϕz i (1) = det Jfi (z) QKi j=1 ωj. Lv is the reconstruction loss function, which constrains g to satisfy g g (x) = x for x X. Instead of adopting the commonly used loss function as follows g g x(i) x(i) 2 we can use the following loss function Since Lv depends on vi (x) L i=1 to compute the loss for a given x X, compared with Eq. (18), Lv requires only the encoding process (i.e., g (x)) and no decoding process (i.e., g (u (x)) where u (x) is the embedding of x), which reduces the computational cost and hence speed up training. In order to see how Lv is related to a reconstruction loss function for X, according to Rem. 1 we learn that the distance from x to the generated manifold f M of g is about q PL i=1 vi (x) 2 2. Therefore, minimizing Lv brings X and f M closer, which is equivalent to encouraging reconstruction on X. The overall loss function Lm can be regarded as a Regularized Autoencoder (Ghosh et al. 2020) (RAE), where Lv corresponds to the reconstruction loss of RAE, and the other losses (i.e., Ldist and Lproj) are regularizers for the decoder. Density Training Objectives In order to perform density estimation on M, similar to M-flow (Brehmer and Cranmer 2020), we employ a density estimator h : RK RK which is trained to estimate the probability density on M after the manifold training phase is done. Given a sample x M, its density p M (x) on M is explicitly derived according to the following change-of-variable formula (Dinh, Krueger, and Bengio 2014; Kobyzev, Prince, and Brubaker 2020) p M (x) = p U (u (x)) det J g (u (x)) Jg (u (x)) 1 = p U (u (x)) , (21) where u (x) = g (x) is the embedding of x, and Eq. (21) is derived because Jg is orthonormal (see Prop. 1). The density estimator h is further involved in the following formula p U (u) = p e U h 1 (u) det Jh h 1 (u) 1 , (22) where h is a flow module (Brehmer and Cranmer 2020) that is a bijection with explicit inversion h 1 and tractable Jacobian determinant det Jh, and the prior distribution p e U is the standard normal distribution. Hence given u (x), the density p U (u (x)) is tractable, and hence p M (x) = p U (u (x)) can be explicitly estimated. In terms of the training objective, we use maximum likelihood training, and the loss function is i=1 log p U u x(i) . (23) Algorithms We present the algorithms for manifold learning using our proposed HF model. As mentioned above, the training process consists of two phases, namely the manifold phase and the density phase. During the manifold phase, we detect the ground-truth dimensionality K of M by utilizing our proposed two-stage dimensionality reduction algorithm, see Alg. 1. Specifically, for the first stage, we set an embedding dimensionality K that is much lower than D but higher than K , then train the first stage generator g1 : RK RD using Lm in Eq. (15). For the second stage, given a reconstruction error threshold t, we use the Binary Search Algorithm (BSA) to search the minimum dimensionality e K for the second stage generator g2 : R e K RK such that the reconstruction error on X of eg = g1 g2 is not larger than t, where g1 is not involved in the second stage training, and g2 is trained on the embedding set n g 1 (x) |x X o . After the Algorithm 1: Two-stage Dimensionality Reduction Require: Training dataset X = x(i) N i=1 RD, dimensionality K of the first stage, threshold t /* Stage 1 */ 1: Initialize a generator g1 : RK RD 2: Train g1 using Lm, then obtain U = u x(i) N i=1 RK, where u (x) = g 1 (x) is the embedding of x /* Stage 2 */ 3: Initialize K1 0, K2 K, e K K1+K2 2 , eg g1 4: while K1 < K do 5: Initialize another generator g2 : R e K RK2 6: Train g2 on U using Lm, then obtain bg g1 g2 7: Estimate the mean reconstruction error E as bg bg x(i) x(i) (24) 8: If E > t, update K2 e K, g1 bg, eg bg, U n g 2 (u) |u U o , otherwise update K1 e K 9: Update e K K1+K2 2 10: end while 11: Obtain the detected manifold dimensionality e K as well as the final generator eg : R e K RD manifold phase training is completed, the density estimator h is trained on the embedding set eg (x) |x X using Ld. See supplementary for the overall training algorithm. 3 Relation to Prior Work Generative Manifold Learning Our proposed HF model is related to generative models for manifold learning (Gropp, Atzmon, and Lipman 2020; Beitler, Sosnovik, and Smeulders 2021; Teng and Choromanska 2019; Lou et al. 2020; Caterini et al. 2021; Silvestri, Roos, and Ambrogioni 2022), see supplementary for discussion. Our proposed HF model can be regarded as a multi-layer extension of M-flow, which is similar regarding the hierarchical architecture to the Pseudo Invertible Encoders (Beitler, Sosnovik, and Smeulders 2021) (PIEs). The hierarchical architecture allows us to detect the ground-truth dimensionality of the data manifold in a time-efficient manner by utilizing Alg. 1. The Isometric Autoencoders (Gropp, Atzmon, and Lipman 2020) (IAEs) propose an isometric loss for manifold learning, and we also propose isometric regularizations to constrain our HF model to satisfy the properties of distance preserving and rigorous projection with theoretical guarantees. Moreover, thanks to the rigorous projection property, we can replace the classical reconstruction loss based on the encoding-decoding process with Lv that only involves the encoding process, which can reduce computational cost and hence speed up training. We compare a variety of models in their ability to achieve manifold learning goals as in Tbl. 1. As we can see, most methods lack the ability to satisfy the aforementioned two properties, which may affect the performance on downstream tasks such Methods Sampling Infer. Proj. Density Est. GAN VAE IAE PIE M-flow HF Table 1: Comparison between different models in their ability to achieve a variety of manifold goals. Regarding inference, (resp., ) means that the model is able to infer embeddings that satisfy (resp., are not guaranteed to satisfy) the distance preserving property. Regarding projection, (resp., ) means that the model is able to project samples onto the manifold in a manner that satisfies (resp., is not guaranteed to satisfy) the rigorous projection property. as anomaly detection, as we show in experimental results. Dimensionality Reduction The dimensionality of embeddings for most manifold learning methods is prescribed, and is not guaranteed to be equal to the ground-truth dimensionality of the manifold. For real-world datasets such as human faces (Liu et al. 2015; Karras, Laine, and Aila 2019), the real dimensionality K of the manifold may never be known. In existing literatures (Brehmer and Cranmer 2020), K is typically obtained by a searching process based on the criterion that a drop in performance (e.g., the reconstruction error) is expected when the model manifold dimensionality becomes smaller than K . Therefore, a naive approach to detect K is to train model for each possible value of K . For our proposed Alg. 1, the searching is performed in the second stage for e K. Compared to a brute-force search of e K, the time complexity of the binary search algorithm is O (log K), which is much superior to O (K) of brute-force search given that K is large. On the other hand, our theoretical analysis guarantees that the composed generator bg = g1 g2 of the first and the second stage generators g1, g2 still satisfies the desired properties for manifold learning. Given that D is very large, although g1 is a nonlinear mapping over high-dimensional spaces with high training time cost, g1 is trained only once in the first stage. Though the searching process in the second stage involves multiple training of g2, each training can be quickly done since g2 operates in low-dimensional spaces when K is small. Hence combining the training of the first and second stage results in a time-efficient algorithm for dimensionality reduction. See supplementary for experiments. 4 Experiments We provide experimental results on a synthetic manifold for intuitive illustration in this section, and leave more results on natural image datasets in supplementary due to space limitation. All the implementation details for our experiments can also be found in supplementary. For intuitive illustration, we use a 1-dimensional manifold M (cos θ, sin θ) |θ π 6 (i.e., a curve) residing in the 2-dimensional Euclidean space. We draw N = 1, 000 Figure 3: Qualitative results of different models trained on X. For each model, the recovered manifold is depicted by using the thick black solid line, and we can intuitive observe that all the models perfectly recover the manifold curve M. In the ambient space of M, the contour lines with respect to distances to M are plotted by using the thin grey dotted lines, namely samples on the same contour line have equal distances to M. We use d = α to denote a contour line with distance to M being α. We also plot the coordinate charts of the flow module f1 : R2 R2 of the generator g = f1 p1 for M-flow (in Fig. 3(a)) and HF (in Fig. 3(c)), where both M-flow and HF use a one-layer generator g. Specifically, the two embedding dimensions u and v of f1 are represented in green and blue, respectively, namely samples on the same green (resp., blue) line corresponds to the same u (resp., v). Hence for M-flow and HF, given an off-manifold sample x, the projection x2 of x onto M inferred by the model is given by the coordinate line of f1 connecting x and the inferred projection x2, which is denoted by the dashed arrow line. The ground-truth projection ex of x onto M is given by the solid arrow line, which is the straight line connecting x and the origin (0, 0), considering radial is the direction perpendicular to the tangent space of the arc M. In terms of IAE, because the decoder g : R1 R2 does not induce a coordinate chart like M-flow and HF, the projection ex of a given sample x onto M inferred by the model is obtained by an encoding-decoding process, i.e., ex = g e (x), where e is the encoder. In Fig. 3(b), we show the projections of {xi}4 i=1, where the ground-truth and the inferred projection of xi onto M are exi and ex i, respectively. samples from a Gaussian density N π 2 , 1 that is restricted to M, obtaining a training dataset X n x(i) N π 2 , 1 x(i) M o N which we use to train different models. We only show qualitative results as in Fig. 3 due to space limitation, and present all quantitative results in supplementary. Distance Preserving In order to evaluate M-flow and HF intuitively, in Fig. 3(a) and Fig. 3(c), we refer to x, y M that are intersections of two adjacent green coordinate lines with M as a pair. Note that we depict coordinate charts for M-flow and HF such that |u (x) u (y)| is constant for any pair x, y. Therefore, the distance preserving property of Mflow and HF can be intuitively evaluated by observing if the manifold distance on M (i.e., the arc length) between x and y is constant for any pair x, y. From Fig. 3(a), we learn that M-flow is not guaranteed to satisfy the distance preserving property due to d M (x1, y1) = d M (x2, y2). However, from Fig. 3(c) we learn that our HF satisfies this property because we can intuitively observe that d M (x1, y1) = d M (x2, y2) for any pairs x1, y1 and x2, y2. In terms of IAE, the distance preserving property is not reflected in Fig. 3(b), see supplementary for quantitative evaluations. Rigorous Projection Given an off-manifold x, we visualize the difference between M-flow and HF in terms of projecting x onto M. By comparing Fig. 3(a) with Fig. 3(c), we can see that the projection x2 inferred by M-flow deviates from the ground-truth projection ex, meanwhile our HF gives the correct projection x2 = ex. Moreover, because x is on the blue coordinate line v = 0.5 as we visualize in Fig. 3(a) and Fig. 3(c), we know that the distance from x to the projection x2 given by the model is 0.5. On the other hand, x is on the contour line d = 1.5 (resp., d = 0.5) for M-flow (resp., our HF), which means that M-flow (resp., our HF) gives biased (resp., correct) distance from x to its inferred projection x2. From Fig. 3(b), we also learn that projections given by IAE are biased. Hence both M-flow and IAE are not guaranteed to satisfy the rigorous projection property. As visualized in Fig. 3(c), the green coordinate lines are radial while the blue coordinate lines coincides with the contour lines, which suggests that HF satisfies the rigorous projection property. The dissatisfaction of the rigorous projection property affects the performance of the model on anomaly detection based on the inferred projection distance, see supplementary. 5 Conclusion In this paper, we propose the Hierarchical Flow (HF) model constrained by isometric regularizations to combine several manifold learning goals in one unified framework. The HF model is theoretically guaranteed to satisfy the properties of distance preserving and rigorous projection, which helps in manifold learning and downstream tasks. We also propose a time-efficient two-stage algorithm for dimensionality reduction based on the hierarchical architecture of our HF model. Experimental results justify our theoretical analysis and verify the effectiveness of our proposed method. Acknowledgements The work was supported by the National Science Foundation of China (62076162), and the Shanghai Municipal Science and Technology Major/Key Project, China (2021SHZDZX0102, 20511100300). References Basri, R.; and Jacobs, D. 2017. Efficient representation of low-dimensional manifolds using deep networks. In ICLR. Beitler, J. J.; Sosnovik, I.; and Smeulders, A. 2021. PIE: Pseudo-Invertible Encoder. ar Xiv preprint ar Xiv:2111.00619. Bose, J.; Smofsky, A.; Liao, R.; Panangaden, P.; and Hamilton, W. 2020. Latent variable modelling with hyperbolic normalizing flows. In ICML. Brehmer, J.; and Cranmer, K. 2020. Flows for simultaneous manifold learning and density estimation. In Neur IPS. Buades, A.; Coll, B.; and Morel, J.-M. 2005. A review of image denoising algorithms, with a new one. Multiscale Modeling & Simulation. Carroll, J. D.; and Arabie, P. 1998. Multidimensional scaling. Measurement, Judgment and Decision Making. Caterini, A. L.; Loaiza-Ganem, G.; Pleiss, G.; and Cunningham, J. P. 2021. Rectangular flows for manifold learning. In Neur IPS. Chandola, V.; Banerjee, A.; and Kumar, V. 2009. Anomaly detection: A survey. ACM Computing Surveys. Chui, C. K.; and Mhaskar, H. N. 2018. Deep nets for local manifold learning. Frontiers in Applied Mathematics and Statistics. Dinh, L.; Krueger, D.; and Bengio, Y. 2014. Nice: Nonlinear independent components estimation. ar Xiv preprint ar Xiv:1410.8516. Fefferman, C.; Mitter, S.; and Narayanan, H. 2016. Testing the manifold hypothesis. Journal of the American Mathematical Society. Gemici, M. C.; Rezende, D.; and Mohamed, S. 2016. Normalizing flows on riemannian manifolds. ar Xiv preprint ar Xiv:1611.02304. Geng, C.; Wang, J.; Chen, L.; Bao, W.; Chu, C.; and Gao, Z. 2020. Uniform interpolation constrained geodesic learning on data manifold. ar Xiv preprint ar Xiv:2002.04829. Ghosh, P.; Sajjadi, M. S.; Vergari, A.; Black, M.; and Sch olkopf, B. 2020. From variational to deterministic autoencoders. In ICLR. Gisbrecht, A.; and Hammer, B. 2015. Data visualization by nonlinear dimensionality reduction. Wiley Interdisciplinary Reviews: Data Mining and Knowledge Discovery. Gong, H.; Pan, C.; Yang, Q.; Lu, H.; and Ma, S. 2006. Neural Network Modeling of Spectral Embedding. In BMVC. Goodfellow, I.; Pouget-Abadie, J.; Mirza, M.; Xu, B.; Warde-Farley, D.; Ozair, S.; Courville, A.; and Bengio, Y. 2014. Generative adversarial nets. In Neur IPS. Gropp, A.; Atzmon, M.; and Lipman, Y. 2020. Isometric autoencoders. ar Xiv preprint ar Xiv:2006.09289. Hadsell, R.; Chopra, S.; and Le Cun, Y. 2006. Dimensionality reduction by learning an invariant mapping. In CVPR. He, X.; and Niyogi, P. 2003. Locality preserving projections. In Neur IPS. Karras, T.; Laine, S.; and Aila, T. 2019. A style-based generator architecture for generative adversarial networks. In CVPR. Kingma, D. P.; and Welling, M. 2013. Auto-encoding variational bayes. ar Xiv preprint ar Xiv:1312.6114. Kobyzev, I.; Prince, S.; and Brubaker, M. 2020. Normalizing flows: An introduction and review of current methods. IEEE Transactions on Pattern Analysis and Machine Intelligence. Liu, Z.; Luo, P.; Wang, X.; and Tang, X. 2015. Deep Learning Face Attributes in the Wild. In ICCV. Lou, A.; Lim, D.; Katsman, I.; Huang, L.; Jiang, Q.; Lim, S. N.; and De Sa, C. M. 2020. Neural manifold ordinary differential equations. In Neur IPS. Mishne, G.; Shaham, U.; Cloninger, A.; and Cohen, I. 2019. Diffusion nets. Applied and Computational Harmonic Analysis. Pai, G.; Talmon, R.; Bronstein, A.; and Kimmel, R. 2019. Dimal: Deep isometric manifold learning using sparse geodesic sampling. In WACV. Pan, Z.; Niu, L.; and Zhang, L. 2022. Unigan: reducing mode collapse using a uniform generator. In Neur IPS. Petersen, P. 2006. Riemannian geometry. Springer. Pless, R.; and Souvenir, R. 2009. A survey of manifold learning for images. IPSJ Transactions on Computer Vision and Applications. Rezende, D. J.; Papamakarios, G.; Racaniere, S.; Albergo, M.; Kanwar, G.; Shanahan, P.; and Cranmer, K. 2020. Normalizing flows on tori and spheres. In ICML. Rowes, S. T. 2000. Nonlinear dimensionality reduction by locally linear embedding. Science. Silvestri, G.; Roos, D.; and Ambrogioni, L. 2022. Closing the gap: Exact maximum likelihood training of generative autoencoders using invertible layers. ar Xiv preprint ar Xiv:2205.09546. Tenenbaum, J. B.; Silva, V. d.; and Langford, J. C. 2000. A global geometric framework for nonlinear dimensionality reduction. Science. Teng, Y.; and Choromanska, A. 2019. Invertible autoencoder for domain adaptation. Computation. Van Der Maaten, L. 2009. Learning a parametric embedding by preserving local structure. In AISTATS. Van der Maaten, L.; and Hinton, G. 2008. Visualizing data using t-SNE. Journal of Machine Learning Research. Verleysen, M.; and Franc ois, D. 2005. The curse of dimensionality in data mining and time series prediction. In IWANN. Weinberger, K. Q.; and Saul, L. K. 2006. Unsupervised learning of image manifolds by semidefinite programming. International Journal of Computer Vision. Yonghyeon, L.; Yoon, S.; Son, M.; and Park, F. C. 2021. Regularized Autoencoders for Isometric Representation Learning. In ICLR.