# topological_autoencoders__2741af48.pdf Topological Autoencoders Michael Moor 1 2 Max Horn 1 2 Bastian Rieck 1 2 Karsten Borgwardt 1 2 We propose a novel approach for preserving topological structures of the input space in latent representations of autoencoders. Using persistent homology, a technique from topological data analysis, we calculate topological signatures of both the input and latent space to derive a topological loss term. Under weak theoretical assumptions, we construct this loss in a differentiable manner, such that the encoding learns to retain multi-scale connectivity information. We show that our approach is theoretically well-founded and that it exhibits favourable latent representations on a synthetic manifold as well as on real-world image data sets, while preserving low reconstruction errors. 1. Introduction While topological features, in particular multi-scale features derived from persistent homology, have seen increasing use in the machine learning community (Carrière et al., 2019, Guss & Salakhutdinov, 2018, Hofer et al., 2017, 2019a,b, Ramamurthy et al., 2019, Reininghaus et al., 2015, Rieck et al., 2019a,b), employing topology directly as a constraint for modern deep learning methods remains a challenge. This is due to the inherently discrete nature of these computations, making backpropagation through the computation of topological signatures immensely difficult or only possible in certain special circumstances (Chen et al., 2019, Hofer et al., 2019a, Poulenard et al., 2018). This work presents a novel approach that permits obtaining gradients during the computation of topological signatures. This makes it possible to employ topological constraints while training deep neural networks, as well as building topology-preserving autoencoders. Specifically, we make Equal contribution. These authors jointly directed this work. 1Department of Biosystems Science and Engineering, ETH Zurich, 4058 Basel, Switzerland 2SIB Swiss Institute of Bioinformatics, Switzerland. Correspondence to: Karsten Borgwardt . Proceedings of the 37 th International Conference on Machine Learning, Online, PMLR 119, 2020. Copyright 2020 by the author(s). the following contributions: 1. We develop a new topological loss term for autoen- coders that helps harmonise the topology of the data space with the topology of the latent space. 2. We prove that our approach is stable on the level of mini-batches, resulting in suitable approximations of the persistent homology of a data set. 3. We empirically demonstrate that our loss term aids in dimensionality reduction by preserving topological structures in data sets; in particular, the learned latent representations are useful in that the preservation of topological structures can improve interpretability. 2. Background: Persistent Homology Persistent homology (Barannikov, 1994, Edelsbrunner & Harer, 2008) is a method from the field of computational topology, which develops tools for analysing topological features (connectivity-based features such as connected components) of data sets. We first introduce the underlying concept of simplicial homology. For a simplicial complex K, i.e. a generalised graph with higher-order connectivity information such as cliques, simplicial homology employs matrix reduction algorithms to assign K a family of groups, the homology groups. The dth homology group Hd(K) of K contains d-dimensional topological features, such as connected components (d = 0), cycles/tunnels (d = 1), and voids (d = 2). Homology groups are typically summarised by their ranks, thereby obtaining a simple invariant signature of a manifold. For example, a circle in R2 has one feature with d = 1 (a cycle), and one feature with d = 0 (a connected component). In practice, the underlying manifold M is unknown and we are working with a point cloud X := {x1, . . . , xn} Rd and a metric dist: X X ! R such as the Euclidean distance. Persistent homology extends simplicial homology to this setting: instead of approximating M by means of a single simplicial complex, which would be an unstable procedure due to the discrete nature of X, persistent homology tracks changes in the homology groups over multiple scales of the metric. This is achieved by constructing a special simplicial complex, the Vietoris Rips complex (Vietoris, 1927). For 0 < 1, the Vietoris Rips complex of X at scale , denoted by R (X), contains all Topological Autoencoders (a) 0 (b) 1 (c) 2 Figure 1. The Vietoris Rips complex R (X) of a point cloud X at different scales 0, 1, and 2. As the distance threshold increases, the connectivity changes. The creation and destruction of d-dimensional topological features is recorded in the dth persistence diagram Dd. simplices (i.e. subsets) of X whose elements {x0, x1, . . . } satisfy dist(xi, xj) for all i, j. Given a matrix A of pairwise distances of a point cloud X, we will use R (A) and R (X) interchangeably because constructing R only requires distances. Vietoris Rips complexes satisfy a nesting relation, i.e. R i(X) R j(X) for i j, making it possible to track changes in the homology groups as increases (Edelsbrunner et al., 2002). Figure 1 illustrates this process. Since X contains a finite number of points, a maximum value exists for which the connectivity stabilises; therefore, calculating R is sufficient to obtain topological features at all scales. We write PH(R (X)) for the persistent homology calculation of the Vietoris Rips complex. It results in a tuple ({D1, D2, . . .}, { 1, 2, . . .}) of persistence diagrams (1st component) and persistence pairings (2nd component). The d-dimensional persistence diagram Dd (Figure 1d) of R (X) contains coordinates of the form (a, b), where a refers to a threshold at which a d-dimensional topological feature is created in the Vietoris Rips complex, and b refers to a threshold 0 at which it is destroyed (please refer to Supplementary Section A.1 for a detailed explanation). When d = 0, for example, the threshold 0 indicates at which distance two connected components in X are merged into one. This calculation is known to be related to spanning trees (Kurlin, 2015) and single-linkage clustering, but the persistence diagrams and the persistence pairings carry more information than either one of these concepts. The d-dimensional persistence pairing contains indices (i, j) corresponding to simplices si, sj 2 R (X) that create and destroy the topological feature identified by (a, b) 2 Dd, respectively. Persistence diagrams are known to be stable with respect to small perturbations in the data set (Cohen-Steiner et al., 2007). Two diagrams D and D0 can be compared using the bottleneck distance db(D, D0) := inf : D!D0 supx2 D kx (x)k1, where : D ! D0 denotes a bijection between the points of the two diagrams, and k k1 refers to the L1 norm. We use DX to refer to the set of persistence diagrams of a point cloud X arising from PH(R (X)). Z Latent code X Input data X Reconstruction Reconstruction loss 0 Topological loss Figure 2. An overview of our method. Given a mini-batch X of data space X, we train an autoencoder to reconstruct X, leading to a reconstruction X. In addition to the usual reconstruction loss, we calculate our topological loss based on the topological differences between persistence diagrams, i.e. topological feature descriptors, calculated on the mini-batch X and its corresponding latent code Z. The objective of our topological loss term is to constrain the autoencoder such that topological features in the data space are preserved in latent representations. 3. A Topology-Preserving Autoencoder We propose a generic framework for constraining autoencoders to preserve topological structures (measured via persistent homology) of the data space in their latent encodings. Figure 2 depicts an overview of our method; the subsequent sections will provide more details about the individual steps. 3.1. Vietoris Rips Complex Calculation Given a finite metric space S, such as a point cloud, we first calculate the persistent homology of the Vietoris Rips complex of its distance matrix AS. It is common practice to use the Euclidean distance for the calculation of AS, but both the persistent homology calculation and our method are not restricted to any particular distance; previous research (Wagner & Dłotko, 2014) shows that even similarity measures that do not satisfy the properties of a metric can be used successfully with PH( ). Subsequently, let := max AS so that R is the corresponding Vietoris Rips complex Topological Autoencoders as described in Section 2. Given a maximum dimension1 of d 2 N>0, we obtain a set of persistence diagrams DS, and a set of persistence pairings S. The dth persistence pairing S d contains indices of simplices that are pertinent to the creation and destruction of d-dimensional topological features. We can consider each pairing to represent edge indices, namely the edges that are deemed to be topologically relevant by the computation of persistent homology (see below for more details). This works because the Vietoris Rips complex is a clique complex, i.e. a simplicial complex that is fully determined by its edges (Zomorodian, 2010). Selecting indices from pairings The basic idea of our method involves selecting indices in the persistence pairing and mapping them back to a distance between two vertices. We then adjust this distance to harmonise topological features of the input space and the latent space. For 0-dimensional topological features, it is sufficient to consider the indices of edges, which are the destroyer simplices, in the pairing S 0 . Each index corresponds to an edge in the minimum spanning tree of the data set. This calculation is computationally efficient, having a worst-case complexity of O , where m is the batch size and ( ) denotes the extremely slow-growing inverse Ackermann function (Cormen et al., 2009, Chapter 22). For 1-dimensional features, where edges are paired with triangles, we obtain edge indices by selecting the edge with the maximum weight of the triangle. While this procedure, and thus our method, generalises to higher dimensions, our current implementation supports no higher-dimensional features. Since preliminary experiments showed that using 1-dimensional topological features merely increases runtime, the subsequent experiments will focus only on 0-dimensional persistence diagrams. We thus use to denote the 0-dimensional persistence diagram and pairing of S, respectively. 3.2. Topological Autoencoder In the following, we consider a mini-batch X of size m from the data space X as a point cloud. Furthermore, we define an autoencoder as the composition of two functions h g, where g: X ! Z represents the encoder and h: Z ! X represents the decoder, denoting latent codes by Z := g(X). During a forward pass of the autoencoder, we compute the persistent homology of the mini-batch in both the data as well as the latent space, yielding two sets of tuples, i.e. ! := PH(R (X)) and := PH(R (Z)). The values of the persistence diagram can be retrieved by subsetting the distance matrix with the edge indices provided by the persistence pairings; we write DX ' AX 1This means that we do not have to consider higherdimensional topological features, making the calculation more efficient. to indicate that the diagram, which is a set, contains the same information as the distances we retrieve with the pairing. We treat AX as a vector in R| X|. Informally speaking, the persistent homology calculation can thus be seen as a selection of topologically relevant edges of the Vietoris Rips complex, followed by the selection of corresponding entries in the distance matrix. By comparing both diagrams, we can construct a topological regularisation term Lt := Lt AX, AZ, X, Z" , which we add to the reconstruction loss of an autoencoder, i.e. L := Lr(X, h(g(X))) + λ Lt (1) where λ 2 R is a parameter to control the strength of the regularisation (see also Supplementary Section A.6). Next, we discuss how to specify Lt. Since we only select edge indices from X and Z, the PH calculation represents a selection of topologically relevant distances from the distance matrix. Each persistence diagram entry corresponds to a distance between two data points. Following standard assumptions in persistent homology (Hofer et al., 2019a, Poulenard et al., 2018), we assume that the distances are unique so that each entry in the diagram has an infinitesimal neighbourhood that only contains a single point. In practice, this can always be achieved by performing (symbolic) perturbations of the distances. Given this fixed pairing and a differentiable distance function, the persistence diagram entries are therefore also differentiable with respect to the encoder parameters. Hence, the persistence pairing does not change upon a small perturbation of the underlying distances, thereby guaranteeing the existence of the derivative of our loss function. This, in turn, permits the calculation of gradients for backpropagation. A straightforward approach to impose the data space topology on the latent space would be to directly calculate a loss based on the selected distances in both spaces. Such an approach will not result in informative gradients for the autoencoder, as it merely compares topological features without matching2 the edges between R (X) and R (Z). A cleaner approach would be to enforce similarity on the intersection of the selected edges in both complexes. However, this would initially include very few edges, preventing efficient training and leading to highly biased estimates of the topological alignments between the spaces3. To overcome this, we account for the union of all selected edges in 2We use the term matching only to build intuition. Our approach does not calculate a matching in the sense of a bottleneck or Wasserstein distance between persistence diagrams. 3When initialising a random latent space Z, the persistence pairing in the latent space will select random edges, resulting in only 1 expected matched edge (independent of mini-batch size) between the two pairings. Thus, only one edge (referring to one pairwise distance between two latent codes) could be used to update the encoding of these two data points. Topological Autoencoders X and Z. Our topological loss term decomposes into two components, each handling the directed loss occurring as topological features in one of the two spaces remain fixed. Hence, Lt = LX!Z + LZ!X , with respectively. The key idea for both terms is to align and preserve topologically relevant distances from both spaces. By taking the union of all selected edges (and the corresponding distances), we obtain an informative loss term that is determined by at least |X| distances. This loss can be seen as a more generic version of the loss introduced by Hofer et al. (2019a), whose formulation does not take the two directed components into account and optimises the destruction values of all persistence tuples with respect to a uniform parameter (their goal is different from ours and does not require a loss term that is capable of harmonising topological features across the two spaces; please refer to Section 4 for a brief discussion). By contrast, our formulation aims to to align the distances between X and Z (which in turn will lead to an alignment of distances between X and Z). If the two spaces are aligned perfectly, LX!Z = LZ!X = 0 because both pairings and their corresponding distances coincide. The converse implication is not true: if Lt = 0, the persistence pairings and their corresponding persistence diagrams are not necessarily identical. Since we did not observe such behaviour in our experiments, however, we leave a more formal treatment of these situations for future work. Gradient calculation Letting refer to the parameters of the encoder and using := @ @ LX!Z = @ 11 X11 denotes the cardinality of a persistence pairing and AZ i refers to the ith entry of the vector of paired distances. This derivation works analogously for LZ!X (with X being replaced by Z). Furthermore, any derivative of AX with respect to must vanish because the distances of the input samples do not depend on the encoding by definition. These equations assume infinitesimal perturbations. The persistence diagrams change in a nondifferentiable manner during the training phase. However, for any given update step, a diagram is robust to infinitesimal changes of its entries (Cohen-Steiner et al., 2007). As a consequence, our topological loss is differentiable for each update step during training. We make our code publicly available4. 3.3. Stability Despite the aforementioned known stability of persistence diagrams with respect to small perturbations of the underlying space, we still have to analyse our topological approximation on the level of mini-batches. The following theorem guarantees that subsampled persistence diagrams are close to the persistence diagrams of the original point cloud. Theorem 1. Let X be a point cloud of cardinality n and X(m) be one subsample of X of cardinality m, i.e. X(m) X, sampled without replacement. We can bound the probability of the persistence diagrams of X(m) exceeding a threshold in terms of the bottleneck distance as where d H refers to the Hausdorff distance between the point cloud and its subsample. Proof. See Section A.2 in the supplementary materials. For m ! n, each mini-batch converges to the original point cloud, so we have limm!n d H = 0. Please refer to Section A.3 for an analysis of empirical convergence rates as well as a discussion of a worst-case bound. Given certain independence assumptions, the next theorem approximates the expected value of the Hausdorff distance between the point cloud and a mini-batch. The calculation of an exact representation is beyond the scope of this work. Theorem 2. Let A2 Rn m be the distance matrix between samples of X and X(m), where the rows are sorted such that the first m rows correspond to the columns of the m subsampled points with diagonal elements aii = 0. Assume that the entries aij with i > m are random samples following a distance distribution FD with supp(FD) 2 R 0. The minimal distances δi for rows with i > m follow a distribution F . Letting Z := max1 i n δi with a corresponding distribution FZ, the expected Hausdorff distance between 4https://github.com/Borgwardt Lab/ topological-autoencoders Topological Autoencoders X and X(m) for m < n is bounded by: d H(X, X(m)) = EZ FZ[Z] (8) 1 FD(z)(n 1) 1 FD(z)m(n m) Proof. See Section A.4 in the supplementary materials. From Eq. 10, we obtain E[d H(X, Xm)] = 0 as m ! n, so the expected value converges as the subsample size approaches the total sample size5. We conclude that our subsampling approach results in point clouds that are suitable proxies for the large-scale topological structures of the point cloud X. 4. Related Work Computational topology and persistent homology (PH) have started gaining traction in several areas of machine learning research. PH is often used as as post hoc method for analysing topological characteristics of data sets. Thus, there are several methods that compare topological features of high-dimensional spaces with different embeddings to assess the fidelity and quality of a specific embedding scheme (Khrulkov & Oseledets, 2018, Paul & Chalup, 2017, Rieck & Leitte, 2015, 2017, Yan et al., 2018). PH can also be used to characterise the training of neural networks (Guss & Salakhutdinov, 2018, Rieck et al., 2019b), as well as their decision boundaries (Ramamurthy et al., 2019). Our method differs from all these publications in that we are able to obtain gradient information to update a model while training. Alternatively, topological features can be integrated into classifiers to improve classification performance. Hofer et al. (2017) propose a neural network layer that learns projections of persistence diagrams, which can subsequently be used as feature descriptors to classify structured data. Moreover, several vectorisation strategies for persistence diagrams exist (Adams et al., 2017, Carrière et al., 2015), making it possible to use them in kernel-based classifiers. These strategies have been subsumed (Carrière et al., 2019) in a novel architecture based on deep sets. The commonality of these approaches is that they treat persistence diagrams as being fixed; while they are capable of learning suitable parameters for classifying them, they cannot adjust input data to better approximate a certain topology. 5For m = n, the two integrals switch their order as m(n m) = 0 < n 1 (for n > 1). Such topology-based adjustments have only recently become feasible. Poulenard et al. (2018) demonstrated how to optimise real-valued functions based on their topology. This constitutes the first approach for aligning persistence diagrams by modifying input data; it requires the connectivity of the data to be known, and the optimised functions have to be node-based and scalar-valued. By contrast, our method works directly on distances and sidesteps connectivity calculations via the Vietoris Rips complex. Chen et al. (2019) use a similar optimisation technique to regularise the decision boundary of a classifier. However, this requires discretising the space, which can be computationally expensive. Hofer et al. (2019a), the closest work to ours, also presents a differentiable loss term. Their formulation enforces a single scale, referred to as , on the latent space. The learned encoding is then applied to a one-class learning task in which a scoring function is calculated based on the pre-defined scale. By contrast, the goal of our loss term is to support the model in learning a latent encoding that best preserves the data space topology in said latent space, which we use for dimensionality reduction. We thus target a different task, and can preserve multiple scales (those selected through the filtration process) that are present in the data domain. 5. Experiments Our main task is to learn a latent space in an unsupervised manner such that topological features of the data space, measured using persistent homology approximations on every batch, are preserved as much as possible. 5.1. Experimental Setup Subsequently, we briefly describe our data sets and evaluation metrics. Please refer to the supplementary materials for technical details (calculation, hyperparameters, etc.). 5.1.1. DATA SETS We generate a SPHERES data set that consists of ten highdimensional 100-spheres living in a 101-dimensional space that are enclosed by one larger sphere that consists of the same number of points as the total of inner spheres (see Section A.5 for more details). We also use three image data sets (MNIST, FASHION-MNIST, and CIFAR-10), which are particularly amenable to our topology-based analysis because real-world images are known to lie on or near lowdimensional manifolds (Lee et al., 2003, Peyré, 2009). 5.1.2. BASELINES & TRAINING PROCEDURE We compare our approach with several dimensionality reduction techniques, including UMAP (Mc Innes et al., 2018), t-SNE (van der Maaten & Hinton, 2008), Isomap (Tenen- Topological Autoencoders baum et al., 2000), PCA, as well as standard autoencoders (AE). We apply our proposed topological constraint to this standard autoencoder architecture (Topo AE). For comparability and interpretability, each method is restricted to two latent dimensions. We split each data set into training and testing (using the predefined split if available; 90% versus 10% otherwise). Additionally, we remove 15% of the training split as a validation data set for tuning the hyperparameters. We normalised our topological loss term by the batch size m in order to disentangle λ from it. All autoencoders employ batch-norm and are optimized using ADAM (Kingma & Ba, 2014). Since t-SNE is not intended to be applied to previously unseen test samples, we evaluate this method only on the train split. In addition, significant computational scaling issues prevent us from running a hyperparameter search for Isomap on real-world data sets, so we only compare this algorithm on the synthetic data set. Please refer to Section A.6 for more details on architectures and hyperparameters. 5.1.3. EVALUATION We evaluate the quality of latent representations in terms of (1) low-dimensional visualisations, (2) dimensionality reduction quality metrics (evaluated between input data and latent codes), and (3) reconstruction errors (Data MSE; evaluated between input and reconstructed data), provided that invertible transformations are available6. For (2), we consider several measures (please refer to Section A.7 for more details). First, we calculate KLσ, the Kullback Leibler divergence between the density estimates of the input and latent space, based on density estimates (Chazal et al., 2011, 2014b), where σ 2 R>0 denotes the length scale of the Gaussian kernel, which is varied to account for multiple data scales. We chose minimising KL0.1 as our hyperparameter search objective. Furthermore, we calculate common non-linear dimensionality reduction (NLDR) quality metrics, which use the pairwise distance matrices of the input and the latent space (as indicated by the in the abbreviations), namely (1) the root mean square error ( -RMSE), which despite its name is not related to the reconstruction error of the autoencoder but merely measures to what extent the two distributions of distances coincide, (2) the mean relative rank error ( -MRRE), (3) the continuity ( - Cont), and (4) the trustworthiness ( -Trust) . The reported measures are computed on the test splits (except for t-SNE where no transformation between splits is available, so we report the measures on a random subsample of the train split preserving the cardinality of the test split). 6Invertible transformations are available for PCA and all autoencoder-based methods. (f) Topo AE Figure 3. Latent representations of the SPHERES data set. Only our method is capable of representing the complicated nesting relationship inherent to the data; t-SNE, for example, tears the original data apart. For Topo AE, we used a batch size of 28. Please refer to Figure A.5 in the supplementary materials for an enlarged version. 5.2. Results Next to a quantitative evaluation in terms of various quality metrics, we also discuss qualitative results in terms of visualisations, which are interpretable in case the ground truth manifold is known. 5.2.1. QUANTITATIVE RESULTS Table 1 reports the quantitative results. Overall, we observe that our method is capable of preserving the data density over multiple length scales (as measured by KL). Furthermore, we find that Topo AE displays competitive continuity values ( -Cont) and reconstruction errors (Data MSE). The latter is particularly relevant as it demonstrates that imposing our topological constraints does not result in large impairments when reconstructing the input space. The remaining classical measures favour the baselines (fore- Topological Autoencoders Data set Method KL0.01 KL0.1 KL1 -MRRE -Cont -Trust -RMSE Data MSE Isomap 0.181 0.420 0.00881 0.246 0.790 0.676 10.4 PCA 0.332 0.651 0.01530 0.294 0.747 0.626 11.8 0.9610 TSNE 0.152 0.527 0.01271 0.217 0.773 0.679 8.1 UMAP 0.157 0.613 0.01658 0.250 0.752 0.635 9.3 AE 0.566 0.746 0.01664 0.349 0.607 0.588 13.3 0.8155 Topo AE 0.085 0.326 0.00694 0.272 0.822 0.658 13.5 0.8681 PCA 0.356 0.052 0.00069 0.057 0.968 0.917 9.1 0.1844 TSNE 0.405 0.071 0.00198 0.020 0.967 0.974 41.3 UMAP 0.424 0.065 0.00163 0.029 0.981 0.959 13.7 AE 0.478 0.068 0.00125 0.026 0.968 0.974 20.7 0.1020 Topo AE 0.392 0.054 0.00100 0.032 0.980 0.956 20.5 0.1207 PCA 0.389 0.163 0.00160 0.166 0.901 0.745 13.2 0.2227 TSNE 0.277 0.133 0.00214 0.040 0.921 0.946 22.9 UMAP 0.321 0.146 0.00234 0.051 0.940 0.938 14.6 AE 0.620 0.155 0.00156 0.058 0.913 0.937 18.2 0.1373 Topo AE 0.341 0.110 0.00114 0.056 0.932 0.928 19.6 0.1388 PCA 0.591 0.020 0.00023 0.119 0.931 0.821 17.7 0.1482 TSNE 0.627 0.030 0.00073 0.103 0.903 0.863 25.6 UMAP 0.617 0.026 0.00050 0.127 0.920 0.817 33.6 AE 0.668 0.035 0.00062 0.132 0.851 0.864 36.3 0.1403 Topo AE 0.556 0.019 0.00031 0.108 0.927 0.845 37.9 0.1398 Table 1. Embedding quality according to multiple evaluation metrics (Section 5.1). The hyperparameters of all tunable methods were selected to minimise the objective KL0.1. For each criterion, the winner is shown in bold and underlined, the runner-up in bold. Please refer to Supplementary Table A.2 for more σ scales and variance estimates. The column Data MSE indicates the reconstruction error. It is included to demonstrate that applying our loss term has no adverse effects. most the train (!) performance of t-SNE). However, we will subsequently see by inspecting the latent spaces that those classic measures fail to detect the relevant structural information, as exemplified with known ground truth manifolds, such as the SPHERES data set. 5.2.2. VISUALISATION OF LATENT SPACES For the SPHERES data set (Figure 3), we observe that only our method is capable of assessing the nesting relationship of the high-dimensional spheres correctly. By contrast, t-SNE cuts open the enclosing sphere, distributing most of its points around the remaining spheres. We see that the KL-divergence confirms the visual assessment that only our proposed method preserves the relevant structure of this data set. Several classical evaluation measures, however, favour t-SNE, even though this method fails to capture the global structure and nesting relationship of the enclosing sphere manifold accounting for half of the data set. On FASHION-MNIST (Figure 4, leftmost column), we see that, as opposed to AE, which is purely driven by the reconstruction error, our method has the additional objective of preserving structure. Here, the constraint helps the regularised autoencoder to organise the latent space, resulting in a comparable pattern as in UMAP, which is also topolog- ically motivated (Mc Innes et al., 2018). Furthermore, we observe that t-SNE tends to fragment certain classes (dark orange, red) into multiple distinct subgroups. This likely does not reflect the underlying manifold structure, but constitutes an artefact frequently observed with this method. For MNIST, the latent embeddings (Figure 4, middle column) demonstrate that the non-linear competitors mostly by pulling apart distinct classes lose some of the relationship information between clusters when comparing against our proposed method or PCA. Finally, we observe that CIFAR-10 (Figure 4, rightmost column), is challenging to embed in two latent dimensions in a purely unsupervised manner. Interestingly, our method (consistently, i.e. over all runs) was able to identify a linear substructure that separates the latent space in two additional groups of classes. 6. Discussion and Conclusion We presented topological autoencoders, a novel method for preserving topological information, measured in terms of persistent homology, of the input space when learning latent representations with deep neural networks. Under weak theoretical assumptions, we showed how our persistent homology calculations can be combined with backpropagation; moreover, we proved that approximating persistent homology on the level of mini-batches is theoretically justified. Topological Autoencoders FASHION-MNIST MNIST CIFAR-10 Figure 4. Latent representations of the FASHION-MNIST (left column), MNIST (middle column), CIFAR-10 (right column) data sets. The respective batch size values for our method are (95, 126, 82). Please refer to Figures A.6, A.7, and A.8 in the supplementary materials for enlarged versions. In our experiments, we observed that our method is uniquely able to capture spatial relationships between nested highdimensional spheres. This is relevant, as the ability to cope with several manifolds in the domain of manifold learning still remains a challenging task. On real-world data sets, we observed that our topological loss leads to competitive performance in terms of numerous quality metrics (such as a density preservation metric), while not adversely affecting the reconstruction error. In both synthetic and real-world data sets, we obtain interesting and interpretable representations, as our method does not merely pull apart different classes, but tries to spatially arrange them meaningfully. Thus, we do not observe mere distinct clouds , but rather entangled structures, which we consider to constitute a more meaningful representation of the underlying manifolds (an auxiliary analysis in Supplementary Section A.10 confirms that our method influences topological features, measured using PH, in a beneficial manner). Future work Our topological loss formulation is highly generalizable; it only requires the existence of a distance matrix between individual samples (either globally, or on the level of batches). As a consequence, our topological loss term can be directly integrated into a variety of different architectures and is not limited to standard autoencoders. For instance, we can also apply our constraint to variational setups (see Figure A.3 for a sketch) or create a topology-aware variant of principal component analysis (dubbed Topo PCA ; see Table A.2 for more details, as well as Figures A.6, A.7, and A.8 for the corresponding latent space representations). Employing our loss term to more involved architectures will be an exciting route for future work. One issue with the calculation is that, given the computational complexity of calculating R ( ), for higherdimensional features, we would scale progressively worse with increasing batch size. However, in our low-dimensional setup, we observed that runtime tends to grow with decreasing batch size, i.e. the mini-batch speed-up still dominates runtime (for more details concerning the effect of batch sizes, see Supplementary Section A.8). In future work, scaling to higher dimensions could be mitigated by approximating the calculation of persistent homology (Choudhary et al., 2018, Kerber & Sharathkumar, 2013, Sheehy, 2013) or by exploiting recent advances in parallelising it (Bauer et al., 2014, Lewis & Morozov, 2015). Another interesting extension would be to tackle classification scenarios with topology-preserving loss terms. This might prove challenging, however, because the goal in classification is to increase class separability, which might be achieved by removing topological structures. This goal is therefore at odds with our loss term that tries preserving those structures. We think that such an extension might require restricting the method to a subset of scales (namely those that do not impede class separability) to be preserved in the data. ACKNOWLEDGEMENTS The authors wish to thank Christian Bock for fruitful discussions and valuable feedback. This project was supported by the grant #2017-110 of the Strategic Focal Area Personalized Health and Related Technologies (PHRT) of the ETH Domain for the SPHN/PHRT Driver Project Personalized Swiss Sepsis Study and the SNSF Starting Grant Significant Pattern Mining (K.B., grant no. 155913). Moreover, this work was funded in part by the Alfried Krupp Prize for Young University Teachers of the Alfried Krupp von Bohlen und Halbach-Stiftung (K.B.). Topological Autoencoders Adams, H., Emerson, T., Kirby, M., Neville, R., Peterson, C., Shipman, P., Chepushtanova, S., Hanson, E., Motta, F., and Ziegelmeier, L. Persistence images: A stable vector representation of persistent homology. Journal of Machine Learning Research, 18(1):218 252, 2017. Barannikov, S. A. The framed Morse complex and its in- variants. Advances in Soviet Mathematics, 21:93 115, 1994. Bauer, U., Kerber, M., and Reininghaus, J. Distributed com- putation of persistent homology. In Mc Geoch, C. C. and Meyer, U. (eds.), Proceedings of the Sixteenth Workshop on Algorithm Engineering and Experiments (ALENEX), pp. 31 38. Society for Industrial and Applied Mathematics, 2014. Bibal, A. and Frénay, B. Measuring quality and interpretabil- ity of dimensionality reduction visualizations. Safe Machine Learning Workshop at ICLR, 2019. Burago, D., Burago, Y., and Ivanov, S. A course in metric geometry, volume 33 of Graduate Studies in Mathematics. American Mathematical Society, 2001. Carrière, M., Oudot, S. Y., and Ovsjanikov, M. Stable topological signatures for points on 3D shapes. In Proceedings of the Eurographics Symposium on Geometry Processing (SGP), pp. 1 12, Aire-la-Ville, Switzerland, 2015. Eurographics Association. Carrière, M., Chazal, F., Ike, Y., Lacombe, T., Royer, M., and Umeda, Y. Pers Lay: A neural network layer for persistence diagrams and new graph topological signatures. ar Xiv e-prints, art. ar Xiv:1904.09378, 2019. Chazal, F., Cohen-Steiner, D., Guibas, L. J., Mémoli, F., and Oudot, S. Y. Gromov Hausdorff stable signatures for shapes using persistence. Computer Graphics Forum, 28(5):1393 1403, 2009. Chazal, F., Cohen-Steiner, D., and Mérigot, Q. Geomet- ric inference for probability measures. Foundations of Computational Mathematics, 11(6):733 751, 2011. Chazal, F., de Silva, V., and Oudot, S. Y. Persistence stability for geometric complexes. Geometriæ Dedicata, 173(1): 193 214, 2014a. Chazal, F., Fasy, B. T., Lecci, F., Michel, B., Rinaldo, A., and Wasserman, L. Robust topological inference: Distance to a measure and kernel distance. ar Xiv e-prints, art. ar Xiv:1412.7197, 2014b. Chazal, F., Fasy, B., Lecci, F., Michel, B., Rinaldo, A., and Wasserman, L. Subsampling methods for persistent homology. In Bach, F. and Blei, D. (eds.), Proceedings of the 32nd International Conference on Machine Learning (ICML), volume 37 of Proceedings of Machine Learning Research, pp. 2143 2151. PMLR, 2015a. Chazal, F., Glisse, M., Labruère, C., and Michel, B. Conver- gence rates for persistence diagram estimation in topological data analysis. Journal of Machine Learning Research, 16:3603 3635, 2015b. Chen, C., Ni, X., Bai, Q., and Wang, Y. A topological regularizer for classifiers via persistent homology. In Chaudhuri, K. and Sugiyama, M. (eds.), Proceedings of Machine Learning Research, volume 89 of Proceedings of Machine Learning Research, pp. 2573 2582. PMLR, 2019. Choudhary, A., Kerber, M., and Raghvendra, S. Improved topological approximations by digitization. ar Xiv eprints, art. ar Xiv:1812.04966, 2018. Cohen-Steiner, D., Edelsbrunner, H., and Harer, J. Stabil- ity of persistence diagrams. Discrete & Computational Geometry, 37(1):103 120, 2007. Cormen, T. H., Leiserson, C. E., Rivest, R. L., and Stein, C. Introduction to algorithms. MIT Press, Cambridge, MA, USA, 3rd edition, 2009. Edelsbrunner, H. and Harer, J. Persistent homology a survey. In Goodman, J. E., Pach, J., and Pollack, R. (eds.), Surveys on discrete and computational geometry: Twenty years later, number 453 in Contemporary Math- ematics, pp. 257 282. American Mathematical Society, Providence, RI, USA, 2008. Edelsbrunner, H., Letscher, D., and Zomorodian, A. J. Topo- logical persistence and simplification. Discrete & Computational Geometry, 28(4):511 533, 2002. Gracia, A., González, S., Robles, V., and Menasalvas, E. A methodology to compare dimensionality reduction algorithms in terms of loss of quality. Information Sciences, 270:1 27, 2014. Guss, W. H. and Salakhutdinov, R. On characterizing the capacity of neural networks using algebraic topology. ar Xiv e-prints, art. ar Xiv:1802.04443, 2018. Hinton, G. E. and Salakhutdinov, R. R. Reducing the di- mensionality of data with neural networks. Science, 313 (5786):504 507, 2006. Hofer, C., Kwitt, R., Niethammer, M., and Uhl, A. Deep learning with topological signatures. In Guyon, I., Luxburg, U. V., Bengio, S., Wallach, H., Fergus, R., Vishwanathan, S., and Garnett, R. (eds.), Advances in Neural Information Processing Systems 30, pp. 1633 1643. Curran Associates, Inc., 2017. Topological Autoencoders Hofer, C., Kwitt, R., Niethammer, M., and Dixit, M. Connectivity-optimized representation learning via persistent homology. In Chaudhuri, K. and Salakhutdinov, R. (eds.), Proceedings of the 36th International Conference on Machine Learning, volume 97 of Proceedings of Machine Learning Research, pp. 2751 2760. PMLR, 2019a. Hofer, C. D., Graf, F., Rieck, B., Niethammer, M., and Kwitt, R. Graph filtration learning. ar Xiv e-prints, art. ar Xiv:1905.10996, 2019b. Kerber, M. and Sharathkumar, R. Approximate ˇCech com- plexes in low and high dimensions. ar Xiv e-prints, art. ar Xiv:1307.3272, 2013. Khrulkov, V. and Oseledets, I. Geometry score: A method for comparing generative adversarial networks. In Dy, J. and Krause, A. (eds.), Proceedings of the 35th International Conference on Machine Learning, volume 80 of Proceedings of Machine Learning Research, pp. 2621 2629. PMLR, 2018. Kingma, D. P. and Ba, J. Adam: A method for stochastic optimization. ar Xiv e-prints, art. ar Xiv:1412.6980, 2014. Kurlin, V. A one-dimensional homologically persistent skeleton of an unstructured point cloud in any metric space. Computer Graphics Forum, 34(5):253 262, 2015. Lee, A. B., Pedersen, K. S., and Mumford, D. The nonlinear statistics of high-contrast patches in natural images. International Journal of Computer Vision, 54(1 3):83 103, 2003. Lee, J. A. and Verleysen, M. Quality assessment of dimen- sionality reduction: Rank-based criteria. Neurocomputing, 72(7):1431 1443, 2009. Lewis, R. and Morozov, D. Parallel computation of persis- tent homology using the blowup complex. In Proceedings of the 27th ACM Symposium on Parallelism in Algorithms and Architectures (SPAA), pp. 323 331. ACM, 2015. Mc Innes, L., Healy, J., and Melville, J. UMAP: Uniform manifold approximation and projection for dimension reduction. ar Xiv e-prints, art. ar Xiv:1802.03426, 2018. Mémoli, F. and Sapiro, G. Comparing point clouds. In Proceedings of the 2004 Eurographics/ACM SIGGRAPH Symposium on Geometry Processing (SGP), pp. 32 40, New York, NY, USA, 2004. ACM. Paul, R. and Chalup, S. K. A study on validating non-linear dimensionality reduction using persistent homology. Pattern Recognition Letters, 100:160 166, 2017. Peyré, G. Manifold models for signals and images. Com- puter Vision and Image Understanding, 113(2):249 260, 2009. Poulenard, A., Skraba, P., and Ovsjanikov, M. Topologi- cal function optimization for continuous shape matching. Computer Graphics Forum, 37(5):13 25, 2018. Ramamurthy, K. N., Varshney, K., and Mody, K. Topologi- cal data analysis of decision boundaries with application to model selection. In Chaudhuri, K. and Salakhutdinov, R. (eds.), Proceedings of the 36th International Conference on Machine Learning, volume 97 of Proceedings of Machine Learning Research, pp. 5351 5360. PMLR, 2019. Reininghaus, J., Huber, S., Bauer, U., and Kwitt, R. A stable multi-scale kernel for topological machine learning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4741 4748, 2015. Rieck, B. and Leitte, H. Persistent homology for the eval- uation of dimensionality reduction schemes. Computer Graphics Forum, 34(3):431 440, 2015. Rieck, B. and Leitte, H. Agreement analysis of quality mea- sures for dimensionality reduction. In Carr, H., Garth, C., and Weinkauf, T. (eds.), Topological Methods in Data Analysis and Visualization IV. Springer, Cham, Switzer- land, 2017. Rieck, B., Bock, C., and Borgwardt, K. A persistent Weisfeiler Lehman procedure for graph classification. In Chaudhuri, K. and Salakhutdinov, R. (eds.), Proceedings of the 36th International Conference on Machine Learning, volume 97 of Proceedings of Machine Learning Research, pp. 5448 5458. PMLR, 2019a. Rieck, B., Togninalli, M., Bock, C., Moor, M., Horn, M., Gumbsch, T., and Borgwardt, K. Neural persistence: a complexity measure for deep neural networks using algebraic topology. In International Conference on Learning Representations (ICLR), 2019b. scikit-optimize contributers, T. scikit-optimize/scikit-optimize: v0.5.2, March 2018. https://doi.org/10.5281/ zenodo.1207017. Sheehy, D. R. Linear-size approximations to the Vietoris Rips filtration. Discrete & Computational Geometry, 49 (4):778 796, 2013. Tenenbaum, J. B., De Silva, V., and Langford, J. C. A global geometric framework for nonlinear dimensionality reduction. Science, 290(5500):2319 2323, 2000. Topological Autoencoders van der Maaten, L. J. and Hinton, G. Visualizing data using t-SNE. Journal of Machine Learning Research, 9:2579 2605, 2008. van der Maaten, L. J., Postma, E. O., and van den Herik, H. J. Dimensionality reduction: A comparative review. Technical Report 2009-005, Tilburg University, 2009. Venna, J. and Kaski, S. Visualizing gene interaction graphs with local multidimensional scaling. In Proceedings of the 14th European Symposium on Artificial Neural Networks, pp. 557 562. d-side group, 2006. Vietoris, L. Über den höheren Zusammenhang kompak- ter Räume und eine Klasse von zusammenhangstreuen Abbildungen. Mathematische Annalen, 97(1):454 472, 1927. Wagner, H. and Dłotko, P. Towards topological analysis of high-dimensional feature spaces. Computer Vision and Image Understanding, 121:21 26, 2014. Yan, L., Zhao, Y., Rosen, P., Scheidegger, C., and Wang, B. Homology-preserving dimensionality reduction via manifold landmarking and tearing. ar Xiv e-prints, art. ar Xiv:1806.08460, 2018. Zomorodian, A. J. Fast construction of the Vietoris Rips complex. Computers & Graphics, 34(3):263 271, 2010.