# principle_component_trees_and_their_persistent_homology__7c365e9e.pdf Principle Component Trees and Their Persistent Homology Ben Kizaric1,3, Daniel Pimentel-Alarc on 2,3 1Department of Electrical Engineering, University of Wisconsin-Madison 2Department of Biostatistics, University of Wisconsin-Madison 3Wisconsin Institute For Discovery benkizaric@gmail.com, pimentelalar@wisc.edu Low dimensional models like PCA are often used to simplify complex datasets by learning a single approximating subspace. This paradigm has expanded to union of subspaces models, like those learned by subspace clustering. In this paper, we present Principal Component Trees (PCTs), a graph structure that generalizes these ideas to identify mixtures of components that together describe the subspace structure of high-dimensional datasets. Each node in a PCT corresponds to a principal component of the data, and the edges between nodes indicate the components that must be mixed to produce a subspace that approximates a portion of the data. In order to construct PCTs, we propose two angle-distribution hypothesis tests to detect subspace clusters in the data. To analyze, compare, and select the best PCT model, we define two persistent homology measures that describe their shape. We show our construction yields two key properties of PCTs, namely ancestral orthogonality and non-decreasing singular values. Our main theoretical results show that learning PCTs reduces to PCA under multivariate normality, and that PCTs are efficient parameterizations of intersecting union of subspaces. Finally, we use PCTs to analyze neural network latent space, word embeddings, and reference image datasets. 1 Introduction Learning a low-dimensional structure that approximates a high-dimensional dataset is a fundamental problem in science and engineering e.g., learning an approximating subspace using principal component analysis (PCA) (Pearson 1901; Hotelling 1933; Jolliffe 2002). This paper introduces a new approximating structure, which we call principal component trees (PCT), of which PCA and generalized PCA (also known as subspace clustering (Elhamifar and Vidal 2013)) are special cases. PCT builds on the union of subspaces model, which assumes that each data sample lies near one of several lowdimensional subspaces. Subspace clustering is then the task of learning the structure of these subspaces and has shown effectiveness in modeling many kinds of data (Vidal, Tron, and Hartley 2008; Mahmood and Pimentel-Alarc on 2022; Li, Zhao, and Zhu 2023). Baked into the majority of subspace clustering algorithms, such as the effective Sparse Copyright 2024, Association for the Advancement of Artificial Intelligence (www.aaai.org). All rights reserved. Figure 1: (Left) Two 2D subspace clusters with a 1D intersection and the corresponding Principal Component Tree. (Right) A PCT trained on Fashion MNIST. Subspace Clustering (SSC) method (Elhamifar and Vidal 2013), is the assumption that the subspaces containing the data are disjoint. That is, all points near the span of one subspace are not near the span of any other subspace. It is this disjoint assumption that motivates Principal Component Trees, which can represent a single subspace, multiple disjoint subspaces, and multiple intersecting subspaces. In essence, a PCT is a graph summarizing the hierarchical structure of the principal components in the data. Each node of a PCT corresponds to a single principal component, defined by a principal vector and a singular value. Each edge of a PCT links principal vectors that, together, define a set of subspaces that contain the data. Each of these subspaces is spanned by a branch of the PCT, which is the set of nodes that lie on the path between each leaf of the PCT and its root. Branches that share nodes correspond to intersecting subspaces (see Figure 1). Our algorithm to learn a Principal Component Tree analyzes the structure of the data dynamically, recursively adding child nodes, depending on the structure of the data. Each node will have more than one child only if we detect disjoint subspace clusters. We make this determination with a non parametric hypothesis test which compares the distribution of pairwise angles of the data points to the known distribution of uniform angles. If no disjoint structure is detected at any node, the PCT will result in a degenerate tree. The Thirty-Eighth AAAI Conference on Artificial Intelligence (AAAI-24) In this case, PCT reduces to PCA. Similarly, the union of subspaces model used by classical subspace clustering is the special case of PCT where a disjoint structure is detected at the root and nowhere else. In general, data following a multivariate normal distribution will yield tall trees, and data exhibiting subspace clustered structure will yield wide trees. To quantitatively measure tree structure, we turn to tools in persistent homology that measure properties of graphs as they are filtered into sub-graphs. Specifically, we perform a height filtration that measures how many nodes are below a given height, and a width filtration, which measures the width of the tree as we prune the least significant nodes. These measures yield bounded, monotonic curves that summarize the subspace structure of the PCT, and can be used to perform hypothesis testing between PCTs. Our experimental results show the effectiveness of both summaries on synthetic and real-world data, including multilingual word embeddings and the latent space of neural networks. Our main theoretical contribution shows that any distribution that can be parameterized using PCA or generalized PCA can be equivalently represented with a PCT with as many or fewer parameters. Our second theoretical contribution relates to our non parametric hypothesis test. We are able to show that with high probability (w.h.p.) our test will accurately detect clusters of correlated data, which we treat as a proxy for subspace structure. As a corollary of this result, we show that w.h.p. our algorithm will result in a degenerate tree (i.e., will produce the same result as PCA) whenever data is best represented by a single subspace. Similar to PCA, our PCT enjoys several favorable properties. Firstly, the principal vector of each node is orthogonal to all of its ancestors and descendants, making projection of data onto each node of the tree efficient and decomposable. We also show that the singular value of a node is strictly less than or equal to that of its parent and ancestors. In standard PCA, the components corresponding to the largest singular values are the subset of components enabling optimal projection approximation of the datapoints. We prove that the monotonically decreasing singular values gives PCT a similar property, that the optimal approximating subtree of a PCT is the subtree with nodes with largest singular values. The decreasing singular values are also essential to enable the aforementioned width filtration topological measure. 2 Related Work Subspace Clustering As mentioned, the task of subspace clustering is to partition a set of data such that each partition lies on the span of a low dimensional subspace. While some works are principally interested in the clustering / partitioning of points, such as in Motion Segmentation (Vidal, Tron, and Hartley 2008) and Hyperspectral Imaging (Li, Zhao, and Zhu 2023), others are more interested in using the learned subspace structure for downstream tasks. For example, subspace clustering has shown to be effective in matrix completion problems (Fan and Udell 2019), or as a feature extraction method (Hu et al. 2023; Kizaric and Pimentel-Alarc on 2022). Methods for performing subspace clustering utilize sparse self-expressiveness (Elhamifar and Vidal 2013), expectation maximization (Lipor et al. 2021), deep learning (Zhu et al. 2019), and fusion over the grassmannian (Mahmood and Pimentel-Alarc on 2022). An important distinction in subspace clustering techniques is the difference between axis-parallel and linear methods. In axis-parallel methods such as CLIQUE (Agrawal et al. 1998) and (Yuan et al. 2013) the subspaces are assumed be be spanned by a discrete subset of the variables / features of the dataset. In linear methods, the subspaces are allowed to span any linear combination of the variables, in a continuous space known formally as the Grassmanian. As illustrated in figure 1, the nodes of Principal Component Trees represent an arbitrary direction in the data, and as such is a generalization of linear subspace clustering. Although it is comparatively well-studied, the discreet sample space in axis-parallel regimes enables the use of an entirely different class of algorithms, so we focus our comparisons only on linear methods. Of particular interest to this paper are hierarchical subspace clustering techniques. These approaches come in topdown (Rafiezadeh Shahi et al. 2020; Wu et al. 2015) and bottom-up (Menon, Muthukrishnan, and Kalyani 2020) flavors. (Rafiezadeh Shahi et al. 2020) and (Wu et al. 2015) are similar to Principal Component Trees in that they are top-down methods which recursively partition the space into lower dimensional subspaces, and the termination of node splitting is based on a reconstruction error criteria. While the construction of PCTs is fundamentally a hierarchical subspace clustering algorithm, it is the only one whose nodes represent a single basis, and that can generalize to PCA, disjoint subspaces, and intersecting subspaces. Similarly named similarity search methods like (Mc Cartin Lim, Mc Gregor, and Wang 2012) and (Wichert 2009) exist, but these methods don t use subspace clustering. Angle Distributions Established results on the distribution of pairwise angles in high dimensional space are important to PCT s adaptive splitting construction. These results were first established in (Cai, Fan, and Jiang 2013) and used to estimate intrinsic dimensionality in (Thordsen and Schubert 2022). (Menon, Muthukrishnan, and Kalyani 2020) uses pairwise angle distributions in subspace clustering, but for merging subspaces from the bottom-up, unlike our top-down approach. Hypothesis tests of angle uniformity are central to directional statistics, and are explored in (Garc ıa-Portugu es, Navarro-Esteban, and Cuesta-Albertos 2023). Dictionary Learning and Sparse Coding Primarily a compression and denoising technique, Dictionary Learning and Sparse Coding seek to find a dictionary V and sparse codes U in order to minimize ||X UV||F + α||U||1. Implementations utilize autoencoders (Ng et al. 2011), alternating optimization (Tovsi c and Frossard 2011), and other techniques. Structured Sparse Coding places additional constraints on U and V in order to form advantageous structures. For example, (Jenatton et al. 2010) force the non-zero coefficients to form a pre-defined tree hierarchy, and its optimization procedure promotes orthogonality in the branches of dictionary atoms in the tree. The atoms in (Szlam, Gregor, and Le Cun 2012) are placed into overlapping groups which are learned from data, but don t form a tree. Topological Data Analysis & Persistent Homology Be- The Thirty-Eighth AAAI Conference on Artificial Intelligence (AAAI-24) cause of its ability to discover robust patterns in arbitrary distributions like networks and point clouds, topological data analysis (TDA) has emerged as a promising approach in medical imaging and other areas (Chazal and Michel 2021). Persistent Homology is a sub-field of TDA that focuses on measuring properties of graphs as they are filtered down (Pun, Lee, and Xia 2022). While persistent homology is most commonly used to construct Birth-Death decomposition charts, Rips Filtration (Bauer 2021) and Graph Filtration (Lee et al. 2012) are robust approaches that enable theoretically sound statistical inference. The key property of Graph Filtration is that it tracks the first two Betti numbers of the graph, which are monotonic over edge filtration. Of particular relevance to our analysis of PCTs are persistent homology based characterizations of tree structure. (Chung et al. 2015) build minimum spanning trees from structural covariance networks then apply graph filtration to create curves used to compare network structure. Persistent Homology has also been applied to tree-structured data. Both (Li et al. 2017) and (Kanari et al. 2016) filter the binary tree structure of neurons and brain arteries based on height / distance to the neuron s cell body. Persistent structures are then described by a birth-death composition chart. Our analysis of PCTs borrows this height filtration approach, but instead creates monotonic curves as in (Chung et al. 2015). 3 Principal Component Tree Consider a data matrix X whose columns {x} represent samples in RD. A principal component tree T to model data X is defined by a set of nodes Ni, i {1 . . . |T |}, and a set of directed edges Eij indicating that node Ni is the parent of node Nj. Given node Ni, we denote its (unique) parent as Pi, its set of ancestors as Ai, with Ni Ai, its set of children as Ci, and its set of descendants as Di. We use Si to denote the set of siblings of Ni, defined as the set of nodes with the same parent as Ni. A node Ni represents two objects: (a) its associated principal component, given by a unit-length vector vi RD, and (b) its singular value σi 0. Much like PCA, each node s vi represents a direction of variation in the data, and its σi is proportional to the amount of variation along this direction. However, unlike PCA, these quantities describe variation directions of specific subsets of the data, not just the data as a whole. These subsets of data and their particular variation directions are determined by the nodes and their lineage. To see this, define the ancestral basis associated to node Ni as Vi := {vi | Ni Ai}. Figure 2: Sample tree and data assignments for one point. x chooses nodes greedily, starting from the root down. This basis spans a subspace that approximates a subset of the samples in X. Specifically, we assign each node of the tree a subset of the data Xi, defined recursively as: Xi := x XPi arg max j Si | vj, x | = i . Intuitively, the subset of columns assigned to node Ni is the subset of columns that were assigned to its parent, and that are closer to the principal component of Ni than to that of any of its siblings. See Figure 2 for a one sample example. Notice that under these definitions, the bases and data assignments of ancestors and descendants are nested, i.e., for any Nj Ai, span(Vj) span(Vi) and Xj Xi That is, the basis for an ancestor is a subspace of the basis for all its descendants, and the data assigned to an ancestor is a superset of the data assigned to its descendants. Approximating Data We will now show how Principal Component Trees can be used to approximate data in a similar fashion to PCA. That is, to create an approximation ˆX of X. Whereas PCA approximations can be computed by a single matrix operation, PCT approximations are a union of approximations given individual nodes of their respective assigned data. For a single node, its node approximation and node residual are: ˆXi = Vi VT i Xi and Ri := Xi ˆXi respectively. The node approximation is simply a node s assigned data projected onto its anscestral basis. Ordinary least squares projection would require that we also compute (Vi VT i ) 1, but our tree construction actually gives an orthonormal basis for all Vi (Property 6.1). The node residuals are used extensively in our tree construction; the descendants of Ni are learned to best approximate Ri. Intuitively, the number of children of Ni is the number of disjoint subspace clusters detectable in Ri and the principal components of those children are the first singular vectors of those disjoint subspaces. To compute the approximation of the entire dataset, we make use of the fact that for leaves of the tree (nodes with no children), data assignments are disjoint. Furthermore, the subspaces defined by leaf nodes are not nested. More exactly, for two leaf nodes Nℓand Ng: span(Vℓ) span(Vg) and Xℓ Xg = Now that we have single node approximations and have shown that data is disjointly partitioned between leaf nodes, we can compute the full approximation of some data X as: ˆX := [|NL| Where NL is the set of all leaves of the tree. The Thirty-Eighth AAAI Conference on Artificial Intelligence (AAAI-24) 4 Subspace Structure Test Testing the Presence of Subspace Clusters Key to our construction of the Principal Component Tree is our adaptive splitting algorithm, which builds the tree top to bottom, expanding node Ni with either (a) a single child at a time, if its data residuals Ri show no sign of subspace structure, or (b) with multiple children, if we detect sufficient subspace structure in Ri. To make this determination we use the Cram er-von Mises test to compare the distribution of the angles of the residuals Ri with the known angle distribution under uniformity on the hypersphere. We motivate the use of this test as follows. If data is subspace clustered, there will be directions in the space that are crowded with points and directions that are comparatively empty. This non-uniform use of the space will cause the distribution of angles between points of subspace clustered data to differ from the known distribution for uniform points. We treat the degree of angle non-uniformity as a proxy for the degree of subspace clusteredness. This measure of subspace clusteredness is summarized in a single p-value p where a small p indicates subspace clustering. To compute this p-value, we require three things. 1) The empirical cumulative distribution function (CDF) ˆPD(C) which describes the observed distribution of angles of our D-dimensional residuals. 2) The CDF PD(C) which defines the expected distribution of angles under the null hypothesis of no subspace structure i.e. uniformity. 3) The Cram ervon Mises test statistic Q which quantifies the difference between ˆPD(C) and PD(C), and therefore the degree of subspace clustering. For notational clarity, we henceforth discuss the test as applied to a generic data matrix X whose columns {x1 . . . xk . . . xj . . . x N} represent samples in RD. To perform this test, we first measure the angle distribution of our data. More precisely, we measure the distribution of absolute cosine similarities between pairs of points of X. Taking all pairs, we compute the N(N 1) 2 angles of X, X := | xk, xj | ||xk|| ||xj|| k < j xi [0, 1]. We use absolute cosine similarity because subspaces pass through the origin. All points on a line through the origin to should be deemed similar, regardless of whether they are on the same side of the origin. We then describe the distribution of angles via the empirical distribution function, ˆPD(C) := 1 N(N 1) i=1 1{ xi C} In order to determine if our data X exhibits subspace clusteredness, we need to compare the observed distribution of angles to an expected null distribution. In this test, our null hypothesis is that our data has angular uniformity; that the unit length normalized data is sampled uniformly from the hypersphere embedded in RD. For example, the isotropic gaussian distribution has angular uniformity. Thankfully, the expected distribution of pairwise angles on the hypershpere has already been derived by (Cai, Fan, Figure 3: Three datasets, their angle distribution, and subspace clustering test results. XA R3: An isotropic normal distribution. XB R3: A data lying on 2 2D subspaces, with a 1d intersection, labeled v. XC R2: A union of 2 1D disjoint subspaces. and Jiang 2013) for standard angles θ [0, 2π] and (Thordsen and Schubert 2022) for absolute cosine similarity C [0, 1]. The latter is a simple transformation of the univariate beta distribution, whose CDF is: PD(C) = 2 Beta CDF 1+C 2 ; α = D 1 2 , β = D 1 With the observed and expected/null distributions in hand, we can finally compare them via the Cram er-von Mises test. This test quantifies the distance between two distributions into the test statistic Q = N(N 1) Z 1 PD(C) ˆP(CD) 2 d C. The known domain of C allows us to bound the otherwise improper integral of the Cv M statistic. Computation of the final p-value p is given in (CS o Rg o and Faraway 1996), equation 1.8. Practical Considerations. For simplicity and interpretability, we leave three important implementation details about this and the following test to the technical appendix. 1) In order to reasonably compare angle distributions to those of hyperspheres, we first whiten X before taking the angle distribution. 2) This explicit whitening makes the expected angle distribution PD(C) dependent on N, so we used monte-carlo simulation to estimate PD(C) for N < 100D. 3) Whenever comparing the subspace clusteredness of two data matrices, we first project both to a common dimension. Testing an Intersection of Subspace Clusters In the previous subsection, we state that we will split a node Ni in two if we detect subspace clusters in its residuals Ri. This is a simplification; if we only do this, we may split nodes prematurely and leave our PCT construction unable to find intersections of subspaces. For example, take XB and XC in Figure 2. Both XB and XC have a very small pvalue, small enough to correctly conclude there is subspace The Thirty-Eighth AAAI Conference on Artificial Intelligence (AAAI-24) Algorithm 1: CLUSTER-TEST Input: XD K Output: p: The p-value of the Cram er-von Mises test; the probability that X has angular uniformity. Sufficiently low p indicates subspace clustering. 0: Let r = RANK(X). 0: Let Xw = WHITEN(X). 0: Let Xunit = xk ||x||k for xk Xw. 0: Let AK K = |Xunit XT unit| 0: Let X = Aij where i > j. { xi [0, 1]} 0: Let p = CVM-TEST( ˆP = X, P = P(Cr)) 0: RETURN p =0 Algorithm 2: EXPAND-NODE Input: A node Ni and residuals Ri. Output: Either one or two child nodes of Ni. 0: V1(X) v1, σ1 {X s 1st singular vector and value}. 0: Let v1, σ1 = V1(Ri). 0: Let is Clustered = CLUSTER-TEST(Ri) < αtest 0: Let has Intersection = INTERSECT-TEST(Ri, v1) 0: if is Clustered and not has Intersection then 0: {Ri,A, Ri,B} SUBSPACE-CLUSTER(Ri) 0: v A, σA = V1(Ri,A); v B, σB = V1(Ri,B) 0: RETURN NODE(v A, σA, Ni), NODE(v B, σB, Ni) 0: end if 0: RETURN NODE(v1, σ1, Ni) =0 clustering in the data. However, if we split XB right away, we would miss the fact that the two subspaces have a prominent 1D intersection (labeled v). In order to build PCTs with the capacity to represent intersecting subspaces, we now present a test that extends on the previous subsection. This test evaluates whether or not potential subspace clusters plausibly intersect along a given axis / basis v. We propose that if some data X projected onto the null space of v appears more subspace clustered than the original data, then v is a plausible intersection of subspace clusters present in X. More exactly, our test INTERSECT-TEST(X, v) returns true if CLUSTER-TEST(X vv T X) < CLUSTER-TEST(X) Back to the example of Figure 3, notice that XC is actually the projection of XB onto the null space of v, and that XC has an even smaller p-value than XB. In this case, we would conclude that v is a plausible subspace intersection of XB. 5 Methodology Building a Principal Component Tree Now that we have defined the structure of a Principal Component Tree in section 3 and two important building blocks in section 4, we can present our algorithm to construct a PCT. Put simply, we build a tree by selecting the leaf node with the largest residuals Nnext := arg max i NL ||Ri||F and expanding it by giving it one or two children. The process is illustrated in Figure 4. We start with an empty root that describes no variation in the data, and we expand nodes until we have reached our target size. The two tests of subspace structure presented in section 4 are used to determine how many children to grant Nnext. In essence, if CLUSTER-TEST(Rnext) finds no subspace clustered structure in Rnext, we give it one child. We also give it one child if we deem that v, the first singular vector of Rnext is common to subspace clusters in Rnext, as determined by INTERSECT-TEST(Rnext, v). This case is like digging deeper through the space in the hope that there are even more disjoint subspace clusters somewhere in the null space of Vnext v. We will only split a node in two when we believe that Rnext lies on disjoint subspace clusters: When the data is sufficiently subspace clustered, and v is not an intersection. See algorithm 2 for a formal specification. By only evaluating the leading singular vector of Rnext as a potential subspace intersection of Rnext, we ensure three desireable properties: a) Only strong intersections will be detected, and not intersections that may arise from noise. b) The non-decreasing singular value property will be enforced. This property is needed to select the optimal subtree of T , and for width-filtration. c) The PCT will reduce to PCA in the case of gaussian data. The PCT construction algorithm uses a few hyperparameters. Firstly, we greedily build the tree up to a maximum number of nodes |T |max. There is no inherent maximum size of the tree, but the maximum height of the tree is the data dimension D. We also specify a significance level αtest 0.05 which specifies how confident the test must be that there is subspace clustered / angle non-uniformity structure in order to split a node in two. This parameter directly influences the shape of the tree; a smaller αtest means nodes are more likely to have only one child, resulting in taller trees. This parameter is used in theorem 6.1 to lower bound the probability that the PCT will reduce to PCA when trained on gaussian / elipical data. If αtest = 0, no nodes will be split, and the PCT will always reduce to PCA. Finally, when we split a node, we use an arbitrary binary subspace clustering algorithm to partition its residuals. We denote this SC(Ri) {Ri,A, Ri,B}. The basis vector and singular value of the two new nodes emerging from this split are the first singular values and vectors of Ri,A and Ri,B. As a structure, the PCT supports arbitrary numbers of children, but we focus only on the binary case in this paper. Persistent Homology Analysis Thus far, we have defined the structure and properties of Principal Component Trees and outlined an algorithm to construct them using two novel hypothesis tests. In this section, we discuss how to analyze and compare the structure of PCTs so that we may use them as a tool to inspect the structure of high-dimensional data. To this end, we introduce two interpretable topological measures that describe the shape of The Thirty-Eighth AAAI Conference on Artificial Intelligence (AAAI-24) Figure 4: A sample PCT as it is constructed. PCTs. Inspired by (Chung et al. 2015), both of these measures take the form of bounded, monotonic curves measuring the size or shape of a tree as nodes are pruned away. As a guiding principal, the more subspace clustered X is, the more nodes will be split into multiple children, and the tree will be wider; it will have more leaves. Conversely, the tree will be taller the less subspace clustered X is. In the case where X follows a multivariate normal distribution, the tree will be degenerate and maximally tall. Our two topological measures quantify this intuition by defining curves that describe the height and width of the tree. Assume that the empty root node is left out of calculations. Height Curves. The first topological measure of a PCT is the tree s height curve. Put simply, the height curve characterizes the distribution of node heights, similar to a CDF. For a tree T , it is defined as: HT (n) = X|T | i=1 1{hi n}. Intuitively, the height curve will be higher for more subspace clustered trees. More subspace clusters leads to more nodes with multiple children which leads to more nodes at lower heights. The height curve is lower bounded by HT (n) = n. This occurs only for degenerate / PCA PCTs, where there is exactly one node at each height. The height curve is upper bounded by HT (n) = |T|. Only trees where all |T| nodes are children of the root will have this curve. In this case, the tree models a union of |T| one dimensional disjoint subspace clusters. Width Curves. We also propose width curves as a topological descriptor of PCT shape. For this curve, we measure tree width (number of leaves) of the subtree of T consisting of only the most significant nodes, according to their singular values. Formally, this subtree consisting of the n most significant nodes is T[n] = {Ni σi σ(n)}, {Eij σi, σj σ(n)} (1) where σ(n) is the nth largest singular value of nodes in T . In property 6.3 we prove that T[n] is the best approximating subtree of T . With this subtree defined, a PCT s width curve is then WT (n) = X|T | i=1 1{i T[n] Ci T[n] = }. That is, the number of nodes who are both in T[n] and leaves of T[n]. It may be easier to imagine recording the width of a tree as the least significant nodes are removed. Similar to height curves, the width curves will be higher for more subspace clustered trees. Also like height curves, 36 32 32 27 21 21 15 14 9 6 Figure 5: Five example PCTs with 6 nodes each, and corresponding height/width curves. Shading indicates singular value. A and E are maximally and minimally subspace clustered. B and C differ only in singular values, which is reflected in width curve only. width curves are lower bounded by WT (n) = 1 and upper bounded WT (n) = n for degenerate trees and maximally wide trees, respectively. Area Under Curves We have just shown that HT (n) and WT (n) are monotonic and maximized under a large degree of subspace clusteredness. Therefore, the area under each curve is an effective one number summary of the subspaceclusteredness of the dataset used to construct a tree. The area under height and width curves are n=1 HT (n) and WT = X|T | n=1 WT (n). In section 7 we perform hypothesis testing to show differences in subspace clusteredness between various datasets. 6 Main Theoretical Results In this section we present two theoretical results that demonstrate the representational capacity of Principal Component Trees and the ability of their construction to perform PCA under correct conditions. Detailed proofs in the Appendix. Theorem 6.1 Any union of subspaces, or distribution lying on a union of subspaces can be equivalently represented by a PCT using the same or fewer parameters. Proof Sketch. We prove this result for a union of two subspaces. Any mixture of 2 d-dimensional subspaces will require 2d vector parameters. But if these subspaces have a c-dimensional intersection, we can a) find that intersection analytically, and b) create a PCT representing that same mixture with only 2d c parameters. The Thirty-Eighth AAAI Conference on Artificial Intelligence (AAAI-24) Theorem 6.2 When presented data following a multivariate normal distribution, PCT construction will reduce to PCA, yielding a degenerate tree with probability at least (1 αtest)D. Proof Sketch. If the data is truly multivariate normal, then all projection residuals will also be multivariate normal. Therefore, when we test for subspace clustering in Ri for any node, we will erroneously split a node with probability αtest. Repeating this test D times, we will never split a node with probability (1 αtest)D. We also prove that our PCT learning algorithm leads to two simple but useful properties of PCTs. Firstly, nodes are orthogonal to their ancestors, making calculation of data approximations fast and decomposable. Secondly, node singular values decrease with height. Together these two properties are used in property 6.3, which provides a simple and efficient way to select the best approximating subtree of an existing tree by selecting its most significant nodes. This is like selecting the best principal components of PCA. Property 6.1 The singular vector of any node is orthogonal to those of its ancestors. Formally, as a consequence, all ancestral bases are orthonormal. Proof Sketch. This property arises from the tree construction. As a singular vector of Ri, Each vi is somewhere in the null space of v Pi, its parent s basis, so vi v Pi. By induction, it is orthogonal to all its ancestors. Property 6.2 A node s singular value is less than or equal to its ancestors. Formally, σi σj j Ai. Proof Sketch. We show this by contradiction. Consider a node Ni and its parent Pi. Let RPi be the residual used to create the parent Pi. Both σPi and σi are derived from RPi, but σPi is its singular value. If σi > σPi, it would violate the Eckart-Young theorem. Therefore, σi σPi. By induction, σi σj, j Ai. Property 6.3 The best size n approximating subtree of T is given by T[n] (equation 1), i.e. selecting the n nodes with largest singular values. More exactly, arg min T T ||X ˆXT ||F := T[n] Proof Sketch. We prove this in three parts. 1) We show that property 6.1 means that the approximation error is inversely proportional to the sum of singular values in the tree. 2) We can therefore minimize the approximation error by selecting the nodes with largest singular values. 3) Property 6.2 implies that T[n] will share the root of T , and will have no gaps, that is Ni T[n] = Nj T[n] Nj Ai. MNIST Fashion MNIST Digits a) Sample Trees b) Average Curves c) Area Under Curves Figure 6: MNIST Digits vs Fashion. a): Example PCTs learned from each dataset. Singular values shaded. b) The average height & width curve from bootstrap sampling. c) Area under width & height curves among bootstrap samples. 7 Experiments With our persistent homology measures of tree shape defined, we now analyze the structure of some real world highdimensional datasets by comparing the structure of Principal Component Trees learned from those datasets. Our goal is not to discriminate between samples, but to conclude something about the clusteredness of the datasets themselves. In order to do this, we: 1) take the datasets we wish to compare, and pre-process them by normalizing, but not completely whitening their singular values. This way, differences in PCT structure are purely the result of subspace structuredness, not anisotropic covariance. 2) Bootstrap sample each dataset into a number of equally sized sub-datasets. Differently sized bootstrap samples between datasets would impact the hypothesis test used to construct the trees. 3) Construct a PCT and calculate HT and WT for each bootstrap sample. 4) Compare HT and WT for each dataset using a wilcoxon rank-sum test. We also plot these metrics on scatter plots. MNIST Digits vs Fashion. For our first experiment, we compare the widely known MNIST Digits and Fashion datasets (Deng 2012; Xiao, Rasul, and Vollgraf 2017). On paper, they are very similar: 60,000 32x32 images with 10 balanced classes each. When we compare the subspace clusteredness of the digits and fashion datasets, the results are clear: MNIST Fashion is much more subspace clustered. The results in Figure 4 tell the story. Both the average height and width curve in 4b is higher for the fashion dataset, and this is reflected in areas under each curve. Anecdotally, the sample tree (4c) for MNIST digits shows a single dominant branch of nodes whereas the fashion tree has two dominant branches. Neural Network Latent Space & Non-Linear Dimensionality Reduction What do neural networks learn? is a long running question in Machine Learning, and there is no shortage of answers in the literature. In this experiment, we answer this question by analyzing the latent space of different neural network architectures via Principal Component Trees. Specifically, we compare the latent space of an Multi Layer Perceptron classifier, simple auto encoder (AE), and variational auto encoder (VAE) which have been trained on the MNIST Digits dataset. We also include the digits after The Thirty-Eighth AAAI Conference on Artificial Intelligence (AAAI-24) 8383 8430 8363 8821 9216 1227 1216 1107 1677 1954 Original AE VAE Classifier UMAP Figure 7: Subspace Clusterness of latent space induced by neural network architectures. Top-Left: Pairwise p-values for rank-sum hypothesis tests between latent spaces. Top Right: HT plotted against WT for bootstrap samples. Bottom: Mean of HT and WT for the five latent spaces. UMAP dimensionality reduction for comparison. All latent spaces are 64 dimensional. To accomplish this, for each bootstrap dataset we: 1) Train a network on that bootstrap dataset. 2) Transform the bootstrap dataset by passing it through the first 2-3 layers of the network. 3) Construct a PCT from this transformed dataset. The results of this experiment are summarized in Figure 7. We first notice that UMAP produces the most clustered embeddings, which is expected given it s objective function and effectiveness in enhancing high-dimensional clusters. Amongst the Neural Network latent spaces, the latent space of the classification model is much more subspace clustered than the original data or the other architectures. This is ultimately unsurprising. The end goal of this model is to transform the original data into one-hot vectors which are maximally clustered under perfect accuracy. We posit that this objective leads to more subspace clustering in the intermediate layers of the model. Looking at the standard and variational autoencoder (labeled AE & VAE), the results are consistent with the objective of these models. The subspace clusteredness of the original data is preserved in the latent space of the simple autoencoder, likely becasue it simply tried to accurately reproduce the training data. For the variational autoencoder, the latent space is less subspace clustered than the original. This is expected as the loss function of VAEs encourages the latent space to take the form of an isotropic gaussian function. For this comparison, it appears that WT is a more sensitive measure than HT . Multilingual word embeddings. For our final experiment, we compare the subspace clusteredness of multilingual word embeddings in three languages. We use the embeddings presented in (Ferreira, Martins, and Almeida 2016), where each language s embedding were trained on the TED 2020 dataset (Cettolo, Girardi, and Federico 2012), where the same TED talks are translated into each language. To ensure our word embeddings cover similar concepts, we 2976 3046 3090 219 271 315 Italian English Russian ~0.00 ~0.00 0.001 0.061 ~0.00 0.023 Italian English Russian Figure 8: PCT Analysis of word embeddings. Top-Left: Pairwise p-values for rank-sum hypothesis tests between language embeddings. Top-Right: HT plotted against WT for bootstrap samples. Bottom: Mean of HT and WT for the three languages examine the 1000 most common words from each language. Under these circumstances, we would expect the embeddings to be structurally similar across languages, but that s not what we see in Figure 5. Each of the three kinds of embeddings have different degrees of subspace clusteredness, with Italian embeddings being the least and Russian embeddings being the most clustered. 8 Conclusion In conclusion, this paper introduced Principal Component Trees (PCTs) as a framework for identifying mixtures of components that collectively represent the subspace structure within high-dimensional datasets. The proposed PCTs extend conventional low-dimensional models such as PCA and union of subspaces by incorporating a graph structure. Each node in the PCT corresponds to a principal component, with edges representing the necessary combinations of components required to form a subspace that approximates specific portions of the data. To construct PCTs, two angle-distribution hypothesis tests are introduced to detect intersecting subspace clusters within the data. Additionally, for the analysis, comparison, and selection of optimal PCT models, we use two novel and interpretable persistent homology measures to examine their shape. The constructed PCTs exhibit two interesting properties: ancestral orthogonality and non-decreasing singular values, which in turn facilitate the computation of persistent homology measures. Theoretical results demonstrate that learning PCTs reduces to PCA under multivariate normality, and that Principal Component Trees serve as efficient parameterizations for intersecting unions of subspaces. Finally, we showcase the practical applicability of PCTs by analyzing neural network latent spaces, word embeddings, and reference image datasets. This research advances our understanding of subspace modeling and offers a powerful tool for dissecting complex high-dimensional datasets in various domains. The Thirty-Eighth AAAI Conference on Artificial Intelligence (AAAI-24) References Agrawal, R.; Gehrke, J.; Gunopulos, D.; and Raghavan, P. 1998. Automatic subspace clustering of high dimensional data for data mining applications. In Proceedings of the 1998 ACM SIGMOD international conference on Management of data, 94 105. Bauer, U. 2021. Ripser: efficient computation of Vietoris Rips persistence barcodes. Journal of Applied and Computational Topology, 5(3): 391 423. Cai, T. T.; Fan, J.; and Jiang, T. 2013. Distributions of angles in random packing on spheres. Journal of Machine Learning Research, 14: 1837. Cettolo, M.; Girardi, C.; and Federico, M. 2012. Wit3: Web inventory of transcribed and translated talks. In Proceedings of the Conference of European Association for Machine Translation (EAMT), 261 268. Chazal, F.; and Michel, B. 2021. An introduction to topological data analysis: fundamental and practical aspects for data scientists. Frontiers in artificial intelligence, 4: 667963. Chung, M. K.; Hanson, J. L.; Ye, J.; Davidson, R. J.; and Pollak, S. D. 2015. Persistent homology in sparse regression and its application to brain morphometry. IEEE transactions on medical imaging, 34(9): 1928 1939. CS o Rg o, S.; and Faraway, J. J. 1996. The exact and asymptotic distributions of Cram er-von Mises statistics. Journal of the Royal Statistical Society Series B: Statistical Methodology, 58(1): 221 234. Deng, L. 2012. The mnist database of handwritten digit images for machine learning research [best of the web]. IEEE signal processing magazine, 29(6): 141 142. Elhamifar, E.; and Vidal, R. 2013. Sparse subspace clustering: Algorithm, theory, and applications. IEEE transactions on pattern analysis and machine intelligence, 35(11): 2765 2781. Fan, J.; and Udell, M. 2019. Online high rank matrix completion. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 8690 8698. Ferreira, D. C.; Martins, A. F.; and Almeida, M. S. 2016. Jointly learning to embed and predict with multiple languages. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), 2019 2028. Garc ıa-Portugu es, E.; Navarro-Esteban, P.; and Cuesta Albertos, J. A. 2023. On a projection-based class of uniformity tests on the hypersphere. Bernoulli, 29(1): 181 204. Hotelling, H. 1933. Analysis of a complex of statistical variables into principal components. Journal of educational psychology, 24(6): 417. Hu, P.; Li, X.; Lu, N.; Dong, K.; Bai, X.; Liang, T.; and Li, J. 2023. Prediction of New-Onset Diabetes after Pancreatectomy with Subspace Clustering based Multi-view Feature Selection. IEEE Journal of Biomedical and Health Informatics. Jenatton, R.; Mairal, J.; Obozinski, G.; and Bach, F. R. 2010. Proximal methods for sparse hierarchical dictionary learning. In ICML, volume 1, 2. Citeseer. Jolliffe, I. T. 2002. Principal component analysis for special types of data. Springer. Kanari, L.; Dłotko, P.; Scolamiero, M.; Levi, R.; Shillcock, J.; Hess, K.; and Markram, H. 2016. Quantifying topological invariants of neuronal morphologies. ar Xiv preprint ar Xiv:1603.08432. Kizaric, B. A.; and Pimentel-Alarc on, D. L. 2022. Classifying Incomplete Data with a Mixture of Subspace Experts. In 2022 58th Annual Allerton Conference on Communication, Control, and Computing (Allerton), 1 8. IEEE. Lee, H.; Kang, H.; Chung, M. K.; Kim, B.-N.; and Lee, D. S. 2012. Weighted functional brain network modeling via network filtration. In NIPS Workshop on Algebraic Topology and Machine Learning, volume 3. Citeseer. Li, Q.; Zhao, X.; and Zhu, H. 2023. Semi-supervised Sparse Subspace Clustering Based on Re-weighting. Engineering Letters, 31(1). Li, Y.; Wang, D.; Ascoli, G. A.; Mitra, P.; and Wang, Y. 2017. Metrics for comparing neuronal tree shapes based on persistent homology. Plo S one, 12(8): e0182184. Lipor, J.; Hong, D.; Tan, Y. S.; and Balzano, L. 2021. Subspace clustering using ensembles of k-subspaces. Information and Inference: A Journal of the IMA, 10(1): 73 107. Mahmood, U.; and Pimentel-Alarc on, D. 2022. Fusion Subspace Clustering for Incomplete Data. In 2022 International Joint Conference on Neural Networks (IJCNN), 1 8. IEEE. Mc Cartin-Lim, M.; Mc Gregor, A.; and Wang, R. 2012. Approximate principal direction trees. ar Xiv preprint ar Xiv:1206.4668. Menon, V.; Muthukrishnan, G.; and Kalyani, S. 2020. Subspace clustering without knowing the number of clusters: A parameter free approach. IEEE Transactions on Signal Processing, 68: 5047 5062. Ng, A.; et al. 2011. Sparse autoencoder. CS294A Lecture notes, 72(2011): 1 19. Pearson, K. 1901. LIII. On lines and planes of closest fit to systems of points in space. The London, Edinburgh, and Dublin philosophical magazine and journal of science, 2(11): 559 572. Pun, C. S.; Lee, S. X.; and Xia, K. 2022. Persistenthomology-based machine learning: a survey and a comparative study. Artificial Intelligence Review, 55(7): 5169 5213. Rafiezadeh Shahi, K.; Khodadadzadeh, M.; Tusa, L.; Ghamisi, P.; Tolosana-Delgado, R.; and Gloaguen, R. 2020. Hierarchical sparse subspace clustering (HESSC): An automatic approach for hyperspectral image analysis. Remote Sensing, 12(15): 2421. Szlam, A.; Gregor, K.; and Le Cun, Y. 2012. Fast approximations to structured sparse coding and applications to object classification. In Computer Vision ECCV 2012: 12th European Conference on Computer Vision, Florence, Italy, October 7-13, 2012, Proceedings, Part V 12, 200 213. Springer. Thordsen, E.; and Schubert, E. 2022. ABID: Angle Based Intrinsic Dimensionality Theory and analysis. Information Systems, 108: 101989. The Thirty-Eighth AAAI Conference on Artificial Intelligence (AAAI-24) Tovsi c, I.; and Frossard, P. 2011. Dictionary learning. IEEE Signal Processing Magazine, 28(2): 27 38. Vidal, R.; Tron, R.; and Hartley, R. 2008. Multiframe motion segmentation with missing data using Power Factorization and GPCA. International Journal of Computer Vision, 79: 85 105. Wichert, A. 2009. Subspace tree. In 2009 Seventh International Workshop on Content-Based Multimedia Indexing, 38 43. IEEE. Wu, T.; Gurram, P.; Rao, R. M.; and Bajwa, W. U. 2015. Hierarchical union-of-subspaces model for human activity summarization. In Proceedings of the IEEE International Conference on Computer Vision Workshops, 1 9. Xiao, H.; Rasul, K.; and Vollgraf, R. 2017. Fashion-mnist: a novel image dataset for benchmarking machine learning algorithms. ar Xiv preprint ar Xiv:1708.07747. Yuan, X.; Ren, D.; Wang, Z.; and Guo, C. 2013. Dimension projection matrix/tree: Interactive subspace visual exploration and analysis of high dimensional data. IEEE Transactions on Visualization and Computer Graphics, 19(12): 2625 2633. Zhu, P.; Hui, B.; Zhang, C.; Du, D.; Wen, L.; and Hu, Q. 2019. Multi-view deep subspace clustering networks. ar Xiv preprint ar Xiv:1908.01978. The Thirty-Eighth AAAI Conference on Artificial Intelligence (AAAI-24)