# understanding_doubly_stochastic_clustering__43f3d243.pdf Understanding Doubly Stochastic Clustering Tianjiao Ding 1 Derek Lim 2 Ren e Vidal 1 Benjamin D. Haeffele 1 The problem of projecting a matrix onto the space of doubly stochastic matrices finds several applications in machine learning. For example, in spectral clustering, it has been shown that forming the normalized Laplacian matrix from a data affinity matrix has close connections to projecting it onto the set of doubly stochastic matrices. However, the analysis of why this projection improves clustering has been limited. In this paper we present theoretical conditions on the given affinity matrix under which its doubly stochastic projection is an ideal affinity matrix (i.e., it has no false connections between clusters, and is well-connected within each cluster). In particular, we show that a necessary and sufficient condition for a projected affinity matrix to be ideal reduces to a set of conditions on the input affinity that decompose along each cluster. Further, in the subspace clustering problem, where each cluster is defined by a linear subspace, we provide geometric conditions on the underlying subspaces which guarantee correct clustering via a continuous version of the problem. This allows us to explain theoretically the remarkable performance of a recently proposed doubly stochastic subspace clustering method. 1. Introduction Spectral clustering is a core technique in machine learning, allowing one to cluster data with relatively general geometric arrangements based on pairwise measures of similarity (or affinity) between data points. The steps of spectral clustering are well-known: 1) Define an affinity matrix whose (i, j)th entry measures the similarity between points i and j; 2) Normalize the affinity matrix and compute a Laplacian; 3) Use eigenvectors of the Laplacian to define an embedding of the data; and 4) Cluster the embedding. However, 1Mathematical Institute for Data Science, Johns Hopkins University, USA 2Computer Science & Artificial Intelligence Laboratory, Massachusetts Institute of Technology, USA. Proceedings of the 39 th International Conference on Machine Learning, Baltimore, Maryland, USA, PMLR 162, 2022. Copyright 2022 by the author(s). Original Affinity Doubly Stochastic Figure 1. Left: An affinity from a weighted stochastic block model of 3 clusters, where the probability of a interand intra-cluster connection are both 0.5, and the weights of interand intra-cluster connections are from N(1, .12) and N(1.25, .12) respectively. Right: A projection of the affinity onto the set of doubly stochastic matrices under the ℓ2 metric, which has few false connections and recovers the 3 clusters. despite the relative simplicity in defining these steps, there are several implementation details that can have a considerable impact on the overall clustering performance. For example, the definition of the affinity matrix appears to be the most important choice one needs to consider for good performance. However, a key detail that often receives relatively little attention but can have a significant impact on performance is how to normalize the affinity matrix. Indeed, the two most popular spectral clustering methods, Ratio Cut (Hagen & Kahng, 1992) and Normalized Cut (NCut) (Shi & Malik, 2000), use two different normalizations of the affinity, which emerge as continuous relaxations of two different clustering objectives. An alternative point of view is provided by Zass & Shashua (2006), who argue that these methods for normalizing the affinity matrix are closely related to projecting the affinity onto the space of doubly stochastic matrices, with the difference between methods being what distance3 is minimized in the projection. With this context, recall that an ideal affinity matrix satisfies two properties: 1) Connectivity, i.e., the non-zero entries (connections) between points within a cluster form a fullyconnected graph and 2) No false connections, i.e., there are no connections for points between two different clusters. Prior work has studied conditions on the data under which 3For simplicity, we abuse the word distance to include other more general distance-like measures which to do not necessarily satisfy all axioms of a distance (e.g., divergences). Understanding Doubly Stochastic Clustering the affinity matrix computed from the data satisfies some of these properties. For example, the problem of subspace clustering considers clustering data which is (approximately) supported on a union of low-dimensional linear subspaces, where each linear subspace defines a cluster. In this case, an affinity matrix is typically computed by expressing each data point as a linear combination of all other data points and enforcing sparse or low-rank properties on the matrix of coefficients, which is then used to define the affinity (Vidal et al., 2016). While several sufficient conditions on the data under which no false connections exist between points from different subspaces (Property 2) have been derived (Vidal et al., 2016), such conditions do not guarantee the connectivity of the affinity. This issue is discussed in (You et al., 2016), where the trade off between sparsity and connectivity is studied experimentally. Moreover, even when the conditions on the data that guarantee no false connections of the affinity are violated, suitable normalization can still lead to an ideal affinity. Indeed, a wide number of ad-hoc approaches have been proposed to normalize an affinity matrix beyond the standard normalization inherent to NCut or Ratio Cuts (Liu et al., 2013; Ji et al., 2014; Elhamifar & Vidal, 2013), and it has even been argued that much of the benefit claimed by many proposed clustering algorithms can actually be attributed largely to ad-hoc affinity normalization (Haeffele et al., 2020). Seeking a more principled approach to affinity normalization, Lim et al. (2020) follow the interpretation of Zass & Shashua (2006) and show empirically that normalizing the affinity matrix by projecting it to the space of doubly stochastic matrix under the ℓ2 distance achieves state-of-the-art performance across a wide variety of common clustering datasets. This is illustrated in Figure 1, which shows that a very noisy affinity matrix with numerous false connections between the 3 underlying clusters becomes nearly ideal following projection to a doubly stochastic matrix under the ℓ2 metric. Contributions. In this work, we give a rigorous theoretical analysis of doubly stochastic projection of affinity matrices with the ℓ2 metric (Problem (2)) to help explain its empirical success. First, we prove a necessary and sufficient condition on the input affinity for the projected affinity to have no false connections (Theorem 2.2) even when the original affinity matrix might contain a large number of false connections. While the optimality condition couples the primal and dual variables from different clusters together which could complicate the analysis, the condition we provide relates quantities from decoupled sub-problems, each of which concerns the doubly stochastic projection of the entries within only one cluster. Further, under some models of the input affinity, this allows us to characterize conditions under which an input affinity with false connections will have no false connections and be well-connected within each cluster following doubly stochastic projection (Corol- laries 2.3 and 2.4). Then, we specialize to the subspace clustering data model, where the data is assumed to be generated from a union of linear subspaces, each defining a cluster. For this setting, we develop a continuous problem (21), i.e., in the limit when the number of data points becomes very large and uniformly distributed, followed by a continuous counterpart of the decoupling theorem (Theorem 3.2). This allows for an analysis of false connections (Theorem 3.5) and the connectivity of the projected affinities (Theorem 3.8), which depends solely on the subspace dimensions, percentage of points within each subspace, and the angles between the subspaces. In particular, we show that there will be no false connections in the normalized affinity matrix if the subspaces are sufficiently separated in angle, or have sufficiently low dimensions, or are well balanced in terms of mixture weights. Finally, we conduct a variety of experiments that illustrate our theoretical findings and demonstrate the utility of doubly stochastic projection in different settings. 2. Doubly Stochastic Clustering 2.1. Problem Formulation To define the doubly stochastic clustering problem that we study, consider n data points drawn from k underlying clusters, where cluster l contains nl points and n = n1+ +nk. Define the mixture weight of each cluster l as πl = nl Given an affinity matrix K Rn n, where Kij = Kji 0 denotes the similarity between data points i and j, the method of doubly stochastic clustering consists of two steps. First, one projects a scaled version of the given K onto the set of doubly stochastic matrices4 An := {A Rn n : Aij 0, i, j; A1 = A 1 = n1} under some notion of distance d A = arg min A An d(A, 1 where 1 is the vector of all ones, and η > 0 is a scaling parameter. After that, one runs spectral clustering on the obtained A to produce the final clustering. One immediate merit of projecting onto An is that one does not need to choose between Ratio Cut and NCut, since both are equivalent5 on a doubly stochastic affinity A . Choice of the Distance d. As briefly mentioned in 1, classical spectral clustering methods correspond to projecting the given affinity K onto the set of doubly stochastic matrices An under different distances: Ratio Cut is closely related to 4The usual definition of a doubly stochastic matrix requires the row and column sums to be 1. Here, without loss of generality, we define the sums to be n to simplify notation in our later analysis. 5That is, the two cuts yield the same Laplacian I 1 Understanding Doubly Stochastic Clustering projection under the ℓ1 metric, and NCut to projection under the KL divergence (Zass & Shashua, 2006). One interesting and perhaps natural alternative is to use the ℓ2 norm as a distance. Remarkably, doubly stochastic projection under the ℓ2 metric yields state-of-the-art performance in subspace clustering on a variety of realistic datasets (Lim et al., 2020). This is largely due to the empirical observation that such a projection strikes a balance between removing false connections (Property 2) and maintaining connectivity (Property 1), via a parameter η that controls the sparsity of the output affinity A . This control of sparsity is, however, not present when projecting with the ℓ1 norm6 and also absent when projecting with the KL divergence7. Importance of Connectivity. Note that the absence of false connections is insufficient to guarantee correct clustering. This is because the connections within one cluster may not form a connected subgraph, resulting in the cluster being over-segmented. Further, the stability of spectral clustering ( 1) against perturbations on the affinity is associated with whether each cluster is sufficiently well connected (see 7.1 of Von Luxburg, 2007). Hence, the stability of the final clustering benefits from the affinity being as connected as possible within each cluster. Based on the discussion above, this paper focuses on doubly stochastic projection under the ℓ2 metric, which we refer to as DS-D(K, η): A = arg min A An 2.2. Optimality Analysis We first study the optimality conditions of DS-D(K, η). Since the optimization problem is strongly convex, a standard primal-dual analysis gives the following necessary and sufficient conditions for global optimality. Proposition 2.1. The optimality conditons to DS-D(K, η) are η [K α 1 1α ]+, (3) 1 η [K α 1 1α ] +1 = n1, (4) where A Rn n is the unique primal solution, α Rn is a dual solution which satisfies (4), and [ ]+ = max( , 0) is applied entrywise. Note that from the form of the primal solution for A , the no false connections property is equivalent to saying that for 6Since the output affinity A differs from the input K only by their diagonal entries as per Proposition 1 of (Zass & Shashua, 2006), the sparsity is almost unchanged. 7See Proposition 2 of (Zass & Shashua, 2006). all i, j coming from different clusters, Kij α i + α j. As such, we would like to find lower bounds on entries of α to give sufficient conditions on A under which A enjoys the no-false-connection property. Nevertheless, it is nontrivial to directly bound α in a meaningful way due to the coupling the entries of α in (4). However, as we show in our next result, the no-false-connection property is satisfied if and only if the DS-D(K, η) problem can be decoupled into a sequence of doubly stochastic projection problems along the inter-cluster portions of the affinity matrix. Without loss of generality, assume that the rows/columns of K are sorted according to their cluster membership, i.e. D(1) D(2) ... ... ... ... D(k) so that the l-th diagonal block of K, D(l) Rnl nl, contains the intra cluster affinities for cluster l. Further let i j and i j notate that points i and j are in the same or different clusters, respectively. With this notation we then have the following result. Theorem 2.2. The following statements are equivalent: 1. For each cluster l, there exist α(l) Rnl a dual solution of DS-D(D(l), η πl ), such that α i + α j Kij for all i j, where α := [α(1) , . . . , α(k) ] Rn. 2. A := diag( 1 π1 A(1), . . . , 1 πk A(k)) is the unique primal solution of DS-D(K, η), where for each cluster l, A(l) is the unique primal solution of DS-D(D(l), η 3. The unique primal solution of DS-D(K, η) has no false connections. A proof is given in the Appendix. The above theorem gives necessary and sufficient conditions for the projected doubly stochastic affinity to have no false connections. Notably, in words this theorem implies that if one solves a DS-D problem for each within-cluster block D(l) of K, then the solution of the DS-D for the entire K matrix will have no false connections if and only if the solution can be formed by concatenating all of the within-cluster solutions into a block diagonal matrix. From this result, we can give several immediate corollaries for simple properties of the affinity matrix that are sufficient to guarantee the no-falseconnections property. Corollary 2.3 (Constant Intra-cluster Connections). Suppose K is such that for each cluster l, all intra-cluster connections have values µl, i.e., D(l) = µl1nl1 nl. Then, the unique primal optimal for DS-D(K, η) has no false connections and is fully-connected within each cluster, as Understanding Doubly Stochastic Clustering long as any connections in K between clusters p = q have values at most 1 2(µp + µq η πp η Proof. Suppose the upper bound on connections between clusters in K holds. We first show that statement 1. in Theorem 2.2 holds. Consider the problem DS-D(D(l), η Proposition 2.1, it can be seen that α(l) := 1 1nl and A(l) := πl η [D(l) α(l)1 nl 1nlα(l) ] is respectively a dual optimal and the primal optimal for this problem. Indeed, a row sum of A(l) takes the form η πl = nl. (6) Now, letting α = [α(1) , . . . , α(k) ] as in Theorem 2.2, note that if point i is in cluster p and point j is in cluster q = p, then α i + α j = 1 2 µp + µq η πp η we assume that 1 2(µp + µq η πp η πq ) Kij, statement 1. in Theorem 2.2 holds. It follows from statement 2. that the primal optimal A for DS-D(K, η) is fully connected within cluster l, where each connection is of strength 1 πl ; from statement 3. that A has no false connections. Corollary 2.4 (Constant Sum of Top Intra-cluster Connections). Suppose K is such that for each cluster l, there exists an integer σl, such that the σl-nearest-neighbour8 graph of D(l), denoted as S(l), satisfy the following j S(l) 1 D(l) 1j = = 1 σl P j S(l) nl D(l) nlj := el, 2. (i, j) S(l), D(l) ij el nη 3. (i, j) / S(l), D(l) ij el nη Then, the unique primal optimal for DS-D(K, η) has no false connections and is connected within cluster l along the σl-nearest-neighbour graph, as long as any connections in K between clusters p = q have values at most 1 2(ep + eq nη Intuitively, the above corollaries suggest that as long as any two clusters have false connections smaller than the average of the largest intra-cluster connections, minus some gap that is proportional to the η parameter, the optimal doubly stochastic projection will have no false connections and is well connected within each cluster. The next corollary shows that the doubly stochastic projection given by the solution of DS-D(K, η) is invariant to perturbations by low rank matrices of the form v1 + 1v 8Namely, S(l) contains indices of the largest σl entries of each row or column of D(l) (recall K is symmetric). for any v Rn, such as an elementwise perturbation by a constant c R. Thus, the doubly stochastic projection can remove certain additive corruptions on the input affinity K. Corollary 2.5. For a vector v Rn, the primal optimal to DS-D(K, η) is the same as that of DS-D(K+1v +v1 , η). In particular, adding a constant c R to each entry of K does not change the solution DS-D(K + c11 ). Proof. For any v Rn and A An, note that 1v + v1 , A = X j vi Aij + X j vj Aij (7) j vjn = 2n X which is a constant independent of A. Thus, the minimizer of the objective DS-D(K) is the same as the minimizer of the objective DS-D(K + 1v + v1 ). The second part of the lemma follows from taking v = c 3. Doubly Stochastic Subspace Clustering Given our above analysis for the general DS-D problem, we now consider a specific data model to provide additional analysis. Specifically, we will analyze the subspace clustering model, where the data are assumed to lie on a union of (low-dimensional) linear subspaces and the goal is to cluster data points based on which linear subspace they lie in. This assumption is a reasonable model for many real-world data problems (Vidal, 2011), possibly after a preprocessing of the data such as the scattering transform (Bruna & Mallat, 2013). Past work has theoretically studied this data model in various settings (Soltanolkotabi & Cand es, 2012; Soltanolkotabi et al., 2014; You & Vidal, 2015), and recent empirical work that uses doubly stochastic projection has achieved state-of-the-art empirical results for subspace clustering problems (Lim et al., 2020). Specifically, consider k subspaces {Sl}k l=1 of RD, each of dimension dim Sl := dl < D. Suppose each Sl contains nl points X(l) = [x(l) 1 , . . . , x(l) nl ] lying on the unit sphere SD 1. Let Φ = Sk l=1 Sl denote the union of the subspaces, and X = [X(1), . . . , X(k)] RD n the collection of data. Given X, one performs subspace clustering using doubly stochastic projection by first computing an affinity matrix K Rn n from X via kernel functions9 or existing subspace clustering methods and the solving the DS-D(K, η) problem. Then, one can perform spectral clustering on the solution to DS-D(K, η) and obtain a final clustering. 9That is, for some positive definite function κ : RD RD R, take Kij = κ(xi, xj) for every i, j. Examples for κ include the Euclidean inner product and radial basis function kernel. Understanding Doubly Stochastic Clustering 3.1. Analysis of the Continuous Problem Since directly tackling the (discrete) DS-D problem with finite data points from a union of subspace model is nontrivial, we consider its continuous counterpart where the number of data points becomes infinitely large and uniformly distributed. First, we define a discrete measure associated with data X j=1 δ(z xj) (10) in which δ( ) is the Dirac function on the sphere SD 1 such that for any f : SD 1 R and any z0 SD 1, z SD 1 f(z)δ(z z0)dµSD 1 = f(z0) (11) where µSD 1 is the uniform measure on SD 1. With the above definitions, one can view the discrete objective (2) as η K(xi, xj) A(xi, xj) 2 (13) =n2Ey µXEz µX η K(y, z) A(y, z) 2) where we assume K L2(SD 1 SD 1) is a kernel function. Similarly, the constraints in (2) can be written as j A(xi, xj) (15) = n Ez µX{A(xi, z)}, (16) i, n n Ez µX{A(z, xi)}, (17) i, j, Aij = A(xi, xj) 0. (18) Note further that the discrete measure can be separated as n µX(l)(z) = l=1 πlµX(l)(z). (19) In the continuous case, one replaces the discrete measure µX by its continuous counterpart l=1 πlµSD 1 Sl(z), (20) where µSD 1 Sl is the uniform measure on SD 1 Sl. This leads to the continuous problem DS-C(K, η) min A Ey µ Ez µ η K(y, z) A(y, z) 2) s.t. Ez µ {A(y, z)} = 1, y SD 1 Φ a.e. (22) Ey µ {A(y, z)} = 1, z SD 1 Φ a.e. (23) A(y, z) 0, y, z SD 1 Φ a.e. (24) According to an analysis of the quadratically regularized optimal transport problems (Lorenz et al., 2021), A L2(SD 1 SD 1) is a solution if and only if there exists a dual function α L2(SD 1) such that A (y, z) = 1 η [K(y, z) α (y) α (z)]+, (25) Ey µ {[K(y, z) α (y) α (z)]+} = η, (26) Ez µ {[K(y, z) α (y) α (z)]+} = η, (27) where (25)-(27) are understood pointwise almost everywhere. Remark 3.1. (21) is strongly convex, hence it has a unique primal solution10. However, infinitely-many dual solutions may exist due to [ ]+ zeroing out negative inputs. Similar to the case for the discrete problem ( 2), the primal solution A satisfying the no false connections property (which we will equivalently refer to by saying the solution is subspace preserving) is equivalent to saying that for all y, z from different subspaces, K(y, z) α (y) + α (z) almost surely. Again, we can separate α (y) for y from different subspaces to show equivalent conditions for when the subspace preserving property will be satisfied. Theorem 3.2. For each Sl, let α(l) : Sl R be a dual optimal and A(l) the unique primal optimal of DS-C(K, η πl ) with measure µSD 1 Sl. Define α (x) : Φ R such that for x Sl, α (x) = α(l)(x).11 Define A : Φ Φ R, A (y, z) = ( 1 πl A(l)(y, z) y, z Sl 0 o.w. . The following statements are equivalent: 1. α (y) + α (z) K(y, z) for all y z. 2. A is the unique primal optimal for DS-C(K, η). 3. The unique primal optimal for DS-C(K, η) is subspace preserving. 10By uniqueness, we mean that any optimal solutions can only differ from each other on a set of measure zero with respect to µ . 11For x lying in multiple subspaces, α (x) can be defined arbitrary, since such x lie in a set of measure zero with respect to µSD 1 Sl, provided that the subspaces are independent, disjoint or intersecting. Understanding Doubly Stochastic Clustering 0 0.1 0.2 0.3 0.4 0.5 0 d=10 d=100 d=1000 0 0.2 0.4 0.6 0.8 1 30 d=10 d=100 d=1000 Figure 2. (a) Plot of ρd(α) with respect to α and dimension d. (b) Given two subspaces of mixture weights π1 and 1 π1, each of dimension d, the subspaces must have first principal angle of at least θmin to guarantee subspace preserving property. 3.2. Example: Inner Product Kernel To demonstrate the effect of doubly stochastic projection for clustering subspaces, we consider the simplest kernel for K which is inner product kernel K(y, z) = | y, z |. We first define the following quantity, which turns out useful in the analysis later on. Definition 3.3 (Average height of spherical cap). For any dimension d, define ρd(α) = R z Sd 1[|zd| 2α]+dµSd 1. Remark 3.4. ρd(α) is decreasing in α with d fixed and decreasing in d with α fixed. Figure 2a shows ρd(α) for α [0, 0.5] and d {10, 100, 1000}. More notes on ρd(α) are provided in the Appendix. Theorem 3.5 (Subspace preserving property). Let K be the inner product kernel. If for any two different subspaces Sp, Sq, we have max y SD 1 Sp z SD 1 Sq | y, z | ρ 1 dp ( η πp ) + ρ 1 dq ( η then the subspace preserving property holds for A . Remark 3.6. Note that with η fixed, ρ 1 d ( η π) is larger when the dimension d is smaller or when the mixture weight π is larger. Moreover, with a fixed subspace arrangement Φ and mixture weights {πl}k l=1, subspace preserving property is guaranteed with sufficiently small η, since limx 0 ρ 1 d (x) = 0.5. Remark 3.7. As seen in Figure 2b, to guarantee the subspace preserving property, the minimum angle between two subspaces must be larger, i.e., subspaces should be more separated, when the subspaces are more imbalanced. Theorem 3.8 (Connectivity). Let K be the inner product kernel and suppose A satisfy subspace preserving property. For any subspace Sl, any two points y, z Sl SD 1 except for a set of measure zero have a non-zero connection in A as long as | y, z | > 2ρ 1 dl ( η Note that from the above two results, we are guaranteed that the doubly stochastic projection will achieve the desired properties of being subspace preserving and being fully connected for an appropriate choice of parameter η. 4. Experiments We now verify our theoretical analysis with a variety of numerical experiments. Metrics. Since our theorems predict the no-false-connection and subspace preserving properties, we report the feature detection error12 (FDE) defined as j:j i |Aij|/ Ai 1 = 1 j:j i |Aij|, (30) and the percent of false connections (PFC). Both metrics take value 0 when A has no false connections. Likewise to asses the connectivity, we report the number of non-zeros (NNZ) of A, defined as the average number of entries per row of A that are larger than 10 8. To further evaluate the quality of the final clustering, we run spectral clustering on A, and report the clustering accuracy (ACC) and normalized mutual information (NMI). 4.1. Weighted Stochastic Block Model First, we consider the problem of clustering an affinity sampled from a Weighted Stochastic Block Model (WSBM) with random edge weights (Aicher et al., 2015). A sample affinity K is taken by first including each intra-cluster edge with a probability p and each inter-cluster edge with a probability q, then drawing edge weights for these chosen edges. Intra-cluster edge weights are drawn from a normal distribution N(1.25, .12) and inter-cluster edge weights are drawn from another normal distribution N(1, .12). In our experiments, we take p close to q, so a successful algorithm cannot just use the difference in sparsity between blocks, and must take into account the edge weight. We use a WSBM with 5 blocks and 50 points per block. We run the DS-D(K, η) studied by this paper on the K with varying η {0.002, 0.004, 0.01} to obtain a doubly stocahstic A. Table 1 reports mean and median feature detection error, percent false connections, number of non-zeros, clustering accuracy and normalized mutual information of A over 10 K samples from weighted stochastic block models WSBM(.5, .4) and WSBM(.5, .5). Remarkably, even when the original affinity has a significant amount of false connections and achieves low clustering accuracy, the affinity normalized by DS-D has much fewer false connections 12This metric is commonly used in subspace clustering, known as feature detection error (Soltanolkotabi & Cand es, 2012) or subspace preserving error (You et al., 2016; Lim et al., 2020). Understanding Doubly Stochastic Clustering Figure 3. An affinity K sampled from the weighted stochastic block model WSBM(.5, .4) of 5 clusters and its doubly stochastic projections DS-D(K, η) with varying η. Table 1. Feature detection error, number of non-zeros, clustering accuracy and normalized mutual information for the original affinity K and the output affinity from doubly stochastic clustering DS-D(K, η), with K generated by the weighted stochastic block model WSBM(p, q) of 5 clusters, where p, q are the probability of intra-cluster and inter-cluster edges respectively. Mean and standard deviations are taken over 10 trials. Dataset WSBM(.5, .4) WSBM(.5, .5) Metrics FDE PFC NNZ ACC NMI FDE PFC NNZ ACC NMI Original affinity .72 .00 77 .2 104.5 .2 .79 .05 .53 .07 .77 .00 80 .2 124.5 .2 .33 .02 .07 .02 DS-D (η = .002) .02 .00 4.5 .6 12.0 .2 1.0 .00 1.0 .00 .03 .00 5.9 .6 12.0 .3 1.0 .00 1.0 .00 DS-D (η = .004) .04 .01 11 .9 18.5 .4 1.0 .00 1.0 .00 .06 .00 14 .8 18.7 .3 1.0 .00 1.0 .00 DS-D (η = .01) .15 .01 35 1 34.1 .6 1.0 .00 1.0 .00 .17 .00 38 .9 35.9 .7 1.0 .00 1.0 .00 and perfect clustering accuracy. For example, the original affinity sampled from WSBM(.5, .4) has mean feature detection error and clustering accuracy of 0.72 and 0.79, while the one produced by DS-D(η = .004) has 0.04 and 1.0 respectively. Further, the above conclusion holds even in the challenging case of WSBM(.5,.5), where the probability of an intra-cluster edge p is the same as that of an inter-cluster edge q. Last but not the least, with a smaller η, the affinity given by DS-D is sparser as expected, e.g., the mean number of non-zeros is 12 with η = 0.002 and 34 with η = 0.01. 4.2. Subspace Clustering Here we conduct experiments to demonstrate the effect of doubly stochastic projection when the clusters are defined by linear subspaces. We first fix K to be the inner product kernel, and verify conditions on the subspace preserving property and connectivity studied13 in 3.2 under various subspace angle θmin, dimension d, and problem parameter η. Next, we further consider the kernel matrices from Least Squares Regression (Lu et al., 2012), and show the effect of doubly stochastic projection improving clustering. Inner Product Kernel. We generate two subspaces of dimension d in RD=20. To control the angles between sub- 13While our theorems for understanding subspace clustering are studied in the continuous limit, the experiments are conducted with finite data points. spaces, we choose the basis of subspaces as U (1) = Id 0D d,d cos(θ1) 0 . . . 0 0 cos(θ2) . . . 0 ... ... ... 0 0 0 . . . cos(θd) sin(θ1) 0 . . . 0 0 sin(θ2) . . . 0 ... ... ... 0 0 0 . . . sin(θd) 0 0 . . . 0 ... ... ... ... 0 0 . . . 0 where θ1 = θmin is the smallest angle between the subspaces, and cos(θi) decreases linearly from cos(θmin) to cos(θd) = 0.5 cos(θmin). Note that θmin = 90 is a simple case where the subspaces are orthogonal, whereas θmin = 0 is difficult since the subspaces have non-trivial intersections. After that, from each subspace we sample n = 50d points of unit norm uniformly at random, which gives 100d points in total, and the input K is taken to the inner product kernel from those points. Figure 4 reports feature detection error and number of Understanding Doubly Stochastic Clustering non-zeros for d {2, 5, 10}, θmin [0, 90] and η {5 10 2, 10 2, 5 10 3, 10 3}. Remarkably, even though the inner product kernel K is not subspace preserving when the subspaces are not orthogonal (i.e., θmin = 90 ), its doubly stochastic projection A is subspace preserving when the subspaces are far away enough from each other (i.e., θmin is large) or when η is sufficiently small. For example, in Figures 4a, 4c and 4e, subspace preserving property holds for A when θmin 40 and η 0.001. Moreover, given η, the closest angle between two subspaces such that A does not have false connections is increasing in the subspace dimension d, e.g., using a rather loose η = 0.05, the smallest θmin is below 40 for d = 2 (Figure 4a), while θmin is above 60 for d = 10 (Figure 4e). Surprising as it may sound, the above phenomenon are expected from Theorem 3.5. Last but not the least, the doubly stochastic affinity A has fewer non-zero connections when η is smaller, which is expected from Theorem 3.8. As such, in practise one may want to tune η to balance between having fewer false connections and the connectivity. Nevertheless, there seems to be a range of suitable η, even for this arguably simplest inner product kernel. For instance, with η [0.001, 0.01], A is subspace preserving whenever θmin 30 and gets at least d-many connections inside the subspace. Least Squares Regression Kernel. Beyond the inner product kernel, we also investigate the effect of doubly stochastic projection on other kernels used for subspace clustering, such as the LSR kernel (Lu et al., 2012). We generate k subspaces of dimension d in RD=20 uniformly at random, from which we further sample n = 50d points of unit norm uniformly at random. First, we compute the LSR kernel K on the data, and record its performance (raw). The parameter γ in LSR is set to be 10. Then, we apply the doubly stochastic projection and report the metrics on A . Figure 5 reports feature detection error, number of nonzeros, clustering accuracy and normalized mutual information for k {2, 5} and d {8, 14}. As expected, clustering seems to be simpler when one has fewer number of subspace (k = 2) or the subspaces are of lower dimension (d = 8). Interestingly, doubly stochastic projection seems to improve clustering over the LSR kernel. This is evidenced by the fact that doubly stochastic not only decreases feature detection error while still leaving a high connectivity, but also increases the final clustering quality as measured by clustering accuracy and normalized mutual information. 5. Conclusion In this paper, we provide an analysis of projecting an affinity matrix onto the set of doubly stochastic matrices with the ℓ2 distance metric, a technique commonly applied for spectral clustering yet whose theoretically properties have not been rigorously analyzed. In particular, we establish nec- 0.05 0.01 0.005 0.001 0.0005 η 90 80 70 60 50 40 30 20 10 0 θmin Subspace Preserving Error (a) D = 20, d = 2 0.05 0.01 0.005 0.001 0.0005 η Number of Non-Zeros (b) D = 20, d = 2 0.05 0.01 0.005 0.001 0.0005 η 90 80 70 60 50 40 30 20 10 0 θmin Subspace Preserving Error (c) D = 20, d = 5 0.05 0.01 0.005 0.001 0.0005 η Number of Non-Zeros (d) D = 20, d = 5 0.05 0.01 0.005 0.001 0.0005 η 90 80 70 60 50 40 30 20 10 0 θmin Subspace Preserving Error (e) D = 20, d = 10 0.05 0.01 0.005 0.001 0.0005 η Number of Non-Zeros (f) D = 20, d = 10 Figure 4. Feature detection error and number of non-zeros of the output affinity from doubly stochastic clustering DS-D(K, η), with the input affinity K being the inner product kernel of data coming from two subspaces of dimensions d {2, 5, 10} in RD=20, each containing n = 50d points of unit norm uniformly at random. For each d, metrics are shown for varying the smallest angle between subspaces θmin [0, 90] and problem parameter η {5 10 2, 10 2, 5 10 3, 10 3}. This demonstrates the geometric conditions for subspace preserving property and connectivity in terms of θmin, d, η, as predicted by Theorem 3.5 and Theorem 3.8 for the continuous problem ( 3). Understanding Doubly Stochastic Clustering Subspace Preserving Error (a) Raw LSR Kernel Subspace Preserving Error (b) LSR Kernel after DS Number of Non-Zeros (c) Raw LSR Kernel Number of Non-Zeros (d) LSR Kernel after DS Clustering Accurary (e) Raw LSR Kernel Clustering Accurary (f) LSR Kernel after DS Normalized Mutual Information (g) Raw LSR Kernel Normalized Mutual Information (h) LSR Kernel after DS Figure 5. Feature detection error, number of non-zeros, clustering accuracy and normalized mutual information of the LSR kernel (Lu et al., 2012) before and after applying doubly stochastic projection, where the LSR kernel is applied on data from k subspaces of dimension d in RD=20, each containing n = 50d points of unit norm. Both the subspaces and the points are sampled uniformly at random. essary and sufficient conditions for the no-false-connection property and provide conditions on the input affinity matrix which will satisfy these conditions, along with conditions which guarantee the clusters will be connected. Moreover, in the case when the clusters are linear subspaces, we further provide analysis of the no-false-connection property and connectivity in terms of subspace dimensions, ratio of points within each subspace, and the angles between subspaces, via a continous extension of the doubly stochastic projection. Finally, via experiments under a variety of settings we compliment the theories and demonstrate the effect of doubly stochastic projection. Acknowledgements The authors are grateful to the anonymous reviewers for their comments on improving presentation and experiments. We thank Liangzu Peng for carefully proofreading a version of this manuscript. This work is supported by NSF grant 1704458 and Northrop Grumman Mission Systems Research in Applications for Learning Machines (REALM) initiative. Understanding Doubly Stochastic Clustering Aicher, C., Jacobs, A. Z., and Clauset, A. Learning latent block structure in weighted networks. Journal of Complex Networks, 3(2):221 248, 2015. Bruna, J. and Mallat, S. Invariant scattering convolution networks. IEEE transactions on pattern analysis and machine intelligence, 35(8):1872 1886, August 2013. Elhamifar, E. and Vidal, R. Sparse subspace clustering: Algorithm, theory, and applications. IEEE Trans. Pattern Anal. Mach. Intell., 35(11):2765 2781, 2013. Haeffele, B. D., You, C., and Vidal, R. A critique of Self Expressive deep subspace clustering. In International Conference on Learning Representations, 2020. Hagen, L. and Kahng, A. B. New spectral methods for ratio cut partitioning and clustering. IEEE transactions on computer-aided design of integrated circuits and systems, 11(9):1074 1085, 1992. Ji, P., Salzmann, M., and Li, H. Efficient dense subspace clustering. In IEEE Winter Conference on Applications of Computer Vision, pp. 461 468, March 2014. Lim, D., Vidal, R., and Haeffele, B. D. Doubly Stochastic Subspace Clustering. 2020. URL http://arxiv. org/abs/2011.14859. Liu, G., Lin, Z., Yan, S., Sun, J., Yu, Y., and Ma, Y. Robust recovery of subspace structures by low-rank representation. IEEE transactions on pattern analysis and machine intelligence, 35(1):171 184, January 2013. Lorenz, D. A., Manns, P., and Meyer, C. Quadratically regularized optimal transport. Applied Mathematics & Optimization, 83(3):1919 1949, June 2021. Lu, C.-Y., Min, H., Zhao, Z.-Q., Zhu, L., Huang, D.-S., and Yan, S. Robust and efficient subspace segmentation via least squares regression. In European conference on computer vision, pp. 347 360. Springer, 2012. Shi, J. and Malik, J. Normalized cuts and image segmentation. IEEE Transactions on pattern analysis and machine intelligence, 22(8):888 905, 2000. Soltanolkotabi, M. and Cand es, E. J. A geometric analysis of subspace clustering with outliers. Ann. Stat., 40(4): 2195 2238, 2012. Soltanolkotabi, M., Elhamifar, E., and Candes, E. J. Robust subspace clustering. The Annals of Statistics, 42(2):669 699, 2014. Tsakiris, M. C. and Vidal, R. Dual principal component pursuit. Journal of Machine Learning Research, 19:1 49, 2018. Vidal, R. Subspace clustering. IEEE Signal Processing Magazine, 28(2):52 68, 2011. Vidal, R., Ma, Y., and Sastry, S. Generalized Principal Component Analysis. Springer Verlag, 2016. Von Luxburg, U. A tutorial on spectral clustering. Statistics and computing, 17(4):395 416, 2007. You, C. and Vidal, R. Geometric conditions for subspacesparse recovery. In International Conference on Machine Learning, pp. 1585 1593. PMLR, 2015. You, C., Robinson, D. P., and Vidal, R. Scalable sparse subspace clustering by orthogonal matching pursuit. In IEEE Conference on Computer Vision and Pattern Recognition, June 2016. Zass and Shashua. Doubly stochastic normalization for spectral clustering. Advances in neural information processing systems, 19, 2006. Understanding Doubly Stochastic Clustering A. Additional Proofs Proof of Theorem 2.2. Proof. (1 2). Let (A(l), α(l)) be primal-dual optimal for DS-D(D(l), η πl ). By Proposition 2.1 η2 [D(l) α(l)1 1α(l) ]+ (32) πl η2 [D(l) α(l)1 1α(l) ]+1 = nl1. (33) Assume now α i + β j Kij for any i j. We verify if (A , α ) are primal-dual optimal for P-DS(K, η). First, we check that 1 η2 [K α 1 1α ]+ = A . For any i, j coming from the same cluster l , 1 η2 [Kij α i α j]+ = η2 [D(l ) i j α(l ) i α(l ) j ]+ = 1 πl A(l ) i j = A ij, where i = i n1 nl 1, j = j n1 nl 1 are the counterpart of i, j for indexing D(l ). For i, j from different clusters, we have 1 η2 [Kij α i α j]+ = 0 = A ij. Now, we only need to verify that A 1 = n1. This is immediate following the construction of A and (33). (2 3). This is trivial. (3 1). Let A be the primal optimal and α be a dual optimal for DS-D(K, η). By Proposition 2.1, η2 [K α 1 1α ]+ (34) A 1 = n1 (35) By assumption, A has no false connections, i.e., A ij = [Kij α i α j]+ = 0 for all i j. That is, α i + α j K ij. This fact, together with (35), yields 1 η2 [D(l) α(l)1 1α(l) ]+1 = nl1 for all l {1, . . . , k}. Therefore, α(l) is dual optimal for DS-D(D(l)). Lemma A.1. Suppose K is rotational invariant, i.e., K(y, z) = K(Ry, Rz) for any orthogonal R. For any dimension d and constant η0, there exists a constant α such that R y Sd 1[K(y, z) 2α]+dµSd 1 = η0. Proof. Taking R to be the householder matrix that sends z to e1 gives Z y Sd 1[K(y, z) 2γ]+dµSd 1 = Z y Sd 1[K(Ry, Rz) 2γ]+dµSd 1 (K symmetric) (36) y Sd 1[K(Ry, e1) 2γ]+dµSd 1 (Householder) (37) y Sd 1[K(y, e1) 2γ]+dµSd 1 (Change of basis14) (38) This is to say there exist a γ [0, 1], such that for any z Sd 1, R y Sd 1[K(y, z) 2γ]+dµSd 1 = η2, hence taking α to be a constant function will satisfy the optimality condition. Notes on Definition 3.3. For any dimension d Z 2, let z2 1 + z2 2 + ... + z2 d = 1 be the coordinate representation of the unit sphere Sd 1 of Rd. Let α [0, 1]. The goal is to compute ρd(α) = R Sd 1[|zd| 2α]+dµSd 1. Note that Sd 1[|zd| 2α]+d S R Sd 1 d S = 1 Ad 1 Sd 1[|zd| 2α]+d S (39) Sd 1 zd 0 [zd 2α]+d S (40) Sd 1 zd 2α (zd 2α) d S (41) 14Let x = Ry, and note that µSd 1(dy) = µSd 1(d R 1x) = µSd 1(dx), where the second equality is due to the fact that µSd 1 is a uniform measure. Understanding Doubly Stochastic Clustering where Ad 1 is the surface area of Sd 1. For zd 0, we have the equation of the sphere as zd = q 1 (z2 1 + + z2 d 1) := g(z1, . . . , z2 d 1), which yields the following identities z2 1 + + z2 d 1 p 1 4α2 (42) s z1 )2 + + ( g zd 1 )2 + 1 = 1 q 1 (z2 1 + + z2 d 1) . (43) With those, one can perform a change of measure to (41) and get ρd(α) = 2 Ad 1 z2 1+ +z2 d 1 1 4α2 1 2α q 1 (z2 1 + + z2 d 1) d A (44) 1 4α2d 1 2αSD 2 where Vd 1 is the (d 1)-dimensional volume of Sd 2. Note that when α = 0, (45) gives ρd(α) = 2Vd 1 Ad 1 , which coincides with the cd quantity as in (Tsakiris & Vidal, 2018). On the other hand, for a general α [0, 1], we further have ρd(α) = 2 Ad 1 1 4α2d 1 2αAD 2 0 sin θD 2 dθ 1 4α2d 1 2αAD 2 1 4α2D 1 2F1(1 2 , 1 4α2) (47) 1 4α2d 1 Vd 1 2αAD 2 2 , 1 4α2) , (48) where 2F1 is the hypergeometric function. Since the last term is hard to bound and interpret, we observe that (41) can be alternatively written as ρd(α) = 2 Ad 1 Sd 1 zd 2α zd d S 4α Ad 1 Sd 1 zd 2α d S (49) = 2 Ad 1 Vd 1 p 1 4α2d 1 4α Ad 1 Sd 1 zd 2α d S, (50) where the second term is some scalar times the surface area of a spherical cap. With that, we have ρd(α) = 2Vd 1 1 4α2d 1 4α Ad 1 1 2Ad 1I1 4α2(d 1 = cd(1 4α2) d 1 2 2αI1 4α2(d 1 where I is the regularized beta function. Observe that as α increases from 0 to 1 2, ρd(α) decreases from cd to 0. Thus, it may be desirable to say 1 ρd(α) cd = O(α...). Indeed, we have cd = 1 (1 4α2) d 1 cd αI1 4α2(d 1 ( 1)i(4α2)i + 2 cd αI1 4α2(d 1 ( 1)i(4α2)i + 2 cd αI1 4α2(d 1 Understanding Doubly Stochastic Clustering Assuming d is odd, i.e., d 1 2 is integer, from the property of I we have ( 1)i(4α2)i + 2 1 2α B( d 1 i=0 ( 1)i d 1 ( 1)i(4α2)i 4α2 i=0 ( 1)i d 1 ( 1)i(4α2)i 1 cd B( d 1 i=0 ( 1)i d 1 ( 1)i(4α2)i 1 cd B( d 1 i=1 ( 1)i 1 d 1 i=1 (4α2)i ( d 1 ( 1)i 1 cd B( d 1 2)( 1)i 1 d 1