# geometric_analysis_of_nonlinear_manifold_clustering__ab487b54.pdf Geometric Analysis of Nonlinear Manifold Clustering Nimita Shinde Lehigh University nis623@lehigh.edu Tianjiao Ding University of Pennsylvania tjding@upenn.edu Daniel P. Robinson Lehigh University daniel.p.robinson@lehigh.edu René Vidal University of Pennsylvania vidalr@upenn.edu Manifold clustering is an important problem in motion and video segmentation, natural image clustering, and other applications where high-dimensional data lie on multiple, low-dimensional, nonlinear manifolds. While current state-ofthe-art methods on large-scale datasets such as CIFAR provide good empirical performance, they do not have any proof of theoretical correctness. In this work, we propose a method that clusters data belonging to a union of nonlinear manifolds. Furthermore, for a given input data sample y belonging to the lth manifold Ml, we provide geometric conditions that guarantee a manifold-preserving representation of y can be recovered from the solution to the proposed model. The geometric conditions require that (i) Ml is well-sampled in the neighborhood of y, with the sampling density given as a function of the curvature, and (ii) Ml is sufficiently separated from the other manifolds. In addition to providing proof of correctness in this setting, a numerical comparison with state-of-the-art methods on CIFAR datasets shows that our method performs competitively although marginally worse than methods without theoretical guarantees. 1 Introduction Manifold clustering is a fundamental problem in data science, in which one seeks to cluster data lying close to a union of low-dimensional manifolds. It has a vast number of applications, such as clustering i) image pixels [1 3], ii) video frames [4, 5], iii) images of faces [6], hand-written digits [7] or other natural objects [8], iv) rigid-body motions [9], v) human actions [10 12], and vi) searching policies for robots [13], to name a few. Over the past two decades, there has been a considerable amount of work on subspace clustering, the special case where each manifold is a linear or affine subspace. A key to the success of many of these works can be attributed to the idea of self-expressiveness [14]: one can write a data point as linear (or affine) combinations of other points, i.e., given a data point y and matrix of data points X, it holds that y = Xc for some vector c. The observation that the support of the sparsest such c should correspond to data points in the same subspace as y prompted the study of the optimization problem min c r(c) + λ 2 e 2 2 subject to (s.t.) e = y Xc, (SC) where r( ) is a regularizer on c (e.g., 1 to promote sparsity [6]) and λ > 0 is a parameter that balances the two terms in the objective. By solving an instance of (SC) for each point in the dataset, one obtains one coefficient vector c per point, and the matrix of all coefficients for all points can be *These authors contributed equally to this work 38th Conference on Neural Information Processing Systems (Neur IPS 2024). used to build a similarity graph of the points and run spectral clustering to obtain a clustering of the data. This has spurred a fruitful line of research, leading to various formulations based on different regularizers [6, 14 20], efficient algorithms [18, 21, 22], and theoretical guarantees on c having the correct support [23 32]. While many interesting datasets (nearly) satisfy the linear or affine subspace assumption, there are a variety of tasks with associated datasets that grossly violate it. For example, natural image datasets such as CIFAR [33] and Image Net [34] cannot be well modeled by low-dimensional subspaces. Instead, it is more natural to assume that each cluster is modeled by a smooth low-dimensional nonlinear manifold, a more general case of manifold clustering. Notably, this is much more challenging than subspace clustering, as the global linear relationship among points in each subspace is absent. Although a few nonlinear manifold clustering methods have achieved high clustering accuracy on large-scale datasets, they lack a theoretical justification. For example, the work in [35 53] aims to learn an embedding (or kernel) via neural networks with subspace clustering style loss functions on the embedded data. While these methods have progressively improved the state-of-the-art in clustering performance, with the most recent ones achieving over 89% accuracy on CIFAR-10 [54 56], little is theoretically understood about why these methods work. In fact, the work of [57] argues the opposite, that some of these methods are provably ill-formulated and learn trivial embeddings *. The current paper provides an approach that is both theoretically grounded and empirically tested on modern large-scale datasets. An interesting method that motivates us is sparse manifold clustering and embedding (SMCE) [58]. It views a local neighborhood of a manifold approximately as a lowdimensional affine subspace, and solves a modified version of (SC) that reweights data and adds an affine constraint 1T c = 1. Despite its effectiveness on simpler datasets such as Extended Yale-B [59], COIL20 [60], and MNIST [7] as observed by [61], so far there is no theoretical understanding of when this method will succeed, nor has it been applied to large-scale datasets. Nevertheless, the method SMCE inspires our work. A major difficulty in providing theoretical guarantees for SMCE is to deal with the affine constraint. This motivates us to relax the constraint as a penalization, leading to the following model based on the self-expressiveness of the input sample y: min c Wc 1 + λ 2 e 2 2 + η (1 1T c)2 s.t. e = y Xc, (1.1) where W is a diagonal matrix with j-th diagonal entry wj an increasing function of the Euclidean distance between y and data sample xj. The regularizer Wc 1 was adopted from SMCE to promote sparse solutions with support determined by close data points. Comparing (1.1) with (SC), we see two differences: (i) (1.1) replaces r(c) with a new regularizer that is a function of c and W, and (ii) (1.1) penalizes violation of the constraint 1T c = 1 in the objective by introducing the term (1 1T c)2. Thus, as η goes to infinity, the data sample will be represented by an affine combination of input data samples. It is then natural to ask the following question: Question 1. Can we provide geometric conditions based on the input data samples and structure of the underlying manifolds that ensure that every optimal solution to (1.1) is manifold preserving, i.e., the non-zero entries of an optimal solution c correspond to data from the same manifold as y? 1.1 Contributions In this paper, we propose a model that generates an approximately affine representation of an input data sample and provide an answer to Question 1. Our contributions are summarized below. Formulation: We propose to perform manifold clustering via the convex optimization problem (1.1), which is based on a self-expressiveness property. Notably, the proposed formulation can be efficiently solved using a fast active-set-based method. We then construct an affinity matrix based on the solutions to (1.1), and then use spectral clustering to cluster the data. Theory: Using our new formulation, we provide an answer to Question 1 by giving theoretical guarantees for any optimal solution of (1.1) to be manifold preserving (see Definition 1). The results depend on the curvature of the manifold at y, the relative location of the data samples in the neighborhood of y, i.e., the distribution of the sample in the neighborhood, and the separation between the samples from other manifolds and the neighborhood of y. *Accuracy and running time of methods with theoretical guarantees of correctness have been reported on only datasets with a small number of samples or clusters, or low ambient dimension; see Appendix C for a review. Experiments: We compare the performance of the proposed method with state-of-the-art alternatives on CIFAR-10,-20 and -100 datasets. The proposed method performs consistently better than subspace clustering methods, and only marginally worse than methods based on deep networks. 1.2 Notations For l = 1, . . . , L, we let Ml denote a nonlinear manifold in RD with intrinsic dimension dl D. We define M = L l=1Ml to be the union of manifolds. Let N data samples be generated from M, which we represent as the columns of X = [x1, . . . , x N] RD N. Additionally, let y be a new sample generated from the union of manifolds M. Without loss of generality, we assume that y M1. Furthermore, for l = 1, . . . , L, let Ml denote the set of generated input data samples that lie on Ml, and define M = L l=1Ml to be the set of all N data samples. For a given subspace S, we let PS(x) denote the orthogonal projection of x onto the subspace S. Outline. In Section 2, we introduce our proposed model, define key quantities used in the analysis, and provide theoretical results for the proposed model. In Section 3, we perform experiments on synthetic data aimed to improve our understanding of the theoretical results, followed by a comparison with other existing methods on real data. We conclude the paper in Section 4. 2 Proposed Model and Theoretical Analysis The model (1.1) we study is obtained by penalizing the affine constraint 1T c = 1 in SMCE [58] with penalty parameter η. The penalty term in the model (1.1) is equivalent to homogenizing the data samples with homogenization constant η. Thus, we we propose to study the equivalent problem min c Wc 1 + λ 2 e 2 2 s.t. e = y Xc (WMC) where λ > 0, W is a diagonal matrix with positive entries that depend on the Euclidean distances between y and the input data samples, X R(D+1) N is the matrix whose j-th column, xj, is the homogenized data sample (x T j , η)T with η > 0 a chosen constant, and y := (y T η)T . One could define wj := Wjj = xj y 2/(P k xk y 2), although other reasonable choices are possible. 2.1 Definitions and assumptions In this section, we state our assumptions and define key quantities used in the theoretical results. Our theoretical analysis provides an answer to Question 1, i.e., we provide conditions under which the non-zero entries of a solution c of (WMC) correspond to data samples from the same manifold as y. If c satisfies this property, it is called a manifold preserving solution, as we now define. Definition 1 (Manifold preserving). For the data sample y M1, we say that the optimal solution c to (WMC) is manifold preserving if and only if c j = 0 for all xj / M1. If one considers solving N instances of problem (WMC) (each instance is defined by replacing y with xj, and removing xj from X), then the N solutions may be used to define a similarity graph. If the solution to each instance is manifold preserving, then the similarity graph will only have intra-cluster connections (no inter-cluster/false connections). Consequently, it would be expected that applying spectral clustering to such a graph will result in correct clustering of the data. We now introduce key quantities and assumptions required to prove the results in Sections 2.2 and 2.3, i.e., to derive conditions under which any optimal solution c to (WMC) is manifold preserving. Assumption 1. We assume that each manifold is smooth and the input data samples generated on each manifold are noiseless. Furthermore, we assume that at least dl + 2 samples are generated on each manifold Ml, where we recall that dl is the intrinsic dimension of manifold Ml. Definition 2 (Set of nearest neighbors N). For the data sample y M1, define B to be the Ddimensional ball of smallest radius centered at y such that it contains d1 + 1 points from M1, where d1 is the dimension of manifold M1. Define N to be the set of all data samples in B excluding y, and note that N may contain samples from both M1 and L l=2 Ml. Definition 3 (Nearest affine subspace S). For the data sample y M1 and N in Definition 2, we define the nearest affine subspace, S, as the affine subspace generated by the set N M1, i.e., S := aff({xj : xj N M1}). (2.1) Furthermore, we define S := span{xj : xj N M1} as the linear subspace spanned by the homogenized data samples from the set N M1. Assumption 2. For the affine subspace S defined in Definition 3, we assume that dim(S) = d1. Equivalently, we assume that dim(S) = d1 + 1. Figure 1: The affine space S approximates the tangent to the 1-dimensional manifold M1 at y. Observe that {xi}4 j=1 N and that {x1, x2} M1. The affine subspace S is spanned by the d1 + 1 points nearest to y on the manifold M1 (see Figure 1 for an illustration when M1 is one-dimensional). We can view S as an approximation of the tangent space to M1 at y. We now define the dual of (WMC) since our analysis requires knowledge of the dual variables. Definition 4 (Weighted ℓ1 norm and its dual). The dual norm of the weighted ℓ1 norm c 1,W := Wc 1 is c ,W 1 := max z 1,W 1 c T z = max j [N] |cj| We can write the dual of (WMC) as max ν RD+1 y, ν 1 2λ ν 2 2 s.t. X T ν ,W 1 1. (2.3) The constraint in (2.3) can be equivalently written as | xj, ν | wj for all 1 j N. (2.4) We now define two more quantities (Definitions 5 and (6)) that appear in our main results (see Lemma 1 and Lemma 3). Remark 1 also gives additional insight to these quantities. Definition 5 (Dual direction). For the data sample y M1, define XS as the matrix with columns that span the linear subspace S in Definition 3, i.e., XS = [xj]xj N M1. Define the reduced problem min c c 1,W + λ 2 y XSc 2 2, (2.5) and its corresponding dual max ν y, ν 1 2λ ν 2 2 s.t. X T Sν ,W 1 1. (2.6) Then, we define the dual direction ν as the unique optimal solution to (2.6). The dual direction ν is used to define the separation between the manifold M and other manifolds. We elaborate on this in Remark 1. Definition 6 (Inradius r(Q W S )). For the data sample y M1, define Q W S = conv xj wj : xj N M1 We define r(Q W S ) as the inradius of the symmetric convex body Q W S , i.e., the radius of the largest Euclidean ball contained in Q W S . For simplicity, we use the notation, r W = r(Q W S ). It follows from Definition 6 that the inradius r W increases as {wj} decrease. Since {wj} are proportional to the distances between y and the d1 + 1 points closest to y in the set M1, the inradius increases when more samples are generated from M1 near y (see Remark 1 for more details). 2.2 Manifold-preserving: general geometric conditions Our first theoretical result provides geometric conditions that ensure that an optimal solution of (WMC), defined for a given input data sample y, is manifold preserving (see Definition 1). Lemma 1. For y M1, input data X, and λ > 0, define the model (WMC) and let Assumptions 1 and 2 hold. Let S and S be defined as in Definition 3, and ν and r W be as given in Definitions 5 and 6, respectively. Let dist(y, S) be the Euclidean distance between y and subspace S, and define γW = min xk M\M1 wkr W | xk,PS(ν ) | PS(ν ) 2 xk 2 . The following statements hold. (a) If γW > 0 and (2.8) dist(y, S) < y 2 γW 1 + γW , (2.9) then the interval (λl, λu) := 1 r W ( y 2 dist(y,S)), γW r W dist(y,S) is well defined and nonempty. (b) If λ (λl, λu), then every optimal solution to (WMC) is manifold preserving and nonzero. To help the reader better under the inequalities in (2.8) and (2.9), we give the following remark. Remark 1. For the d1-dimensional manifold M1, conditions (2.8) and (2.9) depend on the following: 1. Inradius r W : r W (see Definition 6) provides information of the distribution of the d1 + 1 closest data samples to y on the manifold M1, i.e., the samples belonging to the set M1 N. The inequalities in (2.8) and (2.9) are more easily satisfied if r W increases. From the definition of the inradius r W , it is clear that r W increases when the samples in the set M1 N are more uniformly distributed. Furthermore, suppose we generate more samples from the manifold M1 near y. Then, the distances of the d1 + 1 samples in the set M1 N from y will decrease, resulting in a larger inradius. Thus, r W increases when either the number of samples generated from the manifold M1 near y increases or the samples in the set M1 N are well-distributed. 2. Inner product | xk, PS(ν ) |: For each xk M \ M1, the inner product defines the separation between xk and the subspace S. If the separation between each point xk M \ M1 and S increases, the inner product decreases. From the definition of γW , it is clear that the inequalities in (2.8) and (2.9) are more easily satisfied when the separation increases. Part (a) of Lemma 1 provides conditions (based on the separation of samples belonging to other manifolds from S, as well as the distribution of the input data samples in the neighborhood of y) for the interval (λl, λu) to be nonempty. Note that λl is a function of y and X while λu is a function y, X, and λ since γW depends on ν , which is an optimal solution to (2.6) with parameter λ. Part (b) of Lemma 1 states that if the hyperparameter λ satisfies λl < λ < λu, then every solution to (WMC) is manifold preserving and nonzero. While we can choose λ > λl based on knowledge of (y, X), to explicitly compute λ satisfying λ < λu (when such a value exists) requires defining λu as a function of λ. Although Lemma 1 does not provide this relationship, numerical experiments in Section 3.1 on randomly generated data clarify how λl and λu change as a function of (y, X) and λ. Interestingly, a geometric result in [62] for a linear subspace model can be recovered from Lemma 1. Corollary 1 (Special case of (WMC)). For y M1, input data X, and λ > 0, define the model (WMC) with W = I. Assume that y 2 = xj 2 = 1 for each data sample xj, and that Assumptions 1 and (2) hold. In addition to the quantities defined in Lemma 1, define µ = max xk M\M1 | xk, PS(ν ) | and let r = r W r I since W = I. It then follows from Lemma 1 that if µ < r and dist(y, S) < r µ 1+r µ, (2.10) then the interval 1 r(1 dist(y,S)), r µ rdist(y,S)) is well defined and nonempty. Moreover, if λ is in this interval, then every optimal solution to (WMC) is manifold preserving and nonzero. Furthermore, if the manifolds are linear, then dist(y, S) = 0 since y S. Thus, in this special case, if µ < r, then for any λ > 1/r, every optimal solution to (WMC) is subspace preserving and nonzero. 2.3 Manifold-preserving: curvature-based geometric conditions Although we believe the results in the previous section are the first of its kind, they depend on quantities related to projections onto S, which we would like to avoid, if possible. To achieve this goal, we require curvature information of the manifold M1 as well as the reach of the manifold M1. Let us define the curvature and the reach of a manifold. Definition 7 (Curvature of a manifold). Let M1 be a smooth manifold with y M1. If 1/κ is the radius of the largest (internal) tangent circle to M1 at y, then the curvature of M1 at y is κ. Definition 8 (Reach of a manifold). Let M1 be a smooth manifold. The reach of the manifold M1, reach(M1), is the largest value such that for any x / M1 and dist(x, M1) reach(M1), x has a unique projection onto M1. If κ is the curvature of the manifold M1, then reach(M1) 1/κ. In Figure 1, κ denotes the curvature of the 1-dimensional manifold M1 at y. The curvature of the manifold at y gives a bound on how fast the direction of a point on the manifold changes with respect to the distance traveled. The larger the change in the direction (or the angle), the larger will be the curvature. For linear manifolds, the curvature is zero. In Lemma 2, we give a relationship between the curvature of the manifold M1 at y, the radius of the ball B centered at y, and dist(y, S). Then, in Lemma 3, we provide geometric conditions based on the curvature that guarantee that any optimal solution of (WMC) is manifold preserving. Lemma 2. Let Assumptions 1 and 2. For y M1, let B and S be as defined in Definitions 2 and 3, respectively. Let ζ = ζ(B) be the radius of the sphere B, κ be the curvature of the manifold M1 at y, reach(M1) be the reach of the manifold M1, and dist(y, S) be the Euclidean distance between y and the subspace S. If reach(M1) ζ, then dist(y, S) 1 1 κ2 ζ2. (2.11) Combining Lemma 1 and Lemma 2 gives our final result. Lemma 3. For y M1, input data X, and λ > 0, define the model (WMC) and let Assumptions 1 and 2 hold. In addition to the quantities defined in Lemma 1, let ζ be the radius of the sphere B, κ be the curvature of the manifold M1 at y, and reach(M1) be the reach of the manifold M1. The following then hold. γW > 0, (2.12) reach(M1) ζ, and (2.13) 1 κ2 ζ2 < y 2 2 γW 1 + γW , (2.14) then there exists a non-empty interval (λl, λu) := 1 r W y 2 1 1 κ2 ζ2 , γW (b) If λ (λl, λu), then every optimal solution to (WMC) is manifold preserving and nonzero. The quantity γW in Lemma 3 captures the notion of distribution of the data samples in the set M1 N as well as separation of the data samples not in M1 from S. The condition (2.12) requires γW to be positive. Condition (2.13) is related to the sampling density of the manifold M1 near y. If the number of samples generated on M1 near y increases, then ζ will decrease. Meanwhile, if the curvature κ of the manifold at y is large, 1/κ will be small, and from Definition 8, we see that reach(M1) will be small. This indicates that for (2.13) to hold, one may need to generate more samples on M1 near y. Finally, (2.14) provides a lower bound on γW based on κ (the curvature) and ζ (the number of samples generated on M1 near y). When all three of these conditions hold, we have a non-empty interval (λl, λu) as defined in Lemma 3(a). If λl < λ < λu, then Lemma 3(b) ensures that every optimal solution to (WMC) is manifold preserving and nonzero. 3 Computational Results In this section, we first perform experiments on randomly generated data to understand how the quantities in Lemma 1 change as a function of the input data and the hyperparameter λ (see Subsection 3.1). Next, in Subsection 3.2, we compare the performance of our model (WMC) with existing methods on CIFAR datasets. Our goal here is to check the performance of (WMC) and to show that while (WMC) does not outperform the state-of-the-art methods, it performs only slightly worse in terms of clustering accuracy. 3.1 Experiments on synthetic data and understanding Lemma 1 In this section, we perform two experiments designed to understand how λl and λu (see Lemma 1) change as a function of the number of data samples N and hyperparameter λ for randomly generated data from two trefoil knots. The noiseless data samples are embedded in R100. The embedding of data involves generating a random orthonormal basis of a subspace in R100, followed by projecting the samples generated from two trefoil knots onto this subspace. We choose η = 1 and define the matrix W as a diagonal matrix with j-th diagonal entry wj = y xj 2/ P Experiment 1: {λl, λu} versus λ. For this experiment, we evaluate how the values of λl and λu change when only λ changes. We generate N1 {120, 150, 160, 200} and N2 = 40 data samples from two non-intersecting trefoil knots of intrinsic dimension d1 = d2 = 1. We generate 50 different random embeddings of the data samples from the trefoil knots in R100. We plot the mean values and standard deviations of λl and λu for different values of λ [4.8, 204.8] 10 4 in Figure 2. Since λl is independent of λ, its value is constant for all values of (N1, N2). It also appears that λu is a monotonically increasing function of λ that eventually reaches a steady-state" value. When N1 = 120, we observe that the conditions in Lemma 1(a) are not satisfied for any value of λ, and that the interval (λl, λu) is empty. When N2 = 150, we observe that λu > λl for λ > 0.0012, but for these λ values it also holds that λ > λu, meaning that Lemma 1(b) is violated for all λ > 0. When N1 {160, 200}, we observe that for some values of λ > 0, there exist non-empty intervals (λl, λu) such that λ (λl, λu). Thus, for every such λ (λl, λu), it follows from Lemma 1 that every optimal solution to (WMC) is manifold preserving and nonzero. We can conclude from these experiments that for certain input data (y, X), there exists a range of acceptable values of λ for which the conditions defined in both parts (a) and (b) of Lemma 1 are satisfied, so that every optimal solution to (WMC) is nonzero and manifold preserving. 0.001 0.002 N1=120, N2=40 0.001 0.002 N1=150, N2=40 0.001 0.002 N1=160, N2=40 0.001 0.002 N1=200, N2=40 Figure 2: Plot of λl, λu, and λ for (N1, N2) samples generated from two trefoil knots. The data generated from the knots are embedded in R100 with 50 randomly generated embeddings. Experiment 2: {λl, λu} versus N. In this experiment, we aim to understand how λl and λu change when the number of data samples, N, changes. We let N1 = N2 and gradually increase N = N1 + N2. We generate 50 different random embeddings of data samples from trefoil knots in R100. For each pair of data (y, X), we compute λl and then create three values for λ by setting λ = αλl for α {2, 5, 50}. Then, for each λ value, we plot λl and λu versus N = N1 + N2 in Figure 3. We can observe that as the number of samples increases, the size of the interval (λl, λu) increases linearly. We further observe that for α = 2, the interval (λl, λu) is non-empty (satisfying the conditions in Lemma 1(a)) for all N 890, whereas for α = 5, 50, the interval is non-empty for all N 300. Thus, we see that the small value of α leads to a slower increase in λu resulting in the interval (λl, λu) becoming non-empty at a much larger value of N. Furthermore, we also see that, (i) when α = 5, λ (λl, λu) for all N 330, and (ii) when α = 50, λ (λl, λu) for all N 510. In other words, when α = 5, the conditions in both parts (a) and (b) of Lemma 1 are satisfied for all N 330, whereas when α = 50, the conditions in parts (a) and (b) of Lemma 1 are satisfied for all N 510, making λ = 5λl, the best choice of the hyperparameter in this experiment. Figure 3: Plot of λl, λu, and N = N1 + N2 with N1 = N2 and λ = αλl for α {2, 5, 50}. Samples are generated from two trefoil knots and embedded in R100 with 50 randomly generated embeddings. 3.2 Experiments on real data In this section, we compare the empirical performance of our model with existing methods on CIFAR-10, CIFAR-20, CIFAR-100 datasets [33]. The CIFAR dataset consists of 60000 color images of size 32 32 that are divided into 10, 20, and 100 classes for CIFAR-10, CIFAR-20, CIFAR-100, respectively. In Section 3.2.1, we describe the steps for clustering data using our model (WMC), followed by a comparison of empirical results in Section 3.2.2. (See Appendix B.1 for more details.) 3.2.1 Clustering data using (WMC) We follow the steps in Algorithm 1 to cluster the input images using our model (WMC). Algorithm 1 Pseudocode for clustering data using (WMC) Input: Images from CIFAR dataset, λ > 0, and the number of clusters L 1: Generate features using the CLIP encoder [63]. 2: Construct a representation matrix C by solving problem (WMC) for the CLIP feature vector for each input image. 3: Construct a symmetric affinity matrix A from C. 4: Apply spectral clustering to A to get the L clusters. Output: L clusters of data The CLIP (Contrastive Language-Image Pre-Training) [63] encoder is a powerful tool in pre-training data. Given its prior success, we use CLIP to map each input image in CIFAR to a feature vector in Step 1 of Algorithm 1. Step 2 of Algorithm 1 employs our model WMC to construct a representation matrix C. We now briefly explain this step. For the feature vector y of each input image: (a) we define the positive diagonal distance matrix W so that each diagonal entry is an increasing function of the Euclidean distance between y and each of the other feature vectors, (b) we define the model (WMC) for λ > 0, (c) we solve (WMC) using an active-set method [18] to (hopefully) obtain a manifold preserving representation for the feature vector of each input image. (The choice of λ and W is explained further in Subsection 3.2.2.) Similar to classical spectral clustering methods, the representation matrix is used to define a symmetric affinity matrix A; here, we choose A = Table 1: Comparing clustering accuracy (ACC) and Normalized Mutual Information (NMI) for our models L-WMC and E-WMC with state-of-the-art subspace clustering and deep clustering methods. Each method is used to cluster the data pre-trained using CLIP [63]. For each metric (ACC and NMI) and data set, the best overall value achieved is bolded, and the best value achieved by our method is in blue. Dataset CIFAR-10 CIFAR-20 CIFAR-100 Metrics ACC NMI ACC NMI ACC NMI Methods that do not fine-tune representations L-WMC 96.07 90.54 63.97 69.55 69.2 76.49 E-WMC 94.34 88.79 64.1 69.58 69.83 76.63 SMCE 86.86 90.66 63.1 70.53 68.56 77.17 En SC [18] 85.8 89.2 61.6 69.3 66.6 77.1 SSC [22] 85.4 84.6 60.9 65.3 64.6 72.8 Methods that fine-tune representations TEMI [64] 96.9 92.6 61.8 64.5 73.7 79.9 CPP [56] 97.4 93.6 64.2 72.5 74.0 81.8 The numerical results in this row are taken from [56]. See appendix in [56, 64] for details on fine-tuning models. 1 2(|C| + |C|T ). (We used this definition in combination with other techniques to construct A, which are explained in detail in Appendix B.2.) Finally, we use spectral clustering on A to cluster the data. Output metrics. We use two metrics to compare the performance of various clustering methods. In particular, we use Clustering Accuracy and Normalized Mutual Information whose values range from 0 to 100%, with higher values indicating better performance. 3.2.2 Numerical Results In this section, we compare the numerical performance of two instances of our Algorithm 1 with the subspace clustering methods (a) Elastic Net Subspace Clustering (En SC) [18] and (b) Sparse Subspace Clustering, manifold clustering method (c) SMCE [58], and with the deep clustering methods (d) CPP [56] and (e) TEMI [64]. Each method is applied to the CLIP features extracted from the input images. The two instances of our Algorithm 1 are determined by how the weight matrix W is chosen in (WMC), as we now describe. L-WMC: Linearly Weighted Manifold Clustering model with wj = xj y 2/ P k xk y 2 for all j in (WMC). Here, the weights {wj} are linearly proportional to the Euclidean distances. E-WMC: Exponentially Weighted Manifold Clustering model with weights in (WMC) defined as wj = exp(2 xj y 2)/ P k exp(2 xk y 2) for all j. Here, the weights {wj} are defined as an exponential function of the Euclidean distances. We report the clustering results for L-WMC and E-WMC in Table 1. (See Appendix B.2 for details on the choice of λ and η.) From Table 1, we can observe that the clustering accuracy of L-WMC and E-WMC is consistently better than En SC, SSC and SMCE for all three datasets. Moreover, L-WMC and E-WMC perform slightly worse than the state-of-the-art method CPP on the CIFAR-10 and CIFAR-20 datasets. So, although our method does not outperform the existing state-of-the-art method, our method is the first to perform similarly to the state-of-the-art methods and have a theoretical guarantee of correctness (see Lemma 1 and Lemma 3). 4 Discussion In this paper, we propose a model whose solution defines a self-expressive representation of the input data, and provide a theoretical analysis of correctness of the model (see Lemma 1 and Lemma 3). For data (y, X) and hyperparameter λ > 0, we give conditions under which the solution to our model is non-trivial and manifold preserving. To the best of our knowledge, this is the first work that provides a theoretical understanding of manifold clustering models. We also show that our model performs only marginally worse than the current state-of-the-art methods on CIFAR datasets. For a given data set, our analysis does not provide a strategy for choosing the hyperparameter λ > 0 so that it satisfies the conditions in Lemma 1. This leads to an important open question: For any given data set, can one provide a proof of existence of λ > 0 for which the conditions in Lemma 1 are satisfied? If so, can one provide a closed form expression to choose the value of λ? An answer to this question will result in an optimal" choice for the value of λ, thus avoiding the cost of tuning λ. Acknowledgements We would like to thank the generous support provided by the National Science Foundation s through grants 1704458, 2031985, and 2212457, and the Lehigh internal grant funding the project Foundations and Applications of Mathematical Optimization and Data Science". [1] Yi Ma, Harm Derksen, Wei Hong, and John Wright. Segmentation of multivariate mixed data via lossy data coding and compression. IEEE transactions on pattern analysis and machine intelligence, 29(9):1546 1562, 2007. [2] Cheolkon Jung, LC Jiao, Juan Liu, and Yanbo Shen. Image segmentation via manifold spectral clustering. In 2011 IEEE International Workshop on Machine Learning for Signal Processing, pages 1 6. IEEE, 2011. [3] Li Liu, Liang Kuang, Yunfeng Ji, et al. Multimodal mri brain tumor image segmentation using sparse subspace clustering algorithm. Computational and Mathematical Methods in Medicine, 2020, 2020. [4] Stephen Tierney, Junbin Gao, and Yi Guo. Subspace clustering for sequential data. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 1019 1026, 2014. [5] Sheng Li, Kang Li, and Yun Fu. Temporal subspace clustering for human motion segmentation. In Proceedings of the IEEE international conference on computer vision, pages 4453 4461, 2015. [6] Ehsan Elhamifar and René Vidal. Sparse subspace clustering: Algorithm, theory, and applications. IEEE transactions on pattern analysis and machine intelligence, 35(11):2765 2781, 2013. [7] Yann Le Cun, Léon Bottou, Yoshua Bengio, and Patrick Haffner. Gradient-based learning applied to document recognition. Proceedings of the IEEE, 86(11):2278 2324, 1998. [8] Zengyi Li, Yubei Chen, Yann Le Cun, and Friedrich T Sommer. Neural manifold clustering and embedding. ar Xiv preprint ar Xiv:2201.10000, 2022. [9] Alvina Goh and René Vidal. Segmenting motions of different types by unsupervised manifold clustering. In 2007 IEEE Conference on Computer Vision and Pattern Recognition, pages 1 6. IEEE, 2007. [10] Fei Wu, Yongli Hu, Junbin Gao, Yanfeng Sun, and Baocai Yin. Ordered subspace clustering with block-diagonal priors. IEEE transactions on cybernetics, 46(12):3209 3219, 2015. [11] Behnam Gholami and Vladimir Pavlovic. Probabilistic temporal subspace clustering. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 3066 3075, 2017. [12] Giancarlo Paoletti, Jacopo Cavazza, Cigdem Beyan, and Alessio Del Bue. Subspace clustering for action recognition with covariance representations and temporal pruning. In 2020 25th International Conference on Pattern Recognition (ICPR), pages 6035 6042. IEEE, 2021. [13] Rafael Garcia, Bruno C da Silva, and João LD Comba. Task-based behavior generalization via manifold clustering. In 2017 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pages 6047 6052. IEEE, 2017. [14] Ehsan Elhamifar and Rene Vidal. Sparse subspace clustering. pages 2790 2797, June 2009. [15] Can-Yi Lu, Hai Min, Zhong-Qiu Zhao, Lin Zhu, De-Shuang Huang, and Shuicheng Yan. Robust and efficient subspace segmentation via least squares regression. In Computer Vision ECCV 2012: 12th European Conference on Computer Vision, Florence, Italy, October 7-13, 2012, Proceedings, Part VII 12, pages 347 360. Springer, 2012. [16] Guangcan Liu, Zhouchen Lin, Shuicheng Yan, Ju Sun, Yong Yu, and Yi Ma. Robust recovery of subspace structures by low-rank representation. IEEE transactions on pattern analysis and machine intelligence, 35(1):171 184, 2012. [17] Reinhard Heckel and Helmut Bölcskei. Robust subspace clustering via thresholding. IEEE Transactions on Information Theory, 61(11):6320 6342, 2015. [18] Chong You, Chun-Guang Li, Daniel P Robinson, and René Vidal. Oracle based active set algorithm for scalable elastic net subspace clustering. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 3928 3937, 2016. [19] John Lipor, David Hong, Yan Shuo Tan, and Laura Balzano. Subspace clustering using ensembles of k-subspaces. Information and Inference: A Journal of the IMA, 10(1):73 107, 2021. [20] Connor Lane, Ron Boger, Chong You, Manolis Tsakiris, Benjamin Haeffele, and Rene Vidal. Classifying and comparing approaches to subspace clustering with missing data. In Proceedings of the IEEE/CVF International Conference on Computer Vision Workshops, pages 0 0, 2019. [21] Yanxi Chen, Gen Li, and Yuantao Gu. Active orthogonal matching pursuit for sparse subspace clustering. IEEE Signal Processing Letters, 25(2):164 168, 2017. [22] Chong You, Daniel Robinson, and René Vidal. Scalable sparse subspace clustering by orthogonal matching pursuit. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 3918 3927, 2016. [23] Manolis Tsakiris and Rene Vidal. Theoretical analysis of sparse subspace clustering with missing entries. In International Conference on Machine Learning, pages 4975 4984. PMLR, 2018. [24] Chun-Guang Li, Chong You, and René Vidal. On geometric analysis of affine sparse subspace clustering. IEEE Journal of Selected Topics in Signal Processing, 12(6):1520 1533, 2018. [25] Mahdi Soltanolkotabi and Emmanuel J Candes. A geometric analysis of subspace clustering with outliers. 2012. [26] Mahdi Soltanolkotabi, Ehsan Elhamifar, and Emmanuel J Candes. Robust subspace clustering. 2014. [27] Yu-Xiang Wang and Huan Xu. Noisy sparse subspace clustering. Journal of Machine Learning Research, 17(12):1 41, 2016. [28] Yining Wang, Yu-Xiang Wang, and Aarti Singh. A deterministic analysis of noisy sparse subspace clustering for dimensionality-reduced data. In International Conference on Machine Learning, pages 1422 1431. PMLR, 2015. [29] Daniel P Robinson, Rene Vidal, and Chong You. Basis pursuit and orthogonal matching pursuit for subspace-preserving recovery: Theoretical analysis. ar Xiv preprint ar Xiv:1912.13091, 2019. [30] Chong You, Chun-Guang Li, Daniel P Robinson, and René Vidal. Is an affine constraint needed for affine subspace clustering? In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 9915 9924, 2019. [31] Peng Wang, Huikang Liu, Anthony Man-Cho So, and Laura Balzano. Convergence and recovery guarantees of the k-subspaces method for subspace clustering. In International Conference on Machine Learning, pages 22884 22918. PMLR, 2022. [32] Tianjiao Ding, Derek Lim, René Vidal, and Benjamin D Haeffele. Understanding doubly stochastic clustering. In International Conference on Machine Learning, pages 5153 5165. PMLR, 2022. [33] Krizhevsky Alex. Learning multiple layers of features from tiny images. https://www. cs. toronto. edu/kriz/learning-features-2009-TR. pdf, 2009. [34] Olga Russakovsky, Jia Deng, Hao Su, Jonathan Krause, Sanjeev Satheesh, Sean Ma, Zhiheng Huang, Andrej Karpathy, Aditya Khosla, Michael Bernstein, et al. Imagenet large scale visual recognition challenge. International journal of computer vision, 115:211 252, 2015. [35] Pan Ji, Tong Zhang, Hongdong Li, Mathieu Salzmann, and Ian Reid. Deep subspace clustering networks. volume 2017-Decem, pages 24 33, 2017. [36] Xi Peng, Jiashi Feng, Shijie Xiao, Jiwen Lu, Zhang Yi, and Shuicheng Yan. Deep Sparse Subspace Clustering. ar Xiv [cs.CV], September 2017. [37] Mahdi Abavisani and Vishal M Patel. Deep Multimodal Subspace Clustering Networks. IEEE Journal of Selected Topics in Signal Processing, 12(6):1601 1614, April 2018. [38] Yangbangyan Jiang, Zhiyong Yang, Qianqian Xu, Xiaochun Cao, and Qingming Huang. When to Learn What: Deep Cognitive Subspace Clustering. In Proceedings of the 26th ACM international conference on Multimedia, MM 18, pages 718 726, New York, NY, USA, October 2018. Association for Computing Machinery. [39] Xi Peng, Jiashi Feng, Shijie Xiao, Wei-Yun Yau, Joey Tianyi Zhou, and Songfan Yang. Structured Auto Encoders for Subspace Clustering. IEEE Transactions on Image Processing, 27(10):5076 5086, October 2018. Publisher: IEEE. [40] Xiaoliang Tang, Xuan Tang, Wanli Wang, Li Fang, and Xian Wei. Deep Multi-view Sparse Subspace Clustering. In Proceedings of the 2018 VII International Conference on Network, Communication and Computing, ICNCC 18, pages 115 119, New York, NY, USA, December 2018. Association for Computing Machinery. [41] Pan Zhou, Yunqing Hou, and Jiashi Feng. Deep adversarial subspace clustering. In 2018 IEEE/CVF conference on computer vision and pattern recognition, pages 1596 1604, 2018. [42] Yangbangyan Jiang, Qianqian Xu, Zhiyong Yang, Xiaochun Cao, and Qingming Huang. Duet robust deep subspace clustering. In Proceedings of the 27th ACM international conference on multimedia, Mm 19, pages 1596 1604, New York, NY, USA, 2019. Association for Computing Machinery. Number of pages: 9 Place: Nice, France. [43] Ruihuang Li, Changqing Zhang, Huazhu Fu, Xi Peng, Joey Tianyi Zhou, and Qinghua Hu. Reciprocal Multi-Layer Subspace Learning for Multi-View Clustering. In 2019 IEEE/CVF International Conference on Computer Vision (ICCV), pages 8171 8179, Seoul, Korea (South), October 2019. IEEE. [44] Xiukun Sun, Miaomiao Cheng, Chen Min, and Liping Jing. Self-Supervised Deep Multi-View Subspace Clustering. In Proceedings of The Eleventh Asian Conference on Machine Learning, pages 1001 1016. PMLR, October 2019. ISSN: 2640-3498. [45] Meng Zeng, Yaoming Cai, Xiaobo Liu, Zhihua Cai, and Xiang Li. Spectral-Spatial Clustering of Hyperspectral Image Based on Laplacian Regularized Deep Subspace Clustering. In IGARSS 2019 - 2019 IEEE International Geoscience and Remote Sensing Symposium, pages 2694 2697, July 2019. ISSN: 2153-7003. [46] Meng Zeng, Yaoming Cai, Zhihua Cai, Xiaobo Liu, Peng Hu, and Junhua Ku. Unsupervised Hyperspectral Image Band Selection Based on Deep Subspace Clustering. IEEE Geoscience and Remote Sensing Letters, 16(12):1889 1893, December 2019. Conference Name: IEEE Geoscience and Remote Sensing Letters. [47] Tong Zhang, Pan Ji, Mehrtash Harandi, Wenbing Huang, and Hongdong Li. Neural collaborative subspace clustering. 36th International Conference on Machine Learning, ICML 2019, 2019June:12777 12786, 2019. [48] Junjian Zhang, Chun-Guang Li, Chong You, Xianbiao Qi, Honggang Zhang, Jun Guo, and Zhouchen Lin. Self-Supervised Convolutional Subspace Clustering Network. 2019. [49] Lei Zhou, Xiao Bai, Dong Wang, Xianglong Liu, Jun Zhou, and Edwin Hancock. Latent Distribution Preserving Deep Subspace Clustering. In Proceedings of the Twenty-Eighth International Joint Conference on Artificial Intelligence, pages 4440 4446, Macao, China, August 2019. International Joint Conferences on Artificial Intelligence Organization. [50] Mohsen Kheirandishfard, Fariba Zohrizadeh, and Farhad Kamangar. Multi-Level Representation Learning for Deep Subspace Clustering. In 2020 IEEE Winter Conference on Applications of Computer Vision (WACV), pages 2028 2037, Snowmass Village, CO, USA, March 2020. IEEE. [51] Shuai Yang, Wenqi Zhu, and Yuesheng Zhu. Residual Encoder-Decoder Network For Deep Subspace Clustering. In 2020 IEEE International Conference on Image Processing (ICIP), pages 2895 2899, Abu Dhabi, United Arab Emirates, October 2020. IEEE. [52] Changqing Zhang, Huazhu Fu, Qinghua Hu, Xiaochun Cao, Yuan Xie, Dacheng Tao, and Dong Xu. Generalized Latent Multi-View Subspace Clustering. IEEE Transactions on Pattern Analysis and Machine Intelligence, 42(1):86 99, January 2020. Conference Name: IEEE Transactions on Pattern Analysis and Machine Intelligence. [53] Jeya Maria Jose Valanarasu and Vishal M. Patel. Overcomplete Deep Subspace Clustering Networks. In 2021 IEEE Winter Conference on Applications of Computer Vision (WACV), pages 746 755, Waikoloa, HI, USA, January 2021. IEEE. [54] Zengyi Li, Yubei Chen, Yann Le Cun, and Friedrich T Sommer. Neural Manifold Clustering and Embedding. ar Xiv [cs.LG], January 2022. [55] Tianjiao Ding, Shengbang Tong, Kwan Ho Ryan Chan, Xili Dai, Yi Ma, and Benjamin D Haeffele. Unsupervised Manifold Linearizing and Clustering. pages 5450 5461, October 2023. [56] Tianzhe Chu, Shengbang Tong, Tianjiao Ding, Xili Dai, Benjamin David Haeffele, Rene Vidal, and Yi Ma. Image clustering via the principle of rate reduction in the age of pretrained models. ar Xiv preprint ar Xiv:2306.05272, 2023. [57] Benjamin D Haeffele, Chong You, and René Vidal. A Critique of Self-Expressive Deep Subspace Clustering. 2020. [58] Ehsan Elhamifar and René Vidal. Sparse manifold clustering and embedding. Advances in neural information processing systems, 24, 2011. [59] Athinodoros S. Georghiades, Peter N. Belhumeur, and David J. Kriegman. From few to many: Illumination cone models for face recognition under variable lighting and pose. IEEE transactions on pattern analysis and machine intelligence, 23(6):643 660, 2001. [60] Sameer A Nene, Shree K Nayar, Hiroshi Murase, et al. Columbia object image library (coil-20). 1996. [61] Maryam Abdolali and Nicolas Gillis. Beyond Linear Subspace Clustering: A Comparative Study of Nonlinear Manifold Clustering Algorithms. ISBN: 2103.10656 Publication Title: ar Xiv [cs.LG], March 2021. [62] Yu-Xiang Wang and Huan Xu. Noisy sparse subspace clustering. In International Conference on Machine Learning, pages 89 97. PMLR, 2013. [63] Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning transferable visual models from natural language supervision. In International conference on machine learning, pages 8748 8763. PMLR, 2021. [64] Nikolas Adaloglou, Felix Michels, Hamza Kalisch, and Markus Kollmann. Exploring the Limits of Deep Image Clustering using Pretrained Models. ISBN: 2303.17896 Publication Title: ar Xiv [cs.CV], March 2023. [65] René Brandenberg, Abhi Dattasharma, Peter Gritzmann, and David Larman. Isoradial bodies. Discrete & Computational Geometry, 32:447 457, 2004. [66] Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, et al. An image is worth 16x16 words: Transformers for image recognition at scale. ar Xiv preprint ar Xiv:2010.11929, 2020. [67] Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. In Jill Burstein, Christy Doran, and Thamar Solorio, editors, Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pages 4171 4186, Minneapolis, Minnesota, June 2019. Association for Computational Linguistics. [68] Ery Arias-Castro, Guangliang Chen, and Gilad Lerman. Spectral clustering based on local linear approximations. 2011. [69] Ery Arias-Castro, Gilad Lerman, and Teng Zhang. Spectral clustering based on local pca. Journal of Machine Learning Research, 18(9):1 57, 2017. [70] Xu Wang, Konstantinos Slavakis, and Gilad Lerman. Multi-manifold modeling in non-euclidean spaces. In Artificial Intelligence and Statistics, pages 1023 1032. PMLR, 2015. [71] Xu Wang, Konstantinos Slavakis, and Gilad Lerman. Riemannian multi-manifold modeling. ar Xiv preprint ar Xiv:1410.0095, 2014. [72] Anna Little, Mauro Maggioni, and James M. Murphy. Path-Based Spectral Clustering: Guarantees, Robustness to Outliers, and Fast Algorithms. Journal of Machine Learning Research, 21(6):1 66, 2020. A.1 Proof of Lemma 1 Proof of Lemma 1. To prove Lemma 1, we start by defining the condition for the solution to (WMC) to be manifold preserving in the following lemma. Proposition 1. Let ν be the dual direction as defined in Definition 5. If | xj, ν | < wj xj M \ M1, (A.1) then every optimal solution to (WMC) is manifold preserving. The proof of Proposition 1 is given in Section A.2. Now, from (A.1), we have that | xj, ν | = | xj, PS(ν ) + PS (ν ) | | xj, PS(ν ) | + | xj, PS (ν ) |, (A.2) where PS( ) and PS ( ) denote the projections on to S and the subspace orthogonal to S, respectively. Thus, we see that if the stronger condition | xj, PS(ν ) | + | xj, PS (ν ) | < wj (A.3) is satisfied for all xj M \ M1, then (A.1) is satisfied. Bounding | xj, PS (ν ) |. From KKT conditions, we have that ν = λ(y XSc ), where (c , ν ) is the primal-dual optimal solution to (2.5)-(2.6). So, PS (ν ) 2 = PS (λ(y XSc )) 2 = (i) PS (λy) 2 = λdist(y, S), where (i) follows since XSc is the linear combination of data samples in S and so, XSc S. Furthermore, we have dist(y, S) = min a { y z 2 : z = X xk N M1 akxk} min a { y z 2 : z = X xk N M1 akxk, X xk N M1 ak = 1} = min a { y z 2 : z = X xk N M1 akxk, X xk N M1 ak = 1} = dist(y, S). From (A.4) and (A.5), we have | xj, PS (ν ) | xj 2 PS (ν ) 2 λ xj 2dist(y, S) xj M \ M1. (A.6) Combining (A.3) and (A.6), we have a stronger condition | xj, PS(ν ) | + λ xj 2dist(y, S) < wj xj M \ M1, (A.7) and equivalently, if λ < wj | xj, PS(ν ) | dist(y, S) xj 2 xj M \ M1, (A.8) then (A.1) is also satisfied, and every optimal solution to (WMC) is manifold preserving. We can write (A.8) as λ < min xj M\M1 wj | xj,PS(ν ) | PS(ν ) dist(y, S) xj 2 . (A.9) Bounding PS(ν ) 2. The dual constraint of (2.6) can be written as X T Sν ,W 1 1 X T S(PS(ν ) + PS (ν )) ,W 1 1 X T SPS(ν ) ,W 1 1 wk , PS(ν ) 1 xk M1 N. Recall the definition of the symmetric convex body Q W S = conv n xj wj : xj M1 N o as given in Definition 6. From the last inequality in (A.10), we have that PS(ν ) belongs to the polar set of Q W S , i.e., PS(ν ) (Q W S ) . Proposition 2 ([65]). For a symmetric convex body P, the inradius of P and the circumradius of its polar set P satisfy r(P)R(P ) = 1. (A.11) Using the fact that PS(ν ) 2 R(K ), and setting P = Q W S in (A.11), we have that PS(ν ) 2r(Q W S ) 1, (A.12) or equivalently, PS(ν ) 2 1 r W . (A.13) Thus, from A.1, (A.9) and (A.13), we see that if λ < wjr W | xj,PS(ν ) | PS(ν ) 2 r W dist(y, S) xj 2 xj M \ M1, (A.14) then every optimal solution to (WMC) is manifold preserving. However, if λ is too small, then the optimal solution to (WMC) will be trivial, i.e., c = 0. Thus, to get a non-trivial optimal solution to (WMC), λ cannot be too small. The following proposition provides a lower bound on λ to get a non-trivial solution. Proposition 3. If λ > 1 r W ( y 2 dist(y,S)), then every optimal solution to (WMC) is nonzero. Proof. We prove the result by contradiction. Assume that λ > 1 r W ( y 2 dist(y,S)), and c = 0 is an optimal solution to (WMC). Then, from KKT condition applied to (WMC), we have that ν = λ(y Xc) = λy, where ν is the optimal solution to the dual (2.3). We have that X T ν ,W 1 = λ X T y ,W 1 λ X T Sy ,W 1. (A.15) Furthermore, we have X T Sy ,W 1 = X T S(PS(y) + PS (y)) ,W 1 = X T S(PS(y)) ,W 1. (A.16) Combining (A.15) and (A.16), we have X T ν ,W 1 λ X T SPS(y) ,W 1 = λ X T SPS(y) ,W 1 PS(y) 2 PS(y) 2 λ min u S,u =0 λ X T Su ,W 1 u 2 PS(y) 2 = (i) λr W PS(y) 2, where the last equality follows since the inradius, r W (as defined in Defintion 6), can be written as r W = min u S,u =0 max xj M1 N λ| xj/wj, u | u 2 = min u S,u =0 λ X T Su ,W 1 Furthermore, we have PS(y) 2 = y PS (y) 2 y 2 PS (y) 2 = y 2 dist(y, S) y 2 dist(y, S), (A.18) where the last inequality follows from (A.5). Substituting this in (A.17), we have X T Sν ,W 1 λr W ( y 2 dist(y, S)) > 1, (A.19) where the last inequality follows since λ > 1 r W ( y 2 dist(y,S)). Thus, we have that when λ > 1 r W ( y 2 dist(y,S)), c = 0 cannot be an optimal solution since the constraint in the dual problem (2.3) is violated. This proves the statement of the proposition. From Proposition 3 and (A.14), we see that if 1 r W ( y 2 dist(y, S)) < min xj M\M1 wjr W | xj,PS(ν ) | PS(ν ) 2 r W dist(y, S) xj 2 , (A.20) then we get an interval for acceptable values of λ to get a non-trivial and manifold preserving solution to (WMC). Rearranging the terms in (A.20), we see that, if r W dist(y, S) r W ( y 2 dist(y, S)) < min xj M\M1 wjr W | xj,PS(ν ) | PS(ν ) 2 xj 2 , i.e., dist(y, S) < y 2 min xj M\M1 wjr W | xj ,PS (ν ) | PS (ν ) 2 xj 2 1 + min xj M\M1 wjr W | xj ,PS (ν ) | PS (ν ) 2 xj 2 then there exists a non-empty interval (λl, λu) := 1 r W ( y 2 dist(y,S)), min xj M\M1 wjr W | xj ,PS (ν ) | PS (ν ) 2 r W dist(y,S) xj 2 And for any λ (λl, λu), every optimal solution to (WMC) is manifold preserving. A.2 Proof of Proposition 1 Proof of Proposition 1. To prove Proposition 1, we need to derive conditions (based on an optimal solution to the reduced dual 2.6) for which any optimal solution to (WMC) will be manifold preserving. We do so by (i) defining conditions on a primal-dual feasible solutions to (WMC) and its dual (2.3) that ensure manifold preserving optimal solution to (WMC) (see Lemma 4), and then (ii) constructing such primal-dual feasible pair to (WMC) from the optimal primal-dual solution to the reduced problem (2.5) and its dual (2.6). The following lemma provides conditions for any optimal solution of (WMC) to be manifold preserving. A similar result was given in [62] for linear sparse clustering problem. In this work, we have adapted [62, Lemma 12] for our proposed model (WMC). Lemma 4. Let (c, e, ν) be a primal-dual feasible solution to (WMC) and its dual (2.3) such that c has support B where B A {1, . . . , N}, and the dual feasible solution satisfies: 1. (W T X T )Bν = sgn((Wc)B), 3. X T A Bcν ,W 1 1, 4. X T Acν ,W 1 < 1. Then every optimal solution (c , e , ν ) to (WMC) and its dual (2.3) satisfies c Ac = 0. Proof. (c , e(1) , e(2) ) be an optimal solution to (WMC). We then have 2 e 2 2 = c B 1,W + c A Bc 1,W + c Ac 1,W + λ c B 1,W + sgn((Wc)B), (Wc )B (Wc)B + c A Bc 1,W + c Ac 1,W + λ We now determine a lower bound on e 2 2 from the following proposition. Proposition 4. For any γ > 0, the function 2e T e (A.23) has a unique maximizer e and g(e ) = γ Setting γ = λ, we have λ 2 e 2 2 = g(e ) 2 e 2 2 + λe, e e . Also, from condition 1 in the lemma, we have sgn((Wc)B), (Wc )B (Wc)B = (W T X T )Bν, (Wc )B (Wc)B = ν, XB(c B c B) = ν, X(c c) ν, XA Bc(c A Bc) ν, XAc(c Ac) . (A.25) Combining (A.22), (A.24), and (A.25), we have 2 e 2 2 = c B 1,W + λ + c A Bc 1,W ν, XA Bc(c A Bc) + c Ac 1,W ν, XAc(c Ac) + ν, X(c c) + λe, e e . Now, from condition 2 in the lemma, we have that ν, X(c c) + λe, e e = ν, X(c c) + e e = 0, (A.27) where the last inequality follows since (c, e) and (c , e ) are both feasible to (WMC), and so, y = Xc + e = Xc + e . Furthermore, we have that c A Bc 1,W ν, XA Bc(c A Bc) = c A Bc 1,W (W T X T )A Bcν, ((Wc )A Bc) c A Bc 1,W X T A Bcν ,W 1 c A Bc 1,W 0, (A.28) where the last inequality follows from condition 3 in the lemma. Similarly, we have that c Ac 1,W ν, XAc(c Ac) = c Ac 1,W (W T X T )Acν, ((Wc )Ac) c Ac 1,W (1 X T Acν ,W 1). (A.29) Combining (A.26), (A.27), (A.28), and (A.29), and using the fact B is the support of c, we have 2 e 2 2 c 1,W + λ 2 e 2 2 + c Ac 1,W (1 X T Acν ,W 1). (A.30) From condition 4 in the lemma, we know that X T Acν ,W 1 < 1. Since (c , e , ν ) is optimal, this implies that c Ac 1,W = 0, and therefore, (Wc )Ac = 0, proving the result. Lemma 4 provides conditions for any optimal solution of (WMC) to be manifold preserving. Now, we construct a feasible solution (c, e, ν) to (WMC) that satisfies the conditions given in Lemma 4 from an optimal solution (c , e , ν ) to the reduced problem (2.5) and its dual (2.6) as follows: cj = c j for j such that xj M1 N 0 otherwise, (A.31) e = y XSc = y Xc = e, and (A.32) ν = ν . (A.33) We observe that (c, e, ν) is feasible to (WMC) by construction. Let T be the set of indices of data samples in the set M1 N, and let B be the support of c . From KKT conditions for the reduced problem (2.5) and its dual (2.6), we have ν = λe = ν = λe, and (A.34) W T X T Sν = sgn(Wc ) (W T X T )Bν = sgn((Wc)B). (A.35) Thus, (c, e, ν) satisfies the first two conditions in Lemma 4. Moreover, we have that X T A Bcν ,W 1 X T Aν ,W 1 = X T Sν ,W 1 1, (A.36) which shows that condition 3 from Lemma 4 is satisfied for (c, e, ν). Finally, we observe that condition 4 in Lemma 4 can be equivalently written as | xj, ν | < wj xj M \ M1. (A.37) Thus, we have that if (A.37) is satisfied, then all conditions in Lemma 4 are satisfied, and every optimal solution to (WMC) is manifold preserving. A.3 Proof of Lemma 2 Proof of Lemma 2. We prove the result using geometric properties of the manifold and location of the data samples. In the proof, we will introduce several new quantities and notations. We define the curvature of the manifold M1 at y to be κ. From the definition of the curvature of the smooth manifold, we have that 1/κ is the radius of the largest internal tangent sphere to the manifold M1 at y. Let the tangent circle be denoted by T and let z be the center of the (D 1)-dimensional internal tangent sphere. Consider a point of intersection of the sphere B with T and denote it by u. Next, we project the point u on to the line connecting y and z, and denote the projected point by v. We then have that the points z, u and v form vertices of a right-angled triangle, and we have z u 2 2 = z v 2 2 + v u 2 2. (A.38) We now determine the bounds on each of the quantities in (A.38). Since z is the center of the sphere T of radius 1/κ, and u lies on the surface of the sphere, we have z u 2 2 = 1 κ2 . (A.39) Now consider the three points y, u and v. Since v is the projection of the point u on to the line connecting y and z, we have that the three points y, u and v also form a right-angled triangle. Thus, we have that u v 2 2 u y 2 2 = ζ2, (A.40) where the last inequality follows since u lies on the boundary of the sphere B with radius ζ and centered at y. Finally, we bound z v 2. Since v is the projection of point u on to the line connecting y and z, we have that z, v and y collinear and they lie on a single line connecting y and z. We, therefore, have that z v 2 = y z 2 v y 2. Furthermore, since z is the center of the tangent sphere to the manifold M1 at y, we have that y z 2 = 1 κ v y 2. (A.41) Proposition 5. For the point v as defined above, we have dist(y, S) y v 2. Proof. Let BT be the set consisting of u and additional d1 unique points of intersection of the sphere B with T . We have that dim(aff{x : x BT }) d1. Since each point in the set BT lies on the boundaries of both B and T , we have that each point x BT is equidistant from both y and z. Thus, u and every point x BT projects onto the same point (v) on the line connecting y and z. Furthermore, we have that y v 2 = dist(y, aff{x : x BT }), (A.42) i.e., v is the projection of y onto the affine subspace aff{x : x BT }. From the definition of the inner tangent circle T , we have that the affine subspace aff{x : x BT } will intersect with manifold M1 at points outside the sphere B. Thus, aff{x : x BT } is an affine subspace of dimension d1 spanned by points not in B. From the definition of S (see Definition 3) and Assumption 2, we note that S is the affine subspace (of dimension d1) nearest to y, and spanned by the points inside or on the sphere B. Thus, we have that dist(y, S) dist(y, aff{x : x BT }). (A.43) Combining (A.42) and (A.43) proves the result. From (A.41) and Proposition 5, we bound z v 2 as κ dist(y, S). (A.44) Combining (A.38), (A.39), (A.40) and (A.44), we have that 1 κ2 ζ2 + 1 κ dist(y, S) 2 , (A.45) or equivalently, 1 κ dist(y, S) 2 1 κ2 ζ2. (A.46) Since reach(M1) ζ, and from Definition 8, 1 κ reach(M1), we have that 1 κ ζ, and thus, 1 κ dist(y, S) 1 κ2 ζ2. (A.47) Rearranging the terms proves the result. B Details of Experiments on Real Data B.1 CIFAR Datasets and CLIP feature extraction CIFAR Datasets. CIFAR consists of 50,000 training and 10,000 test color images of size 32x32. They are equally divided into 10, 20, and 100 classes resulting in CIFAR-10, CIFAR-20, and CIFAR100 datasets respectively. The 20 classes in CIFAR-20 are obtained by merging 100 classes in CIFAR-100 into disjoint groups. CLIP Pre-Training Model. Contrastive Language-Image Pre-training (CLIP) [63] is an efficient tool for image representation learning by jointly training an image encoder and a text encoder to correctly predict the image-text pairings. We use Vi T-L/14 [66] architecture for image encoder. Vi T-L/14 refers to Large Vision Transfomer model with input images divided into patches of size 14x14. See [67] for more information on the definition of Large models. We use the features extracted from images in CIFAR dataset by CLIP pre-training model that uses Vi T-L/14 architecture*. *We use the CLIP package available at https://github.com/openai/CLIP?tab=readme-ov-file B.2 Details of Experimental Setup We include details of experiments performed in Section 3.2. Note that, the results reported in Table 1 for SSC, En SC, CPP and TEMI methods are taken from [56]. Please refer to [56, Section 4] for experimental details. L-WMC and E-WMC. We perform the steps in Algorithm 1 for clustering data using L-WMC and E-WMC. Details of pre-training using CLIP (Step 1 of Algorithm 1) are provided in Appendix B.1. To construct the representation matrix (Step 2), we define the models L-WMC and E-WMC for each input data sample with the distance matrix W set as given in Section 3.2.2. We use an efficient active-set solver given in [18] to solve the optimization problem. The experiments are performed on a machine with Intel(R) Xeon(R) Gold 6130 CPU operating at 2.10 GHz frequency and with 37 GB RAM. We have provided our code in the Supplementary material. We use grid search over the following parameter values: η {1, 20, 100, 400} and λ {20, 50} λ0, where λ0 is the smallest value of λ that generates a non-trivial (non-zero) solution. We report the best accuracy results in Table 1. Furthermore, Table 2 provides the values of the parameters λ and η corresponding to the clustering results reported in rows 1 (L-WMC) and 2 (E-WMC) in Table 1. Method/ Parameter values CIFAR-10 CIFAR-20 CIFAR-100 λ η λ η λ η L-WMC 50 20 50 400 50 20 E-WMC 20 400 20 400 50 400 Table 2: Values of the parameters corresponding to the clustering results reported in Table 1 for L-WMC and E-WMC. SMCE. The MATLAB code for the ADMM algorithm was provided by the authors of [58]. We implemented the ADMM algorithm that solves SMCE [58] in Python. We refer to this method as SMCE in this paper. As discussed in Section 3.2.2, we apply SMCE to pre-trained data generated using CLIP. To generate the representation matrix, SMCE solves the following optimization problem for each input sample xi, min ci γ Wici 1 + 1 2 Qici 2 2 s.t. 1T ci = 1, (B.1) where Wi is a diagonal matrix whose j-th entry [Wi]jj = wj = xj xi 2 P k =i xk xi 2 , and Qi = h xj xi xj xi 2 j =i. Note that the matrix Wi is the same as the matrix W defined for our model L- WMC. Furthermore, since the ℓ1 does not select points that are very far from the given data point xi, [58] consider only K nearest neighbors of the point xi in the algorithm where K is chosen a priori. We use grid search to choose the values of the hyperparameters γ and K: γ {10, 20, 50} and K {10, 20, 50, 100}. Table 1 provides the best clustering result achieved by SMCE for each of the three datasets. In Table 3, we provide the parameter values corresponding to the clustering results reported for SMCE in Table 1. Method/ Parameter values CIFAR-10 CIFAR-20 CIFAR-100 γ K γ K γ K SMCE 10 20 50 20 20 20 Table 3: Values of the parameters corresponding to the clustering results reported in Table 1 for SMCE. B.3 Constructing affinity The first step in using spectral clustering to cluster the input data samples is to construct an affinity matrix that indicates pairwise similarities between the data samples. Ideally, the data samples from the same cluster should be highly connected while those from different clusters should have no connections. For the models based on self-expressiveness property, the construction of the affinity matrix involves post-processing the representation matrix C. Several post-processing strategies have been used in literature to construct a symmetric affinity matrix from the representation matrix. We describe three commonly used strategies below. Normalize (N): This strategy of constructing affinity matrix A involves normalizing each column of the representation matrix C to have unit ℓ2 norm. This strategy can be helpful when the input data samples have different norms. Symmetric (S): This is one of the commonly used strategies to construct the affinity matrix A as A = 1 2(|C| + |C|T ). k-NN (K): This strategy involves finding k nearest neighbors of each data sample. Note that, the representation matrix can have dense connections. When k-NN strategy is applied to each column of C, only the k largest entries in the column are preserved while the rest are set to 0. Thus, the affinity matrix resulting from this strategy will have at most k non-zero entries in each column. This strategy can be helpful when there are several nonzero entries in C of very small magnitude. We use N, S, K to denote the three strategies from now on. Moreover, we use SNK to denote that symmetric , normalize and k-NN strategies are used one after the other (and in that order) on the representation matrix C to construct the affinity matrix. Since no universally best strategy exists, we construct an affinity matrix by defining all permutations of all possible subsets of the three strategies described above. Thus, there are 15 possible ways to construct an affinity by applying the following strategies on C: N, S, K, NK, KN, SN, NS, SK, KS, SNK, SKN, KSN, KNS, NSK, NKS. After constructing the affinity matrix using each strategy or a combination of strategies, we use k-means to cluster the data. For the clustering results reported in Table 1 for L-WMC, E-WMC, and SMCE, we provide the affinity construction strategy corresponding to these results in Table 4. Method/Dataset CIFAR-10 CIFAR-20 CIFAR-100 L-WMC N SNK N E-WMC NK SNK N SMCE S SNK NS Table 4: Strategy used to construct affinity from the representation matrix corresponding to the clustering results reported in Table 1 for L-WMC and E-WMC. We observed that the CLIP feature vectors for each image in the CIFAR datasets have different ℓ2 norm. Thus, the columns of the resulting representation matrix C also have different sizes (ℓ2 norm). In this case, it seems intuitive to normalize the columns of C while constructing the affinity. Indeed, from Table 4, we notice that the best clustering accuracy result was observed when the affinity construction involved normalizing the representation matrix C. C Review of Manifold Clustering Methods with Theoretical Guarantees Distance between Tangent Spaces The work of [68] provides an algorithm that i) fits an affine subspace to a local neighborhood of each point in the dataset and ii) computes the similarity between two points based on the distance between the two subspaces as well as that between the two points. Further, they provide an upper bound on Euclidean distances between data samples from different clusters to achieve perfect clustering with high probability. Interestingly, the above was later extended to allow for a faster computation based on prototypes [69] and the ambient space being a Riemannian manifold [70, 71]. However, their theoretically guaranteed algorithms require knowledge of the intrinsic dimension of each manifold (or implicitly a model selection parameter to estimate the dimension) which is in practice typically not known apriori and hard to estimate. Longest-Leg Path Distance The work of [72] provides an algorithm that computes distance in a different way: for every two points x, y in a given set of data points, the distance between x, y is the minimum over all paths connecting x, y of the longest leg in the path. Then, the similarity can be computed by applying a kernel function to the distances. They provide theoretical guarantees that allow for a general number of manifolds, as well as algorithms that approximate the distances hierarchically. Nonetheless, the approaches have been only tested on smaller-scale datasets (i.e., either ambient dimension, or number of data samples, or number of clusters being small, while this paper shows both theoretical guarantees as well as clustering performance on large-scale datasets of natural images. A direction of independent interest to the community is to understand the computational picture of these methods on such datasets, which is not pursued by this paper. Neur IPS Paper Checklist Question: Do the main claims made in the abstract and introduction accurately reflect the paper s contributions and scope? Answer: [Yes] Justification: The contributions are clearly stated in the introduction, and Sections 2 and 3 contain the main results. Guidelines: The answer NA means that the abstract and introduction do not include the claims made in the paper. The abstract and/or introduction should clearly state the claims made, including the contributions made in the paper and important assumptions and limitations. A No or NA answer to this question will not be perceived well by the reviewers. The claims made should match theoretical and experimental results, and reflect how much the results can be expected to generalize to other settings. It is fine to include aspirational goals as motivation as long as it is clear that these goals are not attained by the paper. 2. Limitations Question: Does the paper discuss the limitations of the work performed by the authors? Answer: [Yes] Justification: We discuss the limitations of our work in the last section (Section 4) of the paper. Guidelines: The answer NA means that the paper has no limitation while the answer No means that the paper has limitations, but those are not discussed in the paper. The authors are encouraged to create a separate "Limitations" section in their paper. The paper should point out any strong assumptions and how robust the results are to violations of these assumptions (e.g., independence assumptions, noiseless settings, model well-specification, asymptotic approximations only holding locally). The authors should reflect on how these assumptions might be violated in practice and what the implications would be. The authors should reflect on the scope of the claims made, e.g., if the approach was only tested on a few datasets or with a few runs. In general, empirical results often depend on implicit assumptions, which should be articulated. The authors should reflect on the factors that influence the performance of the approach. For example, a facial recognition algorithm may perform poorly when image resolution is low or images are taken in low lighting. Or a speech-to-text system might not be used reliably to provide closed captions for online lectures because it fails to handle technical jargon. The authors should discuss the computational efficiency of the proposed algorithms and how they scale with dataset size. If applicable, the authors should discuss possible limitations of their approach to address problems of privacy and fairness. While the authors might fear that complete honesty about limitations might be used by reviewers as grounds for rejection, a worse outcome might be that reviewers discover limitations that aren t acknowledged in the paper. The authors should use their best judgment and recognize that individual actions in favor of transparency play an important role in developing norms that preserve the integrity of the community. Reviewers will be specifically instructed to not penalize honesty concerning limitations. 3. Theory Assumptions and Proofs Question: For each theoretical result, does the paper provide the full set of assumptions and a complete (and correct) proof? Answer: [Yes] Justification: The assumptions needed for each theoretical results are included in the statement of the lemma, and complete proof of each result is given in the appendix. Guidelines: The answer NA means that the paper does not include theoretical results. All the theorems, formulas, and proofs in the paper should be numbered and crossreferenced. All assumptions should be clearly stated or referenced in the statement of any theorems. The proofs can either appear in the main paper or the supplemental material, but if they appear in the supplemental material, the authors are encouraged to provide a short proof sketch to provide intuition. Inversely, any informal proof provided in the core of the paper should be complemented by formal proofs provided in appendix or supplemental material. Theorems and Lemmas that the proof relies upon should be properly referenced. 4. Experimental Result Reproducibility Question: Does the paper fully disclose all the information needed to reproduce the main experimental results of the paper to the extent that it affects the main claims and/or conclusions of the paper (regardless of whether the code and data are provided or not)? Answer: [Yes] Justification: We have included details of the experimental setup. The codes are also provided to reproduce the results reported. Guidelines: The answer NA means that the paper does not include experiments. If the paper includes experiments, a No answer to this question will not be perceived well by the reviewers: Making the paper reproducible is important, regardless of whether the code and data are provided or not. If the contribution is a dataset and/or model, the authors should describe the steps taken to make their results reproducible or verifiable. Depending on the contribution, reproducibility can be accomplished in various ways. For example, if the contribution is a novel architecture, describing the architecture fully might suffice, or if the contribution is a specific model and empirical evaluation, it may be necessary to either make it possible for others to replicate the model with the same dataset, or provide access to the model. In general. releasing code and data is often one good way to accomplish this, but reproducibility can also be provided via detailed instructions for how to replicate the results, access to a hosted model (e.g., in the case of a large language model), releasing of a model checkpoint, or other means that are appropriate to the research performed. While Neur IPS does not require releasing code, the conference does require all submissions to provide some reasonable avenue for reproducibility, which may depend on the nature of the contribution. For example (a) If the contribution is primarily a new algorithm, the paper should make it clear how to reproduce that algorithm. (b) If the contribution is primarily a new model architecture, the paper should describe the architecture clearly and fully. (c) If the contribution is a new model (e.g., a large language model), then there should either be a way to access this model for reproducing the results or a way to reproduce the model (e.g., with an open-source dataset or instructions for how to construct the dataset). (d) We recognize that reproducibility may be tricky in some cases, in which case authors are welcome to describe the particular way they provide for reproducibility. In the case of closed-source models, it may be that access to the model is limited in some way (e.g., to registered users), but it should be possible for other researchers to have some path to reproducing or verifying the results. 5. Open access to data and code Question: Does the paper provide open access to the data and code, with sufficient instructions to faithfully reproduce the main experimental results, as described in supplemental material? Answer: [Yes] Justification: We have provided access to our code to reproduce the results. In case of synthetic data, we have included the code to reproduce the data. Guidelines: The answer NA means that paper does not include experiments requiring code. Please see the Neur IPS code and data submission guidelines (https://nips.cc/ public/guides/Code Submission Policy) for more details. While we encourage the release of code and data, we understand that this might not be possible, so No is an acceptable answer. Papers cannot be rejected simply for not including code, unless this is central to the contribution (e.g., for a new open-source benchmark). The instructions should contain the exact command and environment needed to run to reproduce the results. See the Neur IPS code and data submission guidelines (https: //nips.cc/public/guides/Code Submission Policy) for more details. The authors should provide instructions on data access and preparation, including how to access the raw data, preprocessed data, intermediate data, and generated data, etc. The authors should provide scripts to reproduce all experimental results for the new proposed method and baselines. If only a subset of experiments are reproducible, they should state which ones are omitted from the script and why. At submission time, to preserve anonymity, the authors should release anonymized versions (if applicable). Providing as much information as possible in supplemental material (appended to the paper) is recommended, but including URLs to data and code is permitted. 6. Experimental Setting/Details Question: Does the paper specify all the training and test details (e.g., data splits, hyperparameters, how they were chosen, type of optimizer, etc.) necessary to understand the results? Answer: [Yes] Justification: We have included the details of experimental setup. The experimental results from other works are also included in the paper. In such cases, we have also included references to the source. Guidelines: The answer NA means that the paper does not include experiments. The experimental setting should be presented in the core of the paper to a level of detail that is necessary to appreciate the results and make sense of them. The full details can be provided either with the code, in appendix, or as supplemental material. 7. Experiment Statistical Significance Question: Does the paper report error bars suitably and correctly defined or other appropriate information about the statistical significance of the experiments? Answer: [Yes] Justification: For the experiments on synthetic data, we report error bars and the experimental settings for the random data. Guidelines: The answer NA means that the paper does not include experiments. The authors should answer "Yes" if the results are accompanied by error bars, confidence intervals, or statistical significance tests, at least for the experiments that support the main claims of the paper. The factors of variability that the error bars are capturing should be clearly stated (for example, train/test split, initialization, random drawing of some parameter, or overall run with given experimental conditions). The method for calculating the error bars should be explained (closed form formula, call to a library function, bootstrap, etc.) The assumptions made should be given (e.g., Normally distributed errors). It should be clear whether the error bar is the standard deviation or the standard error of the mean. It is OK to report 1-sigma error bars, but one should state it. The authors should preferably report a 2-sigma error bar than state that they have a 96% CI, if the hypothesis of Normality of errors is not verified. For asymmetric distributions, the authors should be careful not to show in tables or figures symmetric error bars that would yield results that are out of range (e.g. negative error rates). If error bars are reported in tables or plots, The authors should explain in the text how they were calculated and reference the corresponding figures or tables in the text. 8. Experiments Compute Resources Question: For each experiment, does the paper provide sufficient information on the computer resources (type of compute workers, memory, time of execution) needed to reproduce the experiments? Answer: [Yes] Justification: We have included the details of the computational resources in the appendix. Guidelines: The answer NA means that the paper does not include experiments. The paper should indicate the type of compute workers CPU or GPU, internal cluster, or cloud provider, including relevant memory and storage. The paper should provide the amount of compute required for each of the individual experimental runs as well as estimate the total compute. The paper should disclose whether the full research project required more compute than the experiments reported in the paper (e.g., preliminary or failed experiments that didn t make it into the paper). 9. Code Of Ethics Question: Does the research conducted in the paper conform, in every respect, with the Neur IPS Code of Ethics https://neurips.cc/public/Ethics Guidelines? Answer: [Yes] Justification: We have conformed to the code of ethics. Guidelines: The answer NA means that the authors have not reviewed the Neur IPS Code of Ethics. If the authors answer No, they should explain the special circumstances that require a deviation from the Code of Ethics. The authors should make sure to preserve anonymity (e.g., if there is a special consideration due to laws or regulations in their jurisdiction). 10. Broader Impacts Question: Does the paper discuss both potential positive societal impacts and negative societal impacts of the work performed? Answer: [NA] Justification: Our work does not have a direct societal impact. Guidelines: The answer NA means that there is no societal impact of the work performed. If the authors answer NA or No, they should explain why their work has no societal impact or why the paper does not address societal impact. Examples of negative societal impacts include potential malicious or unintended uses (e.g., disinformation, generating fake profiles, surveillance), fairness considerations (e.g., deployment of technologies that could make decisions that unfairly impact specific groups), privacy considerations, and security considerations. The conference expects that many papers will be foundational research and not tied to particular applications, let alone deployments. However, if there is a direct path to any negative applications, the authors should point it out. For example, it is legitimate to point out that an improvement in the quality of generative models could be used to generate deepfakes for disinformation. On the other hand, it is not needed to point out that a generic algorithm for optimizing neural networks could enable people to train models that generate Deepfakes faster. The authors should consider possible harms that could arise when the technology is being used as intended and functioning correctly, harms that could arise when the technology is being used as intended but gives incorrect results, and harms following from (intentional or unintentional) misuse of the technology. If there are negative societal impacts, the authors could also discuss possible mitigation strategies (e.g., gated release of models, providing defenses in addition to attacks, mechanisms for monitoring misuse, mechanisms to monitor how a system learns from feedback over time, improving the efficiency and accessibility of ML). 11. Safeguards Question: Does the paper describe safeguards that have been put in place for responsible release of data or models that have a high risk for misuse (e.g., pretrained language models, image generators, or scraped datasets)? Answer: [NA] Justification: The models or the data used in the paper poses no such risks. Guidelines: The answer NA means that the paper poses no such risks. Released models that have a high risk for misuse or dual-use should be released with necessary safeguards to allow for controlled use of the model, for example by requiring that users adhere to usage guidelines or restrictions to access the model or implementing safety filters. Datasets that have been scraped from the Internet could pose safety risks. The authors should describe how they avoided releasing unsafe images. We recognize that providing effective safeguards is challenging, and many papers do not require this, but we encourage authors to take this into account and make a best faith effort. 12. Licenses for existing assets Question: Are the creators or original owners of assets (e.g., code, data, models), used in the paper, properly credited and are the license and terms of use explicitly mentioned and properly respected? Answer: [Yes] Justification: When existing models or results were used, we have cited the original work. Guidelines: The answer NA means that the paper does not use existing assets. The authors should cite the original paper that produced the code package or dataset. The authors should state which version of the asset is used and, if possible, include a URL. The name of the license (e.g., CC-BY 4.0) should be included for each asset. For scraped data from a particular source (e.g., website), the copyright and terms of service of that source should be provided. If assets are released, the license, copyright information, and terms of use in the package should be provided. For popular datasets, paperswithcode.com/datasets has curated licenses for some datasets. Their licensing guide can help determine the license of a dataset. For existing datasets that are re-packaged, both the original license and the license of the derived asset (if it has changed) should be provided. If this information is not available online, the authors are encouraged to reach out to the asset s creators. 13. New Assets Question: Are new assets introduced in the paper well documented and is the documentation provided alongside the assets? Answer: [Yes] Justification: We have included the codes to reproduce the results in the paper. We have included the details for executing the codes alongside the code submission. Guidelines: The answer NA means that the paper does not release new assets. Researchers should communicate the details of the dataset/code/model as part of their submissions via structured templates. This includes details about training, license, limitations, etc. The paper should discuss whether and how consent was obtained from people whose asset is used. At submission time, remember to anonymize your assets (if applicable). You can either create an anonymized URL or include an anonymized zip file. 14. Crowdsourcing and Research with Human Subjects Question: For crowdsourcing experiments and research with human subjects, does the paper include the full text of instructions given to participants and screenshots, if applicable, as well as details about compensation (if any)? Answer: [NA] Justification: Our work does not involve research with human subjects or crowdsourcing. Guidelines: The answer NA means that the paper does not involve crowdsourcing nor research with human subjects. Including this information in the supplemental material is fine, but if the main contribution of the paper involves human subjects, then as much detail as possible should be included in the main paper. According to the Neur IPS Code of Ethics, workers involved in data collection, curation, or other labor should be paid at least the minimum wage in the country of the data collector. 15. Institutional Review Board (IRB) Approvals or Equivalent for Research with Human Subjects Question: Does the paper describe potential risks incurred by study participants, whether such risks were disclosed to the subjects, and whether Institutional Review Board (IRB) approvals (or an equivalent approval/review based on the requirements of your country or institution) were obtained? Answer: [NA] Justification: Our work does not involve research with human subjects or crowdsourcing. Guidelines: The answer NA means that the paper does not involve crowdsourcing nor research with human subjects. Depending on the country in which research is conducted, IRB approval (or equivalent) may be required for any human subjects research. If you obtained IRB approval, you should clearly state this in the paper. We recognize that the procedures for this may vary significantly between institutions and locations, and we expect authors to adhere to the Neur IPS Code of Ethics and the guidelines for their institution. For initial submissions, do not include any information that would break anonymity (if applicable), such as the institution conducting the review.