# a_critique_of_selfexpressive_deep_subspace_clustering__15a3790a.pdf Published as a conference paper at ICLR 2021 A CRITIQUE OF SELF-EXPRESSIVE DEEP SUBSPACE CLUSTERING Benjamin D. Haeffele Mathematical Institute for Data Science Johns Hopkins University Baltimore, MD, USA bhaeffele@jhu.edu Chong You Department of Electrical Engineering and Computer Sciences University of California, Berkeley Berkeley, CA, USA cyou@berkeley.edu Ren e Vidal Department of Biomedical Engineering Johns Hopkins University Baltimore, MD, USA Subspace clustering is an unsupervised clustering technique designed to cluster data that is supported on a union of linear subspaces, with each subspace defining a cluster with dimension lower than the ambient space. Many existing formulations for this problem are based on exploiting the self-expressive property of linear subspaces, where any point within a subspace can be represented as linear combination of other points within the subspace. To extend this approach to data supported on a union of non-linear manifolds, numerous studies have proposed learning an embedding of the original data using a neural network which is regularized by a self-expressive loss function on the data in the embedded space to encourage a union of linear subspaces prior on the data in the embedded space. Here we show that there are a number of potential flaws with this approach which have not been adequately addressed in prior work. In particular, we show the model formulation is often ill-posed in that it can lead to a degenerate embedding of the data, which need not correspond to a union of subspaces at all and is poorly suited for clustering. We validate our theoretical results experimentally and also repeat prior experiments reported in the literature, where we conclude that a significant portion of the previously claimed performance benefits can be attributed to an ad-hoc post processing step rather than the deep subspace clustering model. 1 INTRODUCTION AND BACKGROUND Subspace clustering is a classical unsupervised learning problem, where one wishes to segment a given dataset into a prescribed number of clusters, and each cluster is defined as a linear (or affine) subspace with dimension lower than the ambient space. There have been a wide variety of approaches proposed in the literature to solve this problem (Vidal et al., 2016), but a large family of state-of-the-art approaches are based on exploiting the self-expressive property of linear subspaces. That is, if a point lies in a linear subspace, then it can be represented as a linear combination of other points within the subspace. Based on this fact, a wide variety of methods have been proposed which, given a dataset Z Rd N of N d-dimensional points, find a matrix of coefficients C RN N by solving the problem: 2 ZC Z 2 F + λθ(C) = 1 2 Z Z, (C I)(C I) + λθ(C) . (1) Here, the first term ZC C 2 F captures the self-expressive property by requiring every datapoint to represent itself as an approximate linear combination of other points, i.e., Zi ZCi, where Zi and Ci are the ith columns of Z and C, respectively. The second term, θ(C), is some regularization function designed to encourage each data point to only select other points within the Published as a conference paper at ICLR 2021 correct subspace in its representation and to avoid trivial solutions (such as C = I). Once the C matrix has been solved for, one can then define a graph affinity between data points, typically based on the magnitudes of the entries of C, and use an appropriate graph-based clustering method (e.g., spectral clustering (von Luxburg, 2007)) to produce the final clustering of the data points. One of the first methods to utilize this approach was Sparse Subspace Clustering (SSC) (Elhamifar & Vidal, 2009; 2013), where θ takes the form θSSC(C) = C 1 + δ(diag(C) = 0), with 1 denoting the ℓ1 norm and δ an indicator function which takes value if an element of the diagonal of C is non-zero and 0 otherwise. By regularizing C to be sparse, a point must represent itself using the smallest number of other points within the dataset, which in turn ideally requires a point to only select other points within its own subspace in the representation. Likewise other variants, with Low-Rank Representation (LRR) (Liu et al., 2013), Low-Rank Subspace Clustering (LRSC) (Vidal & Favaro, 2014) and Elastic-net Subspace Clustering (En SC) (You et al., 2016) being wellknown examples, take the same form as (1) with different choices of regularization. For example, θLRR(C) = C and θEn SC(C) = C 1 + τ C 2 F + δ(diag(C) = 0), where denotes the nuclear norm (sum of the singular values). A significant advantage of the majority of these methods is that it can be proven (typically subject to some technical assumptions regarding the angles between the underlying subspaces and the distribution of the sampled data points within the subspaces) that the optimal C matrix in (1) will be correct in the sense that if Ci,j is non-zero then the ith and jth columns of Z lie in the same linear subspace (Soltanolkotabi & Cand es, 2012; Lu et al., 2012; Elhamifar & Vidal, 2013; Soltanolkotabi et al., 2014; Wang et al., 2015; Wang & Xu, 2016; You & Vidal, 2015a;b; Yang et al., 2016; Tsakiris & Vidal, 2018; Li et al., 2018; You et al., 2019; Robinson et al., 2019), which has led to these methods achieving state-of-the-art performance in many applications. 1.1 SELF-EXPRESSIVE DEEP SUBSPACE CLUSTERING Although subspace clustering techniques based on self-expression display strong empirical performance and provide theoretical guarantees, a significant limitation of these techniques is the requirement that the underlying dataset needs to be approximately supported on a union of linear subspaces. This has led to a strong motivation to extend these techniques to more general datasets, such as data supported on a union of non-linear low-dimensional manifolds. From inspection of the right side of (1), one can observe that the only dependence on the data Z comes in the form of the Gram matrix Z Z. As a result, self-expressive subspace clustering techniques are amendable to the kerneltrick , where instead of taking an inner product kernel between data points, one can instead use a general kernel κ( , ) (Patel & Vidal, 2014). Of course, such an approach comes with the traditional challenge of how to select an appropriate kernel so that the embedding of the data in the Hilbert space associated with the choice of kernel results in a union of linear subspaces. The first approach to propose learning an appropriate embedding of an initial dataset X Rdx N (which does not necessarily have a union of subspaces structure) was given by Patel et al. (2013; 2015) who proposed first projecting the data into a lower dimensional space via a learned linear projector, Z = Pl X, where Pl Rd dx (d < dx) is also optimized over in addition to C in (1). To ensure that sufficient information about the original data X is preserved in the low-dimensional embedding Z, the authors further required that the linear projector satisfy the constraint that Pl P l = I and added an additional term to the objective with form X P l Pl X 2 F . However, since the projector is linear, the approach is not well suited for nonlinear manifolds, unless it is augmented with a kernel embedding, which again requires choosing a suitable kernel. More recently, given the success of deep neural networks, a large number of studies Peng et al. (2017); Ji et al. (2017); Zeng et al. (2019b;a); Xie et al. (2020); Sun et al. (2019); Li et al. (2019); Yang et al. (2019); Jiang et al. (2019); Tang et al. (2018); Kheirandishfard et al. (2020b); Zhou et al. (2019); Jiang et al. (2018); Abavisani & Patel (2018); Zhou et al. (2018); Zhang et al. (2018; 2019b;a); Kheirandishfard et al. (2020a) have attempted to learn an appropriate embedding of the data (which ideally would have a union of linear subspaces structure) via a neural network, ΦE(X, We), where We denotes the parameters of a network mapping defined by ΦE, which takes a dataset X Rdx N as input. In an attempt to encourage the embedding of the data, ΦE(X, We), to have this union of subspaces structure, these approaches minimize a self-expressive loss term, with form given in (1), on the embedded data, and a large majority of these proposed techniques can be Published as a conference paper at ICLR 2021 SEDSC Auto-Encoder Original Data Geometry Ideal Embedded Geometry Actual Embedded Geometry (Dataset/Batch Normalization) Figure 1: Illustration of our theoretical results. The goal of the SEDSC model is to train a network (typically an auto-encoder) to map data from a union of non-linear manifolds (Left) to a union of linear subspaces (Center). However, we show that for many of the formulations that have been proposed in the literature the global optimum of the model will have a degenerate geometry in the embedded space. For example, in the Dataset and Channel/Batch normalization schemes Theorem 2 shows that the globally optimal geometry will have all points clustered near the origin with the exception of two points, which will be copies of each other (to within a sign-flip) (Right). described by the general form: min We,C γF(Z, C) + g(Z, X, C) subject to Z = ΦE(X, We) (2) where g is some function designed to discourage trivial solutions (for example ΦE(X, We) = 0) and γ > 0 is some hyper-parameter to balance the terms. Several different choices of g have been proposed in the literature. The first is to place some form of normalization directly on Z. For example, Peng et al. (2017) propose an Instance Normalization regularization, g(Z, X, C) = PN i=1(Z i Zi 1)2, which attempts to constrain the norm of the embedded data points to be 1. Likewise, one could also consider Dataset Normalization schemes, which bound the norm of the entire embedded representation Z 2 F τ or Channel/Batch Normalization schemes, which bound the norm of a channel of the embedded representation (i.e., a row of Z), Zi 2 F τ, i. We note that this is quite similar to the common Batch Norm operator (Ioffe & Szegedy, 2015) used in neural network training which attempts to constrain each row of Z to have zero mean and constant norm. Another popular form of g is to also train a decoding network ΦD( , Wd) with parameters Wd to map the self-expressive representation, ΦE(X, We)C, back to the original data to ensure that sufficient information is preserved in the self-expressive representation to recover the original data. We will refer to this as Autoencoder Regularization. This idea is essentially a generalization of the previously discussed work, which considered constrained linear encoder/decoder mappings (Patel et al., 2013; Patel & Vidal, 2014; Patel et al., 2015), to non-linear autoencodering neural networks and was first proposed by the authors of Ji et al. (2017). The problem takes the general form: min We,Wd,C γF(Z, C) + ℓ(X, ΦD(ZC, Wd)) subject to Z = ΦE(X, We), (3) where the first term is the standard self-expressive subspace clustering loss applied to the embedded representation, and the second term is a standard auto-encoder loss, with ℓtypically chosen to be the squared loss. Note that here both the encoding/decoding network and the optimal self-expression encoding, C, are trained jointly, and once problem (3) is solved one can use the recovered C matrix directly for clustering. Using the general formulation in (2) and the popular specific case in (3), Self-Expressive Deep Subspace Clustering (SEDSC) has been applied to a variety of applications, but there is relatively little that it known about it from a theoretical standpoint. Initial formulations for SEDSC were guided by the intuition that if the dataset is drawn from a union of linear subspaces, then solving problem (1) is known to induce desirable properties in C for clustering. By extension one might assume that if one also optimizes over the geometry of the learned embedding (Z) this objective might induce a desirable geometry in the embedded space (e.g., a union of linear subspaces). However, a vast majority of the prior theoretical analysis for problems of the form in (1) only considers the case where the data is held fixed and analyzes the properties of the optimal C matrix. Due to the well-known fact that neural networks are capable of producing highly-expressive mapping functions (and hence a network could produce many potential values for Z), the use of a model such as (2)/(3) is essentially using (1) as a regularization function on Z to encourage a union of subspaces geometry. To Published as a conference paper at ICLR 2021 date, however, models such as (2)/(3) have been guided largely by intuition and significant questions remain regarding what type of data geometry is encouraged by F(Z, C) when one optimizes over both the encoding matrix, C, and the network producing the embedded data representation, Z. 1.2 PAPER CONTRIBUTIONS Here we explore these questions via theoretical analysis where we show that the use of F(Z, C) as a regularization function when learning a kernel from the data in an attempt to promote beneficial data geometries in the embedded space, as is done in (2), is largely insufficient in the sense that the optimal data geometries are trivial and not conducive to successful clustering in most cases. Specifically, we first note a basic fact that the Autoencoder Regularization formulation in (3) is typically ill-posed for most commonly used networks if constraints are not placed on the magnitude of the embedded data, ΦE(X, We), either through regularization/constraints on the autoencoder weights/architecture or an explicit normalization of ΦE(X, We), such as in the Instance/Batch/Dataset Normalization schemes. Then, even assuming that the embedded representation has been suitably normalized, we show that the optimal embedded data geometry encouraged by F(Z, C) is trivial in various ways, which will depend on how the data is normalized. We illustrate our theoretical predictions with experiments on both real and synthetic data. Further, we show experimentally that much of the claimed performance benefit of the SEDSC model reported in previous work can be attributed to an ad-hoc post-processing of the C matrix first proposed in Ji et al. (2017). Notation. We will denote matrices with capital boldfaced letters, Z, vectors which are not rows/columns of a larger matrix with lower-case boldfaced letters, z, and sets with calligraphic letters, Z. The ith row of a matrix will be denoted with a superscript, Zi; the ith column of a matrix will be denoted with a subscript, Zi; the (i, j)th entry of a matrix will be denoted as Zi,j; and the ith entry of a vector will be denoted as zi. We will denote the minimum singular value of a matrix Z as σmin(Z), and we will denote the nuclear, Frobenius, ℓp, and Schatten-p norms1 for a matrix/vector as Z , Z F , Z p, and Z Sp respectively. δ(cnd) denotes an indicator function with value 0 if condition cnd is true and otherwise. 2 THEORETICAL ANALYSIS 2.1 BASIC SCALING ISSUES We begin our analysis of the SEDSC model by considering the most popular formulation, which is to employ Autoencoder Regularization as in (3). Specifically, we note that without any regularization on the autoencoder network parameters (We, Wd) or any normalization placed on the embedded representation, ΦE(X, We), the formulation in (3) is often ill-posed in the sense that the value of F(ΦE(X, We), C) can be made arbitrarily small without changing the value of the autoencoder loss by simply scaling-down the weights in the encoding network and scaling-up the weights in the decoding network in a way which doesn t change the output of the autoencoder but reduces the magnitude of the embedded representation. As we will see, a sufficient condition for this to be possible is when the non-linearity in the final layer of the encoder is positively-homogeneous2. We further note that most commonly used non-linearities in neural networks are positively homogenous, with Rectified Linear Units (Re LUs), leaky Re LUs, and max-pooling being common examples. As a result, most autoencoder architectures employed for SEDSC will require some form of regularization or normalization for the F term to have any impact on the geometry of the embedded representation (other than trivially becoming smaller in magnitude), though many proposed formulations do not take this issue into account. As a basic example, consider fully-connected single-hidden-layer networks for the encoder/decoder, ΦE(X, We) = W2 e(W1 e X)+ and ΦD(Z, Wd) = W2 d(W1 d Z)+, where (x)+ = max{x, 0} denotes the Rectified Linear Unit (Re LU) non-linearity applied entry-wise. Then, note that because the Re LU function is positively homogeneous of degree 1 for any α 0 one has ΦE(X, αWe) = (αW2 e)(αW1 e X)+ = α2W2 e(W1 e X)+ = α2ΦE(X, We) and similarly for ΦD. As a result one can scale We by any α > 0 and Wd by α 1 without changing the output of 1Recall, the Schatten-p norm of a matrix C is defined as the ℓp norm, p 1, of the singular values of C. 2Recall, a function f(x) is positively-homogeneous of degree p if f(αx) = αpf(x), α 0. Published as a conference paper at ICLR 2021 the autoencoder, ΦD(ΦE(X, αWe), α 1Wd) = ΦD(ΦE(X, We), Wd), but ΦE(X, αWe) F = α2 ΦE(X, We) F . From this, taking α to be arbitrarily small the magnitude of the embedding can also be made arbitrarily small, so if θ(C) in (1) is something like a norm (as is typically the case) the value of F(Z, C) in (3) can also be made arbitrarily small without changing the reconstruction loss of the autoencoder ℓ. This implies one is simply training an autoencoder without any contribution from the F(Z, C) term (other than trying to reduce the magnitude of the embedded representation). This basic idea can be easily generalized and formalized in the following statement (all proofs are provided in the Supplementary Material): Proposition 1. Consider the objective in (3) and suppose θ(C) in (1) is a function which achieves its minimum at C = 0 and satisfies θ(µC) < θ(C), µ (0, 1). If for any choice of (Wd, We) and τ1, τ2 (0, 1) there exists ( ˆ Wd, ˆ We) such that ΦE(X, ˆ We) = τ1ΦE(X, We) and ΦD(τ2Z, ˆ Wd) = ΦD(Z, Wd) then the F term in (3) can be made arbitrarily small without changing the value of the loss function ℓ. Further, if the final layer of the encoding network is positively-homogeneous with degree p = 0, such a ( ˆ Wd, ˆ We) will always exist simply by scaling the network weights of the linear (or affine) operator parameters. In practice, most of the previously cited studies employing the SEDSC model use networks which satisfy the conditions of Proposition 1 without regularization, and we note that such models can never be solved to global optimality (only approach it asymptotically) and are inherently ill-posed. From this, one can conclude that Autoencoder Regularization by itself is often insufficient to prevent trival solutions to (2). This specific issue can be easily fixed (although prior work often does not) if one ensures that the magnitude of the embedded representation is constrained to be larger than some minimum value - either through regularization/constraints placed on the network weights, such as using weight decay or coupling weights between the encoder and decoder, potentially a different choice of non-linearity on the last layer or the encoder, or explicit normalization of the embedded representation. However, the question remains what geometries are promoted by (1) even if the basic issue described above is corrected, which we explore next. 2.2 THE EFFECT OF THE AUTOENCODER LOSS Before presenting the remainder of our formal results, we first pause to discuss the effect of the autoencoder loss (ℓ) in (3). The use of the autoencoder loss is designed to ensure that the embedded representation Z retains sufficient information about the original data X so that the data can be reconstructed by the decoder, but we note that this does not necessarily impose significant constraints on the geometric arrangement of the embedded points in Z. While there is potentially some constraint on the possible geometries of Z which is imposed by the choice of encoder/decoder architectures, reasonably expressive encoders and decoders can map a finite number of data points arbitrarily into the embedded space and still decode them accurately, provided the embedding of any two points is still distinct to within some ϵ perturbation (i.e., ΦE is a one-to-one mapping on the points in X). Further, because we only evaluate the mapping on a finite set of points X, this is a much easier (and can be achieved with much simpler networks) than the well-known universal approximation regime of neural networks (which requires good approximation over the entire continuous data domain), and it is well documented that typical network architectures can optimally fit fairly arbitrary finite training data (Zhang et al., 2017). Given the above discussion, in our analysis we will first focus on the setting where the autoencoder is highly expressive. In this regime, the encoder can select weight parameters to produce an essentially arbitrary choice of Z embedding, and as long as ΦE(Xi) = ΦE(Xj), Xi = Xj then the decoder can exactly recover the original data X. As a result, for almost any choice of encoder the autoencoder loss term, ℓ, in (3) can be exactly minimized, so the optimal network weights for the model in (3) (and likewise (2)) will be those which minimize F(Z, C) (potentially to within some small ϵ perturbation so that ΦE(X) is a one-to-one mapping on the points in X). As we already know from Prop 1, this is ill-posed without some additional form of regularization, so in the following subsections we explore optimal solutions to F(Z, C) when one optimizes over both C and the embedded representation Z = ΦE(X, We) subject to three different constraints on Z: (1) Dataset Normalization Z 2 F τ, (2) Channel/Batch Normalization Zi F τ i, and (3) Instance Normalization Zi F = τ i. Finally, after characterizing solutions to F(Z, C), we give a specific example of a very simple (i.e., not highly expressive) family of architectures which can achieve these solutions, Published as a conference paper at ICLR 2021 showing that the assumption of highly expressive networks is not necessary for these solutions to be globally optimal. 2.3 DATASET AND BATCH/CHANNEL NORMALIZATION We will first consider the Dataset and Batch/Channel Normalization schemes, which will both result in very similar optimal solutions for the embedded data geometry. Recall, that this considers when the entire dataset is constrained to have a norm greater than some minimum value in the embedded space (Dataset Normalization) or when the norms of the rows of the dataset are constrained to have a norm greater than some minimum value (Batch/Channel Normalization). We note that the latter case is very closely related to batch normalization (Ioffe & Szegedy, 2015), which requires that each channel/feature (in this case a row of the embedded representation) to have zero mean and constant norm in expectation over the draw of a mini-batch. Additionally, while we do not explicitly enforce a zero-mean constraint, we will see that optimal solutions will exist which have zero mean. Now, if one optimizes F(Z, C) jointly over (Z, C) subject to the above constraint(s) on Z, then the following holds: Theorem 1. Consider the following optimization problems which jointly optimize over Z and C: (P1) min C,Z F(Z, C) s.t. Z 2 F τ (P2) min C,Z F(Z, C) s.t. Zi 2 F τ Then optimal values for C for both (P1) and (P2) are given by C arg min C 1 2σ2 min(C I)τ + λθ(C). (5) Moreover, for any optimal C , let r be the multiplicity of the smallest singular value of C I and let Q RN r be an orthonormal basis for the subspace spanned by the left singular vectors associated with the smallest singular value of C I. Then we have that optimal values for Z are given by: Z {BQ : B Rd r B}, B = {B : B 2 F = τ} σmin(C I) > 0 {B : B 2 F τ} σmin(C I) = 0 (P1) {B : Bi 2 F = τ d, i} σmin(C I) > 0 {B : Bi 2 F τ d, i} σmin(C I) = 0 (P2) (6) From the above result, one notes from (5) that optimizing F(Z, C) jointly over both Z and C is equivalent to finding a C which minimizes a trade-off between the minimum singular value of C I and the regularization θ(C). Further, we note that if such an optimal C results in the minimum singular value of C I having a multiplicity of 1, then this implies that every data point in Z will simply be a scaled version of the same point. Obviously, such an embedding is not useful for subspace clustering. Characterizing optimal solutions to (5) is somewhat complicated in the general case due to the fact that the smallest singular value is a concave function of a matrix and θ is typically chosen to be a convex regularization function, resulting in the minimization of a convex+concave function. Instead, we will focus on the most commonly used choices of regularization, starting with θSSC(C) = C 1 + δ(diag(C) = 0), where we derive the optimal solution in the case where σmin(C I) = 0. We note that this corresponds to the case where Z = Z C which one typically obtains as λ in (1) is taken to be small. Theorem 2. Optimal solutions to the problems (P1) min Z,C C 1 s.t. diag(C) = 0, Z = ZC, Z 2 F τ (7) (P2) min Z,C C 1 s.t. diag(C) = 0, Z = ZC, Zi 2 F τ are characterized by the set (Z , C ) {[z z 0d N 2] P} 0 1 0 0 1 0 0 0 0 0 0 0 ... ... ... ... 0 0 0 0 where P RN N is an arbitrary signed-permutation matrix and z Rd is an arbitrary vector which satisfies z 2 F τ/2 for (P1) and z2 i τ/(2d), i [1, d] for (P2). Published as a conference paper at ICLR 2021 From the above result, we have shown that if Z is normalized to have a lower-bounded norm on either the entire embedded representation or for each row, then the effect of the F(Z, C) loss will be largely similar to the situation described by Proposition 1 in the sense that the loss will still attempt to push all of the points to 0 with the exception of two points, which will be copies of each other (potentially to within a sign-flip). Again, the optimal embedded representation is clearly ill-posed for clustering since all but two of the points are trivially driven towards 0 in the embedded space. In addition, we also present a result similar to Theorem 2 when C is regularized by any Schatten-p norm, which includes two other popular choices of regularization that have appeared in the literature the Frobenius norm θ(C) = C F (for p = 2) and the nuclear norm θ(C) = C (for p = 1) as special cases. Theorem 3. Optimal solutions to the problems (P1) min Z,C C Sp s.t. Z = ZC, Z 2 F τ (10) (P2) min Z,C C Sp s.t. Z = ZC, Zi 2 F τ where C Sp is any Schatten-p norm on C, are characterized by the set (Z , C ) (zq ) (qq ) : q RN, q F = 1 (12) where z Rd is an arbitrary vector which satisfies z 2 F τ for (P1) and z2 i τ d, i for (P2). Again, note that this is obviously not a good geometry for successful spectral clustering as all the points in the dataset are simply arranged on a single line and the optimal C is a rank-one matrix. 2.4 INSTANCE NORMALIZATION To explicitly prevent the case where most of the points in the embedded space are trivially driven to 0 as in the prior two normalization schemes, another potential normalization strategy which has been proposed for (2) is to use Instance Normalization (Peng et al., 2017), where the ℓ2 norm of each embedded data point is constrained to be equal to some constant. Here again we will see that this results in somewhat trivial data geometries. Specifically, we will again focus on the choice of the θSSC(C) regularization function when we have exact equality, Z = ZC, for simplicity of presentation. From this we have the following result: Theorem 4. Optimal solutions to the problem min Z,C C 1 s.t. diag(C) = 0, Z = ZC, Zi 2 F = τ i (13) must have the property that for any column in Z , Z i , there exists another column, Z j (i = j), such that Z i = si,j Z j where si,j { 1, 1}. Further, C i 1 = 1 i and Ci,j = 0 = Z i = Z j. The above result is quite intuitive in the sense that because a given point cannot use itself in its representation (due to the diag(C) = 0 constraint), the next best thing is to have an exact copy of itself in another column. While this result is more conducive to successful clustering in the sense that points which are close in the embedded space are encouraged to merge into a single point, there are still numerous pathological geometries that can result. Specifically, there is no constraint on the number of distinct points in the representation (i.e., the number of vectors which are not copies of each other to within a sign flip), other than it must be less than N/2. As a result, the optimal C matrix can also contain an arbitrary number (in the range [1, N/2]) of connected components in the affinity graph, resulting in somewhat arbitrary spectral clustering. Example of Degenerate Geometry with Simple Networks. In section 2.2 we discussed how if the encoder/decoder are highly expressive then the optimal embedding will approach the solutions we give in our theoretical analysis. Here we show that trivial embeddings can also occur with relatively simple encoders/decoders. Specifically, consider basic encoder/decoder architectures which consist of two affine mappings with a Re LU non-linearity (denoted as ( )+) on the hidden layer: ΦE(x, We) = W2 e(W1 ex + b1 e)+ + b2 e ΦD(z, Wd) = W2 d(W1 dz + b1 d)+ + b2 d (14) where the linear operators (W matrices) can optionally be constrained (for example for convolution operations W could be required to be a Toeplitz matrix). Now if we have that the embedded dimension d is equal to the data dimension dx we will say that linear operators can express identity on X Published as a conference paper at ICLR 2021 0 25 50 75 100 125 150 175 200 Iteration/10 Accuracy (%) Extended Yale B 0 25 50 75 100 125 150 175 200 Iteration Accuracy (%) 0 25 50 75 100 125 150 175 200 Iteration/10 Accuracy (%) Raw Data Raw Data (no pp) Autoenc only Autoenc only (no pp) Full SEDSC Full SEDSC (no pp) Figure 2: Clustering accuracy results for Yale B (38 faces), COIL100, and ORL datasets with (Dashed Lines) and without (Solid Lines) the post-processing step on C matrix proposed in Ji et al. (2017). (Raw Data) Clustering on the raw data. (Autoenc only) Clustering features from an autoencoder trained without the F(Z, C) term. (Full SEDSC) The full model in (3). if there exists parameters (W2, W1) such that W2W1X = X. Note that if the architectures in (14) are fully-connected this implies that for a general X the number of hidden units is greater than or equal to d = dx (which can be even smaller if X is low-rank), while if the linear operators are convolutions, then we only need one convolutional channel (with the kernel being the delta function). Within this setting we have the following result: Proposition 2. Consider encoder and decoder networks with the form given in (14). Then, given any dataset X Rdx N where the linear operators in both the encoder/decoder can express identity on X and any τ > 0 there exist network parameters (We, Wd) which satisfy the following: 1. Embedded points are arbitrarily close: ΦE(Xi, We) ΦE(Xj, We) ϵ (i, j) and ϵ > 0. 2. Embedded points have norm arbitrarily close to τ: | ΦE(Xi, We) F τ| ϵ i and ϵ > 0. 3. Embedded points can be decoded exactly: ΦD(ΦE(Xi, We), Wd) = Xi, i. From the above simple example, we can see that even with very simple network architectures (i.e., not necessarily highly expressive) it is still possible to have solutions which are arbitrarily close to the global optimum described in Theorem 4, in the sense that the points can be made to be arbitrarily close to each other in the embedded space with norm arbitrarily close to τ (for any arbitrary choice of τ), while still having a perfect reconstruction of X. 3 EXPERIMENTS Here, we present experiments on both real and synthetic data that verify our theoretical predictions experimentally. We first evaluate the Autoencoder Regularization form given in (3) by repeating all of the experiments from Ji et al. (2017). In the Supplementary Material we first show that the optimization problem never reaches a stationary point due to the pathology described by Prop 1 (see Figure 4 in Supplementary Material), and below we show that the improvement in performance reported in Ji et al. (2017) is largely attributable to an ad-hoc post-processing step. Then, we present experiments on a simple synthetic dataset to illustrate our theoretical results. Repeating the Experiments of Ji et al. (2017). First we use the code provided by the authors of Ji et al. (2017) to repeat all of their original clustering experiments on the Extended Yale-B (38 faces), ORL, and COIL100 datasets. As baseline methods, we perform subspace clustering on the raw data as well as subspace clustering on embedded features obtained by training the autoencoder network without the F(Z, C) term (i.e., γ = 0 and C fixed at I in (3)). See Supplementary Material for further details. In addition to proposing the model in (3), the authors of Ji et al. (2017) also implement a somewhat arbitrary post-processing of the C matrix recovered from the SEDSC model before the final spectral Table 1: Clustering accuracy shown in Fig 2. To be consistent with Ji et al. (2017), we report the results at 1000/120/700 iterations for Yale B / COIL100 / ORL, respectively. With post-processing Without post-processing Raw Data Autoenc only Full SEDSC Raw Data Autoenc only Full SEDSC Yale B 94.40% 97.12% 96.79% 68.71% 71.96% 59.09% COIL100 66.47% 68.26% 64.96% 47.51% 44.84% 45.67% ORL 78.12% 83.43% 84.10% 72.68% 73.73% 73.50% Published as a conference paper at ICLR 2021 0.1 0.0 0.1 0.2 0.3 Original Data Reconstructed Data Auto Encoder Auto Encoder + Self-Express Original Data Domain Embedded Data Domain |C| |C| - log color scale Dataset Norm 0.4 0.2 0.0 0.2 0.4 0.6 Figure 3: Results for synthetic data using the dataset normalization scheme. (Left) Original data points (Blue) and the data points at the output of the autoencoder when the full model (3) is used (Red). (Center Left) Data representation in the embedded domain when just the autoencoder is trained without the F(Z, C) term (Blue) and the full SEDSC model is used (Red). (Center Right) The absolute value of the recovered C encoding matrix when trained with the full model. (Right) Same plot as the previous column but with a logarithmic color scale to visualize small entries. clustering, which involves 1) entrywise hard thresholding, 2) applying the shape interaction matrix method Costeira & Kanade (1995) to C, and 3) raising C to a power, entry-wise. Likewise, many subsequent works on SEDSC follow Ji et al. (2017) and employ very similar post-processing steps on C. As shown in Figure 2 and Table 1 there is little observed benefit for using SEDSC, as it achieves comparable (or worse) performance than baseline methods in almost all settings when the post-processing of C is applied consistently (or not) across all methods. Synthetic Data Experiments. Finally, to illustrate our theoretical predictions we construct a simple synthetic dataset which consists of 100 points drawn from the union of two parabolas in R2, where the space of the embedding is also R2. We then train the model given in (3) with θ(C) = θSSC(C) = C 1 + δ(diag(C) = 0), the encoder/decoder networks being simple singlehidden-layer fully-connected networks with 100 hidden units, and Re LU activations on the hidden units. Figure 3 shows the solution obtained when we directly add a normalization operator to the encoder network which normalizes the output of the encoder to have unit Frobenius norm over the entire dataset (Dataset Normalization). Additional experiments for other normalization schemes and a description of the details of our experiments can be found in the Supplementary Material. From Figure 3 one can see that our theoretical predictions are largely confirmed experimentally. Namely, one sees that when the full SEDSC model is trained the embedded representation largely as predicted by Theorem 2, with almost all of the embedded points (Left Center - Red points) being close to the origin with the exception of two points, which are co-linear with each other. Likewise, the C matrix is dominated by two non-zero entries with the remaining non-zero entries only appearing on the log-scale color scale. We note that as this is a very simple dataset (i.e., two parabolas without any added noise) one would expect most reasonable manifold clustering/learning algorithms to succeed; however, due to the deficiencies of the SEDSC model we have shown in our analysis a trivial solution results. 4 CONCLUSIONS We have presented a theoretical and experimental analysis of the Self-Expressive Deep Subspace Clustering (SEDSC) model. We have shown that in many cases the SEDSC model is ill-posed and results in trivial data geometries in the embedded space. Further, our attempts to replicate previously reported experiments lead us to conclude that much of the claimed benefit of SEDSC is attributable to other factors such as post-processing of the recovered encoding matrix, C, and not the SEDSC model itself. Overall, we conclude that considerably more attention needs to be given to the issues we have raised in this paper in terms of both how models for this problem are designed and how they are evaluated to ensure that one arrives at meaningful solutions and can clearly demonstrate the performance of the resulting model without other confounding factors. Acknowledgments. The authors thank Zhihui Zhu and Benjamin B ejar Haro for helpful discussions in the early stages of this work. This work was partially supported by the Northrop Grumman Mission Systems Research in Applications for Learning Machines (REALM) initiative, NSF Grants 1704458, 2031985 and 1934979, and the Tsinghua-Berkeley Shenzhen Institute Research Fund. Published as a conference paper at ICLR 2021 Mahdi Abavisani and Vishal M Patel. Deep multimodal subspace clustering networks. IEEE Journal of Selected Topics in Signal Processing, 12(6):1601 1614, 2018. Joao Costeira and Takeo Kanade. A multi-body factorization method for motion analysis. In IEEE International Conference on Computer Vision, pp. 1071 1076. IEEE, 1995. Ehsan Elhamifar and Ren e Vidal. Sparse subspace clustering. In IEEE Conference on Computer Vision and Pattern Recognition, pp. 2790 2797, 2009. Ehsan Elhamifar and Ren e Vidal. Sparse subspace clustering: Algorithm, theory, and applications. IEEE Transactions on Pattern Analysis and Machine Intelligence, 35(11):2765 2781, 2013. Sergey Ioffe and Christian Szegedy. Batch normalization: Accelerating deep network training by reducing internal covariate shift. In International Conference on Machine Learning, pp. 448 456, 2015. Pan Ji, Tong Zhang, Hongdong Li, Mathieu Salzmann, and Ian Reid. Deep subspace clustering networks. In Advances in Neural Information Processing Systems, pp. 24 33, 2017. Yangbangyan Jiang, Zhiyong Yang, Qianqian Xu, Xiaochun Cao, and Qingming Huang. When to learn what: Deep cognitive subspace clustering. In Proceedings of the 26th ACM international conference on Multimedia, pp. 718 726, 2018. Yangbangyan Jiang, Qianqian Xu, Zhiyong Yang, Xiaochun Cao, and Qingming Huang. Duet robust deep subspace clustering. In Proceedings of the 27th ACM International Conference on Multimedia, pp. 1596 1604, 2019. Mohsen Kheirandishfard, Fariba Zohrizadeh, and Farhad Kamangar. Deep low-rank subspace clustering. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops, pp. 864 865, 2020a. Mohsen Kheirandishfard, Fariba Zohrizadeh, and Farhad Kamangar. Multi-level representation learning for deep subspace clustering. In The IEEE Winter Conference on Applications of Computer Vision, pp. 2039 2048, 2020b. Chun-Guang Li, Chong You, and Ren e Vidal. On geometric analysis of affine sparse subspace clustering. IEEE Journal on Selected Topics in Signal Processing, 12(6):1520 1533, 2018. Ruihuang Li, Changqing Zhang, Huazhu Fu, Xi Peng, Tianyi Zhou, and Qinghua Hu. Reciprocal multi-layer subspace learning for multi-view clustering. In IEEE International Conference on Computer Vision, pp. 8172 8180, 2019. Guangcan Liu, Zhouchen Lin, Shuicheng Yan, Ju Sun, and Yi Ma. Robust recovery of subspace structures by low-rank representation. IEEE Transactions on Pattern Analysis and Machine Intelligence, 35(1):171 184, 2013. Can-Yi Lu, Hai Min, Zhong-Qiu Zhao, Lin Zhu, De-Shuang Huang, and Shuicheng Yan. Robust and efficient subspace segmentation via least squares regression. In European Conference on Computer Vision, pp. 347 360, 2012. N. Parikh and S. Boyd. Proximal Algorithms. Foundations and Trends in Optimization, 1(3):123 231, 2013. V. M. Patel and R. Vidal. Kernel sparse subspace clustering. In IEEE International Conference on Image Processing, pp. 2849 2853, 2014. V. M. Patel, H. V. Nguyen, and R. Vidal. Latent space sparse subspace clustering. In IEEE International Conference on Computer Vision, pp. 225 232, 2013. V. M. Patel, H. V. Nguyen, and R. Vidal. Latent space sparse and low-rank subspace clustering. IEEE Journal of Selected Topics in Signal Processing, 9(4):691 701, 2015. Published as a conference paper at ICLR 2021 Xi Peng, Jiashi Feng, Shijie Xiao, Jiwen Lu, Zhang Yi, and Shuicheng Yan. Deep sparse subspace clustering. ar Xiv preprint ar Xiv:1709.08374, 2017. Daniel P Robinson, Rene Vidal, and Chong You. Basis pursuit and orthogonal matching pursuit for subspace-preserving recovery: Theoretical analysis. ar Xiv preprint ar Xiv:1912.13091, 2019. Mahdi Soltanolkotabi and Emmanuel J. Cand es. A geometric analysis of subspace clustering with outliers. Annals of Statistics, 40(4):2195 2238, 2012. Mahdi Soltanolkotabi, Ehsan Elhamifar, and Emmanuel J. Cand es. Robust subspace clustering. Annals of Statistics, 42(2):669 699, 2014. Xiukun Sun, Miaomiao Cheng, Chen Min, and Liping Jing. Self-supervised deep multi-view subspace clustering. In Asian Conference on Machine Learning, pp. 1001 1016, 2019. Xiaoliang Tang, Xuan Tang, Wanli Wang, Li Fang, and Xian Wei. Deep multi-view sparse subspace clustering. In Proceedings of the 2018 VII International Conference on Network, Communication and Computing, pp. 115 119, 2018. Manolis C. Tsakiris and Ren e Vidal. Theoretical analysis of sparse subspace clustering with missing entries. In International Conference on Machine Learning, pp. 4975 4984, 2018. Ren e Vidal and Paolo Favaro. Low rank subspace clustering (LRSC). Pattern Recognition Letters, 43:47 61, 2014. Ren e Vidal, Yi Ma, and Shankar Sastry. Generalized Principal Component Analysis. Springer Verlag, 2016. Ulrike von Luxburg. A tutorial on spectral clustering. Statistics and Computing, 17(4):395 416, 2007. Yining Wang, Yu-Xiang Wang, and Aarti Singh. A deterministic analysis of noisy sparse subspace clustering for dimensionality-reduced data. In International Conference on Machine Learning, pp. 1422 1431, 2015. Yu-Xiang Wang and Huan Xu. Noisy sparse subspace clustering. Journal of Machine Learning Research, 17(12):1 41, 2016. Yuan Xie, Jinyan Liu, Yanyun Qu, Dacheng Tao, Wensheng Zhang, Longquan Dai, and Lizhuang Ma. Robust kernelized multiview self-representation for subspace clustering. IEEE Transactions on Neural Networks and Learning Systems, 2020. Shuai Yang, Wenqi Zhu, and Yuesheng Zhu. Residual encoder-decoder network for deep subspace clustering. ar Xiv preprint ar Xiv:1910.05569, 2019. Yingzhen Yang, Jiashi Feng, Nebojsa Jojic, Jianchao Yang, and Thomas S Huang. ℓ0-sparse subspace clustering. In European Conference on Computer Vision, pp. 731 747, 2016. C. You and R. Vidal. Subspace-sparse representation. Arxiv, abs/1507.01307, 2015a. Chong You and Ren e Vidal. Geometric conditions for subspace-sparse recovery. In International Conference on Machine Learning, pp. 1585 1593, 2015b. Chong You, Chun-Guang Li, Daniel P. Robinson, and Ren e Vidal. Oracle based active set algorithm for scalable elastic net subspace clustering. In IEEE Conference on Computer Vision and Pattern Recognition, pp. 3928 3937, 2016. Chong You, Chun-Guang Li, Daniel P. Robinson, and Ren e Vidal. Is an affine constraint needed for affine subspace clustering? In IEEE International Conference on Computer Vision, 2019. Meng Zeng, Yaoming Cai, Zhihua Cai, Xiaobo Liu, Peng Hu, and Junhua Ku. Unsupervised hyperspectral image band selection based on deep subspace clustering. IEEE Geoscience and Remote Sensing Letters, 16(12):1889 1893, 2019a. Published as a conference paper at ICLR 2021 Meng Zeng, Yaoming Cai, Xiaobo Liu, Zhihua Cai, and Xiang Li. Spectral-spatial clustering of hyperspectral image based on laplacian regularized deep subspace clustering. In IGARSS 20192019 IEEE International Geoscience and Remote Sensing Symposium, pp. 2694 2697. IEEE, 2019b. C. Zhang, S. Bengio, M. Hardt, and B. Recht. Understanding deep learning requires rethinking generalization. In International Conference on Learning Representations, 2017. Changqing Zhang, Huazhu Fu, Qinghua Hu, Xiaochun Cao, Yuan Xie, Dacheng Tao, and Dong Xu. Generalized latent multi-view subspace clustering. IEEE Transactions on Pattern Analysis and Machine Intelligence, 42(1):86 99, 2018. Junjian Zhang, Chun-Guang Li, Chong You, Xianbiao Qi, Honggang Zhang, Jun Guo, and Zhouchen Lin. Self-supervised convolutional subspace clustering network. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 5473 5482, 2019a. Tong Zhang, Pan Ji, Mehrtash Harandi, Huang Wenbing, and Hongdong Li. Neural collaborative subspace clustering. In ICML, 2019b. Lei Zhou, Bai Xiao, Xianglong Liu, Jun Zhou, Edwin R Hancock, et al. Latent distribution preserving deep subspace clustering. In 28th International Joint Conference on Artificial Intelligence. York, 2019. Pan Zhou, Yunqing Hou, and Jiashi Feng. Deep adversarial subspace clustering. In IEEE Conference on Computer Vision and Pattern Recognition, pp. 1596 1604, 2018. Published as a conference paper at ICLR 2021 5 PROOFS OF RESULTS IN MAIN PAPER Here we present the proofs of our various results that we give in the main paper. 5.1 PROOF OF PROPOSITION 1 Proposition 1. Consider the objective in (3) and suppose θ(C) in (1) is a function which achieves its minimum at C = 0 and satisfies θ(µC) < θ(C), µ (0, 1). If for any choice of (Wd, We) and τ1, τ2 (0, 1) there exists ( ˆ Wd, ˆ We) such that ΦE(X, ˆ We) = τ1ΦE(X, We) and ΦD(τ2Z, ˆ Wd) = ΦD(Z, Wd) then the F term in (3) can be made arbitrarily small without changing the value of the loss function ℓ. Further, if the final layer of the encoding network is positively-homogeneous with degree p = 0, such a ( ˆ Wd, ˆ We) will always exist simply by scaling the network weights of the linear (or affine) operator parameters. Proof. Let (C, We, Wd) be an arbitrary triplet. For any choice of τ, µ (0, 1), note the statement conditions require that there exists ( ˆC, ˆ We, ˆ Wd) which satisfy ΦE(X, ˆ We) = τΦE(X, We) ΦD(µτZ, ˆ Wd) = ΦD(Z, Wd). Then, the function ℓevaluated at ( ˆC, ˆ We, ˆ Wd) is given by ℓ(X, ΦD(ΦE(X, ˆ We)µC, ˆ Wd)) = ℓ(X, ΦD(ΦE(X, We)C, Wd)), (16) which is equal to ℓevaluated at (C, We, Wd). Moreover, the function F evaluated at ( ˆC, ˆ We, ˆ Wd) is given by 1 2 ΦE(X, ˆ We) ΦE(X, ˆ We)µC 2 F + λθ(µC) 2 ΦE(X, We) ΦE(X, We)µC 2 F + λθ(µC). (17) Note that the above equation can be made arbitrarily small for choice of τ, µ sufficiently small, completing the first statement of the result. To see the second part of the claim, first note that the condition on the decoder that for all τ2 (0, 1) there exists ˆ Wd such that ΦD(τ2Z, ˆ Wd) = ΦD(Z, Wd) is trivially satisfied by all neural networks by simply scaling the input weights in the first layer of the decoder network by τ 1 2 . As a result, we are left to show that there will always exist a set of encoder weights which satisfies the conditions of the statement. To see this, w.l.o.g. let Φe(X, We) take the general form Φe(X, We) = ψ(A(h(X, We), A, b)) (18) where A( , A, b) is an arbitrary linear (or affine) operator parameterized by linear parameters A and bias terms (for affine operators) b; h(X, We) is an arbitrary function parameterized by We (note that We = {A, b, We}); and ψ is an arbitrary positively homogeneous function with degree p = 0. From this note that for any α > 0 we have the following: ψ(A(h(X, We), αA, αb)) = ψ(αA(h(X, We), A, b)) = αpψ(A(h(X, We), A, b)). (19) where the first equality is due to basic properties of linear (or affine) operators and the second equality is due to positive homogeneity. As a result, for any choice of τ1 (0, 1) we can choose a scaling α = τ 1/p 1 to achieve that for parameters ˆ We = (τ 1/p 1 A, τ 1/p 1 b, We) we have Φe(X, ˆ We) = τ1Φe(X, We), completing the result. 5.2 PROOF OF THEOREM 1 Theorem 1. Consider the following optimization problems which jointly optimize over Z and C: (P1) min C,Z F(Z, C) s.t. Z 2 F τ (P2) min C,Z F(Z, C) s.t. Zi 2 F τ Published as a conference paper at ICLR 2021 Then optimal values for C for both (P1) and (P2) are given by C arg min C 1 2σ2 min(C I)τ + λθ(C). (21) Moreover, for any optimal C , let r be the multiplicity of the smallest singular value of C I and let Q RN r be an orthonormal basis for the subspace spanned by the left singular vectors associated with the smallest singular value of C I. Then we have that optimal values for Z are given by: Z {BQ : B Rd r B}, B = {B : B 2 F = τ} σmin(C I) > 0 {B : B 2 F τ} σmin(C I) = 0 (P1) {B : Bi 2 F = τ d, i} σmin(C I) > 0 {B : Bi 2 F τ d, i} σmin(C I) = 0 (P2) Proof. The objective F(Z, C) can be reformulated as F(Z, C) = 1 2tr(Z(C I)(C I) Z )+λθ(C) = 1 i=1 Zi(C I)(C I) (Zi) +λθ(C), (23) where recall Zi denotes the ith row of Z. If we add the constraints on Z we have the following minimization problems over Z with C held fixed: i=1 Zi(C I)(C I) (Zi) s.t. Z Z = (Pd i=1 Zi 2 F τ (P1) Zi 2 F τ d i (P2) . (24) Note that if we fix the magnitude of the rows, Zi 2 F = ki for any k Rd, k 0, Pd i=1 ki τ for (P1) and k Rd, ki τ/d for (P2) and optimize over the directions of the rows, then the minimum is obtained whenever Zi is in the span of the eignevectors of (C I)(C I) with smallest eigenvalue, which implies that all the rows of an optimal Z matrix must lie in the span of Q, where Q RN r is an orthonormal basis for the subspace by the left singular vectors of C I associated with the smallest singular value of C I, which has multiplicity r. As a result, we have that optimal values of Z must take the form Z = BQ for some B Rd r. Further, we note that the following also holds: F(BQ , C) = 1 2tr(BQ (C I)(C I) QB ) + λθ(C) = 2σ2 min(C I)tr(BB ) + λθ(C) = 1 2σ2 min(C I) B 2 F + λθ(C). (25) The constraints for B are then seen by noting ZZ = BQ QB = BB , so Z 2 F = tr(ZZ ) = tr(BB ) = B 2 F and Zi 2 F = (ZZ )i,i = (BB )i,i = Bi(Bi) = Bi 2 F , and if σmin(C I) > 0 then minimizing (25) w.r.t. B subject to the constraints on Z gives that B 2 F = τ is optimal. 5.3 PROOF OF THEOREM 2 Theorem 2. Optimal solutions to the problems (P1) min Z,C C 1 s.t. diag(C) = 0, Z = ZC, Z 2 F τ (26) (P2) min Z,C C 1 s.t. diag(C) = 0, Z = ZC, Zi 2 F τ are characterized by the set (Z , C ) {[z z 0d N 2] P} 0 1 0 0 1 0 0 0 0 0 0 0 ... ... ... ... 0 0 0 0 Published as a conference paper at ICLR 2021 where P RN N is an arbitrary signed-permutation matrix and z Rd is an arbitrary vector which satisfies z 2 F τ/2 for (P1) and z2 i τ/(2d), i [1, d] for (P2). Proof. As we observed from the proof of Theorem 1, if C I has a left null-space we can choose an optimal Z to have its rows in that null-space (and this also clearly implies we have σmin(C I) = 0). Also note that when σmin(C I) = 0 this corresponds to the case where Z = ZC Z(I C) = 0. Further, note that if q is a non-zero vector in the left null-space of C I we have that q (C I) = 0 q C = q , which implies that if we take q to be all-zero except for its ith entry, then this would imply that Ci,i must be non-zero, which would violate the diag(C) = 0 constraint, so any vector q in a left null-space of C I for a feasible C matrix must have at least two non-zero entries. As a result, solving (5) with the constraint σmin(C I) = 0 is equivalent to the following problem: min C,q C 1 s.t. q (C I) = 0, diag(C) = 0, q 0 2. (29) where 0 denotes the ℓ0 pseudo-norm of a vector defined as the number of non-zero entries in a vector. To first minimize w.r.t. C with a fixed q, we form the Lagrangian: min C L(C, Λ, Γ) = C 1 + Λ, (I C )q + Diag(Γ), C = (30) min C C 1 + qΛ Diag(Γ), C + Λ, q = Λ, q δ( qΛ Diag(Γ) 1), (31) where Λ RN and Γ RN are vectors of dual variables to enforce the (I C )q = 0 and the diag(C) = 0 constraints, respectively. This gives the dual problem max Λ,Γ Λ, q s.t. qΛ Diag(Γ) 1 = max Λ { Λ, q s.t. |qiΛj| 1 i = j} . (32) We note that (32) is separable in the entries of Λ, so if we define {ik}N k=1 to be the indexing which sorts the absolute values of the entries of q in descending order, |qi1| |qi2| |qi N |, one can easily see that an optimal choice of Λ is given by Λ i1 = sgn(qi1) |qi2| , Λ ik = sgn(qik) |qi1| k [2, N] = Λ , q = |qi1| |qi2| + 1 |qi1| k=2 |qik|. (33) If we now minimize the above w.r.t. q, note that the optimal value of the dual objective given by the above equation is invariant w.r.t. scaling the q vector by any non-zero scalar, so we can w.l.o.g. assume that |qi1| = 1 and note that this implies that problem (29) is equivalent to the following optimization problem over the magnitudes of q if we define pk = |qik|: min {pk}N k=2 1 p2 + p2 + k=3 pk s.t. 1 p2 p3 p N 0. (34) Now, note that for a non-negative scalar α 0 the minimum of α 1 + α is achieved at α = 1, so one can clearly see that the optimal value for the above problem is achieved at p2 = 1 and pk = 0, k [3, N]. From this we have that an optimal q for (29) must have exactly two nonzero entries and the non-zero entries must be equal in absolute value. Further, this also implies that C 1 = 2, and because we must have q C = q , if we scale q to have 1 for its two non-zero entries, we then have q C 1 = q 1 = 2 = C 1, so if we let (i, j) index the two non-zero entries of q we have: 2 = q C 1 = sgn(qi)(C )i+sgn(qj)(C )j 1 (C )i 1+ (C )j 1 C 1 = 2. (35) This implies that all the non-zero entries of C must lie in rows i and j, and if there is any overlap in the non-zero support of these rows the signs must match after multiplication by sgn(qi) and sgn(qj). However, since q C must equal q (which is zero everywhere except for entries i and j) and the diagonal of C must be zero, the only way this can be achieved is for the two rows to have non-overlapping non-zero support, proving that the only non-zero entries of C must be Ci,j and Cj,i which take values in { 1, 1}, depending on the choice of the signs for qi and qj. The result is completed by noting that since we require Z = Z C , then Z i and Z j must be equal to within a sign-flip depending on the choice of the signs of the q vector. Published as a conference paper at ICLR 2021 5.4 PROOF OF THEOREM 3 Theorem 3. Optimal solutions to the problems (P1) min Z,C C Sp s.t. Z = ZC, Z 2 F τ (36) (P2) min Z,C C Sp s.t. Z = ZC, Zi 2 F τ where C Sp is any Schatten-p norm on C, are characterized by the set (Z , C ) (zq ) (qq ) : q RN, q F = 1 (38) where z Rd is an arbitrary vector which satisfies z 2 F τ for (P1) and z2 i τ d, i for (P2). Proof. To begin, by the same arguments as in Theorem 2 we consider an optimization problem similar to (29) but for θ(C) = C Sp being any Schatten-p norm and with the q 0 constraint replaced by a q = 0 constraint: min C,q C Sp s.t. q (C I) = 0, q = 0. (39) Again forming the Lagrangian for C with q fixed we have: min C L(C, Λ) = C Sp + Λ, (I C )q = (40) min C C Sp qΛ , C + Λ, q = Λ, q δ( qΛ Sp 1) (41) which implies the dual problem is: max Λ Λ, q s.t. qΛ Sp 1 (42) where Sp denotes the dual norm. Note that for any Schatten-p norm, the dual norm is again a Schatten-p norm, but since we only evaluate the norm on rank-1 matrices this is equal to the Frobenius norm for all values of p. As a result we have for all choices of Schatten-p norm that the dual problem is equivalent to: max Λ Λ, q s.t. qΛ F 1 = max Λ { Λ, q s.t. q F Λ F 1} (43) From the above, one can easily see that the optimal choice for Λ is given as Λ = q q 2 F and the optimal objective value is 1. Further note that from primal optimality in (41) we must have that q(Λ ) C Sp which implies that C , q(Λ ) = C Sp q(Λ ) Sp = C Sp = 1. As a result, we have that C = q(Λ ) by the Cauchy-Schwarz inequality and the fact that q(Λ ) F = 1. Thus since C is a rank-1 matrix then C I can only have one singular value equal to 0, so all the rows of Z must be equal to a scaling of q. Given this structure for Z the result then follows, where we also recall that C Sp = 1, which implies the constraints on q in the statement. 5.5 PROOF OF THEOREM 4 We now present the proof for Theorem 4. Before proving the main result we first prove a simple Lemma which will be helpful. Lemma 1. For a given matrix Z Rd N and vector z Rd such that Zi F = τ i and z F = τ, let k [0, N] be the number of columns in Z which are equal to z to within a sign-flip (i.e., Zi = z). Then, if k 1 the following holds: min c { c 1 s.t. z = Zc} = 1 (44) and c i = 0 = Zi = z. Further, if k = 0 we also have min c { c 1 s.t. z = Zc} > 1 (45) (where we use the convention that the objective takes value if z = Zc has no feasible solution.) Published as a conference paper at ICLR 2021 Proof. Without loss of generality, assume the columns of Z are permuted so that Z has the form: Z = zs , Z (46) where s { 1, 1}k is a vector with k [0, N] elements each taking value 1 or 1, and Z Rd (N k) contains all the columns of Z which are not equal to z. First we consider the k 1 case and note that the Lagrangian of (44) is given as: L(c, Λ) = c 1 + Λ, z Zc (47) Now minimizing L w.r.t. c gives min c c 1 Z Λ, c = δ( Z Λ 1) (48) which gives that the dual problem to (44) (with k 1) is given by max Λ Λ, z s.t. Λ Z 1 max Λ Λ, z s.t. | Λ, Zi | 1, i [1, N] max Λ Λ, z s.t. | Λ, z | 1, | Λ, Zi | 1, i [1, N k] (49) where the final equivalence is due to the special structure of Z in (46). Clearly from (49) and the fact that z F = Zi F = τ, i it is easily seen that an optimal Λ is any vector such that Λ , z = 1, so as a result we have that the optimal solution to the problem in (44) has objective value 1. Further, note that when k 1, then due to the triangle inequality and the fact that all the vectors in Z have equal norm we can only achieve z = Zc with c 1 = 1 if all the non-zero entries of c are in the first k entries and the sign of any non-zero element of c must satisfy sgn(c i ) = si, i [1, k]. To see that (45) is true, first note that an optimal solution to (49) with k 1 is to choose Λ = zτ 2 and that because Zi = z, i we have | Zi, Λ | = | Zi, zτ 2 | < Zi F zτ 2 F = 1. Further, note that the problem in (45) (with k = 0) will have an equivalent dual problem to (49), with the | Λ, z | 1 constraint removed, which shows the inequality, as we can always take Λ = αzτ 2 for some α > 1 and remain dual feasible, giving a dual value (and hence optimal objective value) for (45) strictly greater than 1. With this result we are now ready to prove Theorem 4. Theorem 4. Optimal solutions to the problem min Z,C C 1 s.t. diag(C) = 0, Z = ZC, Zi 2 F = τ i (50) must have the property that for any column in Z , Z i , there exists another column, Z j (i = j), such that Z i = si,j Z j where si,j { 1, 1}. Further, C i 1 = 1 i and Ci,j = 0 = Z i = Z j. Proof. First, note that any Z which satisfies the conditions of the Theorem achieves optimal objective value with C i 1 = 1, i and Ci,j = 0 = Z i = Z j directly from Lemma 1 since when we are finding an optimal Ci encoding for column Z i there must exist another column in Z which is equal to Z i to within a sign-flip (k 1 in Lemma 1). To show that this is optimal, we will proceed by contradiction and assume we have a feasible pair of matrices ( Z, C) which does not satisfy the conditions of the Theorem but C 1 N = C 1. Note that because Z does not satisfy the conditions of the Theorem this implies that at least one column of Z must be distinct (i.e., i : Zi = Zj, j = i). As a result, for any column Zi which is distinct we must have Ci 1 > 1 from Lemma 1 (k = 0 case). If we let I denote the set of indices of the distinct columns in Z and I the compliment of I we then have i I Ci 1 + X j I Cj 1 (51) i I Ci 1 + |I | (52) > |I| + |I | = N (53) where the first equality comes from noting that for any Zj, j I corresponds to the k 1 situation in Lemma 1 and the inequality comes from noting that any Zi, i I corresponds to the k = 0 situation in Lemma 1 and the fact that |I| 1. We thus have the contradiction and the result is completed. Published as a conference paper at ICLR 2021 5.6 PROOF OF PROPOSITION 2 Proposition 2. Consider encoder and decoder networks with the form given in (14). Then, given any dataset X Rdx N where the linear operators in both the encoder/decoder can express identity on X and any τ > 0 there exist network parameters (We, Wd) which satisfy the following: 1. Embedded points are arbitrarily close: ΦE(Xi, We) ΦE(Xj, We) ϵ (i, j) and ϵ > 0. 2. Embedded points have norm arbitrarily close to τ: | ΦE(Xi, We) F τ| ϵ i and ϵ > 0. 3. Embedded points can be decoded exactly: ΦD(ΦE(Xi, We), Wd) = Xi, i. Proof. To begin, let ( W1 e, W2 e) and ( W1 d, W2 d) be choices of linear operator parameters such that W2 e W1 e X = W2 d W1 d X = X which always must exist since the operators can express identity on X. Now, for an arbitrary α > 0 let b1 e be any vector such that α W1 e Xi + b1 e is non-negative for all i (note that this is always possible by taking b1 e to be a sufficiently large non-negative vector). Note that now if we choose b2 e = 0 we then have i and all β > 0: (β W2 e)(α W1 e Xi + b1 e)+ + b2 e = (β W2 e)(α W1Xi + b1 e) = αβXi + β W2 e b1 e (54) where the Re LU function becomes an identity operator due to the fact that we have all nonnegative entries. Likewise, we can choose b1 d to be an arbitrary vector such that (β 1 W1 d)(αβXi + βW2 e b1 e) + b1 d is non-negative for all Xi, so if we choose b2 d = (α 1 W2 d)[ W1 d W2 e b1 e + b1 d] we then have: (α 1 W2 d)[(β 1 W1 d)(αβXi + β W2 e b1 e) + b1 d]+ + b2 d =(α 1 W2 d)[(β 1 W1 d)(αβXi + β W2 e b1 e) + b1 d] + b2 d = W2 d W1 d Xi + (α 1 W2 d)[ W1 d W2 e b1 e + b1 d] + b2 d =Xi So as a result we have constructed a set of encoder/decoder weights which satisfies the third condition of the statement. Further, the embedded points in this construction are of the form Zi = αβXi + β W2 e b1 e (56) so since we can form such a construction for an arbitrary α > 0 and β > 0 we can choose α 0 arbitrarily small and β = τ W2 e b1 e 1 F to give that all the embedded points Zi are arbitrarily close to the point τ W2 e b1 e W2 e b1 e 1 F , which completes the result. 6 ADDITIONAL RESULTS AND DETAILS Here we give a few additional results which expand on results given in the main paper along with extra details regarding our experiments. 6.1 EXPERIMENTS WITH REAL DATA In addition to the results we show in the main paper, we also present additional experimental results on real data. In particular Figure 4 (Left) shows the magnitude of the embedded representation, Z, using the original code from Ji et al. (2017) to solve model (3) using the Yale B dataset (38 faces). Note that the optimization never reaches a stationary point with the magnitude of the embedded representation continually decreasing (as predicted by Proposition 1). Additionally, if one looks at the singular values (normalized by the largest singular value) for the embedding of data points from one class (Right), then training the autoencoder without the F(Z, C) term results in a geometry that is closer to a linear subspace. Further, the raw data is actually closer to a linear subspace than after training the full SEDSC model (comparing Red and Blue curves). Interestingly, the fact that the autoencoder features and raw data is closer to a linear subspace than SEDSC is also consistent with the clustering performance we show in Table 1, where for the setting without the post-processing the autoencoder-only features achieve the best clustering results, followed by the raw data, followed by SEDSC. Published as a conference paper at ICLR 2021 0 5 10 15 20 25 30 Raw AE SEDSC Index of Singular Value 0 50 100 150 200 250 300 350 400 Iterations/50 Norm of Z Singular values of one class Figure 4: Experiments on Extended Yale B dataset. (Left) The norm of the embedded representation Z as training proceeds. (Right) The singular values of the embedded representation of points from one class, normalized by the largest singular value. (Raw) The singular values of the raw data. (AE) The singular values of the embedded representation from an autoencoder trained without the F(Z, C) term. (SEDSC) The singular values of the embedded representation from the full SEDSC model (3). Details of experiments with real data. We use the code provided by the authors of Ji et al. (2017)3. The code implements the model in (3) with θ(C) = 1 2 C 2 F and ℓ( , ) being the squared loss. The training procedure, as described in Ji et al. (2017), involves pre-training an antoencoder network without the F(Z, C) term. Such pretrained models for each of the three datasets are also provided alongside with their code. Then, the encoder and decoder networks of SEDSC are initialized by the pre-trained networks and all model parameters are trained via the Adam optimizer. The implementation details of the three methods reported in Figure 2 and Table 1, namely Raw Data, Autoenc only and Full SEDSC, are as follows. For Raw Data, we solve the model in (1) with θ(C) = 1 2 C 2 F and λ chosen in the set {0.5, 1, 2, 5, 10, 20, 50, 100, 200, 500} that gives the highest averaged clustering accuracy over 10 independent trials. For Autoenc only, we use the pretrained encoder and decoder networks to initialize SEDSC, then freeze the encoder and decoder networks and use Adam to optimize the variable C only. The results for Full SEDSC are generated by running the code as it is. Finally, the same post-processing step is adopted for all three methods (i.e., we do not fine-tune it for Raw Data and Autoenc only). 6.2 EXPERIMENTS WITH SYNTHETIC DATA In addition to the results shown in the main paper we additionally conduct similar experiments with synthetic data for the Instance Normalization and the Batch/Channel Normalization scheme. Details of experiments with real data. For the Dataset and Batch/Channel normalization experiments we directly add a normalization operator to the encoder network which normalizes the output of the encoder such that the entire dataset has unit Frobenius norm (Dataset Normalization) or each row of the embedded dataset has unit norm (Batch/Channel Normalization) before passing to the self-expressive layer. For the Instance Normalization setting we add the regularization term proposed in Peng et al. (2017) with the form γ2 PN i=1(Z i Zi 1)2 to the objective in (3). We use regularization hyper-parameters (λ, γ) = (10 4, 2) for all cases and γ2 = 10 4 for the Instance Normalization case. We first pretrain the autoencoder without the F(Z, C) term (i.e., γ = 0 and C fixed at I), and we initialize the C matrix to be the solution to (1) using the Z embedding from the pretrained autoencoder. Following this we perform standard proximal gradient descent (Parikh & Boyd, 2013) on the full dataset, where we take a gradient descent step on all of the model parameters for the full objective excluding the θ(C) term, then we solve the proximal operator for θ(C). Figure 5 shows the results of this experiment, where we plot the original dataset along with the reconstructed output of 3 https://github.com/panji1990/Deep-subspace-clustering-networks Published as a conference paper at ICLR 2021 0.0 0.2 0.4 0.6 0.8 1.0 1.2 0.1 0.0 0.1 0.2 0.3 0.1 0.0 0.1 0.2 0.3 Original Data Reconstructed Data Auto Encoder Auto Encoder + Self-Express Instance Norm Original Data Domain Embedded Data Domain |C| |C| - log color scale Dataset Norm 0.1 0.0 0.1 0.2 0.3 0.4 0.2 0.0 0.2 0.4 0.6 0.6 0.4 0.2 0.0 0.2 0.4 0.6 0.8 Figure 5: Showing results for the synthetic dataset for three normalization schemes (along the rows). Instance Normalization (top); Dataset Normalization (center); Batch/Channel Normalization (bottom). The columns are the same as described in the main paper. (Left) Original data points (Blue) and the data points at the output of the autoencoder when the full model (3) is used (Red). (Center Left) Data representation in the embedded domain when just the autoencoder is trained without the F(Z, C) term (Blue) and the full model is used (Red). (Center Right) The absolute value of the recovered C encoding matrix when trained with the full model. (Right) Same plot as the previous column but with a logarithmic color scale to visualize small entries. the autoencoder (Left), the embedded representation after pretraining the autoencoder (Left Center Blue) and after fully training the model (Left Center-Red), the absolute value of the final C matrix (Right Center), and the same plot with a logarithmic color scale to better visualize small entries (Right). From Figure 5 one can see that our theoretical predictions are largely confirmed experimentally. Namely, first examining the Batch and Dataset normalization experiments one sees that when the full SEDSC model is trained the embedded representation is very close to as predicted by Theorem 2, with almost all of the embedded points (Left Center - Red points) being near the origin with the exception of two points, which are co-linear with each other. Likewise, the C matrix is dominated by two non-zero entries with the remaining non-zero entries only appearing on the log-scale color scale. Further, the Instance normalization experiment also largely confirms our theoretical predictions, where all the points are co-linear and largely identical copies of a single point.