# sparse_quantized_spectral_clustering__690b2d10.pdf Published as a conference paper at ICLR 2021 SPARSE QUANTIZED SPECTRAL CLUSTERING Zhenyu Liao ICSI and Department of Statistics University of California, Berkeley, USA zhenyu.liao@berkeley.edu Romain Couillet G-STATS Data Science Chair, GIPSA-lab University Grenobles-Alpes, France romain.couillet@gipsa-lab.grenoble-inp.fr Michael W. Mahoney ICSI and Department of Statistics University of California, Berkeley, USA mmahoney@stat.berkeley.edu Given a large data matrix, sparsifying, quantizing, and/or performing other entrywise nonlinear operations can have numerous benefits, ranging from speeding up iterative algorithms for core numerical linear algebra problems to providing nonlinear filters to design state-of-the-art neural network models. Here, we exploit tools from random matrix theory to make precise statements about how the eigenspectrum of a matrix changes under such nonlinear transformations. In particular, we show that very little change occurs in the informative eigenstructure, even under drastic sparsification/quantization, and consequently that very little downstream performance loss occurs when working with very aggressively sparsified or quantized spectral clustering problems. We illustrate how these results depend on the nonlinearity, we characterize a phase transition beyond which spectral clustering becomes possible, and we show when such nonlinear transformations can introduce spurious non-informative eigenvectors. 1 INTRODUCTION Sparsifying, quantizing, and/or performing other entry-wise nonlinear operations on large matrices can have many benefits. Historically, this has been used to develop iterative algorithms for core numerical linear algebra problems (Achlioptas & Mc Sherry, 2007; Drineas & Zouzias, 2011). More recently, this has been used to design better neural network models (Srivastava et al., 2014; Dong et al., 2019; Shen et al., 2020). A concrete example, amenable to theoretical analysis and ubiquitous in practice, is provided by spectral clustering, which can be solved by retrieving the dominant eigenvectors of XTX, for X = [x1, . . . , xn] Rp n a large data matrix (Von Luxburg, 2007). When the amount of data n is large, the Gram kernel matrix XTX can be enormous, impractical even to form and leading to computationally unaffordable algorithms. For instance, Lanczos iteration that operates through repeated matrix-vector multiplication suffers from an O(n2) complexity (Golub & Loan, 2013) and quickly becomes burdensome. One approach to overcoming this limitation is simple subsampling: dividing X into subsamples of size εn, for some ε (0, 1), on which one performs parallel computation, and then recombining. This leads to computational gain, but at the cost of degraded performance, since each data point xi looses the cumulative effect of comparing to the whole dataset. An alternative cost-reduction procedure consists in uniformly randomly zeroing-out entries from the whole matrix XTX, resulting in a sparse matrix with only an ε fraction of nonzero entries. For spectral clustering, by focusing on the eigenspectrum of the zeroed-out matrix, Zarrouk et al. (2020) showed that the same computational gain can be achieved at the cost of a much less degraded performance: for n/p rather large, almost no degradation is observed down to very small values of ε (e.g., ε 2% for n/p 100). Previous efforts showed that it is often advantageous to perform sparsification/quantization in a nonuniform manner, rather than uniformly (Achlioptas & Mc Sherry, 2007; Drineas & Zouzias, 2011). The focus there, however, is often on (non-asymptotic bounds of) the approximation error between Published as a conference paper at ICLR 2021 the original and the sparsified/quantized matrices. This, however, does not provide a direct access to the actual performance for spectral clustering or other downstream tasks of interest, e.g., since the top eigenvectors are known to exhibit a phase transition phenomenon (Baik et al., 2005; Saade et al., 2014). That is, they can behave very differently from those of the original matrix, even if the matrix after treatment is close in operator or Frobenius norm to the original matrix. Here, we focus on a precise characterization of the eigenstructure of XTX after entry-wise nonlinear transformation such as sparsification or quantization, in the large n, p regime, by performing simultaneously non-uniform sparsification and/or quantization (down to binarization). We consider a simple mixture data model with x N( µ, Ip) and let K f(XTX/ p)/ p, where f is an entry-wise thresholding/quantization operator (thereby zeroing-out/quantizing entries of XTX); and we prove that this leads to significantly improved performances, with the same computational cost, in spectral clustering as uniform sparsification, but for a much reduced cost in storage induced by quantization. The only (non-negligible) additional cost arises from the extra need for evaluating each entry of XTX. Our main technical contribution (of independent interest, e.g., for those interested in entry-wise nonlinear transformations of feature matrices) consists in using random matrix theory (RMT) to derive the large n, p asymptotics of the eigenspectrum of K = f(XTX/ p)/ p for a wide range of functions f, and then comparing to previously-established results for uniform subsampling and sparsification in (Zarrouk et al., 2020). Experiments on real-world data further corroborate our findings. Our main contributions are the following. 1. We derive the limiting eigenvalue distribution of K as n, p (Theorem 1), and we identify: (a) the existence of non-informative and isolated eigenvectors of K for some f (Corollary 1); (b) in the absence of such eigenvectors, a phase transition in the dominant eigenvalue-eigenvector (ˆλ, ˆv) pair (Corollary 2): if the signal-to-noise ratio (SNR) µ 2 of the data exceeds a certain threshold γ, then ˆλ becomes isolated from the main bulk (Von Luxburg, 2007; Joseph & Yu, 2016; Baik et al., 2005) and ˆv contains data class-structure information exploitable for clustering; if not, then ˆv contains only noise and is asymptotically orthogonal to the class-label vector. 2. Letting f be a sparsification, quantization, or binarization operator, we show: (a) a selective non-uniform sparsification operator, such that XTX can be drastically sparsified with very little degradation in clustering performance (Proposition 1 and Section 4.2), which significantly outperforms the random uniform sparsification scheme in (Zarrouk et al., 2020); (b) for a given matrix storage budget (i.e., fixed number of bits to store K), an optimal design of the quantization/binarization operators (Proposition 2 and Section 4.3), the performances of which are compared against the original XTX and its sparsified but not quantized version. For spectral clustering, the surprisingly small performance drop, accompanied by a huge reduction in computational cost, contributes to improved algorithms for large-scale problems. More generally, our proposed analysis sheds light on the effect of entry-wise nonlinear transformations on the eigenspectra of data/feature matrices. Thus, looking forward (and perhaps more importantly, given the use of nonlinear transformations in designing modern neural network models as well as the recent interest in applying RMT to neural network analyses (Dobriban et al., 2018; Li & Nguyen, 2018; Seddik et al., 2018; Jacot et al., 2019; Liu & Dobriban, 2019)), we expect that our analysis opens the door to improved analysis of computationally efficient methods for large dimensional machine learning and neural network models more generally. 2 SYSTEM MODEL AND PRELIMINARIES Basic setup. Let x1, . . . , xn Rp be independently drawn (not necessarily uniformly) from a two-class mixture of C1 and C2 with C1 : xi = µ + zi, C2 : xi = +µ + zi (1) with zi Rp having i.i.d. zero-mean, unit-variance, κ-kurtosis, sub-exponential entries, µ Rp such that µ 2 ρ 0 as p , and v { 1}n with [v]i = 1 for xi C1 and +1 for xi C2.1 The data matrix X = [x1, . . . , xn] Rp n can be compactly written as X = Z + µv T 1The norm is the Euclidean norm for vectors and the operator norm for matrices. Published as a conference paper at ICLR 2021 for Z = [z1, . . . , zn] so that v = n and both Z, µv T have operator norm of order O( p) in the n p regime. In this setting, the Gram (or linear kernel) matrix XTX achieves optimal clustering performance on the mixture model (1); see Remark 1 below. However, it consists of a dense n n matrix, which becomes quickly expensive to store or to perform computation on, as n increases. Thus, we consider instead the following entry-wise nonlinear transformation of XTX: K = δi =jf(x T i xj/ p)/ p n for f : R R satisfying some regularity conditions (see Assumption 1 below), where δi =j equals 1 for i = j and equals 0 otherwise. The diagonal elements f(x T i xi/ p) (i) bring no additional information for clustering and (ii) do not scale properly for p large (x T i xi/ p = O( p) instead of O(1)). Thus, following (El Karoui, 2010; Cheng & Singer, 2013), they are discarded. Most of our technical results hold for rather generic functions f, e.g., those of interest beyond sparse quantized spectral clustering, but we are particularly interested in f with nontrivial numerical properties (e.g., promoting quantization and sparsity): Sparsification: f1(t) = t 1|t|> Quantization: f2(t) = 22 M( t 2M 2/ 2s + 1/2) 1|t| 2s + sign(t) 1|t|> Binarization: f3(t) = sign(t) 1|t|> Here, s 0 is some truncation threshold, and M 2 is a number of information bits.2 The visual representations of these fs are given in Figure 1-(left). For f3, taking s 0 leads to the sign function sign(t). In terms of storage, the quantization f2 consumes 2M 2 +1 bits per nonzero entry, while the binarization f3 takes values in { 1, 0} and thus consumes 1 bit per nonzero entry. Random matrix theory. To provide a precise description of the eigenspectrum of K for the nonlinear f of interest, to be used in the context of spectral clustering, we will provide a large dimensional characterization for the resolvent of K, defined for z C \ R+ as Q(z) (K z In) 1. (6) This matrix, which plays a central role in RMT analysis (Bai & Silverstein, 2010), will be used in two primary ways. First, the normalized trace 1 n tr Q(z) is the so-called Stieltjes transform of the eigenvalue distribution of K, from which the eigenvalue distribution can be recovered, and be further used to characterize the phase transition beyond which spectral clustering becomes theoretically possible (Corollary 2). Second, for (ˆλ, ˆv), an isolated eigenvalue-eigenvector pair of K, and a Rn, a deterministic vector, by Cauchy s integral formula, the angle between ˆv and a is given by |ˆv Ta|2 = 1 Γ(ˆλ) a TQ(z)a dz, where Γ(ˆλ) is a positively oriented contour surrounding ˆλ only. Letting a = v, this will be exploited to characterize the spectral clustering error rate (Proposition 1). From a technical perspective, unlike linear random matrix models, K (and thus Q(z)) involves nonlinear dependence between its entries. To break this difficulty, following the ideas of (Cheng & Singer, 2013), we exploit the fact that, by the central limit theorem, z T i zj/ p N(0, 1) in distribution as p . As such, up to µv T, which is treated separately with a perturbation argument, the entries of K asymptotically behave like a family of dependent standard Gaussian variables to which f is applied. Expanding f in a series of orthogonal polynomials with respect to the Gaussian measure allows for unwrapping this dependence. A few words on the theory of orthogonal polynomials (Andrews et al., 1999) are thus convenient to pursue our analysis. Orthogonal polynomial framework. For a probability measure µ, let {Pl(x), l 0} be the orthonormal polynomials with respect to f, g R fgdµ obtained by the Gram-Schmidt procedure on the monomials {1, x, x2, . . .}, such that P0(x) = 1, Pl is of degree l and Pl1, Pl2 = δl1 l2. By the Riesz-Fischer theorem (Rudin, 1964, Theorem 11.43), for any function f L2(µ), the set of square-integrable functions with respect to , , one can formally expand f as l=0 al Pl(x), al = Z f(x)Pl(x)µ(dx) (7) 2Also, here, denotes the floor function, while erf(x) = 2 π R x 0 e t2dt and erfc(x) = 1 erf(x) denotes the error function and complementary error function, respectively. Published as a conference paper at ICLR 2021 f1 erfc(s) + 2se s2/ π erfc(s) + 2se s2/ π 2 π 21 M(1 + e s2 1 2M 1 4M 1 erf(s) + P2M 2 1 k=1 2e k2s2 4M 2 ) P2M 2 1 k=1 k erf(ks 22 M) 22M 5 f3 e s2p 2/π erfc(s) Figure 1: Visual representations of different functions f defined in (3)-(5) (left) and their corresponding parameters a1 and ν (right) (a2 = 0 for each of these fs) defined in Assumption 1. where f P l=0 al Pl indicates that f PL l=0 al Pl 0 as L with f 2 = f, f . To investigate the asymptotic behavior of K, as n, p , we make the following assumption on f. Assumption 1 (Square-integrable in Gauss space). Let ξp = z T i zj/ p and Pl,p(x) be the orthonormal polynomials with respect to the measure µp of ξp. For f L2(µp), f(x) P l=0 al,p Pl,p(x) with al,p in (7) such that (i) P l=0 al,p Pl,p(x)µp(dx) converges in L2(µp) to f(x) uniformly over large p; and (ii) as p , P l=1 a2 l,p ν and, for l = 0, 1, 2, al,p al converges with a0 = 0. Since ξp N(0, 1) in distribution, the parameters a0, a1, a2 and ν are simply moments of the standard Gaussian measure involving f. More precisely, for ξ N(0, 1), a0 = E[f(ξ)], a1 = E[ξf(ξ)], 2a2 = E[ξ2f(ξ)] a0, ν = E[f 2(ξ) a2 0] a2 1 + a2 2. (8) Imposing the condition a0 = 0 simply discards the non-informative rank-one matrix a01n1T n/ p from K. The three parameters (a1, a2, ν) are of crucial significance in determining the spectral behavior of K (see Theorem 1 below).The sparse f1, quantized f2, and binary f3 of our primary interest all satisfy Assumption 1 (counterexample exists though, for example f(x) = e P (x) for polynomial P(x) of degree greater than two), with their corresponding a2 = 0 (as we shall see in Corollary 1 below, this is important for the spectral clustering use case) and a1, ν given in Figure 1- (right). With these preliminary comments, we are in position to present our main technical results. 3 MAIN TECHNICAL RESULTS Our main technical result, from which our performance-complexity trade-off analysis will follow, provides an asymptotic deterministic equivalent Q(z) for the random resolvent Q, defined in (6). (A deterministic equivalent is a deterministic matrix Q(z) such that, for any deterministic sequence of matrices An Rn n and vectors an, bn Rn of bounded (spectral and Euclidean) norms, 1 n tr An(Q(z) Q(z)) 0 and a T n(Q(z) Q(z))bn 0 almost surely as n, p . We denote this relation Q(z) Q(z).) This is given in the following theorem. The proof is in Appendix A.1. Theorem 1 (Deterministic equivalent). Let n, p with p/n c (0, ), let Q(z) be defined in (6), and let ℑ[ ] denote the imaginary part of a complex number. Then, under Assumption 1, Q(z) Q(z) = m(z)In VΛ(z)VT, Λ(z) = " Θ(z)m2(z) v T1nΘ(z)Ω(z) n m(z) v T1nΘ(z)Ω(z) n m(z) (v T1n)2Θ(z)Ω2(z) with n V = [v, 1n], Ω(z) = a2 2(κ 1)m3(z) 2c2 a2 2(κ 1)m2(z), Θ(z) = a1 µ 2 c+a1m(z)(1+ µ 2)+a1 µ 2Ω(z)(v T1n)2/n2 , for κ the kurtosis of the entries of Z, and m(z) the unique solution, such that ℑ[m(z)] ℑ[z] 0, to 1 m(z) = z + a2 1m(z) c + a1m(z) + ν a2 1 c m(z). (9) The major implication of Theorem 1 is that, the spectral behavior of the matrix K (e.g., its eigenvalue distribution and the isolated eigenpairs as discussed after (6)) depends on the nonlinear f only via the three parameters a1, a2, ν defined in (8). More concretely, the empirical spectral measure ωn = 1 n Pn i=1 δλi(K) of K has a deterministic limit ω as n, p , uniquely defined through its Stieltjes transform m(z) R (t z) 1ω(dt) as the solution to (9).3 This limiting measure ω does not 3For m(z) the Stieltjes transform of a measure ω, ω can be obtained via ω([a, b]) = limy 0 1 π R b a ℑ[m(x + ıy)]dx for all a < b continuity points of ω. Published as a conference paper at ICLR 2021 10 5 0 5 10 15 Figure 2: (Left) Histogram of eigenvalues of K (blue) versus the limiting spectrum and spikes (red). (Right) Eigenvectors of the largest (top) and second largest (bottom) eigenvalues of K (blue), versus the rescaled class label αv/ n (red, from Corollary 2). f(t) = sin(t) 3 cos(t) + 3/ e, p = 800, n = 6 400, µ = 1.1 1p/ p, v = [ 1n/2; 1n/2] on Student-t data with κ = 5. depend on the law of the independent entries of Z, so long that they are sub-exponential, with zero mean and unit variance. In particular, taking a1 = 0 in (9) gives the (rescaled) Wigner semi-circle law ω (Wigner, 1955), and taking ν = a2 1 (i.e., al = 0 for l 2) gives the Mar cenko-Pastur law ω (Marcenko & Pastur, 1967). See Remark 2 in Appendix A.2 for more discussions. Spurious non-informative spikes. Going beyond just the limiting spectral measure of K, Theorem 1 also shows that isolated eigenvalues (often referred to as spikes) may be found in the eigenspectrum of K at very specific locations. Such spikes and their associated eigenvectors are typically thought to provide information on the data class-structure (and they do when f is linear). However, when f is nonlinear, this is not always the case: it is possible that not all these spikes are useful in the sense of being informative about the class structure. This is made precise in the following, where we assume µ = 0, i.e., that there is no class structure. The proof is in Appendix A.2. Corollary 1 (Non-informative spikes). Assume µ = 0, κ = 1 and a2 = 0, so that in the notations of Theorem 1, Θ(z) = 0 and Q(z) = m(z) + Ω(z) 1 n1n1T n for Ω(z) defined in Theorem 1. Then, 2 κ 1 satisfies a1x = 1 and (ν a2 1)x2 + a2 1x2 (1+a1x )2 < 1/c, two eigenvalues of K converge to z = 1 cx a2 1x 1+a1x (ν a2 1)x away from the support of ω. If instead x = 1/a1 for a1 = 0, then a single eigenvalue of K isolates with limit z = ν Corollary 1 (combined with Theorem 1) says that, while the limiting spectrum ω is universal with respect to the distribution of the entries of Z, the existence and position of the spurious non-informative spikes are not universal and depend on the kurtosis κ of the distribution. (See Figure 12 in Appendix A.2 for an example.) This is far from a mathematical curiosity. Given how nonlinear transformations are used in machine learning practice, being aware of the existence of spurious non-informative spikes in the eigenspectrum of K (well separated from the bulk of eigenvalues, but that correspond to random noise instead of signal), as a function of properties of the nonlinear f, is of fundamental importance for downstream tasks. For example, for spectral clustering, their associated eigenvectors may be mistaken as informative ones by spectral clustering algorithms, even in the complete absence of classes. This is confirmed by Figure 2 where two isolated eigenvalues (on the right side of the bulk) are observed, with only the second largest one corresponding to an eigenvector that contains class-label information. For further discussions, see Appendix A.2 and A.3. Informative spikes. From Theorem 1, we see that the eigenspectrum of K depends on f only via a1, a2, and ν. In particular, a1 and ν determine the limiting spectral measure ω. From Corollary 1, we see that a2 contributes by (i) introducing (at most two) non-informative spikes and (ii) reducing the ratio a1/ν (since ν = P i 1 a2 i ), thereby necessarily enlarging the support of ω (see Remark 2 in Appendix A.2 for more details). Taking a2 = 0 thus reduces the length of the support of ω and, as such, maximizes the chance of the appearance of an informative spike (the eigenvector of which is positively correlated with the label vector v). See Remark 1 below for a more precise statement. In particular, by taking a2 = 0 and a1 = 0, we obtain only informative spikes, and we can characterize a phase transition depending on the SNR ρ. The proof of the following is in Appendix A.3. Corollary 2 (Informative spike and a phase transition). For a1 > 0 and a2 = 0, let F(x) = x4 + 2x3 + 1 cν x2 2cx c, G(x) = a1 c (1 + x) + a1 x + ν a2 1 a1 1 1 + x, (10) Published as a conference paper at ICLR 2021 and let γ be the largest real solution to F(γ) = 0. Then, under Assumption 1, we have c γ cν/a1, and the largest eigenpair (ˆλ, ˆv) of K satisfies ˆλ λ = G(ρ) ρ > γ, G(γ) ρ γ; |ˆv Tv|2 ( F (ρ) ρ(1+ρ)3 ρ > γ, 0 ρ γ; (11) almost surely as n, p , p/n c (0, ), where we recall ρ limp µ 2 and v = n. Without loss of generality, we discuss only the case a1 > 0.4 For a1 > 0, both F(x) and G(x) are increasing functions on x (γ, ). Then, as expected, both λ and α increase with the SNR ρ. Moreover, the phase transition point γ is an increasing function of c and of ν/a2 1. As such, the optimal f in the sense of the smallest phase transition point is the linear function f(t) = t with a1 = ν = 1 and γ = c. This recovers the classical random matrix result in (Baik et al., 2005). 4 CLUSTERING PERFORMANCE OF SPARSE AND QUANTIZED OPERATORS We start, in Section 4.1, by providing a sharp asymptotic characterization of the clustering error rate and demonstrating the optimality of the linear function under (1). Then, in Section 4.2, we discuss the advantageous performance of the proposed selective sparsification approach (with f1) versus the uniform or subsampling approach studied previously in (Zarrouk et al., 2020). Finally, in Section 4.3, we derive the optimal truncation threshold sopt, for both quantized f2 and binary f3, so as to achieve an optimal performance-complexity trade-off for a given storage budget. 4.1 PERFORMANCE OF SPECTRAL CLUSTERING The technical results in Section 3 provide conditions under which K admits an informative eigenvector ˆv that is non-trivially correlated with the class label vector v (and thus that is exploitable for spectral clustering) in the n, p limit. Since the exact (limiting) alignment |v Tˆv| is known, along with an additional argument on the normal fluctuations of ˆv, we have the following result for the performance of the spectral clustering method. The proof is in Appendix A.4. Proposition 1 (Performance of spectral clustering). Let Assumption 1 hold, let a1 > 0, a2 = 0, and let ˆCi = sign([ˆv]i) be the estimate of the underlying class Ci of xi, with the convention ˆv Tv 0, for ˆv the top eigenvector of K. Then, the (average) misclassification rate satisfies 1 n Pn i=1 δ ˆCi =Ci 1 2 erfc( p α/(2 2α)), almost surely, as n, p , for α [0, 1) defined in (11). Recall from Corollary 2 that, for a2 = 0, the nonlinear function f (e.g., f1, f2, f3) acts on the (statistical behavior of the) isolated eigenvector ˆv, and thus on the spectral clustering performance per Proposition 1, only through the ratio ν/a2 1. It thus suffices to evaluate the ratio ν/a2 1 of different f and compare, for instance to that of the linear f(t) = t corresponding to the original XTX matrix. Despite being asymptotic results valid in the n, p limit, the results of Proposition 1 and Corollary 2 closely match empirical results for moderately large n, p only in the hundreds. This is illustrated in Figure 3. Proposition 1 further confirms that the misclassification rate, being a decreasing function of α, increases with ν/a2 1 (for c and ρ fixed). This leads to the following remark. Remark 1 (Optimality of linear function). Since both the phase transition point γ and the misclassification rate grow with ν/a2 1, the linear function f(t) = t with the minimal ν/a2 1 = 1 is optimal in the sense of (i) achieving the smallest SNR ρ or the largest ratio c = lim p/n (i.e., the fewest samples n) necessary to observe an informative (isolated) eigenvector, and (ii) upon existence of such an isolated eigenvector, reaching the lowest classification error rate. According to Remark 1, any f with ν/a2 1 > 1 induces performance degeneration (compared to the optimal linear function). However, by choosing f to be one of the functions f1, f2, f3 defined in (3)-(5), one may trade clustering performance optimality for reduced storage size and computational time. Figure 4 displays the theoretical decay in clustering performance and the gain in storage size of the sparse f1, quantized f2, and binary f3, when compared to the optimal but dense linear function. As s 0, both performance and storage size under f1 naturally approach those of the 4Otherwise we could consider K instead of K and the largest eigenvalue becomes the smallest one. Published as a conference paper at ICLR 2021 0 γ 3 5 7 9 0 0.2 0.4 0.6 0.8 0 2 4 6 8 0 0 0.5 1 0.06 Truncation threshold s Misclassif. rate Figure 3: (Left) Empirical alignment |ˆv Tv|2/n (green crosses) and misclassification rate (purple circles) in markers versus their limiting behaviors in lines, for f(t) = sign(t), as a function of SNR ρ. (Right) Misclassification rate as a function of the truncation thresholds s of sparse f1 (blue crosses) and binary f3 (red circles) with ρ = 4. Here, p = 512, n = 256, µ N(0, Ip), v = [ 1n/2; 1n/2] on Gaussian data. The results are obtained by averaging over 250 runs. 0 0.5 1 1.5 0.12 0.14 0.16 0.18 Truncation threshold s Misclassif. rate 0 0.43 0.69 1 Truncation threshold s 0 0.5 1 10 1 Truncation threshold s Storage size Figure 4: Clustering performance (left, a zoom-in in middle) and storage size (MB) (right) of f1 (blue), f2 with M = 2 (green), f3 (red), and linear f(t) = t (black), versus the truncation threshold s, for SNR ρ = 2, c = 1/2 and n = 103, with 64 bits per entry for non-quantized matrices. linear function. This is unlike f2 or f3, which approach the sign function. For s 1, the performance under sparse f1 becomes comparable to that of binary f3 (which is significantly worse than quantized but non-sparse f2) but for a larger storage cost. In particular, using f2 or f3 in the setting of Figure 4, one can reduce the storage size by a factor of 32 or 64 (in the case of IEEE standard singleor double-precision floating-point format), at the price of a performance drop less than 1%. 4.2 COMPARISON TO UNIFORM SPARSIFICATION AND SUBSAMPLING From Figure 4, we see that the classification error and storage gain of the sparse f1 increase monotonically, as the truncation threshold s grows. For f1, the number of nonzero entries of K is approximately erfc(s)n2 with truncation threshold s. Thus, the sparsity level εselec = erfc(s) [0, 1] can be defined and compared to uniform sparsification or subsampling approaches. Recall (from the introduction) that the cost of spectral clustering may be reduced by subsampling the whole dataset in 1/εsub chunks of nεsub data vectors each. Alternatively, as investigated recently in (Zarrouk et al., 2020), the cost can be reduced by uniformly zeroing-out XTX with a symmetric random mask matrix B, with Bij Bern(εunif) for 1 i < j n and Bii = 0. On average, a proportion 1 εunif of the entries of XTX is set to zero, so that εunif [0, 1] controls the sparsity level (and thus the storage size as well as computational time). Similar to our Corollary 2, the associated eigenvector alignment α (and thus the clustering accuracy per Proposition 1) in both cases can be derived. Specifically, taking εunif = a2 1/ν in (Zarrouk et al., 2020, Theorem 3.2), we obtain the same F(x) as in our Corollary 2 and therefore the same phase transition point γ and eigenvector alignment α. As for subsampling, its performance can be obtained by letting a2 1 = ν and changing c into c/εsub in the formulas of F(x) and G(x) of our Corollary 2. Consequently, the same clustering performance is achieved by either uniform or selective sparsification (using f1) with εunif = a2 1/ν = erfc(s) + 2se s2/ π > erfc(s) = εselec, (12) and our proposed selective sparsification thus leads to strictly sparser matrices. Moreover, their ratio r(s) = erfc(s)/(erfc(s) + 2se s2/ π) is a decreasing function of s and approximates as Published as a conference paper at ICLR 2021 0 1 2 3 0 0.2 0.4 0.6 0.8 Truncation threshold s Proportion / ratio r(s) for r 1 0 0.2 0.4 0.6 0.8 1 Level of sparsity ε Figure 5: (Left) Proportion of nonzero entries with uniform versus selective sparsification f1 and their ratio r(s), as a function of the truncation threshold s. (Right) Comparison of 1%, 10% classification error and phase transition (i.e., 50% error) curves between subsampling (green), uniform (blue) and selective sparsification f1 (red), as a function of sparsity level ε and SNR ρ, for c = 2. r(s) (1 + s2) 1/2 for s 1,5 meaning that the gain in storage size and computational time is more significant as the matrix becomes sparser. This is depicted in Figure 5-(left). Fixing α in Corollary 2 to achieve a given clustering performance level (via Proposition 1), one may then retrieve equi-performance curves in the (ε, ρ)-plane, for uniform sparsification, selective sparsification, and subsampling. This is displayed in Figure 5-(right), showing that a dramatic performance gain is achieved by the proposed selective sparsification f1. Besides, here for c = 2, as much as 80% sparsity could be obtained with selective sparsification at constant SNR ρ, with virtually no performance loss (red curves are almost flat on ε [0.2, 1]). This fails to hold for uniform sparsification (Zarrouk et al. (2020) obtain such a result only when c 0.1) or subsampling. 4.3 OPTIMALLY QUANTIZED AND BINARIZED MATRICES From Figure 4, we see that the classification errors of the quantized f2(M; s; t) and binarized f3(s; t) do not increase monotonically with the truncation threshold s. It can be shown (and also visually confirmed in Figure 4) that, for a given M 2, the ratio ν/a2 1 of both f2 and f3 is convex in s and has a unique minimum. This leads to the following optimal design result for f2 and f3, respectively, the proof of which is straightforward. Proposition 2 (Optimal design of quantized and binarized functions). Under the assumptions and notations of Proposition 1, the classification error rate is minimized at s = sopt with 1. sopt the unique solution to a1(sopt)ν (sopt) = 2a 1(sopt)ν(sopt) for quantized f2, with a 1(s) and ν (s) the corresponding derivatives with respect to s in Figure 1-(right); and 2. sopt = exp( s2 opt)/(2 π erfc(sopt)) 0.43 for binary f3, with level of sparsity ε 0.54. Therefore, the optimal threshold sopt for quantized f2 or binary f3 under (1) is problem-independent, as it depends neither on ρ nor on c. In particular, note that (i) the binary f3(sopt; ) is consistently better than f(t) = sign(t) for which ν/a2 1 = π/2 1.57 > 1.24; and (ii) the performance of quantized f2 can be worse, though very slightly, than that of binary f3 for small s, but significantly better for not-too-small s. These are visually confirmed in the left and middle displays of Figure 4. As already observed in Figure 4-(right), a significant gain in storage size can be achieved by using f2 or f3, versus the performance-optimal but dense linear function, with virtually no performance loss. Figure 6 compares the performance of the optimally designed f2 and f3, to sparse f1 that has approximately the same storage size.6 A significant drop in classification error is observed by using quantization f2 or binarization f3 rather than sparsification f1. Also, the performances of f2 and f3 are extremely close to the theoretical optimal (met by f(t) = t). This is further confirmed by Figure 6-(right) where, for the optimal f2, the ratio ν/a2 1 gets close to 1, for all M 5. Figure 7 next evaluates the clustering performance, the proportion of nonzero entries in K, and the computational time of the top eigenvector, for sparse f1 and binary f3, versus linear f(t) = t, 5We use here the asymptotic expansion erfc(s) = e s2 s π 1 + P k=1( 1)k 1 3 (2k 1) 6We set the truncation threshold s of f1 such that erfc(s) = 3/64, so that the storage size of the sparse f1 (64 bits per nonzero entry) is the same as the quantized f2 with M = 3 (with 3 bits per nonzero entry), which is three times that of the binary f3. Published as a conference paper at ICLR 2021 0 2 4 6 8 10 0 Misclassif. rate 10 1 100 101 (ν/a2 1)min Figure 6: Performance of f1 (blue), f2 with M = 3 (green) and f3 (red) of the same storage size, versus SNR for c = 4 (left) and versus c for SNR ρ = 4 (middle). (Right) Optimal threshold sopt (green) and (ν/a2 1)min (purple) of f2 versus M. Curves for linear f(t) = t are displayed in black. 0 5 10 15 20 0 Truncation threshold s Misclassif. rate 0 5 10 15 20 0 Truncation threshold s 0 5 10 15 20 0 0.2 0.4 0.6 0.8 Truncation threshold s Nonzero prop. 0 5 10 15 200 Compute. time (s) Nonzero Time Figure 7: Clustering performance of sparse f1 and binary f3 (left and middle), proportion of nonzero entries and computational time of the top eigenvector for f3 (right), as a function of the truncation threshold s on the MNIST dataset: digits (0, 1) (left) and (5, 6) (middle and right) with n = 2 048 and performance of the linear function in black. Results averaged over 100 runs. as a function of the truncation threshold s, on the popular MNIST datasets (Le Cun et al., 1998). Depending on (the SNR ρ of) the task, up to 90% of the entries can be discarded almost for free . Moreover, the curves of the binary f3 appear strikingly close to those of the sparse f1, showing the additional advantage of using the former to further reduce the storage size of K. More empirical results on various datasets are provided in Appendix B to confirm our observations in Figure 7. 5 CONCLUDING REMARKS We have evaluated performance-complexity trade-offs when sparsifying, quantizing, and binarizing a linear kernel matrix via a thresholding operator. Our main technical result characterizes the change in the eigenspectrum under these operations; and we have shown that, under an information-plusnoise model, sparsification and quantization, when carefully employed, maintain the informative eigenstructure and incur almost negligible performance loss in spectral clustering. Empirical results on real data demonstrate that these conclusions hold far beyond the present statistical model. The proposed analysis can be extended in many ways, for instance by considering a multi-cluster and more involved model than (1) as in (Liao & Couillet, 2019) (i.e., generic K-class Gaussian mixture N(µa, Ca) for a {1, . . . , K}, which may help better interpret the empirical observations in Figure 7 and Appendix B), by focusing on more general kernels beyond the current inner-product type in (2), or by deriving non-asymptotic guarantees as in (Vankadara & Ghoshdastidar, 2020). Our results open the door to theoretical investigation of a broad range of cost-efficient linear algebra methods in machine learning, including subsampling techniques (Mensch et al., 2017; Roosta Khorasani & Mahoney, 2019), distributed optimization (Wang et al., 2018), randomized linear algebra algorithms (Mahoney, 2011; Drineas & Mahoney, 2016), and quantization for improved training and/or inference (Dong et al., 2019; Shen et al., 2020). Also, given recent interest in viewing neural networks from the perspective of RMT (Li & Nguyen, 2018; Seddik et al., 2018; Jacot et al., 2019; Liu & Dobriban, 2019; Martin & Mahoney, 2019; 2020), our results open the door to understanding and improving performance-complexity trade-offs far beyond kernel methods (Rahimi & Recht, 2008; Jacot et al., 2018; Liu et al., 2020), e.g., to sparse, quantized, or even binary neural networks (Hubara et al., 2016; Lin et al., 2017). Published as a conference paper at ICLR 2021 ACKNOWLEDGMENTS We would like to acknowledge DARPA, IARPA (contract W911NF20C0035), NSF, and ONR via its BRC on Rand NLA for providing partial support of this work. Our conclusions do not necessarily reflect the position or the policy of our sponsors, and no official endorsement should be inferred. Couillet s work is partially supported by MIAI at University Grenoble-Alpes (ANR-19-P3IA-0003) and the HUAWEI Lar Dist project. Dimitris Achlioptas and Frank Mc Sherry. Fast computation of low-rank matrix approximations. Journal of the ACM (JACM), 54(2):9 es, 2007. George E. Andrews, Richard Askey, and Ranjan Roy. Special Functions, volume 71. Cambridge University Press, 1999. Zhidong Bai and Jack W Silverstein. Spectral analysis of large dimensional random matrices, volume 20. Springer, 2010. Jinho Baik, G erard Ben Arous, and Sandrine P ech e. Phase transition of the largest eigenvalue for nonnull complex sample covariance matrices. The Annals of Probability, 33(5):1643 1697, 2005. Xiuyuan Cheng and Amit Singer. The spectrum of random inner-product kernel matrices. Random Matrices: Theory and Applications, 2(04):1350010, 2013. Tarin Clanuwat, Mikel Bober-Irizar, Asanobu Kitamoto, Alex Lamb, Kazuaki Yamamoto, and David Ha. Deep learning for classical japanese literature, 2018. Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. Image Net: A large-scale hierarchical image database. In 2009 IEEE conference on computer vision and pattern recognition, pp. 248 255. Ieee, 2009. Yen Do and Van Vu. The spectrum of random kernel matrices: universality results for rough and varying kernels. Random Matrices: Theory and Applications, 2(03):1350005, 2013. Edgar Dobriban, Stefan Wager, et al. High-dimensional asymptotics of prediction: Ridge regression and classification. The Annals of Statistics, 46(1):247 279, 2018. Zhen Dong, Zhewei Yao, Amir Gholami, Michael W Mahoney, and Kurt Keutzer. HAWQ: Hessian aware quantization of neural networks with mixed-precision. In Proceedings of the IEEE International Conference on Computer Vision, pp. 293 302, 2019. Petros Drineas and Michael W Mahoney. Rand NLA: randomized numerical linear algebra. Communications of the ACM, 59(6):80 90, 2016. Petros Drineas and Anastasios Zouzias. A note on element-wise matrix sparsification via a matrixvalued Bernstein inequality. Information Processing Letters, 111(8):385 389, 2011. Noureddine El Karoui. On information plus noise kernel random matrices. The Annals of Statistics, 38(5):3191 3216, 2010. Zhou Fan and Andrea Montanari. The spectral norm of random inner-product kernel matrices. Probability Theory and Related Fields, 173(1-2):27 85, 2019. Gene H. Golub and Charles F. Van Loan. Matrix Computations. The Johns Hopkins University Press, third edition, 2013. Itay Hubara, Matthieu Courbariaux, Daniel Soudry, Ran El-Yaniv, and Yoshua Bengio. Binarized Neural Networks. In Advances in Neural Information Processing Systems, volume 29, pp. 4107 4115, 2016. Arthur Jacot, Franck Gabriel, and Cl ement Hongler. Neural Tangent Kernel: Convergence and generalization in neural networks. In Advances in neural information processing systems, pp. 8571 8580, 2018. Published as a conference paper at ICLR 2021 Arthur Jacot, Franck Gabriel, and Clement Hongler. The asymptotic spectrum of the Hessian of DNN throughout training. In International Conference on Learning Representations, 2019. Antony Joseph and Bin Yu. Impact of regularization on spectral clustering. The Annals of Statistics, 44(4):1765 1791, 2016. Arun Kadavankandy and Romain Couillet. Asymptotic gaussian fluctuations of spectral clustering eigenvectors. In 2019 IEEE 8th International Workshop on Computational Advances in Multi Sensor Adaptive Processing (CAMSAP), pp. 694 698. IEEE, 2019. Yann Le Cun, L eon Bottou, Yoshua Bengio, and Patrick Haffner. Gradient-based learning applied to document recognition. Proceedings of the IEEE, 86(11):2278 2324, 1998. Ping Li and Phan-Minh Nguyen. On random deep weight-tied autoencoders: Exact asymptotic analysis, phase transitions, and implications to training. In International Conference on Learning Representations, 2018. Zhenyu Liao and Romain Couillet. Inner-product kernels are asymptotically equivalent to binary discrete kernels. ar Xiv preprint ar Xiv:1909.06788, 2019. Xiaofan Lin, Cong Zhao, and Wei Pan. Towards accurate binary convolutional neural network. In Advances in Neural Information Processing Systems, pp. 345 353, 2017. Fanghui Liu, Xiaolin Huang, Yudong Chen, and Johan AK Suykens. Random features for kernel approximation: A survey in algorithms, theory, and beyond. ar Xiv preprint ar Xiv:2004.11154, 2020. Sifan Liu and Edgar Dobriban. Ridge Regression: Structure, Cross-Validation, and Sketching. In International Conference on Learning Representations, 2019. Anna Lytova and Leonid Pastur. Central limit theorem for linear eigenvalue statistics of random matrices with independent entries. The Annals of Probability, 37(5):1778 1840, 2009. Michael W Mahoney. Randomized Algorithms for Matrices and Data. Foundations and Trends in Machine Learning, 3(2):123 224, 2011. Vladimir A Marcenko and Leonid Andreevich Pastur. Distribution of eigenvalues for some sets of random matrices. Mathematics of the USSR-Sbornik, 1(4):457, 1967. Charles H Martin and Michael W Mahoney. Traditional and heavy tailed self regularization in neural network models. In International Conference on Machine Learning, pp. 4284 4293, 2019. Charles H Martin and Michael W Mahoney. Heavy-tailed universality predicts trends in test accuracies for very large pre-trained deep neural networks. In Proceedings of the 2020 SIAM International Conference on Data Mining, pp. 505 513. SIAM, 2020. Arthur Mensch, Julien Mairal, Bertrand Thirion, and Ga el Varoquaux. Stochastic subsampling for factorizing huge matrices. IEEE Transactions on Signal Processing, 66(1):113 128, 2017. Vinay Uday Prabhu. Kannada-MNIST: A new handwritten digits dataset for the Kannada language. ar Xiv preprint ar Xiv:1908.01242, 2019. Ali Rahimi and Benjamin Recht. Random features for large-scale kernel machines. In Advances in neural information processing systems, pp. 1177 1184, 2008. Farbod Roosta-Khorasani and Michael W Mahoney. Sub-sampled Newton methods. Mathematical Programming, 174(1-2):293 326, 2019. Walter Rudin. Principles of mathematical analysis, volume 3. Mc Graw-hill New York, 1964. Alaa Saade, Florent Krzakala, and Lenka Zdeborov a. Spectral clustering of graphs with the Bethe Hessian. In Advances in Neural Information Processing Systems, pp. 406 414, 2014. Mohamed El Amine Seddik, Mohamed Tamaazousti, and Romain Couillet. A kernel random matrixbased approach for sparse PCA. In International Conference on Learning Representations, 2018. Published as a conference paper at ICLR 2021 Mohamed El Amine Seddik, Cosme Louart, Mohamed Tamaazousti, and Romain Couillet. Random matrix theory proves that deep learning representations of GAN-data behave as gaussian mixtures. ar Xiv preprint ar Xiv:2001.08370, 2020. Sheng Shen, Zhen Dong, Jiayu Ye, Linjian Ma, Zhewei Yao, Amir Gholami, Michael W Mahoney, and Kurt Keutzer. Q-BERT: Hessian Based Ultra Low Precision Quantization of BERT. In AAAI, pp. 8815 8821, 2020. Jack W Silverstein and ZD Bai. On the empirical distribution of eigenvalues of a class of large dimensional random matrices. Journal of Multivariate analysis, 54(2):175 192, 1995. Jack W Silverstein and Sang-Il Choi. Analysis of the limiting spectral distribution of large dimensional random matrices. Journal of Multivariate Analysis, 54(2):295 309, 1995. Nitish Srivastava, Geoffrey Hinton, Alex Krizhevsky, Ilya Sutskever, and Ruslan Salakhutdinov. Dropout: a simple way to prevent neural networks from overfitting. The journal of machine learning research, 15(1):1929 1958, 2014. Christian Szegedy, Wei Liu, Yangqing Jia, Pierre Sermanet, Scott Reed, Dragomir Anguelov, Dumitru Erhan, Vincent Vanhoucke, and Andrew Rabinovich. Going deeper with convolutions. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 1 9, 2015. Terence Tao, Van Vu, and Manjunath Krishnapur. Random matrices: Universality of ESDs and the circular law. The Annals of Probability, 38(5):2023 2065, 2010. Leena C Vankadara and Debarghya Ghoshdastidar. On the optimality of kernels for highdimensional clustering. In International Conference on Artificial Intelligence and Statistics, pp. 2185 2195. PMLR, 2020. Dan Voiculescu. Addition of certain non-commuting random variables. Journal of functional analysis, 66(3):323 346, 1986. Ulrike Von Luxburg. A tutorial on spectral clustering. Statistics and computing, 17(4):395 416, 2007. Shusen Wang, Fred Roosta, Peng Xu, and Michael W Mahoney. GIANT: Globally improved approximate Newton method for distributed optimization. In Advances in Neural Information Processing Systems, pp. 2332 2342, 2018. Eugene P. Wigner. Characteristic vectors of bordered matrices with infinite dimensions. Annals of Mathematics, 62(3):548 564, 1955. Han Xiao, Kashif Rasul, and Roland Vollgraf. Fashion-MNIST: a Novel Image Dataset for Benchmarking Machine Learning Algorithms, 2017. Tayeb Zarrouk, Romain Couillet, Florent Chatelain, and Nicolas Le Bihan. Performance-complexity trade-off in large dimensional statistics. In 2020 IEEE International Workshop on Machine Learning for Signal Processing (MLSP). IEEE, 2020. Published as a conference paper at ICLR 2021 A PROOFS AND RELATED DISCUSSIONS Under the mixture model (1), the data matrix X Rp n can be compactly written as X = Z + µv T, (13) for Z Rp n having i.i.d. zero-mean, unit-variance, κ-kurtosis, sub-exponential entries and v { 1}n so that v = n. Recall also the following notations: K = δi =jf(x T i xj/ p)/ p n i,j=1 , Q(z) (K z In) 1. (14) A.1 PROOF OF THEOREM 1 The proof of Theorem 1 comes in the following two steps: 1. show that the random quantities 1 n tr An Q(z) and a T n Q(z)bn of interest concentrate around their expectations in the sense that 1 n tr An(Q(z) E[Q(z)]) 0, a T n(Q(z) E[Q(z)])bn 0, (15) almost surely as n, p ; and 2. show that the sought-for deterministic equivalent Q(z) given in Theorem 1 is an asymptotic approximation for the expectation of the resolvent Q(z) defined in (6) in the sense that E[Q] Q 0, (16) The concentration of trace forms in the first item has been established in (Cheng & Singer, 2013; Do & Vu, 2013), and the bilinear forms follow similarly. Here we focus on the second item to show that E[Q] Q 0 in the large n, p limit. In the sequel, we use o(1) and o (1) for scalars or matrices of (almost surely if being random) vanishing absolute values or operator norms as n, p . To establish E[Q] Q 0, we need to show subsequently that: 1. under (13), the random matrix K defined in (2) admits a spiked-model approximation, that is K = K0 + UΛUT + o (1), (17) for some full rank random (noise) matrix K0 and low rank (information) matrix UΛUT to be specified; and 2. the matrix inverse ( K0 UΛUT z In) 1 can be decomposed with the Woodbury identity, so that Q = ( K0 UΛUT z In) 1 +o (1) = Q0 Q0U(Λ 1 +UT Q0U) 1UT Q0 +o (1), (18) with Q0(z) ( K0 z In) 1; and 3. the expectation of the right-hand side of (18) is close to Q in the large n, p limit, allowing us to conclude the proof of Theorem 1. To establish (17), we denote the noise-only null model with µ = 0 by writing K = K0 such that [K0]ij = δi =jf(z T i zj/ p)/ p. (19) With a combinatorial argument, it has been shown in (Fan & Montanari, 2019) that K0 K0 a2 2 1 p(ψ1T n + 1nψT) 0, (20) almost surely as n, p , for K0 such that ( K0 z In) 1 Q0(z) m(z)In and the random vector ψ Rn with its i-th entries given by [ψ]i = 1 p zi 2 E[ zi 2] = 1 p( zi 2 p). Published as a conference paper at ICLR 2021 Consider now the informative-plus-noise model K for X = Z + µv T as in (13) with [v]i = 1 and v = n. It follows from (Liao & Couillet, 2019) that K K0 a1 p v ZTµ µ 2 1 1 0 almost surely as n, p . Combining (20) with (21), we obtain K K0 UΛUT 0 almost surely as n, p , with U = 1 p[1n, v, ψ, ZTµ] Rn 4, Λ = 2 0 0 a1 µ 2 0 a1 a2 2 0 0 0 0 a1 0 0 and ( K0 z In) 1 Q0(z) m(z)In. By the Woodbury identity, we write Q = (K z In) 1 = ( K0 + UΛUT z In) 1 + o (1) = Q0 Q0U(Λ 1 + UT Q0U) 1UT Q0 + o (1) (23) with Λ 1 + UT Q0U = c 0 0 0 0 (κ 1) m(z) c 0 0 0 0 µT( 1 p E[Z Q0ZT])µ where we use the fact that E[ψ] = 0, E[ψψT] = (κ 1)In. We need to evaluate the expectation 1 p E[Z( K0 z In) 1ZT]. This is given in the following lemma. Lemma 1. Under the assumptions and notations of Theorem 1, we have 1 p E[Z( K0 z In) 1ZT] = m(z) c + a1m(z)Ip + o (1). (24) Proof of Lemma 1. For Q0 = ( K0 z In) 1, we aim to approximate the expectation E[Z Q0ZT]/p. Consider first the case where the entries of Z are i.i.d. Gaussian, we can write the (i, i ) entry of E[Z Q0ZT] with Stein s lemma (i.e., E[xf(x)] = E[f (x)] for x N(0, 1)) as E[Z Q0ZT]ii = j=1 E[Zij[ Q0ZT]ji ] = j=1 E [ Q0ZT]ji [ Q0]jjδii + X We first focus on the term [ Q0]jk Zij by writing Zij = Q0 K0 Zij Q0 l,m=1 [ Q0]jl [K0]lm Zij [ Q0]mk where we recall [K0]ij = δi =jf(ZTZ/ p)ij/ p so that for l = m we have pf (ZTZ/ p)lm [ZTZ]lm pf (ZTZ/ p)lm(δjl Zim + ZT liδjm) Zij = 0 for l = m. We get Zij ZT ki = 1 j,k,m [ Q0]jjf (ZTZ/ p)jm Zim[ Q0]mk ZT ki 1 j,k,l [ Q0]jlf (ZTZ/ p)lj ZT li[ Q0]jk ZT ki p[Z diag(f (ZTZ/ p) Q01n) Q0ZT]ii 1 p[Z( Q0 f (ZTZ/ p)) Q0ZT]ii Published as a conference paper at ICLR 2021 where f (ZTZ/ p) indeed represents f (ZTZ/ p) diag( ) in both cases. For the first term, since f (ZTZ/ p) diag( ) = a11n1T n + O ( p), we have 1 pf (ZTZ/ p) Q01n = a1 p 1T n Q01n 1n + O(p 1/2) = a1m(z) c 1n + O(p 1/2) (25) where O(p 1/2) is understood entry-wise. As a result, 1 p Z diag(f (ZTZ/ p) Q01n) Q0ZT = a1m(z) p Z Q0ZT + o (1). For the second term, since f (ZTZ/ p) has O(1) entries and A B n A B for A, B Rn n, we deduce that 1 p Z( Q0 f (ZTZ/ p)) Q0ZT = O( p). As a consequence, we conclude that 1 p E[Z Q0ZT] = 1 p tr Q0 Ip a1m(z) p E[Z Q0ZT] + o (1) that is 1 p E[Z Q0ZT] = m(z) c + a1m(z)Ip + o (1) where we recall that tr Q0/p = m(z)/c and thus the conclusion of Lemma 1 for the Gaussian case. The interpolation trick (Lytova & Pastur, 2009, Corollaray 3.1) can then be applied to extend the result beyond Gaussian distribution. This concludes the proof of Lemma 1. Denote A = (Λ 1 + UT Q0U) 1, it follows from Lemma 1 that E[UAUT] = 1 p(A111n1T n + A121nv T + A21v1T n + A22vv T + A33(κ 1)In + A44 µ 2In) p(A111n1T n + A121nv T + A21v1T n + A22vv T) + o (1) since µ = O(1) and v = O( n). We thus deduce from (23) that Q(z) Q(z) = m(z)In cm2(z)V A11 A12 A21 A22 with n V = [v, 1n]. Rearranging the expression we conclude the proof of Theorem 1. A.2 PROOF OF COROLLARY 1 AND RELATED DISCUSSIONS Consider the noise-only model by taking µ = 0 in Theorem 1. Then, we have K = K0 and Θ(z) = 0, so that Q(z) = m(z) + Ω(z) 1 n1n1T n, Ω(z) = a2 2(κ 1)m3(z) 2c2 a2 2(κ 1)m2(z) (26) where we recall m(z) is the solution to m(z) = z + a2 1m(z) c + a1m(z) + ν a2 1 c m(z) 1 . (27) Since the resolvent Q(z) is undefined for z R within the eigensupport of K that consists of (i) the main bulk characterized by the Stieltjes transform m(z) defined in (27) and (ii) the possible spikes, we need to find the poles of Q(z) but not those of m(z) to determine the asymptotic locations of the spikes that are away from the main bulk. Direct calculations show that the Stieltjes transforms of the possible non-informative spikes satisfy 2 κ 1 c a2 (28) Published as a conference paper at ICLR 2021 that are in fact the poles of Ω(z), for a2 = 0 and κ = 1. For κ = 1 or a2 = 0, Ω(z) has no (additional) poles, so that there is (almost surely) no spike outside the limiting spectrum. It is however not guaranteed that z R corresponding to (28) isolates from the main bulk. To this end, we introduce the following characterization of the limiting spectral measure in (27), the proof of which follows from previous work. Corollary 3 (Limiting spectrum). Under the notations and conditions of Theorem 1, with probability one, the empirical spectral measure ωn = 1 n Pn i=1 δλi(K0) of the noise-only model K0 (and therefore that of K as a low rank additive perturbation of K0 via (21)) converges weakly to a probability measure ω of compact support as n, p , with ω uniquely defined through its Stieltjes transform m(z) solution to (27). Moreover, 1. if we let supp(ω) be the support of ω, then supp(ω) {0} = R \ {x(m) | m R \ {{ c/a1} {0}} and x (m) > 0} (29) for x(m) the functional inverse of (27) explicitly given by m a2 1m c + a1m ν a2 1 c m, (30) 2. the measure ω has a density and its support may have up to four edges, with the associated Stieltjes transforms given by the roots of x (m) = 0, i.e., x (m) = 1 m2 a2 1c (c + a1m)2 ν a2 1 c = 0. (31) The limiting spectral measure ω of the null model K0 was first derived in (Cheng & Singer, 2013) for Gaussian distribution and then extended to sub-exponential distribution in (Do & Vu, 2013). The fact that a finite rank perturbation does not affect the limiting spectrum follows from (Silverstein & Bai, 1995, Lemma 2.6). The characterization in (29) above follows the same idea as in (Silverstein & Choi, 1995, Theorem 1.1), which arises from the crucial fact that the Stieltjes transform m(x) = R (t x) 1ω(dt) of a measure ω is an increasing function on its domain of definition and so must be its functional inverse x(m) given explicitly in (30). In plain words, Corollary 3 tells us that (i) depending on the number of real solutions to (31), the support of ω may contain two disjoint regions with four edges, and (ii) x R is outside the support of ω if and only if its associated Stieltjes transform m satisfies x (m) > 0, i.e., belonging to the increasing region of the functional inverse x(m) in (30). This is depicted in Figure 8, where for the same function f(t) = max(t, 0) 1/ 2π with a1 = 1/2, a2 = 1/(2 π) and ν = (π 1)/(2π), we observe in the top display a single region of ω for c = 2 and in the bottom display two disjoint regions (with thus four edges) for c = 1/10. The corresponding (empirical) eigenvalue histograms and limiting laws are given in Figure 9. Note, in particular, that the local extrema of the functional inverse x(m) in Figure 8 characterize the (possibly up to four) edges of the support of ω in Figure 9. According to the discussion above, it remains to check the sign of x (m) for m = q 2 κ 1 c a2 to see if they correspond to isolated eigenvalues away from the support of ω. This, after some algebraic manipulations, concludes the proof of Corollary 1. Discussions. The limiting spectral measure in Corollary 3 is indeed a mix between the popular Mar cenko-Pastur and the Wigner s semicircle law. Remark 2 (From Mar cenko-Pastur to semicircle law). As already pointed out in (Fan & Montanari, 2019), here the limiting spectral measure ω is the so-called free additive convolution (Voiculescu, 1986) of the semicircle and Mar cenko-Pastur laws, weighted respectively by a1 and p ν a2 1, i.e., ω = a1(ωMP,c 1 1) q (ν a2 1)/c ωSC (32) where we denote a1(ωMP,c 1 1) the law of a1(x 1) for x ωMP,c 1 and p (ν a2 1)/c ωSC the law of p (ν a2 1)/c x for x ωSC. Figure 10 compares the eigenvalue distributions of K0 Published as a conference paper at ICLR 2021 x(m) supp(ω) x(m) supp(ω) Figure 8: Functional inverse x(m) for m R\{{ c/a1} {0}}, with f(t) = max(t, 0) 1/ 2π, for c = 2 (above, with two edges) and c = 1/10 (bottom, with four edges). The support of ω can be read on the vertical axes and the values of x such that x (m) = 0 are marked in green. 0.5 0 0.5 1 0 5 Figure 9: Eigenvalues of K with µ = 0 (blue) versus the limiting laws in Theorem 1 and Corollary 3 (red) for p = 3 200, n = 1 600 (left) and p = 400, n = 4 000 (right), with f(t) = max(t, 0) 1/ 2π and Gaussian data. The values of x such that x (m) = 0 in Figure 8 are marked in green. for f(t) = a1t + a2(t2 1)/ 2 (so that ν a2 1 = a2 2) with different pairs of (a1, a2). We observe a transition from the Mar cenko-Pastur law (in the left display, with a1 = 0 and a2 = 0) to the semicircle law (in the right display, with a1 = 0 and a2 = 0). 1 0 1 2 1 0 1 2 2 1 0 1 2 Figure 10: Eigenvalues of K with µ = 0 (blue) versus the limiting laws in Theorem 1 and Corollary 3 (red) for Gaussian data, p = 1 024, n = 512 and f(t) = a1t + a2(t2 1)/ 2 with a1 = 1, a2 = 0 (left), a1 = 1, a2 = 1/2 (middle), and a1 = 0, a2 = 1 (right). Published as a conference paper at ICLR 2021 Remark 2 tells us that, depending on the ratio ν/a2 1, the eigenspectrum of K exhibits a transition from the Mar cenko-Pastur to semicircle-like shape. Note from Figure 1-(right) that, for the sparse f1, the ratio ν/a2 1 is an increasing function of the truncation threshold s and therefore, as the matrix K become sparser, it eigenspectrum changes from a Mar cenko-Pastur-type (at s = 0) to be more semicircle-like. This is depicted in Figure 11 and similar conclusions hold for quantized f2 and binary f3 in the s sopt regime. 1 0 1 2 1 0 1 2 0.5 0 0.5 Figure 11: Eigenvalues of K with µ = 0 (blue) versus the limiting laws in Theorem 1 and Corollary 3 (red) for Gaussian data, p = 1 024, n = 512 and f(t) = t 1|t|> 2s with s = 0.1 (left), s = .75 (middle), and s = 1.5 (right). As discussed after Theorem 1 and in the proof above, while the limiting eigenvalue distribution ω is universal and independent of the law of the entries of Z, so long as they are independent, subexponential, of zero mean and unit variance, as commonly observed in RMT (Tao et al., 2010), this is no longer the case for the isolated eigenvalues. In particular, according to Corollary 1, the possible non-informative spikes do depend on the kurtosis κ of the distribution. In Figure 12 we observe a farther (left) spike for Student-t (with κ = 5 and is thus not sub-exponential) than Gaussian distribution (with κ = 3), while no spike can be observed for the symmetric Bernoulli distribution (that takes values 1 with probability 1/2 so that κ = 1), with the same limiting eigenvalue distribution for f(t) = max(t, 0) 1/ 2 0 2 4 2 0 2 4 2 0 2 4 Figure 12: Eigenvalues of K with µ = 0 (blue) versus the limiting laws and spikes in Theorem 1 and Corollary 1 (red) for Student-t (with 7 degrees of freedom, left), Gaussian (middle) and Rademacher distribution (right), p = 512, n = 2 048, f(t) = max(t, 0) 1/ 2π. Emphasis on the non-informative spikes at different locations: at 2.10 for Student-t and 1.77 for Gaussian. Remark 3 (Non-informative spike in-between). When the support of ω consists of two disjoint regions (e.g., in the right plot of Figure 9), a non-informative spike may appear between these two regions, with the associated Stieltjes transform m < c/a1 in the setting of Figure 8-(bottom). This is only possibly when a1 q 2 κ 1 > a2. An example is provided in Figure 13. A.3 PROOF OF COROLLARY 2 AND RELATED DISCUSSIONS Similar to our discussions in Section A.2, we need to find the zeros of det Λ(z), that are real solutions to H(x) = 0 with H(x) = a1a2 2(κ 1) (v T1n)2 n2 ρ 1 ρ m3(x) a2 2c(κ 1)m2(x)+2a1c2(ρ+1)m(x)+2c3 = 0 Published as a conference paper at ICLR 2021 Figure 13: Eigenvalues of K with µ = 0 (blue) versus the limiting laws and spikes in Corollary 3 and 1 (red) for p = 400, n = 6 000, with f(t) = max(t, 0) 1/ 2π and Gaussian data. for m(z) the unique solution to (9) and ρ = limp µ 2. Note that 1. for a1a2 2(κ 1)( (v T1n)2 n2 ρ 1 ρ) = 0, there can be up to three spikes; 2. with a1 = 0 and a2 = 0, we get m2(x) = 2c2 a2 2(κ 1) and there are at most two spikes: this is equivalent to the case of Corollary 1 with ρ = 0; in fact, taking a1 we discard the information in the signal µ, as has been pointed out in (Liao & Couillet, 2019); 3. with a2 = 0 and a1 = 0 we obtain m(x) = c a1(ρ+1), this is the case of Corollary 2. For a given isolated eigenvalue-eigenvector pair (ˆλ, ˆv) (assumed to be of multiplicity one), the projection |ˆv Tv|2 onto the label vector v can be evaluated via the Cauchy s integral formula and our Theorem 1. More precisely, consider a positively oriented contour Γ that circles around only the isolated ˆλ, we write 1 nv Tˆvˆv Tv = 1 1 nv T(K z In) 1v dz 1 nv T(m(z)In VΛ(z)VT)v dz + o(1) Γ Λ(z) dz VTv + o(1) = 1 nv TV (ResΛ(z)) VTv + o(1) = h 1 v T1n i lim z λ(z λ) h Θ(z)m2(z) Θ(z)Ω(z) v T1n Θ(z)Ω(z) v T1n n m(z) Θ(z)Ω2(z) (v T1n)2 where we use Theorem 1 for the second line and recall that the asymptotic location λ of ˆλ is away from the support of limiting spectral measure ω so that 1 Γ m(z) dz = 0 in the third line. Interestingly, note at this point that taking v T1n = 0 or a2 = 0 (so that Ω(z) = 0) leads to the same simplification 1 n|v Tˆv|2 = lim z λ(z λ)Θ(z)m2(z) + o(1) = lim z λ(z λ) a1ρm2(z) c + a1m(z)(1 + ρ) + o(1) (34) 1 + ρ m2(λ) m (λ) + o(1) = a1ρ 1 a2 1cm2(λ) (c + a1m(λ))2 ν a2 1 c m2(λ) + o(1) (35) with l Hospital s rule and the fact that m (z) = 1 m2(z) a2 1c (c+a1m(z))2 ν a2 1 c 1 by differentiating (9). The particularly means that, in the absence of the (noisy) non-informative spikes due to a2 = 0 or in the case of balanced class v T1n = 0 (in fact almost balanced class v T1n = o(n)), we obtain the same asymptotic alignment (with respect to v) for the informative eigenvector. However, there may appear spurious non-informative and isolated eigenvectors in the latter case. In the setting of Corollary 2 with a1 > 0 and a2 = 0, with the substitution m(λ) = mρ = c a1(ρ+1) into (35) and then the change-of-variable m = c a1 1 1+x, we obtain the expression of F(x) in Corollary 2. The phase transition condition can be similarly obtained, as discussed in Section A.2, by checking the sign of the derivative of the functional inverse x (m) as in Corollary 3. This concludes the proof of Corollary 2. Published as a conference paper at ICLR 2021 Discussions. Note that, while with either a2 = 0 or v T1n = 0 we obtain the same expression for the projection |v Tˆv|2, the possible spike of interest ˆλ (and its asymptotic location λ) in these two scenarios can be rather different. More precisely, 1. with a2 = 0, there is a single possible spike ˆλ with m(λ) = mρ = c a1(ρ+1); 2. with v T1n = 0, there can be up to three spikes that correspond to mρ = c a1(ρ+1) and This observation leads to the following remark. Remark 4 (Noisy top eigenvector with a2 = 0). For v T1n = 0 and a2 = 0, one may have 2 κ 1 > c a1(ρ+1) = mρ for instance with large a2 and small a1. Since m(x) is an increasing function, the top eigenvalue-eigenvector pair of K can be non-informative, independent of the SNR ρ, and totally useless for clustering purposes. An example is provided in Figure 2 where one observes that (i) the largest spike (on the right-hand side) corresponds to a noisy eigenvector while the second largest spike has its eigenvector positively aligned with the label vector v; and (ii) the theoretical prediction of the eigen-alignment α in Corollary 2 still holds here due to v T1n = 0. This extends our Corollary 1 to the signal-plus-noise scenario and confirms the advantage and necessity of taking a2 = 0. As a side remark, in contrast to Remark 3 and Figure 13, where we observe that the non-informative spike can be lying between the two disjoint regions of the limiting measure ω, in the case of a1 > 0, the informative spike mρ = c a1(ρ+1) can only appear on the right-hand side of the support of ω, since c a1 < c a1(ρ+1) < 0 for ρ = limp µ 2 0. See Figure 8-(bottom) for an illustration. A.4 PROOF OF PROPOSITION 1 Note that, for ˆv the top isolated eigenvector of K, with Corollary 2 we can write ˆv = αv/ n + σw (36) for some σ R, w Rn a zero-mean random vector, orthogonal to v, and of unit norm. To evaluate the asymptotic clustering performance in the setting of Proposition 1 (i.e., with the estimate ˆCi = sign([ˆv]i) for ˆv Tv 0), we need to assess the probability Pr(sign([ˆv]i) < 0) for xi C1 and Pr(sign([ˆv]i) > 0) for xi C2 (recall that the class-label [v]i = 1 for xi C1 and [v]i = +1 for xi C2), and it thus remains to derive σ. Note that 1 = ˆv Tˆv = α + 2σ αw Tv/ n + σ2 = α + σ2 + o(1) (37) where we recall v = n, which, together with an argument on the normal fluctuations of ˆv (Kadavankandy & Couillet, 2019), concludes the proof. B ADDITIONAL EMPIRICAL RESULTS ON REAL-WORLD DATASETS In Figure 14, we compare the clustering performance, the level of sparsity, and the computational time of the top eigenvector, of the sparse function f1 and quantized f2 with M = 2 (so 2 bits per nonzero entry), on the MNIST dataset. We see that, different from the binary f3 with which small entries of K are set to zero, the quantized function f2, by letting the small entries of K to take certain nonzero values, yields surprisingly good performance on the MNIST dataset. This performance gain comes, however, at the price of somewhat heavy computational burden that is approximately the same as the original dense matrix XTX, since we lose the sparsity with f2, see Figure 1-(left). This may be interpreted as a trade-off between storage size and computational time. Also, from the left and middle displays of Figure 7 and Figure 14, we see that for MNIST data, while the classification error rates on the pair of digits (0, 1) can be as low as 1%, the performances on the pair (5, 6) are far from satisfactory, with the linear f(t) = t and the proposed f1, f2 or f3. This is the limitation of the proposed statistical model in (1), which only takes into account the first-order discriminative statistics. Indeed, it has been shown in (Liao & Couillet, 2019) that, taking a2 = 0 (as Published as a conference paper at ICLR 2021 0 5 10 15 20 0 Truncation threshold s Misclassif. rate 0 5 10 15 20 0 Truncation threshold s 0 5 10 15 20 0 Truncation threshold s Prop. of nonzero 0 5 10 15 200 Compute. time (s) Figure 14: Clustering performance (left), proportion of nonzero entries, and computational time of the top eigenvector (right, in markers) of sparse f1 and quantized f2 with M = 2, as a function of the truncation threshold s on the MNIST dataset: digits (0, 1) (left) and (5, 6) (middle and right) with n = 2 048 and performance of the linear function in black. Results averaged over 100 runs. in the case of the proposed f1, f2 and f3) asymptotically discards the second-order discriminative statistics in the covariance structure, and may thus result in suboptimal performance in the case of non-identity covariance. It would be of future interest to extend the current analysis to the generic Gaussian mixture classification: N(µ1, C1) versus N(µ2, C2) by considering the impact of (i) asymmetric means µ1 and µ2 = µ1 and (ii) statistical information in the covariance structure C1 versus C2 and (iii) possibly a multi-class mixture model with number of classes K 3. 0 5 10 15 20 0 Truncation threshold s Misclassif. rate 0 5 10 15 20 0 Truncation threshold s 0 5 10 15 20 0 Truncation threshold s Figure 15: Clustering performance of sparse f1, quantized f2 (with M = 2) and binary f3 as a function of the truncation threshold s on: (left) Kuzushiji-MNIST class 3 versus 4, (middle) Fashion-MNIST class 0 versus 9, and (right) Kannada-MNIST class 4 versus 8, for n = 2 048 and performance of the linear function in black. Results averaged over 100 runs. Figure 15 compares the clustering performances of the proposed f1, f2, and f3 on other MNISTlike datasets including the Fashion-MNIST (Xiao et al., 2017), Kuzushiji-MNIST (Clanuwat et al., 2018), and Kannada-MNIST (Prabhu, 2019) datasets. Then, Figure 16 compares the performances on the representations of the Image Net dataset (Deng et al., 2009) from the popular Goog Le Net (Szegedy et al., 2015) of feature dimension p = 2 048. On various real-world data or features, we made similar observations as in the case of MNIST data in Figure 7 and Figure 14: the performances of sparse f1 and binary f3 are very similar and generally degrade as the threshold s becomes large, while the quantized f2 yields consistently good performances that are extremely close to that of the linear function. This is in line with the (theoretically sustained) observation in (Seddik et al., 2020) that the deep representations of real-world datasets behave, in the large n, p regime, very similar to simple Gaussian mixtures, thereby conveying a strong practical motivation for the present analysis. Truncation threshold s Misclassif. rate Truncation threshold s Figure 16: Clustering performance of sparse f1, quantized f2 (with M = 2) and binary f3 as a function of the truncation threshold s on Goog Le Net features of the Image Net datasets: (left) class pizza versus daisy and (right) class hamburger versus coffee , for n = 1 024 and performance of the linear function in black. Results averaged over 10 runs.