# spectrum_estimation_from_a_few_entries__1ba7ba41.pdf Journal of Machine Learning Research 20 (2019) 1-55 Submitted 1/18; Revised 10/18; Published 1/19 Spectrum Estimation from a Few Entries Ashish Khetan khetan2@illinois.edu Sewoong Oh swoh@illinois.edu Department of Industrial and Enterprise Systems Engineering University of Illinois Urbana-Champaign Urbana, IL 61801, USA Editor: Moritz Hardt Singular values of a data in a matrix form provide insights on the structure of the data, the effective dimensionality, and the choice of hyper-parameters on higher-level data analysis tools. However, in many practical applications such as collaborative filtering and network analysis, we only get a partial observation. Under such scenarios, we consider the fundamental problem of recovering spectral properties of the underlying matrix from a sampling of its entries. In this paper, we address the problem of directly recovering the spectrum, which is the set of singular values, and also in sample-efficient approaches for recovering a spectral sum function, which is an aggregate sum of a fixed function applied to each of the singular values. Our approach is to first estimate the Schatten k-norms of a matrix for a small set of values of k, and then apply Chebyshev approximation when estimating a spectral sum function or apply moment matching in Wasserstein distance when estimating the singular values directly. The main technical challenge is in accurately estimating the Schatten norms from a sampling of a matrix. We introduce a novel unbiased estimator based on counting small structures called network motifs in a graph and provide guarantees that match its empirical performance. Our theoretical analysis shows that Schatten norms can be recovered accurately from strictly smaller number of samples compared to what is needed to recover the underlying low-rank matrix. Numerical experiments suggest that we significantly improve upon a competing approach of using matrix completion methods, below the matrix completion threshold, above which matrix completion algorithms recover the underlying low-rank matrix exactly. Keywords: spectrum estimation, matrix completion, counting subgraphs. 1. Introduction Computing and analyzing the set of singular values of a data in a matrix form, which is called the spectrum, provide insights into the geometry and topology of the data. Such a spectral analysis is routinely a first step in general data analysis with the goal of checking if there exists a lower dimensional subspace explaining the important aspects of the data, which itself might be high dimensional. Concretely, it is a first step in dimensionality reduction methods such as principal component analysis or canonical correlation analysis. However, spectral analysis becomes challenging in practical scenarios where the data is only partially observed. We commonly observe pairwise relations of randomly chosen pairs: each user only rates a few movies in recommendation systems, each player/team only plays against a few opponents in sports leagues, each word appears in the same sentence with c 2019 Ashish Khetan and Sewoong Oh. License: CC-BY 4.0, see https://creativecommons.org/licenses/by/4.0/. Attribution requirements are provided at http://jmlr.org/papers/v20/18-027.html. Khetan and Oh a small number of other words in word count matrices, and each worker answers a few questions in crowdsourcing. In other applications, we have more structured samples. For example, in a network analysis we might be interested in the spectrum of a large network, but only get to see the connections within a small subset of nodes corresponding to sampling a sub-matrix of the adjacency matrix. Whatever the sampling pattern is, typical number of paired relations we observe is significantly smaller than the dimension of the data matrix. We study all such variations in sampling patterns for partially observed data matrices, and ask the following fundamental question: can we estimate spectral properties of a data matrix from partial observations? We propose a novel approach that allows us to estimate the spectrum, i.e. the singular values. A crucial building block in our approach is that spectral properties can be accurately approximated from the first few moments of the spectrum known as the Schatten k-norms defined as i=1 σi(M)k 1/k , where σ1(M) σ2(M) σd(M) 0 are the singular values of the data matrix M Rd d. Once we obtain accurate estimates of Schatten k-norms, these estimates, as well as corresponding performance guarantees, can readily be translated into accurate estimates of the spectrum of the matrix. Further, if we are interested in estimating a class of functions known as spectral sum functions of the form (4), our estimates of the Schatten norms can be used to estimate any spectral sum function using Chebyshev expansions. We want to estimate the Schatten k-norm of a positive semidefinite matrix M Rd d from a subset of its entries. The restriction to positive semidefinite matrices is primarily for notational convenience, and our analyses, the estimator, and the efficient algorithms naturally generalize to any non-square matrices. Namely, we can extend our framework to bipartite graphs and estimate Schatten k-norm of any matrix for any even k. Let Ωdenote the set of indices of samples we are given and let PΩ(M) = {(i, j, Mij)}(i,j) Ωdenote the samples. With a slight abuse of notation, we used PΩ(M) to also denote the d d sampled matrix: PΩ(M)ij = Mij if (i, j) Ω, 0 otherwise , and it should be clear from the context which one we refer to. Although we propose a framework that generally applies to any probabilistic sampling, it is necessary to propose specific sampling scenarios to provide tight analyses on the performance. Hence, we focus on Erd os-R enyi sampling. There is an extensive line of research in low-rank matrix completion problems (Cand es and Recht, 2009; Keshavan et al., 2010a), which addresses a fundamental question of how many samples are required to complete a matrix (i.e. estimate all the missing entries) from a small subset of sampled entries. It is typically assumed that each entry of the matrix is sampled independently with a probability p (0, 1]. We refer to this scenario as Erd os R enyi sampling, as the resulting pattern of the samples encoded as a graph is distributed Spectrum Estimation from a Few Entries as an Erd os-R enyi random graph. The spectral properties of such an sampled matrix have been well studied in the literature (Friedman et al., 1989; Achlioptas and Mc Sherry, 2001; Feige and Ofek, 2005; Keshavan et al., 2010a; Le et al., 2015). In particular, it is known that the original matrix is close in spectral norm to the sampled one where the missing entries are filled in with zeros and properly rescaled under certain incoherence assumptions. This suggests using the singular values of the sampled and rescaled matrix (d2/|Ω|)P(M) directly for estimating the Schatten norms. However, in the sub-linear regime in which the number of samples |Ω| = d2p is comparable to or significantly smaller than the degrees of freedom in representing a symmetric rank-r matrix, which is dr r2, the spectrum of the sampled matrix is significantly different from the spectrum of the original matrix as shown in Figure 1. We need to design novel estimators that are more sample efficient in the sub-linear regime where d2p dr. true spectrum sampled spectrum Figure 1: In yellow, we show the histogram of the singular values of a positive semi-definite matrix M Rd d of size d = 1000 with rank r = 100, with σ1 = = σ50 = 10, σ51 = = σ100 = 5, and the rest at zero (we omit zero singular values in the plot for illustration). In comparison, we show in black the histogram of the singular values of the sampled matrix where each entry of M is sampled with probability p = (1/d)r1 2/7 (properly rescaled by 1/p to best match the original spectrum). 1.2. Summary of the approach and preview of results We propose first estimating one or a few Scahtten norms, which can be accurately estimated from samples, and using these estimated Schatten norms to approximate the spectral properties of interest: spectral sum functions and the spectrum. We use an alternative expression of the Schatten k-norm for positive semidefinite matrices as the trace of the k-th power of M, i.e. ( M k)k = Tr(Mk). This sum of the entries along the diagonal of Mk is the sum of total weights of all the closed walks of length k. Consider the entries of M as weights on a complete graph Kd over d nodes (with self-loops). A closed walk of length k is defined as a sequence of nodes w = (w1, w2, . . . , wk+1) with w1 = wk+1, where we allow repeated nodes and repeated edges. The weight of a closed walk w = (w1, . . . , wk, w1) is defined as Khetan and Oh ωM(w) Qk i=1 Mwiwi+1, which is the product of the weights along the walk. It follows that w: all length k closed walks ωM(w) . (1) Following the notations from enumeration of small simple cycles in a graph by Alon et al. (1997), we partition this summation into those with the same pattern H that we call a k-cyclic pseudograph. Let Ck = (Vk, Ek) denote the undirected simple cycle graph with k nodes, e.g. A3 in Figure 2 is C3. We expand the standard notion of simple k-cyclic graphs to include multi-edges and loops, hence the name pseudograph. Definition 1 We define an unlabelled and undirected pseudograph H = (VH, EH) to be a k-cyclic pseudograph for k 3 if there exists an onto node-mapping from Ck = (Vk, Ek), i.e. f : Vk VH, and a one-to-one edge-mapping g : Ek EH such that g(e) = (f(ue), f(ve)) for all e = (ue, ve) Ek. We use Hk to denote the set of all k-cyclic pseudographs. We use c(H) to the number of different node mappings f from Ck to a k-cyclic pseudograph H. A1 A2 A3 c(A1) = 1 c(A2) = 3 c(A3) = 6 Figure 2: The 3-cyclic pseudographs H3 = {A1, A2, A3}. In the above example, each member of H3 is a distinct pattern that can be mapped from C3. For A1, it is clear that there is only one mapping from C3 to A1 (i.e. c(A1) = 1). For A2, one can map any of the three nodes to the left-node of A2, hence c(A2) = 3. For A3, any of the three nodes can be mapped to the bottom-left-node of A3 and also one can map the rest of the nodes clockwise or counter-clockwise, resulting in c(A3) = 6. For k 7, all the k-cyclic pseudo graphs are given in the Section 9 (See Figures 10 15). Each closed walk w of length k is associated with one of the graphs in Hk, as there is a unique H that the walk is an Eulerian cycle of (under a one-to-one mapping of the nodes). We denote this graph by H(w) Hk. Considering the weight of a walk ωM(w), there are multiple distinct walks with the same weight. For example, a length-3 walk w = (v1, v2, v2, v1) has H(w) = A2 and there are 3 walks with the same weight ω(w) = (Mv1v2)2Mv2v2, i.e. (v1, v2, v2, v1), (v2, v2, v1, v2), and (v2, v1, v2, v2). This multiplicity of the weight depends only on the structure H(w) of a walk, and it is exactly c(H(w)) the number of mappings from Ck to H(w) in Definition 1. The total sum of the weights of closed walks of length k can be partitioned into their respective pattern, which will make computation of such terms more efficient (see Section 2) and also de-biasing straight forward (see Equation (3)): H Hk ωM(H) c(H) , (2) Spectrum Estimation from a Few Entries where with a slight abuse of a notation, we let ωM(H) for H Hk be the sum of all distinct weights of walks w with H(w) = H, and c(H) is the multiplicity of each of those distinct weights. This gives an alternative tool for computing the Schatten k-norm without explicitly computing the singular values. Given only the access to a subset of sampled entries, one might be tempted to apply the above formula to the sampled matrix with an appropriate scaling, i.e. H Hk ωPΩ(M)(H) c(H) , to estimate M k k. However, this is significantly biased. To eliminate the bias, we propose rescaling each term in (1) by the inverse of the probability of sampling that particular walk w (i.e. the probability that all edges in w are sampled). A crucial observation is that, for any sampling model that is invariant under a relabelling of the nodes, this probability only depends on the pattern H(w). In particular, this is true for the Erd os-R enyi sampling. Based on this observation, we introduce a novel estimator that de-biases each group separately: bΘk(PΩ(M)) = X 1 p(H) ωPΩ(M)(H) c(H) , (3) where p(H) is the probability a pattern H is sampled, i.e. all edges traversed in a walk w with H(w) = H is sampled. It immediately follows that this estimator is unbiased, i.e. EΩ[bΘk(PΩ(M))] = M k k, where the randomness is in Ω. However, computing this estimate can be challenging. Naive enumeration over all closed walks of length k takes time scaling as O(d k 1), where is the maximum degree of the graph. Except for extremely sparse graphs, this is impractical. Inspired by the work of Alon et al. (1997) in counting short cycles in a graph, we introduce a novel and efficient method for computing the proposed estimate for small values of k. Proposition 2 For a positive semidefinite matrix M and any sampling pattern Ω, the proposed estimate bΘk(PΩ(M)) in (3) can be computed in time O(dα) for k {3, 4, 5, 6, 7}, where α < 2.373 is the exponent of matrix multiplication. For k = 1 or 2, bΘk(PΩ(M)) can be computed in time O(d) and O(d2), respectively. This bound holds regardless of the degree, and the complexity can be even smaller for sparse graphs as matrix multiplications are more efficient. We give a constructive proof by introducing a novel algorithm achieving this complexity in Section 2. For k 8, our approach can potentially be extended, but the complexity of the problem fundamentally changes as it is at least as hard as counting K4 in a graph, for which the best known run time is O(dα+1) for general graphs (Kloks et al., 2000). We make the following contributions in this paper: We give in (3) an unbiased estimator of the Schatten k-norm of a positive semidefinite matrix M, from a random sampling of its entries. In general, the complexity of computing the estimate scales as O(d k 1) where is the maximum degree (number of sampled entries in a column) in the sampled matrix. We propose a novel efficient Khetan and Oh algorithm for computing the estimate in (3) exactly for small k 7, which involves only matrix operations. This algorithm is significantly more efficient and has run-time scaling as O(dα) independent of the degree and for all k 7 (see Proposition 2) . Under the typical Erd os-R enyi sampling, we show that the Schatten k-norm of an incoherent rank-r matrix can be approximated within any constant multiplicative error, with number of samples scaling as O(dr1 2/k) (see Theorem 3). In particular, this is strictly smaller than the number of samples necessary to complete the matrix, which scales as O(dr log d). Below this matrix completion threshold, numerical experiments confirm that the proposed estimator significantly outperforms simple heuristics of using singular values of the sampled matrices directly or applying state-of-the-art matrix completion methods (see Figure 4). Given the estimation of first K Schatten norms, it is straight forward to approximate spectral sum functions of the form (4) using Chebyshev s expansion, and also estimate the spectrum itself using moment matching in Wasserstein distance. We apply our Schatten norm estimates to the applications of estimating the generalized rank studied in Zhang et al. (2015) and estimating the spectrum studied in Kong and Valiant (2016). We provide performance guarantees for both applications and provide experimental results suggesting we improve upon other competing methods. We propose a new sampling model, which we call graph sampling, that preserves the structural properties of the pattern of the samples. We identify a fundamental property of the structure of the pattern (λ G,r in Eq.(14)) that captures the difficulty of estimating the Schatten k-norm from such graph sampling (see Theorem 11). Under this graph sampling, we show that there are sampling patterns that are significantly more efficient, for estimating the spectral properties, than Erd os-R enyi sampling. In the remainder of this section, we review existing works in Schatten norm approximation, and provide an efficient implementation of the estimator (3) for small k in Section 2. In Section 3, we provide a theoretical analysis of our estimator under the Erd os-R enyi sampling scenario. In Section 4, we provide a theoretical analysis under the graph sampling scenario. We conclude with a discussion on interesting observations and remaining challenges in Section 5. 1.3. Related work We review the existing methods in approximating the Schatten norms, counting small structures in graphs, and various applications of Schatten norms. Estimating k-Schatten norms of a data matrix. The proposed Schatten norm estimator can be used as a black box in various applications where we want to test the property of a data matrix or a network but limited to observe only a small portion of the data. These include, for example, network forensics, matrix spectral property testing, and testing for graph isospectral properties. Relatively little is known under the matrix completion setting studied in this paper. However, Schatten norm estimation under different resource constrained scenarios have been studied. Hutchinson (1990) propose a randomized algorithm Spectrum Estimation from a Few Entries for approximating the trace of any large matrix, where the constraint is on the computational complexity. The goal is to design a random rank-one linear mapping such that the trace is preserved in expectation and the variance is small (Avron and Toledo, 2011; Roosta-Khorasani and Ascher, 2015). Li et al. (2014) propose an optimal bilinear sketching of a data matrix, where the constraint is on the memory, i.e. the size of the resulting sketch. The goal is to design a sketch of a data matrix M using minimal storage and a corresponding approximate reconstruction method for M k k. Li and Woodruff(2016) propose an optimal streaming algorithm where only one-pass on the data is allowed in a data stream model and the constraint is on the space complexity of the algorithm. The goal is to design a streaming algorithm using minimal space to estimate M k k. Zhang et al. (2015) propose an estimator under a distributed setting where columns of the data are store in distributed storage and the constraint is on the communication complexity. The goal is to design a distributed protocol minimizing the communication to estimate M k k. Given a random vector X, Kong and Valiant (2016) propose an optimal estimator for the Schatten k-norm of the covariance matrix, where the constraint is on the number of samples n. The goal is to design an estimator using minimum number of samples to estimate E[XXT ] k k. One of our contributions is that we propose an efficient algorithm for computing the weighted counts of small structures in Section 2, which can significantly improve upon less sample-efficient counterpart in, for example, (Kong and Valiant, 2016). Under the setting of (Kong and Valiant, 2016) (and also (Li et al., 2014)), the main idea of the estimator is that the weight of each length-k cycle in the observed empirical covariance matrix (1/n) Pn i=1 Xi XT i provides an unbiased estimator of E[XXT ] k k. One prefers to sum over the weights of as many cycles as computationally allowed in order to reduce the variance. As counting all cycles is in general computationally hard, they propose counting only increasing cycles (which only accounts for only 1/k! fraction of all the cycles), which can be computed in time O(dα). If one has an efficient method to count all the (weighted) cycles, then the variance of the estimator could potentially decrease by an order of k!. For k 7, our proposed algorithm in Section 2 provides exactly such an estimator. We replace (Kong and Valiant, 2016, Algorithm 1) with ours, and run the same experiment to showcase the improvement in Figure 3, for dimension d = 2048 and various values of number of samples n comparing the multiplicative error in estimating E[XXT ] k k, for k = 7. With the same run-time, significant gain is achieved by simply substituting our proposed algorithm for counting small structures, in the sub-routine. In general, the efficient algorithm we propose might be of independent interest to various applications, and can directly replace (and significantly improve upon) other popular but less efficient counterparts. One of the main challenges under the sampling scenario considered in this paper is that existing counting methods like that of (Kong and Valiant, 2016) cannot be applied, regardless of how much computational power we have. Under the matrix completion scenario, we need to (a) sum over all small structures H Hk and not just Ck as in (Kong and Valiant, 2016); and (b) for each structure we need to sum over all subgraphs with the same structure and not just those walks whose labels form a monotonically increasing sequence as in (Kong and Valiant, 2016). Algorithms for counting structures. An important problem in graph theory is to count the number of small structures, also called network motifs, in a given graph. This has many Khetan and Oh 256 512 1024 2048 increasing simple cycles all simple cycles number of samples, n \ | E[XXT ] k k E[XXT ] k k| E[XXT ] k k Figure 3: By replacing Algorithm 1 in (Kong and Valiant, 2016) that only counts increasing cycles with our proposed algorithm that counts all cycles, significant gain is acheived in estimating E[XXT ] k k, for k = 7. practical applications in designing good LDPC codes (Tian et al., 2004), understanding the properties of social networks (Ugander et al., 2013), and explaining gene regulation networks (Shen-Orr et al., 2002). Exact and approximate algorithms have been proposed in (Alon et al., 1997; Kloks et al., 2000; Liu and Wang, 2006; Halford and Chugg, 2006; Karimi and Banihashemi, 2013; Wang et al., 2014). The most relevant one is the work of Alon et al. (1997) on counting the number of cycles Ck, where counts of various small structures called k-cyclic graphs are used as sub-routines and efficient approaches are proposed for k 7. These are similar to k-cyclic pseudographs, but with multi-edges condensed to a single edge. When counting cycles in a simple (unweighted) graph, k-cyclic graphs are sufficient as all the edges have weight one. Hence, one does not need to track how many times an edge has been traversed; the weight of that walk is always one. In our setting, the weight of a walk depends on how many times an edge has been traversed, which we track using multiedges. It is therefore crucial to introduce the class of k-cyclic pseudographs in our estimator. In a distributed environment, fast algorithms for counting small structures have been proposed by Elenberg et al. (2015) and Elenberg et al. (2016) for small values of k {3, 4}. However, the main strength of their approach is in distributed computing, and under the typical centralized setting we study, this approach can be slower by a factor exponential in k for, say k 7. From Schatten norms to spectral sum functions. A dominant application of Schatten norms is in approximating a family of functions of a matrix, which are called spectral sum functions (Han et al., 2016) of the form i=1 f(σi(M)) k=0 ak n d X i=1 σi(M)ko . (4) A typical approach is to compute the coefficients of a Chebyshev approximation of f, which immediately leads to an approximation of the spectral sum function of interest as the weighted sum of Schatten k-norms. This follows from the approximation of f(x) Spectrum Estimation from a Few Entries PK k=0 akxk. This approach has been widely used in fast methods for approximating the logdeterminant (Pace and Le Sage, 2004; Zhang and Leithead, 2007; Boutsidis et al., 2015; Aune et al., 2014; Han et al., 2015), corresponding to f(x) = log x. Practically, logdeterminant computations are routinely (approximately) required in applications including Gaussian graphical models (Rue and Held, 2005), minimum-volume ellipsoids (Van Aelst and Rousseeuw, 2009), and metric learning (Davis et al., 2007). Fast methods for approximating trace of matrix inverse has been studied in (Wu et al., 2016; Chen, 2016), corresponding to f(x) = x 1, motivated by applications in lattice quantum chromodynamics (Stathopoulos et al., 2013). Fast methods for approximating the Estarada index has been studied in (Han et al., 2016), corresponding to f(x) = exp(x). Practically, it is used in characterizing 3-dimensional molecular structure (Estrada, 2000) and measuring graph centrality (Estrada and Hatano, 2007), the entropy of a graph (Carb o-Dorca, 2008), and the bipartivity of a graph (Estrada and Rodriguez-Vel azquez, 2005). Approximating the generalized rank under communication constraints has been studied in (Zhang et al., 2015), corresponding to f(x; c1) = I(x c1). The generalized rank approximates a necessary tuning parameter in a number of problems where low-rank solutions are sought including robust PCA (Cand es et al., 2011; Netrapalli et al., 2014) and matrix completion (Keshavan et al., 2010b,a; Jain et al., 2013), and also is required in sampling based methods in numerical analysis (Mahoney et al., 2011; Halko et al., 2011). Similarly, (Saade et al., 2015) studied the number of singular values in an interval, corresponding to f(x; c1, c2) = I(c1 x c2). In practice, a number of eigensolvers (Polizzi, 2009; Sakurai and Sugiura, 2003; Schofield et al., 2012) require the number of eigenvalues in a given interval. For more comprehensive list of references and applications of this framework, we refer to the related work section in (Han et al., 2016). In a recent work, Kong and Valiant (2016) provide a novel approach to tackle the challenging problem of estimating the singular values themselves. Considering the histogram of the singular values as a one-dimensional distribution and the Schatten k-norm as the k-th moment of this distribution, the authors provide an innovative algorithm to estimate the histogram that best matches the moments in Wasserstein distance. Matrix completion. Low-rank matrix completion addresses the problem of recovering a low-rank matrix from its sampled entries. Tight lower and upper bounds on the sample complexity is well studied in both cases where you want exact recovery when samples are noiseless (Cand es and Recht, 2009; Keshavan et al., 2010a; Bhojanapalli and Jain, 2014), and also when samples are noisy and where you want approximate recovery (Keshavan et al., 2010b; Negahban and Wainwright, 2012). In practical applications, one might not have enough samples to estimate all the missing entries with sufficient accuracy. However, one might still be able to infer important spectral properties of the data, such as the singular values or the rank. Such spectral properties can also assist in making decisions on how many more samples to collect in order to make accurate inference on the quantity of interest. In this paper, one of the fundamental question we ask and answer affirmatively is: Can we accurately recover the spectral properties of a low-rank matrix from sampling of its entries, below the matrix completion threshold? Khetan and Oh 2. Efficient Algorithm In this section we give a constructive proof of Proposition 2, inspired by the seminal work of Alon et al. (1997) and generalize their counting algorithm for k-cyclic graphs to counting (weighted) k-cyclic pseudographs. In computing the estimate in (3), c(H) can be computed in time O(k!). Suppose p(H) has been computed. We will explain how to compute p(H) for Er os-R enyi sampling and graph sampling in Sections 3 and 4. The bottleneck then is computing the weights ωPΩ(M)(H) for each H Hk. Let γM(H) ωM(H)c(H). We give matrix-multiplication-based equations to compute γM(H) for every H Hk for k {3, 4, 5, 6, 7}. This establishes that γM(H), and hence ωM(H), can be computed in time O(dα), proving Proposition 2. For any matrix A Rd d, let diag(A) to be a diagonal matrix such that (diag(A))ii = Aii, for all i [d] and (diag(A))i,j = 0, for all i = j [d]. For a given matrix M Rd d, define OM to be matrix of off-diagonal entries of M that is OM M diag(M) and DM diag(M). Let tr(A) denote the trace of A, that is, tr(A) = P i [d] Aii. Let A B denote the standard matrix multiplication of two matrices A and B. Consider computing γM(H) for H H3 as labeled in Figure 2: γM(A1) = tr(DM DM DM) (5) γM(A2) = 3 tr(DM OM OM) (6) γM(A3) = tr(OM OM OM) (7) The first weighted sum γM(A1) is the sum of all weights of walks of length 3 that consist of three self-loops. One can show that γM(A1) = P i [d] M3 ii, which is (5) in our matrix operation notations. Similarly, γM(A3) is the sum of weights of length 3 walks with no self-loop, which leads to (7) and, γM(A2) is the sum of weights of length 3 walks with a single self-loop, which leads to (6). The factor 3 accounts for the fact that the self loop could have been placed at the first, second or the third in the walk. Similarly, for each k-cyclic pseudographs in Hk for k 7, computing γM(H) involves a few matrix operations with run-time O(dα). We provide the complete set of explicit expressions in Section 10. A MATLAB implementation of the estimator (3), that includes as its sub-routines the computation of the weights of all k-cyclic pseudographs, is available for download at https://github.com/khetan2/Schatten_norm_estimation. The explicit formulae in Section 10 together with the implementation in the above url might be of interest to other problems involving counting small structures in graphs. For k = 1, the estimator simplifies to bΘk(PΩ(M)) = (1/p) P i PΩ(M)ii, which can be computed in time O(d). For k = 2, the estimator simplifies to bΘk(PΩ(M)) = (1/p) P i,j PΩ(M)2 ij, which can be computed in time O(|Ω|). However, for k 8, there exist walks over K4, a clique over 4 nodes, that cannot be decomposed into simple computations involving matrix operations. The best known algorithm for a simpler task of counting K4 has run-time scaling as O(dα+1), which is fundamentally different. We refer to Section 5 for further discussions on the computational complexity beyond k = 7. Spectrum Estimation from a Few Entries Algorithm 1 Schatten k-norm estimator Require: PΩ(M), k, Hk, p(H) for all H Hk Ensure: bΘk(PΩ(M)) 1: if k 7 then 2: For each H Hk, compute γPΩ(M)(H) using the formula from Eq. (5) (7) for k = 3 and Eq. (60) (203) for k {4, 5, 6, 7} 3: bΘk(PΩ(M)) P H Hk 1 p(H) γPΩ(M)(H) 5: bΘk(PΩ(M)) Algorithm 4[PΩ(M), k, Hk, p(H) for all H Hk] [Section 6] 3. Erd os-R enyi sampling Under the stylized but canonical Erd os-R enyi sampling, notice that the probability p(H) that we observe all edges in a walk with pattern H is p(H) = pm(H) , where p is the probability an edge is sampled and m(H) is the number of distinct edges in a k-cyclic pseudograph H. Plugging in this value of p(H), which can be computed in time linear in k, into the estimator (3), we get an estimate customized for Erd os-R enyi sampling. Given a rank-r matrix M, the difficulty of estimating properties of M from sampled entries is captured by the incoherence of the original matrix M, which we denote by µ(M) R (Cand es and Recht, 2009). Formally, let M UΣU be the singular value decomposition of a positive definite matrix where U is a d r orthonormal matrix and Σ diag(σ1, , σr) with singular values σ1 σ2 σr > 0. Let Ui,r denote the i-th row and j-th column entry of matrix U. The incoherence µ(M) is defined as the smallest positive value µ such that the following holds: A1. For all i [d], we have Pr a=1 U2 ia(σa/σ1) µr/d. A2. For all i = j [d], we have | Pr a=1 Uia Uja(σa/σ1)| µ r/d. The incoherence measures how well the matrix is spread out and is a common measure of difficulty in completing a matrix from random samples (Cand es and Recht, 2009; Keshavan et al., 2010a). We note that our incoherence condition depends upon singular values unlike the one given in (Cand es and Recht, 2009) which only depends upon row and column spaces. The lower the incoherence, the more spread out the entries are, and estimation is easier. On the other hand, if there are a few entries that are much larger than the rest, estimating a property of the matrix (such as the Schatten k-norm) from uniformly sampled entries can be extremely challenging. 3.1. Performance guarantee For any d d positive semidefinite matrix M of rank r with incoherence µ(M) = µ and the effective condition number κ = σmax(M)/σmin(M), we define ρ2 (κµ)2kg(k) max Khetan and Oh such that the variance of our estimator is bounded by Var(bΘ(PΩ(M))/ M k k) ρ2(r1 2/k/dp)k as we show in the proof of Theorem 3 in Section 8.1. Here, g(k) = O(k!) is a function depending only on k. Note that for k 3, ρ2 simplifies to ρ2 = (κµ)2kg(k) max n 1, (dp)k 1 Theorem 3 (Upper bound under the Erd os-R enyi sampling) For any integer k [3, ), any δ > 0, any rank-r positive semidefinite matrix M Rd d, and given i.i.d. samples of the entries of M with probability p, the proposed estimate of (3) achieves normalized error δ with probability bounded by bΘk(PΩ(M)) M k k Consider a typical scenario where µ, κ, and k are finite with respect to d and r, and rank is sufficiently small, r = O(dk/((k 1)(k 2))). Then the Chebyshev s bound in (8) implies that the sample complexity of d2p = O(dr1 2/k) is sufficient to recover M k k up to arbitrarily small multiplicative error and arbitrarily small (but strictly positive) error probability. This is strictly less than the known minimax sample complexity for recovering the entire low-rank matrix, which scales is Θ(rd log d). As we seek to estimate only a property of the matrix (i.e. the Schatten k-norm) and not the whole matrix itself, we can be more efficient on the sample complexity by a factor of r2/k in rank and a factor of log d in the dimension. We emphasize here that such a gain can only be established using the proposed estimator based on the structure of the k-cyclic pseudographs. We will show empirically that the standard matrix completion approaches fail in the critical regime of samples below the recovery threshold of O(rd log d). Figure 4 is a scatter plot of the absolute relative error in estimated Schatten k-norm, M k k \ M k k / M k k, for k = 5, for three approaches: the proposed estimator, Schatten norm of the scaled sampled matrix (after rank-r projection), and Schatten norm of the completed matrix, using state-of-the-art alternating minimization algorithm (Jain et al., 2013). All the three estimators are evaluated 20 times for each value of p. M is a symmetric positive semi-definite matrix of size d = 500, and rank r = 100 (left panel) and r = 500 (right panel). Singular vectors U of M = UΣU , are generated by QR decomposition of N(0, Id d) and Σi,i is uniformly distributed over [1, 2]. For a low rank matrix on the left, there is a clear critical value of p 0.45, above which matrix completion is exact with high probability. However, this algorithm knows the underlying rank and crucially exploits the fact that the underlying matrix is exactly low-rank. In comparison, our approach is agnostic to the low-rank assumption but finds the accurate estimate that is adaptive to the actual rank in a data-driven manner. Using the first r singular values of the (rescaled) sampled matrix fails miserably for all regimes (we truncate the error at one for illustration purposes). In this paper, we are interested in the regime where exact matrix completion is impossible as we do not have enough samples to exactly recover the underlying matrix: p 0.45 in the left panel and all regimes in the right panel. The sufficient condition of d2p = O(dr1 2/k) in Theorem 3 holds for a broad range of parameters where the rank is sufficiently small r = O(dk/((k 1)(k 2))) (to ensure that Spectrum Estimation from a Few Entries 0.01 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 proposed estimator scaled sampled matrix matrix completion sampling probability, p relative error d = 500, r = 100 0.01 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 proposed estimator scaled sampled matrix matrix completion sampling probability, p relative error d = 500, r = 500 Figure 4: The proposed estimator outperforms both baseline approaches below the matrix completion threshold. For k = 5, comparison of the absolute relative error in estimated Schatten norm that is M k k \ M k k / M k k for the three algorithms: (1) the proposed estimator, \ M k k = bΘk(PΩ(M)), (2) Schatten norm of the scaled sampled matrix, \ M k k = (1/p)Pr(PΩ(M)) k k, (3) Schatten norm of the completed matrix, f M = Alt Min(PΩ(M)) from (Jain et al., 2013), \ M k k = f M k k, where Pr( ) is the standard best rank-r projection of a matrix. Ωis generated by Erd os-R enyi sampling of matrix M with probability p. the first term in ρ2 dominates). However, the following results in Figure 5 on numerical experiments suggest that our analysis holds more generally for all regimes of the rank r, even those close to d. M is generated using settings similar to that of Figure 4. Empirical probabilities are computed by averaging over 100 instances. One might hope to tighten the Chebyshev bound by exploiting the fact that the correlation among the summands in our estimator (3) is weak. This can be made precise using the recent result from Schudy and Sviridenko (2011), where a Bernstein-type bound was proved for the sum of polynomials of independent random variables that are weakly correlated. The first term in the bound (9) is the natural Bernstein-type bound corresponding to the Chebyshev s bound in (8). However, under the regime where k is large or p is large, the correlation among the summands becomes stronger, and the second and third term in the bound (9) start to dominate. In the typical regime of interest where µ, κ, k are finite, rank is sufficiently small r = O(dk/((k 1)(k 2))), and sample complexity d2p = O(dr1 2/k), the error probability is dominated by the first term on the right-hand side of (9). Neither one of the two bounds in (8) and (9) dominates the other, and depending on the values of the problem parameters, we might want to apply the one that is tighter. We provide a proof in Section 8.2. Khetan and Oh 5 50 500 0.002 sampling probability, p k = 2 k = 3 k = 4 k = 5 k = 6 k = 7 5 50 500 0.002 k = 2 k = 3 k = 4 k = 5 k = 6 k = 7 Figure 5: Each colormap in each block for k {2, 3, 4, 5, 6, 7} show empirical probability of the event M k k bΘk(PΩ(M)) / M k k δ , for δ = 0.5 (left panel) and δ = 0.2 (right panel). Ωis generated by Erd os-R enyi sampling of matrix M with probability p (vertical axis). M is a symmetric positive semi-definite matrix of size d = 1000. The solid lines correspond to our theoretical prediction p = (1/d)r1 2/k. Theorem 4 Under the hypotheses of Theorem 3, the error probability is upper bounded by bΘk(PΩ(M)) M k k ρ2 dp r1 2/k k , e (dp) δd ρrk 1 1/k , e (dp) δd ρrk 1 These two results show that the sample size of d2p = O(dr1 2/k) is sufficient to estimate a Schatten k-norm accurately when µ, κ, k are finite and rank is sufficiently small r = O(dk/((k 1)(k 2))). In general, we do not expect to get a universal upper bound that is significantly tighter for all r, because for a special case of r = d, the following corollary of (Li et al., 2014, Theorem 3.2) provides a lower bound; it is necessary to have sample size d2p = O(d2 4/k) when r = d. Hence, the gap is at most a factor of r2/k in the sample complexity. Corollary 5 Consider any linear observation X Rn of a matrix M Rd d and any estimate θ(X) satisfying (1 δk) M k k θ(X) (1 + δk) M k k for any M with probability at least 3/4, where δk = (1.2k 1)/(1.2k + 1). Then, n = Ω(d2 4/k). For k {1, 2}, precise bounds can be obtained with simpler analyses. In particular, we have the following remarks, whose proof follows immediately by applying Chebyshev s inequality and Bernstien s inequality along with the incoherence assumptions. Spectrum Estimation from a Few Entries Remark 6 For k = 1, the probability of error in (8) is upper bounded by min{ν1, ν2}, where dp , and ν2 2 exp δ2 Remark 7 For k = 2, the probability of error in (8) is upper bounded by min{ν1, ν2}, where d , and ν2 2 exp δ2 d + δ(κµ)2r When k = 2, for rank small r C d, only we only need d2p = O(1) samples for recovery up to any arbitrary small multiplicative error. When rank r is large, our estimator requires d2p = O(d) for both k {1, 2}. 3.2. From Schatten norms to spectrum and spectral sum functions Schatten norms by themselves are rarely of practical interest in real applications, but they provide a popular means to approximate functions of singular values, which are often of great practical interest (Di Napoli et al., 2016; Zhang et al., 2015; Kong and Valiant, 2016). In this section, we consider two such applications using the first few Schatten norms explicitly: estimating the generalized rank in Section 3.2.1 and estimating the singular values in Section 3.2.2. 3.2.1. Estimating the generalized rank For a matrix M Rd d and a given constant c 0, its generalized rank of order c is given by rank(M, c) = i=1 I σi(M) > c . This recovers the standard rank as a special case when c = 0. Without loss of generality, we assume that σmax(M) 1. For any given 0 c2 < c1 1, and δ [0, 1), our goal is to get an estimate br(PΩ(M)) from sampled entries PΩ(M) such that (1 δ) rank(M, c1) br(PΩ(M)) (1 + δ) rank(M, c2) . The reason we take two different constants c1, c2 is to handle the ambiguous case when the matrix M has many eigenvalues smaller than but very close to c1. If we were to set c2 = c1, then any estimator br(M) would be strictly prohibited from counting these eigenvalues. However, since these eigenvalues are so close to the threshold, distinguishing them from other eigenvalues just above the threshold is difficult. Setting c2 < c1 allows us to avoid this difficulty and focus on the more fundamental challenges of the problem. Consider the function Hc1,c2 : R [0, 1] given by Hc1,c2(x) = 1 if x > c1 0 if x < c2 x c2 c1 c2 otherwise. Khetan and Oh It is a piecewise linear approximation of a step function and satisfies the following: rank(M, c1) Pd i=1 Hc1,c2(σi(M)) rank(M, c2) . (10) We exploit this sandwich relation and estimate the generalized rank. Given a polynomial function f : R R of finite degree m such that f(x) Hc1,c2(x) for all x, such that f(x) = a0 + a1x + + amxm, we immediately have the following relation, which extends to a function on the cone of PSD matrices in the standard way: i=1 f(σi(M)) = a0d + k=1 ak M k k . (11) Using this equality, we propose the estimator: br(PΩ(M); c1, c2) a0d + k=1 ak bΘk(PΩ(M)) , (12) where we use the first several bΘk(PΩ(M)) s obtained by the estimator (3). Note that function f depends upon c1, c2. The remaining task is to obtain the coefficients of the polynomials in f that is a suitable approximation of the function Hc1,c2. In a similar context of estimating the generalized rank from approximate Schatten norms, Zhang et al. (2015) propose to use a composite function f = qs q where q is a finite-degree Chebyshev polynomial of the first kind such that supx [0,1] |q(x) Hc1,c2(x)| 0.1, and qs is a polynomial of degree 2s + 1 given by qs(x) = 1 B(s + 1, s + 1) 0 ts(1 t)sdt , where B( , ) is the Beta function. Note that, since Hc1,c2 is a continuous function with bounded variation, classical theory in (Mason and Handscomb, 2002, Theorem 5.7), guarantees existence of the Chebyshev polynomial q of a finite constant degree, say Cb, that depends upon c1 and c2. Specifically, for a given choice of thresholds 0 c1 < c2 1 and the degree of the beta approximation s, the estimator br(PΩ(M); c1, c2) in (12) can be computed as follows. The approximation of Hc1,c2 with f = qs q and our upper bound on estimated Schatten norms bΘk(PΩ(M)) translate into the following guarantee on generalized rank estimator br(PΩ(M); c1, c2) given in (12). Corollary 8 Suppose M 2 1. Under the hypotheses of Theorem 3, for any given 1 c1 > c2 0, there exists a constant Cb, such that for any s 0 and any γ > 0, the estimate in (12) with the choice of f = qs q satisfies (1 δ)(rank(M, c1) 2 sd) br(PΩ(M); c1, c2) (1 + δ)(rank(M, c2) + 2 sd) , with probability at least 1 γCb(2s+1), where δ max1 k Cb(2s+1) nq γ (max{1,r1 2/k} Spectrum Estimation from a Few Entries Algorithm 2 Generalized rank estimator (a variation of Zhang et al. (2015)) Require: PΩ(M), c1, c2, s Ensure: br(PΩ(M); c1, c2) 1: For given c1 and c2, find a Chebyshev polynomial of the first kind q(x) satisfying sup x [0,1] |q(x) Hc1,c2(x)| < 0.1 [Algorithm 7] 2: Let Cb denote the degree of q(x) 3: Find the degree (2s + 1)Cb polynomial expansion of qs q(x) = P(2s+1)Cb k=0 akxk 4: br(PΩ(M); c1, c2) a0d + P(2s+1)Cb k=1 ak bΘk(PΩ(M)) [Algorithm 1] The proof follows immediately from Theorem 3 and the following lemma which gives a uniform bound on the approximation error between Hc1,c2 and f = qs q. Lemma 9, together with Equations (10) and (11), provides a (deterministic) functional approximation guarantee of rank(M, c1) d 2 s i=1 f(σi(M)) rank(M, c1) + d 2 s , for any c1 < c2 and any choice of s, as long as Cb is large enough to guarantee a 0.1 uniform error bound on the Chebyshev polynomial approximation. Since we can achieve 1 δ approximation on each polynomial in f(σi(x)), Theorem 3 implies the desired Corollary 8. Note that using Remarks 6 and 7, the bounds in (9) hold for k [1, ) with r1 2/k replaced by max{1, r1 2/k}. Lemma 9 (Zhang et al. (2015), Lemma 1) Consider the composite polynomial f(x) = qs(q(x)). Then f(x) [0, 1] for all x [0, 1], and moreover |f(x) Hc1,c2(x)| 2 s , for all x [0, c2] [c1, 1] . In Figure 6, we evaluate the performance of estimator (12) numerically. We construct a symmetric matrix M of size d = 1000 and rank r = 200, σi Uni(0, 0.4) for 1 i r/2, and σi Uni(0.6, 1) for r/2 + 1 i r. We estimate br(PΩ(M); c1, c2) for Erd os-R enyi sampling Ω, and a choice of c2 = 0.5 and c1 = 0.6, which is motivated by the distribution of σi. We use Chebyshev polynomial of degree Cb = 2, and s = 1 for qs. That is function f is of degree 6. Accuracy of the estimator can be improved by increasing Cb and s, however that would require estimating higher Schatten norms. 3.2.2. Estimating the spectrum Given accurate estimates of first K Schatten norms of a matrix M, we can estimate singular values of M using a linear programming based algorithm given in (Kong and Valiant, 2016). In particular, we get the following guarantees on the estimated singular values, whose proof follows directly using the analysis techniques in the proof of (Kong and Valiant, 2016, Khetan and Oh 0 0.2 0.4 0.6 0.8 1 Histogram of singular values singular values proposed estimator matrix completion sampling probability, p |br(c1, c2) r(c1)| Figure 6: The left panel shows a histogram of singular values of M chosen for the experiment. The right panel compares absolute error in estimation br(PΩ(M); c1 = 0.5, c2 = 0.6) for two choices of the Schatten norm estimates \ M k k: first the proposed estimator bΘk(PΩ(M)) in (3), and second the Schatten norm of the completed matrix, f M = Alt Min(PΩ(M)) from (Jain et al., 2013). Theorem 2). The main idea is that given the rank, the maximum support size of the true spectrum, and an estimate of its first K moments, one can find r singular values whose K first moments are close to the estimated Schatten norms. Algorithm 3 Spectrum estimator (a variation of Kong and Valiant (2016)) Require: PΩ(M), K, ϵ, target rank r, lower bound a and upper bound b on the positive singular values Ensure: estimated singular values (bσ1, bσ2, . . . , bσr) 1: L RK : Lk = bΘk(PΩ(M)) for k [K] [Algorithm 1] 2: t = (b a)/ϵ + 1, x Rt: xi = a + ϵ(i 1), for i [t], 3: V RK t : Vij = xi j for i [K], j [t] 4: p {minp Rt |V p L|1 : 1 t p = 1, p 0} 5: bσi = min{xj : P ℓ j p ℓ i r+1}, ith (r + 1)st-quantile of distribution corresponding to p Further, our upper bound on the first K moments can be translated into an upper bound on the Wasserstein distance between those two distributions, which in turn gives the following bound on the singular values. With small enough ϵ and large enough K and r, we need sample size d2p > Cr,K,ϵ,γdr1 2/k to achieve arbitrary small error. Corollary 10 Under the hypotheses of Theorem 3, given rank r, constants 0 a < b such that σmin a, σmax b, and estimates of the first K Schatten norms of M, {bΘk(PΩ(M))}k [K] obtained by the estimator (3), for any 0 < ϵ (b a), and γ > 0, Algo- Spectrum Estimation from a Few Entries rithm 3 runs in time poly(r, K, (b a)/ϵ) and returns {bσi}i [r] an estimate of {σi(M)}i [r] such that i=1 |bσi σi| r + g(K)(b a) max{1, r1 2/k} with probability at least 1 γK, where C is an absolute constant and g(K) only depends on K. In Figure 7, we evaluate the performance of the proposed estimator (3), in recovering the true spectrum using Algorithm 3. We compare the results with the case when Schatten norms are estimated using matrix completion. We consider two distributions on singular values, one peak and two peaks. More general distributions of spectrum can be recovered accurately, however that would require estimating higher Schatten norms. For both cases, the proposed estimator outperforms matrix completion approaches, and achieves better accuracy as sample size increases with α. In each graph, the black solid line depicts the empirical Cumulative Distribution Function (CDF) of the ground truths {σi}i [r] for those r strictly positive singular values. In the first experiment (the top panel), there are r singular values at one peak σi = 1, and in the second experiment (the bottom pannel) there are r/2 singular values at each of the two peaks at σi = 1 and σi = 2. Each cell shows the result of a choice of rank r {50, 200, 500} and a parameter α {3, 5, 8, 10}, where Ωis generated using Erd os-R enyi sampling with probability p = (α/d)r1 2/7. Matrix M is a symmetric matrix of size d = 1000 and rank r with singular values {σi}i [d]. In each cell, there are one black line, three blue lines, and three orange lines. Each blue line corresponds to the empirical CDF of {bσi}i [d] for each trial, over three independent trials. Each orange line corresponds to the empirical CDF of {eσi}i [d]. Here, bσi s are estimated using {bΘk(PΩ(M))}k [K] obtained by the estimator (3), and eσi s are estimated using { f M k k}k [K] where f M = Alt Min(PΩ(M)), along with Algorithm 2 in (Kong and Valiant, 2016), for K = 7. 4. Graph sampling Our framework for estimating the Schatten k-norms can be applied more generally to any random sampling, as long as the distribution is permutation invariant. In practice, we typically observe one instance of a sampled matrix and do not know how the samples were generated. Under a mild assumption that the probability of sampling an entry is independent of the value of that entry, the only information about the sampling model that we have is the pattern, i.e. an unlabelled graph G = (V, E) capturing the pattern of sampled indices by the edges. This naturally suggests a novel sampling scenario that we call graph sampling. The Erd os-R enyi sampling has been criticized as being too strict for explaining how real-world datasets are sampled. When working with natural data, we typically only get Khetan and Oh 0 1 2 3 singular values α = 3 α = 5 α = 8 α = 10 0 1 2 3 singular values α = 3 α = 5 α = 8 α = 10 Figure 7: The proposed estimator (in blue solid lines) outperforms matrix completion approaches (in orange solid lines) in estimating the ground truths empirical cumulative distribution function of the r strictly positive singular values (in black solid line) for two examples: one peak at σi = 1 on the top and two peaks at σi = 1 or σi = 2 on the bottom. one instance of a sampled matrix without the knowledge of how those entries are sampled. In this section, we propose a new sampling model that we call graph sampling that makes minimal assumptions about how the data was sampled. We assume that the pattern has been determined a priori, which is represented by a deterministic graph G = (V, E) with d nodes denoted by V and undirected edges denoted by E. The random sampling Ωis chosen uniformly at random over all relabeling of the nodes in G. Formally, for a given G = (V, E), a permutation π : [d] V is drawn uniformly at random and samples are drawn according Spectrum Estimation from a Few Entries PΩ(M) = {(i, j, Mij)}(π(i),π(j)) E . As the sampling pattern G is completely known to the statistician who only has one instance of a random sampling, we are only imposing that the samples are drawn uniformly at random from all instances that share the same pattern. Further, understanding this graph sampling model has a potential to reveal the subtle dependence of the estimation problem on the underlying pattern, which is known to be hard even for an established area of matrix completion. In this section, we provide an estimator under graph sampling, and characterize the fundamental limit on the achievable error. This crucially depends on the original pattern G via a fundamental property λ G,r, which is generally challenging to compute. However, we provide a bound on λ G,r for two extreme cases of varying difficulty: a clique sampling that requires only O(r2 4/k) samples and a clique-star sampling that requires as many as O(dr1 4/k) samples. This is made formal by showing a lower bound on the minimax sample complexity. Comparing the two necessary conditions on sample complexity, O(r2 4/k) for clique sampling and O(dr1 4/k) for clique-star sampling, it follows that depending on the pattern of the samples, the sample complexity can vary drastically, especially for low-rank matrices where r d. Under the graph sampling, the probability p(H) that we observe all edges in a walk with pattern H is p(H) = ωPΩ(1d1T d )(H) ω1d1T d (H) , where 1d1T d is the all ones matrix, and by permutation invariance, the probability is the ratio between total (unweighted) number of walks with H(w) = H in the original pattern Ωand that of the complete graph Kd. Note that although Ωis a random quantity, ωPΩ(11T )(H) only depends on the structure and not the labelling of the nodes and hence is a deterministic quantity. Plugging in this value of p(H), which can be computed in time O(dα) for k 7 as shown in Proposition 2 (and in general only increases the computational complexity of the estimate by a factor of two), into the estimator (3), we get an estimate customized for graph sampling. 4.1. Performance Guarantees Recall the graph sampling defined in Section 1.1, where we relabel the nodes of a pattern graph G(V, E) according to a random uniform permutation, and sample the entries of the matrix M on the edges. We prove a fundamental lower bound on the sample complexity that crucially depends on the following property of the pattern G. Let Gπ(e V , Ω) denote the graph after relabeling the nodes of G = (V, E) with permutation π : [d] [d]. For independent Rademacher variables ui for i [r] f G,r(λ) max π exp (5/d)2λ2 X (i,j) P(r)(Gπ) uiuj Khetan and Oh where P(r)(Gπ) [r] [r] is a projection of the edges Ωover d nodes to a set of edges over r nodes by mapping a node i [d] to a node 1+(i 1 mod r) [r]. Precisely, (i, j) P(r)(Gπ) if there exists an edge (i , j ) Ωsuch that i = 1+(i 1 mod r) and j = 1+(j 1 mod r). Observe that f G,r(λ) is a non-decreasing function of λ. It follows from the fact that for any positive λ and random variable x and any ϵ > 0, we have E[eλ(1+ϵ)x] E[eλx](E[eλx])ϵ E[eλx]eϵλE[x] E[eλx]. The first and the second inequalities use Jensen s inequality and the third one holds when E[x] 0. Note that Eu[P (i,j) P(r)(Gπ) uiuj] 0, since ui s are i.i.d. Rademacher variables. This function measures the distance between a particular low-rank matrix with Gaussian entries and its rank one perturbation, which is used in our constructive lower bound (see Eq. (35)). Intuitively, smaller f G,r(λ) implies that two rank-r matrices with separated Schatten norms look similar after graph sampling w.r.t. G. Hence, when this function is small, say less than 26/25, then it is hard to distinguish which of the two (distributions of) matrices we are observing. This is captured by the largest value of λ that still maintains f G,r(λ) sufficiently small: λ G,r max {λ>0:f G,r(λ) 26/25} λ . (14) One can choose any number not necessarily 26/25 as long as it is strictly larger than one and strictly smaller than two, and this will only change the probability upper bound in (15). If we sample from a graph G with large λ G,r, then we cannot distinguish two distributions even if they have a large Schatten norm separation. We do not have enough samples and/or our pattern is not sample efficient. The dependence of the fundamental lower bound on the graph G is captured by this quantity λ G,r, which is made precise in the following theorem. We provide a lower bound that captures how sample complexity depends on the pattern G and also on the underlying matrix, by providing analysis customized for each family of matrices Mr,µ parametrized by its rank and incoherence: Mr,µ M Rd d : M = M , rank(M) r , µ(M) µ . Theorem 11 (General lower bound under graph sampling) For any k [3, ) suppose we observe samples under the graph sampling defined in Section 1.1 with respect to a pattern graph G = (V, E). Then there exist universal constants C > 0, C > 0 and C > 0 such that for any r e C k and µ C log r, if λ G,r Cdr1/k 1/2 then inf M Mr,µ sup eΘ P 1 2 M k eΘ(PΩ(M)) 2 M k where the supremum is over any measurable function of PΩ(M) and the probability is with respect to the random sampling Ω. A proof of Theorem 11 is given in Section 8.3. It is in general challenging to evaluate λ G,r for a given graph. For a special case of clique sampling, where the pattern G(V, E) is a clique over a subset of ℓnodes among d, we provide a sharp upper bound on λ G,r. Lemma 12 (Lower bound for clique sampling) If the pattern graph G(V, E) is a clique over a subset of ℓnodes, then λ G,r 2 4d(min{ℓ, r}) 1/2. Spectrum Estimation from a Few Entries Therefore, if ℓ 2 8C 2r1 2/k then λ G,r Cdr1/k 1/2, which together with Theorem 11 implies that with probability at least 1/4 any estimator makes a multiplicative error larger than two. Hence, sample size of ℓ(ℓ+ 1)/2 = O(r2 4/k) is necessary to achieve multiplicative error less than two with high probability. We show that our estimator is optimal for k = 3, by providing a matching upper bound on the sample complexity. For any positive semidefinite matrix M Rd d of rank r with incoherence µ(M) = µ, κ = σmax(M)/σmin(M), and some function g(k) = O(k!), we define ρ2 (κµ)2kg(k) max such that the variance of our estimator is bounded by Var(bΘ(PΩ(M))/ M k k) ρ2(r1 2/k/ℓ)k as we show for k = 3 in the proof of Theorem 13 in Section 8.6. Here, g(k) = O(k!) is a function of k only. Theorem 13 (Upper bound for clique sampling) For k = 3, any δ > 0, and any rank-r matrix M 0, the proposed estimator (3) achieves a multiplicative error δ with probability of error bounded by bΘk(PΩ(M)) M k k under the graph sampling with the pattern graph G that is a clique over ℓnodes. For a typical scenario with finite µ and κ, this upper bound shows that sample size of ℓ(ℓ+ 1)/2 = O(r2 4/k) is sufficient to achieve any arbitrarily small multiplicative error for k = 3 and sufficiently small rank r d2k/(3k 2) and ℓ r(k 2)/(k 1), to ensure that the first term dominates in ρ2. However, the numerical experiments suggest that our analysis holds more generally for all regimes of the rank r. This matches the previous lower bound, proving optimality of the proposed estimator for k = 3. Although the current analysis holds only for k = 3, we are intentionally writing the guarantee in general form as we expect the bound to hold more generally. In particular, we believe that Lemma 19 holds for all k 3, and thereby Theorem 13 holds for any fixed integer k [3, ). In the numerical experiments in Figure 8, M is generated using settings similar to that of Figure 4. Empirical probabilities are computed by averaging over 100 instances. Although our analysis does not give a tight lower bound for Erd os-R enyi sampling, there exists graph patterns such that the sample complexity is large, i.e. scales linearly in d. Consider a clique-star sampling where the pattern graph G(V, E) has a clique on a small subset of nodes V1, |V1| = ℓ, and the remaining nodes V \ V1 are disconnected among themselves and are fully connected with the clique in V1. Precisely, G = (V, E) with (i, j) E if i V1 or j V1. Lemma 14 (Lower bound for clique-star sampling) Under the clique-star sampling over a clique of size ℓ, there exists an absolute constant c such that λ G,r cd(r(min{ℓ, r})) 1/4. Khetan and Oh 10 100 200 10 clique size ℓ k = 3 k = 4 k = 5 k = 6 10 100 200 10 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 k = 3 k = 4 k = 5 k = 6 Figure 8: Each colormap in each block for k {3, 4, 5, 6} show empirical probability of the event M k k bΘk(PΩ(M)) / M k k δ , for δ = 0.5 (left panel) and δ = 0.2 (right panel). Ωis generated by clique sampling of matrix M with a clique of size ℓ(vertical axis). M is a positive semi-definite matrix of size d = 1000. The solid lines correspond to our theoretical prediction ℓ= Together with Theorem 11, this implies that if ℓ c4C 4r1 4/k, then with probability at least 1/4, any estimator makes an multiplicative error larger than two. This implies that the total number of edges in the pattern graph should be O(dr1 4/k) for accurate estimation. Together with the upper bound on clique sampling in Theorem 13, this shows that the sample complexity can drastically change based on the pattern of your sampling model. Clique sampling requires only O(r2 4/k) samples (for k = 3) whereas clique-star sampling requires at least O(dr1 4/k). A proof of Lemma 12 and Lemma 14 is given in Section 8.4 and 8.5 respectively. 5. Discussion We list some observations and future research directions. Complexity of the estimator beyond k = 7. For k 8, our approach of using matrix operations to count (the weights of) walks for each pattern H Hk can potentially be extended. However, the complexity of the problem fundamentally changes for k 8. As our estimator is at least as hard as counting small structures in a simple (unweighted) graph, we can borrow known complexity results to get a lower bound. For instance, for k 8, we need to count K4 in a graph. There is no known simple matrix computation to count K4 in a general graph. The best known run time for counting all K4 is O(dα+1) for general graphs (Kloks et al., 2000). For general k, under standard hardness assumptions, Flum and Grohe (2004) show that there is no algorithm with run time O(f(k)dc) for counting cycles of length k, for any function f(k) and a constant c that does not depend on k. In comparison, finding one cycle of length k can be done in time 2O(k)dα (Alon et al., 1997). This implies that the complexity should scale as O(df(k)), and we believe f(k) should be larger than (α 2k/3). The reason is that for k ℓ 2 for an odd ℓ, our estimator needs to count the Spectrum Estimation from a Few Entries number of cliques Kℓof size ℓand, for k (1/2)ℓ2 for an even ℓ, we require counting Kℓ. The best known algorithm for counting Kℓtakes time O(min{d1+α (ℓ 1)/3 , d2+α (ℓ 2)/3 }) for general graphs (Alon et al., 1997, Theorem 6.4). Putting these bounds together, we believe that the estimator take time at least dα Sufficiency of k 7 for practical applications. There are many practical applications where k 7 is sufficient for estimating spectral sum function of a matrix from its partial observations. In the two applications we discuss in this paper, generalized rank estimation and spectrum estimation, we show by numerical experiments that k 7 is sufficient. In Figure 6, we show that the generalized rank of a matrix can be estimated within 0.05 multiplicative error, for the case when the eigenvalues of the matrix are uniformly distributed in an interval. In Figure 7, we show that k = 7 is sufficient for estimating the spectrum for one peak and two peak distributions of eigenvalues. Further, for uniform distribution of eigenvalues, Kong and Valiant (2016) use only first k = 7 Schatten norms for estimating the spectrum. They note that considering higher Schatten norms beyond k = 7 did not significantly improve the results. Graph sampling. Typical guarantees known for matrix completion assumes the Erd os R enyi sampling. One exception is the deterministic sampling studied by Bhojanapalli and Jain (2014), but such generalization in sampling comes at a price of requiring more strict assumptions on the matrix M. We propose graph sampling, which can potentially capture how estimation guarantees depend explicitly on the pattern G, and still remain analytically tractable. We give such examples for special graphs in Section 4, and graph sampling model can potentially be used to bridge the gap in sampling models between theory and practice. (Standard) rank estimation. As several popular matrix completion approaches require the knowledge of the rank of the original matrix, it is of great practical interest to estimate the standard rank of a matrix from sampled entries. Our framework in Section 3.2.1 provides a way to estimate the standard rank from samples. However, there are a few parameters that need to be tuned, such as the thresholds c1 and c2, and the degree of the polynomial approximation and the order of the Schatten norm. For rank estimation, Keshavan and Oh (2009) give an estimator that is provably correct in the regime where matrix completion works, justifying the requirement that popular matrix completion algorithms (Keshavan et al., 2010a; Jain et al., 2013) need to know the underlying rank. However, in the regime of our interest, which is below the standard matrix completion threshold, the algorithm fails miserably and there are no guarantees. In a more recent work, Saade et al. (2015) propose a novel rank estimator of counting the negative eigenvalues of Bethe Hessian matrix. It is an interesting future direction to build upon our framework to provide a guideline for choosing the parameters for the standard rank estimation, and compare its performance to existing methods. The effect of the effective rank. One property of the Schatten norm is that as k gets large and as the singular values have small effective rank (meaning that they decay fast), the summation is dominated by the largest few singular values. In such scenarios, in the estimation problem, any algorithm that tracks the first few singular values correctly would achieve small error. Hence, the gap get smaller as the effective rank gets smaller, between Khetan and Oh 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 proposed estimator scaled sampled matrix matrix completion sampling probability, p relative error d = 500, r = 500 Figure 9: For a matrix with a very small effective rank, the gap between the proposed estimator and the simple scaled sampled matrix approach is smaller. the proposed estimator and the simple Schatten k-norm of the rescaled sampled matrix, as depicted in Figure 9. We are using the same setting as those in Figure 4 with a full rank matrix M with r = d = 500, but the effective rank is relatively small as the singular values are decaying as σi = 1/i2. For the current choice of k = 5, notice that the contribution in M k k of the second singular value is a factor of 210 smaller than the top singular value, making it effectively a rank one matrix. Technical challenges. The technical challenge in proving bounds on the necessary number of samples needed to estimate Schatten k-norms lies in getting tight bounds on the variance of the estimator. Variance is a function of weighted counts of each pseudograph of 2k-closed walks, in the complete matrix. As the weight of each walk can be positive or negative, significant cancellation occurs when we sum all the weights. However, this stochastic cancellation is hard to capture in the analysis and we assume the worst case where all the weights are positive, which cannot occur for incoherent and well-conditioned matrices. This weakness of the analysis leads to the requirement of the rank being sufficiently small in the case of Erd os-R enyi sampling and k small in the case of clique sampling. We believe these bounds can be tightened and the same is reflected in the numerical simulations which show the same scaling holds for all small values of k and rank close to the dimension of the matrix. 6. Algorithm for estimating Schatten k-norm for k 8 The collection of pseudographs Hk is partitioned into sets {Hiso k,i}1 i r, for some r k!. The partitions Hiso k,i are defined such that the pseudographs in one partition are isomorphic to each other when multi-edges are condensed into one. This is useful since all the pseudographs in one partition are observed together in G([d], Ω) for any fixed subgraph in G. The underlying simple graph (including self loops) for each partition Hiso k,i is denoted by Fk,i. Spectrum Estimation from a Few Entries The main idea is to enumerate a list Lℓof all connected ℓ-vertex induced subgraphs (possibly with loops) of the graph G([d], Ω), for each 1 ℓ k. The unbiased weighted count of all pseudographs Hk for each of these vertex-induced subgraphs g Lℓis computed. This is achieved by further enumerating a list Sg,ℓof all ℓ-vertex subgraphs for each g. Then the unbiased weight of all pseudographs H Hk that exist in the subgraph h is computed and is summed over to get the estimate of the k-th Schatten norm. Recall the notation PΩ(M) which is used to denote the partially observed matrix corresponding to the index set Ωwith the unobserved entries being replaced by zero. We abuse this notation and use h(M) to represent the matrix M restricted to the subgraph h of the observed graph G([d], Ω). Each connected induced subgraphs of size k in a graph can be enumerated in time polynomial in d and k (Elbassioni, 2015). The number of connected induced subgraphs of size k in a graph is upper bounded by (e )k/(( 1)k), where is the maximum degree of the graph (Uehara et al., 1999). Therefore, Algorithm 4 runs in time that is super exponential in k, polynomial in d and the number of k-connected induced subgraphs in the observed graph G([d], Ω). Algorithm 4 Schatten k-norm estimator Require: PΩ(M), k, Hk, p(H) for all H Hk Ensure: bΘk(PΩ(M)) 1: bΘk(PΩ(M)) 0 2: for 1 ℓ k do 3: Enumerate a list, Lℓ, of all connected ℓ-vertex induced subgraphs (possibly with loops) of the graph G([d], Ω) 4: for all g Lℓdo 5: Enumerate a list Sg,ℓof all connected ℓ-vertex subgraphs of the graph g by removing one or more edges 6: for all h Sg,ℓdo 7: for 1 i r do 8: if h is isomorphic to Fk,i then 9: bΘk(PΩ(M)) bΘk(PΩ(M)) + P H Hiso k,i 1 p(H)ωh(M)(H)c(H) 11: end for 12: end for 13: end for 14: end for 7. Algorithm for computing the Chebyshev polynomial We provide proofs for the main results and the technical lemmas. Khetan and Oh Algorithm 5 Chebyshev polynomial of the first kind approximating Hc1,c2(x) Require: Hc1,c2, c1, c2, and target accuracy δ = 0.1 Ensure: Chebyshev polynomial q(x) of first kind 1: g(x) x c2 c1 c2 2: T0(x) 1, T1(x) x π R c1 c2 (1 x2) 1/2g(x)T0(x)dx + 1 π R 1 c1(1 x2) 1/2T0(x)dx 5: while supx [0,1] |q(x) Hc1,c2(x)| δ do 6: q(x) q(x) + 2Ti(x) π R c1 c2 (1 x2) 1/2g(x)Ti(x)dx + 2Ti(x) π R 1 c1(1 x2) 1/2Ti(x)dx 8: Ti(x) 2x Ti 1(x) Ti 2(x) 9: end while 8.1. Proof of Theorem 3 Consider f W to be the collection of all length k closed walks on a complete graph of d vertices. Here we slightly overload the notion of complete graph to refer to an undirected graph with not only all the d(d 1)/2 simple edges but also with d self loops as well. Construct the largest possible collection W from f W wherein each walk has distinct weights that is ω(w) = ω(w ) for all w, w W. We partition W according to the pattern among k-cyclic pseudographs, which are further partitioned into four groups. The estimator (3) can be re-written as bΘk(PΩ(M)) = X c(H(w)) p(H(w)) ωPΩ(M)(w) w:H(w)=H ωM(w) I(w Ω) o w:H(w)=H ωM(w) I(w Ω) o , where we write w Ωto denote the event that all the edges in the walk w are sampled, and we define Hk,1 {Ck} is just a (set of a) simple cycle of length k and there are total |{w W : H(w) Hk,1}| = d k (k!/2k) (dk/2k) corresponding walks to this set, and c(Ck) = 2k. Hk,2 {H(VH, EH) Hk : |VH| k 1 and no self loops}, and there are total |{w W : H(w) Hk,2| dk 1 corresponding walks to this set. Hk,3 Sk 1 s=1 Hk,3,s where Hk,3,s = {H Hk with s self loops}, and there are total |{w W : H(w) Hk,3}| dk s corresponding walks in this set. Hk,4 {H(VH, EH) Hk : |VH| = 1} is a (set of a) graph with k self loops and there are total |{w W : H(w) Hk,4}| = d corresponding walks to this set. Spectrum Estimation from a Few Entries Given this unbiased estimator, we provide an upper bound on the variance of each of the partitions to prove concentration with Chebyshev s inequality. For any walk w W, let |w| denote the number of unique edges (including self loops) that the walk w traverses. Let |w w | denote the number of unique overlapping edges (including self loops) of walks w and w . We have, Var bΘk(PΩ(M))) w =w f W |w w |=ℓ I(w Ω)ωM(w)c(H(w)) p(H(w)) , I(w Ω)ωM(w )c(H(w )) w:H(w)=H ωM(w)2Var I(w Ω) o w =w W |w w |=ℓ E h I(w Ω)I(w Ω) i ωM(w) ωM(w ) c(H(w))c(H(w )) p(H(w)) p(H(w )) c(H)2ωM(w)2 p(H)2 E h I(w Ω) i . (16) Recall from the definition of incoherence that |Mii| σ1(M)µr/d and |Mij| = σ1(M)µr1/2/d, and let α = σ1(M)µr1/2/d denote the maximum off-diagonal entry, such that |Mij| α and |Mii| α r for all i, j [d]. Let Ap,k,α,d = dkα2k/pk denote the target scaling of the variance, then X c(H)2 ωM(w)2 p(H)2 E h I(w Ω) i 2k (2k)2α2k pk = 2k Ap,k,α,d , for i = 1 , (17) dk 1 f(k)2α2k d Ap,k,α,d , for i = 2 , (18) dk 1 Ap,k,α,d , for i = 4 , (19) and for i = 3 and for 1 s k 1, we have c(H)2 ωM(w)2 p(H)2 E h I(w Ω) i dk s f(k)2α2krs pk = f(k)2rs ds Ap,k,α,d , (20) where c(H) is defined as the multiplicity of walks with the same weight satisfying c(H) f(k). For w = w and |w w | = ℓ, where the range of ℓvaries across equations depending upon the set to which w, w belongs, we have the following: w =w W |w w |=ℓ,H(w) Hk,i,s,H(w ) Hk,i ,s E I(w Ω)I(w Ω) ωM(H(w))ωM(H(w )) c(H(w))c(H(w )) p(H(w))p(H(w )) Khetan and Oh 2k α2k(2k)2 pℓ = (dp)k ℓ d 2k Ap,k,α,d, for i = i = 1 (21) f(k)2dk 1dk 1 (ℓ+1)α2k pℓ f(k)2(dp)k ℓ d3 Ap,k,α,d for i = i = 2 (22) f(k)2dk sdk s ℓα2k s s (α r)s+s pℓ f(k)2(dp)k ℓ (d/ r)s+s Ap,k,α,d , for i = i = 3 (23) f(k)2dkdk 1 (ℓ+1)α2k pℓ f(k)2(dp)k ℓ d2 Ap,k,α,d for i = 1, i = 2(24) f(k)2dkdk s (ℓ+1)α2k s(α r)s pℓ f(k)2(dp)k ℓ d(d/ r)s Ap,k,α,d for i = 1, i = 3(25) f(k)2dk 1dk s (ℓ+1)α2k s(α r)s pℓ f(k)2(dp)k ℓ d2(d/ r)s Ap,k,α,d for i = 2, i = 3(26) f(k)2ddk s ℓαk s(α r)k+s pℓ f(k)2(dp)k ℓ dk 1(d/ r)k+s Ap,k,α,d for i = 3, i = 4 ,(27) where (27) is valid only for ℓ= 1. Note that for any w with H(w) Hk,1 S Hk,2, it has no overlap with w such that H(w ) Hk,4. Observe that Var bΘk(PΩ(M))) as bounded in (16) is upper bounded by the sum of quantities in (17)-(27), summating over all possible values of 1 ℓ k 1, and 1 s, s k 1. Let h(k) f(k)2Ap,k,α,d. Observe that quantities in (17),(18), and (20) are upper bounded by h(k). Quantities in (21)-(27) are upper bounded by h1(k) h(k)(dp)k 1/d. Quantity in (19) is upper bounded by h2(k) h(k)rkpk 1/dk 1. Given M k k r(σmin)k, recall a bound on offdiagonals of matrix M by |Mij| α = µσmax r/d and Ap,k,α,d = dkα2k/pk. This gives M 2k k κ2kµ2krk 2 Using Chebyshev s inequality and collecting all terms in the upper bound on the variance, we have for sufficiently large d the following bound: bΘk(PΩ(M)) M k k (κµ)2kf(k)2rk 2 δ2(dp)k max where the second and the third term in the max expression follow from evaluating h1(k) and h2(k). If sampling probability p is small enough such that dp Cd1/(k 1) for some constant C, then the second and the third terms are smaller than the first term. Hence, the desired result in Theorem 3 follows. 8.2. Proof of Theorem 4 We can prove a Bernstein-type bound on accuracy of the estimator. The estimator (3) can be rewritten as a multi-linear polynomial function of d(d + 1)/2 i.i.d. Bernoulli(p) random variables. bΘk(PΩ(M)) = X p(H(w)) ωM(w) Y (i,j) unique(w) I((i, j) Ω) , Spectrum Estimation from a Few Entries where I((i, j) Ω) is a random variable that takes value 1 if the (i, j)th entry of the matrix M is sampled, and unique(w) denotes the set of the unique edges (and self loops) that the walk w traverses. Let q denote the power of the polynomial function that is the maximum number of unique edges in the walk w, that is q = k. We use the following Bernstien-type concentration results of Schudy and Sviridenko (2011) for the polynomials of independent random variables. Lemma 15 (Schudy and Sviridenko (2011),Theorem 1.3) We are given d(d + 1)/2 independent central moment bounded random variables {I((i, j) Ω)}1 i j d with the same parameter L. We are given a multilinear polynomial bΘk(PΩ(M)) of power q, then P h bΘk(PΩ(M)) E bΘk(PΩ(M)) λ i e2 max e Var[bΘk(PΩ(M))]Rq , max t [q] e ( λ µt Lt Rq )1/t , where R is some absolute constant and µt is defined as follows: µt = max S {(i,j):i,j [d]} |S|=t c(H(w)) p(H(w)) |ωM(w)| Y (i,j) unique(w)\S E[I((i, j) Ω)] where w S denotes that the walk w comprises edges(and self loops) contained in the set S. The parameter L is defined as follows: A random variable Z is called central moment bounded with real parameter L > 0, if for any integer i 1 we have E |Z E[Z]|i i L E[|Z E[Z]|i 1] . For each of the Bernoulli random variables {I((i, j) Ω)}1 i j d, L is contained in [1/4, 1]. In the following, we show that µt (µσmax)kg(k)rk/(d(dp)t), for t [k]. Using Lemma 15, along with M k k r(σmin)k, the bound in (9) follows immediately. To compute µt, define a set of walks Wℓ,s,ˆs such that w Wℓ,s,ˆs has 0 ℓ k unique edges and 0 s k unique self loops, and ˆs total self loops with ℓ+ ˆs k. For the set S as required in (28), let S ℓ, s be a set of ℓunique edges and s unique self loops, with |S ℓ, s| = ℓ+ s Khetan and Oh where 1 ℓ+ s k. Therefore, we have µt = max S ℓ, s : ℓ+ s=t 0 s ˆs k ℓ [k]:ℓ+ˆs k w Wℓ,s,ˆs :w S ℓ, s c(H(w)) p(H(w)) |ωM(w)| Y (i,j) unique(w)\S ℓ, s E[I((i, j) Ω)] max S ℓ, s : ℓ+ s=t 0 s ˆs k ℓ [k]:ℓ+ˆs k w Wℓ,s,ˆs :w S ℓ, s pℓ+s αkrˆs/2pℓ+s ( ℓ+ s) ! max S ℓ, s : ℓ+ s=t 0 s ˆs k ℓ [k]:ℓ+ˆs k, s s dℓ (1+ ℓ)f(k) pℓ+s (µσmax)kr(k+ˆs)/2 dk pℓ+s ( ℓ+ s) ! = max S ℓ, s : ℓ+ s=t 0 s ˆs k ℓ [k]:ℓ+ˆs k, s s f(k)(µσmax)kr(k+ˆs)/2 dd(k ℓ s)(dp)( ℓ+ s) max S ℓ, s : ℓ+ s=t k3f(k)(µσmax)kr(k+ˆs)/2 dd(k ℓ s)(dp)( ℓ+ s) (µσmax)kg(k)rk 8.3. Proof of Theorem 11 The proof technique is a generalization to a rank-r symmetric matrix of the proof given by Li et al. (2014) for deriving lower bound on the size of a random bi-linear sketch needed for approximating Schatten norm of any matrix. It also draws on the techniques used in Andoni et al. (2013) for proving a lower bound on the size of the linear sketches of moments. We prove Theorem 11 for an arbitrary fixed relabeling permutation π of the graph nodes. Indeed, by Yao s minimax principle, it suffices to give two distributions on matrix M Mr for which the M k values differ by a constant factor with high probability, but for any relabeling permutation π of the nodes of the pattern graph G, the induced distributions on the sampled entries PΩ(M) corresponding to the relabeled graph Gπ(e V , Ω), have low total variation distance. For positive C > 0 to be specified later, define λ Cdr1/k 1/2. We construct distributions D1 and D2 for M Mr,µ with µ = C log r, for some absolute constant C , such that the following holds: 1. M k λ on the entire support of D1, and M k 4λ on the entire support of D2. 2. Let E1 and E2 denote the distribution of the sampled matrix PΩ(M) when M is drawn from D1 and D2 respectively. Recall that Ωis the set of edges of the relabeled graph Gπ(e V , Ω) as defined in Section 4.1. If λ G,r λ then, the total variation distance between E1 and E2 is bounded by TV(E1, E2) 1/2. Spectrum Estimation from a Few Entries The desired result (15) follows from the above claims and the following relationship between statistical tests and estimators: 2 M k eΘ(PΩ(M)) 2 M k eΘ(PΩ(M)) 2λ + 1 eΘ(PΩ(M)) 2λ 2 1 + TV(E1, E2) 3 4 , where the last inequality follows from the following characterization of the total variation distance TV(E1, E2) sup A |E1(A) E2(A)|. To prove the two claims, we construct one of the desired rank-r random matrix via tiling, i.e. covering the matrix with copies of a single r r sub-matrix from the Gaussian Wigner Ensemble, where diagonals and off-diagonals(upper triangle) are both distributed as i.i.d. standard Gaussians. Another one is constructed by adding a rank one perturbation. Precisely, we define a random matrix drawn from D1 as follows. A random r r matrix Z chosen from Gaussian Wigner Ensemble, G(r, r), is a symmetric matrix whose entries Zi,i and Zi,j for i < j are independent with N(0, 1) distribution. Define B 1 d/r 1 d/r to be an all-ones matrix of size d/r d/r . Let D1 denote the distribution of M1 = Y B where Y G(r, r), and denotes the standard Kronecker product of two matrices. Note that the matrix norm of M1 and Y are related by M1 k = d/r Y k. Since the Schatten norm of Y G(r, r) takes value on the entire R+, we need to truncate it. We set D1 to be D1 conditioned on the event S1 = {M1 : M1 k λ, µ(M1) C log r}, i.e. D1(A) = D1(A S1)/ D1(S1). We define D2 by adding a rank one perturbation. Precisely, let M2 = M1 + (5/d)λU, where M1 D1 and U = uu B. Here a random vector u { 1}r is a vector of i.i.d. Rademacher random variables. Note that U is a rank one matrix and U k = d/r uu k = d. We set D2 to be D2 conditioned on the event S2 = {M2 : M2 k 4λ, µ(M2) C log r}. Observe that M1 D1 and M2 D2 belong to Rd d, are symmetric and both are rank at most r + 1. Let E1 and E2 denote the distribution of PΩ(M) when M is drawn from D1 and D2 respectively. We first show that their total variation distance is not too large. Using the triangle inequality, we have TV(E1, E2) TV( E1, E2) + TV( E1, E1) + TV( E2, E2) TV( E1, E2) + TV( D1, D1) + TV( D2, D2) (29) = TV( E1, E2) + P M1 D1 ( M1 k λ) (µ(M1) C p ( M2 k 4λ) (µ(M2) C p log r) , (30) where (29) follows from the data processing inequality and (30) follows from TV(E1, E2) sup A |E1(A) E2(A)|. We next show that the three terms in (30) are sufficiently small. We first provide an upper bound on TV( E1, E2). As per our construction, only the upper triangular (including diagonals) of the upper-left submatrix of size r r of M1 D1 and M2 D2 has unique entries and the rest are copies of these. Observe that the set of unique Khetan and Oh entries of M1(or M2) corresponding to any pattern graph G(V, E) are precisely the following entries of the projection graph P(r)(G) that is defined in Section 4.1: E(P(r)(G)) n (i, j) : i j [r], (i, j) P(r)(G(V, E)) o . (31) For the purpose of computing the total variation distance TV( E1, E2), it is sufficient to consider only E(P(r)(Gπ)) entries of M1 distributed as i.i.d. standard Gaussians N(0, Iℓ1 ℓ1), and the entries of M2 distributed as N(W, Iℓ1 ℓ1)), where ℓ1 = |E(P(r)(Gπ))|. The random vector W represents the rank one perturbation and is distributed as Wi,j = (5/d)λ uiuj , (i, j) E(P(r)(Gπ)) . To bound total variation distance between E1 and E2, we use the following lemma and the fact that for any two distributions µ and ν, TV(µ, ν) p χ2(µ ν). Let µ ν denote the convolution of the density (or equivalently addition of the two random variables). Lemma 16 (Ingster and Suslina (2012), p97) It holds that χ2(N(0, In) µ N(0, In)) E exp( z, z ) 1, where z, z µ are independent. It follows that TV( E1, E2) p Ee W,W 1 1/5 , for λ G λ where the expectation is taken over independent W and W which are identically distributed. We show that if λ G λ the last inequality holds, as following: EW,W exp W, W = Eu,u exp (5/d)2λ2 X (i,j) E(P(r)(Gπ)) uiu iuju j = Eu exp (5/d)2λ2 X (i,j) E(P(r)(Gπ)) uiuj exp (5/d)2λ2 X (i,j) E(P(r)(Gπ)) :i =j exp (5/d)2λ2 X (i,j) E(P(r)(Gπ)) :i=j exp (5/d)2λ2 X (i,j) E(P(r)(Gπ)) :i =j exp (5/d)2λ2 X (i,j) E(P(r)(Gπ)) :i=j exp (5/d)2λ2 X (i,j) P(r)(Gπ) :i =j exp (5/d)2λ2 X (i,j) P(r)(Gπ) :i=j exp (5/d)2λ2 X (i,j) P(r)(Gπ) uiuj 1 + 1/25 , (35) Spectrum Estimation from a Few Entries where (32) follows from the fact that u, u are i.i.d. Rademacher variables, (33) follows from the fact that f G,r(λ) defined in (13) is non-decreasing in λ, (34) follows from the definition of E(P(r)(Gπ)) in (31),and (35) follows from the definition of λ G in (14). To bound the other two terms in (30), we use Wigner s semicircular law and its rate of convergence for Gaussian Wigner Ensemble, G(r, r) as defined above. Consider the empirical spectral distribution of Z Rr r as r|{i : λi(Z) x}|. Lemma 17 (Wigner (1955)) Define Z = (1/ r)Y for Y G(r, r). Then as r the empirical distribution FZ(x) of Z converges weakly to the distribution G(x) with density 2π t [ 2, 2] . Lemma 18 (G otze and Tikhomirov (2003)) For any positive constant α > 0, let ℓr,α = log r(log log r)α. There exists an absolute positive constant C and c such that for r large enough, FZ(x) G(x) r 1 log rℓ6 r,α C exp cℓr,α . To bound the schatten norm of a matrix Y G(r, r), along with Lemma 17 and Lemma 18 we use the following. If F(x) and G(x) are cumulative distribution functions of densities µ, ν then for any continuous and bounded function f, we have Z fdµ Z fdν f sup x F(x) G(x) . Choosing f(x) = xk for x [ 2, 2], we can see that for k = O(log r) there exists a constant C > 2 such that with probability 1 1/80 it holds that (1/ r)Y k k = 2π dx + o(1) r (2k + o(1))r Ckr . (36) Hence Y k Cr(1/k+1/2). By construction of distribution D1, for M1 D1, M1 k = (d/r) Y k Cdr(1/k 1/2) = λ. Also, by construction M2 D2 is M2 = M1 + (5/d)λU where U k = d. Using triangle inequality, we have M2 k (5/d)λU k M1 k 5λ Cdr1/k 1/2 = 4λ , Recall that, incoherence parameter µ(M) is defined as µ(M) = maxi =j [d] Mi,j/(|σmax(M)| r/d). From (36), there exists a constant 0 < C < 1 such that with probability 1 1/160 it holds that Y 2 C r. The integral evaluates to 1 for k = 2. Therefore, the largest singular value of M1 is lower bounded: |σmax(M1)| C d/ r. Using the fact that there exists a constant C such that maxi,j [r]{Yi,j} C log r with probability at least 1 1/160, we have, µ(M1) (C /C ) log r. The same µ(M1) satisfies the upper bound on diagonals as well. Therefore, using union bound, the second and the third term in (30) are upper bounded by 1/40. Khetan and Oh 8.4. Proof of Lemma 12 Observe that for any given permutation π, P(r)(Gπ) as defined in Section 4.1 is a clique over a subset of nodes e Vπ, where |e Vπ| min{ℓ, r}. From the definition of f G,r(λ), (13), we have the following: Eu exp (5/d)2λ2 X (i,j) P(r)(Gπ) uiuj Eu exp (5/d)2λ2 X (5/d)2tλ2t Eu P i e Vπ ui 2t (5/d)2λ2|e Vπ| t , where the inequality follows from the bound in (37). Therefore, from the definition of λ G,r, we have that λ G,r is lower bounded by 2 4d(min{ℓ, r}) 1/2. To bound E(P i e Vπ ui)2t, for t [1, ), using Hoeffding bound we have that 2t = Z |e Vπ| 2t 2t z dz 2 Z |e Vπ|2t dz 2(2|e Vπ|)tt! , where the integral is evaluated by variable substitution. 8.5. Proof of Lemma 14 For the given pattern graph G and any given permutation π, let e Aπ {0, 1}r r be the adjacency matrix of the graph P(r)(Gπ) that is defined in Section 4.1. Observe that for a permutation π, ℓπ rows of e Aπ are all-ones and the remaining are all-zeros, where ℓπ min{ℓ, r}. Let Aπ be a copy of e Aπ where all the diagonal entries are replaced with zero. Note that Eu(u Aπu)2t+1 = 0 for all t 0, where ui s are i.i.d. Rademacher random variables. Define Cπ exp((5/d)2λ2ℓπ). From the definition of f G,r(λ), (13), we have the following: Eu exp (5/d)2λ2 X (i,j) P(r)(Gπ) uiuj CπEu exp (5/d)2λ2(u Aπu) (5/d)4tλ4t Eu (u Aπu)2t 2c(5/d)2λ2p where the inequality follows from the bound in (38), and c is some absolute constant. Therefore, from the definition of λ G,r, we have that λ G,r is lower bounded by cd((min{ℓ, r})r) 1/4. Spectrum Estimation from a Few Entries To bound Eu (u Aπu)2t , for t [1, ), we use Hanson-Wright Inequality. Observe that Aπ 2 ℓπr, and Aπ 2 F = (r 1)ℓπ < ℓπr. Eu (u Aπu)2t = Z (2 rℓπ)2t 0 P (u Aπu)2t z dz + Z (ℓπr)2t (2 rℓπ)2t P (u Aπu)2t z dz Z (2 rℓπ)2t 0 exp cz1/t dz + Z (ℓπr)2t (2 rℓπ)2t exp cz1/(2t) 2(4ℓπr/c)tt! + 2(2 p ℓπr/c)2t(2t)! 4(2 p ℓπr/c)2t(2t)! , (38) where the integral is evaluated by variable substitution. 8.6. Proof of Theorem 13 For a clique of size m selected uniformly at random, we derive an upper bound on the variance of our estimator. Following the notations defined in the proof of Theorem 3, we have the following bound on the variance. Var bΘk(PΩ(M))) w =w f W |w w |=ℓ I(w Ω)ωM(w)c(H(w)) p(H(w)) , I(w Ω)ωM(w )c(H(w )) w:H(w)=H ωM(w)2Var I(w Ω) o w =w W |w w |=ℓ E h I(w Ω)I(w Ω) i ωM(w) ωM(w )c(H(w))c(H(w )) p(H(w)) p(H(w )) w =w W |w w |=ℓ E h I(w Ω) i E h I(w Ω) i ωM(w) ωM(w )c(H(w))c(H(w )) p(H(w)) p(H(w )) c(H)2ωM(w)2 p(H)2 E h I(w Ω) i . (39) where we abuse the earlier defined notation |w w | to denote the number of overlapping nodes in the two walks w, w W instead of number of overlapping edges. Note that in pattern sampling, covariance term for two walks that do not have any overlapping node is not zero. As earlier, we provide bound on each of the terms in (39). Probability of any walk w being sampled is P[w Ω] = m ℓ / d ℓ f(ℓ)mℓ/dℓ, where ℓ is the number of unique nodes that the walk traverses and f(ℓ) is an exponential function in ℓ. Recall that offdiagonals of matrix M are bounded by |Mij| α = µσmax r/d and the diagonals are bounded by |Mii| µσmaxr/d. We have, X c(H)2 ωM(w)2 p(H)2 E h I(w Ω) i Khetan and Oh 2k f(k)2α2kdk mk f(k)2(µσmax)2krk mk , for i = 1 , (40) d2 k 1 f(k)2α2k = m d2 f(k)2(µσmax)2krk mk , for i = 2 , (41) mrkα2k = rkmk 1 d2k 2 f(k)2(µσmax)2krk mk , for i = 4 , (42) and for i = 3 and for 1 s k 1, we have X c(H)2 ωM(w)2 p(H)2 E h I(w Ω) i k s f(k)2α2krs = msrs d2s f(k)2(µσmax)2krk For any two walks w, w with ℓ 0 overlapping nodes, P[w, w Ω]/(P[w Ω]P[w Ω]) f(k)dℓ/mℓ. For w = w and |w w | = ℓ, where the range of ℓvaries across equations depending upon the set to which w, w belongs, we have the following: w =w W |w w |=ℓ H(w) Hk,i,s H(w ) Hk,i ,s E h I(w Ω)I(w Ω) i E h I(w Ω) i E h I(w Ω) i ωM(w) ωM(w )c(H(w))c(H(w )) p(H(w)) p(H(w )) mℓ (µσmax)2kr2 dℓ = f(k)2(µσmax)2k max{r2, rℓ} mℓ , for i = i = 1, ℓ 1 (44) d2k d2kf(k)2(µσmax)2kr2 m2k = f(k)2(µσmax)2kr2 m , for i = i = 1, ℓ= 0 (45) f(k)2dℓd2k 2 ℓα2k mℓ f(k)2(µσmax)2krk mℓd2 , for i = i = 2 (46) f(k)2dℓd2k s s ℓα2k( r)s+s mℓ f(k)2(µσmax)2krk mℓd , for i = i = 3 (47) f(k)2d2α2k( r)2k f(k)2(µσmax)2krk d2k 2/rk , for i = i = 4 (48) f(k)2dℓd2k 1 ℓα2k mℓ f(k)2(µσmax)2krk mℓd , for i = 1, i = 2 (49) f(k)2dℓd2k s ℓα2k( r)s mℓ f(k)2(µσmax)2krk mℓd/ r , for i = 1, i = 3 (50) f(k)2dℓdk+1 ℓα2k( r)k mℓ f(k)2(µσmax)2krk mℓdk 1/( r)k , for i = 1, i = 4 , (51) f(k)2dℓd2k 1 s ℓα2k( r)s mℓ f(k)2(µσmax)2krk mℓd2/ r , for i = 2, i = 3 (52) f(k)2dℓdk ℓα2k( r)k mℓ f(k)2(µσmax)2krk mℓdk( r)k , for i = 2, i = 4 , (53) f(k)2dℓdk+1 s ℓα2k( r)s+k mℓ f(k)2(µσmax)2krk mℓdk 1/( r)k , for i = 3, i = 4 , (54) Spectrum Estimation from a Few Entries Where (44) and (45) both use (56), and (45) also uses (55). Note that ℓis zero in (48). Collecting all the terms, and using Chebyshev s inequality, along with M k k r(σmin)k, we get the desired result. For any two disjoint simple cycles w = w Hk,1 with |w w | = 0, we have the following P w Ω P w Ω w Ω = (d k + 1)k (m 2k + 1)k (d k)k f(k)mk 1 where the last inequality assumes that k < d/2. Lemma 19 For k = 3, and any 0 ℓ k w =w Hk,1:|w w |=ℓ ωM(w)ωM(w ) f(k)(µσmax)2k max{r2, rℓ} Although we give a proof for k = 3 only, we are intentionally writing the lemma for general k as we expect the lemma holds for all k 3. The joint walk w = w Hk,1 : |w w | = ℓ corresponds to H(w) = D27, for ℓ= 1; and H(w) = D23, for ℓ= 2 in Figure 12. Define M M diag(M), and let denote the Hadamard product of two matrices. We have, X w =w Hk,1:|w w |=2 ωM(w)ωM(w ) = (1/4) X M2 M2 ( M M)2 ( M M) Let s denote the quantity in (57) by C1, we have, X w =w Hk,1:|w w |=1 ωM(w)ωM(w ) diag( M3) diag( M3) 2diag(( M M)3) It is easy to verify Equation (56) for k = 3 and ℓ {1, 2} using the fact that M is a µ incoherent symmetric matrix with its off-diagonals bounded by µσmax( r/d). For ℓ= 0, quantity in (56) is the sum of each pair of disjoint triangles. For sum of all triangles, we have, X w Hk,1 ωM(w) = (1/6) X i (µσmax)3r . (59) Using Equations (57), (58) and (59), bound for ℓ= 0 follows immediately. Bound for ℓ= k, follows by using the fact that Mi,j µσmax( r/d) for i = j [d]. 9. k-cyclic pseudographs We provide an enumeration of all k-cyclic psuedographs for k {4, 5, 6, 7} in Figures (10 15). Khetan and Oh B1 B2 B3 B4 B5 B6 B7 Figure 10: The 4-cyclic pseudographs H4. C1 C2 C3 C4 C5 C6 C7 C8 C9 C10 C11 C12 Figure 11: The 5-cyclic pseudographs H5. Spectrum Estimation from a Few Entries D1 D2 D3 D4 D5 D6 D7 D8 D9 D10 D11 D12 D13 D14 D15 D16 D17 D18 D19 D20 D21 D22 D23 D24 D25 D26 D27 D28 D29 D30 D31 D32 Figure 12: The 6-cyclic pseudographs H6. 41 Khetan and Oh E1 E2 E3 E4 E5 E6 E7 E8 E9 E10 E11 E12 E13 E14 E15 E16 E17 E18 E19 E20 E21 E22 E23 E24 E25 E26 E27 E28 E29 E30 E31 E32 Figure 13: The 7-cyclic pseudographs H7 Spectrum Estimation from a Few Entries E33 E34 E35 E36 E37 E38 E39 E40 E41 E42 E43 E44 E45 E46 E47 E48 E49 E50 E51 E52 E53 E54 E55 E56 E57 E58 E59 E60 Figure 14: The 7-cyclic pseudographs H7 43 Khetan and Oh E61 E62 E63 E64 E65 E66 E67 E68 E69 Figure 15: The 7-cyclic pseudographs H7. 10. Efficient computation of ωM(H) for k {4, 5, 6, 7} In this section we provide the complete matrix oeprations for copmuting γM(H) s. Equations (60) - (66) give expressions to compute γM(H) for H H4 as labeled in Figure 10. Equations (67) - (78) give expressions to compute γM(H) for H H5 as labeled in Figure 11. Equations (79) - (110) give expressions to compute γM(H) for H H6 as labeled in Figure 12. Equations (111) - (203) give expressions to compute γM(H) for H H7 as labeled in Figure 15. For brevity of notations and readability, we define the following additional notations. Let A B denote the Hadamard product. For A Rd d, let sum(A) denote a vector v Rd such that vi = P j [d] Ai,j. With a slight abuse of notation, for v Rd, let sum(v) denote sum of all elements of v that is sum(v) = P i [d] vi. Let sum(γM(Hi) : γM(Hj)) Pj i =i γM(Hi ). Define R 1d d diag(1d d), that is R is an all-ones matrix except on diagonals which are zeros. Further, for brevity, we omit the subscript M from the notations γM(H), OM and DM. γ(B1) = sum(sum(D D D D)) (60) γ(B2) = sum(sum(O O O O)) (61) γ(B3) = 4 tr(O O D D) (62) γ(B4) = 2 sum(sum((O O) (O O) R)) (63) γ(B5) = 2 tr(O D O D) (64) γ(B6) = tr(O O O O) sum(γ(B2) : γ(B4)) (65) γ(B7) = tr(M M M M) sum(γ(B1) : γ(B6)) (66) Spectrum Estimation from a Few Entries γ(C1) = tr(D D D D D) (67) γ(C2) = 5 sum(sum(D O O O O)) (68) γ(C3) = 5 sum(sum((D D D) (O O))) (69) γ(C4) = 5 tr((O O O) O O) (70) γ(C5) = 5 sum(sum(D (O O) (D D))) (71) γ(C6) = 5 sum(sum(((O O) D (O O)) R)) (72) γ(C7) = 5 sum(sum((D (O O) (O O)) R)) (73) γ(C8) = 5 tr(O O O (D D)) (74) γ(C9) = 5 sum(diag(O O O) sum(O O)) 10 tr((O O O) O O)) (75) γ(C10) = tr(O O O O O) γ(C4) γ(C9) (76) γ(C11) = 5 tr(O D O D O) (77) γ(C12) = tr(M M M M M) sum(γ(C1) : γ(C11)) (78) Khetan and Oh γ(D1) = sum(sum(D D D D D D)) (79) γ(D2) = sum(sum(O O O O O O)) (80) γ(D3) = 6 sum(sum(((O O) (O O O O)) R)) (81) γ(D4) = 6 sum(sum(((O O) (D D D D)) R)) (82) γ(D5) = 9 sum(sum(((D D) (O O O O)) R)) (83) γ(D6) = 3 sum(sum(((D D) (O O) (D D)) R)) (84) γ(D7) = 6 sum(sum(((D D) (O O) (O O)) R)) (85) γ(D8) = 9 sum(sum(((O O) (D D) (O O)) R)) (86) γ(D9) = 6 sum(sum(((D D D) (O O) D) R)) (87) γ(D10) = 6 sum(sum((D (O O O O) D) R)) (88) γ(D11) = 3 sum sum(((O O) (O O)) R) sum(O O) sum (O O O O) (O O) R diag((O O) (O O) (O O)) γ(D12) = 4 tr((O O) (O O) (O O)) (90) γ(D13) = 2 sum (sum(O O)) (sum(O O)) (sum(O O)) sum((O O O O O O)) 3 (sum(O O O O)) (sum(O O)) (sum(O O O O O O)) γ(D14) = 3 sum(sum((D (O O) (O O) D) R)) (92) γ(D15) = 12 sum(sum((D (O O) D (O O)) R)) (93) γ(D16) = 6 sum sum(((O O O) O) R (O O)) sum(((O O O O) (O O)) R) γ(D17) = 6 tr((D D D) O O O) (95) γ(D18) = 24 tr(D (O O O) O O) (96) γ(D19) = 6 tr(D O (O O O) O) (97) γ(D20) = 6 sum(sum((O O) ((O (D D) O) R))) sum(sum(((O O) (D D) (O O)) R)) γ(D21) = 12 tr(O (D D) O D O) (99) γ(D22) = 6 sum sum ((O O) R (O O) ((O O) (O O)) R) sum(O O) 2 sum sum((((O O O) O) R (O O) ((O O O O) (O O)) R)) sum sum (O O) R (O O) ((O O) (O O)) R (O O) γ(D23) = 9 sum(sum(((O O) R (O O) ((O O) (O O)) R) ((O O)))) (101) γ(D24) = 12 sum(diag(O D O O) sum((O O)) diag((O O O) D O O) diag((O O O) O D O)) (102) γ(D25) = 6 sum(diag(O O O) sum((O O) D) 2 diag((O O O) D O O)) (103) γ(D26) = 12 sum(diag(O O O) diag(D) sum((O O)) diag((O O O) O O) diag(D)) (104) γ(D27) = 3 sum diag(O O O) diag(O O O) 2 diag((O O) (O O) (O O)) (4/3)γ(D23) (105) Spectrum Estimation from a Few Entries γ(D28) = tr(O O O O O O) γ(D2) γ(D3) γ(D11) γ(D12) γ(D13) γ(D16) γ(D22) γ(D23) γ(D27) (106) γ(D29) = 2 tr(D O D O D O) (107) γ(D30) = 3 sum(sum((O D O) R (O D O)) sum(((O O) (D D) (O O)) R)) γ(D31) = 6 sum(sum((O D O D) R (O O)) sum(((O O) D (O O) D) R)) (109) γ(D32) = tr(M M M M M M) tr(O O O O O O) sum(γ(D1) : γ(D26)) + γ(D2) + γ(D3) + γ(D11) + γ(D12) + γ(D13) + γ(D16) + γ(D22) + γ(D23) γ(D29) γ(D30) γ(D31)(110) Khetan and Oh γ(E1) = sum(diag((D D D D D D D))) (111) γ(E2) = 7 sum(sum((O O) (D D D D D))) (112) γ(E3) = 7 sum(sum(((D D) (O O) (D D D)) R)) (113) γ(E4) = 14 sum(sum((O O O O) (D D D))) (114) γ(E5) = 7 sum(sum((O O O O O O) D)) (115) γ(E6) = 7 sum(sum((D (O O) (D D D D)) R)) (116) γ(E7) = 21 sum(sum((D (O O O O) (D D)) R)) (117) γ(E8) = 7 sum(sum(((O O) (O O) (D D D)) R)) (118) γ(E9) = 14 sum(sum(((O O) (D D D) (O O)) R)) (119) γ(E10) = 7 sum(sum(((O O O O) (O O) D) R)) (120) γ(E11) = 21 sum(sum(((O O O O) D (O O)) R)) (121) γ(E12) = 14 sum(sum((D (O O O O) (O O)) R)) (122) γ(E13) = 7 tr((O O O O O) O O) (123) γ(E14) = 14 tr((O O O) O (O O O)) (124) γ(E15) = 7 sum(sum(((O O) (O O)) R) sum((O O) D) sum(((O O O O) D (O O)) R) diag(((O O) D (O O) (O O)))) (125) γ(E16) = 14 sum((sum(((O O) (O O)) R) sum((O O)) sum(((O O O O) (O O)) R) diag(((O O) (O O) (O O)))) diag(D)) (126) γ(E17) = 7 sum(((sum(O O) sum(O O) sum(O O)) sum((O O O O O O)) 3 (sum((O O O O)) sum((O O)) sum((O O O O O O)))) diag(D)) (127) Z1 0.5 ((sum(O O) sum(O O)) sum((O O O O))) γ(E18) = 14 sum(sum((O O) D) Z1 sum((O O O O) D) sum((O O)) + sum((O O O O O O) D)) (128) γ(E19) = 28 sum(diag((O O) (O O) (O O)) diag(D)) (129) γ(E20) = 21 sum(sum((D (O O) (D D) (O O)) R)) (130) γ(E21) = 14 sum(sum(((D D) (O O) D (O O)) R)) (131) γ(E22) = 7 sum(sum((D (O O) (O O) (D D)) R)) (132) γ(E23) = 7 sum(diag(O O O) diag((D D D D))) (133) γ(E24) = 28 sum(diag((O O O) O O) sum((O O)) diag((O O O O O) O O) diag((O O O) O (O O O))) (134) γ(E25) = 7 sum(diag(O (O O O) O) sum((O O)) 2 diag((O O O) (O O O) O)) (135) γ(E26) = 7 sum(diag(O (O O O) O) diag((D D))) (136) γ(E27) = 42 sum(diag((O O O) O O) diag((D D))) (137) γ(E28) = 7 sum(diag(O O O) sum((O O O O)) 2 diag((O O O O O) O O)) (138) γ(E29) = 7 sum(sum((D (O O) D (O O) D) R)) (139) γ(E30) = 28 sum(diag(O D (O O O) O) diag(D)) (140) Spectrum Estimation from a Few Entries γ(E31) = 28 tr(O D (O O O) D O) (141) γ(E32) = 14 sum(diag(O (D D) O O) sum((O O)) diag((O O O) O (D D) O) diag((O O O) (D D) O O)) (142) γ(E33) = 14 sum(diag(O D O O) diag((D D D))) (143) γ(E34) = 7tr(O (D D) O (D D) O) (144) γ(E35) = 7(sum(sum((((O O) R) ((O (D D D) O) R)))) sum(sum(((O O) (D D D) (O O)) R))) (145) γ(E36) = 14 sum(sum(((O O O) O) R (O D O)) sum(((O O O O) D (O O)) R)) (146) γ(E37) = 28 sum(sum(((O O O) D O) R (O O)) sum(((O O O O) D (O O)) R)) (147) Z2 (((O O) R) O O (1d 1 (sum((O O) )) (O O))) R (148) Z3 (O ((O O) R)) R (149) Z4 (O (((O O O O O) O) R)) R (150) Z6 ((O O O) ((O O) R)) R (151) Z7 (O (((O O O) (O O O)) R)) R (152) γ(E38) = 7 sum(sum((((O O O) O) R Z2 (((O O O O) Z3) R Z4) ((Z6 (O O)) R Z7)))) (153) Z7 0.5 sum(sum(O (((O O) (O O)) R) ((O O) R) O (((O O O) (O O O)) R))) (154) γ(E39) = 7 (sum(sum((O ((O O) R) (sum((O O)) 11 d (O O)) (1d 1 (sum((O O) )) (O O))))) sum(sum((O (((O O O) O) R) (1d 1 (sum((O O) )) (O O))))) sum(sum((O ((O (O O O)) R) (sum((O O)) 11 d (O O))))) +sum(sum((O (((O O O) (O O O)) R))))) 14 Z7 (155) γ(E40) = 21 sum(diag((D D) O O O) sum((O O)) 2 diag((D D) (O O O) O O))(156) γ(E41) = 7 sum(diag(O O O) sum((O O) (D D)) 2 diag((O O O) (D D) O O)) (157) γ(E42) = 7 (sum(diag(O O O) sum(((O O) (O O)) R) 2 diag((O O O) (O O O) O)) 2 sum(diag((O O O) O O) sum((O O)) diag((O O O O O) O O) diag((O O O) O (O O O)))) 28 Z7 (158) γ(E43) = 14 sum(diag(O O O) Z1 2 (diag((O O O) O O) sum((O O)) diag((O O O O O) O O) 0.5 diag((O O O) O (O O O)))) (159) γ(E44) = 56 Z7 (160) Z8 (O (((O O O) O) R)) R (161) Z9 (O ((O O) R)) R (162) Z10 (O ((O (O O O)) R)) R (163) Z11 ((O O) R Z2 (((O O) Z3) R Z8) ((Z9 (O O)) R Z10)) (164) γ(E45) = 14 (sum(0.5 sum(Z11) sum((O O))) (1/7) γ(E38) sum(sum(((O O)) Z11)))(165) Khetan and Oh γ(E46) = 21 sum(sum(((O O)) Z11)) (166) γ(E47) = 7 sum(sum(Z11) diag((D D))) (167) γ(E48) = 7 tr((D D) O D O D O) (168) γ(E49) = 14 sum(diag(D O O O) sum((O O) D) 2 diag(D (O O O) D O O)) (169) γ(E50) = 14 sum(diag(O O D O) sum((O O) D) diag((O O O) D O D O) diag((O O O) (D D) O O)) (170) γ(E51) = 28 sum(diag(D O D O O) sum((O O)) diag(D (O O O) D O O) diag(D (O O O) O D O)) (171) γ(E52) = 7 sum(diag(O D O D O) sum((O O)) 2 diag((O O O) D O D O)) (172) γ(E53) = 14 sum((sum(((((O O) R) ((O D O) R)) ((O O) D (O O)) R))) diag((D D))) (173) γ(E54) = 7 sum(sum(((((O D O) R) ((O (D D) O) R)) ((O O) (D D D) (O O)) R))) (174) Z12 sum(0.5 sum(((((O O) R) ((O O) R)) (((O O) (O O)) R)) ((O O) D))) (175) Z13 sum(sum((((((O O O) D O) R) ((O O) R)) ((O O O O) D (O O)) R))) (176) Z14 0.5 sum(sum(((((O D O) R) ((O O) R)) (((O O) D (O O)) R)) ((O O)))) (177) Z15 sum(sum((((((O O O) O) R) ((O D O) R)) ((O O O O) D (O O)) R))) (178) γ(E55) = 14 (sum(0.5 (sum(((((O O) R) (((O O)) R)) ((O O) (O O)) R))) sum((O O) D)) Z13 Z12) (179) γ(E56) = 28 (sum(0.5 (sum(((((O O) R) (((O O)) R)) ((O O) (O O)) R))) sum(D (O O))) Z13 Z12) (180) γ(E57) = 14 (sum((sum(((((O D O) R) (((O O)) R)) ((O O) D (O O)) R))) sum((O O))) Z13 Z15 2 Z14) (181) γ(E58) = 14 (sum(0.5 sum((((((O O) R) ((O O) R)) (((O O) (O O)) R)) D)) sum((O O))) Z15 Z12) (182) γ(E59) = 84 Z12 (183) γ(E60) = 42 Z14 (184) Z25 = tr(M M M M M M M) sum(γ(E1) : γ(E60)) (185) Z26 = tr(O O O O O O O) γ(E13) γ(E14) γ(E24) γ(E25) γ(E28) γ(E38) γ(E39) sum(γ(E42) : γ(E46)) (186) Z16 (1/6) ((O O R) (O O R) (O O R) ((O O O) (O O O) R) 3 (((O O) (O O) R) (O O R) ((O O O) (O O O) R))) (187) Spectrum Estimation from a Few Entries γ(E61) = 42 sum(sum(Z16 O)) (188) Z17 sum(sum(0.5 ((O O R) (O O R) ((O O) (O O) R))) (0.5 diag(O O O))) (189) γ(E62) = 28 (Z17 (6/84) γ(E61) (2/42) γ(E46) (3/56) γ(E44) (190) γ(E63) = Z26 γ(E61) γ(E62) (191) γ(E64) = 7 sum(sum((D O D O D R) (O O R))) 7 sum(sum(D (O O) D (O O) D R)) (192) γ(E65) = 7 sum(sum(D Z11 D)) (193) Z18 sum(((O O) R (O O) ((O O) (O O)) R) ((O O))) (194) γ(E66) = 7 sum(((diag(O O O) diag(O O O)) 2 diag((O O) (O O) (O O)) 4 Z18) diag(D)) (195) Z20 0.5 sum(sum(((O O R) (O D O R) ((O O) D (O O))) (O O)))(196) γ(E67) = 14 (sum(diag(O O O) diag(O O D O) 2 diag((O O) (O O) D (O O))) 2 sum(Z18 diag(D)) 4 Z20) (197) Z21 (((O D O D) R) O O (1d 1 sum(D (O O) D, 1) D (O O) D)) R (198) Z22 (O ((O D O) R)) R (199) Z23 (O ((D (O O O) D O) R)) R (200) Z24 (O ((O D (O O O) D) R)) R (201) γ(E68) = 7 sum(sum(((O O) R Z21 (((O O) D Z22) R Z23) ((Z22 D (O O)) R Z24)))) (202) γ(E69) = Z25 Z26 sum(γ(E64) : γ(E68)) (203) Dimitris Achlioptas and Frank Mc Sherry. Fast computation of low rank matrix approximations. In Proceedings of the thirty-third annual ACM symposium on Theory of computing, pages 611 618. ACM, 2001. N. Alon, R. Yuster, and U. Zwick. Finding and counting given length cycles. Algorithmica, 17(3):209 223, 1997. A. Andoni, H. L. Nguyˆen, Y. Polyanskiy, and Y. Wu. Tight lower bound for linear sketches of moments. In International Colloquium on Automata, Languages, and Programming, pages 25 32. Springer, 2013. E. Aune, D. P. Simpson, and J. Eidsvik. Parameter estimation in high dimensional gaussian distributions. Statistics and Computing, 24(2):247 263, 2014. H. Avron and S. Toledo. Randomized algorithms for estimating the trace of an implicit symmetric positive semi-definite matrix. Journal of the ACM (JACM), 58(2):8, 2011. Khetan and Oh S. Bhojanapalli and P. Jain. Universal matrix completion. In Proceedings of the 31st International Conference on Machine Learning (ICML-14), pages 1881 1889, 2014. C. Boutsidis, P. Drineas, P. Kambadur, E.-M. Kontopoulou, and A. Zouzias. A randomized algorithm for approximating the log determinant of a symmetric positive definite matrix. ar Xiv preprint ar Xiv:1503.00374, 2015. E. J. Cand es and B. Recht. Exact matrix completion via convex optimization. Foundations of Computational Mathematics, 9(6):717 772, 2009. Emmanuel J Cand es, Xiaodong Li, Yi Ma, and John Wright. Robust principal component analysis? Journal of the ACM (JACM), 58(3):11, 2011. R. Carb o-Dorca. Smooth function topological structure descriptors based on graph-spectra. Journal of Mathematical Chemistry, 44(2):373 378, 2008. J. Chen. How accurately should i compute implicit matrix-vector products when applying the hutchinson trace estimator? SIAM Journal on Scientific Computing, 38(6):A3515 A3539, 2016. J. V. Davis, B. Kulis, P. Jain, S. Sra, and I. S. Dhillon. Information-theoretic metric learning. In Proceedings of the 24th international conference on Machine learning, pages 209 216. ACM, 2007. E. Di Napoli, E. Polizzi, and Y. Saad. Efficient estimation of eigenvalue counts in an interval. Numerical Linear Algebra with Applications, 2016. Khaled M Elbassioni. A polynomial delay algorithm for generating connected induced subgraphs of a given cardinality. J. Graph Algorithms Appl., 19(1):273 280, 2015. E. R. Elenberg, K. Shanmugam, M. Borokhovich, and A. G. Dimakis. Beyond triangles: A distributed framework for estimating 3-profiles of large graphs. In Proceedings of the 21th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pages 229 238. ACM, 2015. E. R. Elenberg, K. Shanmugam, M. Borokhovich, and A. G. Dimakis. Distributed estimation of graph 4-profiles. In Proceedings of the 25th International Conference on World Wide Web, pages 483 493. International World Wide Web Conferences Steering Committee, 2016. E. Estrada. Characterization of 3d molecular structure. Chemical Physics Letters, 319(5): 713 718, 2000. E. Estrada and N. Hatano. Statistical-mechanical approach to subgraph centrality in complex networks. Chemical Physics Letters, 439(1):247 251, 2007. E. Estrada and J. A. Rodriguez-Vel azquez. Spectral measures of bipartivity in complex networks. Physical Review E, 72(4):046105, 2005. U. Feige and E. Ofek. Spectral techniques applied to sparse random graphs. Random Struct. Algorithms, 27(2):251 275, 2005. Spectrum Estimation from a Few Entries J. Flum and M. Grohe. The parameterized complexity of counting problems. SIAM Journal on Computing, 33(4):892 922, 2004. J. Friedman, J. Kahn, and E. Szemer edi. On the second eigenvalue in random regular graphs. In Proceedings of the Twenty-First Annual ACM Symposium on Theory of Computing, pages 587 598, Seattle, Washington, USA, may 1989. ACM. Friedrich G otze and Alexandre Tikhomirov. Rate of convergence to the semi-circular law. Probability Theory and Related Fields, 127(2):228 276, 2003. T. R. Halford and K. M. Chugg. An algorithm for counting short cycles in bipartite graphs. IEEE Transactions on Information Theory, 52(1):287 292, 2006. Nathan Halko, Per-Gunnar Martinsson, and Joel A Tropp. Finding structure with randomness: Probabilistic algorithms for constructing approximate matrix decompositions. SIAM review, 53(2):217 288, 2011. I. Han, D. Malioutov, and J. Shin. Large-scale log-determinant computation through stochastic chebyshev expansions. In ICML, pages 908 917, 2015. I. Han, D. Malioutov, H. Avron, and J. Shin. Approximating the spectral sums of large-scale matrices using chebyshev approximations. ar Xiv preprint ar Xiv:1606.00942, 2016. Michael F Hutchinson. A stochastic estimator of the trace of the influence matrix for laplacian smoothing splines. Communications in Statistics-Simulation and Computation, 19(2):433 450, 1990. Yuri Ingster and Irina A Suslina. Nonparametric goodness-of-fit testing under Gaussian models, volume 169. Springer Science & Business Media, 2012. P. Jain, P. Netrapalli, and S. Sanghavi. Low-rank matrix completion using alternating minimization. In STOC, pages 665 674, 2013. M. Karimi and A. H. Banihashemi. Message-passing algorithms for counting short cycles in a graph. IEEE Transactions on Communications, 61(2):485 495, 2013. R. H. Keshavan and S. Oh. A gradient descent algorithm on the Grassman manifold for matrix completion. ar Xiv preprint ar Xiv:0910.5260, 2009. R. H. Keshavan, A. Montanari, and S. Oh. Matrix completion from a few entries. Information Theory, IEEE Transactions on, 56(6):2980 2998, 2010a. R. H Keshavan, A. Montanari, and S. Oh. Matrix completion from noisy entries. Journal of Machine Learning Research, 11(2057-2078):1, 2010b. T. Kloks, D. Kratsch, and H. M uller. Finding and counting small induced subgraphs efficiently. Information Processing Letters, 74(3):115 121, 2000. W. Kong and G. Valiant. Spectrum estimation from samples. ar Xiv preprint ar Xiv:1602.00061, 2016. Khetan and Oh C. M. Le, E. Levina, and R. Vershynin. Sparse random graphs: regularization and concentration of the laplacian. ar Xiv preprint ar Xiv:1502.03049, 2015. Y. Li and D. P. Woodruff. On approximating functions of the singular values in a stream. ar Xiv preprint ar Xiv:1604.08679, 2016. Y. Li, H. L. Nguyˆen, and D. P. Woodruff. On sketching matrix norms and the top singular vector. In Proceedings of the Twenty-Fifth Annual ACM-SIAM Symposium on Discrete Algorithms, pages 1562 1581. Society for Industrial and Applied Mathematics, 2014. H. Liu and J. Wang. A new way to enumerate cycles in graph. In AICT/ICIW, page 57, 2006. Michael W Mahoney et al. Randomized algorithms for matrices and data. Foundations and Trends R in Machine Learning, 3(2):123 224, 2011. J. C. Mason and D. C. Handscomb. Chebyshev polynomials. CRC Press, 2002. S. Negahban and M. J. Wainwright. Restricted strong convexity and (weighted) matrix completion: Optimal bounds with noise. Journal of Machine Learning Research, 2012. To appear; posted at http://arxiv.org/abs/1009.2118. Praneeth Netrapalli, UN Niranjan, Sujay Sanghavi, Animashree Anandkumar, and Prateek Jain. Non-convex robust pca. In Advances in Neural Information Processing Systems, pages 1107 1115, 2014. R. K. Pace and J. P. Le Sage. Chebyshev approximation of log-determinants of spatial weight matrices. Computational Statistics & Data Analysis, 45(2):179 196, 2004. Eric Polizzi. Density-matrix-based algorithm for solving eigenvalue problems. Physical Review B, 79(11):115112, 2009. F. Roosta-Khorasani and U. Ascher. Improved bounds on sample size for implicit matrix trace estimators. Foundations of Computational Mathematics, 15(5):1187 1212, 2015. H. Rue and L. Held. Gaussian Markov random fields: theory and applications. CRC Press, 2005. A. Saade, F. Krzakala, and L. Zdeborov a. Matrix completion from fewer entries: Spectral detectability and rank estimation. In Advances in Neural Information Processing Systems, pages 1261 1269, 2015. Tetsuya Sakurai and Hiroshi Sugiura. A projection method for generalized eigenvalue problems using numerical integration. Journal of computational and applied mathematics, 159 (1):119 128, 2003. G. Schofield, J. R. Chelikowsky, and Y. Saad. A spectrum slicing method for the kohn sham problem. Computer Physics Communications, 183(3):497 505, 2012. Spectrum Estimation from a Few Entries W. Schudy and M. Sviridenko. Bernstein-like concentration and moment inequalities for polynomials of independent random variables: multilinear case. ar Xiv preprint ar Xiv:1109.5193, 2011. S. S. Shen-Orr, R. Milo, S. Mangan, and U. Alon. Network motifs in the transcriptional regulation network of escherichia coli. Nature genetics, 31(1):64 68, 2002. A. Stathopoulos, J. Laeuchli, and K. Orginos. Hierarchical probing for estimating the trace of the matrix inverse on toroidal lattices. SIAM Journal on Scientific Computing, 35(5): S299 S322, 2013. T. Tian, C. R. Jones, J. D. Villasenor, and R. D. Wesel. Selective avoidance of cycles in irregular ldpc code construction. IEEE Transactions on Communications, 52(8):1242 1247, 2004. Ryuhei Uehara et al. The number of connected components in graphs and its applications. Manuscript. URL: http://citeseerx. ist. psu. edu/viewdoc/summary, 1999. J. Ugander, L. Backstrom, and J. Kleinberg. Subgraph frequencies: Mapping the empirical and extremal geography of large graph collections. In Proceedings of the 22nd international conference on World Wide Web, pages 1307 1318. ACM, 2013. S. Van Aelst and P. Rousseeuw. Minimum volume ellipsoid. Wiley Interdisciplinary Reviews: Computational Statistics, 1(1):71 82, 2009. P. Wang, J. Lui, B. Ribeiro, D. Towsley, J. Zhao, and X. Guan. Efficiently estimating motif statistics of large networks. ACM Transactions on Knowledge Discovery from Data (TKDD), 9(2):8, 2014. E. P. Wigner. Characteristic vectors of bordered matrices with infinite dimensions. Annals of Mathematics, page 548564, 1955. L. Wu, J. Laeuchli, V. Kalantzis, A. Stathopoulos, and E. Gallopoulos. Estimating the trace of the matrix inverse by interpolating from the diagonal of an approximate inverse. Journal of Computational Physics, 326:828 844, 2016. Y. Zhang and W. E. Leithead. Approximate implementation of the logarithm of the matrix determinant in gaussian process regression. Journal of Statistical Computation and Simulation, 77(4):329 348, 2007. Y. Zhang, M. J. Wainwright, and M. I. Jordan. Distributed estimation of generalized matrix rank: Efficient algorithms and lower bounds. ar Xiv preprint ar Xiv:1502.01403, 2015.