# graph_reduction_with_spectral_and_cut_guarantees__0afc1939.pdf Journal of Machine Learning Research 20 (2019) 1-42 Submitted 10/18; Revised 3/19; Published 6/19 Graph Reduction with Spectral and Cut Guarantees Andreas Loukas andreas.loukas@epfl.ch Laboratoire de traitement des signaux 2, Ecole Polytechnique F ed erale de Lausanne, CH-1015 Lausanne, Switzerland Editor: Karsten Borgwardt Can one reduce the size of a graph without significantly altering its basic properties? The graph reduction problem is hereby approached from the perspective of restricted spectral approximation, a modification of the spectral similarity measure used for graph sparsification. This choice is motivated by the observation that restricted approximation carries strong spectral and cut guarantees, and that it implies approximation results for unsupervised learning problems relying on spectral embeddings. The article then focuses on coarsening the most common type of graph reduction. Sufficient conditions are derived for a small graph to approximate a larger one in the sense of restricted approximation. These findings give rise to algorithms that, compared to both standard and advanced graph reduction methods, find coarse graphs of improved quality, often by a large margin, without sacrificing speed. Keywords: graph reduction and coarsening, spectral methods, unsupervised learning 1. Introduction As graphs grow in size, it becomes pertinent to look for generic ways of simplifying their structure while preserving key properties. Simplified graph representations find profound use in the design of approximation algorithms, can facilitate storage and retrieval, and ultimately ease graph data analysis by separating overall trends from details. There are two main ways to simplify graphs. First, one may reduce the number of edges a technique commonly referred to as graph sparsification. In a series of works, it has been shown that it is possible to find sparse graphs that approximate all pairwise distances (Peleg and Sch affer, 1989), every cut (Karger, 1999), or every eigenvalue (Spielman and Teng, 2011) respectively referred to as spanners, cut sparsifiers and spectral sparsifiers. Spectral sparsification techniques, in particular, can yield computational benefits whenever the number of edges is the main bottleneck (Batson et al., 2013). Indeed, they form a fundamental component of nearly-linear time algorithms for linear systems involving symmetric diagonally dominant matrices (Koutis et al., 2010; Spielman and Srivastava, 2011), and have found application to machine learning problems involving graph-structured data (Calandriello et al., 2018). 2019 Loukas. License: CC-BY 4.0, see https://creativecommons.org/licenses/by/4.0/. Attribution requirements are provided at http://jmlr.org/papers/v20/18-680.html. Alternatively, one may seek to reduce the size of the graph directly, i.e., the number of its vertices N, by some form of vertex selection or re-combination scheme followed by re-wiring. This idea can be traced back to the multigrid literature, that targets the acceleration of finite-element methods using cycles of multi-level coarsening, lifting, and refinement. After being generalized to graphs, reduction methods have become pervasive in computer science and form a key element of modern graph processing pipelines, especially with regards to graph partitioning (Hendrickson and Leland, 1995; Karypis and Kumar, 1998; Kushnir et al., 2006; Dhillon et al., 2007; Wang et al., 2014) and graph visualization (Koren, 2002; Hu, 2005; Walshaw, 2006). In machine learning, reduction methods are used to create multiscale representations of graph-structured data (Lafon and Lee, 2006; Gavish et al., 2010; Shuman et al., 2016) and as a layer of graph convolutional neural networks (Bruna et al., 2014; Defferrard et al., 2016; Bronstein et al., 2017; Simonovsky and Komodakis, 2017; Ardizzone et al., 2018; Liang et al., 2018). In addition, being shown to solve linear systems in (empirically) linear time (Koutis et al., 2011; Livne and Brandt, 2012) as well as to approximate the Fiedler vector (Urschel et al., 2014; Gandhi, 2016), reduction methods have been used to accelerate inverse problems employing graph-regularization (Hirani et al., 2015; Colley et al., 2017). Some of their main benefits are the ability to deal with sparse graphs (graphs with at most O(N log N) edges) and to accelerate algorithms whose complexity depends on the number of vertices as well as edges. In contrast to graph sparsification, there has been only circumstantial theory supporting graph reduction (Moitra, 2011; D orfler and Bullo, 2013; Loukas and Vandergheynst, 2018). The lack of a concrete understanding of how different reduction choices affect fundamental graph properties is an issue: the significant majority of reduction algorithms in modern graph processing and machine learning pipelines have been designed based on intuition and possess no rigorous justification or provable guarantees. 1.1. A Spectral Perspective My starting point in this work is spectral similarity a measure that has been proven useful in sparsification for determining how well a graph approximates another one. To render spectral similarity applicable to graphs of different sizes, I generalize it and restrict it over a subspace of size that is at most equal to the size of the reduced graph. I refer to the resulting definition as restricted spectral approximation 1 (or restricted approximation for short). Despite being a statement about subspaces, restricted approximation has significant consequences. It is shown that when the subspace in question is a principal eigenspace (this is a feature agnostic choice where one wants to preserve only the graph structure), the eigenvalues and eigenspaces of the reduced graph approximate those of the original large graph. It is then a corollary that (i) if the large graph has a good (normalized) cut so does the smaller one; and (ii) that unsupervised learning algorithms that utilize spectral embeddings, such as spectral clustering (Von Luxburg, 2007) and Laplacian eigenmaps (Belkin and Niyogi, 2003), can also work well when ran on the smaller graph and their solution is lifted. More 1. Though similarly named, the definition of restricted spectral similarity previously proposed by Loukas and Vandergheynst (2018) concerns a set of vectors (rather than subspaces) and is significantly weaker than the one examined here. Graph Reduction with Spectral and Cut Guarantees generally, it is expected that methods relying on specific eigenspaces will exhibit similar behavior when combined with coarsening, e.g., in graph signal regression (Loukas and Perraudin, 2016; Grassi et al., 2017), compression (Shahid et al., 2016), sampling (Puy et al., 2018), or forecasting (Isufiet al., 2018). The analysis then focuses on graph coarsening a popular type of reduction where, in each level, reduced vertices are formed by contracting disjoint sets of connected vertices (each such set is called a contraction set). I derive sufficient conditions for a small coarse graph to approximate a larger graph in the sense of restricted spectral approximation. Crucially, this result holds for any number of levels and is independent of how the subspace is chosen. Though the derived bound is global, a decoupling argument renders it locally separable over levels and contraction sets, facilitating computation. The final bound can be interpreted as measuring the local variation over each contraction set, as it involves the maximum variation of vectors supported on each contracted subgraph. These findings give rise to algorithms for graph coarsening that I refer to as local variation algorithms. Each such algorithm starts from a predefined family of candidate contraction sets. Even though any connected set of vertices may form a valid candidate set, I opt for small well-connected sets constructed based on simple rules: in the edge-based variant one candidate set is constructed for each of pair of adjacent vertices, whereas in the neighborhood-based variant every candidate set contains a vertex along with all adjacent vertices. The algorithm then greedily2 contracts those sets whose local variation is the smallest. Depending on how the candidate family is constructed, the proposed algorithms obtain different solutions, trading offcomputational complexity for reduction. 1.2. Theoretical and Practical Implications This work improves and generalizes upon the state-of-the-art in several ways: Instead of directly focusing on specific constructions, a general graph reduction scheme is studied featuring coarsening as a special case. As a consequence, the implications of restricted approximation are proven in a fairly general setting where specifics of the reduction, such as the type of graph representation and the reduction matrices involved, are abstracted. Contrary to previous results on the analysis of coarsening (Loukas and Vandergheynst, 2018), the analysis holds for multiple levels of reduction. Given that the majority of coarsening methods reduce the number of vertices by a constant factor at each level, a multi-level approach is necessary to achieve significant reduction. Along that line, the analysis also brings an intuitive insight: rather than taking the standard approach of approximating at each level the graph produced by the previous level, one should strive to preserve the properties of the original graph at every level. 2. Even after decoupling, the problem of candidate set selection is not only NP-hard but also cannot be approximated to a constant factor in polynomial time (by reduction to the maximum-weight independent set problem). For the specific case of edge-based families, where one candidate set is constructed for each pair of adjacent vertices, the greedy iterative contraction can be substituted by more sophisticated procedures accompanied by improved guarantees. The proposed local variation algorithms are not heuristically designed, but optimize (an upper bound of) the restricted spectral approximation objective. Despite the breadth of the literature that utilizes some form of graph reduction and coarsening, the overwhelming majority of known methods are heuristics (Safro et al., 2015). A notable exception is Kron reduction (D orfler and Bullo, 2013), an elegant method that aims to preserve the effective resistance distance. Compared to Kron reduction, the methods proposed here are accompanied by stronger spectral guarantees (i.e., beyond interlacing), do not sacrifice the sparsity of the graph, and can ultimately be more scalable as they do not rely on the Schur complement of the Laplacian matrix. To demonstrate the practical benefits of local variation methods, the analysis is complemented with numerical results on representative graphs ranging from scale-free to planar graphs. Compared to both standard (Karypis and Kumar, 1998) and advanced reduction methods (Ron et al., 2011; Livne and Brandt, 2012; Shuman et al., 2016), the proposed methods yield small graphs of improved spectral quality, often by a large margin, without being much slower than naive heavy-edge matching. A case in point: when examining how close are the principal eigenvalues of the coarse and original graph for a reduction of 70%, local variation methods attain on average 2.6 smaller error; this gain becomes 3.9 if one does not include Kron reduction in the comparison. 2. Graph Reduction and Coarsening The following section introduces graph reduction. The exposition starts by considering a general reduction scheme. It is then shown how graph coarsening arises if one additionally imposes few natural restrictions on the interpretability of reduced variables. 2.1. Graph Reduction Consider a positive semidefinite (PSD) matrix L RN N whose sparsity structure captures the connectivity structure of a connected weighted undirected graph G = (V, E, W) of N = |V| vertices and M = |E| edges. In other words, L(i, j) = 0 only if eij is a valid edge between vertices vi and vj. Moreover, let x be an arbitrary vector of size N. I study the following generic reduction scheme: Scheme 1: Graph reduction Commence by setting L0 = L and x0 = x and proceed according to the following two recursive equations: Lℓ= P ℓLℓ 1P + ℓ and xℓ= Pℓxℓ 1, where Pℓ RNℓ Nℓ 1 are matrices with more columns than rows, ℓ= 1, 2, . . . , c is the level of the reduction, symbol + (resp. ) denotes the pseudoinverse (resp. transposed Graph Reduction with Spectral and Cut Guarantees Scheme 1: Graph reduction (cont.) pseudoinverse), and Nℓis the dimensionality at level ℓsuch that N0 = N and Nc = n N. Vector xc is lifted back to RN by recursion exℓ 1 = P + ℓexℓ, where exc = xc. Graph reduction thus involves a sequence of c + 1 graphs G = G0 = (V0, E0, W0) G1 = (V1, E1, W1) Gc = (Vc, Ec, Wc) (1) of decreasing size N = N0 > N1 > > Nc = n, where the sparsity structure of Lℓmatches that of graph Gℓ, and each vertex of Gℓrepresents one of more vertices of Gℓ 1. The multi-level design allows us to achieve high dimensionality reduction ratio even when at each level the ratio rℓ= 1 Nℓ Nℓ 1 is small. For instance, supposing that rℓ ϱ for each ℓ, then c = O(log(n/N)/ log(1 ϱ)) levels suffice to reduce the dimension to n. One may express the reduced quantities in a more compact form: xc = Px, Lc = P LP + and ex = Πx, (2) where P = Pc P1, P + = P + 1 P + c , Π = P +P, and for convenience, I drop zero indices and refer to a lifted vector as ex(= ex0). The rational of this scheme is that vector ex should be the best approximation of x given P in an ℓ2-sense, which is a consequence of the following property: Property 1 Π is a projection matrix. On the other hand, matrix L is reduced such that x c Lcxc = ex Lex. Though introduced here for the reduction of sparse PSD matrices representing the sparsity structure of a graph, Scheme 1 can also be applied to any PSD matrix L. This and similar reduction schemes belong to the class of Nystr om methods and, to the extent of my knowledge, were first studied in the context of approximate low-rank matrix approximation (Halko et al., 2011; Wang and Zhang, 2013). Despite the common starting point, interpreting L and Lc as sparse graph matrices , as it is done here, incorporates a graph-theoretic twist to reduction, distinguishing it from previous approaches 3: the constructions that we will study are eventually more scalable and interpretable as they maintain the graph structure of L after reduction. Obtaining guarantees is also arguably more challenging in this setting, as the involved problems end up being strongly combinatorial. 3. To achieve low-rank approximation, matrix P is usually built by sampling columns of L. 2.2. Properties of Reduced Graphs Even in the general context where P is an arbitrary n N matrix, certain handy properties can be proven about the relation between Lc and L. To begin with, it is simple to see that the set of positive semidefinite matrices is closed under reduction. Property 2 If L is PSD, then so is Lc. The proof is elementary: if L is PSD then there exists matrix S such that L = S S, implying that Lc = P LP + can also be written as Lc = S c Sc if one sets Sc = SP +. I further consider the spectrum of the two matrices. Sort the eigenvalues of L as λ1 λ2 . . . λN and denote by eλk the k-th largest eigenvalue of Lc and euk the associated eigenvector. It turns out that the eigenvalues eλ and λ are interlaced. Theorem 3 For any P with full-row rank and k = 1, . . . , n, we have γ1 λk λk γ2 λk+N n with γ1 = λ1((PP ) 1) and γ2 = λn((PP ) 1), respectively the smallest and largest eigenvalue of (PP ) 1. The above result is a generalization of the Cauchy interlacing theorem for the case that PP = I. It also resembles the interlacing inequalities known for the normalized Laplacian, where the re-normalization is obtained by construction. Chen et al. (2004) showed in Theorem 2.7 of their paper that after contracting N n edges λk N+n λk λk+N n for k = 1, 2, . . . , n and with λℓ= 0 when ℓ 0, resembling the upper bound above. The lower bound is akin to that given by Chung (1997, Lemma 1.15), again for the normalized Laplacian. Also notably, the inequalities are similar to those known for Kron reduction (D orfler and Bullo, 2013, Lemma 3.6). Theorem 3 is particularly pessimistic as it has to hold for every possible P and L (subject to γ1 and γ2). Much stronger results will be obtained later on by restricting the attention to constructions that satisfy additional properties (see Theorem 13). One can also say something about the action of Lc on vectors. Property 4 For every vector x im(Π), one has x c Lcxc = x ΠLΠx = x Lx and ex = Πx = x. In other words, reduction maintains the action of L of every vector that lies in the image of Π. Naturally, after lifting the eigenvectors euk of Lc are included in this set. Graph Reduction with Spectral and Cut Guarantees 2.3. Coarsening as a Type of Graph Reduction Coarsening is a type of graph reduction abiding to a set of constraints that render the graph transformation interpretable. More precisely, in coarsening one selects for each level ℓa surjective (i.e., many-to-one) map ϕℓ: Vℓ 1 Vℓbetween the original vertex set Vℓ 1 and the smaller vertex set Vℓ. I refer to the set of vertices V(r) ℓ 1 Vℓ 1 mapped onto the same vertex v r of Vℓas a contraction set: V(r) ℓ 1 = {v Vℓ 1 : ϕℓ(v) = v r} For a graphical depiction of contraction sets, see Figure 1. I also constrain ϕℓslightly by requiring that the subgraph of Gℓ 1 induced by each contraction set V(r) ℓ 1 is connected. It is easy to deduce that contraction sets induce a partitioning of Vℓ 1 into Nℓsubgraphs, each corresponding to a single vertex of Vℓ. Every reduced variable thus corresponds to a small set of adjacent vertices in the original graph and coarsening amounts to a scaling operation. An appropriately constructed coarse graph aims to capture the global problem structure, whereas neglected details can be recovered in a local refinement phase. Coarsening can be placed in the context of Scheme 1 by restricting each Pℓto lie in the family of coarsening matrices, defined next: Definition 5 (Coarsening matrix) Matrix Pℓ RNℓ Nℓ 1 is a coarsening matrix w.r.t. graph Gℓ 1 if and only if it satisfies the following two conditions: a. It is a surjective mapping of the vertex set, meaning that if Pℓ(r, i) = 0 then Pℓ(r , i) = 0 for every r = r. b. It is locality preserving, equivalently, the subgraph of Gℓ 1 induced by the non-zero entries of Pℓ(r, :) is connected for each r. An interesting consequence of this definition is that, in contrast to graph reduction, with coarsening matrices the expensive pseudo-inverse computation can be substituted by simple transposition and re-scaling: Proposition 6 (Easy inversion) The pseudo-inverse of a coarsening matrix Pℓis given by P + ℓ= P ℓD 2 ℓ, where Dℓis the diagonal matrix with Dℓ(r, r) = Pℓ(r, :) 2. Proposition 6 carries two consequences. First, coarsening can be done in linear time. Each coarsening level (both in the forward and backward directions) entails multiplication by a sparse matrix. Furthermore, both Pℓand P + ℓ have only Nℓ 1 non-zero entries meaning that O(N) and O(M) operations suffice to coarsen respectively a vector and a matrix L whose sparsity structure reflects the graph connectivity. In addition, the number of graph edges also decreases at each level. Denoting by µℓthe average number of edges of the graphs induced by contraction sets V(r) ℓ 1 for every r, then a quick calculation reveals that the coarsest graph has m = M Pc ℓ=1 Nℓµℓedges. If, for instance, at each level all vertices are perfectly contracted into pairs then µℓ= 2 and Nℓ= N/2ℓ, meaning that m = M 2N(1 2 c). (a) Graph G (b) Coarse graph Gc Figure 1: Toy coarsening example. Grey discs denote contraction sets. The first three vertices of G forming contraction set V1 0 are contracted onto vertex v 1. All other vertices remain unaffected. 2.4. Laplacian Consistent Coarsening A further restriction that can be imposed is that coarsening is consistent w.r.t. the Laplacian form. Suppose that L is the combinatorial Laplacian of G defined as di if i = j wij if eij E 0 otherwise, where wij is the weight associated with edge eij and di the weighted degree of vi. The following proposition can be proven: Proposition 7 (Consistency) Let P be a coarsening matrix w.r.t. a graph with combinatorial Laplacian L. Matrix Lc = P LP + is a combinatorial Laplacian if and only if the non-zero entries of P + are equally valued. It is a corollary of Propositions 6 and 7 that in Laplacian consistent coarsening, for any v r Vℓand vi Vℓ 1 matrices Pℓ RNℓ Nℓ 1 and P + ℓ RNℓ 1 Nℓare given by: 1 |V(r) ℓ 1| if vi V(r) ℓ 1 0 otherwise and [P + ℓ](i, r) = ( 1 if vi V(r) ℓ 1 0 otherwise, where the contraction sets V(1) ℓ 1, . . . , V(Nℓ) ℓ 1 were defined in Section 2.3. 2.4.1. A Toy Example The toy graph shown in Figure 1a whose Laplacian is given by 3 1 1 1 0 1 3 1 0 1 1 1 2 0 0 1 0 0 1 0 0 1 0 0 1 Graph Reduction with Spectral and Cut Guarantees illustrates an example where the gray vertices V(1) 0 = {v1, v2, v3} of G are contracted onto vertex v 1, as shown in Figure 1b. The main matrices I have defined are 1/3 1/3 1/3 0 0 0 0 0 1 0 0 0 0 0 1 1 0 0 1 0 0 1 0 0 0 1 0 0 0 1 Π = P + 1 P1 = 1/3 1/3 1/3 0 0 1/3 1/3 1/3 0 0 1/3 1/3 1/3 0 0 0 0 0 1 0 0 0 0 0 1 and coarsening results in Lc = P 1 LP + 1 = 2 1 1 1 1 0 1 0 1 (x(1) + x(2) + x(3))/3 x(4) x(5) Finally, when lifted xc becomes ex = P + 1 xc = (x(1) + x(2) + x(3))/3 (x(1) + x(2) + x(3))/3 (x(1) + x(2) + x(3))/3 x(4) x(5) Since vertices v4 and v5 are not affected, the respective contraction sets V(2) 0 and V(3) 0 are singleton sets. 2.5. Properties of Laplacian Consistent Coarsening Due to its particular construction, Laplacian consistent coarsening is accompanied some interesting properties. I present three in the following: Cuts. To begin with, weights of edges in Gc correspond to weights of cuts in G. Property 8 For any level ℓ, the weight Wℓ(r, q) between vertices v r, v q Vℓis equal to Wℓ(r, q) = X where S(r) ℓ = {vi V : ϕℓ ϕ1(vi) = v r} V contains all vertices of G contracted onto v r Vℓ. In the toy example, there exists a single edge of unit weight connecting vertices in V(1) 0 and V(2) 0 , and as such the weight between v 1 and v 2 is equal to one. Eigenvalue interlacing. For a single level of Laplacian consistent coarsening, matrix PP = P1P 1 is given by diag(1/|V(1) 0 |, . . . , 1/|V(N1) 0 |), implying that the multiplicative constants in Theorem 3 are: γ1 = min vi V |Vϕ1(vi) 0 | 1 and γ2 = max vi V |Vϕ1(vi) 0 |. Above, v r = ϕ1(vi) V1 is the vertex to which vi is mapped to and the set Vϕ1(vi) 0 contains all vertices also contracted to v r. Thus in the toy example, λk λk 3λk+2 for every k 3. If multiple levels are utilized these terms become dependent on the sequence of contractions. To obtain a general bound let ϕℓ 1(vi) = ϕℓ ϕ1(vi) Vℓbe the vertex onto which vi V is contracted to in the ℓ-th level. Property 9 If Lc is obtained from L by Laplacian consistent coarsening, then γ1 min vi V ℓ=1 |Vϕℓ 1(vi) ℓ 1 | 1 and γ2 max vi V ℓ=1 |Vϕℓ 1(vi) ℓ 1 |, with the set Vϕℓ 1(vi) ℓ 1 containing all vertices of Vℓ 1 that are contracted onto ϕℓ 1(vi). The proof follows from the diagonal form of Pℓ P1P 1 P ℓand the special row structure of each Pℓfor every ℓ, but it s not included for brevity. The dependency of λk on the size of contraction sets can be removed either by enforcing at each level that all contraction sets have identical size and dividing the graph weights by that size, or by re-normalizing each Pℓ such that P ℓ = P + ℓ. The latter approach was used by Loukas and Vandergheynst (2018) but is not adopted here as it results in Lc losing its Laplacian form. Nullspace. Finally, as is desirable, the structure of the nullspace of L is preserved both by coarsening and lifting: Property 10 If P is a (multi-level) Laplacian consistent coarsening matrix, then P1N = 1n and P +1n = 1N, where the subscript indicates the dimensionality of the constant vector. Thus, we can casually ignore vectors parallel to the constant vector in our analysis. 3. Restricted Notions of Approximation This section formalizes how should a graph be reduced such that fundamental structural properties (e.g., its spectrum and cuts) are preserved. Inspired by work in graph sparsification, I introduce a measure of approximation that is tailored to graph reduction. The new definition implies strong guarantees about the distance of the original and coarsened spectrum and gives conditions such that the cut structure of a graph is preserved by coarsening. 3.1. Restricted Spectral Approximation One way to define how close a PSD matrix L is to its reduced counterpart is to establish an isometry guarantee w.r.t. the following induced semi-norms: x Lx and xc Lc = q Graph Reduction with Spectral and Cut Guarantees Ideally, one would hope that there exists ϵ > 0 such that (1 ϵ) x L xc Lc (1 + ϵ) x L (3) for all x RN. If the inequalities in (3) hold, matrices Lc and L are called ϵ-similar. The objective of constructing sparse spectrally similar graphs is the main idea of spectral graph sparsifiers, a popular method for accelerating the solution of linear systems involving the Laplacian. In addition, spectral similarity carries several interesting consequences that are of great help in the construction of approximation algorithms: the eigenvalues and eigenvectors of two similar graphs are close and, moreover, all vertex partitions have similar cut size. In contrast to graph sparsification however, since here the dimension of the space changes it is impossible to satisfy (3) for every x RN unless one trivially sets ϵ = 1 (this follows by a simple rank argument). To carry out a meaningful analysis, one needs to consider a subspace of dimension k n and aim to approximate the behavior of L solely within it. With this in mind, I define the following generalization of spectral similarity: Definition 11 (Restricted spectral approximation) Let R be a k-dimensional subspace of RN. Matrices Lc and L are (R, ϵ)-similar if there exists an ϵ 0 such that x x L ϵ x L for all x R, where x = P +Px. In addition to the restriction on R, the above definition differs from (3) in how the error is measured. In fact, it asserts a property that is slightly stronger than an approximate isometry w.r.t. a semi-norm within R. The strengthening of the notion of approximation deviates from the restricted spectral similarity property proposed by Loukas and Vandergheynst (2018) and is a key ingredient in obtaining multi-level bounds. Nevertheless, one may recover a restricted spectral similarity-type guarantee as a direct consequence: Corollary 12 If Lc and L are (R, ϵ)-similar, then (1 ϵ) x L xc Lc (1 + ϵ) x L for all x R. Proof Let S be defined such that L = S S. By the triangle inequality: | x L xc Lc | = | Sx 2 SP +Px 2 | Sx SP +Px 2 = x x L ϵ x L , which is equivalent to the claimed relation. Clearly, if Lc and L are (R, ϵ)-similar then they are also (R , ϵ )-similar, where R is any subspace of R and ϵ ϵ. As such, results about large subspaces and small ϵ are the most desirable. It will be shown in Sections 3.2 and 3.3 that the above definition implies restricted versions of the spectral and cut guarantees provided by spectral similarity. For instance, instead of attempting to approximate the entire spectrum as done by spectral graph sparsifiers, here one can focus on a subset of the spectrum with particular significance. 3.2. Implications for the Graph Spectrum One of the primary benefits of restricted spectral approximation is that it implies a relation between the spectra of matrices L and Lc that goes beyond interlacing (see Theorem 3). To this effect, consider the smallest k eigenvalues and corresponding eigenvectors and define the following matrices: Uk RN k = [u1, u2, . . . , uk] and Λk = diag(λ1, λ2, . . . , λk) As I will show next, ensuring that ϵ in Proposition 17 is small when R = Uk = span(Uk) suffices to guarantee that the first k eigenvalues and eigenvectors of L and Lc are aligned. The first result concerns eigenvalues. Theorem 13 (Eigenvalue approximation) If Lc and L are (Uk, ϵk)-similar, then γ1 λk eλk γ2 (1 + ϵk)2 1 ϵ2 k(λk/λ2) λk, whenever ϵ2 k < λ2/λk. Crucially, the bound depends on λk instead of λk+N n and thus can be significantly tighter than the one given by Theorem 3. Noticing that ϵk ϵk whenever k < k , one also deduces that it is stronger for smaller eigenvalues. For k = 2 in particular, one has γ1 λ2 eλ2 γ2 (1 + ϵ2)2 which can be made small by controlling ϵ2. I also analyze the angle between principal eigenspaces of L and Lc. I follow Li (1994) and split the eigendecompositions of L = UΛU and P Lc P = P e U eΛe U P as L = (Uk, Uk ) Λk Λk P Lc P = (P e Uk, P e Uk ) ! e U k P e U k P where eΛk and e Uk are defined analogously to Λk and Uk. Davis and Kahan (1970) defined the canonical angles between the spaces spanned by Uk and P Uk as the singlular values of the matrix Θ(Uk, P e Uk) = arccos(U k P e Uk e U k PUk) 1/2, see also (Stewart, 1990). The smaller the sinus of the canonical angles are the closer the two subspaces lie. The following theorem reveals a connection between the Frobenius norm of the sinus of the canonical angles and restricted spectral approximation. Graph Reduction with Spectral and Cut Guarantees Theorem 14 (Eigenspace approximation) If Lc and L are (Uk, ϵk)-similar then sin Θ Uk, P e Uk 2 F 1 λk+1 λk γ1 1 + λk X Note that the theorem above utilizes all ϵi with i k, corresponding to the restricted spectral approximation constants for R = Ui, respectively. However, all these can be trivially relaxed to ϵk, since ϵi ϵk for all i k. 3.3. Implications for Graph Partitioning One of the most popular applications of coarsening is to accelerate graph partitioning (Hendrickson and Leland, 1995; Karypis and Kumar, 1998; Kushnir et al., 2006; Dhillon et al., 2007; Wang et al., 2014). In the following, I provide a rigorous justification for this choice by showing that if the (Laplacian consistent) coarsening is done well and Gc contains a good normalized cut, then so will G. For the specific case of spectral clustering, I also provide an explicit bound on the coarse solution quality. 3.3.1. Existence Results For consistent coarsening, the spectrum approximation results presented previously imply similarities between the cut-structures of Gc and G. To formalize this intuition, the conductance of any subset S of V is defined as φ(S) = w(S, S) min{w(S), w( S)}, where S = V \ S is the complement set, w(S, S) = P vi S,vj S wij is the weight of the cut and w(S) = P vi S P vj V wij is the volume of S. The k-conductance of a graph measures how easy it is to cut it into k disjoint subsets S1, . . . , Sk V of balanced volume: φk(G) = min S1,...,Sk max i φ(Si), where the minimum is taken over all possible k-partitionings (S1, . . . , Sk). The smaller φk(G) is, the easier it is to partition our graph. As it turns out, restricted spectral approximation can be used to relate the conductance of the original and coarse graphs. To state the result, it will be useful to denote by D the diagonal degree matrix and further to suppose that Vk contains the first k eigenvectors of the normalized Laplacian Ln = D 1/2LD 1/2, whose eigenvalues are 0 = µ1 µk. Theorem 15 For any graph G and integer 2 k n/2 , if Lc and L are (R2k, ϵ2k)- similar combinatorial Laplacian matrices then φk(G) φk(Gc) = O γ2 (1 + ϵ2k)2ξk(G) 1 ϵ2 2k(µ2k/µ2) φk(G) with R2k = span(D 1/2V2k) and ξk(G) log k, whenever ϵ2 2k < µ2/µ2k. If G is planar then ξk(G) 1 and if G excludes Kh as a minor (but is not planar) then ξk(G) h4. For k = 2, supposing that Lc and L are (R2, ϵ2)-similar, we additionally have φ2(G) φ2(Gc) 2 γ2 (1 + ϵ2)2 1 ϵ2 2 φ2(G). The above theorem gives a non-constructive result: it does not reveal how to find the optimal partitioning, but provides conditions such that the latter is of similar quality in the two graphs. 3.3.2. Spectral Clustering It is also possible to derive approximation results about the solution quality of unsupervised learning algorithms that utilize the first k eigenvectors in order to partition G. I focus here on spectral clustering. To perform the analysis, let Uk and P e Uk be the spectral embedding of the vertices w.r.t. L and Lc, respectively, and define the optimal partitioning as P = arg min P={S1,...,Sk} Fk(Uk, P) and e P = arg min P={S1,...,Sk} Fk(P e Uk, P), (4) where, for any embedding X, the k-means cost induced by partitioning V into clusters S1, . . . , Sk is defined as X(i, :) X(j, :) 2 2 2 |Sz| . One then measures the quality of e P by examining how far the correct minimizer Fk(Uk, P ) is to Fk(Uk, e P ). Boutsidis et al. (2015) noted that if the two quantities are close then, despite the clusters themselves possibly being different, they both feature the same quality w.r.t. the k-means objective. An end-to-end control of the k-means error is obtained by combining the inequality derived by Loukas and Vandergheynst (2018), based on the works of (Boutsidis et al., 2015; Yu et al., 2014; Martin et al., 2018), |Fk(Uk, P ) 1/2 Fk(Uk, e P ) 1/2| 2 2 sin Θ Uk, P e Uk F with Theorem 14: Graph Reduction with Spectral and Cut Guarantees Corollary 16 If Lc and L are (Uk, ϵk)-similar then Fk(Uk, P ) 1/2 Fk(Uk, e P ) 1/2 2 8 λk+1 λk γ1 1 + λk X Contrary to previous analysis (Loukas and Vandergheynst, 2018), the approximation result here is applicable to any number of levels and it can be adapted to hold for the eigenvectors of the normalized Laplacian4. Nevertheless, it should be stressed that at this point it is an open question whether the above analysis yields benefits over other approaches tailored especially to the acceleration of spectral clustering. A plethora of such specialized algorithms are known (Tremblay and Loukas, 2019) arguing about the pros and cons of each extends beyond the scope of this work. 4. Graph Coarsening by Local Variation This section proposes algorithms for Laplacian consistent coarsening. I suppose that L is a combinatorial graph Laplacian and, given a subspace R and a target graph size n, aim to find an (R, ϵ)-similar Laplacian Lc of size n n with ϵ smaller than some threshold ϵ . Local variation algorithms differ only in the type of contraction sets that they consider. For instance, the edge-based local variation algorithm only contracts edges, whereas in the neighborhood-based variant each contraction set is a subset of the neighborhood of a vertex. Otherwise, all local variation algorithms follow the same general methodology and aim to minimize an upper bound of ϵ. To this end, two bounds are exploited: First, Lc is shown to be (R, ϵ)-similar to L with ϵ Q ℓ(1 + σℓ) 1, where the variation cost σℓdepends only on previous levels (see Section 4.1). The main difficulty with minimizing σℓis that it depends on interactions between contraction sets. For this reason, the second bound shows that these interactions can be decoupled by considering each local variation cost, i.e., the cost of contracting solely the vertices in V(r) ℓ 1, independently on a slightly modified subgraph (see Section 4.2). Having achieved this, Section 4.3 considers ways of efficiently identifying disjoint contraction sets with small local variation cost. 4.1. Decoupling Levels and the Variation Cost Guaranteeing restricted spectral approximation w.r.t. subspace R boils down to minimizing at each level ℓthe variation cost σℓ= Π ℓAℓ 1 Lℓ 1 = Sℓ 1Π ℓAℓ 1 2, where Lℓ 1 = S ℓ 1Sℓ 1 and Π ℓ= I P + ℓPℓis a projection matrix. Matrix Aℓ 1 captures two types of information: 4. For the normalized Laplacian, one should perform (combinatorial) Laplacian consistent coarsening on a modified eigenspace, as in the proof of Theorem 15. 1. Foremost, it encodes the behavior of the target Laplacian matrix L w.r.t. R. This is clearly seen in the first level, for which one has that A0 = V V L+1/2 with V RN k being an orthonormal basis of R. 2. When ℓ> 1 one needs to consider A0 in view of the reduction done in previous levels. The necessary modification turns out to be Aℓ 1 = Bℓ 1(B ℓ 1Lℓ 1Bℓ 1)+1/2, with Bℓ 1 = Pℓ 1Bℓ 2 RNℓ 1 N expressed in a recursive manner and B0 = A0. The following result makes explicit the connection between ϵ and σℓ. Proposition 17 Matrices Lc and L are (R, ϵ)-similar with ϵ Qc ℓ=1(1 + σℓ) 1. Crucially, the previous result makes it possible to design a multi-level coarsening greedily, by starting from the first level and optimizing following levels one at a time. Thus, every local variation algorithm operates in the following manner: Algorithm 1 Multi-level coarsening 1: input: Combinatorial Laplacian L, threshold ϵ , and target size n. 2: Set ℓ 0, Lℓ L, and ϵℓ 0. 3: while Nℓ> n and ϵℓ< ϵ do 5: Coarsen Lℓ 1 using Algorithm 2 with threshold σ = 1+ϵ 1+ϵℓ 1 1 and target size n. Let Lℓbe the resulting Laplacian of size Nℓwith variation cost σℓ. 6: ϵℓ (1 + ϵℓ 1)(1 + σℓ) 1. 7: return Lℓ Algorithm 1 returns a Laplacian matrix Lc that is (R, ϵ)-similar to L with ϵ ϵc ϵ , where c is the last level ℓ. On the other hand, setting ϵ to a large value ensures that the same algorithm always attains the target reduction at the expense of loose restricted approximation guarantees. Remark. The variation cost simplifies when R is an eigenspace of L. I demonstrate this for the choice of Uk, though an identical argument can be easily derived for any eigenspace. Denote by Λ the diagonal N N eigenvalue matrix placed from top-left to bottom-right in non-decreasing order and by U the respective full eigenvector matrix. Furthermore, let Λk be the k k sub-matrix of Λ with the smallest k eigenvalues in its diagonal. By the unitary invariance of the spectral norm, it follows that σ0 = Π 1 Uk U k L+1/2 L0 = Π 1 Uk U k L+1/2U L0 = Π 1 Uk U k UΛ+1/2 L0 . Simplifying and eliminating zero columns, one may redefine B0 = UkΛ+1/2 k RN k, such that once more σ0 = Π 1 B0 L0. This is computationally attractive because now at each level one needs to take the pseudo-inversesquare-root of a k k matrix B ℓ 1Lℓ 1Bℓ 1, with k N. Graph Reduction with Spectral and Cut Guarantees 4.2. Decoupling Contraction Sets and Local Variation Suppose that Π C is the (complement) projection matrix obtained by contracting solely the vertices in set C, while leaving all other vertices in Vℓ 1 untouched: h Π C x i (i) = ( x(i) P vj C x(j) |C| if vi C 0 otherwise. (Here, for convenience, the level index is suppressed.) Furthermore, let LC be the Nℓ 1 Nℓ 1 combinatorial Laplacian whose weight matrix is [WC] (i, j) = Wℓ 1(i, j) if vi, vj C 2 Wℓ 1(i, j) if vi C and vj / C 0 otherwise. That is, WC is zero everywhere other than at the edges touching at least one vertex in C. The following proposition shows us how to decouple the contribution of each contraction set to the variation cost. Proposition 18 The variation cost is bounded by C Pℓ Π C Aℓ 1 2 LC, where Pℓ= {V(1) ℓ 1, . . . , V(Nℓ) ℓ 1 } is the family of contraction sets of level ℓ. The above argument therefore entails bounding the, difficult to optimize, variation cost as a function of locally computable and independent costs Π C Aℓ 1 2 LC. The obtained expression is a relaxation, as it assumes that the interaction between contraction sets will be the worst possible. It might be interesting to notice that the quality of the relaxation depends on the weight of the cut between contraction sets. Taking the limit, the inequality converges to an equality as the weight of the cut shrinks. Also of note, the bound becomes tighter the larger the per-level dimensionality reduction requested (the smaller Nℓ= |Pℓ| is, the fewer inequalities are involved in the derivation). 4.3. Local Variation Coarsening Algorithms Starting from a candidate family Fℓ= {C1, C2, C3, . . .}, that is, an appropriately sized family of candidate contraction sets, the strategy will be to search for a small contraction family Pℓ= {V(1) ℓ 1, . . . , V(Nℓ) ℓ 1 } with minimal variation cost σℓ(Pℓis valid if it partitions Vℓ 1 into Nℓcontraction sets). Every coarse vertex v r Vℓis then formed by contracting the vertices in V(r) ℓ 1. As a thought experiment, suppose that set C Fℓis chosen to be part of Pℓ. From the decoupling argument, its contribution to σ2 ℓwill be at most Π C Aℓ 1 2 LC independently of how other candidate sets are chosen. Moreover, the selection will reduce Nℓ 1 by |C| 1 vertices. Thus, one needs to look for the non-singleton candidate sets C with cost costℓ(C) = Π C Aℓ 1 2 LC |C| 1 (6) that is as small as possible. I refer to (6) as local variation cost because it captures the maximal variation of all signals from an appropriate subspace (implied by Aℓ 1) with support on C. On the other hand, since any permissible contraction family Pℓshould be a partitioning of Vℓ 1, choosing C precludes us from selecting any C with which it intersects. Based on this intuition, Algorithm 2 sequentially examines candidate sets from Fℓ, starting from those with minimal cost. To decide whether a candidate set C will be added to Pℓ the algorithm asserts that all vertices in C are unmarked essentially enforcing that all contraction sets are disjoint. Accordingly, as soon as C is added to Pℓ, all vertices that are in C become marked. Candidate sets with marked vertices are pruned (C C \marked) and their cost is updated. The algorithm terminates if either the target reduction is achieved, the error threshold is exceeded, or no candidate sets remain. Even though this remains implicit in the discussion, if at termination Pℓdoes not cover every vertex of Vℓ 1, then I compliment it with singleton sets, featuring one vertex each (and zero cost). Algorithm 2 Single-level coarsening by local variation 1: input: Combinatorial Laplacian Lℓ 1, threshold σ , and target size n. 2: Form the family of candidate sets Fℓ= {C1, C2, C3, . . .} (algorithm-specific step). 3: Nℓ Nℓ 1, marked , σ2 ℓ 0. 4: Sort Fℓin terms of increasing costℓ(C). 5: while |Fℓ| > 0 and Nℓ> n and σℓ σ do 6: Pop the candidate set C of minimal cost s from Fℓ. 7: if all vertices of C are not marked and σ q σ2 ℓ+ (|C| 1)s then 8: marked marked C, Pℓ Pℓ C, Nℓ Nℓ |C| + 1, σ2 ℓ σ2 ℓ+ (|C| 1)s 10: C C \ marked 11: if |C | > 1 then 12: Compute costℓ(C ) and insert C into Fℓwhile keeping the latter sorted. 13: Form the Nℓ Nℓ 1 coarsening matrix Pℓbased on Pℓ. 14: return Lℓ P ℓLℓ 1P + ℓ and σℓ Undeniably, Algorithm 2 is only one of the possible ways to select a partitioning of small variation cost. However, this algorithm stands out from other algorithms I experimented with, as it is efficient when the subspace of interest is an eigenspace (e.g., V = Uk), k is small, and the families Fℓhave been selected appropriately. Denote by Φ = maxℓ P C Fℓ|C| the maximum number of vertices in all candidate sets and by δ = maxℓ, C Fℓ|C| the cardinality of the maximum candidate set I refer to these measures as family weight and width, respectively. Choosing R = Uk, the computational complexity of Algorithm 2 is O(ck M + k2N + ck3 + c Φ min{k2δ + kδ2, kδ2 + δ3} + log maxℓ|Fℓ| ), which up to polylog factors is linear on the number of edges, vertices, and Φ (see Appendix B for details). Graph Reduction with Spectral and Cut Guarantees If computational complexity is of no concern, one may consider the following two more sophisticated algorithms: The optimal algorithm. Given a candidate family, the algorithm that optimally minimizes the sum of local variation costs constructs a graph with one vertex for each subset of a candidate set and adds an edge between every two vertices whose respective sets have a non-empty intersection. It then selects Pℓas the maximum independent set of minimal weight (the weight of each vertex is a local variation cost w.r.t a set). Unfortunately, even if the size of this graph is polynomial in N, this problem cannot be solved efficiently, since the minimum-weight independent set problem is NPhard. Nevertheless, for the specific case where candidate sets correspond to edges the problem simplifies to a minimum-weight matching problem, which can be computed in O(N3 ℓ 1) time exactly, whereas a (2 + δ)-approximation can be found much faster (Paz and Schwartzman, 2017). The quadratic variant. A second possibility is to proceed as with Algorithm 2, but to prune each C Fℓafter a set C is added to Pℓ. The numerical experiments indicated that this additional step may improve the coarsening quality slightly, but it is not recommended for large graphs as it introduces a quadratic dependency of the complexity on N. 4.3.1. Candidate Contraction Families To keep coarsening efficient, I focus on families of linear weight and almost constant width. Two possibilities are considered: Edge-based. Here Fℓcontains one candidate set for each edge of Gℓ 1. This is a natural choice for coarsening indeed, most coarsening algorithms in the literature use some form of edge contraction. It is straightforward to see that in this case Φ = 2M and δ = 2, meaning that the expression of the computational complexity simplifies to O(ck M +ck3+k2N). The drawback of contracting edges is that at each level the graph size can only be reduced by at most a factor of 2, meaning that a large number of levels is necessary to achieve significant reduction5. Neighborhood-based. A more attractive choice is to construct one candidate set for the neighborhood of each vertex, including the vertex itself. Denoting by the largest combinatorial degree and since Φ = 2M, the complexity here is bounded by O(c M(k+min{k2 + k 2, k 2 + 3}) + ck3 + k2N). The experiments show that the neighborhood-based variant generally achieves better reduction while being marginally slower than the edge-based variant. As a final remark, when G is dense, the dependency on M can be dropped by sparsifying the graph before using Algorithm 1. 5. Numerical Results The evaluation was performed on four representative graphs, each exhibiting different structural characteristics: 5. In practice, depending on the graph in question, the per-level reduction ratio rℓis usually between 0.35 and 0.45. Yeast. Protein-to-protein interaction network in budding yeast, analyzed by Jeong et al. (2001). The network has N = 1458 vertices, M = 1948 edges, diameter of 19, and degree between 1 and 56. Airfoil. Finite-element graph obtained by airflow simulation Preis and Diekmann (1997), consisting of N = 4000 vertices, M = 11490 edges, diameter of 65, and degree between 1 and 9. Minnesota. Road network with N = 2642 vertices, M = 3304 edges, diameter of 99, and degree between 1 and 5 (Gleich, 2008). Bunny. Point cloud consisting of N = 2503 vertices, M = 65490 edges, diameter of 15, and degree between 13 and 97 (Turk and Levoy, 1994). The point cloud has been sub-sampled from its original size. I compare to the following methods for multi-level graph reduction: Heavy edge matching. At each level of the scheme, the contraction family is obtained by computing a maximum-weight matching with the weight of each contraction set (vi, vj) calculated as wij/ max{di, dj}. In this manner, heavier edges connecting vertices that are well separated from the rest of the graph are contracted first. Heavy edge matching was first introduced in the algebraic multigrid literature and, perhaps due to its simplicity, its variants have been repeatedly used for partitioning (Karypis and Kumar, 1998; Dhillon et al., 2007) and drawing (Walshaw, 2000; Hu, 2005) graphs, as well as more recently in graph convolutional neural networks (Defferrard et al., 2016). Algebraic distance. This method differs from heavy edge matching in that the weight of each candidate set (vi, vj) E is calculated as (PQ q=1(xq(i) xq(j))2)1/2, where xk is an N-dimensional test vector computed by successive sweeps of Jacobi relaxation. The complete method is described by Ron et al. (2011), see also (Chen and Safro, 2011). As recommended by the authors, I performed 20 relaxation sweeps. Further, I set the number of test vectors Q to equal the dimension k of the space I aimed to approximate (a simple rank argument shows that Q k for the test vectors to span the space). Affinity. This is a vertex proximity heuristic in the spirit of the algebraic distance that was proposed by Livne and Brandt (2012) in the context of their work on the lean algebraic multigrid. As per the author suggestion, the Q = k test vectors are here computed by a single sweep of a Gauss-Seidel iteration. Kron reduction. At each level, the graph size is reduced by selecting a set of vertices of size N/2 (corresponding to the positive entries of the last eigenvector of L) and applying Kron reduction. The method, which was proposed by Shuman et al. (2016), is not strictly a coarsening method as it rewires the vertices of the reduced graph, resulting in significantly denser graphs6. Unfortunately, the rewiring step en- 6. As suggested by the authors, the sparsity of reduced graphs can be controlled by spectral sparsification. The sparsification step was not included in the numerical experiments since it often resulted in increased errors. Graph Reduction with Spectral and Cut Guarantees tails finding the Schur complement of a large Laplacian submatrix and thus generally exhibits O(N3) complexity, rendering it prohibitive for graphs of more than a few thousand vertices. Despite these drawbacks, the method is quite popular because of its elegant theoretical guarantees (D orfler and Bullo, 2013). Depending on how the edge matching is constructed, different variants of edge contraction methods can be implemented. At the two extremes of the complexity spectrum one finds the maximum matching of minimum weight at O(N3) complexity (Galil, 1986) or greedily constructs a matching by visiting vertices in random order and inducing O(M) overhead (Dhillon et al., 2007). For consistency, I implemented all edge-based methods by combining Algorithm 2 with an edge-based family and substituting the local variation cost with the (negative) methodspecific edge weight. This generally yields matchings of better quality (heavier weight) than visiting vertices in a predefined order, at the expense of the marginally larger O(M log M) complexity necessary for sorting the edge weights. The choice is also motivated by the observation that the computational bottleneck of (sophisticated) edge contraction methods lies in the edge weight computation. For all experiments, I set ϵ = aiming for a fixed reduction rather than a restricted spectral approximation guarantee. The code reproducing the experiments can be accessed online 7. 5.1. Restricted Spectral Approximation The first experiment tests how well Lc approximates the action of L w.r.t. the subspace Uk of the smallest variation. In other words, for each method, I plot the smallest ϵ such that the following equation holds: x ex L ϵ x L for all x Uk. (7) The results are summarized in Figure 2 for two representative subspaces of size k = 10 and k = 40. With the exception of the Kron reduction that was repeated 10 times, all methods are deterministic and thus were run only once. Overall, it can be seen that local variation methods outperform other coarsening methods in many problem instances. The gap is particularly prominent for large reductions, where multiple levels are employed. Neighborhood-based contraction yields the best result overall, mainly because it achieves the same reduction in fewer levels. Interestingly, local variation (and coarsening) methods in many cases also outperform Kron reduction, even though the latter is more demanding computationally. I elaborate further on four points stemming from the results: In most instances, a reduction of up to 70% is feasible while maintaining a decent approximation. The attained approximation is a function of the graph in question and k. Nevertheless, in almost all experiments, the best coarsening method could reduce the graph size by at least 70% while also ensuring that ϵ < 1 (horizontal black line). This is an 7. github.com/loukasa/graph-coarsening/tree/v1.1 (DOI 175851068) 10% 30% 50% 70% 90% reduction r kron affinity algebraic dist. local var. (neigh) local var. (edges) heavy edge (a) yeast (k = 10) 10% 30% 50% 70% 90% reduction r kron affinity algebraic dist. local var. (neigh) local var. (edges) heavy edge (b) yeast (k = 40) 10% 30% 50% 70% 90% reduction r kron affinity algebraic dist. local var. (neigh) local var. (edges) heavy edge (c) airfoil (k = 10) 10% 30% 50% 70% 90% reduction r kron affinity algebraic dist. local var. (neigh) local var. (edges) heavy edge (d) airfoil (k = 40) 10% 30% 50% 70% 90% reduction r kron affinity algebraic dist. local var. (neigh) local var. (edges) heavy edge (e) bunny (k = 10) 10% 30% 50% 70% 90% reduction r kron affinity algebraic dist. local var. (neigh) local var. (edges) heavy edge (f) bunny (k = 40) 10% 30% 50% 70% 90% reduction r kron affinity algebraic dist. local var. (neigh) local var. (edges) heavy edge (g) minnesota (k = 10) 10% 30% 50% 70% 90% reduction r kron affinity algebraic dist. local var. (neigh) local var. (edges) heavy edge (h) minnesota (k = 40) Figure 2: Quality of approximation comparison for four representative graphs (rows) and two subspace sizes (columns). Graph Reduction with Spectral and Cut Guarantees encouraging result, illustrating that significant dimensionality reduction is often possible, without sacrificing too much the solution quality. Following intuition, it is generally harder to approximate subspaces of larger dimension k, but not excessively so. Increasing k from 10 to 40 in most cases increases ϵ only slightly. The only case where the approximation becomes profoundly better with small subspaces is with small reduction ratios r. For instance, coarsening the yeast graph results in an impressive approximation for all r < 30% when k = 10, whereas ϵ increases almost by an order of magnitude when k becomes 40. Kron reduction is an effective way to half the graph size but can result in poor approximation otherwise. If one is willing to sacrifice in terms of efficiency, Kron reduction effectively reduces the graph size by a factor of two (with the exception of the yeast graph). What might be startling is that the method behaves poorly for different reduction ratios. Three main factors cause this deterioration of performance. First, the sampling set is constructed based on the sign pattern of u N, i.e., by keeping the vertices vi for which u N(i) 0, and has cardinality close to N/2 (Shuman et al., 2016). Therefore, if at any level one tries to reduce the graph size by less than half, the last eigenvector heuristic cannot be used precisely. Second, numerical instability issues sometimes manifest when r exceeds 50%. I was not able to improve the implementation featured in the Py GSP toolbox, and in my experiments some problem instances could not be solved successfully (hence the missing markers). The third reason is described next. Coarse levels should aim to approximate the original graph and not the proceeding levels. The conventional approach in multi-level schemes is to aim at each level to approximate as closely as possible the graph of the previous level. This can lead to a sudden increase of error at consecutive levels (e.g. notice the minnesota error as r approaches 50%) as decisions early in the scheme can have a significant impact later on. On the other hand, local variation methods modify the cost function minimized at each level (see Proposition 17), resulting in smoother transitions between levels and tighter approximations at large r. 5.2. Spectrum Approximation The second part of the experiments examines the coarsening through the lens of spectral graph theory. The premise is that, since the spectrum of the Laplacian distills information about the graph structure, one may interpret the spectral distance as a proxy for the structural similarity of the two graphs. This is by no means a new idea the Laplacian spectrum is a common ingredient in accessing graph similarity (Wilson and Zhu, 2008). Tables 1 and 2 report the mean relative eigenvalue error defined as 1 k Pk i=1 | e λi λi| λi for two representative k, respectively 10 and 40. The results for k = 80 were consistent with those presented here, and are not reported here for reasons of brevity. As expected, the reduction ratio plays a significant role in the closeness of Laplacian spectra. Indeed, for most cases, the eigenvalue error jumps by almost an order of magnitude whenever r increases by 20%. Yet, also in most cases, acceptable errors can be achieved even when r heavy edge local var. (edges) local var. (neigh.) algebraic distance affinity Kron reduction yeast 30% 0.284 0.123 0.003 0.126 0.164 0.054 50% 1.069 0.460 0.034 0.759 0.877 1.321 70% 5.126 3.920 0.409 3.395 3.140 1.865 airfoil 30% 0.278 0.036 0.065 0.219 0.258 0.345 50% 0.527 0.201 0.197 1.221 1.291 0.900 70% 3.954 1.042 0.926 5.562 5.145 2.027 bunny 30% 0.015 0.006 0.061 0.244 0.070 0.335 50% 0.064 0.046 0.190 0.401 0.137 0.801 70% 0.122 0.080 0.323 0.694 0.304 1.812 minnesota 30% 0.332 0.088 0.078 0.220 0.295 0.324 50% 1.363 0.431 0.310 2.394 2.676 0.873 70% 7.452 4.553 1.892 8.412 9.354 2.068 Table 1: Mean relative error for the first k = 10 eigenvalues, for different graphs, reduction ratios, and coarsening methods. r heavy edge local var. (edges) local var. (neigh.) algebraic distance affinity Kron reduction yeast 30% 0.311 0.113 0.023 0.113 0.162 0.120 50% 1.087 0.413 0.130 0.522 0.676 1.196 70% 3.618 2.212 0.454 2.387 2.474 1.946 airfoil 30% 0.278 0.095 0.181 0.191 0.257 0.366 50% 0.555 0.326 0.349 0.692 0.788 0.955 70% 2.059 0.905 0.848 2.309 2.619 2.141 bunny 30% 0.014 0.008 0.085 0.215 0.049 0.294 50% 0.067 0.058 0.181 0.356 0.092 0.660 70% 0.122 0.098 0.299 0.520 0.200 1.192 minnesota 30% 0.358 0.118 0.115 0.223 0.308 0.342 50% 0.967 0.468 0.383 1.105 1.296 0.944 70% 3.588 2.160 1.610 4.113 4.133 2.198 Table 2: Mean relative error for the first k = 40 eigenvalues, for different graphs, reduction ratios, and coarsening methods. the coarse graph is as small as one third of the size of the original graph (corresponding to r = 70%). It might also be interesting to observe that there is a general agreement between the trends reported here and those described in the matrix approximation experiment. In particular, if one sorts the tested methods from best to worse, he/she will obtain an ordering that is generally consistent across the two experiments, with local variation methods giving the best approximation by a significant margin. A case in point: for the maximum ratio, the smallest local variation error is on average 3.5 smaller than the leading state-of-the-art coarsening method. The gain is 2.3 if the Kron reduction is also included in the comparison. Overall, it can be deduced from these results that local variation methods coarsen a graph in a manner that preserves its spectrum. This is in accordance with the theoretical results. As it was shown by Theorem 13, if L and Lc act similar w.r.t. all vectors in Uk, then their Graph Reduction with Spectral and Cut Guarantees 103 104 105 106 number of edges (M) execution time (sec) max execution time kron affinity algebraic dist. local var. (neigh) local var. (edges) heavy edge 103 104 105 106 number of edges (M) execution time (sec) max execution time kron affinity algebraic dist. local var. (neigh) local var. (edges) heavy edge Figure 3: Execution time as a function of the number of edges of the graph. eigenvalues cannot be far apart. Therefore, by aiming for restricted approximation, local variation methods implicitly also guarantee spectrum approximation. 5.3. Efficiency The last experiment tests computational efficiency. I adopt a simple approach and aim to coarsen a 10-regular graph of increasing size8. I measure the execution time of the six different methods for graph reduction and report the mean over ten iterations while capping computation at 100 seconds. The results are displayed in Figures 3a and 3b for subspaces of size 10 and 40, respectively. As with most such comparisons, the actual numbers are only indicative and depend on the programming language utilized (Python), processing paradigm (no parallelism was employed), and hardware architecture (2.2GHz CPU)9. Focusing on the trends, except for Kron reduction and affinity, most methods scale quasilinearly with the number of edges. Interestingly, local variation methods are quite competitive and do not sacrifice much as compared to the straightforward heavy edge matching. One can also observe that constructing Fℓ 1 based on neighborhoods results in slightly slower computation that with the edge-based method construction because in the latter case the local variation cost can be computed much more efficiently. 6. Conclusions Graph reduction techniques are commonly used in modern graph processing and machine learning pipelines in order to facilitate the processing and analysis of large graphs. Nevertheless, more often than not these techniques have been designed based on intuition and possess no rigorous justification or provable guarantees. 8. Starting from a ring, this graph is constructed by connecting each vertex with the five neighbors in either side. 9. I expect that a significant speedup can be achieved by compiling the code to native machine instructions as well as by parallelizing the local variation cost computation. This work considered the graph reduction problem from the perspective of restricted spectral approximation a modification of the measure utilized for the construction of graph sparsifiers. This measure is especially relevant when restricted to Laplacian eigenspaces of small variation, as it implies strong spectral and cut guarantees. The analysis of restricted spectral approximation has lead to new algorithms for graph coarsening that provably approximate (a portion of) the graph spectrum, as well as the cut structure of a large graph. Certain important questions remain open at the point of concluding this manuscript. To begin with, I am currently unaware of a rigorous way to determine how much one may benefit from reduction that is, how small can ϵ be for a specific subspace and target n? In addition, no polynomial-time algorithm for graph coarsening exists that provably approximates the minimal achievable ϵ. Finally, though I lack a formal proof, I also suspect that stronger cut guarantees can be derived from restricted spectral approximation. I argue that the potential of graph reduction cannot be fully realized until these fundamental questions are satisfactorily addressed. Acknowledgments I would like to thank the Swiss National Science Foundation (SNSF) for supporting this work in the context of the project Deep Learning for Graph-Structured Data , grant number PZ00P2 179981. I would like to thank Rodrigo C. G. Pena and Nguyen Q. Tran as well as the editor Karsten Borgwardt and the anonymous reviewers for their useful comments. Appendix A. Deferred Proofs A.1. Proof of Property 1 I draw up an inductive argument demonstrating that Π is a projection matrix. The base case, i.e., that Ac = P + c Pc is a projection matrix follows by the definition of the pseudoinverse: Ac Ac = P + c Pc P + c Pc = P + c Pc = Ac, where one uses the property Pc = Pc P + c Pc. For the inductive step, I argue that if Aℓ+1 is a projection matrix the same holds for Aℓ= P + ℓAℓ+1 Pℓ. To this end, let Pℓ= UΣV be the singular-value decomposition with Σ = (D; 0) RNℓ+1 Nℓdecomposed into the Nℓ+1 Nℓ+1 diagonal matrix D and the all zero matrix 0. Then Aℓ= V Σ+U Aℓ+1 UΣV . Recalling that a projection matrix remains projective if it undergoes a similarity transformation, we deduce that U Aℓ+1 U is also projective and, moreover, if Σ+U Aℓ+1 UΣ is a projection matrix, so is Aℓ. However, one may write Σ+U Aℓ+1 UΣ = D 1 0 0 0 U Aℓ+1 U D 0 0 0 = D 1U Aℓ+1 UD 0 0 0 Graph Reduction with Spectral and Cut Guarantees As a block diagonal matrix whose blocks are projective (again D 1U Aℓ+1 UD is a similarity transformation), Σ+U Aℓ+1 UΣ is also a projection matrix. The proof that Π is a projection matrix concludes by letting the induction unfold backwards from c to 1 and setting Π = A1. A.2. Proof of Theorem 3 The Courant-Fischer min-max theorem for a Hermitian matrix L reads λk = min dim(U)=k max x U x x | x = 0 , (8) whereas the same theorem for Lc gives eλk = min dim(Uc)=k max xc Uc x c Lcxc x c xc | xc = 0 = min dim(Uc)=k max Px Uc x P Px | x = 0 , where in the second equality I substitute Lc = P LP + and xc = Px. We will need the following result: Lemma 19 For any P with full-row rank, the following holds: λ1(PP ) x Πx x P Px λn(PP ) x Πx, with λ1(PP ) and λn(PP ), respectively the smallest and largest eigenvalues of PP . Proof Set D = (PP )+, which is an n n PSD matrix. By the properties of the Moore Penrose inverse P + = P (PP )+ = P D and therefore P P = P DD 1P = P +D 1P. Supposing that the eigenvalues of D lie in [λmin(D), λmax(D)] and that P is full row-rank such that λmin(D) > 0, one may write 1 λmax(D) x Πx x P Px = x P +D 1Px 1 λmin(D) x Πx. To grasp why the aforementioned inequality holds, first use the cyclic property of the trace to obtain x P +D 1Px = tr(x P +D 1Px) = tr(D 1Pxx P +), and further recall that for any symmetric A and PSD matrix B one has λmin(A)tr(B) tr(AB) λmax(A)tr(B), where λmin(A) denotes the smallest eigenvalue of A (and λmax(A) is the largest) (Fang et al., 1994). Set A = D 1, which is by assumption symmetric, and B = Pxx P +, which is PSD since its rank is (at most) one with the only (potentially) non-zero eigenvalue exactly tr(Pxx P +) = x P +Px = x Πx = x Π2x = Πx 2 2 0, where Π = Π2 due to being a projection matrix (Property 1. The desired inequality then follows since λmin(D 1) = 1/λmax(D) = 1 β, λmax(D 1) = 1/λmin(D) = 1 α and once more tr(Pxx P +) = x Πx. Finally, since P is full row-rank, D is invertible meaning λmin(D) = 1/λn(PP ) and λmax(D) = 1/λ1(PP ). From the above, it is deduced that eλk min dim(Uc)=k max Px Uc x ΠLΠx λn(PP ) x Πx | x = 0 = 1 λn(PP ) min dim(U)=k,U im(Π) max x U x x | x = 0 , where the equality holds since Π is a projection matrix (see Property 1). Notice how, with the exception of the constraint x = Πx and the multiplicative term, the final optimization problem is identical to the one for λk, given in (8). As such, the former s solution must be strictly larger (since it is a more constrained problem) and eλk λk λn(PP ). Analogously, one obtains the lower inequality eλk (N n) λk λ1(PP ) by applying the same argument on matrices L and Lc and exploiting that the k-th largest eigenvalue of any matrix M is also the k-th smallest eigenvalue of M. A.3. Proof of Proposition 6 For notational convenience, I drop the level index supposing that c = 1 and thus P is an n N coarsening matrix. As we will see in the following, P has rank n and thus to prove that P + = P D 2, it is sufficient to show that matrix Π = P D 2P is a projection matrix of rank n (and thus equal to P +P). Matrix P = D 1P has the same sparsity structure as P and is thus also a coarsening matrix. W.l.o.g. let the rows of P be sorted based on their support, such that for any two rows r < r and P(r, i), P(r , i ) = 0 we necessarily have i < i . Furthermore, denote by pr the vector containing all non-zero entries of P(r, :) such that pr 2 = P(r, :) 2 = D(r, r). Due to the disjoint support of the rows of P and under this particular sorting, matrix Π is block-diagonal. Moreover, each block Br in its diagonal is a rank 1 projection matrix as B2 r = Br Br = pr D(r, r) 2p r pr D(r, r) 2p r = pr D(r, r) 2p r p r pr pr 2 2 = Br. We have thus arrived to the relation Π2 = Π, which constitutes a necessary and sufficient condition for Π to be a projection matrix. The block-diagonal structure of Π also implies that its rank (as well as that of P) is n. A.4. Proof of Proposition 7 Let us first remark that, by Proposition 6, A = P is also a coarsening matrix with the same sparsity structure as P. Necessity. I start by considering the nullspace of Lc = ALA and aim to ensure that it is equal to the span of the constant vector 1, which is a necessity for all combinatorial Laplacian matrices. Since matrix A is full row-rank and L has rank N 1, the nullspace of Lc is one dimensional. Therefore, the nullspace is correct as long as 1 ALA 1 = 0, which Graph Reduction with Spectral and Cut Guarantees happens if either A 1 = α1 for a constant α or A 1 = 0. In both cases, (A 1)(r) = α1 for every r. By the definition of A however, we know that its rows have disjoint support and, as such, vector A 1 exactly contains the non-zero entries of A. In other words, for the nullspace of Lc to be properly formed, the non-zero entries of A should either all be equal to α (such that A 1 = α1) or zero (in which case A 1 = 0). The latter case can clearly be discarded as it would disconnect the graph. We have thus discovered that ALA has a properly formed nullspace if and only if the non-zero entries of A are equal, rendering the latter condition necessary. Sufficiency. Considering that every Laplacian of M edges can be re-written as L = S S, where S is the M N incidence matrix, one may confirm that the condition is also sufficient by showing that, for every A with equal non-zero entries, the matrix Sc = SA is an incidence matrix of Lc such that Lc = S c Sc. W.l.o.g., suppose that α = 1 (α2L is a valid Laplacian for all α). By construction, each row of Sc is Sc(q, :) = AS(q, :) . Name as eij the corresponding edge, such that S(q, :) = δi δj, where δi is a dirac centered at vertex vi. It follows that Sc(q, :) = Aδi Aδj. Obviously, if none of vi, vj are contracted then Sc(q, :) = δi δj, which is a valid row. Moreover, by construction of A, if either of vi, vj is contracted (but not both) or if vi, vj are contracted into different vertices then both relations Aδi = δi and Aδj = δj hold and thus once more Sc(q, :) = δi δj is a valid row. Lastly, if vi, vj are contracted into the same vertex then for some r it must be that A(r, i) = A(r, j), whereas A(r , i) = A(r , j) = 0 for all r = r and thus Sc(r, :) = 0, signifying that the edge is not present. Summarizing, in every case Sc is a valid incidence matrix, rendering the condition also sufficient. A.5. Proof of Property 8 For any two disjoint subsets S1, S2 of V denote by w(S1, S2) = P vi S1 P vj Sj wij the cut weight in G. The claim is proven by induction on the number of levels. For the base case, set ℓ= 1 and define C = P + 1 such that cr = C(:, r) is the indicator vector of the contraction set V(r) 0 . It is a consequence of the Laplacian form of L that, for any vr, vq V1 with r = q, we have W1(r, q) = L1(r, q) = c r Lcq = X vi =vj wijcr(i)cq(j) + X vi dicr(i)cq(i) vi V(r) 0 ,vj V(q) 0 wij = w(S(r) 1 , S(q) 1 ), where the penultimate step uses cr(i)cq(i) = 0 since contraction steps are disjoint, and the last step exploits the equivalence V(q) 0 = S(q) 1 . For the inductive step, consider level ℓ> 1. Since Lℓ 1 is a Laplacian matrix, one may employ an identical argument as when ℓ= 1 to find that the weight between vertices vr, vq Vℓwith r = q is Wℓ(r, q) = X vi V(r) ℓ 1,vj V(q) ℓ 1 Wℓ 1(i, j). By the induction hypothesis however, it must be Wℓ 1(i, j) = w(S(i) ℓ 1, S(j) ℓ 1), implying Wℓ(r, q) = X vi V(r) ℓ 1,vj V(q) ℓ 1 w(S(i) ℓ 1, S(j) ℓ 1) = w(S(r) ℓ, S(q) ℓ), with the final equality being true due to the recursive definition of sets S(r) ℓ and S(q) ℓ, as well as the following property of cuts: for any two sets (call them large sets) and any partition of each into an arbitrary number of subsets, the cut between the large sets is equal to the sum of all cuts between pairs of subsets belonging to different large sets. This completes the proof. A.6. Proof of Theorem 13 The lower bound is given by Theorem 3. For the upper bound, I reason similarly to the proof of the latter to find: eλk = min dim(Uc)=k max xc Uc x c Lcxc x c xc | xc = 0 γ2 min dim(U)=k,U im(Π) max x U x Πx | x = 0 . Above, the inequality is due to Lemma 19 with γ2 = 1/λ1(PP ). Thus, for any matrix V the following inequality holds eλk γ2 max x span(V ) x Πx | x = 0 , as long as the image of V is of dimension k and does not intersect the nullspace of Π. Write Uk to denote the N k matrix with the k first eigenvectors of L, whose image is of dimension k as needed. Assume for now that the nullspace requirement is also met: eλk γ2 max x span(Uk) x Πx | x = 0 = γ2 max x span(Uk) SΠx 2 2 Πx 2 2 | x = 0 . It will be convenient to manipulate the square-root of this quantity: s eλk γ2 max a Rk SΠUka 2 ΠUka 2 = SΠUk 2 ΠUk 2 SUk 2 + SΠ Uk 2 ΠUk 2 = λk + SΠ Uk 2 ΠUk 2 , (9) with S defined such that L = S S. The norm in the numerator is upper bounded by SΠ Uk 2 = SΠ UkΛ 1/2 k Λ 1/2 k 2 SΠ UkΛ+1/2 k 2 Λ 1/2 k 2 λk SΠ UkΛ+1/2 k 2 = p If the last step is not immediately obvious, one can be convinced by first exploiting the unitary-invariance of the spectral norm to write SΠ UkΛ+1/2 k 2 = SΠ Uk U k L+1/2 2, and then confirming in the proof of Proposition 17 that the latter quantity is exactly ϵk when V = Uk. Graph Reduction with Spectral and Cut Guarantees The preceding analysis assumed that the image of Uk and the nullspace of Π did not intersect. Since Π = I Π is a complement projection matrix , the previous holds when Π Uk 2 2 < 1. Since Π u1 = 0, one may w.l.o.g. exclude the first eigenvector u1 from the space of interest. For the remainder of im(Uk) the following holds: Π Uk 2 2 = max x Uk and x u1 Π x 2 2 x 2 2 1 λ2 max x Uk and x u1 Π x 2 L x 2 2 = ϵ2 k λk λ2 . Therefore, when ϵ2 k < λ2 λk , the nullspace condition is met. The proof is then concluded by substituting the bound ΠUk 2 2 = 1 Π Uk 2 2 1 ϵ2 k λk λ2 in the denominator of (9). A.7. Proof of Theorem 14 Li (1994) showed that we can express the sinΘ as a sum of squared inner products: sin Θ Uk, P e Uk 2 F = e U k PUk 2 j>k (eu j Pui)2 (10) If Lc and L are (Ui, ϵi)-similar it follows from Corollary 12 that u i P Lc Pui (1 + ϵi)2λi. Summing these inequalities for all i k amounts to i k (1 + ϵi)2λi X j=1 eλj(eu j Pui)2 j=1 λj(eu j Pui)2 i k (eu j Pui)2 + γ1 X i k (eu j Pui)2. (11) where, following Theorem 3, I set γ1 = 1/λn(PP ) such that λi γ1λi. Perform the following manipulation: i k (eu j Pui)2 X i k (eu j Pui)2 = X i>k (eu j Pui)2 ! j k λj λk X Π ui 2 2 + X j k (eu j Pui)2 which together with (10) and (11) yields sin Θ Uk, P e Uk 2 (1 + ϵi)2λi/γ1 λi + λk Π ui 2 2 λk+1 λk (12) To proceed, I note the following useful inequality: Lemma 20 If L and Lc are (Ui, ϵi)-similar, then Π ui 2 2 ϵi. Proof For every ui one sees that u i P Lc Pui = u i ΠLΠui = u i (I Π )L(I Π )ui = u i Lui 2u i LΠ ui + u i Π LΠ ui = λi 2λiu i Π ui + u i Π LΠ ui meaning that Π ui 2 2 = 1 1 + u i Π LΠ ui λi u i P Lc Pui 2 1 + ϵ2 i (1 ϵi)2 = ϵi. The last inequality is because, by restricted spectral approximation, we have u i Π LΠ ui = Π ui 2 L ϵ2 i ui 2 L = ϵ2 i λi and from Corollary 12: u i P Lc Pui = Pui 2 Lc (1 ϵi)2 ui 2 L = (1 ϵi)2λi. As a consequence, it follows that sin Θ Uk, P e Uk 2 (1 + ϵi)2λi/γ1 λi + λkϵi which after manipulation gives the desired inequality. A.8. Proof of Theorem 15 The lower bound is a direct consequence of consistent coarsening and holds independently of restricted spectral approximation: for any set Sc Vc define S V such that vi S if and only if ϕc ϕc 1 ϕ1(vi) Sc. Clearly, w(S) wc(Sc), where the subscript c implies that the latter volume is w.r.t. Gc. In addition, by the definition of Laplacian consistent coarsening and since every contraction set belongs either in S or S (but not in both), it follows that w(S, S) = wc(Sc, Sc). In other words, for every Sc there exists a set S such that φ(S) = w(S, S) min{w(S), w( S)} wc(Sc, Sc) min{wc(Sc), wc( Sc)} = φc(Sc), implying also that the k-conductance of G and Gc are related by φk(G) φk(Gc). For the upper bound, I exploit the following multi-way Cheeger inequality: Theorem 21 (Restatement of Theorem 1.2 by Lee et al. (2014)) For every graph G and every k N, we have µk 2 φk(G) = O( p µ2k ξk(G)), with ξk(G) log k. If G is planar then ξk(G) 1 and if G excludes Kh as a minor (but is not planar) then ξk(G) h4. Graph Reduction with Spectral and Cut Guarantees Further, in the standard Cheeger inequality (Alon and Milman, 1985; Alon, 1986) k = 2 and the upper bound is given by 2µ2. Note that the eigenvalues mentioned here are those of the normalized Laplacian matrix Ln = D 1/2LD 1/2. To this end, suppose that V2k contains the first 2k eigenvectors of Ln and fix R = span(D 1/2V2k). Perform consistent coarsening w.r.t. to the combinatorial Laplacian L (not Ln). Then, by definition, if Lc and L are (ϵ2k, R)-similar then for every x R one gets Π x L ϵ2k x L. The substitution y = D 1/2x, such that y span(V2k), allows us to transform the semi-norms above into semi-norms concerning Ln as follows: x 2 L = x Lx = y D 1/2LD 1/2y = y 2 Ln Π x L = D 1/2D 1/2Π D 1/2D 1/2x L = D 1/2Π D 1/2y Ln = (I Πn)y Ln. Above, Πn = D 1/2ΠD 1/2 is the projection matrix (the set of projection matrices is closed under similarity transformations) corresponding to the coarsening matrix P n = PD 1/2, and now yc = Πny. It follows that, for every y V2k = span(V2k), we have y Πny Ln ϵ2k y Ln and thus Ln c and Ln are (V2k, ϵ2k)-similar. Combining the multi-way Cheeger inequality with Theorem 13 for Ln c and Ln one obtains φ2 k(Gc) = O ( µ2k ξk(G)) = O γ2(1 + ϵ2k)2 µ2 µ2 ϵ2kµ2k µ2k ξk(G) = O γ2 (1 + ϵ2k)2ξk(G) 1 ϵ2 2k(µ2k/µ2) φk(G) , where the eigenvalues above are those of Ln and the preceeding holds whenever ϵ2 2k < µ2/µ2k. Further, when k = 2 the upper bound simplifies to φ2 2(Gc) 4 γ2 (1+ϵ2)2 1 ϵ2 2 φ2(G). A.9. Proof of Proposition 17 The following analysis is slightly more general than what is claimed in the statement of Proposition 17: it holds for arbitrary PSD L and Lc (i.e., not necessarily Laplacian matrices) as long as the image im(Π) of the projection matrix Π encloses the nullspace of L. The former trivially holds for Laplacian consistent coarsening, as, by design, one has Π1 = 1 (see Section 2.4). Let V RN k be a basis of R. I start by proving that, for any integer k n and for all x span(V ) the inequality x Πx L ϵ x L holds for all ϵ Π B0 L, where B0 = V V L+1/2. I remind the reader that x L = Sx 2 = L 1/2x 2 and Π = I P +P. Furthermore, since im(Π) necessarily encloses the nullspace N of L, w.l.o.g., one may assume that x R the action of matrix L is invertible. To see why, note that if x N then x L = 0 and x Πx L = 0, meaning that the inequality above is trivially satisfied. I then derive max x R x Πx L x L = max x R SΠ x 2 = max x R SΠ V V x 2 L1/2x 2 (13) = max x im(LV ) SΠ V V L+1/2x 2 SΠ V V L+1/2 2 = Π B0 L, where equality (13) holds because V V is a projection onto R, whereas equality (14) is true since L is invertible within R. One should also note that, for the specific case where V is an eigenspace of L, im(LV ) = R and as such ϵ = Π x L/ x L = SΠ V V L+1/2 2 (once more w.l.o.g. the nullspace of L can be ignored). In addition, as the following technical lemma claims, in a multi-level scheme, any Π x L can be broken down into the contributions of each level: Lemma 22 Define projection matrices Πℓ= P + ℓPℓand Π ℓ= I Πℓ. If Π ℓxℓ 1 Lℓ 1 σℓ xℓ 1 Lℓ 1 at each level ℓ c, then the multi-level error is bounded by q=1 (1 + σq) ℓ=1 (1 + σℓ) 1 Proof Recursively apply the following inequality Sℓ 1Π ℓxℓ 1 2 σℓ Sℓ 1xℓ 1 2 = σℓ Sℓ 2Πℓ 1xℓ 2 2 σℓ Sℓ 2xℓ 2 2 + Sℓ 2Π ℓ 1xℓ 2 2 σℓ( Sℓ 2xℓ 2 2 + σℓ 1 Sℓ 2xℓ 2 2) = σℓ(1 + σℓ 1) Sℓ 2xℓ 2 2 to deduce that Sℓ 1Π ℓxℓ 1 2 σℓ q=1 (1 + σq) S0x0 2 = σℓ q=1 (1 + σq) x L. Graph Reduction with Spectral and Cut Guarantees The end-to-end error SΠ x 2 is controlled with a simple telescopic series argument. Π x L = S0Π x0 2 = S0x0 Scxc 2 S0x0 S1x1 2 + S1x1 S2x2 2 + . . . + Sc 1xc 1 Scxc 2 = S0Π 1 x1 2 + S1Π 1 x1 2 + . . . + Sc 1Π c xc 1 2 Together, the above two results imply the desired bound. Therefore, to guarantee that in a multi-level scheme Π B0 L = max b RN SΠ B0 b 2 one needs to make sure that, for each level ℓ= 1, . . . , c, the following holds: Sℓ 1Π ℓxℓ 1 2 Sℓ 1xℓ 1 2 σℓ, for all xℓ 1 = Pℓ 1 P1B0 b By the same argument used for the multi-level error, when ℓ= 1, we have that σ1 = Π 1 B0 L0. For all other ℓ, set Bℓ 1 = Pℓ 1 P1B0 and further let (B ℓ 1Lℓ 1Bℓ 1)+1/2 be the pseudo-inverse of the matrix square-root of the N N matrix B ℓ 1Lℓ 1Bℓ 1. By the substitution b = Sℓ 1Bℓ 1a, the above can be rewritten as max b RN Sℓ 1Π ℓBℓ 1b 2 Sℓ 1Bℓ 1b 2 = max b RN Sℓ 1Π ℓBℓ 1(B ℓ 1Lℓ 1Bℓ 1)+1/2b 2 b 2 . For ℓ> 1, therefore σℓ= Π ℓAℓ 1 Lℓ 1 with Aℓ 1 = Bℓ 1(B ℓ 1Lℓ 1Bℓ 1)+1/2. A.10. Proof of Proposition 18 For notational simplicity in the context of this proof I drop level indices and assume that only a single coarsening level is used this is without loss of generality, as an identical argument holds for every level of the scheme. Consider any x and set y = Π x. Furthermore, define for each contraction set the (i) internal edge set E(r) = {eij|vi, vj V(r) and eij Eℓ 1} , and (ii) the boundary edge set E(r), such that eij E(r) if and only if vi V(r) and vj / V(r). It is true that Π x 2 L = X eij E wij(y(i) y(j))2 eij E(r) wij(y(i) y(j))2 eij E(r) wij(y(i) y(j))2 In the following, I will express ar and br as a function of the vector yr = Π V(r)x. Term ar is luckily independent of any other contraction set: eij E(r) wij(y(i) y(j))2 = X eij E(r) wij(yr(i) yr(j))2. On the other hand, br is smaller than eij E(r) wij(y(i) y(j))2 2 X eij E(r) wij(y(i) 0)2 + 2 X eij E(r) wij(0 y(j))2. Distributing the second quantities, respectively, amongst the contraction sets that include said vertices, one gets eij E(r) wij(yr(i) yr(j))2 + 2 X eij E(r) wij(y(i) 0)2 ! eij E(r) wij(yr(i) yr(j))2 + X eij E(r) (2wij)(yr(i) yr(j))2 ! r=1 yr 2 LV(r) = X C P yr 2 LC. The second step above used the fact that [Π ](i) = 0 for all vi / C. A decoupled bound can then be obtained as follows: Π A 2 L = max a Rk 1 SΠ A a 2 2 a 2 2 X C P max a Rk 1 Π C A a 2 LC a 2 2 = X C P Π C A 2 LC The final inequality is derived by taking the square-root of the last equation. Appendix B. Complexity Analysis The computational complexity of Algorithm 1 depends on the number of nodes N and edges M of G, the number of levels c, the subspace size k, as well as on how the families of candidate sets are formed. To derive worst-case bounds, I denote by Φℓ= P C Fℓ|C| the number of vertices in all candidate sets and by δ = maxℓ,C Fℓ|C| the cardinality of the maximum candidate set over all levels. Furthermore, I suppose that the per-level reduction ratio rℓis a constant. I start with some basic observations: Computing A0, . . . , Ac 1 is possible in O(ck M + k2N + ck3) operations when V = Uk. Each Aℓ 1 is computed once for each level. For ℓ= 1, one needs to approximate the first k eigenpairs of L, which can be achieved in O(k M) operations using inverse iteration as described by Vishnoi et al. (2013). For consecutive levels, forming matrix Graph Reduction with Spectral and Cut Guarantees B ℓ 1Lℓ 1Bℓ 1 takes O(Mℓ 1k + Nℓ 1k2) operations, whereas computing the pseudosquare-root (B ℓ 1Lℓ 1Bℓ 1)+1/2 is possible in O(k3) operations. Summed up, the costs for all levels amount to O(k Pc ℓ=1 Mℓ 1+k2 Pc ℓ=2 Nℓ 1+ck3) = O(ck M +k2N +ck3), where I used the observation that Pc ℓ=2 Nℓ 1 = O(N). At each level, the cost function is evaluated at most Φℓtimes. One starts by computing the cost of each candidate set in Fℓ. Moreover, every C added to Pℓcauses the pruning of at most P vi C(φi 1) other sets, where φi is the number of candidate sets that include vi. Since Pℓis a partitioning of Vℓ 1, at most P C Pℓ P vi C(φi 1) P vi Vℓ 1 φi |Fℓ| = Φℓ |Fℓ| cost re-evaluation are needed. Given Aℓ 1, each call of costℓ(C) requires O(min{k2δ + kδ2, kδ2 + δ3}) operations. The involved matrices themselves can be easily formed since, excluding all-zero rows and columns, both LC and Π C are |C| |C| matrices and one can safely restrict Aℓ 1 to be of size |C| k by deleting all rows that would have been multiplied by zero. Now, by definition, the incidence matrix SC of LC has at most δ columns and 2δ rows (since one can bundle all boundary weights of a vertex in C in a single row). Depending on the relative size of k and δ the computation can be performed in two ways: Either one forms the k k matrix A ℓ 1Π C LCΠ C Aℓ 1 and approximate its spectral norm paying a total of O(k2δ + kδ2). Otherwise, the 2d 2d matrix SCΠ C Aℓ 1A ℓ 1Π C S C is formed and its norm is computed at a combined cost of O(δ2k + δ3). Maintaining Fℓsorted incurs O(Φℓlog |Fℓ|) cost. Sorting Fℓduring initialization entails O(|Fℓ| log |Fℓ|) operations. Inserting each C into Fℓ(see step 12) can be done in O(log |Fℓ|) and, moreover, by the same argument used to bound the number of cost evaluations, at most Φℓ |Fℓ| such insertions can happen. Other operations carry negligible cost. In particular, by implementing marked as a binary array, checking if a vertex is marked or not can be done in constant time. Overall, using Algorithm 2 and for R = Uk one can coarsen a graph in O(ck M + k2N + ck3 + Pc ℓ=1 Φℓ(min{k2δ + kδ2, kδ2 + δ3} + log |Fℓ|)) time, where the asymptotic notation hides poly-logarithmic factors. Noga Alon. Eigenvalues and expanders. Combinatorica, 6(2):83 96, 1986. Noga Alon and Vitali D Milman. λ1, isoperimetric inequalities for graphs, and superconcentrators. Journal of Combinatorial Theory, Series B, 38(1):73 88, 1985. Lynton Ardizzone, Jakob Kruse, Sebastian Wirkert, Daniel Rahner, Eric W Pellegrini, Ralf S Klessen, Lena Maier-Hein, Carsten Rother, and Ullrich K othe. Analyzing inverse problems with invertible neural networks. ar Xiv preprint ar Xiv:1808.04730, 2018. Joshua Batson, Daniel A. Spielman, Nikhil Srivastava, and Shang-Hua Teng. Spectral sparsification of graphs: Theory and algorithms. Commun. ACM, 56(8):87 94, 2013. ISSN 0001-0782. Mikhail Belkin and Partha Niyogi. Laplacian eigenmaps for dimensionality reduction and data representation. Neural computation, 15(6):1373 1396, 2003. Christos Boutsidis, Prabhanjan Kambadur, and Alex Gittens. Spectral clustering via the power method-provably. In International Conference on Machine Learning, pages 40 48, 2015. M. M. Bronstein, J. Bruna, Y. Le Cun, A. Szlam, and P. Vandergheynst. Geometric deep learning: Going beyond euclidean data. IEEE Signal Processing Magazine, 34(4):18 42, July 2017. ISSN 1053-5888. doi: 10.1109/MSP.2017.2693418. Joan Bruna, Wojciech Zaremba, Arthur Szlam, and Yann Lecun. Spectral networks and locally connected networks on graphs. In International Conference on Learning Representations (ICLR2014), CBLS, April 2014, 2014. Daniele Calandriello, Ioannis Koutis, Alessandro Lazaric, and Michal Valko. Improved largescale graph learning through ridge spectral sparsification. In International Conference on Machine Learning, 2018. Guantao Chen, George Davis, Frank Hall, Zhongshan Li, Kinnari Patel, and Michael Stewart. An interlacing result on normalized Laplacians. SIAM Journal on Discrete Mathematics, 18(2):353 361, 2004. Jie Chen and Ilya Safro. Algebraic distance on graphs. SIAM Journal on Scientific Computing, 33(6):3468 3490, 2011. Fan RK Chung. Spectral graph theory. Number 92. American Mathematical Soc., 1997. Charles Colley, Junyuan Lin, Xiaozhe Hu, and Shuchin Aeron. Algebraic multigrid for least squares problems on graphs with applications to hodgerank. In Parallel and Distributed Processing Symposium Workshops (IPDPSW), 2017 IEEE International, pages 627 636. IEEE, 2017. Chandler Davis and William Morton Kahan. The rotation of eigenvectors by a perturbation. iii. SIAM Journal on Numerical Analysis, 7(1):1 46, 1970. Micha el Defferrard, Xavier Bresson, and Pierre Vandergheynst. Convolutional neural networks on graphs with fast localized spectral filtering. In Advances in Neural Information Processing Systems, pages 3844 3852, 2016. Inderjit S Dhillon, Yuqiang Guan, and Brian Kulis. Weighted graph cuts without eigenvectors a multilevel approach. IEEE Transactions on Pattern Analysis and Machine Intelligence, 29(11), 2007. Florian D orfler and Francesco Bullo. Kron reduction of graphs with applications to electrical networks. IEEE Transactions on Circuits and Systems, 60(1):150 163, 2013. Graph Reduction with Spectral and Cut Guarantees Yuguang Fang, Kenneth A Loparo, and Xiangbo Feng. Inequalities for the trace of matrix product. IEEE Transactions on Automatic Control, 39(12):2489 2490, 1994. Zvi Galil. Efficient algorithms for finding maximum matching in graphs. ACM Computing Surveys (CSUR), 18(1):23 38, 1986. Shivam Gandhi. Improvement of the cascadic multigrid algorithm with a Gauss-Seidel smoother to efficiently compute the Fiedler vector of a graph Laplacian. ar Xiv preprint ar Xiv:1602.04386, 2016. Matan Gavish, Boaz Nadler, and Ronald R Coifman. Multiscale wavelets on trees, graphs and high dimensional data: Theory and applications to semi supervised learning. In ICML, pages 367 374, 2010. David Gleich. Matlab BGL. a matlab graph library. Institute for Computational and Mathematical Engineering, Stanford University, 2008. Francesco Grassi, Andreas Loukas, Nathanael Perraudin, and Benjamin Ricaud. A timevertex signal processing framework: Scalable processing and meaningful representations for time-series on graphs. IEEE Transactions on Signal Processing, 66(3):817 829, 2017. Nathan Halko, Per-Gunnar Martinsson, and Joel A Tropp. Finding structure with randomness: Probabilistic algorithms for constructing approximate matrix decompositions. SIAM review, 53(2):217 288, 2011. Bruce Hendrickson and Robert W Leland. A multi-level algorithm for partitioning graphs. SC, 95(28):1 14, 1995. Anil N Hirani, Kaushik Kalyanaraman, and Seth Watts. Graph Laplacians and least squares on graphs. In Parallel and Distributed Processing Symposium Workshop (IPDPSW), 2015 IEEE International, pages 812 821. IEEE, 2015. Yifan Hu. Efficient, high-quality force-directed graph drawing. Mathematica Journal, 10 (1):37 71, 2005. Elvin Isufi, Andreas Loukas, Nathanael Perraudin, and Geert Leus. Forecasting time series with varma recursions on graphs. ar Xiv preprint ar Xiv:1810.08581, 2018. Hawoong Jeong, Sean P Mason, A-L Barab asi, and Zoltan N Oltvai. Lethality and centrality in protein networks. Nature, 411(6833):41, 2001. David R Karger. Random sampling in cut, flow, and network design problems. Mathematics of Operations Research, 24(2):383 413, 1999. George Karypis and Vipin Kumar. A fast and high quality multilevel scheme for partitioning irregular graphs. SIAM Journal on scientific Computing, 20(1):359 392, 1998. David Harel Yehuda Koren. A fast multi-scale method for drawing large graphs. Journal of graph algorithms and applications, 6(3):179 202, 2002. I. Koutis, G. L. Miller, and R. Peng. Approaching optimality for solving sdd linear systems. In 2010 IEEE 51st Annual Symposium on Foundations of Computer Science, pages 235 244, Oct 2010. Ioannis Koutis, Gary L Miller, and David Tolliver. Combinatorial preconditioners and multilevel solvers for problems in computer vision and image processing. Computer Vision and Image Understanding, 115(12):1638 1646, 2011. Dan Kushnir, Meirav Galun, and Achi Brandt. Fast multiscale clustering and manifold identification. Pattern Recognition, 39(10):1876 1891, 2006. Stephane Lafon and Ann B Lee. Diffusion maps and coarse-graining: A unified framework for dimensionality reduction, graph partitioning, and data set parameterization. IEEE transactions on pattern analysis and machine intelligence, 28(9):1393 1403, 2006. James R Lee, Shayan Oveis Gharan, and Luca Trevisan. Multiway spectral partitioning and higher-order cheeger inequalities. Journal of the ACM (JACM), 61(6):37, 2014. Ren-Cang Li. Relative perturbation theory:(ii) eigenspace variations. Technical report, 1994. Jiongqian Liang, Saket Gurukar, and Srinivasan Parthasarathy. Mile: A multi-level framework for scalable graph embedding. ar Xiv preprint ar Xiv:1802.09612, 2018. Oren E Livne and Achi Brandt. Lean algebraic multigrid (LAMG): Fast graph Laplacian linear solver. SIAM Journal on Scientific Computing, 34(4):B499 B522, 2012. Andreas Loukas and Nathana el Perraudin. Stationary time-vertex signal processing. ar Xiv preprint ar Xiv:1611.00255, 2016. Andreas Loukas and Pierre Vandergheynst. Spectrally approximating large graphs with smaller graphs. In Interenational Conference on Machine Learning (ICML), 2018. Lionel Martin, Andreas Loukas, and Pierre Vandergheynst. Fast approximate spectral clustering for dynamic networks. In Interenational Conference on Machine Learning (ICML), 2018. Ankur Moitra. Vertex sparsification and universal rounding algorithms. Ph D thesis, Massachusetts Institute of Technology, 2011. Ami Paz and Gregory Schwartzman. A (2+ϵ)-approximation for maximum weight matching in the semi-streaming model. In Proceedings of the Twenty-Eighth Annual ACM-SIAM Symposium on Discrete Algorithms, pages 2153 2161. SIAM, 2017. David Peleg and Alejandro A Sch affer. Graph spanners. Journal of graph theory, 13(1): 99 116, 1989. Robert Preis and Ralf Diekmann. Party-a software library for graph partitioning. Advances in Computational Mechanics with Parallel and Distributed Processing, pages 63 71, 1997. Graph Reduction with Spectral and Cut Guarantees Gilles Puy, Nicolas Tremblay, R emi Gribonval, and Pierre Vandergheynst. Random sampling of bandlimited signals on graphs. Applied and Computational Harmonic Analysis, 44(2):446 475, 2018. Dorit Ron, Ilya Safro, and Achi Brandt. Relaxation-based coarsening and multiscale graph organization. Multiscale Modeling & Simulation, 9(1):407 423, 2011. Ilya Safro, Peter Sanders, and Christian Schulz. Advanced coarsening schemes for graph partitioning. Journal of Experimental Algorithmics (JEA), 19:2 2, 2015. Nauman Shahid, Nathanael Perraudin, Vassilis Kalofolias, Gilles Puy, and Pierre Vandergheynst. Fast robust PCA on graphs. IEEE Journal of Selected Topics in Signal Processing, 10(4):740 756, 2016. David I Shuman, Mohammad Javad Faraji, and Pierre Vandergheynst. A multiscale pyramid transform for graph signals. IEEE Transactions on Signal Processing, 64(8):2119 2134, 2016. Martin Simonovsky and Nikos Komodakis. Dynamic edge-conditioned filters in convolutional neural networks on graphs. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2017. Daniel A Spielman and Nikhil Srivastava. Graph sparsification by effective resistances. SIAM Journal on Computing, 40(6):1913 1926, 2011. Daniel A Spielman and Shang-Hua Teng. Spectral sparsification of graphs. SIAM Journal on Computing, 40(4):981 1025, 2011. Gilbert W Stewart. Matrix perturbation theory. 1990. Nicolas Tremblay and Andreas Loukas. Approximating spectral clustering via sampling: a review. Co RR, abs/1901.10204, 2019. Greg Turk and Marc Levoy. Zippered polygon meshes from range images. In Proceedings of the 21st annual conference on Computer graphics and interactive techniques, pages 311 318. ACM, 1994. John C Urschel, Xiaozhe Hu, Jinchao Xu, and Ludmil T Zikatanov. A cascadic multigrid algorithm for computing the Fiedler vector of graph Laplacians. ar Xiv preprint ar Xiv:1412.0565, 2014. Nisheeth K Vishnoi et al. Lx=b. Foundations and Trends in Theoretical Computer Science, 8(1 2):1 141, 2013. Ulrike Von Luxburg. A tutorial on spectral clustering. Statistics and computing, 17(4): 395 416, 2007. Chris Walshaw. A multilevel algorithm for force-directed graph drawing. In International Symposium on Graph Drawing, pages 171 182. Springer, 2000. Chris Walshaw. A multilevel algorithm for force-directed graph-drawing. Journal of Graph Algorithms and Applications, 7(3):253 285, 2006. Lu Wang, Yanghua Xiao, Bin Shao, and Haixun Wang. How to partition a billion-node graph. In Data Engineering (ICDE), 2014 IEEE 30th International Conference on, pages 568 579. IEEE, 2014. Shusen Wang and Zhihua Zhang. Improving cur matrix decomposition and the Nystr om approximation via adaptive sampling. The Journal of Machine Learning Research, 14(1): 2729 2769, 2013. Richard C Wilson and Ping Zhu. A study of graph spectra for comparing graphs and trees. Pattern Recognition, 41(9):2833 2841, 2008. Yi Yu, Tengyao Wang, and Richard J Samworth. A useful variant of the Davis Kahan theorem for statisticians. Biometrika, 102(2):315 323, 2014.