# featured_graph_coarsening_with_similarity_guarantees__490849b7.pdf Featured Graph Coarsening with Similarity Guarantees Manoj Kumar 1 Anurag Sharma 2 Shashwat Saxena 1 Sandeep Kumar 1 3 4 Graph coarsening is a dimensionality reduction technique that aims to learn a smaller-tractable graph while preserving the properties of the original input graph. However, many real-world graphs also have features or contexts associated with each node. The existing graph coarsening methods do not consider the node features and rely solely on a graph matrix(e.g., adjacency and Laplacian) to coarsen graphs. However, some recent deep learning-based graph coarsening methods are designed for specific tasks considering both node features and graph matrix. In this paper, we introduce a novel optimization-based framework for graph coarsening that takes both the graph matrix and the node features as the input and jointly learns the coarsened graph matrix and the coarsened feature matrix while ensuring desired properties. To the best of our knowledge, this is the first work that guarantees that the learned coarsened graph is ϵ [0, 1) similar to the original graph. Extensive experiments with both real and synthetic benchmark datasets elucidate the proposed framework s efficacy and applicability for numerous graph-based applications, including graph clustering, node classification, stochastic block model identification, and graph summarization. 1. Introduction Graph-based approaches with big data and machine learning are one of the strongest driving forces of the current research frontiers, creating new possibilities in various do- 1Department of Electrical Engineering, IIT Delhi, New Delhi, India 2Department of Mathematics and Computing, IIT Delhi, New Delhi, India 3Bharti School Of Telecommunications Technology Management, IIT Delhi, New Delhi, India 4Yardi School of Artificial Intelligence, IIT Delhi, New Delhi, India. Correspondence to: Manoj Kumar . Proceedings of the 40 th International Conference on Machine Learning, Honolulu, Hawaii, USA. PMLR 202, 2023. Copyright 2023 by the author(s). mains, from social networks to drug discovery and from finance to material science studies. Large-scale graphs are becoming increasingly common, which is exciting since more data implies more knowledge and more training sets for learning algorithms. However, the graph data size is the real bottleneck; handling large graph data involves considerable computational hurdles to process, extract, and analyze. Therefore, graph dimensionality reduction techniques are needed. Graph coarsening or graph summarization is a promising direction for scaling up graph-based machine-learning approaches by simplifying large graphs. Coarsening aims to summarize a very large graph into a smaller and tractable graph while preserving the properties of the originally given graph. The core idea of coarsening comes from the algebraic multi-grid literature (Ruge & Stüben, 1987). Coarsening methods have been applied in various applications like graph partitioning (Hendrickson et al., 1995; Karypis & Kumar, 1998; Kushnir et al., 2006; Dhillon et al., 2007), graph summarization (Liu et al., 2018), machine learning (Lafon & Lee, 2006; Gavish et al., 2010; Shuman et al., 2015), and scientific computing(Chen et al., 2022; Hackbusch, 2013; Ruge & Stüben, 1987; Briggs et al., 2000). The recent work in (Loukas, 2019) developed a set of frameworks for graph matrix coarsening, preserving spectral and cut guarantees, but do not take node features into account. There are some recent deep learning-based approaches that do take node features into account (Cai et al., 2021; Ying et al., 2018; Ma & Chen, 2020). Furthermore, many real-world graphs also have features associated with each graph, e.g., node feature or edge feature (Kipf & Welling, 2017; Zügner & Günnemann, 2019; Wang et al., 2019). Existing graph coarsening methods are not designed to consider features of nodes and rely solely on structural information e.g., adjacency matrix to simplify graphs that may not be suitable for downstream tasks that require node features. Furthermore, these methods also lack a formal guarantee on the properties of the original graph and features being preserved in the coarsened graph and coarsened features. However, there are some recent deep learning-based graph coarsening technique that considers feature matrix and Laplacian matrix both designed for a particular task only, e.g., optimal transport coarsening(Ma & Chen, 2021) for graph classification, GCOND (Jin et al., Featured Graph Coarsening with Similarity Guarantees 2021), SCAL (Huang et al., 2021) for node classification, Dos Cond(Jin et al., 2022) for node and graph classification both. We introduce a novel optimization-based framework lying at the unification of graph learning (Kumar et al., 2020; 2019) and dimensionality reduction (Qiu et al., 2017; Zhu et al., 2017) for coarsening graph data, named as featured graph coarsening (FGC). It takes both the graph matrix and the node features as the input and learns the coarsened graph matrix and the coarsened feature matrix jointly while ensuring desired properties. The proposed optimization formulation is a multi-block non-convex optimization problem, which is solved efficiently by leveraging block majorizationminimization, log determinant, Dirichlet energy, and regularization frameworks. The developed algorithm is provably convergent and enforces the desired properties in the learned coarsened graph. To the best of our knowledge, the proposed method is the first coarsening method that guarantees that the original graph and coarsened graph are ϵ-similar for ϵ [0, 1). Extensive experiments elucidate the efficacy of the proposed framework for real-world applications. Furthermore, we have also performed the downstream task i.e. node classification on small and large datasets both to elucidate the efficacy of graph coarsening. We have trained graph neural network (GNN) using the coarsened graph and assign the label of original graph. We have compared the node classification accuracy and computational time required to perform coarsening and classification against the most recent state of the art methods like GCOND (Jin et al., 2021), SCAL (Huang et al., 2021). Remark: It is important here to highlight the distinction between (i) Clustering (Ng et al., 2001; Dhillon et al., 2007) (ii) Community detection (Fortunato, 2010), and (iii) Graph Coarsening methods (Loukas & Vandergheynst, 2018). Given a set of data points, the clustering and community detection algorithm aim to segregate groups with similar traits and assign them into clusters. For community detection, the data points are nodes of a given network. But these methods do not answer how these groups are related to each other. On the other hand coarsening segregates groups with similar traits and assigns them into supernodes, in addition, it also establishes how these supernodes are related to each other. It learns the graph of the supernodes, the edge weights, and finally the effective feature of each supernode. The scope of coarsening is wider than the aforementioned methods. Notation In terms of notation, lower case (bold) letters denote scalars (vectors) and upper case letters denote matrices. The (i, j)-th entry of a matrix X is denoted by Xij. X and X denote the pseudo inverse and transpose of matrix X, respectively. Xi and [XT ]j denote the i-th column and j-th row of matrix X. Xi, Xj = XT i Xj denotes the inner product of two vectors. The all-zero and all-one vectors or matrices of appropriate sizes are denoted by 0 and 1, re- spectively. The X 1, X F , X 1,2 denote the ℓ1-norm, Frobenius norm and ℓ1,2-norm of X, respectively. det(X) is defined as the generalized determinant of a positive definite matrix X. 2. Background In this section, we review the basics of graph and graph coarsening, the spectral similarity of the graph matrices, and the ϵ-similarity of graph matrices and feature matrices. A graph with features is denoted by G = (V, E, W, X) where V = {v1, v2, ..., vp} is the vertex set, E V V is the edge set and W is the adjacency (weight) matrix. We consider a simple undirected graph without self-loop: Wij > 0, if (i, j) E and Wij = 0, if (i, j) / E. Finally, X Rp n = [x1, x2, . . . , xp] is the feature matrix, where each row vector xi Rn is the feature vector associated with one of p nodes of the graph G. Thus, each of the n columns of X can be seen as a signal on the same graph. Graphs are conveniently represented by some matrix, such as Laplacian and adjacency graph matrices, whose positive entries correspond to edges in the graph. A matrix L Rp p is a combinatorial graph Laplacian matrix if it belongs to the following set: SL = {Lij = Lji 0 for i = j; Lii = X j =i Lij}. (1) The W and the L are related as follows: Wij = Lij for i = j and Wij = 0 for i = j. Both L and W represent the same graph, however, they have very different mathematical properties. The Laplacian matrix L is a symmetric, positive semidefinite matrix with zero row sum. The non-zero entries of the matrix encode positive edge weights as Lij and Lij = 0 implies no connectivity between vertices i and j. The importance of the graph Laplacian matrix has been well recognized as a tool for embedding, manifold learning, spectral sparsification, clustering, and semi-supervised learning. Owing to these properties, Laplacian matrix representation is more desirable for building graph-based algorithms. 2.2. Graph Coarsening Given an original graph G = (V, E, W, X) with p nodes, the goal of graph coarsening is to construct an appropriate "smaller" or coarsened graph Gc = ( V , E, W, X) with k << p nodes, such that Gc and G are similar in some sense. Every node vj V , where j = 1, 2, ...k, of the smaller graph with reference to the nodes of the larger graph is termed as a "super-node". In coarsening, we define a linear mapping π : V V that maps a set of nodes in G having similar properties to a super-node in Gc i.e. for any super- Featured Graph Coarsening with Similarity Guarantees node v V , all nodes π 1( v) V have similar properties. Furthermore, the features of the super-node, v, should be based on the features of nodes π 1( v) V in G, and the edge weights of the coarse graph, W, should depend on the original graph s weights as well as the coarsened graph s features. Let P Rk p + be the coarsening matrix which is a linear map from π : V V such that X = PX. Each non-zero entry of P i.e. [P]ij, indicate the j-th node of G is mapped to i-th super node of Gc. For example, non-zero elements of j-th row, i.e., pj corresponds to the following nodes set π 1( vj) V . The rows of P will be pairwise orthogonal if any node in V is mapped to only a single super-node in Vc. This means that the grouping via super-node is disjoint. Let the Laplacian matrices of G and Gc be L Rp p and Lc Rk k, respectively. The Laplacian matrices L, Lc, feature matrices X, X and the coarsening matrix P together satisfy the following properties(Loukas, 2019): Lc = CT LC, X = PX, X = P X = C X (2) where C Rp k is the tall matrix which is the pseudo inverse of P and known as the loading matrix. The non-zero elements of C, i.e., Cij > 0 implies that the i-th node of G is mapped to the j-th supernode of Gc. The loading matrix C Rp k + belongs to the following set: C = n C 0| Ci, Cj = 0 i = j, Ci, Ci = di, (3) Ci 0 1 and [CT ]i 0 = 1 o where Ci and Cj represent i-th and j-th column of loading matrix C and they are orthogonal to each other, [CT ]i represents the i-th row of loading matrix C, and di s are positive quantities. The C matrix has k columns and p rows. Also, in each row of the loading matrix C, there is only one non-zero entry and that entry is 1 which implies that C 1k = 1p, where 1k and 1p are vectors having all entry 1 and having the size of k and p respectively. This implies that CT C will be a diagonal matrix with di as the ith diagonal element. (a) Original graph G (b) Coarsened graph Gc Figure 1: Toy example: Featured graph coarsening. In the toy example, nodes (v1, v2, v3) of G are mapped into super-node v1 of Gc. The coarsening matrix P and the loading matrix C are 1 3 1 3 1 3 0 0 0 0 0 1 0 0 0 0 0 1 and C = P = 1 0 0 1 0 0 1 0 0 0 1 0 0 0 1 From the coarsened dimension of k k one can go back to the original dimension, i.e., p p by computing the lifted Laplacian matrix defined as Ll = P T Lc P(Loukas & Vandergheynst, 2018). 2.3. Preserving properties of G in Gc The coarsened graph Gc(Lc, X) should be learned such that the properties of G and Gc are similar. Some widely used notions of similarities are : Definition 1. Spectral similarity(Loukas & Vandergheynst, 2018; Loukas, 2019) The spectral similarity is shown by calculating relative eigen error (REE), defined as REE = 1 m Pm i=1 | λi λi| λi , where λi and λi are the top m eigenvalues corresponding to the original graph Laplacian matrix L and coarsened graph Laplacian matrix Lc respectively and m is the count of eigen value. Definition 2. Let L be original Laplacian matrix and Ll be the lifted Laplacian matrix,then the reconstruction error(RE) (Liu et al., 2018) is defined as RE = L Ll 2 F . Definition 3. Hyperbolic Error(HE) (Bravo Hermsdorff & Gunderson, 2019) The hyperbolic error between original Laplacian matrix L and lifted Laplacian matrix Ll is defined as HE = arccosh(1 + (L Ll)X 2 F X 2 F 2tr(XT LX)tr(XT Ll X)). The REE value indicates that how well the eigen properties of the original graph G is preserved in the coarsened graph Gc. The RE and HE values indicate how well we can recover the original graph from the coarsened graph. For a good coarsening algorithm lower values of these quantities are desired. 3. Featured Graph Coarsening (FGC) The existing graph coarsening or summarization methods are not designed to consider the node features and solely rely on the graph matrix for learning a simpler graph (Loukas & Vandergheynst, 2018; Loukas, 2019; Bravo Hermsdorff & Gunderson, 2019; Purohit et al., 2014; Chen et al., 2022; Hajiabadi et al., 2021; Riondato et al., 2017; Kang et al., 2022; Ko et al., 2020), and thus, not suitable for graph machine learning applications. For example, many real-world graph data satisfy certain properties, e.g., homophily assumption and smoothness (Wang et al., 2021; Kalofolias, 2016), that if two nodes are connected with stronger weights, then the features corresponding to these nodes should be similar. Thus, if the original graph satisfies any property, then that property should translate to the coarsened graph data. Current methods can only ensure spectral properties which satisfy the property of the graph matrix but not the features (Loukas & Vandergheynst, 2018; Loukas, 2019; Chen et al., 2022). The aforementioned discussion suggests the following graph coarsening method (i) should consider jointly both the graph Featured Graph Coarsening with Similarity Guarantees matrix L and the node feature X of the original graph and (ii) to ensure the desired specific properties on coarsened graph data, such as smoothness and homophily, the Lc and X should be learned jointly depending on each other. We approach this problem at the unification of dimensionality reduction (Qiu et al., 2017; Zhu et al., 2017) and graph learning (Kumar et al., 2020; 2019), where we solve these problems jointly, first we reduce the dimensionality and then learn a suitable graph on the reduced dimensional data. We propose a unique optimization-based framework that uses both the features X and Laplacian matrix L of the original graph to learn loading matrix C and coarsened graph s features X, jointly. Next we briefly discuss how to learn graphs with features, and then we propose our formulation. 3.1. Graph learning from data Let X = [x1, ..., xp]T , where xi is n dimensional feature vector associated with i th node of an undirected graph. In the context of modeling signals or features with graphs, the widely used assumption is that the signal residing on the graph changes smoothly between connected nodes (Kalofolias, 2016). The Dirichlet energy (DE) is used for quantifying the smoothness of the graph signals which is defined by using graph Laplacian matrix L SL matrix and the feature vector as follows: DE(L, X) = tr(XT LX) = X Lij xi xj 2 . (4) The lower value of Dirichlet energy indicates a more desirable configuration (Wang et al., 2021). Smooth graph signal methods are an extremely popular family of approaches for a variety of applications in machine learning and related domains (Dong et al., 2016). When only the feature matrix X = [x1, ..., xp]T , associated with an undirected graph is given then a suitable graph satisfying the smoothness property can be obtained by solving the following optimization problem: min L SL γ log(det(L + J)) + tr(XT LX) + αh(L) (5) where L Rp p denotes the desired graph matrix, SL is the set of Laplacian matrix (1), h( ) is the regularization term, and α > 0 is the regularization parameter, and J = 1 p1p p is a constant matrix whose each element is equal to 1 p. The rank of L is p 1 for connected graph matrix having p nodes(Chung & Graham, 1997), adding J to L makes L+J a full rank matrix without altering the row and column space of the matrix L(Kumar et al., 2020; Kalofolias, 2016). When the data is Gaussian distributed X N(0, L ), optimization in (5) also corresponds to the penalized maximum likelihood estimation of the inverse covariance (precision) matrix also known as Gaussian Markov random field (GMRF) for γ = 1 (Ying et al., 2020). The graph G inferred from L and the random vector X follows the Markov property, meaning Lij = 0 {i, j} E i = j implies xi and xj are conditionally dependent given the rest. Furthermore, in a more general setting with non-Gaussian distribution, (5) can be related to the log-determinant Bregman divergence regularized optimization problem, which ensures nice properties on the learned graph matrix, e.g., connectedness and full rankness. 3.2. Proposed Featured Graph Coarsening Formulation The proposed formulation of FGC is min Lc, X,C γ log(det(Lc + J)) + tr( XT Lc X) (6) +βh(Lc) + λ s.t. C 0, Lc = CT LC, X = C X, Lc SL, where L and X are the given Laplacian and feature matrix of a large connected graph, and X Rk n and Lc Rk k are the feature matrix and the Laplacian matrix of the learned coarsened graph, respectively, C Rp k is the loading matrix, h( ) and g( ) are the regularization functions for Lc and the loading matrix C, while β > 0 and λ > 0 are the regularization parameters. Fundamentally, the proposed formulation (6) aims to learn the coarsened graph matrix Lc, the loading matrix C, and the feature matrix X, jointly. This constraint X = C X coarsens the feature matrix of larger graph X Rp n to a smaller graph s feature matrix X Rk n. Next, the first two-term of the objective function is the graph learning term, where the log det( ) term ensures the coarsened graph is connected while the second term imposes the smoothness property on the coarsened graph, and finally third and fourth term act as a regularizer. The regularizer g(C) ensures the mapping such that one node vi V does not get mapped to two different super-nodes vj, vk V and mapping of nodes to super-nodes be balanced such that not all or majority of nodes get mapped to the same super-node. This simply implies that only one element of each row of C be nonzero and the columns of C be sparse. An ℓ1,2-based group penalty is suggested to enforce such structure (Yuan & Lin, 2006; Ming et al., 2019). 3.3. Featured Graph Coarsening (FGC) Algorithm Next by using Lc = CT LC and introducing a quadratic penalty for X = C X we aim to solve the following optimization problem: min X,C γlog det(CT LC + J) + tr( XT CT LC X) (7) 2 ||C X X||2 F +λ s.t. SC = C 0| [CT ]i 2 2 1 i = 1, .., p Featured Graph Coarsening with Similarity Guarantees where CT 2 1,2= p P i=1 [CT ]i 2 1= Pp i=1 ( Pk j=1 Cij)2 is the ℓ1,2 norm of CT which ensures group sparsity in the resultant C matrix and [CT ]i is the i-th row of matrix C. For a high value of λ, the loading matrix is observed to be orthogonal, more details are given in remark 2. The quadratic relaxation α 2 ||C X X||2 F keeps C X close to X instead of solving the constraint. Note that this relaxation can be made tight by using sufficiently large or iteratively increasing α. 3.3.1. SOME PROPERTIES OF CT LC MATRIX Before we proceed with the algorithm development, some of the properties and intermediary Lemmas regarding CT LC are presented below. Lemma 1. If L is the Laplacian matrix for a connected graph with p nodes, and C is the loading matrix such that C Rp k + and C C as in (3), then the coarsened graph matrix Lc = CT LC is a connected graph Laplacian matrix with k nodes. Lemma 2. If L be the Laplacian matrix for a connected graph with p nodes, C C be the loading matrix, and J = 1 k1k k is a constant matrix whose each element is equal to 1 k. The function f(C) = γlog det(CT LC + J) is a convex function with respect to the loading matrix C. We provide the sketches of the proofs of Lemma 1 and Lemma 2 here. Please refer to the appendix A.1 and A.2 for detailed proof. Proof sketch of Lemma 1. Using Cholesky decomposition and (3), we have proved that Lc = CT LC is the symmetric positive semi-definite matrix. Also, C uk = up if and only if uk = t1k and up = t1p and C uk = up uk = t1k and up = t1p where t R which implies that C uk = up holds only for a constant vector uk. And thus, CT LC uk = 0k, for constant vector uk. This concludes that the constant vector is the only eigenvector spanning the nullspace of Lc, concluding that the rank of CT LC is k 1. Proof sketch of Lemma 2. Since CT LC+J is a symmetric positive definite matrix and using Cholesky decomposition it can be written as CT LC = Y T Y where we establish that Y = L 1 2 C + 1 pk1p k. And γlog det(Y T Y ) is a convex function in Y . Furthermore, C is an affine transformation of Y implying γlog det(CT LC + J) is also a convex function in C. 3.4. BSUM Framework The resulting optimization problem in (7) is a multi-block non-convex. We develop efficient optimization methods based on the block Majorization Minimization framework. Suppose we have the following optimization problem minimize x f(x) subject to x X, (8) where x = (x1, x2), with xi Xi, X = Q i=1,2 Xi is a closed convex set, and f : X R is a continuous function. At the t-th iteration, block x1 is updated in a cyclic order by solving: minimize x1 g1 x1|x(t) 2 subject to x1 X1, where a continuous function g1 x1|x(t) 2 is a majorization function of f(x1) at x(t) 2 satisfying g1 x(t) 1 |x(t) 2 = f x(t) 1 , x(t) 2 ) (9a) g1 x1|x(t) 2 f x1, x(t) 2 , x1 X1, (9b) g1 x1; d1|x(t) 2 |x1=x(t) 1 = f x(t) 1 , x(t) 2 ; d , (9c) such that x(t) 1 + d1 X1, where f(x, d) stands for the directional derivative at x along d = (d1, 0) (Razaviyayn et al., 2012; Sun et al., 2017). Similarly we can find the surrogate function g2 x2|x(t) 1 at the t-th iteration. If the surrogate functions gi is properly chosen, then the solution to (9a) could be easier to obtain than solving (8) directly. Collecting the variables as (C Rp k + , X Rk n), we develop a block MM-based algorithm which updates one variable at a time while keeping the other fixed. 3.5. Update of C Treating C as a variable and fixing X, we obtain the following sub-problem for C: min C Sc f(C) = γlog det(CT LC + J) + λ 2 CT 2 1,2 (10) 2 C X X 2 F +tr( XT CT LC X) The function f(C) in (10) is a strictly convex function. This can be established from Lemmas 1 and 2 and using the property of norm, and trace operators, see the appendix A.5 for detailed proof. The set SC is a closed and convex set, thus (10) is a strongly convex optimization problem, but there does not exist a closed-form solution to it. To get closed-form updates we will use the MM framework (Beck & Pan, 2018; Razaviyayn et al., 2012; Sun et al., 2017). By using the first-order Taylor series approximation, a majorized function for f(C) at C(t) can be obtained as: g(C|C(t)) = f(C(t))+(C C(t)) f(C(t))+L 2 ||C C(t)||2 (11) where f(C) is L Lipschitz continuous gradient function L = max(L1, L2, L3, L4) with L1, L2, L3, L4 the Lipschitz constants of γlog det(CT LC+J), tr( XT CT LC X), Featured Graph Coarsening with Similarity Guarantees C X X 2 F , CT 2 1,2respectively. More details are deferred to the appendix A.4. The majorised function g(C|C(t)) satisfies all the required properties listed in (9) (a)-(c). After ignoring the constant term, the majorised problem of (10) is minimize C Sc 1 2CT C CT A (12) where A = C(t) 1 L f(C(t)) and f(C(t)) = 2γLC(t)(C(t)T LC(t) + J) 1 + α C(t) X X XT + 2LC(t) X XT + λC(t)1 where 1 is all ones matrix of dimension k k. Note that since C 0, it implies that |Cij|= Cij hence CT 2 1,2 is differentiable. Lemma 3. By using KKT optimality condition we can obtain the optimal solution of (12) as C(t+1) = C(t) 1 L f C(t) + (13) where (Xij)+ = max( Xij [XT ]i 2 , 0) and [XT ]i is the i-th row of matrix X. Proof: The proof is deferred to the appendix A.3. Remark The update rule above will be able to learn a balanced mapping, i.e., the loading matrix C belongs to the set (3). Let us begin by expanding CT 2 1,2 as CT 2 1,2= Pp i=1 ( Pk j=1 Cij)2 = C 2 F + P i =j Ci, Cj for i, j = 1, 2, ...k, and analyze the possible solutions under the constraints and the objective of the given problem (10). The trivial solution of all zero C = 0p k and any C with zero column vector i.e. Ci = 0 i = 1, 2, . . . p will make the term (CT LC +J) rank deficient and the log det( ) term become infeasible and thus ruled out. Moreover, the minimization of C X X 2 F = Pp i=1([CT ]i X Xi)2 ensures that no row of C matrix will be zero. Also, as C 0, C = 0p k, and from the property of Frobenius norm it implies that C 2 F = 0, thus the only possibility to minimize (10) is to get C 0 such that P i =j Ci, Cj = 0. This implies that columns of the loading matrix C are orthogonal to each other. Finally, the orthogonality of columns combined with C 0 implies that in each row there is only one non-zero entry. Thus, the learned C matrix belongs to the set (3), and a more detailed discussion is provided in (Kumar et al., 2023). 3.6. Update of X X X Fixing C, we obtain the following problem for X: min X f( X) = tr( XT CT LC X) + α 2 C X X 2 F (14) As CT LC and CT C are the positive semi-definite and definite matrices, respectively, the Hessian of f( X) i.e., 2f( X) = 2CT LC + αCT C is a positive definite matrix. That implies that (14) is a strongly convex optimization problem. For this, we can easily get the closed-form solution by setting the gradient to zero, i.e., 2CT LC X + αCT (C X X) = 0, we get αCT LC + CT C 1 CT X (15) Algorithm 1 FGC Algorithm Input: G(X, L), α, γ, λ 1 t 0; while stopping criteria not met do 2 Update Ct+1 and Xt+1 as in (13) and (15) respectively. t t + 1; Output: C, Lc, and X Algorithm 1 summarizes the implementation of the featured graph coarsening (FGC) method. The worst-case computational complexity is O(p2k) which is due to the matrix multiplication in the gradient of f(C). Theorem 1. The sequence {C(t), X(t)} generated by Algorithm 1 converges to the set of Karush Kuhn Tucker (KKT) points of Problem (7). Proof: The proof is deferred to the appendix A.6. 3.6.1. SIMILARITY GUARANTEE Lemma 4. Consider X be the minimizer of problem (14), L and Lc be the graph matrix of original and coarsened graph respectively, X is the feature matrix of the original graph then the two norms X Lc and X L with respect to Lc and L satisfy the following inequality: X L X Lc (16) where X 2 L= tr(XT LX) and X 2 Lc= tr( X T Lc X ) Proof: The proof is deferred to the appendix A.7. Theorem 3.1. ϵ-similarity The coarsened graph data Gc(Lc, X) is ϵ similar to the original graph data G(L, X), i.e., there exist an 0 ϵ < 1 such that (1 ϵ) X L X Lc (1 + ϵ) X L Proof: The proof is deferred to the appendix A.8. 4. Experiments In this section, we demonstrate the effectiveness of the proposed algorithm through a comprehensive set of experiments conducted on both real and synthetic graph data sets. We compare the proposed Featured Graph Coarsening (FGC) method by benchmarking against the state-of-the-art Featured Graph Coarsening with Similarity Guarantees methods, Local Variation Edges (LVE), Local Variation Neighbourhood (LVN), proposed in (Loukas, 2019), and GOREN (Cai et al., 2021). Throughout all the experiments, the FGC has shown superior performance. Datasets: Real and synthetic datasets, dataset(p, m, n) where p is the number of nodes, m is the number of edges and n is the number of features, used in our experiments are: (i) Cora (2708,5278,1433), (ii) Citeseer (3312,4536,3703), (iii) Polblogs (1490,16715,5000), (iv) ACM (3025,13128,1870), (v) Erdos Renyi (ER) (1000, 25010, 5000), (vi) Watts Strogatz (WS) (1000, 5000, 5000), (vii) Barabasi Albert (BA) (1000, 9800, 5000), (viii) Random Geometric Graph (RGG) (1000, 7265, 5000), (ix) Minnesota (2642, 3304, 5000), (x) Yeast (2361, 13292, 5000), (xi) Airfoil (4253, 12289, 5000) and (xii) Bunny (2503, 78292, 5000) (xiii) Pubmed(19717,88648,500) (34493, 247962, 8, 415) (xiv) Coauthor Physics(Co-phy) (34493, 247962, 8, 415) . The features of Polblogs, ER, WS, BA, RGG, Yeast, Minnesota, Airfoil and Bunny are generated as following X N(0, L ) (5), where L is the Laplacian matrix of the given graph. Further details of datasets are in the appendix B. REE, DE, HE and RE analysis: We use relative eigenerror (REE), Reconstruction error (RE), Dirichlet energy (DE) of Gc, and, Hyperbolic error (HE) as the metrics to evaluate the performance of coarsening algorithms as shown in Table 1. Lower values will indicate that the coarsened graph has better preserved the properties of the original graph. The baseline method only uses adjacency matrix information for performing coarsening. Once the coarsening matrix P = C is learned, which establishes the linear mapping of the nodes to the supernodes. The matrix P is used further for the coarsening of the feature matrix as X = PX. Comparison with deep learning based methods: We have compared FGC (proposed) against the GOREN(Cai et al., 2021), a deep learning-based graph coarsening approach, on real datasets. Due to the unavailability of their code, we compared only REE because it is the only metric they have computed in their paper among REE, DE, RE, and HE. Their results of REE are taken directly from their paper. It is evident in Table 3 that FGC outperforms GOREN. REE Minnesota Airfoil Yeast Bunny G.(LVN/LVE) 0.38/0.45 0.36/0.33 0.13/0.39 0.16/0.05 FGC 0.08 0.10 0.01 0.04 Table 3: REE values of FGC (proposed) and GOREN (Cai et al., 2021) for r = 0.5. More results on r = 0.3, 0.7 are in the appendix B.1. It is evident that FGC outperforms GOREN. ϵ-Similarity and Spectral Similarity: Here we evaluate the FGC algorithm for ϵ-similarity as discussed in Theorem (3.1) and spectral similarity. The spectral similarity 0.1 0.2 0.3 0.4 0.5 0.6 Coarsening Ratio (a) ϵ similarity 0 20 40 60 80 100 Cora Original (b) Eigenvalues plots 0 20 40 60 80 100 Citeseer Original (c) Eigenvalues plots Figure 2: Figure (a), the ϵ values lying between [0, 1) indicates that the coarsened graph Gc learned by FGC (proposed) and G are similar in terms of Drichlet energy as in Theorem 3.1. The top 100 eigenvalues of L and Lc obtained by FGC, LVN, and LVE are plotted in Figure (b) Cora dataset and Figure (c) Citeseer dataset. This indicates that the FGC better preserves the spectral properties than the existing state-of-the-art methods. Results with more datasets are provided in the appendix B.2. plots in Figure 2 are obtained for three coarsening methods FGC (proposed), LVE and LVN, and the coarsening ratio is chosen as r = 0.3. Experiments on other datasets and coarsening ratios are shown in the appendix B.1. Clustering: The input is a social network of a university Karate club consisting of 34 nodes, the goal is to classify the nodes into two groups, see Figure 3. For the FGC algorithm feature matrix, X of size 34 n is generated by sampling from X N(0, L ), where L is the Laplacian matrix of the given network. The result here is shown for n = 600, however, an n of the order of 5 34 has been observed to perform well. The FGC result also demonstrates that the features may help in improving the graph-based task, and for some cases like the one presented here the features can also be artificially generated governed by the smoothness and homophily properties. Figure 3: This figure evaluates the clustering performance of the FGC algorithm on the classic Zachary s karate club dataset (Zachary, 1977): (a) Ground truth, (b) Graclus(Dhillon et al., 2007), (c) spectral clustering ratio cut(Ng et al., 2001) (d) spectral clustering normalized cut(Ng et al., 2001) (e) LVN, and (f) FGC (Proposed). Orange nodes indicate misclassified points, FGC demonstrates a better performance, it resulted in 1 misclassified point, while the number of misclassified points for (b), (c), (d) and (e) are 11, 7, 2 and 5, respectively. The clustering experiments on Polblogs dataset are in the appendix B.5. 7 Featured Graph Coarsening with Similarity Guarantees Dataset r= k REE(L, Lc, 100) DE in 104 HE RE in log( ) FGC LVN/LVE FGC LVN/LVE FGC LVN/LVE FGC LVN/LVE 0.7 0.5 0.3 0.04 0.051 0.058 0.33/0.29 0.51/0.53 0.65/0.71 0.75 0.69 0.66 10/9.9 6.1/5.81 3.2/2.8 0.72 1.18 1.71 1.39/1.42 2.29/2.37 2.94/3.08 1.91 2.78 3.28 2.92/2.95 3.63/3.67 3.77/3.79 0.7 0.5 0.3 0.012 0.04 0.05 0.32/0.29 0.54/0.55 0.72/0.76 0.71 0.69 0.59 13/14 7.5/7.1 3.1/2.9 0.85 1.05 1.80 1.68/1.63 2.43/2.40 3.25/3.41 1.32 1.61 2.41 2.56/2.51 2.87/2.90 3.04/3.04 0.7 0.5 0.3 0.001 0.007 0.01 0.50/0.35 0.73/0.67 0.86/0.96 3.2 3.0 2.6 607/656 506/468 302/115 1.73 2.70 2.89 2.33/2.39 2.73/2.58 3.07/3.69 5.1 6.2 6.3 7.27/7.11 7.42/7.42 7.50/7.51 0.7 0.5 0.3 0.002 0.034 0.036 0.38/0.14 0.66/0.42 0.92/0.88 1.7 1.5 0.5 72/93.4 30/43 5.7/7.5 0.45 0.98 1.86 2.13/1.63 3.10/2.55 4.867/4.43 2.42 3.78 4.77 5.05/4.66 5.35/5.18 5.44/5.42 Table 1: This table summarizes the REE, DE, HE and RE values obtained by FGC (proposed), LVN and LVE on different coarsening ratios (r) for real graph datasets. It is evident that FGC (proposed) outperforms state-of-the-art methods significantly. Experiments on other benchmark algorithms are in the appendix B.1. Dataset r GCN GAT APPNP Cora 0.3 85.79 0.2 85.62 0.2 87.25 0.3 Citeseer 0.3 74.64 1.3 74.81 0.8 74.36 0.4 Co-Phy 0.3 94.27 0.2 94.98 0.6 94.23 0.3 Pubmed 0.3 80.73 0.4 81.02 0.7 80.65 0.5 Table 2: The table summarizes the node classification performance of the FGC with different GNN architectures. It is evident that the proposed FGC algorithm is compatible with different GNN architectures and shows similar performance. 4.1. Visualization of sparse coarsened graph Given a sparse original graph, we can learn a sparse coarsened graph by adding the following regularizer h(Lc) = CT LC 2 F in the objective function (7). The regularizer is differentiable and convex in C. The update rule of loading matrix C is modified by adding 2LC(t)(C(t)T LC(t)) in f(C(t)), while update rule of X remains the same. Next, a visualization for 100 nodes stochastic block model graph coarsened into a 30 nodes graph using i) FGC without sparsity and ii) FGC with sparsity regularizer (FGCS) as h(Lc) = ||CT LC||2 F is provided in the Figure 4. (a) Original graph G (b) Gc without sparsity (c) Gc with sparsity Figure 4: This figure shows the visualization of 100 nodes stochastic block model graph and coarsened graphs obtained using FGC with and without sparsity regularizer for a coarsening ratio of 0.3. 4.2. Node classification using FGC In this section, we perform node classification using GNN, in which the goal is to predict a label for unlabelled nodes of a graph. The GNN is trained using the coarsened graph, and the trained model is used to assign labels to the unlabelled nodes of the original graph. The steps to perform node classification are (i) we learn a coarsen graph Gc = ( V , E, W, X) using FGC algorithm with sparsity regularizer having supernode labels Y . The labels for super-nodes can be selected using Y = argmax(PY ) (Huang et al., 2021) where P is the coarsening matrix. (ii) Graph neural network (GNN) is trained using coarsened graph data Gc. (iii) We assign labels to each node of the original graph using the trained GNN model. The learning rate and decay rate used in the node classification experiments are 0.01 and 0.0001, respectively. All the results are calculated using 10-fold cross-validation. It is evident in table Table 4 that FGC outperforms state-ofthe-art graph coarsening algorithms. Next, we used different GNN architectures like GCN (Kipf & Welling, 2016), GAT (Velivckovi c et al., 2017), and APPNP (Gasteiger et al., 2018) to train our GNN and perform the node classification task. Table 2 demonstrates that FGC is compatible with different widely used GNN architectures, giving almost similar Node Classification accuracy across all the datasets. Algorithm 2 Algorithm for node classification Input: G = (V, E), features X, labels Y 4 Apply a coarsening algorithm to learn P; P = C .; 5 Compute feature matrix of a coarsened graph, X = PX; 6 Compute labels of coarsened graph, Y = arg max(PY ) 7 Learn W matrix to minimize ℓ(GCNGc(W ), Y ) Output: Trained weight matrix W Featured Graph Coarsening with Similarity Guarantees Dataset r GCOND SCAL FGC 0.3 81.56 0.6 79.42 1.71 85.79 0.24 0.1 81.37 0.40 71.38 3.62 81.46 0.79 0.05 79.93 0.44 55.32 7.03 80.01 0.51 0.3 72.43 0.94 68.87 1.37 74.64 1.37 0.1 70.46 0.47 71.38 3.62 73.36 0.53 0.05 64.03 2.4 55.32 7.03 71.02 0.96 0.05 93.05 0.26 73.09 7.41 94.27 0.25 0.03 92.81 0.31 63.65 9.65 94.02 0.20 0.01 92.79 0.4 31.08 2.65 93.08 0.22 0.05 78.16 0.30 72.82 2.62 80.73 0.44 0.03 78.04 0.47 70.24 2.63 79.91 0.30 0.01 77.2 0.20 54.49 10.5 78.42 0.43 0.05 86.29 0.63 34.45 10.0 89.60 0.39 0.03 86.32 0.45 26.06 9.29 88.29 0.79 0.01 84.01 0.02 14.42 8.5 86.37 1.36 Table 4: The table summarizes the node classification accuracy on both small and large datasets for the FGC with sparsity regularizer, GCOND (Jin et al., 2021), SCAL (Huang et al., 2021). It is evident that the FGC outperforms stateof-the-art methods. However, we have compared only with GCOND and SCAL as these are the most recent technique for graph node classification using the coarsened graph. 4.3. Time Complexity Consider the input graph with p nodes, E1 edges, and a feature vector of size n for each node. If the GCN has l number of layers, the total time complexity to perform node classification using GCN is O(lp2n + lp E1n). However, the time complexity to perform coarsening as well as node classification is O(p2k + lk2n + lk E2n) where k is the number of nodes of the coarsened graph and E2 is the number of edges of the coarsened graph. Since (p >> k) and E1 >> E2 and if choose k such that k < n, then the time complexity to perform coarsening and node classification is very less as compared to performing node classification using the original graph. It is evident in Table 5 that the proposed FGC algorithm is much faster than baseline methods. Dataset(τ) GCOND SCAL FGC GCN(wh.data) Cora 329.8 27.7 1.66 2.86 Citeseer 331.3 56.2 2.22 5.24 Pubmed 202.0 54.0 30.4 58.8 Co-CS 1600 180 34.4 72.3 Table 5: The table summarizes the time complexity comparison between proposed FGC and baseline algorithms for a coarsening ratio of r = 0.05, where τ(in sec.) is the time required to perform coarsening and classification, and wh.data represents whole dataset. It is evident that the proposed FGC is much faster than the existing baselines. Dataset r Prop.FGC GCN(Whole dataset) Cora 0.3 85.79 0.24 89.50 1.23 Citeseer 0.3 74.64 1.37 78.08 1.96 Pubmed 0.05 80.73 0.44 88.89 0.59 Co-phy 0.05 94.27 0.25 96.22 0.72 Table 6: The table summarizes the node classification performance comparison of the proposed FGC with the original Graph (No Coarsening). It is evident that the node classification accuracy obtained using the coarsened graph with the proposed FGC approach is very close to the one obtained with the full graph that is the original graph. 5. Conclusion We introduced a novel framework for coarsening graph data named Featured Graph Coarsening (FGC) which considers both the graph matrix and feature matrix jointly. The featured graph coarsening is formulated as a multi-block non-convex optimization problem for which we developed an efficient algorithm by bringing in techniques from alternate minimization, majorization-minimization, and log determinant frameworks. The developed algorithm is provably convergent and ensures the necessary properties in the coarsened graph data. Also, the FGC method is the first coarsening method that guarantees that the original graph and coarsened graph are ϵ-similar for ϵ [0, 1). Extensive experiments with both real and synthetic datasets demonstrate the superiority of the proposed FGC framework over existing state-of-the-art methods. The proposed approach for graph coarsening will be of significant interest to the graph machine-learning community. 6. Acknowledgments We would like to thank the reviewers for their suggestions to improve this paper. We would also like to thank to whole MISN lab members for their contributions to the discussions regarding the paper and code implementation. This work is supported by the DST Inspire faculty grant MI02322G. Beck, A. and Pan, D. Convergence of an Inexact Majorization-Minimization Method for Solving a Class of Composite Optimization Problems, pp. 375 410. 01 2018. ISBN 978-3-319-97477-4. doi: 10.1007/ 978-3-319-97478-1_13. Bravo Hermsdorff, G. and Gunderson, L. A unifying framework for spectrum-preserving graph sparsification and coarsening. In Wallach, H., Larochelle, H., Beygelzimer, A., d'Alché-Buc, F., Fox, E., and Garnett, R. (eds.), Featured Graph Coarsening with Similarity Guarantees Advances in Neural Information Processing Systems, volume 32. Curran Associates, Inc., 2019. Briggs, W. L., Henson, V. E., and Mc Cormick, S. F. A multigrid tutorial. Society for Industrial and Applied Mathematics., 2000. Cai, C., Wang, D., and Wang, Y. Graph coarsening with neural networks. In International Conference on Learning Representations, 2021. Chen, J., Saad, Y., and Zhang, Z. Graph coarsening: from scientific computing to machine learning. Journal of the Spanish Society of Applied Mathematics (Se MA), 79(1): 187 223, 2022. Chung, F. R. and Graham, F. C. Spectral graph theory, volume 92. American Mathematical Soc., 1997. Dhillon, I. S., Guan, Y., and Kulis, B. Weighted graph cuts without eigenvectors a multilevel approach. IEEE transactions on pattern analysis and machine intelligence, 29(11):1944 1957, 2007. Dong, X., Thanou, D., Frossard, P., and Vandergheynst, P. Learning laplacian matrix in smooth graph signal representations. IEEE Transactions on Signal Processing, 64(23):6160 6173, 2016. Fortunato, S. Community detection in graphs. Physics reports, 486(3-5):75 174, 2010. Gasteiger, J., Bojchevski, A., and Günnemann, S. Predict then propagate: Graph neural networks meet personalized pagerank. ar Xiv preprint ar Xiv:1810.05997, 2018. Gavish, M., Nadler, B., and Coifman, R. R. Multiscale wavelets on trees, graphs and high dimensional data: Theory and applications to semi supervised learning. In International Conference on Machine Learning, 2010. Hackbusch, W. Multi-grid methods and applications, volume 4. Springer Science & Business Media, 2013. Hajiabadi, M., Singh, J., Srinivasan, V., and Thomo, A. Graph summarization with controlled utility loss. In Proceedings of the 27th ACM SIGKDD Conference on Knowledge Discovery & Data Mining, pp. 536 546, 2021. Hendrickson, B., Leland, R. W., et al. A multi-level algorithm for partitioning graphs. SC, 95(28):1 14, 1995. Huang, Z., Zhang, S., Xi, C., Liu, T., and Zhou, M. Scaling up graph neural networks via graph coarsening. In Proceedings of the 27th ACM SIGKDD Conference on Knowledge Discovery & Data Mining, pp. 675 684, 2021. Jin, W., Zhao, L., Zhang, S., Liu, Y., Tang, J., and Shah, N. Graph condensation for graph neural networks. ar Xiv preprint ar Xiv:2110.07580, 2021. Jin, W., Tang, X., Jiang, H., Li, Z., Zhang, D., Tang, J., and Yin, B. Condensing graphs via one-step gradient matching. In Proceedings of the 28th ACM SIGKDD Conference on Knowledge Discovery and Data Mining, pp. 720 730, 2022. Kalofolias, V. How to learn a graph from smooth signals. In Gretton, A. and Robert, C. C. (eds.), Proceedings of the 19th International Conference on Artificial Intelligence and Statistics, volume 51 of Proceedings of Machine Learning Research, pp. 920 929, Cadiz, Spain, 09 11 May 2016. Proceedings of Machine Learning Research. Kang, S., Lee, K., and Shin, K. Personalized graph summarization: formulation, scalable algorithms, and applications. ar Xiv preprint ar Xiv:2203.14755, 2022. Karypis, G. and Kumar, V. A fast and high quality multilevel scheme for partitioning irregular graphs. SIAM Journal on scientific Computing, 20(1):359 392, 1998. Kipf, T. N. and Welling, M. Semi-supervised classification with graph convolutional networks. ar Xiv preprint ar Xiv:1609.02907, 2016. Kipf, T. N. and Welling, M. Semi-supervised classification with graph convolutional networks. In International Conference on Learning Representations, 2017. Ko, J., Kook, Y., and Shin, K. Incremental lossless graph summarization. In Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, pp. 317 327, 2020. Kumar, M., Sharma, A., and Kumar, S. A unified framework for optimization-based graph coarsening. Journal of Machine Learning Research, 24(118):1 50, 2023. Kumar, S., Ying, J., de Miranda Cardoso, J. V., and Palomar, D. Structured graph learning via laplacian spectral constraints. In Wallach, H., Larochelle, H., Beygelzimer, A., d'Alché-Buc, F., Fox, E., and Garnett, R. (eds.), Advances in Neural Information Processing Systems, volume 32. Curran Associates, Inc., 2019. Kumar, S., Ying, J., de Miranda Cardoso, J. V., and Palomar, D. P. A unified framework for structured graph learning via spectral constraints. Journal of Machine Learning Research, 21(22):1 60, 2020. Kushnir, D., Galun, M., and Brandt, A. Fast multiscale clustering and manifold identification. Pattern Recognition, 39(10):1876 1891, 2006. Lafon, S. and Lee, A. B. Diffusion maps and coarsegraining: A unified framework for dimensionality reduction, graph partitioning, and data set parameterization. IEEE transactions on pattern analysis and machine intelligence, 28(9):1393 1403, 2006. Featured Graph Coarsening with Similarity Guarantees Langley, P. Crafting papers on machine learning. In Langley, P. (ed.), Proceedings of the 17th International Conference on Machine Learning (ICML 2000), pp. 1207 1216, Stanford, CA, 2000. Morgan Kaufmann. Liu, Y., Safavi, T., Dighe, A., and Koutra, D. Graph summarization methods and applications: A survey. ACM computing surveys (CSUR), 51(3):1 34, 2018. Loukas, A. Graph reduction with spectral and cut guarantees. Journal of Machine Learning Research, 20(116):1 42, 2019. Loukas, A. and Vandergheynst, P. Spectrally approximating large graphs with smaller graphs. In International Conference on Machine Learning, pp. 3237 3246. Proceedings of Machine Learning Research, 2018. Ma, T. and Chen, J. Unsupervised learning of graph hierarchical abstractions with differentiable coarsening and optimal transport, 2020. Ma, T. and Chen, J. Unsupervised learning of graph hierarchical abstractions with differentiable coarsening and optimal transport. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 35, pp. 8856 8864, 2021. Ming, D., Ding, C., and Nie, F. A probabilistic derivation of lasso and l12-norm feature selections. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 33, pp. 4586 4593, 2019. Ng, A., Jordan, M., and Weiss, Y. On spectral clustering: Analysis and an algorithm. Advances in neural information processing systems, 14, 2001. Purohit, M., Prakash, B. A., Kang, C., Zhang, Y., and Subrahmanian, V. Fast influence-based coarsening for large networks. In Proceedings of the 20th ACM SIGKDD international conference on Knowledge discovery and data mining, pp. 1296 1305, 2014. Qiu, Y., Zhou, G., and Xie, K. Deep approximately orthogonal nonnegative matrix factorization for clustering. ar Xiv preprint ar Xiv:1711.07437, 2017. Rajawat, K. and Kumar, S. Stochastic multidimensional scaling. IEEE Transactions on Signal and Information Processing over Networks, 3(2):360 375, 2017. Razaviyayn, M., Hong, M., and Luo, Z.-Q. A unified convergence analysis of block successive minimization methods for nonsmooth optimization. SIAM Journal on Optimization, 23, 09 2012. doi: 10.1137/120891009. Riondato, M., García-Soriano, D., and Bonchi, F. Graph summarization with quality guarantees. Data mining and knowledge discovery, 31(2):314 349, 2017. Ruge, J. W. and Stüben, K. Algebraic multigrid. In Multigrid methods, pp. 73 130. SIAM, 1987. Shuman, D. I., Faraji, M. J., and Vandergheynst, P. A multiscale pyramid transform for graph signals. IEEE Transactions on Signal Processing, 64(8):2119 2134, 2015. Sun, Y., Babu, P., and Palomar, D. P. Majorizationminimization algorithms in signal processing, communications, and machine learning. IEEE Transactions on Signal Processing, 65(3):794 816, 2017. doi: 10.1109/ TSP.2016.2601299. Velivckovi c, P., Cucurull, G., Casanova, A., Romero, A., Lio, P., and Bengio, Y. Graph attention networks. ar Xiv preprint ar Xiv:1710.10903, 2017. Wang, X., Ji, H., Shi, C., Wang, B., Ye, Y., Cui, P., and Yu, P. S. Heterogeneous graph attention network. In The world wide web conference, pp. 2022 2032, 2019. Wang, X., Pun, Y.-M., and So, A. Learning graphs from smooth signals under moment uncertainty. 05 2021. Ying, J., de Miranda Cardoso, J. V., and Palomar, D. Nonconvex sparse graph learning under laplacian constrained graphical model. Advances in Neural Information Processing Systems, 33:7101 7113, 2020. Ying, Z., You, J., Morris, C., Ren, X., Hamilton, W., and Leskovec, J. Hierarchical graph representation learning with differentiable pooling. In Bengio, S., Wallach, H., Larochelle, H., Grauman, K., Cesa-Bianchi, N., and Garnett, R. (eds.), Advances in Neural Information Processing Systems, volume 31. Curran Associates, Inc., 2018. Yuan, M. and Lin, Y. Model selection and estimation in regression with grouped variables. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 68 (1):49 67, 2006. Zachary, W. W. An information flow model for conflict and fission in small groups. Journal of anthropological research, 33(4):452 473, 1977. Zhu, P., Zhu, W., Hu, Q., Zhang, C., and Zuo, W. Subspace clustering guided unsupervised feature selection. Pattern Recognition, 66:364 374, 2017. Zügner, D. and Günnemann, S. Adversarial attacks on graph neural networks via meta learning. ar Xiv preprint ar Xiv:1902.08412, 2019. Featured Graph Coarsening with Similarity Guarantees A. Appendix A.1. Proof of Lemma 1 The matrix L Rp p is the Laplacian matrix of a connected graph having p nodes. From (1) it is implied that L = LT , L is positive semi-definite matrix with rank(L) = p 1 and L t1p = 0p, where t R and 1p and 0p are the all one and zero vectors of size p. In addition, we also have L up = 0p for up = t1p, this means that there is only one zero eigenvalues possible and the corresponding eigenvector is a constant vector. In order to establish Lc is the connected graph Laplacian matrix of size k, we need to prove that Lc (1) and rank(Lc) = k 1. We begin by using the Cholesky decomposition of the Laplacian matrix L, as L = ST S. Next, we can write CT LC as Lc = CT LC = CT ST SC (17) = ZT Z (18) where Z = SC and CT LC = ZT Z imply that Lc is a symmetric positive semidefinite matrix. Now, using the property of C, i.e, C t1k = t1p as in (3). In each row of the loading matrix C, there is only one non zero entry and that entry is 1 which implies that C 1k = 1p and CT LC 1k = CT L 1p = 0k which imply that the row sum of CT LC is zero and constant vector is the eigenvector corresponding to the zero eigenvalue. Thus Lc is the Laplacian matrix. Next, we need to prove that Lc is a connected graph Laplacian matrix for that we need to prove that rank(Lc) = k 1. Note that, C uk = up if and only if uk = t1k and up = t1p and C uk = up uk = t1k and up = t1p where t R which implies that C uk = up holds only for a constant vector uk. And thus, CT LC uk = 0k, for constant vector uk. This concludes that the constant vector is the only eigenvector spanning the nullspace of Lc which concludes that the rank of CT LC is k 1 which completes the proof. A.2. Proof of Lemma 2 We prove the convexity of γlog det(CT LC + J) using restricting function to line i.e. A function f : Rn R is convex if g : R R is convex where, g(t) = f(z + tv), {z dom(f), t dom(g), v Rn} (19) Since L is the Laplacian of connected original Graph G and Laplacian of coarsened graph Gc is Lc = CT LC which also represents a connected graph and proof is given in the appendix A.1. Using the property of the connected graph Laplacian matrix, Lc is a symmetric positive semi-definite matrix and has a rank k 1. Adding J = 1 k1k k which is a rank 1 matrix in Lc increases rank by 1. Lc + J becomes symmetric and positive definite matrix and we can rewrite Lc + J = CT LC + J = Y T Y . Now, we can rewrite γlog det(CT LC + J) as f(Y ) = γlog det(CT LC + J) = γlog det(Y T Y ) (20) Consider Y=Z+t V and put it in (20). However, Z and V are constant so function in Y becomes function in t i.e. g(t) is g(t) = γlog det((Z + t V )T (Z + t V )) (21) = γlog det(ZT Z + t(ZT V + V T Z) + t2V T V ) (22) = γlog det(ZT (I + t(V Z 1 + (V Z 1)T ) + t2(Z 1)T V T V Z 1)Z) (23) = γ(log det(ZT Z) + log det(I + t(P + P T ) + t2P T P)) (24) Featured Graph Coarsening with Similarity Guarantees g(t) = γ(log det(ZT Z) + log det(QQT + 2t QΛQT + t2QΛ2QT )) (25) = γ(log det(ZT Z) + log det(Q(I + 2tΛ + t2Λ2)QT )) (26) = γlog det(ZT Z) γ i=1 log(1 + 2tλi + t2λ2 i ) (27) On putting V Z 1 = P in (23) to get (24). Using eigenvalue decomposition of P matrix i.e. P = QΛQT and QQT = I and putting the values of P and I in (24) to get (25). The second derivative of g(t) with respect to t is 2λ2 i (1 + tλi)2 (1 + 2tλi + t2λ2 i )2 (28) It is clearly seen that g (t) 0, t R so it is a convex function in t. Now, using the restricting function to line property if g(t) is convex in t then f(Y ) is convex in Y . Consider Y = L 1 2 C + 1 kp1P k so, Y T Y = (L 1 2 C + 1 kp1P k)T (L 1 2 C + 1 kp1P k) (29) = CT LC + 1 kp(p1k k) + 1 kp1T P k L 1 2 C + 1 kp CT (L 1 2 )T 1P k (30) = CT LC + 1 k 1k k (31) = CT LC + J (32) L is a Laplacian matrix so L 1 2 is also Laplacian matrix and using the property of Laplacian matrix i.e. L 1 2 .1p k = 0p k and 1T p k.L 1 2 = 0k p in (30), we get (31). Since Y = L 1 2 C + 1 pk1p k and f(Y ) is convex in Y and C is a affine transformation of Y so γlog det(CT LC + J) is a convex function in C. A.3. Proof of Lemma 3 The Lagrangian function of (12) is: L(C, X, µ1) =1 2CT C CT A µ 1 C + µT 2 h CT 1 2 2 1, . . . CT p 2 2 1 i T (33) where µ1 is the dual variable. The KKT conditions of (12) is C A µ1 + 2 h µ21CT 1 , . . . µ2p CT p ]T = 0, (34) µT 2 h CT 1 2 2 1 CT 2 2 2 1 . . . CT p 2 2 i T 1 = 0, (35) µ 1 C = 0, (36) [CT ]i 2 2 1 (39) The optimal solution of C that satisfies all KKT conditions (34)-(40) is Ct+1 = (A)+ [AT ]i 2 (41) where A = C(t) 1 L f C(t) + and [AT ]i is the i-th row of matrix A. This concludes the proof. Featured Graph Coarsening with Similarity Guarantees A.4. Lipschitz Continuity of Function f(C) in (11) Consider the function γlog det(CT LC + J). The Lipschitz constant L1 of the function γlog det(CT LC + J) is related to the smallest non zero eigenvalue of coarsened Laplacian matrix CT LC = Lc, which is bounded away from δ (k 1)2 (Rajawat & Kumar, 2017), where δ is the minimum non zero weight of coarsened graph. However, for practical purposes, the edges with very small weights can be ignored and set to be zero, and we can assume that the non-zero weights of the coarsened graph Gc are bounded by some constant δ 0. On the other hand, we do not need a tight Lipschitz constant L1. In fact, any L 1 L1 makes the function g(C|C(t)) satisfy (9)(a)-(c). Now, consider the tr( ) term: tr( XT C1T LC1 X) tr( XT C2T LC2 X) = tr( XT C1T LC1 X) tr( XT C2T LC1 X) + tr( XT C2T LC1 X) tr( XT C2T LC2 X) (42) tr( XT (C1 C2)T LC1 X) + tr( XT C2T L(C1 C2) X (43) tr XT (C1 C2)T LC1 X F + tr XT C2T L(C1 C2) X F (44) ||tr|||| X||2 F ||L||||C1 C2||F (||C1||F +||C2||F ) (45) L2||C1 C2||F (46) We applied the triangle inequality after adding and subtracting tr( XT C2T LC1 X) in (42) to get (43). Using the property of the norm of the trace operator i.e. tr = sup A =0 ||A||F from Rn n to R in (43) to get (44). Applying the Frobenius norm property i.e. AB F A F B F in (44) to get (45). Since, in each row of C is having only one non zero entry i.e. 1 and rest entries are zero so, C1 F = C2 F = p and putting this in (45), we get (46) where, L2 = 2 p tr X 2 F ||L||F . Next, consider the function α 2 C X X 2 F : 2 C X X 2 F = α 2 tr((C X X)T (C X X)) (47) 2 tr( XT CT C X XT C X + XT X XT CT X) (48) 2 (tr( XT CT C X) tr( XT CT X) tr(XT C X) + tr(XT X)) (49) With respect to C, tr(XT X) is a constant and tr( XT CT C X), tr( XT CT X), tr(XT C X) are Lipschitz continous function and proof is very similar to the proof of tr( ) as in (42)-(46), and sum of Lipschitz continuous function is Lipschitz continuous so α 2 C X X 2 F is L3 Lipschitz continuous. Finally, consider the function λ 2 CT 2 1,2. Note that we have C 0 means all the elements of C are non-negative, |C|ij= Cij 0. With this the ℓ1-norm becomes summation, and we obtain the following: j=1 Cij)2 (50) i=1 ([CT ]i1)2 (51) = C1 2 F (52) = tr(1T CT C1) (53) where 1 is a vector having all entry 1, [CT ]i is i-th row of loading matrix C and since each entry of C is Cij 0. tr(1T CT C1) is Lipschitz continuous function and proof is similar to proof of tr( ) as in (42)-(46) so CT 2 1,2 is L-4 Lipschitz continuous function. Addition of Lipschitz continuous functions is Lipschitz continuous so we can say that f(C) in (10) is LLipschitz continuous function where L = max(L1, L2, L3, L4). Featured Graph Coarsening with Similarity Guarantees A.5. f(C) in (10) is strictly convex log det( ) and trace( ) are convex functions and proof are in Lemma 1 and 2 respectively, also Frobenius and ℓ2 1,2 norm are convex functions. Consider the term CT 2 1,2= p P i=1 [CT ]i 2 1 > 0 which implies that f(C) in (10) is strictly convex A.6. Proof of Theorem 1 The Lagrangian function of (7) is L(C, X, µ1, µ2) = γlog det(CT LC + J) + α 2 ||X C X||2 F +tr( XT CT LC X) + λ i=1 [CT ]i 2 1 (54) tr(µ 1 C) + µT 2 h ( CT 1 2 2 1), . . . , ( CT p 2 2 1) i T where µ1(p k) and µ2(p 1) are the dual variables. (1) The KKT conditions with respect to C is (i) 2γLC(CT LC + J) 1 + α C X X XT + 2LC X XT + λC1k k µ1 + 2 h µ21CT 1 , . . . µ2p CT p ]T = 0, (ii) µ2i( CT i 2 2 1) = 0 p i=1, tr(µ 1 C) = 0 (iii) C 0, [CT ]i 2 2 1 i = 1, 2, . . . , p, (iv) µ1 0, µ2 0 where 1k k is a k k matrix whose all entry is one. The variable C is derived by using the KKT condition from (13): 2γLC (C )T LC + J 1 + α(C X X)( X )T + 2LC X ( X )T + λC 1k k = 0 For µ1 = 0 and µ2i[CT ] i = 0 i = 1, 2, . . . p, we observe that C satisfies the KKT condition. The KKT condition with respect to X is 2CT LC X + αCT (C X X) = 0 The solution of convex optimization problem (14) is α(C )T LC + CT C 1 (C )T X which satisfies the KKT condition. This concludes the proof. A.7. Proof of Lemma 4 Let f1( X) = tr( XT CT LC X) = X 2 Lc and f2( X) = α 2 C X C 2 F . Using the property of convexity, at t th iteration, following inequality holds: f1( Xt+1) + f2( Xt+1) f1( Xt) + f2( Xt) f1( X0) + f2( X0) (56) In algorithm 1, we have initialized X as X = C X, f2( X0) = 0 and f1( X0) = X 2 L then following inequality holds f1( Xt+1) + f2( Xt+1) f1( X0) f1( Xt+1) f1( X0) = X 2 L (57) and this will hold for all iterations and we get f1( X ) = X 2 Lc f1( X0) = X 2 L (58) Since X Lc and X L are positive quantities so, X Lc X L (59) Featured Graph Coarsening with Similarity Guarantees A.8. Proof of Theorem 3.1: We have X L= p tr(XT LX) and X Lc= q tr( XT Lc X). Taking the absolute difference between X L and X Lc and L is a positive semi-definite matrix using Cholesky s decomposition L = ST S we get: X L X Lc = SX F SP PX F (60) SX SP PX F ϵ X L (61) Also using Lemma 4, we get the following inequality X L X Lc X L < 1 (62) The equation (61) and (62) implies that the range of ϵ [0, 1). Next, by applying the property of the modulus function in (61), we obtain the following inequality for all the n samples: (1 ϵ) X L X Lc (1 + ϵ) X L (63) where ϵ [0, 1) and this concludes the proof. B. ADDITIONAL EXPERIMENTS In the following experiments, we apply the FGC algorithm on both real and synthetic datasets. For synthetic datasets feature matrices are generated as X N(0, L ), where L is the Laplacian matrix of G, and weights in their weight matrices are generated randomly and uniformly from a range of (1,10). These results show that FGC performs outstandingly on both real and synthetic datasets. For simplicity, we have chosen lambda to be n 2 and tuned other hyperparameters accordingly. (1) The additional details of real datasets are as follows: Cora. Hyperparameters (γ=500, α=500, λ=716.5) used in FGC algorithms. DE of G is 160963. Citeseer. Hyperparameters (γ=500, α=500, λ=1851.5) used in FGC algorithms. DE of G is 238074. Polblogs. Hyperparameters (γ=500, α=500, λ=2500) used in FGC algorithms. DE of G is 6113760. ACM. Hyperparameters (γ=2000, α=500, λ=935) used in FGC algorithms. DE of G is 1654444. Bunny. This dataset consists of p=2503, m=78292, n=5000. Hyperparameters (γ=450, α=500, λ=2500) used in FGC algorithms. DE of G is 12512526. Airfoil. This dataset consists of p=4253, m=12289, n=5000. Hyperparameters (γ=2000, α=600, λ=2500) used in FGC algorithms. DE of G is 21269451. Yeast. This dataset consists of p=2224, m=4416, n=5000. Hyperparameters (γ=350, α=650, λ=2500) used in FGC algorithms. DE of G is 10697908. Minnesota. This dataset consists of p=2642, m=3304, n=5000. Hyperparameters (γ=500, α=550, λ=2500) used in FGC algorithms. DE of G is 13207844. (2) The details of synthetic datasets are as follows: Erdos Renyi (ER). It is represented as G(n, p), where n = 1000 is the number of nodes and p = 0.1 is probability of edge creation. Hyperparameters (γ=500, α=500, λ=10) used in FGC algorithms. DE of G is 4995707. Barabasi Albert (BA). It is represented as G(n, m), where n = 1000 is the number of nodes and m = 20 edges are preferentially linked to existing nodes with a higher degree. Hyperparameters(γ=500, α=500, λ=1000) used in FGC algorithms . DE of G is 4989862. Featured Graph Coarsening with Similarity Guarantees Watts Strogatz (WS). It is represented as G(n, k, p), where n = 1000 is the number of nodes, k = 20 is nearest neighbors in ring topology connected to each node, p = 0.1 is probability of rewiring edges. Hyperparameters(γ=500, α=500, λ=1000) used in FGC algorithm. DE of G is 4997509. Random Geometric Graph (RGG). It is represented as G(n, radius), where n = 1000 is the number of nodes and radius = 0.1 is the distance threshold value for an edge to create. Hyperparameters(γ=500, α=500, λ=1000) used in FGC algorithm. DE of G is 4989722. B.1. REE and DE analysis For synthetic datasets also, to measure the spectral similarity and ϵ-similarity of coarsened graph Gc and original graph G, we evaluate REE(L, Lc, 100), where L and Lc are Laplacian matrices of G and Gc respectively. We also compute DE of Gc to measure its smoothness. r = k/p BA WS ER RGG 0.3 0.623 0.202 0.138 0.234 0.5 0.211 0.109 0.076 0.249 0.7 0.084 0.201 0.055 0.252 0.3 212117 35948 293116 30112 0.5 208352 41889 542051 41215 0.7 201324 44852 534544 36483 Table 7: REE and DE results for synthetic datasets. We have compared FGC (proposed) algorithm with Kron reduction (Kron) and heavy edge matching (HEM) using REE and DE. The results are shown in the Table 6. Dataset r= k REE(L, Lc, 100) DE in 104 FGC Kron HEM FGC Kron HEM 0.7 0.5 0.3 0.04 0.051 0.058 0.3815 0.579 0.7406 0.3814 0.5804 0.774 0.75 0.69 0.66 9.1 5.5 2.7 9.1 5.4 2.4 0.7 0.5 0.3 0.0124 0.04 0.05 0.3153 0.5411 0.778 0.3153 0.542 0.807 0.71 0.69 0.59 12.9 7.0 2.7 12.9 7.0 2.5 0.7 0.5 0.3 0.001 0.007 0.01 0.4256 0.67 0.9653 0.4474 0.708 0.929 3.2 3.0 2.6 752 513 132 761 373 183 0.7 0.5 0.3 0.002 0.034 0.036 0.1568 0.400 0.858 0.1568 0.4172 0.935 1.7 1.5 0.5 94.5 49 8.9 94.5 46.1 5.4 Table 8: This table summarizes the REE and DE values obtained by FGC (proposed), Kron and HEM on different coarsening ratios (r) for standard real graph datasets. It is evident that FGC (proposed) outperforms state-of-the-art methods significantly. Experiments on Bunny, Airfoil, Yeast and Minnesota comparing FGC (proposed) with GOREN (FGC), GOREN (LVN) and GOREN (LVE) using REE are shown below, where G. stands for GOREN in Table 7. Featured Graph Coarsening with Similarity Guarantees Dataset r= k REE(L, Lc, 100) FGC G.HEM G.LVN G.LVE 0.7 0.5 0.3 0.0167 0.0392 0.0777 0.258 0.420 0.533 0.082 0.169 0.283 0.007 0.057 0.094 0.7 0.5 0.3 0.103 0.105 0.117 0.279 0.568 1.979 0.184 0.364 0.876 0.102 0.336 0.782 0.7 0.5 0.3 0.007 0.011 0.03 0.291 1.080 3.482 0.024 0.133 0.458 0.113 0.398 2.073 0.7 0.5 0.3 0.0577 0.0838 0.0958 0.357 0.996 3.423 0.114 0.382 1.572 0.118 0.457 2.073 Table 9: This table summarizes the REE values obtained by FGC (proposed), GOREN(HEM), GOREN(LVN) and GOREN(LVE) on different coarsening ratios (r) for real graph datasets. It is evident that FGC (proposed) outperforms state-of-the-art methods significantly. B.2. Spectral similarity We present eigen value plots on different coarsening ratios (r) to visualize the spectral similarity of G and Gc. 0 20 40 60 80 100 Coarsening ratio: 0.3 (a) Cora at r=0.3 0 20 40 60 80 100 Coarsening ratio: 0.5 (b) Cora at r=0.5 0 20 40 60 80 100 Coarsening ratio: 0.3 (c) Citeseer at r=0.3 0 20 40 60 80 100 Coarsening ratio: 0.5 (d) Citeseer at r=0.5 Figure 5: This figure plots top-100 eigen values of Laplacian matrices of original and coarsened graphs for cora (a and b) and citeseer (c and d). Featured Graph Coarsening with Similarity Guarantees 0 20 40 60 80 100 0 20 40 60 80 100 Figure 6: This figure plots top-100 eigen values for (a) BA and (b) RGG of original and coarsened graphs. B.3. Loss Curves Here we compute the loss curve for 10 iterations and in each iteration, C is updated 100 times having a learning rate 1 k on real and synthetic datasets to experimentally show the convergence of the proposed FGC algorithm. The figures below for different coarsening ratios demonstrate that the proposed FGC algorithm is convergent. Number of iterations Coarsening ratio: 0.3 Coarsening ratio: 0.5 Coarsening ratio: 0.7 Number of iterations Coarsening ratio: 0.3 Coarsening ratio: 0.5 Coarsening ratio: 0.7 (b) Citeseer Figure 7: This figure shows loss curves for (a) Cora (b) Citeseer on different coarsening ratios. Number of iterations Number of iterations Figure 8: This figure shows loss curves for (a) WS and (b) BA on different coarsening ratios. Featured Graph Coarsening with Similarity Guarantees B.4. Effects of Hyperparameters on FGC algorithm: The FGC algorithm has 3 hyperparameters:(i) γ for ensuring the coarsen graph is connected, (ii) α to learn X correctly (iii) λ to enforce sparsity and orthogonality on loading matrix C. From Figures 5,6, and 7, it is observed that the algorithm is not sensitive to the hyperparameters (λ, γ, α) any moderate value of can be used for the FGC algorithm. 0 20 40 60 80 100 (a) γ = 100 0 20 40 60 80 100 (b) γ = 200 0 20 40 60 80 100 (c) γ = 400 0 20 40 60 80 100 (d) γ = 500 0 20 40 60 80 100 (e) γ = 600 0 20 40 60 80 100 (f) γ = 700 0 20 40 60 80 100 (g) γ = 800 0 20 40 60 80 100 (h) γ = 1000 0 20 40 60 80 100 (i) γ = 2000 0 20 40 60 80 100 (j) γ = 5000 0 20 40 60 80 100 (k) γ = 10000 0 20 40 60 80 100 (l) γ = 50000 Figure 9: Fig.(a-l) shows the eigen value plot of original graph and coarsened graph obtained by FGC using hyperparameters α = 500, λ = 1000 and varying γ in between (100-50000). It is evident that for a moderated γ i.e between 200 to 2000, the REE is almost similar and our algorithm is consistent. Featured Graph Coarsening with Similarity Guarantees 0 20 40 60 80 100 (a) α = 100 0 20 40 60 80 100 (b) α = 200 0 20 40 60 80 100 (c) α = 400 0 20 40 60 80 100 (d) α = 500 0 20 40 60 80 100 (e) α = 600 0 20 40 60 80 100 (f) α = 700 0 20 40 60 80 100 (g) α = 800 0 20 40 60 80 100 (h) α = 1000 0 20 40 60 80 100 (i) α = 2000 0 20 40 60 80 100 (j) α = 5000 0 20 40 60 80 100 (k) α = 10000 0 20 40 60 80 100 (l) α = 50000 Figure 10: Fig.(a-l) shows the eigen value plot of original graph and coarsened graph obtained by FGC using hyperparameters λ = 1000, γ = 600 and varying α in between (100-50000). It is evident that for a moderated α i.e between 100 to 2000, the REE is almost similar and our algorithm is consistent. Featured Graph Coarsening with Similarity Guarantees 0 20 40 60 80 100 (a) λ = 100 0 20 40 60 80 100 (b) λ = 200 0 20 40 60 80 100 (c) λ = 400 0 20 40 60 80 100 (d) λ = 500 0 20 40 60 80 100 (e) λ = 600 0 20 40 60 80 100 (f) λ = 700 0 20 40 60 80 100 (g) λ = 800 0 20 40 60 80 100 (h) λ = 1000 0 20 40 60 80 100 (i) λ = 2000 0 20 40 60 80 100 (j) λ = 5000 0 20 40 60 80 100 (k) λ = 10000 0 20 40 60 80 100 (l) λ = 50000 Figure 11: Fig.(a-l) shows the eigen value plot of original graph and coarsened graph obtained by FGC using hyperparameters α = 500, γ = 1000 and varying λ in between (100-50000). It is evident that for a moderated λ i.e between 100 to 2000, the REE is almost similar and our algorithm is consistent. B.5. Clustering on Polblogs dataset The input is a political blogs consisting of 1490 nodes, the goal is to classify the nodes into two groups. For the FGC algorithm, the feature matrix X of size 1490 5000 is generated by sampling from X N(0, L ), where L is the Laplacian matrix of the given network. The FGC algorithm and Graclus correctly classifies 1250 and 829 nodes respectively. Featured Graph Coarsening with Similarity Guarantees However, the performance of LVN and spectral clustering are not competent. The FGC algorithm outperperforms and it also demonstrates that the features may help in improving the graph-based task, and for some cases like the one presented here the features can also be artificially generated governed by the smoothness and homophily properties.