# generative_graph_dictionary_learning__7f2d7775.pdf Generative Graph Dictionary Learning Zhichen Zeng 1 Ruike Zhu 1 Yinglong Xia 2 Hanqing Zeng 2 Hanghang Tong 1 Dictionary learning, which approximates data samples by a set of shared atoms, is a fundamental task in representation learning. However, dictionary learning over graphs, namely graph dictionary learning (GDL), is much more challenging than vectorial data as graphs lie in disparate metric spaces. The sparse literature on GDL formulates the problem from the reconstructive view and often learns linear graph embeddings with a high computational cost. In this paper, we propose a Fused Gromov-Wasserstein (FGW) Mixture Model named FRAME to address the GDL problem from the generative view. Equipped with the graph generation function based on the radial basis function kernel and FGW distance, FRAME generates nonlinear embedding spaces, which, as we theoretically proved, provide a good approximation of the original graph spaces. A fast solution is further proposed on top of the expectationmaximization algorithm with guaranteed convergence. Extensive experiments demonstrate the effectiveness of the obtained node and graph embeddings, and our algorithm achieves significant improvements over the state-of-the-art methods. 1. Introduction In the era of big data and AI, graphs originate from various domains carrying rich information. Finding lowdimensional representations encoding graph structural information, i.e., graph representation learning, is the key stepping stone behind various graph-based applications in bioinformatics (Ktena et al., 2017), chemistry (Jin et al., 2017), social networks (Yanardag & Vishwanathan, 2015), and many more. Dictionary learning, which seeks for low-dimensional repre- 1Department of Computer Science, University of Illinois at Urbana-Champaign, Urbana, IL, USA 2Meta, CA, USA. Correspondence to: Hanghang Tong . Proceedings of the 40 th International Conference on Machine Learning, Honolulu, Hawaii, USA. PMLR 202, 2023. Copyright 2023 by the author(s). sentations for data samples based on a set of shared patterns, namely atoms, has achieved great success in vectorial data. However, graph dictionary learning (GDL) is much more challenging as graphs lie in disparate spaces (Peyr e et al., 2016; Xu, 2020). Thanks to the recent advancement of optimal transport (OT), together with powerful graph distance measures based on the Wasserstein-like distances (M emoli, 2011; Sturm, 2012; Titouan et al., 2019), a few GDL methods (Xu, 2020; Vincent-Cuaz et al., 2022; 2021; Liu et al., 2023) have been recently proposed. Most of the existing GDL methods follow a reconstructive formulation by minimizing the Wasserstein-like distances between the original and reconstructed graphs, but may suffer from several fundamental limitations. First (linear embedding), graphs are approximated by the linear combination of atoms, which in turn leads to linear embeddings with limited representation power. Second (single-level embedding), the rich information in the OT coupling is rarely exploited by existing GDL methods to generate embeddings at other levels (e.g., node or subgraph level) with quantitatively identified cross-level relationships. Third (computation), although stochastic gradient descent (SGD) provides a scalable approach for large graphs, other accompanied optimization steps along SGD still bear intensive computation. Contributions. In this paper, we propose a novel GDL method named FRAME to address the aforementioned limitations. Using the graph generation function based on the Fused Gromov-Wasserstein (FGW) distance and the radius basis function (RBF) kernel, the GDL problem is formulated from the generative perspective by maximizing the likelihood of generating graphs from atoms, through which the nonlinear graph-atom relationship can be captured. Besides, by utilizing the accompanied node correspondence information, the proposed method can jointly generate graph, subgraph and node embeddings. A fast solution based on the expectation-maximization (EM) algorithm is further proposed achieving quadratic time complexity with guaranteed convergence. Theoretical analysis shows that FRAME generates a low-dimensional embedding space that well-approximates the original graph space with little information loss. Extensive experiments show that FRAME achieves significant improvement on graph-level and nodelevel tasks, outperforming the state-of-the-art by 8.0% on graph classification, 0.5% on graph clustering, and 2.5% on Generative Graph Dictionary Learning node clustering, respectively. The rest of the paper is organized as follows. Section 2 introduces the preliminaries and formulates the GDL problem. Section 3 presents the proposed FRAME and provides relevant theoretical analyses. Extensive experiments are carried out in Section 4. Related works and conclusions are given in Sections 5 and 6 respectively. 2. Problem Formulation 2.1. Notations We use bold uppercase letters for matrices (e.g., A), bold lowercase letters for vectors (e.g., s), calligraphic letters for sets (e.g., G), and lowercase letters for scalars (e.g., α). The element (i, j) of a matrix A is denoted as A(i, j). The transpose of A is denoted by the superscript T (e.g., AT). An attributed graph is represented as G = {A, X} where A and X denote the adjacency matrix and the node attribute matrix respectively. The simplex histogram with n bins is denoted as n = {µ R+ n | Pn i=1 µ(i) = 1}. The probabilistic coupling is denoted as Π( , ), and the inner product is denoted as , . For simplicity, we denote the set of positive integers no greater than n as N+ n. 2.2. Optimal Transport on Graphs The OT theory seeks for the optimal coupling between distributions and has achieved great success in various graphrelated tasks including graph comparison (Maretic et al., 2019; Titouan et al., 2019), graph matching (Xu et al., 2019b; Zeng et al., 2023), and graph representation learning (Kolouri et al., 2021). To adopt the OT theory for graphs, a graph is represented as a probability measure on the product space of graph structure and node attributes, where elements indicate the importance of nodes. Without any prior knowledge on nodes, uniform node importance is often the default choice and a graph G with n nodes is represented as a uniform histogram µ = 1n n (Vincent-Cuaz et al., 2021). Definition 2.1. Fused Gromov-Wasserstein distance (Peyr e et al., 2016; 2019; Titouan et al., 2019). Given two graphs G1, G2 represented by probability measures µ1 = Pn1 i=1 hiδxi,X1(xi), µ2 = Pn2 j=1 gjδyj,X2(yj), where h n, g m are histograms, a cross-graph matrix M Rn1 n2 measuring cross-graph node distances based on attributes, and two intra-graph matrices C1 Rn1 n1, C2 Rn2 n2 measuring intra-graph node similarity based on graph structure, the q-FGW distance FGWq,α(G1, G2) is defined as: min S Π(µ1,µ2) x1 G1 y1 G2 (1 α)M(x1, y1)q S(x1, y1) + x1,x2 G1 y1,y2 G2 α|C1(x1, x2) C2(y1, y2)|q S(x1, y1)S(x2, y2) (1) where q and α are the order of FGW distance and weight hyperparameters respectively. Intuitively, Eq. (1) measures the minimum effort of transporting G1 to G2 based on the node attribute dissimilarities M(x1, y1) and structure dissimilarities C1(x1, x2) C2(y1, y2). For simplicity, we omit the subscript and use FGW(G1, G2) throughout the paper. The FGW distance serves as a powerful graph distance measure as both node attributes and graph structure are exploited (Titouan et al., 2019). Based on Definition 2.1, the FGW barycenter problem is defined as follows: Definition 2.2. Fused Gromov-Wasserstein barycenter (Titouan et al., 2019). Given N graphs Gi with intra-graph matrices CGi and weights λi for i N+ N satisfying PN i=1 λi = 1, the FGW barycenter problem aims to find the barycenter graph B with the intra-graph matrix CB and node attribute matrix XB such that the weighted sum of FGW distances between Gi and B is minimized: CB, XB = arg min CB,XB i=1 λi FGW(Gi, B). (2) In the GDL problem, we are interested in finding multiple barycenters B1, . . . , BK, namely atoms, from the input graphs G1, . . . , GN. For clarity, we slightly abuse the subscript i to denote graph Gi and k to denote atom Bk exclusively (e.g., Xi = XGi, Xk = XBk). Following a similar approach as (Titouan et al., 2019), the cross-graph and intragraph matrices are defined as follows. We use the L2 norm between the attributes of node x Gi and y Bk as crossgraph matrices Mi,k(x, y) and the adjacency matrices Ai as the intra-graph matrices Ci. 2.3. Generative Graph Dictionary Learning Unlike the existing methods formulating the problem from the reconstructive perspective, we formulate the GDL problem from the generative perspective as follows: Definition 2.3. Generative graph dictionary learning. Given N graphs Gi for i N+ N and a graph generation function p(Gi|Bk) indicating the probability of generating graph Gi from atom Bk. The generative graph dictionary learning problem aims to find K attributed atoms Bk and corresponding prior probability πk = p(Bk), such that the log likelihood of generating Gi from Bk for i N+ N, k N+ K is maximized. Denoting model parameters as Θ = {B1, . . . , BK, π1, . . . , πK}, the objective function of generative graph dictionary learning is formulated as: Θ = arg max Θ log p(G1, . . . , GN|Θ), k=1 πk = 1. (3) Generative Graph Dictionary Learning Table 1. Comparison with existing GDL methods. METHOD DESIRED PROPERTY NONLINEAR MULTI-LEVEL SIZE-FREE SUPERVISION GWF (XU, 2020) SRGW (VINCENT-CUAZ ET AL., 2022) GDL (VINCENT-CUAZ ET AL., 2021) RGWD (LIU ET AL., 2023) FRAME (PROPOSED) In this paper, we assume graphs are i.i.d. data samples generated by the mixture of atoms (i.e., p(G1, . . . , GN|Θ) = QN i=1 p(Gi|Θ)), and a graph Gi is more likely to be generated by an atom Bk with a smaller FGW(Gi, Bk) following the RBF kernel-based graph generation function as follows: p(Gi|Bk) = exp( σFGW(Gi, Bk)), (4) where σ>0 is the length-scale parameter of the RBF kernel. Before presenting our solution to the generative GDL problem in the next section, let us first summarize a few key desired properties of GDL. First (P1. multi-level embedding), the existing GDL methods almost exclusively focus on producing graph level embedding, while the rich nodelevel information in the accompanied OT coupling is rarely exploited. Ideally, the learned atoms should be able to generate multi-level embedding with quantitatively identified relationship between different levels (Du & Tong, 2019). This will not only help discover cross-level correlations, but also broaden the applicability of GDL to support node level or subgraph level learning tasks (e.g., node classification, link prediction, etc.). Second (P2. nonlinear embedding), most of the existing GDL methods generate linear embedding, which limits its representation power. It is desirable to generate nonlinear embedding without increasing its computational complexity. Third (P3. size-free), most of the reconstructive formulation (Vincent-Cuaz et al., 2021; Liu et al., 2023) learns atoms with the same size, but it is preferred to generate atoms with different sizes to capture features at multiple scales. Forth (P4. incorporate supervision), existing GDL methods only focus on the unsupervised setting, while utilizing supervision from labelled data may significantly enhance the GDL performance. Table 1 summarizes and compares the existing GDL methods. As we can see, the proposed FRAME is the only method that enjoys all these four desired properties. 3. Algorithm and Analysis In this section, we introduce and analyze the proposed FRAME. The algorithm is first introduced in Section 3.1. Relevant analysis on embedding quality, convergence, and complexity are presented in Section 3.2. Further discussion and variants of FRAME are carried out in Appendix B. 3.1. FRAME Algorithm In this subsection, we present our proposed algorithm FRAME. The overall optimization framework to maximize Eq. (3) can be divided into the expectation and maximization steps. By leveraging the posterior probability and optimal coupling, we can generate embeddings at multiple levels and reconstruct the origin graphs from the embedding space. Expectation. We first introduce a set of latent variables zi,k indicating whether the graph Gi is generated from the atom Bk. In the expectation step, we focus on computing the posterior probability γi,k = p(zi,k = 1|Gi, Θ), i.e., the probability of Gi being generated by Bk from the mixture model. For the t-th iteration, fixing atoms B(t 1) k and corresponding prior probability π(t 1) k , the posterior probability γ(t) i,k can be computed based on the Bayes rule as follows: γ(t) i,k = π(t 1) k p Gi|B(t 1) k PK j=1 π(t 1) j p Gi|B(t 1) j . (5) Maximization. Equipped with posterior probabilities γ(t) i,k from the expectation step, the objective function in Eq. (3) is lower bounded by the following complete loglikelihood (Bishop & Nasrabadi, 2006): k=1 p zi,k|Gi,Θ(t 1) log p(Gi, zi,k|Θ) k=1 γ(t) i,k [log πk σFGW(Gi, Bk)] . In the maximization step, we focus on re-estimating model parameters to maximize the above complete log-likelihood. First (maximization w.r.t. πk), fixing the posterior probability γ(t) i,k and the atom B(t 1) k , the prior probability π(t) k can be calculated by PN i=1 p(zi,k=1|Gi,Θ(t 1)) PN i=1 γ(t) i,k N . (7) Second (maximization w.r.t. Bk = {Ak, Xk}), fixing the posterior probability γ(t) i,k and the prior probability π(t) k , the Generative Graph Dictionary Learning maximization w.r.t. Bk in Eq. (6) can be formulated as i=1 γ(t) i,k FGW(Gi, Bk). (8) Note that the above optimization problem corresponds to the FGW barycenter problem in Eq. (2) and can be efficiently solved by the block coordinate descent (BCD) algorithm (Ferradans et al., 2014; Titouan et al., 2019). Specifically, the objective function in Eq. (8) is minimized w.r.t. the optimal coupling Si,k between Gi and Bk, atom adjacency matrix Ak and atom attribute matrix Xk iteratively. For the l-th iteration, the minimization w.r.t. three variables are computed as follows. (1) Fixing A(l 1) k and X(l 1) k , the objective function in Eq. (8) is equivalent to calculating (NK) FGW distances independently. For each FGW distance, a fast solution based on conditional gradient (CG) (Jaggi, 2013) is proposed by (Titouan et al., 2019) with convergence to a stationary point of the non-convex problem (Lacoste-Julien, 2016). Each CG subproblem is formulated as: min S Π(µi,µk) G(l) i,k, S , G(l) i,k = SFGW=(1 α)M(l 1) i,k +2αL(l 1) i,k S, (9) where is the Kronecker product, and L(l 1) i,k is a 4-dimensional tensor with L(l 1) i,k (x1, x2, y1, y2) = |Ai(x1, x2) A(l 1) k (y1, y2)|q for x1, x2 Gi and y1, y2 Bk. Note that each subproblem in Eq. (9) corresponds to an OT problem, which can be efficiently solved by the inexact proximal point method with guaranteed convergence to the global optimum (Xie et al., 2020; Xu et al., 2019b). (2) Fixing S(l) i,k and X(l 1) k , the optimal intra-graph matrix Ak can be computed based on the first-order optimality condition as follows (Peyr e et al., 2016): PN i=1 γi,k S(l) i,k TAi S(l) i,k µkµT k . (10) (3) Fixing S(l) i,k and A(l) k , the objective function in Eq. (8) is quadratic w.r.t. X(l) k , whose optimal solution can be computed as follows (Cuturi & Doucet, 2014): X(l) k = diag 1 i=1 γi,k S(l) i,k Xi. (11) Combining Eqs. (5)-(11), the overall algorithm of FRAME is given in Algorithm 1. Semi-supervised FRAME. Existing GDL methods (Xu, 2020; Vincent-Cuaz et al., 2021; 2022; Liu et al., 2023) Algorithm 1 FRAME 1: Input: N graphs Gi = {Ai, Xi}, number of nodes in K atoms nk, hyperparameters α, q, T, L. 2: Randomly initialize atoms π(0) k , A(0) k , X(0) k ; 3: Initialize marginal dist. µi = 1ni ni , µk = 1nk 4: for t N+ T do 5: for i N+ N and k N+ K do 6: Update cross and intra-graph matrices M(t) i,k, L(t) i,k; 7: Update posterior probability γ(t) i,k by Eq. (5); 8: end for 9: for k N+ K do 10: Update prior probability π(t) k by Eq. (7); 11: Update S(t) i,k, A(t) k , X(t) k by running L conditional gradient iterations in Eqs. (9)-(11); 12: end for 13: end for 14: return posterior probability γ(T ) i,k , optimal coupling S(T ) i,k , learned atoms B(T ) k = {A(T ) k , X(T ) k }. solely focus on the unsupervised setting, while it would be beneficial to incorporate external supervision when a small portion of graph labels are available. FRAME can naturally incorporate such supervision based on the semi-supervised EM algorithm (Nigam et al., 2000) to derive the semisupervised variant named ss-FRAME. Specifically, in the semi-supervised setting, we are given Nu unlabelled graphs G1, , GNu and Nl labelled graphs GNu+1, , GNu+Nl with labels y Nu+1, , y Nu+Nl. Following a similar derivation in the unsupervised setting, we can still adopt the maximization step in Eqs. (7)-(11) but with a modified expectation step as π(t 1) k p Gi|B(t 1) k PK j=1 π(t 1) j p Gi|B(t 1) j , if Gi is unlabelled 1, if Gi is labelled and yi = k 0, if Gi is labelled and yi = k Towards Nonlinear Embeddings. Different from existing GDL methods that generate the linear representation of graphs based on the atoms (Vincent-Cuaz et al., 2022; 2021; Liu et al., 2023), the proposed FRAME generates nonlinear embeddings at node, subgraph and graph levels. For node embeddings, by regarding nodes in the atoms as the bases of the node embedding space and using the optimal coupling Si,k(vp, vq) as the coordinate of node vp Gi w.r.t. the basis vq Bk, the node embedding zvp of vp Gi is the concatenation of the optimal couplings as follows: zvp = [γi,1Si,1(vp, :) γi,KSi,K(vp, :)], (12) Generative Graph Dictionary Learning Intuitively, Si,k(vp, vq) indicates the conditional probability of nodes vp Gi and vq Bk on that Gi is generated from Bk, i.e., p(vp, vq|zi,k = 1). Therefore, each element in node embedding zvp corresponds to the marginal probability of vp, vq, that is p(vp, vq) = X zi,k p(vp, vq|zi,k)p(zi,k) =γi,k Si,k(vp, vq) , (13) Similarly, we can generate graph embeddings by measuring the joint probability of graph Gi and atom Bk, i.e., p(Gi, Bk). To be specific, we perform a linear transformation on node embeddings in Eq. (13) as follows: vp Gi zvp W, where W R PK i=1n Bk K is the indicator matrix such that each element W(vq, k) indicates whether the atom node vq belongs to the atom Bk as follows: 1n B1 0n B1 . . . 0n B1 0n B2 1n B2 . . . 0n B2 ... ... ... ... 0n BK 0n BK . . . 1n BK Owing to the coupling constraint on Si,k = Π(µi, νk), the summation over elements in Si,k equals 1 and the graph embedding for Gi can be further simplified as: wi = [γi,1, , γi,K]. (14) Similarly, for a subgraph S Gi, we can generate the subgraph embedding as w S = P vp S zvp W. As stated before, we assume no prior knowledge on nodes and use uniform distributions to represent graphs. Nonetheless, when mass distribution is available or can be learned (Vincent-Cuaz et al., 2022), stronger graph and node embeddings can be further generated. Graph Reconstruction. The row-normalized optimal coupling ˆSi,k = ni Si,k acts as a soft permutation matrix describing the node correspondence between Gi and Bk. Therefore, we approximate the reconstructed Gi from Bk, denoted as Gi,k = { Ai,k, Xi,k}, as the realigned matrix based on ˆSi,k as follows Ai,k = ˆSi,k AkˆS T i,k, Xi,k = ˆSi,k Xk. (15) Besides, the posterior probability γi,k describes the correspondence between Gi and Bk. Therefore, the overall reconstructed graph of Gi, denoted as Gi = { Ai, Xi}, is approximated by the integration of Gi,k based on γi,k as k=1 γi,k Ai,k, Xi = k=1 γi,k Xi,k. (16) It is worth mentioning that the reconstructed graph Gi belongs to the same metric space of Gi, whereas existing GDL methods (Xu, 2020; Vincent-Cuaz et al., 2022; 2021) can only reconstruct graphs in the atom space. As we will show in the next subsection, the reconstructed graph by Eq. (16) provides a good approximation of the original graph, with an upper bound on the reconstruction error. 3.2. Theoretical Analysis In this subsection, we provide theoretical analysis regarding the embedding quality, the reconstruction error, the overall convergence and the time complexity. The proofs for all the theorems and propositions are provided in Appendix A. Bound on Graph Embedding. To elucidate the quality of graph embeddings, we present the following theorem connecting the embedding space with the graph space. Theorem 3.1. For two attributed graphs Gi, Gj, the graph embeddings wi, wj given by FRAME with the 1-FGW distance satisfy the following inequality: wi wj 1 log wi log wj 1 2KσFGW1,α(Gi, Gj), where K is the number of atoms and σ is the length-scale parameter of the RBF kernel. Generally speaking, the L1 norm between the graph embeddings is upper bounded by the 1-FGW distance between the original graphs, indicating that the graph embedding space provides a good proxy to the original graph space. We carry out further experiments in Section 4.4 to demonstrate the close correlation between the graph and embedding spaces. Upper Bound on Reconstruction Error. To quantify the information loss during the embedding process, we provide the following upper bound on the reconstruction error. Theorem 3.2. For an attributed graph Gi and the reconstructed graph Gi by Eqs. (15) and (16), the 2-FGW distance between Gi and Gi satisfies the following inequality: FGW2,α(Gi, Gi) k=1 γi,k [α GWk + (1 α) Wk] , where GWk = diag(µi)(Ai Ai,k) 2 F and Wk = 1 2 i )(Xi Xi,k) 2 F . To the best of our knowledge, we are the first to derive a reconstruction error bound for GDL. The upper bound suggests that the obtained embedding space retains essential information in the original graph space. Experimental results in Section 4.4 show that FRAME achieves relatively low reconstruction errors. Generative Graph Dictionary Learning Convergence Analysis. The EM algorithm is guaranteed to converge under exact maximization (Wu, 1983). For the sake of efficiency, it is preferred to evaluate the maximization step inexactly with a suboptimal solution. As shown in the following theorem, the objective function of FRAME is non-decreasing and converges with inexact maximization. Theorem 3.3. The objective function in Eq. (3) is nondecreasing and converges along the inexact EM process. With the above convergence guarantee, we only need to solve the maximization problem in Eq. (6) inexactly by a few BCD iterations in practice, hence dramatically reducing the running time. Further empirical results on the convergence of FRAME are provided in Appendix C.2. Time Complexity. Without loss of generality, we assume input graphs share a comparable size that is greater than the shared size of atoms (i.e., O(n G) O(n B) nodes and O(m G) O(m B) edges). We have the following time complexity analysis. Proposition 3.4. The overall time complexity of FRAME is O(TLNK(m Gn B + n Gn2 B)), where T is the number of EM iterations and L is the number of BCD iterations. Generally speaking, the time complexity of FRAME is (1) quadratic w.r.t. the node number of atom n B which is always much smaller than the node number of graph n G, and (2) linear w.r.t. the edge number of input graphs, the number of input graphs and the number of atoms. Sharing the same big-O time complexity, FRAME expands the existing GDL methods (Xu, 2020; Vincent-Cuaz et al., 2022; 2021) ability to generate nonlinear embedding without increasing the time complexity. In practice, FRAME achieves faster computation as the EM algorithm generally converges with fewer iterations than the gradient descent scheme. 4. Experiment We conduct extensive experiments to validate and verify our proposed FRAME from the following aspects: How to interpret the learned embeddings (Section 4.1)? How effective are the learned graph (Section 4.2) and node (Section 4.3) embeddings? How well does the learned embedding space represent the original graph space (Section 4.4)? 4.1. Understanding the Learned Embeddings Experiment Setup. To better understand the relationship between input graphs and learned atoms, we apply FRAME on the synthetic graphs generated by the stochastic block model with {1,2,3,5} blocks. For each category, we generate (a) 3-dimensional space (b) 4-dimensional space Figure 1. Graph embedding spaces for synthetic graphs: for 1block, for 2-block, for 3-block, and for 5-block graphs. 50 graphs with node numbers randomly sampled between 10 to 20. Besides, for {1,2,3} blocks, 3 auxiliary graphs per category are generated as the supervision for ss-FRAME. We learn two embedding spaces spanned by 3 and 4 atoms with 12 nodes respectively. Results and Analysis. The generated graph embedding spaces are shown Fig. 1. The learned graph embeddings are of high quality as the learned atoms recover the block structures of the input graphs and similar graphs are clustered around corresponding atoms. For the 3-dimensional embedding space with less atoms than the graph classes, the 5-block graphs are underrepresented as a mixture of 2-block and 3-block graphs, i.e., partially belonging to both classes. However, when 4 atoms are learned, different graphs are clearly separated in the resulting graph embedding space. Based on this observation, we set the number of atoms to be equal to the number of graph classes in the following experiments. 4.2. Graph Classification and Clustering Experiment Setup. Three real-world datasets are considered to evaluate the graph embeddings, including ENZYMES (Borgwardt et al., 2005), IMDB-M (Yanardag & Vishwanathan, 2015) and PTC-MR (Kriege & Mutzel, 2012). Detailed dataset description can be found in Appendix D. Four types of baseline methods are considered, including (1) kernel-based methods: Random Walk (RW) (Kashima et al., 2003), Shortest Path (SP) (Borgwardt & Kriegel, 2005), Weisfeiler-Lehman (WL) (Shervashidze et al., 2011) and Pyramid Match (PM) (Nikolentzos et al., 2017) kernels, (2) embedding-based methods: FGSD (Verma & Zhang, 2017), LDP (Cai & Wang, 2018), Mr Mine (Du & Tong, 2019) and FEATHER (Rozemberczki & Sarkar, 2020), (3) OT-based methods: FGW (Titouan et al., 2019), WWL (Schulz et al., 2022) and Linear FGW (Nguyen & Tsuda, 2022) and (4) GDL-based methods: GWF (Xu, 2020), GDL (Vincent Cuaz et al., 2021) and sr GW (Vincent-Cuaz et al., 2022). For a fair comparison with other GDL methods, we fix the number of atoms to be the same as the number of graph Generative Graph Dictionary Learning Table 2. Rand Index on graph clustering (%). METHOD ENZYMES IMDB-M PTC-MR RW KERNEL 17.6 0.0 34.4 0.0 50.6 0.0 SP KERNEL 58.4 0.7 33.6 0.1 50.3 0.0 WL KERNEL 61.2 0.9 49.6 0.0 49.9 0.0 PM KERNEL 69.0 0.4 52.9 0.0 51.1 0.1 FGSD 61.5 0.0 53.4 0.0 50.4 0.0 LDP 44.5 0.0 40.6 0.0 50.6 0.0 FEATHER 65.0 0.0 50.7 0.0 51.0 0.0 FGW 59.3 0.0 52.1 0.0 49.9 0.0 WWL 24.1 0.0 40.3 5.7 50.6 0.1 LINEARFGW 57.9 0.5 44.4 0.7 50.9 0.0 GWF 48.8 1.7 55.4 0.4 51.5 0.8 GDL 66.5 2.7 53.2 0.3 51.6 0.1 SRGW 55.1 0.0 48.1 0.0 50.6 0.0 FRAME 69.3 2.3 54.8 0.5 52.1 0.4 classes (i.e., one representative atom per class), with a fixed atom size n B = 5. For graph classification, we apply 10fold cross-validation on the benchmark datasets. We use a SVM classifier for classification, and the results are assessed by the classification accuracy. For graph clustering, we apply spectral clustering on top of the graph embeddings, and the results are assessed by the Rand Index (Rand, 1971). Results and Analysis. The results of graph clustering are reported in Table 2. It is shown that FRAME achieves an up to 0.5% improvement over the best competitor in Rand Index. We also carried out experiments on graph classification to validate the effectiveness of the proposed semi-supervised FRAME, and results are shown in Table 3. It is shown that FRAME achieves comparable performance with other GDL methods. When supervision is incorporated, the semisupervised FRAME outperforms all baselines, including the semi-supervised embedding-based methods, with an up to 8.0% improvement. 4.3. Node Clustering Experiment Setup. We consider four real-world datasets to evaluate the node embeddings, including AIDS (Riesen & Bunke, 2008), ENZYMES (Borgwardt et al., 2005), PROTEINS (Borgwardt et al., 2005) and PTC-MR (Kriege & Mutzel, 2012). Detailed dataset description can be found in Appendix D. Six well-known node embedding methods are considered, including Deep Walk (Perozzi et al., 2014), Gra Rep (Cao et al., 2015), node2vec (Grover & Leskovec, 2016), Net MF (Qiu et al., 2018), Node Sketch (Yang et al., 2019) and Mr Mine (Du & Tong, 2019). We apply spectral clustering on top of the node embeddings and evaluate the cluster- Table 3. Accuracy on graph classification (%). METHOD ENZYMES IMDB-M PTC-MR SP KERNEL 27.3 5.3 35.0 2.1 55.8 5.8 RW KERNEL 20.7 3.6 35.8 3.5 55.8 0.9 WL KERNEL 35.3 5.1 49.5 3.2 54.4 9.6 PM KERNEL 23.7 5.7 44.1 4.3 52.1 6.9 FGSD 30.5 5.8 38.9 3.4 59.6 7.2 LDP 23.8 3.7 48.7 2.2 56.7 1.9 FEATHER 25.2 4.4 49.3 3.0 55.8 5.9 FGW 26.0 8.3 39.0 7.5 58.2 4.3 WWL 23.8 3.3 33.3 0.3 55.8 0.9 LINEARFGW 17.0 4.8 32.3 2.7 53.8 4.4 GWF 24.3 5.5 40.3 2.1 58.9 4.4 GDL 37.0 6.7 40.1 2.2 54.4 4.4 SRGW 38.3 5.9 45.6 2.2 59.0 6.8 FRAME 30.2 6.0 40.6 1.9 58.4 4.7 SS-FRAME 46.3 4.1 51.9 2.7 60.3 3.1 ing results by the Rand Index (Rand, 1971). Results and Analysis. The results of the node clustering are reported in Table 4. The proposed FRAME consistently outperforms all the competitors on all datases, achieving an up to 2.5% improvement in the Rand Index compared with the best competitor. Table 4. Rand Index on node clustering (%). METHOD AIDS ENZYMES PROTEINS PTC-MR DEEPWALK 52.7 0.3 50.1 0.1 50.4 0.1 50.0 0.2 GRAREP 47.4 0.0 48.3 0.0 49.4 0.0 50.3 0.0 NODE2VEC 43.6 0.1 48.2 0.0 49.5 0.6 50.0 0.2 NETMF 55.4 0.0 50.2 0.0 50.0 0.0 50.2 0.0 NODESKETCH 47.9 0.0 48.7 0.0 49.6 0.0 50.7 0.0 MRMINE 57.0 0.3 50.6 0.0 51.5 0.0 51.5 0.0 FRAME 57.2 0.1 53.1 0.1 53.5 0.0 51.8 0.2 4.4. Experimental Analysis Hyperparameter Study. We analyze the effects of the length-scale parameter σ on the graph embeddings, and the results are shown in Fig. 2. A larger σ generates more deterministic posterior probabilities γ and pushes data samples to the apices of the embedding space, resulting in sparser graph embeddings and clearer block structure in atoms. The above observation may help guide the model tuning process. When class-specific patterns exhibit in graphs, a large σ is recommended to identify class-specific patterns more deterministically. When common patterns are shared by graphs in different classes, a small σ might be preferred to represent graphs as the mixture of a set of patterns. Generative Graph Dictionary Learning Figure 2. Hyperparameter study on σ: for 1-block graphs, for 2-block graphs, for 3-block graphs, and for 5-block graphs. Figure 3. Comparison between the graph space and the embedding space. Each point corresponds to a pair of graphs. Graph Embedding Quality. We first evaluate the correlation between the graph and embedding spaces by comparing the FGW distance between graph pairs and the distance between corresponding embeddings. As the results shown in Fig. 3, the learned embedding space well approximates the original graph space with a Pearson coefficient over 0.95. Besides, we compare the graph reconstruction error measured by the FGW distance between the original graph G and the reconstructed graph G of different GDL methods, and results are shown in Fig. 4. For one thing, FRAME achieves the fastest computation thanks to the fast convergence of the EM algorithm. For another, the nonlinear GWF and FRAME achieve smaller reconstruction errors compared with the linear GDL method, validating the necessity of imposing nonlinear graph-atom relationship. Besides, GDL performs relatively poor on two densely-connected datasets DBLP and IMDB-M (refer to data statistics in Table 5), indicating that complicated connectivity patterns may not be well-captured by the linear GDL methods. 5. Related Works Graph Representation Learning. Extensive efforts have been made in graph representation learning including graph kernel and graph embedding. Graph kernel (Kashima et al., 2003; Borgwardt & Kriegel, 2005; Shervashidze et al., 2011) provides a powerful graph similarity measure by computing the inner product in reproducing kernel Hilbert space (Sch olkopf et al., 2002). However, most of the kernel methods require hand-crafted features or predefined rules, Figure 4. Average graph reconstruction error and running time of different GDL methods. resulting in fixed representations that can not be adapted to specific dataset (Vincent-Cuaz et al., 2021). Some graph embedding methods (Perozzi et al., 2014; Grover & Leskovec, 2016) leverage language models to generate representations based on truncated random walks, and some methods (Ribeiro et al., 2017; Donnat et al., 2018) learn embeddings by exploring node structural roles. Graph neural networks (e.g., (Kipf & Welling, 2017; 2016) and its many follow-ups) provide an end-to-end learning framework with promising performance. However, a large amount of labeled data is often required for training and the generated representations are in general not interpretable. Besides, most of the existing work focuses on single graph embedding and may suffer from the embedding space disparity issue when dealing with multiple graphs (Du & Tong, 2019). Graph Dictionary Learning. Dictionary learning (Mallat, 1999; Mairal et al., 2009; Schmitz et al., 2018) aims to embed data samples into a linear subspace spanned by a set of shared atoms and has achieved great success in vectorial data with wide applications in classification (Raina et al., 2007; Mairal et al., 2009), clustering (Ramirez et al., 2010) and domain adaptation (Ni et al., 2013; Yang et al., 2018). It is of great interest to explore its potential to graph representation learning, but relatively sparse literature exists as graphs lie in disparate metric spaces. The GDL task is formulated from different aspects. Early works focus on the single-graph setting by factorizing the node attribute matrix with a consistency regularization between embeddings and graph topology (Thanou et al., 2014; Generative Graph Dictionary Learning Yankelevsky & Elad, 2016). Recently, GDL is revisited from the reconstructive perspective, which essentially minimizes the Gromov-Wasserstein (GW) discrepancy between the input graphs and corresponding approximations in the atom space. GWF (Xu, 2020) utilizes the GW barycenter as the nonlinear graph approximation, but the resulting bi-level optimization bears a high computational cost. To scale to large graph datasets, GDL (Vincent-Cuaz et al., 2021) approximates input graphs by the linear combination of atoms with an easier-to-solve GW unmixing problem. sr GW (Vincent-Cuaz et al., 2022) learns the OT coupling and the target mass distribution simultaneously to avoid pre-defined target mass distribution and further improves the computational efficiency. More recently, RGWD (Liu et al., 2023) proposes a robust GW discrepancy following a minimax formulation, based on which, a GDL method is developed to learn from noisy graph data. In our work, the GDL task is formulated from the generative perspective by maximizing the likelihood of generating input graphs from the learned atoms. Compared with the existing GDL methods, the generative formulation avoids explicit optimization of the graph embedding (e.g., GW unmixing in (Vincent-Cuaz et al., 2021)) thanks to the closedloop solution to the expectation step. Besides, by virtue of the fast convergence of EM algorithm, FRAME learns nonlinear embeddings with an empirically faster computational speed. In addition, owing to probabilistic nature underlying the generative approach, multi-level embeddings can be generated with quantitatively identified cross-level relationships. Despite different variants of the GDL formulation, the optimization frameworks essentially follow the BCD optimization scheme by sequentially optimizing the atom (e.g., SGD in GDL and the maximization step in FRAME) and the graph embedding (e.g., GW unmixing in GDL and the expectation step in FRAME). 6. Conclusion In this paper, we study the graph dictionary learning problem from the generative perspective to learn nonlinear embeddings at multiple levels. An efficient algorithm named FRAME is proposed based on the expectation-maximization algorithm. The proposed FRAME enjoys the same time complexity as the existing GDL with a faster empirical running time. Theoretical analysis shows that the learned embedding space well approximates the original graph space and captures essential graph information with an upper bound on the graph reconstruction error. Extensive experiments on both graph and node-level tasks demonstrate the effectiveness of the proposed FRAME. Acknowledgements ZZ, RZ and HT are partially supported by NSF (1947135, 1939725, and 2134079), DARPA (HR001121C0165), NIFA (2020-67021-32799), DHS (17STQAC00001-06-00), and ARO (W911NF2110088). The content of the information in this document does not necessarily reflect the position or the policy of the Government or Amazon, and no official endorsement should be inferred. The U.S. Government is authorized to reproduce and distribute reprints for Government purposes notwithstanding any copyright notation here on. Bishop, C. M. and Nasrabadi, N. M. Pattern recognition and machine learning, volume 4. Springer, 2006. Borgwardt, K. M. and Kriegel, H.-P. Shortest-path kernels on graphs. In Fifth IEEE International Conference on Data Mining (ICDM 05), pp. 8 pp. IEEE, 2005. Borgwardt, K. M., Ong, C. S., Sch onauer, S., Vishwanathan, S., Smola, A. J., and Kriegel, H.-P. Protein function prediction via graph kernels. Bioinformatics, 21(suppl 1): i47 i56, 2005. Cai, C. and Wang, Y. A simple yet effective baseline for non-attributed graph classification. ar Xiv preprint ar Xiv:1811.03508, 2018. Cao, S., Lu, W., and Xu, Q. Grarep: Learning graph representations with global structural information. In Proceedings of the 24th ACM International on Conference on Information and Knowledge Management, pp. 891 900, 2015. Chen, J., Zhu, J., Teh, Y. W., and Zhang, T. Stochastic expectation maximization with variance reduction. Advances in Neural Information Processing Systems, 31, 2018. Chen, L., Gan, Z., Cheng, Y., Li, L., Carin, L., and Liu, J. Graph optimal transport for cross-domain alignment. In International Conference on Machine Learning, pp. 1542 1553. PMLR, 2020. Cuturi, M. and Doucet, A. Fast computation of wasserstein barycenters. In International Conference on Machine Learning, pp. 685 693. PMLR, 2014. Donnat, C., Zitnik, M., Hallac, D., and Leskovec, J. Learning structural node embeddings via diffusion wavelets. In Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, pp. 1320 1329, 2018. Generative Graph Dictionary Learning Du, B. and Tong, H. Mrmine: Multi-resolution multinetwork embedding. In Proceedings of the 28th ACM International Conference on Information and Knowledge Management, pp. 479 488, 2019. Ferradans, S., Papadakis, N., Peyr e, G., and Aujol, J.-F. Regularized discrete optimal transport. SIAM Journal on Imaging Sciences, 7(3):1853 1882, 2014. Flamary, R., Courty, N., Gramfort, A., Alaya, M. Z., Boisbunon, A., Chambon, S., Chapel, L., Corenflos, A., Fatras, K., Fournier, N., Gautheron, L., Gayraud, N. T., Janati, H., Rakotomamonjy, A., Redko, I., Rolet, A., Schutz, A., Seguy, V., Sutherland, D. J., Tavenard, R., Tong, A., and Vayer, T. Pot: Python optimal transport. Journal of Machine Learning Research, 22(78):1 8, 2021. URL http: //jmlr.org/papers/v22/20-451.html. Grover, A. and Leskovec, J. node2vec: Scalable feature learning for networks. In Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 855 864, 2016. Helma, C., King, R. D., Kramer, S., and Srinivasan, A. The predictive toxicology challenge 2000 2001. Bioinformatics, 17(1):107 108, 2001. Jaggi, M. Revisiting frank-wolfe: Projection-free sparse convex optimization. In International Conference on Machine Learning, pp. 427 435. PMLR, 2013. Jin, W., Coley, C., Barzilay, R., and Jaakkola, T. Predicting organic reaction outcomes with weisfeiler-lehman network. Advances in Neural Information Processing Systems, 30, 2017. Kashima, H., Tsuda, K., and Inokuchi, A. Marginalized kernels between labeled graphs. In International Conference on Machine Learning, pp. 321 328, 2003. Kipf, T. N. and Welling, M. Variational graph auto-encoders. ar Xiv preprint ar Xiv:1611.07308, 2016. Kipf, T. N. and Welling, M. Semi-supervised classification with graph convolutional networks. In International Conference on Learning Representations, 2017. URL https://openreview.net/forum? id=SJU4ay Ygl. Kolouri, S., Naderializadeh, N., Rohde, G. K., and Hoffmann, H. Wasserstein embedding for graph learning. In International Conference on Learning Representations, 2021. URL https://openreview.net/forum? id=AAes_3W-2z. Kriege, N. and Mutzel, P. Subgraph matching kernels for attributed graphs. In International Conference on Machine Learning, pp. 291 298, 2012. Ktena, S. I., Parisot, S., Ferrante, E., Rajchl, M., Lee, M., Glocker, B., and Rueckert, D. Distance metric learning using graph convolutional networks: Application to functional brain networks. In International Conference on Medical Image Computing and Computer-Assisted Intervention, pp. 469 477. Springer, 2017. Lacoste-Julien, S. Convergence rate of frank-wolfe for non-convex objectives. ar Xiv preprint ar Xiv:1607.00345, 2016. Liu, W., Xie, J., Zhang, C., Yamada, M., Zheng, N., and Qian, H. Robust graph dictionary learning. In International Conference on Learning Representations, 2023. URL https://openreview.net/forum? id=qx Rsces Ar BZ. Mairal, J., Bach, F., Ponce, J., and Sapiro, G. Online dictionary learning for sparse coding. In International Conference on Machine Learning, pp. 689 696, 2009. Mallat, S. A wavelet tour of signal processing. Elsevier, 1999. Maretic, H. P., Gheche, M. E., Chierchia, G., and Frossard, P. Got: an optimal transport framework for graph comparison. Advances in Neural Information Processing Systems, 32, 2019. Maretic, H. P., El Gheche, M., Chierchia, G., and Frossard, P. Fgot: Graph distances based on filters and optimal transport. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 36, pp. 7710 7718, 2022. Mc Lachlan, G. J. and Rathnayake, S. On the number of components in a gaussian mixture model. Wiley Interdisciplinary Reviews: Data Mining and Knowledge Discovery, 4(5):341 355, 2014. Mc Lachlan, G. J., Lee, S. X., and Rathnayake, S. I. Finite mixture models. Annual Review of Statistics and Its Application, 6:355 378, 2019. M emoli, F. Gromov wasserstein distances and the metric approach to object matching. Foundations of Computational Mathematics, 11:417 487, 2011. Morris, C., Kriege, N. M., Bause, F., Kersting, K., Mutzel, P., and Neumann, M. Tudataset: A collection of benchmark datasets for learning with graphs. ar Xiv preprint ar Xiv:2007.08663, 2020. Nguyen, D. H. and Tsuda, K. On a linear fused gromovwasserstein distance for graph structured data. ar Xiv preprint ar Xiv:2203.04711, 2022. Ni, J., Qiu, Q., and Chellappa, R. Subspace interpolation via dictionary learning for unsupervised domain adaptation. Generative Graph Dictionary Learning In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2013. Nigam, K., Mc Callum, A. K., Thrun, S., and Mitchell, T. Text classification from labeled and unlabeled documents using em. Machine Learning, 39(2):103 134, 2000. Nikolentzos, G., Meladianos, P., and Vazirgiannis, M. Matching node embeddings for graph similarity. In Proceedings of the AAAI Conference on Artificial Intelligence, pp. 2429 2435, 2017. Pan, S., Zhu, X., Zhang, C., and Philip, S. Y. Graph stream classification using labeled and unlabeled graphs. In 2013 IEEE 29th International Conference on Data Engineering (ICDE), pp. 398 409. IEEE, 2013. Pearson, K. Contributions to the mathematical theory of evolution. Philosophical Transactions of the Royal Society of London. A, 185:71 110, 1894. Perozzi, B., Al-Rfou, R., and Skiena, S. Deepwalk: Online learning of social representations. In Proceedings of the 20th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 701 710, 2014. Peyr e, G., Cuturi, M., and Solomon, J. Gromov-wasserstein averaging of kernel and distance matrices. In International Conference on Machine Learning, pp. 2664 2672. PMLR, 2016. Peyr e, G., Cuturi, M., et al. Computational optimal transport: With applications to data science. Foundations and Trends in Machine Learning, 11(5-6):355 607, 2019. Qiu, J., Dong, Y., Ma, H., Li, J., Wang, K., and Tang, J. Network embedding as matrix factorization: Unifying deepwalk, line, pte, and node2vec. In Proceedings of the Eleventh ACM International Conference on Web Search and Data Mining, pp. 459 467, 2018. Raina, R., Battle, A., Lee, H., Packer, B., and Ng, A. Y. Self-taught learning: transfer learning from unlabeled data. In International Conference on Machine learning, pp. 759 766, 2007. Ramirez, I., Sprechmann, P., and Sapiro, G. Classification and clustering via dictionary learning with structured incoherence and shared features. In 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition, pp. 3501 3508. IEEE, 2010. Rand, W. M. Objective criteria for the evaluation of clustering methods. Journal of the American Statistical association, 66(336):846 850, 1971. Ribeiro, L. F., Saverese, P. H., and Figueiredo, D. R. struc2vec: Learning node representations from structural identity. In Proceedings of the 23rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 385 394, 2017. Riesen, K. and Bunke, H. Iam graph database repository for graph based pattern recognition and machine learning. In Joint IAPR International Workshops on Statistical Techniques in Pattern Recognition (SPR) and Structural and Syntactic Pattern Recognition (SSPR), pp. 287 297. Springer, 2008. Rozemberczki, B. and Sarkar, R. Characteristic functions on graphs: Birds of a feather, from statistical descriptors to parametric models. In Proceedings of the 29th ACM International Conference on Information & Knowledge Management, pp. 1325 1334, 2020. Rozemberczki, B., Kiss, O., and Sarkar, R. Karate club: an api oriented open-source python framework for unsupervised learning on graphs. In Proceedings of the 29th ACM International Conference on Information & Knowledge Management, pp. 3125 3132, 2020. Schmitz, M. A., Heitz, M., Bonneel, N., Ngole, F., Coeurjolly, D., Cuturi, M., Peyr e, G., and Starck, J.-L. Wasserstein dictionary learning: Optimal transport-based unsupervised nonlinear dictionary learning. SIAM Journal on Imaging Sciences, 11(1):643 678, 2018. Sch olkopf, B., Smola, A. J., Bach, F., et al. Learning with kernels: support vector machines, regularization, optimization, and beyond. MIT press, 2002. Schulz, T. H., Horv ath, T., Welke, P., and Wrobel, S. A generalized weisfeiler-lehman graph kernel. Machine Learning, pp. 1 29, 2022. Shervashidze, N., Schweitzer, P., Van Leeuwen, E. J., Mehlhorn, K., and Borgwardt, K. M. Weisfeiler-lehman graph kernels. Journal of Machine Learning Research, 12(9), 2011. Siglidis, G., Nikolentzos, G., Limnios, S., Giatsidis, C., Skianis, K., and Vazirgiannis, M. Grakel: A graph kernel library in python. The Journal of Machine Learning Research, 21(1):1993 1997, 2020. Sturm, K.-T. The space of spaces: curvature bounds and gradient flows on the space of metric measure spaces. ar Xiv preprint ar Xiv:1208.0434, 2012. Thanou, D., Shuman, D. I., and Frossard, P. Learning parametric dictionaries for signals on graphs. IEEE Transactions on Signal Processing, 62(15):3849 3862, 2014. Generative Graph Dictionary Learning Titouan, V., Courty, N., Tavenard, R., and Flamary, R. Optimal transport for structured data with application on graphs. In International Conference on Machine Learning, pp. 6275 6284. PMLR, 2019. Togninalli, M., Ghisu, E., Llinares-L opez, F., Rieck, B., and Borgwardt, K. Wasserstein weisfeiler-lehman graph kernels. Advances in Neural Information Processing Systems, 32, 2019. Verma, S. and Zhang, Z.-L. Hunt for the unique, stable, sparse and fast feature learning on graphs. Advances in Neural Information Processing Systems, 30, 2017. Vincent-Cuaz, C., Vayer, T., Flamary, R., Corneli, M., and Courty, N. Online graph dictionary learning. In International Conference on Machine Learning, pp. 10564 10574. PMLR, 2021. Vincent-Cuaz, C., Flamary, R., Corneli, M., Vayer, T., and Courty, N. Semi-relaxed gromov-wasserstein divergence and applications on graphs. In International Conference on Learning Representations, 2022. URL https:// openreview.net/forum?id=RSha Mexjc-x. Wu, C. J. On the convergence properties of the em algorithm. The Annals of statistics, pp. 95 103, 1983. Xie, Y., Wang, X., Wang, R., and Zha, H. A fast proximal point method for computing exact wasserstein distance. In Uncertainty in Artificial Intelligence, pp. 433 453. PMLR, 2020. Xu, H. Gromov-wasserstein factorization models for graph clustering. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 34, pp. 6478 6485, 2020. Xu, H., Luo, D., and Carin, L. Scalable gromov-wasserstein learning for graph partitioning and matching. Advances in Neural Information Processing Systems, 32, 2019a. Xu, H., Luo, D., Zha, H., and Duke, L. C. Gromovwasserstein learning for graph matching and node embedding. In International Conference on Machine Learning, pp. 6932 6941. PMLR, 2019b. Yanardag, P. and Vishwanathan, S. Deep graph kernels. In Proceedings of the 21th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 1365 1374, 2015. Yang, B., Ma, A. J., and Yuen, P. C. Domain-shared groupsparse dictionary learning for unsupervised domain adaptation. In Proceedings of the AAAI Conference on Artificial Intelligence, pp. 7453 7460, 2018. Yang, D., Rosso, P., Li, B., and Cudre-Mauroux, P. Nodesketch: Highly-efficient graph embeddings via recursive sketching. In Proceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, pp. 1162 1172, 2019. Yankelevsky, Y. and Elad, M. Dual graph regularized dictionary learning. IEEE Transactions on Signal and Information Processing over Networks, 2(4):611 624, 2016. Zeng, Z., Zhang, S., Xia, Y., and Tong, H. Parrot: Positionaware regularized optimal transport for network alignment. In Proceedings of the ACM Web Conference 2023, pp. 372 382, 2023. Generative Graph Dictionary Learning Theorem 3.1. For two attributed graphs Gi, Gj, the graph embeddings wi, wj given by FRAME with the 1-FGW distance satisfy the following inequality: wi wj 1 log wi log wj 1 2KσFGW1,α(Gi, Gj), where K is the number of atoms and σ is the length-scale parameter of the RBF kernel. Proof. For simplicity, we use di,j to denote FGW1,α(Gi, Bk). For the k-th element in wi and wj, i.e., γi,k and γj,k, according to Eq. (5), we have: | log γi,k log γj,k| = |σ(dj,k di,k) + log PK m=1 πme σdj,m PK m=1 πme σdi,m | |σ(dj,k di,k)| + | log PK m=1 πme σdj,m PK m=1 πme σdi,m | We denote the most biased barycenter as Bq satisfying q = arg maxm e σ(di,m dj,m), then we have: | log γi,k log γj,k| |σ(dj,k di,k)| + | log πqe σdj,q πqe σdi,q | = |σ(dj,k di,k)| + |σ(di,q dj,q)| where the first inequality is due to q = arg maxm e σ(di,m dj,m) and the second inequality is due to the triangle inequality of 1-FGW distance (Titouan et al., 2019). Besides, as γi,k 1, we have |γi,k γj,k| | log γi,k log γj,k|. Therefore, we can bound the L1 norm of graph embeddings by the corresponding FGW distance as follows: wi wj 1 log wi log wj 1 2KσFGW1,α(Gi, Gj), where K is the number of barycenters and σ is the length-scale parameter of the RBF kernel. Theorem 3.2. For an attributed graph Gi and the reconstructed graph Gi by Eqs. (15) and (16), the 2-FGW distance between Gi and Gi satisfies the following inequality: FGW2,α(Gi, Gi) k=1 γi,k [α GWk + (1 α) Wk] , where GWk = diag(µi)(Ai Ai,k) 2 F and Wk = diag(µ 1 2 i )(Xi Xi,k) 2 F . Proof. For simplicity, we omit the graph index i and use subscripts k, l to index atoms Bk, Bl. Given G = {A, X} and G = { A, X}, the 2-FGW distance has the following inner product form (Peyr e et al., 2016; Vincent-Cuaz et al., 2021): FGW2,α(G, G) = min S Π(µ,µ) (1 α)M + αL, S ( M = diag(XX T)1 T + 1diag( X X T) 2X X T L = A2µ1 T + 1µ T A2 2AS A , where diag(X) of a matrix X means taking the diagonal value of matrix X and diag(µ) of a vector µ means formulating a diagonal matrix with µ as the diagonal values. Note that G and G have the same number of nodes, hence µG = µ G = µ. For simplicity, we use Dµ to denote diag(µ). Generative Graph Dictionary Learning Since Dµ Π(µ, µ) is an suboptimal solution to FGW(G, G), we have: FGW2,α(G, G) (1 α)M + αL, Dµ = (1 α) diag(XX T)1 T + 1diag( X X T) 2X X T, Dµ | {z } Wasserstein distance + α A2µ1 T + 1µ T A2 2ADµ A, Dµ | {z } Gromov-Wasserstein distance We first check the Wasserstein distance term. Owing to the marginal constraint Dµ1 = 1DT µ = µ and the trace property Tr(XT 1DµX2Dµ) = Tr(X1 X2µµT), we have: diag(XX T)1 T + 1diag( X X T) 2X X T, Dµ 1 2µ X X TD 1 2µ 2X X TD T µ k,l=1 γkγl D 1 2µ Xk X T l D 1 2µ 2X X TD T µ k,l=1 γkγl Xk Xlµ 1 2 µ 1 2 T 2X X TD T µ k γk X2 kµ 1 2 µ 1 2 T 2X X TD T µ k,l=1 γkγl Xk Xl k=1 γk X2 k | {z } non-positive µ 1 2 µ 1 2 T 1 2µ Xk X T k D k=1 γk X X T k D T µ 1 2µ(X Xk) 2 F where the non-positivity is owing to the following inequality: k,l=1 γkγl Xk(i, j) Xl(i, j) 1 2γkγl Xk(i, j)2 + Xl(i, j)2 k=1 γk Xk(i, j)2 + l=1 γl Xl(i, j)2 ! k=1 γk X2 k(i, j) Generative Graph Dictionary Learning We then check the Gromov-Wasserstein distance term. Similarly, we have: A2µ1 T + 1µ T A2 2ADµ A, Dµ = Tr A2µµ T + µµ T A2 2ADµ AD T µ k,l=1 γkγl Ak Alµµ T k=1 2γk ADµ Ak D T µ k γk A2 kµµ T k=1 2γk ADµ Ak D T µ k,l=1 γkγl Al Ak k=1 γk A2 k | {z } non-positive k γk Dµ Ak A T k Dµ 2 k=1 γk ADµ Ak D T µ k=1 γk Dµ(A Ak) 2 F Combine Eq. (17) and (18), we prove that FGW2,α(G, G) k=1 γk h α Dµ(A Ak) 2 F + (1 α) D 1 2µ(X Xk) 2 F i . Theorem 3.3. The objective function in Eq. (3) is non-decreasing and converges along the inexact EM process. Proof. For simplicity, we denote the set of input graphs as G = {G1, , GN} and we use superscript (t) to denote parameters in the t-th EM iteration. It is shown that for two set of parameters Θ and Θ(t), the difference between the two objective functions in Eq. (3) is lower bounded as follows (Wu, 1983): L(Θ) L(Θ(t)) Q(Θ, Θ(t)) Q(Θ(t), Θ(t)) where Q(Θ, Θ(t)) = P Z p(Z|G, Θ(t)) log p(G, Z|Θ) is the maximization objective evaluated with parameters at the t-th EM iteration shown in Eq. (6). We next show that Q(Θ(t), Θ(t)) Q(Θ(t+1), Θ(t)) even with inexact maximization, i.e., finding a suboptimal solution to Eq. (6) with a few iterations. We use Q(t) (l) to denote the value of Q(π(t) (l), S(t) (l), A(t) (l), X(t) (l), γ(t)) at the l-th BCD iteration and t-th maximization step. Since Eq. (7)-(11) provides optimal maximization w.r.t. π, S, A and X (Bishop & Nasrabadi, 2006; Xie et al., 2020), the value of Q is non-decreasing along the BCD iteration, i.e., Q(t) (l) Q(t) (l+1). Besides, by setting the initial values for the BCD iteration as A(t+1) (0) = A(t), X(t+1) (0) = X(t), we have: Q(Θ(t), Θ(t)) = Q(t+1) (0) Q(t+1) (L) = Q(Θ(t+1), Θ(t)) which means L(Θ(t+1)) L(Θ(t)) 0 always holds even if Q(t) (L) is not the exact optimal solution to the maximization problem in Eq. (6). Therefore, we prove that the objective function in Eq. (3) is non-decreasing and converges along the inexact EM process of FRAME. Proposition 3.4. The overall time complexity of FRAME is O(TLNK(m Gn B + n Gn2 B)), where T is the number of EM iterations and L is the number of BCD iterations. Generative Graph Dictionary Learning Proof. Note that graphs Gi have sparse adjacency matrix with m Gi non-zero elements for fast computation. With N attributed input graphs with d node features, K barycenters, T expectation-maximization iterations, L block coordinate descent iterations and M conditional gradient iterations, we have the following analysis. Calculating M(t) i,k, |Ci,k(x1, x2) Ci,k(y1, y2)|q and FGW(t)(Gi, Bk) in Eq. (1) require O(n2 Gd), O(m Gn B) and O(n Gn B) complexity. Calculating all the γ(t) i,k and π(t) k in Eq. (5) and (7) both require O(NK) in total. Calculating S(t) i,k in Eq. (9) requires O(Mn Gn B) complexity. Calculating A(t) k and X(t) k in Eq. (10) and (11) require O(Nn Gn2 B) and O(Ndn Gn B) complexity respectively. Ignoring non-trivial terms, the overall complexity of FRAME is O(TLNK(m Gn B + n Gn2 B)). B. Variants and Discussion Graph Generation Kernel. The RBF kernel is adopted in FRAME for three reasons. First, the exponential term in Eq. (4) amplifies the gap between graph distances with large σ, resulting in sparse embeddings. Second, the RBF kernel maps the origin graphs into an infinite dimensional space, hence generating highly nonlinear embeddings. Third, with the RBF kernel, the maximization step corresponds to the FGW barycenter problem and can be efficiently solved by the BCD iteration (Titouan et al., 2019). Other exponential kernels such as the Gaussian kernel and the squared exponential kernel can also be adopted with minor modifications. Besides, it is also feasible to adopt other types of non-exponential kernel functions, but the resulting maximization problem may require the computationally more costly gradient descent for optimization. Reducing Atom Redundancy. It is desirable to generate discriminant atoms capturing diverse graph patterns to reduce redundancy. Therefore, we propose a regularization term that minimizes the mutual likelihood among atoms as follows l=1 log p(Bk|Bl) (19) And the resulting objective function is formulated as arg max πk,Bk k=1 γi,k[log πk σFGW(Gi, Bk)] + β σFGW(Bk, Bl) To solve the regularized GDL problem, we can follow the similar EM process but with a modified maximization step. Specifically, we can still adopt Eq. (7) to optimize πk as the regularization term is decoupled with πk. To optimize Bk, we follow the BCD iteration where all the variables except Bk are fixed, and the optimization problem is formulated as i=1 γi,k FGW(Gi, Bk) β l=1 FGW(Bl, Bk) The above problem can be regarded as a modified FGW barycenter problem with negative distances. We can still adopt the CG in Eq. (9) with a modified gradient to optimize Si,k. However, the negative terms would violate the positive semidefinite prerequisite (Peyr e et al., 2016) for applying Eq. (10) and destroy the quadratic approximation (Cuturi & Doucet, 2014) in Eq. (11). Therefore, we may not follow the closed-loop solutions in Eqs. (10)-(11), but adopt gradient descent for optimization. C. Additional Experiments C.1. Scalability Analysis We conduct scalability analysis on the synthetic Erd os-R enyi random graphs, and results are shown in Fig. 5. As shown in Fig. 5(a), the running time is sublinear for small graphs and linear for large graphs w.r.t. the number of graph edges m G. This is mainly because the running time is dominated by the O(n Gn2 B) term for small graphs. Besides, in Fig. 5(b), the running time is superlinear w.r.t. the number of atom nodes n B. For Fig. 5(c) and 5(d), the running time is linear w.r.t. the number of graphs N and the number of atoms K, which is consistent with our analysis in Proposition 3.4. Generative Graph Dictionary Learning (a) Time - Graph Size (b) Time - Atom Size (c) Time - # Graph (d) Time - # Atoms Figure 5. Scalability analysis. C.2. Graph Embedding Space Evolution We visualize the evolution of the graph embedding space along the EM training process, and the results are shown in Fig. C.2. Along the EM optimization process, the cluster structure gradually appears where similar graphs are closely clustered and dissimilar graphs are pushed far away in the embedding space. Besides, we show how the value of the objective function in Eq. (3) changes along the optimization process in Fig. 7. We observe that the EM algorithm empirically converges after 6 iterations, which validate our claim that FRAME achieves empirically faster computation than the existing GDL methods that adopts the stochastic gradient descent for solution. ITERATION 1 ITERATION 2 ITERATION 3 ITERATION 4 ITERATION 5 ITERATION 6 ITERATION 7 ITERATION 8 ITERATION 9 ITERATION 10 Figure 6. Evolution of the graph embedding space: for 1-block, for 2-block, for 3-block, and for 5-block graphs. C.3. Convergence Analysis We empirically evaluate the convergence of the proposed FRAME on the DBLP dataset. Specifically, we assess the convergence by measuring the value of the objective function in Eq. (3) and the difference between two consecutive posterior probability (i.e., γ(t) γ(t 1) 1) along the EM iterations. The experiments are repeated 5 times, and we report the mean and standard deviation of the results, which are shown in Fig. 7. It is shown in Fig. 7(a) that the objective function is non-decreasing along the optimization process and eventually converges to a local optimum after several iterations. Besides, Fig. 7(b) demonstrates that the difference between two consecutive posterior probability γ(t 1) and γ(t) approaches zero as the optimization proceeds. These results empirically validates our theoretical convergence analysis and provide further evidence supporting our claim that the proposed FRAME based on EM optimization has empirically faster convergence rate than the SGD-based optimization adopted by many GDL methods. C.4. Learning Multi-Scale Patterns Capture graph patterns at multiple scales is crucial, but most of the existing GDL methods (Vincent-Cuaz et al., 2021; Liu et al., 2023) fail to achieve this capability as they require atoms to be the same size. In this subsection, we provide Generative Graph Dictionary Learning (a) Values of the objective function. (b) Difference between two consecutive γ. Figure 7. Convergence analysis of the proposed FRAME. experimental results to validate FRAME s ability of generating atoms with different sizes to capture multi-scale patterns. We consider an input graph with patterns at multiple scales shown in Fig. 8(a), which includes a central cycle graph with outreaching edges connecting fully-connected blocks, and apply FRAME to learn 3 atoms with sizes in {5, 10, 11}. As shown in Fig. 8, FRAME successfully identifies three principal patterns in Figs. 8(b)-8(d): a fully-connected atom, a cycle atom and a star atom. (a) Input graph (b) Fully-connected atom (c) Cycle atom (d) Star atom Figure 8. Learning graph patterns at multiple scales. C.5. Graph Reconstruction We first analyze how the choice of atoms affects the graph reconstruction process. We consider different atom sizes in {5, 10, 15, 20, 25, 30} and different number of atoms in {2, 3, 4, 5, 6, 7}, and results are shown in Fig. 9. First, the reconstruction error achieves relatively small values with an atom size that is close to the graph size (i.e., n G = 9 for DBLP, n G = 32 for ENZYMES, n G = 13 for IMDB-M, and n G = 14 for PTC-MR as shown in Table 5). For one thing, graph patterns may not be well captured with small atoms, resulting in high reconstruction errors. For another, when the atom size is big, the reconstruction error either stays unchanged or even increases as the redundancy in atoms may otherwise introduce noises into the reconstruction process. Second, more atoms lead to smaller reconstruction error. More atoms generate a higher dimensional embedding space, resulting in smaller information loss during the embedding process and more accurate graph reconstruction. Interestingly, the most significant reduction of the reconstruction error happens at the point where the number of atoms equals the number of graph classes. This shows that each atom captures the discriminant pattern in each graph class, and therefore, FRAME can generate an embedding space with little redundancy with properly chosen parameters. We also provide the visualization of some original graphs and corresponding reconstructed graphs from four datasets, as shown in Fig. C.5. Note that the reconstructed graphs are weighted graphs, and the edge widths are proportional to the corresponding weights. By regarding the edge weights as the probability of connecting two nodes, the proposed FRAME Generative Graph Dictionary Learning Figure 9. Graph reconstruction error analysis. can be applied to other tasks such as link prediction and recommendation. DATASETS ORIGINAL GRAPHS RECONSTRUCTED GRAPHS Figure 10. Visualization of graph reconstruction. D. Reproducibility Dataset Descriptions. All the real-world datasets we use are from (Morris et al., 2020) and available online1. Here we briefly summarize the datasets used in the experiments: AIDS (Riesen & Bunke, 2008) is a set of chemical graphs, where each graph represents one molecular compound with nodes as atoms and edges as covalent bonds. The binary graph labels indicate whether molecules have activity against HIV or not. The node labels correspond to the atom type. DBLP (Pan et al., 2013) is a set of biological networks in computer science, where each graph represents one publication with nodes as the paper ID or keywords and edges as citations. The binary graph labels indicate the published conference (DBDM/CVPR) of the paper. 1https://chrsmrrs.github.io/datasets/ Generative Graph Dictionary Learning ENZYMES (Borgwardt et al., 2005) is a set of enzyme graphs where each graph represents the protein tertiary structure with nodes as atoms and edges as chemical bonds. The graph labels indicate the top-level enzyme classes. The node labels correspond to atom type. IMBD-M (Yanardag & Vishwanathan, 2015) is a set of movie collaboration networks, where each graph represents one movie with nodes as actor/actress and edges indicating whether two actors/actresses co-appear in the same movie. The graph labels indicate the movie categories (Comedy/Romance/Sci-Fi). PROTEINS (Borgwardt et al., 2005) is a set of proteins where each graph represents the protein structure with nodes as amino acids and edges as chemical bonds. The graph labels indicate whether the proteins are enzymes or non-enzymes. The node labels correspond to atom type and the node attributes represent node chemical features. PTC-MR (Helma et al., 2001) is a set of chemical graphs, where each graph represents one molecule with nodes as atoms and edges as chemical bonds. The binary graph labels indicate the carcinogenicity in male rats. The node labels correspond to the atom type. Table 5. Dataset statistics. DATASET #GRAPHS #NODES #EDGES SPARSITY #FEATURES #GRAPH CLASS #NODE CLASS AIDS 2,000 17.80 18.40 0.06 4 2 38 DBLP 19,456 9.13 19.48 0.23 NONE 2 NONE ENZYMES 600 31.64 61.85 0.06 18 6 3 IMDB-M 1,500 12.74 53.88 0.33 NONE 3 NONE PROTEINS 1,113 43.31 77.79 0.04 1 2 3 PTC-MR 344 13.88 14.18 0.07 NONE 2 19 Machine Configuration and Code. The proposed method is implemented in Python based on the POT toolbox (Flamary et al., 2021). The graph kernel methods are based on the Gra Kel library (Siglidis et al., 2020), and the embedding-based methods are based on the Karate Club library (Rozemberczki et al., 2020). All experiments are conducted on the Linux platform with an Intel Xeon Gold 6240R CPU and an NVIDIA Tesla V100 SXM2 GPU. The code is implemented by authors from the University of Illinois and available at https://github.com/zhichenz98/Fra Me-ICML23. E. More on Related Works Optimal Transport on Graphs. The OT theory (Peyr e et al., 2019) compares two distributions by finding the optimal coupling minimizing a predefined cost. There has been a recent interest on applying OT on structured data such as graphs with wide applications on graph comparison, graph alignment, graph dictionary learning and so on. The key idea is to represent graphs as distributions and optimize a probabilistic coupling under the Wasserstein discrepancy (Nikolentzos et al., 2017; Maretic et al., 2019; Togninalli et al., 2019) or the Gromov-Wasserstein discrepancy (M emoli, 2011; Sturm, 2012). Many works (Chen et al., 2020; Xu et al., 2019b; Xu, 2020; Titouan et al., 2019; Zeng et al., 2023) represent graph as a discrete distribution on the product space of graph topology and node attributes, where elements indicate the weight of different nodes, while other works represent graphs as uniform distributions with Laplacian-like covariance matrices (Maretic et al., 2019; 2022). Afterwards, either the distances are utilized for tasks like graph comparison (Xu et al., 2019b; Titouan et al., 2019) and graph dictionary learning (Vincent-Cuaz et al., 2021; 2022; Xu, 2020; Liu et al., 2023), or the OT couplings are leveraged to model node relationships in graph alignment (Xu et al., 2019a;b; Maretic et al., 2019; 2022; Zeng et al., 2023). Mixture Models. Mixture models are widely used to model unknown distribution shapes in various fields such as bioinformatics, engineering, and imaging (Mc Lachlan et al., 2019). These models offer a semi-parametric approach to represent the unknown distribution as a combination of base distributions. By optimizing the weights and parameters of these base distributions, the model aims to best fit the data distribution. Initially, the method of moments (Pearson, 1894) was employed to fit the unknown distribution by solving a nonic polynomial. Subsequently, the expectationmaximization algorithm emerged as the predominant approach for optimizing mixture models. The key idea follows the Generative Graph Dictionary Learning principle of maximum likelihood estimation, which iteratively calculates the conditional expectation of the log likelihood and maximizes the complete likelihood. While the mixture model provides a probabilistic clustering of data samples given the base components, the selection of the model order, i.e., number of base components, is essential to the model performance (Mc Lachlan & Rathnayake, 2014; Mc Lachlan et al., 2019). In this paper, we set the model order to be the same as the number of graph classes, as we expect to have one representative base model (i.e., atom) for each graph class. F. Future Works and Limitations As discussed in Appendix B, there are several possible directions to explore that can further benefit the current framework, including: Non-exponential graph generation kernels: When adopting non-exponential graph generation kernels, the resulting formulation is not a FGW barycenter problem, hence may lack of efficient solution. We may need to further explore efficient solutions other than the computationally costly stochastic gradient descent. Redundancy regularization: When incorporating the redundancy regularization in Eq. (19), the resulting problem can be regarded as a FGW barycenter problem with negative distances, which, however, can not be directly solved by the current algorithm due to the violation of the semi-definite and quadratic properties. Online learning: When dealing with graphs that come in sequence, it would be beneficial to build up a model that can incrementally learn from new samples (Vincent-Cuaz et al., 2021). Although the current vanilla EM algorithm is not applicable for online learning, the recent advance on stochastic EM algorithm (Chen et al., 2018) may provide a feasible to develop an online generative GDL method. Learnable mass distribution over graphs: The current follows a common approach in OT-based graph learning frameworks where probability mass is uniformly distributed on nodes. However, the recently proposed sr GW (Vincent-Cuaz et al., 2022) provides a semi-relaxed GW discrepancy by relaxing the marginal constraint over the target graph. By relaxing the marginal constraint over the atom, each atom can function differently for different graphs, hence further increase the representation power of the proposed model. Graph generation: It would be of great interest to explore FRAME s ability of graph generation. In other words, given a graph embedding space learned by FRAME, how to efficiently sample graphs from the model? This future direction not only provides a feasible way to generate graphs obeying the input graph distribution, but also provides a possible way to construct interpretable graph encoder-decoder and graph GAN models.